From fgao at openjdk.org Wed Mar 1 02:29:06 2023 From: fgao at openjdk.org (Fei Gao) Date: Wed, 1 Mar 2023 02:29:06 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v9] In-Reply-To: References: <8lyMxPlTmzLDuwFLvhla8AAL6sVxEjigf4LaDlsAVWg=.4bfb5428-4b11-4c65-8540-503cf9e595f1@github.com> <_eDa4Xjs3HIxm2u4bUOWmmZL6y2DjQGd9kVyrUCTSD0=.d56d4533-bf46-4900-8c7c-6e22d247ff22@github.com> <9YDIdADEPbY9f6PT6xIQMs-Lc4SsDvVp4ua_JOdpAOY=.3f0a4b8e-05ab-4320-847e-223e97099ec9@github.com> Message-ID: On Tue, 28 Feb 2023 15:27:07 GMT, Emanuel Peter wrote: >> @jatin-bhateja I now have a first version out of the `test/hotspot/jtreg/compiler/loopopts/superword/TestDependencyOffsets.java`. It seems to work for `sse4.1 ... avx512`. I'm now testing it for `asimd`. >> And then I will proceed to add the features for `-XX:+AlignVector`, with the modulo check. > > Update: I cannot use `SuperWordMaxVectorSize`, because it is regulated down on x86 to be at most `MaxVectorSize`. But on `aarch64` the flag `SuperWordMaxVectorSize` does not seem to be adjusted. @fg1417 do you think this is correct / on purpose? Maybe this is just an unfortunate but harmless inconsistency. I guess in `SuperWord::max_vector_size` we first get the info from `Matcher::max_vector_size` (based on `MaxVectorSize`), and then upper bound that based on `SuperWordMaxVectorSize`. > > TLDR: I am using `MaxVectorSize` instead of `SuperWordMaxVectorSize` now. Hi @eme64, see https://github.com/openjdk/jdk/pull/8877. Before that, we use `MaxVectorSize` for all platforms. `SuperWordMaxVectorSize` is only used to fix the performance issue on x86. The option is set as `64` by default, which is fine for current aarch64 hardware, but SVE architecture supports more than 512 bits. I believe `SuperWordMaxVectorSize` is just a temporary solution and we expect a more complete fix. Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/12350 From jsjolen at openjdk.org Wed Mar 1 09:06:50 2023 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Wed, 1 Mar 2023 09:06:50 GMT Subject: RFR: JDK-8301074: Replace NULL with nullptr in share/opto/ [v3] In-Reply-To: References: Message-ID: <6UYV4UEowi3T3oMiBn4srMGOZIaf_iwcj3_hlDx--FI=.1ead71f0-ce00-4826-ac11-64e4e7500192@github.com> > Hi, this PR changes all occurrences of NULL to nullptr for the subdirectory share/opto/. Unfortunately the script that does the change isn't perfect, and so we > need to comb through these manually to make sure nothing has gone wrong. I also review these changes but things slip past my eyes sometimes. > > Here are some typical things to look out for: > > No changes but copyright header changed (probably because I reverted some changes but forgot the copyright). > Macros having their NULL changed to nullptr, these are added to the script when I find them. They should be NULL. > nullptr in comments and logs. We try to use lower case "null" in these cases as it reads better. An exception is made when code expressions are in a comment. > > An example of this: > > > // This function returns null > void* ret_null(); > // This function returns true if *x == nullptr > bool is_nullptr(void** x); > > > Note how nullptr participates in a code expression here, we really are talking about the specific value nullptr. > > Thanks! Johan Sj?len has updated the pull request incrementally with one additional commit since the last revision: Check for null string explicitly ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12187/files - new: https://git.openjdk.org/jdk/pull/12187/files/73d54744..a440058b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12187&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12187&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/12187.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12187/head:pull/12187 PR: https://git.openjdk.org/jdk/pull/12187 From jsjolen at openjdk.org Wed Mar 1 09:15:14 2023 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Wed, 1 Mar 2023 09:15:14 GMT Subject: RFR: JDK-8301074: Replace NULL with nullptr in share/opto/ [v3] In-Reply-To: <6UYV4UEowi3T3oMiBn4srMGOZIaf_iwcj3_hlDx--FI=.1ead71f0-ce00-4826-ac11-64e4e7500192@github.com> References: <6UYV4UEowi3T3oMiBn4srMGOZIaf_iwcj3_hlDx--FI=.1ead71f0-ce00-4826-ac11-64e4e7500192@github.com> Message-ID: On Wed, 1 Mar 2023 09:06:50 GMT, Johan Sj?len wrote: >> Hi, this PR changes all occurrences of NULL to nullptr for the subdirectory share/opto/. Unfortunately the script that does the change isn't perfect, and so we >> need to comb through these manually to make sure nothing has gone wrong. I also review these changes but things slip past my eyes sometimes. >> >> Here are some typical things to look out for: >> >> No changes but copyright header changed (probably because I reverted some changes but forgot the copyright). >> Macros having their NULL changed to nullptr, these are added to the script when I find them. They should be NULL. >> nullptr in comments and logs. We try to use lower case "null" in these cases as it reads better. An exception is made when code expressions are in a comment. >> >> An example of this: >> >> >> // This function returns null >> void* ret_null(); >> // This function returns true if *x == nullptr >> bool is_nullptr(void** x); >> >> >> Note how nullptr participates in a code expression here, we really are talking about the specific value nullptr. >> >> Thanks! > > Johan Sj?len has updated the pull request incrementally with one additional commit since the last revision: > > Check for null string explicitly Changed an assert after Clang complained (Mac builds failed). ------------- PR: https://git.openjdk.org/jdk/pull/12187 From roland at openjdk.org Wed Mar 1 09:36:16 2023 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 1 Mar 2023 09:36:16 GMT Subject: RFR: 8301630: C2: 8297933 broke type speculation in some cases [v2] In-Reply-To: References: Message-ID: On Thu, 23 Feb 2023 10:58:40 GMT, Christian Hagedorn wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> type system fix & test > > The update looks good to me, too! @chhagedorn thanks for the review. ------------- PR: https://git.openjdk.org/jdk/pull/12368 From roland at openjdk.org Wed Mar 1 09:40:22 2023 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 1 Mar 2023 09:40:22 GMT Subject: Integrated: 8301630: C2: 8297933 broke type speculation in some cases In-Reply-To: References: Message-ID: On Wed, 1 Feb 2023 16:55:01 GMT, Roland Westrelin wrote: > With 8297933, a TypeAryPtr is a lot more likely to have a null _klass > and that breaks TypePtr::speculative_type(). This pull request has now been integrated. Changeset: 6b07243f Author: Roland Westrelin URL: https://git.openjdk.org/jdk/commit/6b07243f5671f148166f027796f620bad9b38f73 Stats: 175 lines in 3 files changed: 173 ins; 0 del; 2 mod 8301630: C2: 8297933 broke type speculation in some cases Reviewed-by: chagedorn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/12368 From epeter at openjdk.org Wed Mar 1 09:46:13 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 1 Mar 2023 09:46:13 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs In-Reply-To: <86PauwnM1BpvfJVJW1kyP6g3sVeirHKrMIry-kl6Ins=.50f25d11-47d2-4b33-8008-7f36a874b06e@github.com> References: <86PauwnM1BpvfJVJW1kyP6g3sVeirHKrMIry-kl6Ins=.50f25d11-47d2-4b33-8008-7f36a874b06e@github.com> Message-ID: On Mon, 13 Feb 2023 01:43:49 GMT, Fei Gao wrote: >>> @fg1417 Do you have the possibility to test on arm32? >> >> Sure. I'll do the testing with a 32-bit docker container on a 64-bit host. > >> > @fg1417 Do you have the possibility to test on arm32? >> >> Sure. I'll do the testing with a 32-bit docker container on a 64-bit host. > > The testing for tier 1 - 3 and jcstress looks good. No new failures on arm32. Thanks. I have been discussing with @fg1417 over emails (thanks for the conversation). @fg1417 voiced concern about my change here: > Disallow extend_packlist from adding MemNodes back in. Because if we have rejected some memops, we do not want them to be added back in later. This refers to: https://github.com/openjdk/jdk/blob/8349906d63a5746383752091904bd1b1e75c83ae/src/hotspot/share/opto/superword.cpp#L1562-L1564 https://github.com/openjdk/jdk/blob/8349906d63a5746383752091904bd1b1e75c83ae/src/hotspot/share/opto/superword.cpp#L1603-L1612 Before my patch, the condition was only demanding the extension to be within the block, and did not exclude memops (eg. it was `if (!in_bb(t1) || !in_bb(t2)) {`) Before my patch, this means we may be adding back (resurrecting) memops in `extend_packlist` that were rejected in `find_adjacent_refs`. For example they were rejected for these reasons: **1: +AlignVector requires all packs to have vector_with <= that of best_align_to_mem_ref** `+AlignVector` requires that all packs have a `vector_width` smaller or equal of that of `best_align_to_mem_ref`. https://github.com/openjdk/jdk/blob/8349906d63a5746383752091904bd1b1e75c83ae/src/hotspot/share/opto/superword.cpp#L773-L781 The issue here is that if we align the main-loop based on `best_align_to_mem_ref`. https://github.com/openjdk/jdk/blob/8349906d63a5746383752091904bd1b1e75c83ae/src/hotspot/share/opto/superword.cpp#L2636-L2640 https://github.com/openjdk/jdk/blob/8349906d63a5746383752091904bd1b1e75c83ae/src/hotspot/share/opto/superword.cpp#L3998 If there is any other pack that now has a larger `vector_width` than that of the `best_align_to_mem_ref` we cannot ensure that it is properly aligned under `+AlignVector`. It seems that the current assumption of the VM is that `+AlignVector` requires all vector memory ops to be aligned to their vector size. Of course many CPU's only have a 4-byte or 8-byte alignment requirement. On those platforms, our current alignment analysis is much too coarse and prevents SuperWord in many instances, unnecessarily. One example that used to work under master, but not with my patch (**performance regression**), is this example (from `TestVectorizeTypeConversion.testConvI2D`): @Test @IR(counts = {IRNode.LOAD_VECTOR, ">0", IRNode.VECTOR_CAST_I2X, ">0", IRNode.STORE_VECTOR, ">0"}) private static void testConvI2D(double[] d, int[] a) { for(int i = 0; i < d.length; i++) { d[i] = (double) (a[i]); } } In this example, `best_align_to_mem_ref` is `StoreD, which on some machines may have a smaller `vector_width` than the `LoadI`, and hence gets rejected. But under master, it was added back in during `extend_packlist`. **WIP, I will write more** ------------- PR: https://git.openjdk.org/jdk/pull/12350 From dnsimon at openjdk.org Wed Mar 1 10:48:04 2023 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 1 Mar 2023 10:48:04 GMT Subject: RFR: 8303357: [JVMCI] thread is _thread_in_vm when committing JFR compilation event In-Reply-To: References: Message-ID: On Tue, 28 Feb 2023 16:23:04 GMT, Tom Rodriguez wrote: >> [JDK-8280844](https://bugs.openjdk.org/browse/JDK-8280844) added a native-to-VM thread transition when committing a JFR compilation event. This is not necessary if the thread is already `_thread_in_vm` at the commit point. This PR makes the transition conditional. > > Marked as reviewed by never (Reviewer). Thanks @tkrodriguez and @vnkozlov ! ------------- PR: https://git.openjdk.org/jdk/pull/12787 From dnsimon at openjdk.org Wed Mar 1 10:51:14 2023 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 1 Mar 2023 10:51:14 GMT Subject: Integrated: 8303357: [JVMCI] thread is _thread_in_vm when committing JFR compilation event In-Reply-To: References: Message-ID: On Tue, 28 Feb 2023 14:25:50 GMT, Doug Simon wrote: > [JDK-8280844](https://bugs.openjdk.org/browse/JDK-8280844) added a native-to-VM thread transition when committing a JFR compilation event. This is not necessary if the thread is already `_thread_in_vm` at the commit point. This PR makes the transition conditional. This pull request has now been integrated. Changeset: 2451c5a4 Author: Doug Simon URL: https://git.openjdk.org/jdk/commit/2451c5a4620d5aec0ea9bc52fee5f3a54eb89d62 Stats: 10 lines in 1 file changed: 7 ins; 0 del; 3 mod 8303357: [JVMCI] thread is _thread_in_vm when committing JFR compilation event Reviewed-by: never, kvn ------------- PR: https://git.openjdk.org/jdk/pull/12787 From tholenstein at openjdk.org Wed Mar 1 11:50:48 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 1 Mar 2023 11:50:48 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally Message-ID: In IGV the user can apply a set of filters to a graph. Currently, the same set of selected filters is applied to all graphs (globally). - With this change the use can define a set of filters for each individual graph tab using the `--Custom--` profile - Further a filter profile can be created that represents a set of filter. This filter profile can the be selected in each graph tab individually. ### Custom profile Each tab has a `--Custom--` filter profile which is selected when opening a graph. Filters applied to the `--Custom--` profile are applied only to the currently selected tabs. Only one tab can be selected at a time and a tab gets selected by clicking on it. To make it more clear which tab is currently selected, the title of the selected tab is displayed in **bold** font. tabA When clicking on a different tab with a different `--Custom--` profile, the selected filters get updated accordingly. tabB ### New profile The user can also create a new filter profile and give it a name. E.g. `My Filters` newProfile The `My Filters` profile is then globally available to other tabs as well selectProfile ### Filters for cloned tabs When the user clones a tab, the `--Custom--` profile gets cloned as well. Further the clone has the same filter profile selected when it gets opened cloneTab ------------- Commit messages: - copyright year - initFiltersFromModel() - Fix bug when deleting a used filter setting - custom filter working - merge - customFilterChain - bold tab display name - filter per Tab working - further cleanup - cleanup FilterTopComponent Changes: https://git.openjdk.org/jdk/pull/12714/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12714&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8302644 Stats: 586 lines in 10 files changed: 169 ins; 320 del; 97 mod Patch: https://git.openjdk.org/jdk/pull/12714.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12714/head:pull/12714 PR: https://git.openjdk.org/jdk/pull/12714 From epeter at openjdk.org Wed Mar 1 12:05:48 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 1 Mar 2023 12:05:48 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v15] In-Reply-To: References: Message-ID: > Cyclic dependencies are not handled correctly in all cases. Three examples: > > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/loopopts/superword/TestCyclicDependency.java#L270-L277 > > And this, compiled with `-XX:CompileCommand=option,compiler.vectorization.TestOptionVectorizeIR::test*,Vectorize`: > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestOptionVectorizeIR.java#L173-L180 > > And for `vmIntrinsics::_forEachRemaining` compile option `Vectorize` is always enabled: > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestForEachRem.java#L69-L73 > > All of these examples are vectorized, despite the cyclic dependency of distance 2. The cyclic dependency is dropped, instead the emitted vector code implements a shift by 2, instead of repeating the same 2 values. > > **Analysis** > > The `create_pack` logic in `SuperWord::find_adjacent_refs` is broken in two ways: > > - When the compile directive `Vectorize` is on, or we compile `vmIntrinsics::_forEachRemaining` we have `_do_vector_loop == true`. When that is the case, we blindly trust that there is no cyclic dependency larger than distance 1. Distance 1 would already be detected by the `independence(s1, s2)` checks we do for all adjacent memops. But for larger distances, we rely on `memory_alignment == 0`. But the compile directive avoids these checks. > - If `best_align_to_mem_ref` is of a different type, and we have `memory_alignment(mem_ref, best_align_to_mem_ref) == 0`, we do not check if `mem_ref` has `memory_alignment == 0` for all other refs of the same type. In the example `TestCyclicDependency::test2`, we have `best_align_to_mem_ref` as the `StoreF`. Then we assess the `StoreI`, which is not aligned with it, but it is of a different type, so we accept it too. Finally, we look at `LoadI`, which has perfect alignment with the `StoreF`, so we accept it too (even though it is in conflict with the `StoreI`). > > Generally, the nested if-statements are confusing and buggy. I propose to fix and refactor the code. > > I also propose to only allow the compile directive `Vectorize` only if `vectors_should_be_aligned() == false`. If all vector operations have to be `vector_width` aligned, then they also have to be mutually aligned, and we cannot have patterns like `v[i] = v[i] + v[i+1]` for which the compile directive was introduced in the first place https://github.com/openjdk/jdk/commit/c7d33de202203b6da544f2e0f9a13952381b32dd. > **Update**: I found a **Test.java** that lead to a crash (`SIGBUS`) on a ARM32 on master. The example bypassed the alignment requirement because of `_do_vector_loop`, and allowed unaligned vector loads to be generated, on a platform that requires alignment. Thanks @fg1417 for running that test for me! > > **Solution** > > First, I implemented `SuperWord::verify_packs` which catches cyclic dependencies just before scheduling. The idea is to reassess every pack, and check if all memops in it are mutually independent. Turns out that per vector pack, it suffices to do a single BFS over the nodes in the block (see `SuperWord::find_dependence`). With this verification in place we at least get an assert instead of wrong execution. > > I then refactored and fixed the `create_pack` code, and put the logic all in `SuperWord::is_mem_ref_alignment_ok`. With the added comments, I hope the logic is more straight forward and readable. If `_do_vector_loop == true`, then I filter the vector packs again in `SuperWord::combine_packs`, since we are at that point not sure that the packs are actually independent, we only know that adjacient memops are independent. > > Another change I have made: > Disallow `extend_packlist` from adding `MemNodes` back in. Because if we have rejected some memops, we do not want them to be added back in later. > > **Testing** > > I added a few more regression tests, and am running tier1-3, plus some stress testing. > > However, I need help from someone who can test this on **ARM32** and **PPC**, basically machines that have `vectors_should_be_aligned() == true`. I would love to have additional testing on those machine, and some reviews. > **Update:** @fg1417 did testing on ARM32, @reinrich did testing on PPC. > > **Discussion / Future Work** > > I wonder if we should have `_do_vector_loop == true` by default, since it allows more vectorization. With the added filtering, we are sure that we do not schedule packs with cyclic dependencies. We would have to evaluate performance and other side-effects of course. What do you think? [JDK-8303113](https://bugs.openjdk.org/browse/JDK-8303113) Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: TestDependencyOffsets.java: MulVL not supported on NEON / asimd. Replaced it with AddVL ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12350/files - new: https://git.openjdk.org/jdk/pull/12350/files/8349906d..0cb67e5d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12350&range=14 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12350&range=13-14 Stats: 297 lines in 1 file changed: 0 ins; 0 del; 297 mod Patch: https://git.openjdk.org/jdk/pull/12350.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12350/head:pull/12350 PR: https://git.openjdk.org/jdk/pull/12350 From epeter at openjdk.org Wed Mar 1 12:12:09 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 1 Mar 2023 12:12:09 GMT Subject: RFR: 8294715: Add IR checks to the reduction vectorization tests [v3] In-Reply-To: References: Message-ID: <4Zzq82OGO_kHUpd5FnoueK5-ZtVI42E6lbwPLGta_Yw=.9f0c7ede-85f3-4ad3-ac41-ae7e920503a9@github.com> On Wed, 22 Feb 2023 08:14:52 GMT, Emanuel Peter wrote: >> Daniel Skantz has updated the pull request incrementally with one additional commit since the last revision: >> >> Address further review comments (edits) > > test/hotspot/jtreg/compiler/loopopts/superword/RedTest_long.java line 166: > >> 164: @IR(applyIf = {"SuperWordReductions", "false"}, >> 165: failOn = {IRNode.OR_REDUCTION_V}) >> 166: @IR(applyIfCPUFeature = {"avx2", "true"}, > > Add a comment why we only require `AVX2`. Add "negative" rule when not present. Do that everywhere where you require more than `sse4.1`. @danielogh Thanks for adding that! ------------- PR: https://git.openjdk.org/jdk/pull/12683 From epeter at openjdk.org Wed Mar 1 12:21:06 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 1 Mar 2023 12:21:06 GMT Subject: RFR: 8294715: Add IR checks to the reduction vectorization tests [v3] In-Reply-To: References: Message-ID: On Tue, 28 Feb 2023 08:03:20 GMT, Daniel Skantz wrote: >> We are lifting some loopopts/superword tests to use the IR framework, and add IR annotations to check that vector reductions take place on x86_64. This can be useful to prevent issues such as JDK-8300865. >> >> Approach: lift the more general tests in loopopts/superword, mainly using matching rules in cpu/x86/x86.ad, but leave tests mostly unchanged otherwise. Some reductions are considered non-profitable (superword.cpp), so we might need to raise sse/avx value pre-conditions from what would be a strict reading of x86.ad (as noted by @eme64). >> >> Testing: Local testing (x86_64) using UseSSE={2,3,4}, UseAVX={0,1,2,3}. Tested running all jtreg compiler tests. Tier1-tier5 runs to my knowledge never showed any compiler-related regression in other tests as a result from this work. GHA. Validation: all tests fail if we put unreasonable counts for the respective reduction node, such as counts = {IRNode.ADD_REDUCTION_VI, ">= 10000000"}). >> >> Thanks @robcasloz and @eme64 for advice. >> >> Notes: ProdRed_Double does not vectorize (JDK-8300865). SumRed_Long does not vectorize on 32-bit, according to my reading of source, test on GHA and cross-compiled JDK on 32-bit Linux, so removed these platforms from @requires. Lifted the AbsNeg tests too but added no checks, as these are currently not run on x86_64. > > Daniel Skantz has updated the pull request incrementally with one additional commit since the last revision: > > Address further review comments (edits) test/hotspot/jtreg/compiler/loopopts/superword/SumRedAbsNeg_Double.java line 97: > 95: @Test > 96: @IR(applyIfOr = {"SuperWordReductions", "false", "LoopMaxUnroll", "< 8"}, > 97: failOn = {IRNode.ADD_REDUCTION_VD, IRNode.ABS_V, IRNode.NEG_V}) @danielogh Are you sure that this would not vectorize if `sve` is available? It may not fail on any Oracle test machines, but we only have `x64` and `aarch64` (only have `asimd`, not `sve`). Have you ever let this be tested explicitly with `sve` support? ------------- PR: https://git.openjdk.org/jdk/pull/12683 From epeter at openjdk.org Wed Mar 1 12:42:10 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 1 Mar 2023 12:42:10 GMT Subject: RFR: 8294715: Add IR checks to the reduction vectorization tests [v3] In-Reply-To: References: Message-ID: On Tue, 28 Feb 2023 08:03:20 GMT, Daniel Skantz wrote: >> We are lifting some loopopts/superword tests to use the IR framework, and add IR annotations to check that vector reductions take place on x86_64. This can be useful to prevent issues such as JDK-8300865. >> >> Approach: lift the more general tests in loopopts/superword, mainly using matching rules in cpu/x86/x86.ad, but leave tests mostly unchanged otherwise. Some reductions are considered non-profitable (superword.cpp), so we might need to raise sse/avx value pre-conditions from what would be a strict reading of x86.ad (as noted by @eme64). >> >> Testing: Local testing (x86_64) using UseSSE={2,3,4}, UseAVX={0,1,2,3}. Tested running all jtreg compiler tests. Tier1-tier5 runs to my knowledge never showed any compiler-related regression in other tests as a result from this work. GHA. Validation: all tests fail if we put unreasonable counts for the respective reduction node, such as counts = {IRNode.ADD_REDUCTION_VI, ">= 10000000"}). >> >> Thanks @robcasloz and @eme64 for advice. >> >> Notes: ProdRed_Double does not vectorize (JDK-8300865). SumRed_Long does not vectorize on 32-bit, according to my reading of source, test on GHA and cross-compiled JDK on 32-bit Linux, so removed these platforms from @requires. Lifted the AbsNeg tests too but added no checks, as these are currently not run on x86_64. > > Daniel Skantz has updated the pull request incrementally with one additional commit since the last revision: > > Address further review comments (edits) @danielogh Good work! I think we are getting there now. If this passes on `sve` I'd approve this. Well, it would be nice if you could also remove some of the unnecessary stores in the reduction tests, that would be a nice improvement. @danielogh You should merge from master, because it seems your state still is broken for ARM (see github actions). test/hotspot/jtreg/compiler/loopopts/superword/RedTest_int.java line 149: > 147: @IR(applyIfCPUFeature = {"sse4.1", "true"}, > 148: applyIfAnd = {"SuperWordReductions", "true", "LoopMaxUnroll", ">= 8"}, > 149: counts = {IRNode.ADD_REDUCTION_VI, ">= 1"}) We could also remove the store to `d` here. I think it should still vectorize. I think that applies to all examples in this test. And also for `RedTest_long.java`. test/hotspot/jtreg/compiler/loopopts/superword/SumRedSqrt_Double.java line 100: > 98: @IR(applyIfCPUFeature = {"avx", "true"}, > 99: applyIfAnd = {"SuperWordReductions", "true", "LoopMaxUnroll", ">= 8"}, > 100: counts = {IRNode.ADD_REDUCTION_VD, ">= 1", IRNode.SQRT_V, ">= 1"}) We could also remove the store to `d` here. ------------- PR: https://git.openjdk.org/jdk/pull/12683 From epeter at openjdk.org Wed Mar 1 12:42:13 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 1 Mar 2023 12:42:13 GMT Subject: RFR: 8294715: Add IR checks to the reduction vectorization tests [v3] In-Reply-To: References: Message-ID: On Wed, 1 Mar 2023 12:17:57 GMT, Emanuel Peter wrote: >> Daniel Skantz has updated the pull request incrementally with one additional commit since the last revision: >> >> Address further review comments (edits) > > test/hotspot/jtreg/compiler/loopopts/superword/SumRedAbsNeg_Double.java line 97: > >> 95: @Test >> 96: @IR(applyIfOr = {"SuperWordReductions", "false", "LoopMaxUnroll", "< 8"}, >> 97: failOn = {IRNode.ADD_REDUCTION_VD, IRNode.ABS_V, IRNode.NEG_V}) > > @danielogh Are you sure that this would not vectorize if `sve` is available? It may not fail on any Oracle test machines, but we only have `x64` and `aarch64` (only have `asimd`, not `sve`). Have you ever let this be tested explicitly with `sve` support? @fg1417 could you please run this patch on a machine with `sve` support? It would be good to know that all IR rules are ok. > test/hotspot/jtreg/compiler/loopopts/superword/SumRedSqrt_Double.java line 100: > >> 98: @IR(applyIfCPUFeature = {"avx", "true"}, >> 99: applyIfAnd = {"SuperWordReductions", "true", "LoopMaxUnroll", ">= 8"}, >> 100: counts = {IRNode.ADD_REDUCTION_VD, ">= 1", IRNode.SQRT_V, ">= 1"}) > > We could also remove the store to `d` here. And all the following files as well: SumRed_Double.java SumRed_Float.java SumRed_Int.java SumRed_Long.java ------------- PR: https://git.openjdk.org/jdk/pull/12683 From tholenstein at openjdk.org Wed Mar 1 12:46:54 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 1 Mar 2023 12:46:54 GMT Subject: RFR: JDK-8303443: IGV: Syntax highlighting and resizing for filter editor Message-ID: In the Filter window of the IdealGraphVisualizer (IGV) the user can double-click on a filter to edit the javascript code. - Previously, the code window was not resizable and had no syntax highlighting editor_old - Now, the code window can be resized by the user and has basic syntax highlighting for `keywords`, `strings` and `comments` editor_new - Further all filter are now saved as .js files in `src/utils/IdealGraphVisualizer/application/target/userdir/config/Filters` and reloaded when opening a new IGV instance js_ext ------------- Commit messages: - IGV: Syntax highlighting and resizing for filter editor Changes: https://git.openjdk.org/jdk/pull/12803/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12803&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8303443 Stats: 114 lines in 2 files changed: 84 ins; 12 del; 18 mod Patch: https://git.openjdk.org/jdk/pull/12803.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12803/head:pull/12803 PR: https://git.openjdk.org/jdk/pull/12803 From jsjolen at openjdk.org Wed Mar 1 14:15:18 2023 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Wed, 1 Mar 2023 14:15:18 GMT Subject: RFR: JDK-8301074: Replace NULL with nullptr in share/opto/ [v3] In-Reply-To: <6UYV4UEowi3T3oMiBn4srMGOZIaf_iwcj3_hlDx--FI=.1ead71f0-ce00-4826-ac11-64e4e7500192@github.com> References: <6UYV4UEowi3T3oMiBn4srMGOZIaf_iwcj3_hlDx--FI=.1ead71f0-ce00-4826-ac11-64e4e7500192@github.com> Message-ID: On Wed, 1 Mar 2023 09:06:50 GMT, Johan Sj?len wrote: >> Hi, this PR changes all occurrences of NULL to nullptr for the subdirectory share/opto/. Unfortunately the script that does the change isn't perfect, and so we >> need to comb through these manually to make sure nothing has gone wrong. I also review these changes but things slip past my eyes sometimes. >> >> Here are some typical things to look out for: >> >> No changes but copyright header changed (probably because I reverted some changes but forgot the copyright). >> Macros having their NULL changed to nullptr, these are added to the script when I find them. They should be NULL. >> nullptr in comments and logs. We try to use lower case "null" in these cases as it reads better. An exception is made when code expressions are in a comment. >> >> An example of this: >> >> >> // This function returns null >> void* ret_null(); >> // This function returns true if *x == nullptr >> bool is_nullptr(void** x); >> >> >> Note how nullptr participates in a code expression here, we really are talking about the specific value nullptr. >> >> Thanks! > > Johan Sj?len has updated the pull request incrementally with one additional commit since the last revision: > > Check for null string explicitly Passes tier1, tier2. ------------- PR: https://git.openjdk.org/jdk/pull/12187 From tobias.hartmann at oracle.com Wed Mar 1 14:17:18 2023 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Wed, 1 Mar 2023 15:17:18 +0100 Subject: [External] : Re: A noob question about weird sequence of `*synchronization entry` in a C2 compiled code In-Reply-To: <322745e1-ab69-17f7-571f-ed2d5580204c@oracle.com> References: <648b6872-b63d-c861-f592-bea89b6a48fd@oracle.com> <322745e1-ab69-17f7-571f-ed2d5580204c@oracle.com> Message-ID: <229bb471-9377-ba7e-2e65-8e140e6afdd7@oracle.com> Hi Jaroslav, I finally got the time to investigate this properly and found the root cause. The fix is out for review: https://github.com/openjdk/jdk/pull/12806 Thanks again for reporting this! Best regards, Tobias On 07.02.22 23:20, dean.long at oracle.com wrote: > It turns out the problem with DebugNonSafepoints is a known issue.? See > https://bugs.openjdk.java.net/browse/JDK-8201516. > > dl > > On 2/7/22 7:58 AM, Jaroslav Bachor?k wrote: >> Hi Dean, >> >> The first thing I want to mention is that I isolated? this behaviour >> to be triggered by `-XX:DebugNonSafepoints` JVM arg. When this option >> is not specified I don't see the pattern at all. >> >> I have extracted a self-contained (almost) reproducer - it is not a >> single class, unfortunately, but building and running it is as simple >> as executing the attached `./run.sh` >> The reproducer has a weak point, though - I am not able to get the >> `*synchronization entry` pattern manifesting at the same locations. >> Therefore it is more of PoC than a full test case - it requires going >> to the assembly print out and searching for `*synchronization entry` >> manually. >> >> The project can be found here - >> https://urldefense.com/v3/__https://drive.google.com/file/d/1Z6rX4NpvNpctVA3AuYfaG0Qxjd4KjrED/view?usp=sharing__;!!ACWV5N9M2RV99hQ!f9PhKRCRHuSdH__32-XrLtTdu0RoQCQz8wJGjvsqF83SjHg4R2_vAcUJOJFYqd4$ >> It is a zipped gradle java project and the only thing it requires is a >> working Java env. >> >> Thanks! >> >> -JB- >> >> On Sun, Feb 6, 2022 at 6:45 AM wrote: >>> >>> On 2/5/22 12:58 PM, Jaroslav Bachor?k wrote: >>>> Is this the hard limitation of what is possible to do with the debug >>>> data at this level of optimization? Are some boundaries irreversibly >>>> lost here? >>> >>> It's quite likely you've discovered a bug that can be fixed.? If you >>> could narrow it down to a self-contained test case, that would really help. >>> >>> dl From thartmann at openjdk.org Wed Mar 1 14:19:57 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 1 Mar 2023 14:19:57 GMT Subject: RFR: 8201516: DebugNonSafepoints generates incorrect information Message-ID: C2 emits incorrect debug information when diagnostic `-XX:+DebugNonSafepoints` is enabled. The problem is that renumbering of live nodes (`-XX:+RenumberLiveNodes`) introduced by [JDK-8129847](https://bugs.openjdk.org/browse/JDK-8129847) in JDK 8u92 / JDK 9 does not update the `_node_note_array` side table that links IR node indices to debug information. As a result, after node indices are updated, they point to unrelated debug information. The [reproducer](https://github.com/jodzga/debugnonsafepoints-problem) shared by the original reporter @jodzga (@jbachorik also reported this issue separately) does not work anymore with recent JDK versions but with a slight adjustment to trigger node renumbering, I could reproduce the wrong JFR method profile: ![Screenshot from 2023-03-01 13-17-48](https://user-images.githubusercontent.com/5312595/222146314-8b5299a8-c1c0-4360-b356-ac6a8c34371c.png) It suggests that the hottest method of the [Test](https://github.com/jodzga/debugnonsafepoints-problem/blob/f8ed40f24ef6a6bff7f86ea861c022db193ef48a/src/main/java/org/tests/Test.java#L28) is **not** the long running loop in [Test::arraycopy](https://github.com/jodzga/debugnonsafepoints-problem/blob/f8ed40f24ef6a6bff7f86ea861c022db193ef48a/src/main/java/org/tests/Test.java#L56) but several other short running methods. The hot method is not even in the profile. This is obviously wrong. With the fix, or when running with `-XX:-RenumberLiveNodes` as a workaround, the correct profile looks like this: ![Screenshot from 2023-03-01 13-20-09](https://user-images.githubusercontent.com/5312595/222146316-b036ca7d-8a92-42b7-9570-c29e3cfcc2f2.png) With the help of the IR framework, it's easy to create a simple regression test (see `testRenumberLiveNodes`). The fix is to create a new `node_note_array` and copy the debug information to the right index after updating node indices. We do the same in the matcher: https://github.com/openjdk/jdk/blob/c1e77e05647ca93bb4f39a320a5c7a632e283410/src/hotspot/share/opto/matcher.cpp#L337-L342 Another problem is that `Parse::Parse` calls `C->set_default_node_notes(caller_nn)` before `do_exits`, which resets the `JVMState` to the caller state. We then set the bci to `InvocationEntryBci` in the **caller** `JVMState`. Any new node that is emitted in `do_exits`, for example a `MemBarRelease`, will have that `JVMState` attached and `NonSafepointEmitter::observe_instruction` -> `DebugInformationRecorder::describe_scope` will then use that information when emitting debug info. The resulting debug info is misleading because it suggests that we are at the beginning of the caller method. The tests `testFinalFieldInit` and `testSynchronized` reproduce that scenario. The fix is to move `set_default_node_notes` down to after `do_exits`. I find it also misleading that we often emit "synchronization entry" for `InvocationEntryBci` at method entry/exit in the debug info, although there is no synchronization happening. I filed [JDK-8303451](https://bugs.openjdk.org/browse/JDK-8303451) to fix that. Thanks, Tobias ------------- Commit messages: - Added missing @ForceInline - Fixed test method names - More fixes - 8201516: DebugNonSafepoints generates incorrect information Changes: https://git.openjdk.org/jdk/pull/12806/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12806&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8201516 Stats: 151 lines in 3 files changed: 149 ins; 2 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/12806.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12806/head:pull/12806 PR: https://git.openjdk.org/jdk/pull/12806 From thartmann at openjdk.org Wed Mar 1 14:38:49 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 1 Mar 2023 14:38:49 GMT Subject: RFR: 8201516: DebugNonSafepoints generates incorrect information [v2] In-Reply-To: References: Message-ID: > C2 emits incorrect debug information when diagnostic `-XX:+DebugNonSafepoints` is enabled. The problem is that renumbering of live nodes (`-XX:+RenumberLiveNodes`) introduced by [JDK-8129847](https://bugs.openjdk.org/browse/JDK-8129847) in JDK 8u92 / JDK 9 does not update the `_node_note_array` side table that links IR node indices to debug information. As a result, after node indices are updated, they point to unrelated debug information. > > The [reproducer](https://github.com/jodzga/debugnonsafepoints-problem) shared by the original reporter @jodzga (@jbachorik also reported this issue separately) does not work anymore with recent JDK versions but with a slight adjustment to trigger node renumbering, I could reproduce the wrong JFR method profile: > > ![Screenshot from 2023-03-01 13-17-48](https://user-images.githubusercontent.com/5312595/222146314-8b5299a8-c1c0-4360-b356-ac6a8c34371c.png) > > It suggests that the hottest method of the [Test](https://github.com/jodzga/debugnonsafepoints-problem/blob/f8ed40f24ef6a6bff7f86ea861c022db193ef48a/src/main/java/org/tests/Test.java#L28) is **not** the long running loop in [Test::arraycopy](https://github.com/jodzga/debugnonsafepoints-problem/blob/f8ed40f24ef6a6bff7f86ea861c022db193ef48a/src/main/java/org/tests/Test.java#L56) but several other short running methods. The hot method is not even in the profile. This is obviously wrong. > > With the fix, or when running with `-XX:-RenumberLiveNodes` as a workaround, the correct profile looks like this: > > ![Screenshot from 2023-03-01 13-20-09](https://user-images.githubusercontent.com/5312595/222146316-b036ca7d-8a92-42b7-9570-c29e3cfcc2f2.png) > > With the help of the IR framework, it's easy to create a simple regression test (see `testRenumberLiveNodes`). > > The fix is to create a new `node_note_array` and copy the debug information to the right index after updating node indices. We do the same in the matcher: > https://github.com/openjdk/jdk/blob/c1e77e05647ca93bb4f39a320a5c7a632e283410/src/hotspot/share/opto/matcher.cpp#L337-L342 > > Another problem is that `Parse::Parse` calls `C->set_default_node_notes(caller_nn)` before `do_exits`, which resets the `JVMState` to the caller state. We then set the bci to `InvocationEntryBci` in the **caller** `JVMState`. Any new node that is emitted in `do_exits`, for example a `MemBarRelease`, will have that `JVMState` attached and `NonSafepointEmitter::observe_instruction` -> `DebugInformationRecorder::describe_scope` will then use that information when emitting debug info. The resulting debug info is misleading because it suggests that we are at the beginning of the caller method. The tests `testFinalFieldInit` and `testSynchronized` reproduce that scenario. > > The fix is to move `set_default_node_notes` down to after `do_exits`. > > I find it also misleading that we often emit "synchronization entry" for `InvocationEntryBci` at method entry/exit in the debug info, although there is no synchronization happening. I filed [JDK-8303451](https://bugs.openjdk.org/browse/JDK-8303451) to fix that. > > Thanks, > Tobias Tobias Hartmann has updated the pull request incrementally with one additional commit since the last revision: Removed default argument ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12806/files - new: https://git.openjdk.org/jdk/pull/12806/files/13281557..ec3ee517 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12806&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12806&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/12806.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12806/head:pull/12806 PR: https://git.openjdk.org/jdk/pull/12806 From jaroslav.bachorik at datadoghq.com Wed Mar 1 16:05:58 2023 From: jaroslav.bachorik at datadoghq.com (=?UTF-8?Q?Jaroslav_Bachor=C3=ADk?=) Date: Wed, 1 Mar 2023 18:05:58 +0200 Subject: [External] : Re: A noob question about weird sequence of `*synchronization entry` in a C2 compiled code In-Reply-To: <229bb471-9377-ba7e-2e65-8e140e6afdd7@oracle.com> References: <648b6872-b63d-c861-f592-bea89b6a48fd@oracle.com> <322745e1-ab69-17f7-571f-ed2d5580204c@oracle.com> <229bb471-9377-ba7e-2e65-8e140e6afdd7@oracle.com> Message-ID: Hi Tobias, Thanks for fixing this subtle and annoying bug. Looking forward to having the fix accepted so we can start backporting it. Cheers, Jaroslav On Wed 1. 3. 2023 at 16:17, Tobias Hartmann wrote: > Hi Jaroslav, > > I finally got the time to investigate this properly and found the root > cause. The fix is out for > review: https://github.com/openjdk/jdk/pull/12806 > > Thanks again for reporting this! > > Best regards, > Tobias > > > On 07.02.22 23:20, dean.long at oracle.com wrote: > > It turns out the problem with DebugNonSafepoints is a known issue. See > > https://bugs.openjdk.java.net/browse/JDK-8201516. > > > > dl > > > > On 2/7/22 7:58 AM, Jaroslav Bachor?k wrote: > >> Hi Dean, > >> > >> The first thing I want to mention is that I isolated this behaviour > >> to be triggered by `-XX:DebugNonSafepoints` JVM arg. When this option > >> is not specified I don't see the pattern at all. > >> > >> I have extracted a self-contained (almost) reproducer - it is not a > >> single class, unfortunately, but building and running it is as simple > >> as executing the attached `./run.sh` > >> The reproducer has a weak point, though - I am not able to get the > >> `*synchronization entry` pattern manifesting at the same locations. > >> Therefore it is more of PoC than a full test case - it requires going > >> to the assembly print out and searching for `*synchronization entry` > >> manually. > >> > >> The project can be found here - > >> > https://urldefense.com/v3/__https://drive.google.com/file/d/1Z6rX4NpvNpctVA3AuYfaG0Qxjd4KjrED/view?usp=sharing__;!!ACWV5N9M2RV99hQ!f9PhKRCRHuSdH__32-XrLtTdu0RoQCQz8wJGjvsqF83SjHg4R2_vAcUJOJFYqd4$ > >> It is a zipped gradle java project and the only thing it requires is a > >> working Java env. > >> > >> Thanks! > >> > >> -JB- > >> > >> On Sun, Feb 6, 2022 at 6:45 AM wrote: > >>> > >>> On 2/5/22 12:58 PM, Jaroslav Bachor?k wrote: > >>>> Is this the hard limitation of what is possible to do with the debug > >>>> data at this level of optimization? Are some boundaries irreversibly > >>>> lost here? > >>> > >>> It's quite likely you've discovered a bug that can be fixed. If you > >>> could narrow it down to a self-contained test case, that would really > help. > >>> > >>> dl > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kvn at openjdk.org Wed Mar 1 17:38:19 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 1 Mar 2023 17:38:19 GMT Subject: RFR: JDK-8301074: Replace NULL with nullptr in share/opto/ [v3] In-Reply-To: <6UYV4UEowi3T3oMiBn4srMGOZIaf_iwcj3_hlDx--FI=.1ead71f0-ce00-4826-ac11-64e4e7500192@github.com> References: <6UYV4UEowi3T3oMiBn4srMGOZIaf_iwcj3_hlDx--FI=.1ead71f0-ce00-4826-ac11-64e4e7500192@github.com> Message-ID: On Wed, 1 Mar 2023 09:06:50 GMT, Johan Sj?len wrote: >> Hi, this PR changes all occurrences of NULL to nullptr for the subdirectory share/opto/. Unfortunately the script that does the change isn't perfect, and so we >> need to comb through these manually to make sure nothing has gone wrong. I also review these changes but things slip past my eyes sometimes. >> >> Here are some typical things to look out for: >> >> No changes but copyright header changed (probably because I reverted some changes but forgot the copyright). >> Macros having their NULL changed to nullptr, these are added to the script when I find them. They should be NULL. >> nullptr in comments and logs. We try to use lower case "null" in these cases as it reads better. An exception is made when code expressions are in a comment. >> >> An example of this: >> >> >> // This function returns null >> void* ret_null(); >> // This function returns true if *x == nullptr >> bool is_nullptr(void** x); >> >> >> Note how nullptr participates in a code expression here, we really are talking about the specific value nullptr. >> >> Thanks! > > Johan Sj?len has updated the pull request incrementally with one additional commit since the last revision: > > Check for null string explicitly Update is good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/12187 From kvn at openjdk.org Wed Mar 1 17:42:17 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 1 Mar 2023 17:42:17 GMT Subject: RFR: JDK-8301074: Replace NULL with nullptr in share/opto/ [v3] In-Reply-To: <6UYV4UEowi3T3oMiBn4srMGOZIaf_iwcj3_hlDx--FI=.1ead71f0-ce00-4826-ac11-64e4e7500192@github.com> References: <6UYV4UEowi3T3oMiBn4srMGOZIaf_iwcj3_hlDx--FI=.1ead71f0-ce00-4826-ac11-64e4e7500192@github.com> Message-ID: <2Wj1m49Lh7vMO90T6cfohFW0vo8v4lvryw-gEFFtx3Y=.44365f22-9a12-4fed-8f32-0d608fad8be4@github.com> On Wed, 1 Mar 2023 09:06:50 GMT, Johan Sj?len wrote: >> Hi, this PR changes all occurrences of NULL to nullptr for the subdirectory share/opto/. Unfortunately the script that does the change isn't perfect, and so we >> need to comb through these manually to make sure nothing has gone wrong. I also review these changes but things slip past my eyes sometimes. >> >> Here are some typical things to look out for: >> >> No changes but copyright header changed (probably because I reverted some changes but forgot the copyright). >> Macros having their NULL changed to nullptr, these are added to the script when I find them. They should be NULL. >> nullptr in comments and logs. We try to use lower case "null" in these cases as it reads better. An exception is made when code expressions are in a comment. >> >> An example of this: >> >> >> // This function returns null >> void* ret_null(); >> // This function returns true if *x == nullptr >> bool is_nullptr(void** x); >> >> >> Note how nullptr participates in a code expression here, we really are talking about the specific value nullptr. >> >> Thanks! > > Johan Sj?len has updated the pull request incrementally with one additional commit since the last revision: > > Check for null string explicitly Cross compilation for ARM in GHA failed with: ------------- PR: https://git.openjdk.org/jdk/pull/12187 From kvn at openjdk.org Wed Mar 1 17:59:17 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 1 Mar 2023 17:59:17 GMT Subject: RFR: 8201516: DebugNonSafepoints generates incorrect information [v2] In-Reply-To: References: Message-ID: On Wed, 1 Mar 2023 14:38:49 GMT, Tobias Hartmann wrote: >> C2 emits incorrect debug information when diagnostic `-XX:+DebugNonSafepoints` is enabled. The problem is that renumbering of live nodes (`-XX:+RenumberLiveNodes`) introduced by [JDK-8129847](https://bugs.openjdk.org/browse/JDK-8129847) in JDK 8u92 / JDK 9 does not update the `_node_note_array` side table that links IR node indices to debug information. As a result, after node indices are updated, they point to unrelated debug information. >> >> The [reproducer](https://github.com/jodzga/debugnonsafepoints-problem) shared by the original reporter @jodzga (@jbachorik also reported this issue separately) does not work anymore with recent JDK versions but with a slight adjustment to trigger node renumbering, I could reproduce the wrong JFR method profile: >> >> ![Screenshot from 2023-03-01 13-17-48](https://user-images.githubusercontent.com/5312595/222146314-8b5299a8-c1c0-4360-b356-ac6a8c34371c.png) >> >> It suggests that the hottest method of the [Test](https://github.com/jodzga/debugnonsafepoints-problem/blob/f8ed40f24ef6a6bff7f86ea861c022db193ef48a/src/main/java/org/tests/Test.java#L28) is **not** the long running loop in [Test::arraycopy](https://github.com/jodzga/debugnonsafepoints-problem/blob/f8ed40f24ef6a6bff7f86ea861c022db193ef48a/src/main/java/org/tests/Test.java#L56) but several other short running methods. The hot method is not even in the profile. This is obviously wrong. >> >> With the fix, or when running with `-XX:-RenumberLiveNodes` as a workaround, the correct profile looks like this: >> >> ![Screenshot from 2023-03-01 13-20-09](https://user-images.githubusercontent.com/5312595/222146316-b036ca7d-8a92-42b7-9570-c29e3cfcc2f2.png) >> >> With the help of the IR framework, it's easy to create a simple regression test (see `testRenumberLiveNodes`). >> >> The fix is to create a new `node_note_array` and copy the debug information to the right index after updating node indices. We do the same in the matcher: >> https://github.com/openjdk/jdk/blob/c1e77e05647ca93bb4f39a320a5c7a632e283410/src/hotspot/share/opto/matcher.cpp#L337-L342 >> >> Another problem is that `Parse::Parse` calls `C->set_default_node_notes(caller_nn)` before `do_exits`, which resets the `JVMState` to the caller state. We then set the bci to `InvocationEntryBci` in the **caller** `JVMState`. Any new node that is emitted in `do_exits`, for example a `MemBarRelease`, will have that `JVMState` attached and `NonSafepointEmitter::observe_instruction` -> `DebugInformationRecorder::describe_scope` will then use that information when emitting debug info. The resulting debug info is misleading because it suggests that we are at the beginning of the caller method. The tests `testFinalFieldInit` and `testSynchronized` reproduce that scenario. >> >> The fix is to move `set_default_node_notes` down to after `do_exits`. >> >> I find it also misleading that we often emit "synchronization entry" for `InvocationEntryBci` at method entry/exit in the debug info, although there is no synchronization happening. I filed [JDK-8303451](https://bugs.openjdk.org/browse/JDK-8303451) to fix that. >> >> Thanks, >> Tobias > > Tobias Hartmann has updated the pull request incrementally with one additional commit since the last revision: > > Removed default argument I have few comments. src/hotspot/share/opto/phaseX.cpp line 474: > 472: GrowableArray* old_node_note_array = C->node_note_array(); > 473: if (old_node_note_array != NULL) { > 474: C->set_node_note_array(new (C->comp_arena()) GrowableArray (C->comp_arena(), 8, 0, NULL)); Use `nullptr` in these lines. Can you use `_useful.size()` as initial array length? src/hotspot/share/opto/phaseX.cpp line 492: > 490: _old2new_map.at_put(n->_idx, current_idx); > 491: > 492: if (old_node_note_array != NULL) { nullptr ------------- PR: https://git.openjdk.org/jdk/pull/12806 From luhenry at openjdk.org Wed Mar 1 22:25:16 2023 From: luhenry at openjdk.org (Ludovic Henry) Date: Wed, 1 Mar 2023 22:25:16 GMT Subject: RFR: 8302384: Handle hsdis out-of-bound logic for RISC-V [v3] In-Reply-To: References: Message-ID: On Tue, 21 Feb 2023 05:01:50 GMT, Xiaolin Zheng wrote: >> Several debug assertion failures have been observed on RISC-V, on physical boards only. >> >> Failure list: (the `hs_err` log is in the JBS issue) >> >> compiler/vectorapi/TestVectorShiftImm.java >> compiler/compilercontrol/jcmd/AddPrintAssemblyTest.java >> compiler/intrinsics/math/TestFpMinMaxIntrinsics.java >> compiler/compilercontrol/TestCompilerDirectivesCompatibilityFlag.java >> compiler/compilercontrol/TestCompilerDirectivesCompatibilityCommandOn.java >> compiler/runtime/TestConstantsInError.java >> compiler/compilercontrol/jcmd/PrintDirectivesTest.java >> >> >> When the failure occurs, hsdis is disassembling the last unrecognizable data at the end of a code blob, usually the data stored in trampolines. It could be theoretically any address inside the code cache, and sometimes binutils can recognize the data as 2-byte instructions, 4-byte instructions, and 6 or 8-byte instructions even though as far as I know no instructions longer than 4-byte have landed. Therefore, binutils may firstly run out of bound after the calculation. However, the RISC-V binutils returns our `hsdis_read_memory_func`'s return number directly [1] (an EIO, which is `5`, FYI), rather than returning a `-1` (FYI, [2][3][4][5]) on other platforms when such out-of-bound happens. So when coming back to our hsdis, we (hsdis) get the `size = 5` as the return value [6] rather than `-1`: our hsdis error handling is skipped, our variable `p` is out of bound, and then we meet the crash. >> >> To fix it, we should check the value is the special `EIO` on RISC-V. However, after fixing that issue, I found binutils would print some messages like "Address 0x%s is out of bounds." on the screen: >> >> >> 0x0000003f901a41b4: auipc t0,0x0 ; {trampoline_stub} >> 0x0000003f901a41b8: ld t0,12(t0) # 0x0000003f901a41c0 >> 0x0000003f901a41bc: jr t0 >> 0x0000003f901a41c0: .2byte 0x8ec0 >> 0x0000003f901a41c2: srli s0,s0,0x21 >> 0x0000003f901a41c4: Address 0x0000003f901a41c9 is out of bounds. <----------- But we want the real bytes here. >> >> >> So, we should overwrite the `disassemble_info.memory_error_func` in the binutils callback [7], to generate our own output: >> >> 0x0000003f901a41b4: auipc t0,0x0 ; {trampoline_stub} >> 0x0000003f901a41b8: ld t0,12(t0) # 0x0000003f901a41c0 >> 0x0000003f901a41bc: jr t0 >> 0x0000003f901a41c0: .2byte 0x8ec0 >> 0x0000003f901a41c2: srli s0,s0,0x21 >> 0x0000003f901a41c4: .4byte 0x0000003f >> >> >> Mirroring the code of hsdis-llvm, to print merely a 4-byte data [8]. >> >> >> BTW, the reason why the crash only happens on the physical board, is that boards support RISC-V sv39 address mode only: a legal user-space address can be no more than 38-bit. So the code cache is always mmapped to an address like `0x3fe0000000`. Such a `0x3f` is always recognized as the mark of an 8-byte instruction [9]. >> >> >> Tested hotspot tier1~4 with fastdebug build, no new errors found. >> >> Thanks, >> Xiaolin >> >> >> [1] https://github.com/bminor/binutils-gdb/blob/binutils-2_38-branch/opcodes/riscv-dis.c#L940 >> [2] https://github.com/bminor/binutils-gdb/blob/binutils-2_38-branch/opcodes/aarch64-dis.c#L3792 >> [3] https://github.com/bminor/binutils-gdb/blob/binutils-2_38-branch/opcodes/ppc-dis.c#L872 >> [4] https://github.com/bminor/binutils-gdb/blob/binutils-2_38-branch/opcodes/s390-dis.c#L305 >> [5] https://github.com/bminor/binutils-gdb/blob/binutils-2_38-branch/opcodes/i386-dis.c#L9466 (the i386 one uses a `setlongjmp` to handle the exception case, so the code might look different) >> [6] https://github.com/openjdk/jdk/blob/94e7cc8587356988e713d23d1653bdd5c43fb3f1/src/utils/hsdis/binutils/hsdis-binutils.c#L198 >> [7] https://github.com/bminor/binutils-gdb/blob/binutils-2_38-branch/opcodes/dis-buf.c#L51-L72 >> [8] https://github.com/openjdk/jdk/blob/94e7cc8587356988e713d23d1653bdd5c43fb3f1/src/utils/hsdis/llvm/hsdis-llvm.cpp#L316-L317 >> [9] https://github.com/bminor/binutils-gdb/blob/binutils-2_38-branch/include/opcode/riscv.h#L30-L42 > > Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: > > Remove white spaces, and use hsdis code style Marked as reviewed by luhenry (Committer). ------------- PR: https://git.openjdk.org/jdk/pull/12551 From duke at openjdk.org Wed Mar 1 23:21:18 2023 From: duke at openjdk.org (Saint Wesonga) Date: Wed, 1 Mar 2023 23:21:18 GMT Subject: RFR: 8303409: Add Windows AArch64 ABI support to the Foreign Function & Memory API Message-ID: There are 2 primary differences between the Windows ARM64 ABI and the macOS/Linux ARM64 ABI: variadic floating point arguments are passed in general purpose registers on Windows (instead of the vector registers). In addition to this, up to 64 bytes of a struct being passed to a variadic function can be placed in general purpose registers. This happens regardless of the type of struct (HFA or other generic struct). This means that a struct can be split across registers and the stack when invoking a variadic function. The Windows ARM64 ABI conventions are documented at https://learn.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions For details about the Foreign Function & Memory API, see JEP 434 at https://openjdk.org/jeps/434 This change is a cherry pick of https://github.com/openjdk/panama-foreign/commit/d379ca1c and https://github.com/openjdk/panama-foreign/commit/08225e4f from https://github.com/openjdk/panama-foreign/pull/754 and includes an additional commit that introduces a VaList implementation for Windows on AArch64. ------------- Commit messages: - Add Windows AArch64 VaList implementation - 8295290: Add Windows ARM64 ABI support to the Foreign Function & Memory API - Move Linux & MacOs CallArranger tests into separate files Changes: https://git.openjdk.org/jdk/pull/12773/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12773&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8303409 Stats: 2135 lines in 20 files changed: 1445 ins; 650 del; 40 mod Patch: https://git.openjdk.org/jdk/pull/12773.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12773/head:pull/12773 PR: https://git.openjdk.org/jdk/pull/12773 From jvernee at openjdk.org Wed Mar 1 23:21:18 2023 From: jvernee at openjdk.org (Jorn Vernee) Date: Wed, 1 Mar 2023 23:21:18 GMT Subject: RFR: 8303409: Add Windows AArch64 ABI support to the Foreign Function & Memory API In-Reply-To: References: Message-ID: On Mon, 27 Feb 2023 17:04:28 GMT, Saint Wesonga wrote: > There are 2 primary differences between the Windows ARM64 ABI and the macOS/Linux ARM64 ABI: variadic floating point arguments are passed in general purpose registers on Windows (instead of the vector registers). In addition to this, up to 64 bytes of a struct being passed to a variadic function can be placed in general purpose registers. This happens regardless of the type of struct (HFA or other generic struct). This means that a struct can be split across registers and the stack when invoking a variadic function. The Windows ARM64 ABI conventions are documented at https://learn.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions > > For details about the Foreign Function & Memory API, see JEP 434 at https://openjdk.org/jeps/434 > > This change is a cherry pick of https://github.com/openjdk/panama-foreign/commit/d379ca1c and https://github.com/openjdk/panama-foreign/commit/08225e4f from https://github.com/openjdk/panama-foreign/pull/754 and includes an additional commit that introduces a VaList implementation for Windows on AArch64. All still looks good (including the VaList impl, though that is less important now since we plan to remove VaList in 21) I'm running tier 1-4 before giving a checkmark. Tests came back green @swesonga I've filed a new issue here: https://bugs.openjdk.org/browse/JDK-8303409 Please use that issue number in the PR title (the name should already be correct). ------------- PR: https://git.openjdk.org/jdk/pull/12773Marked as reviewed by jvernee (Reviewer). From fgao at openjdk.org Thu Mar 2 01:47:14 2023 From: fgao at openjdk.org (Fei Gao) Date: Thu, 2 Mar 2023 01:47:14 GMT Subject: RFR: 8294715: Add IR checks to the reduction vectorization tests [v3] In-Reply-To: References: Message-ID: On Wed, 1 Mar 2023 12:33:20 GMT, Emanuel Peter wrote: > @fg1417 could you please run this patch on a machine with `sve` support? It would be good to know that all IR rules are ok. Sure. I'll do the testing on our sve machine. Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/12683 From xlinzheng at openjdk.org Thu Mar 2 03:40:13 2023 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Thu, 2 Mar 2023 03:40:13 GMT Subject: RFR: 8302384: Handle hsdis out-of-bound logic for RISC-V [v3] In-Reply-To: References: Message-ID: On Tue, 21 Feb 2023 05:01:50 GMT, Xiaolin Zheng wrote: >> Several debug assertion failures have been observed on RISC-V, on physical boards only. >> >> Failure list: (the `hs_err` log is in the JBS issue) >> >> compiler/vectorapi/TestVectorShiftImm.java >> compiler/compilercontrol/jcmd/AddPrintAssemblyTest.java >> compiler/intrinsics/math/TestFpMinMaxIntrinsics.java >> compiler/compilercontrol/TestCompilerDirectivesCompatibilityFlag.java >> compiler/compilercontrol/TestCompilerDirectivesCompatibilityCommandOn.java >> compiler/runtime/TestConstantsInError.java >> compiler/compilercontrol/jcmd/PrintDirectivesTest.java >> >> >> When the failure occurs, hsdis is disassembling the last unrecognizable data at the end of a code blob, usually the data stored in trampolines. It could be theoretically any address inside the code cache, and sometimes binutils can recognize the data as 2-byte instructions, 4-byte instructions, and 6 or 8-byte instructions even though as far as I know no instructions longer than 4-byte have landed. Therefore, binutils may firstly run out of bound after the calculation. However, the RISC-V binutils returns our `hsdis_read_memory_func`'s return number directly [1] (an EIO, which is `5`, FYI), rather than returning a `-1` (FYI, [2][3][4][5]) on other platforms when such out-of-bound happens. So when coming back to our hsdis, we (hsdis) get the `size = 5` as the return value [6] rather than `-1`: our hsdis error handling is skipped, our variable `p` is out of bound, and then we meet the crash. >> >> To fix it, we should check the value is the special `EIO` on RISC-V. However, after fixing that issue, I found binutils would print some messages like "Address 0x%s is out of bounds." on the screen: >> >> >> 0x0000003f901a41b4: auipc t0,0x0 ; {trampoline_stub} >> 0x0000003f901a41b8: ld t0,12(t0) # 0x0000003f901a41c0 >> 0x0000003f901a41bc: jr t0 >> 0x0000003f901a41c0: .2byte 0x8ec0 >> 0x0000003f901a41c2: srli s0,s0,0x21 >> 0x0000003f901a41c4: Address 0x0000003f901a41c9 is out of bounds. <----------- But we want the real bytes here. >> >> >> So, we should overwrite the `disassemble_info.memory_error_func` in the binutils callback [7], to generate our own output: >> >> 0x0000003f901a41b4: auipc t0,0x0 ; {trampoline_stub} >> 0x0000003f901a41b8: ld t0,12(t0) # 0x0000003f901a41c0 >> 0x0000003f901a41bc: jr t0 >> 0x0000003f901a41c0: .2byte 0x8ec0 >> 0x0000003f901a41c2: srli s0,s0,0x21 >> 0x0000003f901a41c4: .4byte 0x0000003f >> >> >> Mirroring the code of hsdis-llvm, to print merely a 4-byte data [8]. >> >> >> BTW, the reason why the crash only happens on the physical board, is that boards support RISC-V sv39 address mode only: a legal user-space address can be no more than 38-bit. So the code cache is always mmapped to an address like `0x3fe0000000`. Such a `0x3f` is always recognized as the mark of an 8-byte instruction [9]. >> >> >> Tested hotspot tier1~4 with fastdebug build, no new errors found. >> >> Thanks, >> Xiaolin >> >> >> [1] https://github.com/bminor/binutils-gdb/blob/binutils-2_38-branch/opcodes/riscv-dis.c#L940 >> [2] https://github.com/bminor/binutils-gdb/blob/binutils-2_38-branch/opcodes/aarch64-dis.c#L3792 >> [3] https://github.com/bminor/binutils-gdb/blob/binutils-2_38-branch/opcodes/ppc-dis.c#L872 >> [4] https://github.com/bminor/binutils-gdb/blob/binutils-2_38-branch/opcodes/s390-dis.c#L305 >> [5] https://github.com/bminor/binutils-gdb/blob/binutils-2_38-branch/opcodes/i386-dis.c#L9466 (the i386 one uses a `setlongjmp` to handle the exception case, so the code might look different) >> [6] https://github.com/openjdk/jdk/blob/94e7cc8587356988e713d23d1653bdd5c43fb3f1/src/utils/hsdis/binutils/hsdis-binutils.c#L198 >> [7] https://github.com/bminor/binutils-gdb/blob/binutils-2_38-branch/opcodes/dis-buf.c#L51-L72 >> [8] https://github.com/openjdk/jdk/blob/94e7cc8587356988e713d23d1653bdd5c43fb3f1/src/utils/hsdis/llvm/hsdis-llvm.cpp#L316-L317 >> [9] https://github.com/bminor/binutils-gdb/blob/binutils-2_38-branch/include/opcode/riscv.h#L30-L42 > > Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: > > Remove white spaces, and use hsdis code style Uh. Sorry for the disturbance; I forgot the rule was "1 review required, with at least 1 Reviewer" in the mainline. Seems we need to wait for another proper review of this. BTW, an issue [1] for binutils was filed yesterday to track this; though I have not got a confirmation about whether it could get fixed, and when. [1] https://sourceware.org/bugzilla/show_bug.cgi?id=30184 ------------- PR: https://git.openjdk.org/jdk/pull/12551 From jbhateja at openjdk.org Thu Mar 2 05:55:12 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 2 Mar 2023 05:55:12 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v9] In-Reply-To: <_eDa4Xjs3HIxm2u4bUOWmmZL6y2DjQGd9kVyrUCTSD0=.d56d4533-bf46-4900-8c7c-6e22d247ff22@github.com> References: <8lyMxPlTmzLDuwFLvhla8AAL6sVxEjigf4LaDlsAVWg=.4bfb5428-4b11-4c65-8540-503cf9e595f1@github.com> <_eDa4Xjs3HIxm2u4bUOWmmZL6y2DjQGd9kVyrUCTSD0=.d56d4533-bf46-4900-8c7c-6e22d247ff22@github.com> Message-ID: On Fri, 24 Feb 2023 15:29:14 GMT, Jatin Bhateja wrote: >> @jatin-bhateja Ok, I have reconsidered it. I will add some `SuperWordMaxVectorSize` and `AlignVector` combinations. But I will do it in a separate file, and always have CompileCommand directive `Vectorize` enabled (`_do_vector_loop == true`). I might refactor `TestOptionVectorizeIR.java` for that. >> Let me know if you find it essencial to have the tests also with `_do_vector_loop == false`. > > Sounds good. > @jatin-bhateja I now have a first version out of the `test/hotspot/jtreg/compiler/loopopts/superword/TestDependencyOffsets.java`. It seems to work for `sse4.1 ... avx512`. I'm now testing it for `asimd`. And then I will proceed to add the features for `-XX:+AlignVector`, with the modulo check. I am seeing lots of IR violations with UseSSE=4. ------------- PR: https://git.openjdk.org/jdk/pull/12350 From thartmann at openjdk.org Thu Mar 2 06:16:53 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 2 Mar 2023 06:16:53 GMT Subject: RFR: 8201516: DebugNonSafepoints generates incorrect information [v3] In-Reply-To: References: Message-ID: > C2 emits incorrect debug information when diagnostic `-XX:+DebugNonSafepoints` is enabled. The problem is that renumbering of live nodes (`-XX:+RenumberLiveNodes`) introduced by [JDK-8129847](https://bugs.openjdk.org/browse/JDK-8129847) in JDK 8u92 / JDK 9 does not update the `_node_note_array` side table that links IR node indices to debug information. As a result, after node indices are updated, they point to unrelated debug information. > > The [reproducer](https://github.com/jodzga/debugnonsafepoints-problem) shared by the original reporter @jodzga (@jbachorik also reported this issue separately) does not work anymore with recent JDK versions but with a slight adjustment to trigger node renumbering, I could reproduce the wrong JFR method profile: > > ![Screenshot from 2023-03-01 13-17-48](https://user-images.githubusercontent.com/5312595/222146314-8b5299a8-c1c0-4360-b356-ac6a8c34371c.png) > > It suggests that the hottest method of the [Test](https://github.com/jodzga/debugnonsafepoints-problem/blob/f8ed40f24ef6a6bff7f86ea861c022db193ef48a/src/main/java/org/tests/Test.java#L28) is **not** the long running loop in [Test::arraycopy](https://github.com/jodzga/debugnonsafepoints-problem/blob/f8ed40f24ef6a6bff7f86ea861c022db193ef48a/src/main/java/org/tests/Test.java#L56) but several other short running methods. The hot method is not even in the profile. This is obviously wrong. > > With the fix, or when running with `-XX:-RenumberLiveNodes` as a workaround, the correct profile looks like this: > > ![Screenshot from 2023-03-01 13-20-09](https://user-images.githubusercontent.com/5312595/222146316-b036ca7d-8a92-42b7-9570-c29e3cfcc2f2.png) > > With the help of the IR framework, it's easy to create a simple regression test (see `testRenumberLiveNodes`). > > The fix is to create a new `node_note_array` and copy the debug information to the right index after updating node indices. We do the same in the matcher: > https://github.com/openjdk/jdk/blob/c1e77e05647ca93bb4f39a320a5c7a632e283410/src/hotspot/share/opto/matcher.cpp#L337-L342 > > Another problem is that `Parse::Parse` calls `C->set_default_node_notes(caller_nn)` before `do_exits`, which resets the `JVMState` to the caller state. We then set the bci to `InvocationEntryBci` in the **caller** `JVMState`. Any new node that is emitted in `do_exits`, for example a `MemBarRelease`, will have that `JVMState` attached and `NonSafepointEmitter::observe_instruction` -> `DebugInformationRecorder::describe_scope` will then use that information when emitting debug info. The resulting debug info is misleading because it suggests that we are at the beginning of the caller method. The tests `testFinalFieldInit` and `testSynchronized` reproduce that scenario. > > The fix is to move `set_default_node_notes` down to after `do_exits`. > > I find it also misleading that we often emit "synchronization entry" for `InvocationEntryBci` at method entry/exit in the debug info, although there is no synchronization happening. I filed [JDK-8303451](https://bugs.openjdk.org/browse/JDK-8303451) to fix that. > > Thanks, > Tobias Tobias Hartmann has updated the pull request incrementally with one additional commit since the last revision: Use nullptr ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12806/files - new: https://git.openjdk.org/jdk/pull/12806/files/ec3ee517..413729d2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12806&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12806&range=01-02 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/12806.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12806/head:pull/12806 PR: https://git.openjdk.org/jdk/pull/12806 From fgao at openjdk.org Thu Mar 2 06:56:18 2023 From: fgao at openjdk.org (Fei Gao) Date: Thu, 2 Mar 2023 06:56:18 GMT Subject: RFR: 8294715: Add IR checks to the reduction vectorization tests [v3] In-Reply-To: References: Message-ID: On Tue, 28 Feb 2023 08:03:20 GMT, Daniel Skantz wrote: >> We are lifting some loopopts/superword tests to use the IR framework, and add IR annotations to check that vector reductions take place on x86_64. This can be useful to prevent issues such as JDK-8300865. >> >> Approach: lift the more general tests in loopopts/superword, mainly using matching rules in cpu/x86/x86.ad, but leave tests mostly unchanged otherwise. Some reductions are considered non-profitable (superword.cpp), so we might need to raise sse/avx value pre-conditions from what would be a strict reading of x86.ad (as noted by @eme64). >> >> Testing: Local testing (x86_64) using UseSSE={2,3,4}, UseAVX={0,1,2,3}. Tested running all jtreg compiler tests. Tier1-tier5 runs to my knowledge never showed any compiler-related regression in other tests as a result from this work. GHA. Validation: all tests fail if we put unreasonable counts for the respective reduction node, such as counts = {IRNode.ADD_REDUCTION_VI, ">= 10000000"}). >> >> Thanks @robcasloz and @eme64 for advice. >> >> Notes: ProdRed_Double does not vectorize (JDK-8300865). SumRed_Long does not vectorize on 32-bit, according to my reading of source, test on GHA and cross-compiled JDK on 32-bit Linux, so removed these platforms from @requires. Lifted the AbsNeg tests too but added no checks, as these are currently not run on x86_64. > > Daniel Skantz has updated the pull request incrementally with one additional commit since the last revision: > > Address further review comments (edits) Changes requested by fgao (Committer). test/hotspot/jtreg/compiler/loopopts/superword/RedTest_long.java line 150: > 148: @IR(applyIfOr = {"SuperWordReductions", "false", "LoopMaxUnroll", "< 8"}, > 149: failOn = {IRNode.ADD_REDUCTION_VL}) > 150: @IR(applyIfCPUFeature = {"avx2", "false"}, I'm afraid the restriction here is not safe because all non-avx2 machines would meet the condition, like aarch64 and riscv64. SVE supports `AndReductionV`, `OrReductionV `, `XorReductionV`, `AddReductionVL`, `AddReductionVD`, `SqrtVD`, and so on. test/hotspot/jtreg/compiler/loopopts/superword/RedTest_long.java line 172: > 170: @IR(applyIfOr = {"SuperWordReductions", "false", "LoopMaxUnroll", "< 8"}, > 171: failOn = {IRNode.OR_REDUCTION_V}) > 172: @IR(applyIfCPUFeature = {"avx2", "false"}, ditto test/hotspot/jtreg/compiler/loopopts/superword/RedTest_long.java line 194: > 192: @IR(applyIfOr = {"SuperWordReductions", "false", "LoopMaxUnroll", "< 8"}, > 193: failOn = {IRNode.AND_REDUCTION_V}) > 194: @IR(applyIfCPUFeature = {"avx2", "false"}, ditto test/hotspot/jtreg/compiler/loopopts/superword/RedTest_long.java line 216: > 214: @IR(applyIfOr = {"SuperWordReductions", "false", "LoopMaxUnroll", "< 8"}, > 215: failOn = {IRNode.XOR_REDUCTION_V}) > 216: @IR(applyIfCPUFeature = {"avx2", "false"}, ditto test/hotspot/jtreg/compiler/loopopts/superword/SumRedSqrt_Double.java line 95: > 93: @IR(applyIfOr = {"SuperWordReductions", "false", "LoopMaxUnroll", "< 8"}, > 94: failOn = {IRNode.ADD_REDUCTION_VD, IRNode.SQRT_V}) > 95: @IR(applyIfCPUFeature = {"avx", "false"}, ditto test/hotspot/jtreg/compiler/loopopts/superword/SumRed_Long.java line 98: > 96: @IR(applyIfOr = {"SuperWordReductions", "false", "LoopMaxUnroll", "< 8"}, > 97: failOn = {IRNode.ADD_REDUCTION_VL}) > 98: @IR(applyIfCPUFeature = {"avx2", "false"}, ditto ------------- PR: https://git.openjdk.org/jdk/pull/12683 From epeter at openjdk.org Thu Mar 2 07:13:16 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 2 Mar 2023 07:13:16 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v15] In-Reply-To: <7BB6Bc-RlCF3GYkbuOEmSJgdv4aAzQ-2cQFdiCY2vaQ=.37b1a20a-70f2-43a2-865e-b49aafe09b42@github.com> References: <7BB6Bc-RlCF3GYkbuOEmSJgdv4aAzQ-2cQFdiCY2vaQ=.37b1a20a-70f2-43a2-865e-b49aafe09b42@github.com> Message-ID: On Thu, 2 Mar 2023 05:46:11 GMT, Jatin Bhateja wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> TestDependencyOffsets.java: MulVL not supported on NEON / asimd. Replaced it with AddVL > > test/hotspot/jtreg/compiler/loopopts/superword/TestDependencyOffsets.java line 881: > >> 879: // cpu: sse4.1 to avx -> vector_width: 16 -> elements in vector: 4 >> 880: // positive byte_offset 4 can lead to cyclic dependency >> 881: @IR(counts = {IRNode.LOAD_VECTOR, "> 0", IRNode.MUL_V, "> 0", IRNode.STORE_VECTOR, "> 0"}, > > Needs to fix the test generation script. This IR rule looks incorrect since there is a valid dependency with distance 1. @jatin-bhateja I think the IR rule is just ineffective. I have the following condition in it that will never be met: `applyIfAnd = {"MaxVectorSize", ">= 8", "MaxVectorSize", "<= 4"},` The `<= 4` must hold so that `byte_offset <= MaxVectorSize`, and so the cyclical dependency would not happen. But `>= 8` must hold so that two ints fit in a vector, so that we even vectorize. I could improve the script and filter out such ineffective IR rules. Not sure if that is worth it though. ------------- PR: https://git.openjdk.org/jdk/pull/12350 From epeter at openjdk.org Thu Mar 2 07:45:01 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 2 Mar 2023 07:45:01 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v16] In-Reply-To: References: Message-ID: > Cyclic dependencies are not handled correctly in all cases. Three examples: > > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/loopopts/superword/TestCyclicDependency.java#L270-L277 > > And this, compiled with `-XX:CompileCommand=option,compiler.vectorization.TestOptionVectorizeIR::test*,Vectorize`: > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestOptionVectorizeIR.java#L173-L180 > > And for `vmIntrinsics::_forEachRemaining` compile option `Vectorize` is always enabled: > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestForEachRem.java#L69-L73 > > All of these examples are vectorized, despite the cyclic dependency of distance 2. The cyclic dependency is dropped, instead the emitted vector code implements a shift by 2, instead of repeating the same 2 values. > > **Analysis** > > The `create_pack` logic in `SuperWord::find_adjacent_refs` is broken in two ways: > > - When the compile directive `Vectorize` is on, or we compile `vmIntrinsics::_forEachRemaining` we have `_do_vector_loop == true`. When that is the case, we blindly trust that there is no cyclic dependency larger than distance 1. Distance 1 would already be detected by the `independence(s1, s2)` checks we do for all adjacent memops. But for larger distances, we rely on `memory_alignment == 0`. But the compile directive avoids these checks. > - If `best_align_to_mem_ref` is of a different type, and we have `memory_alignment(mem_ref, best_align_to_mem_ref) == 0`, we do not check if `mem_ref` has `memory_alignment == 0` for all other refs of the same type. In the example `TestCyclicDependency::test2`, we have `best_align_to_mem_ref` as the `StoreF`. Then we assess the `StoreI`, which is not aligned with it, but it is of a different type, so we accept it too. Finally, we look at `LoadI`, which has perfect alignment with the `StoreF`, so we accept it too (even though it is in conflict with the `StoreI`). > > Generally, the nested if-statements are confusing and buggy. I propose to fix and refactor the code. > > I also propose to only allow the compile directive `Vectorize` only if `vectors_should_be_aligned() == false`. If all vector operations have to be `vector_width` aligned, then they also have to be mutually aligned, and we cannot have patterns like `v[i] = v[i] + v[i+1]` for which the compile directive was introduced in the first place https://github.com/openjdk/jdk/commit/c7d33de202203b6da544f2e0f9a13952381b32dd. > **Update**: I found a **Test.java** that lead to a crash (`SIGBUS`) on a ARM32 on master. The example bypassed the alignment requirement because of `_do_vector_loop`, and allowed unaligned vector loads to be generated, on a platform that requires alignment. Thanks @fg1417 for running that test for me! > > **Solution** > > First, I implemented `SuperWord::verify_packs` which catches cyclic dependencies just before scheduling. The idea is to reassess every pack, and check if all memops in it are mutually independent. Turns out that per vector pack, it suffices to do a single BFS over the nodes in the block (see `SuperWord::find_dependence`). With this verification in place we at least get an assert instead of wrong execution. > > I then refactored and fixed the `create_pack` code, and put the logic all in `SuperWord::is_mem_ref_alignment_ok`. With the added comments, I hope the logic is more straight forward and readable. If `_do_vector_loop == true`, then I filter the vector packs again in `SuperWord::combine_packs`, since we are at that point not sure that the packs are actually independent, we only know that adjacient memops are independent. > > Another change I have made: > Disallow `extend_packlist` from adding `MemNodes` back in. Because if we have rejected some memops, we do not want them to be added back in later. > > **Testing** > > I added a few more regression tests, and am running tier1-3, plus some stress testing. > > However, I need help from someone who can test this on **ARM32** and **PPC**, basically machines that have `vectors_should_be_aligned() == true`. I would love to have additional testing on those machine, and some reviews. > **Update:** @fg1417 did testing on ARM32, @reinrich did testing on PPC. > > **Discussion / Future Work** > > I wonder if we should have `_do_vector_loop == true` by default, since it allows more vectorization. With the added filtering, we are sure that we do not schedule packs with cyclic dependencies. We would have to evaluate performance and other side-effects of course. What do you think? [JDK-8303113](https://bugs.openjdk.org/browse/JDK-8303113) Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: removed negative rules for TestCyclicDependency.java ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12350/files - new: https://git.openjdk.org/jdk/pull/12350/files/0cb67e5d..366bc31b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12350&range=15 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12350&range=14-15 Stats: 24 lines in 1 file changed: 0 ins; 18 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/12350.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12350/head:pull/12350 PR: https://git.openjdk.org/jdk/pull/12350 From epeter at openjdk.org Thu Mar 2 07:48:19 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 2 Mar 2023 07:48:19 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v9] In-Reply-To: References: <8lyMxPlTmzLDuwFLvhla8AAL6sVxEjigf4LaDlsAVWg=.4bfb5428-4b11-4c65-8540-503cf9e595f1@github.com> <_eDa4Xjs3HIxm2u4bUOWmmZL6y2DjQGd9kVyrUCTSD0=.d56d4533-bf46-4900-8c7c-6e22d247ff22@github.com> Message-ID: On Thu, 2 Mar 2023 05:47:26 GMT, Jatin Bhateja wrote: >> Sounds good. > >> @jatin-bhateja I now have a first version out of the `test/hotspot/jtreg/compiler/loopopts/superword/TestDependencyOffsets.java`. It seems to work for `sse4.1 ... avx512`. I'm now testing it for `asimd`. And then I will proceed to add the features for `-XX:+AlignVector`, with the modulo check. > > I am seeing lots of IR violations with UseSSE=4. I realized I have lots of negative IR rules that check that we do NOT vectorize if I expect cyclic dependency. But these negative rules are difficult, there may always be some other factor that leads to shorter vector sizes than what I expect. And then it vectorizes, and does not encounter a cyclic dependency. So I will have to remove all these negative IR rules. @jatin-bhateja was there any positive IR rule that failed? One that did expect vectorization, but it did not in fact vectorize? ------------- PR: https://git.openjdk.org/jdk/pull/12350 From epeter at openjdk.org Thu Mar 2 07:55:04 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 2 Mar 2023 07:55:04 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v17] In-Reply-To: References: Message-ID: > Cyclic dependencies are not handled correctly in all cases. Three examples: > > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/loopopts/superword/TestCyclicDependency.java#L270-L277 > > And this, compiled with `-XX:CompileCommand=option,compiler.vectorization.TestOptionVectorizeIR::test*,Vectorize`: > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestOptionVectorizeIR.java#L173-L180 > > And for `vmIntrinsics::_forEachRemaining` compile option `Vectorize` is always enabled: > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestForEachRem.java#L69-L73 > > All of these examples are vectorized, despite the cyclic dependency of distance 2. The cyclic dependency is dropped, instead the emitted vector code implements a shift by 2, instead of repeating the same 2 values. > > **Analysis** > > The `create_pack` logic in `SuperWord::find_adjacent_refs` is broken in two ways: > > - When the compile directive `Vectorize` is on, or we compile `vmIntrinsics::_forEachRemaining` we have `_do_vector_loop == true`. When that is the case, we blindly trust that there is no cyclic dependency larger than distance 1. Distance 1 would already be detected by the `independence(s1, s2)` checks we do for all adjacent memops. But for larger distances, we rely on `memory_alignment == 0`. But the compile directive avoids these checks. > - If `best_align_to_mem_ref` is of a different type, and we have `memory_alignment(mem_ref, best_align_to_mem_ref) == 0`, we do not check if `mem_ref` has `memory_alignment == 0` for all other refs of the same type. In the example `TestCyclicDependency::test2`, we have `best_align_to_mem_ref` as the `StoreF`. Then we assess the `StoreI`, which is not aligned with it, but it is of a different type, so we accept it too. Finally, we look at `LoadI`, which has perfect alignment with the `StoreF`, so we accept it too (even though it is in conflict with the `StoreI`). > > Generally, the nested if-statements are confusing and buggy. I propose to fix and refactor the code. > > I also propose to only allow the compile directive `Vectorize` only if `vectors_should_be_aligned() == false`. If all vector operations have to be `vector_width` aligned, then they also have to be mutually aligned, and we cannot have patterns like `v[i] = v[i] + v[i+1]` for which the compile directive was introduced in the first place https://github.com/openjdk/jdk/commit/c7d33de202203b6da544f2e0f9a13952381b32dd. > **Update**: I found a **Test.java** that lead to a crash (`SIGBUS`) on a ARM32 on master. The example bypassed the alignment requirement because of `_do_vector_loop`, and allowed unaligned vector loads to be generated, on a platform that requires alignment. Thanks @fg1417 for running that test for me! > > **Solution** > > First, I implemented `SuperWord::verify_packs` which catches cyclic dependencies just before scheduling. The idea is to reassess every pack, and check if all memops in it are mutually independent. Turns out that per vector pack, it suffices to do a single BFS over the nodes in the block (see `SuperWord::find_dependence`). With this verification in place we at least get an assert instead of wrong execution. > > I then refactored and fixed the `create_pack` code, and put the logic all in `SuperWord::is_mem_ref_alignment_ok`. With the added comments, I hope the logic is more straight forward and readable. If `_do_vector_loop == true`, then I filter the vector packs again in `SuperWord::combine_packs`, since we are at that point not sure that the packs are actually independent, we only know that adjacient memops are independent. > > Another change I have made: > Disallow `extend_packlist` from adding `MemNodes` back in. Because if we have rejected some memops, we do not want them to be added back in later. > > **Testing** > > I added a few more regression tests, and am running tier1-3, plus some stress testing. > > However, I need help from someone who can test this on **ARM32** and **PPC**, basically machines that have `vectors_should_be_aligned() == true`. I would love to have additional testing on those machine, and some reviews. > **Update:** @fg1417 did testing on ARM32, @reinrich did testing on PPC. > > **Discussion / Future Work** > > I wonder if we should have `_do_vector_loop == true` by default, since it allows more vectorization. With the added filtering, we are sure that we do not schedule packs with cyclic dependencies. We would have to evaluate performance and other side-effects of course. What do you think? [JDK-8303113](https://bugs.openjdk.org/browse/JDK-8303113) Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: remove negative IR rules for TestOptionVectorizeIR.java ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12350/files - new: https://git.openjdk.org/jdk/pull/12350/files/366bc31b..9b8738ae Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12350&range=16 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12350&range=15-16 Stats: 17 lines in 1 file changed: 0 ins; 17 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/12350.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12350/head:pull/12350 PR: https://git.openjdk.org/jdk/pull/12350 From epeter at openjdk.org Thu Mar 2 08:33:13 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 2 Mar 2023 08:33:13 GMT Subject: RFR: 8300258: C2: vectorization fails on simple ByteBuffer loop [v2] In-Reply-To: <5i999_cMhXUrM7TY5N0YGJAsiegwj-SdzuKLGscKaTQ=.4c75e5f3-d9ff-485d-a525-462c02979671@github.com> References: <5i999_cMhXUrM7TY5N0YGJAsiegwj-SdzuKLGscKaTQ=.4c75e5f3-d9ff-485d-a525-462c02979671@github.com> Message-ID: <1jbga6zehjnJnIvTQEJNCASY18zE56emNpxGuBnQjH8=.b5fb2848-33ff-4f47-9ef4-b77b4982565b@github.com> On Tue, 21 Feb 2023 16:05:27 GMT, Roland Westrelin wrote: >> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision: >> >> - comments >> - extra test >> - more >> - Merge branch 'master' into JDK-8300258 >> - review >> - more >> - fix & test > > Maybe we should first collect all legal packs and then try to find the best alignment, dropping packs that are unaligned on those platforms that support it. Hi @rwestrel @vnkozlov . I have another general question. Should this really vectorize? @Test @IR(counts = { IRNode.LOAD_VECTOR, ">=1", IRNode.STORE_VECTOR, ">=1" }) public static void testOffHeapLong1(long dest, long[] src) { for (int i = 0; i < src.length; i++) { UNSAFE.putLongUnaligned(null, dest + 8 * i, src[i]); } } I talked with @TobiHartmann about this. It is apparently possible to get the address of an array, and store in a `long`. In that case, we could play with that rawptr a bit, and create a cyclic dependency. Pseudocode: long[] arr = new long[1000]; long ptr = unsafe.getTheArrayAddress(arr); // does not exist directly, but it is possible ptr += 8; // shift it one long forward testOffHeapLong1(ptr, arr); This should have different behavior if it is vectorized or not. But maybe this is expected, and just a bad use of `Unsafe`. Probably we want this to vectorize, for the extra performance, and people who use `Unsafe` just have to be careful not to do such strange things. ------------- PR: https://git.openjdk.org/jdk/pull/12440 From roland at openjdk.org Thu Mar 2 08:37:49 2023 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 2 Mar 2023 08:37:49 GMT Subject: RFR: 8300258: C2: vectorization fails on simple ByteBuffer loop [v3] In-Reply-To: References: Message-ID: > The loop that doesn't vectorize is: > > > public static void testByteLong4(byte[] dest, long[] src, int start, int stop) { > for (int i = start; i < stop; i++) { > UNSAFE.putLongUnaligned(dest, 8 * i + baseOffset, src[i]); > } > } > > > It's from a micro-benchmark in the panama > repo. `SuperWord::find_adjacent_refs() `prevents it from vectorizing > because it finds it cannot properly align the loop and, from the > comment in the code, that: > > > // Can't allow vectorization of unaligned memory accesses with the > // same type since it could be overlapped accesses to the same array. > > > The test for "same type" is implemented by looking at the memory > operation type which in this case is overly conservative as the loop > above is reading and writing with long loads/stores but from and to > arrays of different types that can't overlap. Actually, with such > mismatched accesses, it's also likely an incorrect test (reading and > writing could be to the same array with loads/stores that use > different operand size) eventhough I couldn't write a test case that > would trigger an incorrect execution. > > As a fix, I propose implementing the "same type" test by looking at > memory aliases instead. Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains ten additional commits since the last revision: - more - Merge branch 'master' into JDK-8300258 - comments - extra test - more - Merge branch 'master' into JDK-8300258 - review - more - fix & test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12440/files - new: https://git.openjdk.org/jdk/pull/12440/files/67519781..c6c09763 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12440&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12440&range=01-02 Stats: 13173 lines in 575 files changed: 8752 ins; 2409 del; 2012 mod Patch: https://git.openjdk.org/jdk/pull/12440.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12440/head:pull/12440 PR: https://git.openjdk.org/jdk/pull/12440 From roland at openjdk.org Thu Mar 2 08:46:15 2023 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 2 Mar 2023 08:46:15 GMT Subject: RFR: 8300258: C2: vectorization fails on simple ByteBuffer loop [v2] In-Reply-To: <1jbga6zehjnJnIvTQEJNCASY18zE56emNpxGuBnQjH8=.b5fb2848-33ff-4f47-9ef4-b77b4982565b@github.com> References: <5i999_cMhXUrM7TY5N0YGJAsiegwj-SdzuKLGscKaTQ=.4c75e5f3-d9ff-485d-a525-462c02979671@github.com> <1jbga6zehjnJnIvTQEJNCASY18zE56emNpxGuBnQjH8=.b5fb2848-33ff-4f47-9ef4-b77b4982565b@github.com> Message-ID: <2c3FEVc2G6QCil9SlGF17c7ZM5iDJCp9JnC9NoPDAh4=.fbaabbb1-8e47-4467-8e96-2f787a95374d@github.com> On Thu, 2 Mar 2023 08:29:47 GMT, Emanuel Peter wrote: >> Maybe we should first collect all legal packs and then try to find the best alignment, dropping packs that are unaligned on those platforms that support it. > > Hi @rwestrel @vnkozlov . > > I have another general question. Should this really vectorize? > > @Test > @IR(counts = { IRNode.LOAD_VECTOR, ">=1", IRNode.STORE_VECTOR, ">=1" }) > public static void testOffHeapLong1(long dest, long[] src) { > for (int i = 0; i < src.length; i++) { > UNSAFE.putLongUnaligned(null, dest + 8 * i, src[i]); > } > } > > > I talked with @TobiHartmann about this. It is apparently possible to get the address of an array, and store in a `long`. In that case, we could play with that rawptr a bit, and create a cyclic dependency. > > Pseudocode: > > long[] arr = new long[1000]; > long ptr = unsafe.getTheArrayAddress(arr); // does not exist directly, but it is possible > ptr += 8; // shift it one long forward > testOffHeapLong1(ptr, arr); > > > This should have different behavior if it is vectorized or not. > > But maybe this is expected, and just a bad use of `Unsafe`. Probably we want this to vectorize, for the extra performance, and people who use `Unsafe` just have to be careful not to do such strange things. Thanks for the comments @eme64 @vnkozlov Looking at the code again, if `vectors_should_be_aligned()` is true, if `create_pack` is false, the current code removes every memops and already created packset with `same_velt_type()` true: That can't be motivated by a correctness issue. So I suppose we want to preserve that behavior. Wouldn't we need the change of the last commit I pushed then? ------------- PR: https://git.openjdk.org/jdk/pull/12440 From roland at openjdk.org Thu Mar 2 08:49:16 2023 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 2 Mar 2023 08:49:16 GMT Subject: RFR: 8300258: C2: vectorization fails on simple ByteBuffer loop [v2] In-Reply-To: <1jbga6zehjnJnIvTQEJNCASY18zE56emNpxGuBnQjH8=.b5fb2848-33ff-4f47-9ef4-b77b4982565b@github.com> References: <5i999_cMhXUrM7TY5N0YGJAsiegwj-SdzuKLGscKaTQ=.4c75e5f3-d9ff-485d-a525-462c02979671@github.com> <1jbga6zehjnJnIvTQEJNCASY18zE56emNpxGuBnQjH8=.b5fb2848-33ff-4f47-9ef4-b77b4982565b@github.com> Message-ID: On Thu, 2 Mar 2023 08:29:47 GMT, Emanuel Peter wrote: > But maybe this is expected, and just a bad use of `Unsafe`. Probably we want this to vectorize, for the extra performance, and people who use `Unsafe` just have to be careful not to do such strange things. Yes, I would say that qualifies as bad use of `Unsafe` and we can't support everything that's possible with it. ------------- PR: https://git.openjdk.org/jdk/pull/12440 From epeter at openjdk.org Thu Mar 2 09:08:17 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 2 Mar 2023 09:08:17 GMT Subject: RFR: 8300258: C2: vectorization fails on simple ByteBuffer loop [v2] In-Reply-To: <1jbga6zehjnJnIvTQEJNCASY18zE56emNpxGuBnQjH8=.b5fb2848-33ff-4f47-9ef4-b77b4982565b@github.com> References: <5i999_cMhXUrM7TY5N0YGJAsiegwj-SdzuKLGscKaTQ=.4c75e5f3-d9ff-485d-a525-462c02979671@github.com> <1jbga6zehjnJnIvTQEJNCASY18zE56emNpxGuBnQjH8=.b5fb2848-33ff-4f47-9ef4-b77b4982565b@github.com> Message-ID: On Thu, 2 Mar 2023 08:29:47 GMT, Emanuel Peter wrote: >> Maybe we should first collect all legal packs and then try to find the best alignment, dropping packs that are unaligned on those platforms that support it. > > Hi @rwestrel @vnkozlov . > > I have another general question. Should this really vectorize? > > @Test > @IR(counts = { IRNode.LOAD_VECTOR, ">=1", IRNode.STORE_VECTOR, ">=1" }) > public static void testOffHeapLong1(long dest, long[] src) { > for (int i = 0; i < src.length; i++) { > UNSAFE.putLongUnaligned(null, dest + 8 * i, src[i]); > } > } > > > I talked with @TobiHartmann about this. It is apparently possible to get the address of an array, and store in a `long`. In that case, we could play with that rawptr a bit, and create a cyclic dependency. > > Pseudocode: > > long[] arr = new long[1000]; > long ptr = unsafe.getTheArrayAddress(arr); // does not exist directly, but it is possible > ptr += 8; // shift it one long forward > testOffHeapLong1(ptr, arr); > > > This should have different behavior if it is vectorized or not. > > But maybe this is expected, and just a bad use of `Unsafe`. Probably we want this to vectorize, for the extra performance, and people who use `Unsafe` just have to be careful not to do such strange things. > Thanks for the comments @eme64 @vnkozlov Looking at the code again, if `vectors_should_be_aligned()` is true, if `create_pack` is false, the current code removes every memops and already created packset with `same_velt_type()` true: That can't be motivated by a correctness issue. So I suppose we want to preserve that behavior. Wouldn't we need the change of the last commit I pushed then? I think the reason we used `same_velt_type` was that we were confused. Or maybe we did that before we had memory slices, and using `same_velt_type` was at least already an improvemnt? At any rate: it was confused and leads to Bugs in conjunction with `Unsafe`, as my example showed. Keeping `same_velt_type` will probably not harm much, but be more restrictive than neccessary. It will not harm much because `velt_type == memory_slice` as long as we are not using `Unsafe`. And when we do use `Unsafe`, we probably do not use it in very wild ways. One "wild" use might be something like this: void test(int[] iarr, float[] farr) { // cyclic dependency -> not vectorized in v1 = (int)Unsafe.LoadF(iarr, i); // assume this to be best Unsafe.StoreI(iarr, i + 1); // separate slice -> could be vectorized Unsafe.StoreI(farr, i) = Unsafe.LoadI(farr, i); // on different slice as best, but have same velt_type -> rejected // We end up vectorizing nothing, even though we could vectorize the farr } ------------- PR: https://git.openjdk.org/jdk/pull/12440 From roland at openjdk.org Thu Mar 2 09:31:06 2023 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 2 Mar 2023 09:31:06 GMT Subject: RFR: 8303511: C2: assert(get_ctrl(n) == cle_out) during unrolling Message-ID: In the same round of loop optimizations: - `PhaseIdealLoop::remix_address_expressions()` creates a new `AddP` out of loop. It sets it control to `n_loop->_head->in(LoopNode::EntryControl)` which, because the loop is strip mined, is an `OuterStripMinedLoop`. - The `LoadI` for that `AddP` is found to only have uses outside the loop and is cloned out of the loop. It's referenced by the outer loop's safepoint. - The loop is unrolled. Unrolling follows the safepoint's inputs and find the new `AddP` with control set to the `OuterStripMinedLoop` and the assert fires. No control should be set to an `OuterStripMinedLoop`. The fix is straightforward and sets the control to the `OuterStripMinedLoop` entry control. ------------- Commit messages: - test - fix Changes: https://git.openjdk.org/jdk/pull/12824/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12824&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8303511 Stats: 84 lines in 2 files changed: 82 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/12824.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12824/head:pull/12824 PR: https://git.openjdk.org/jdk/pull/12824 From thartmann at openjdk.org Thu Mar 2 09:38:56 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 2 Mar 2023 09:38:56 GMT Subject: RFR: 8201516: DebugNonSafepoints generates incorrect information [v4] In-Reply-To: References: Message-ID: > C2 emits incorrect debug information when diagnostic `-XX:+DebugNonSafepoints` is enabled. The problem is that renumbering of live nodes (`-XX:+RenumberLiveNodes`) introduced by [JDK-8129847](https://bugs.openjdk.org/browse/JDK-8129847) in JDK 8u92 / JDK 9 does not update the `_node_note_array` side table that links IR node indices to debug information. As a result, after node indices are updated, they point to unrelated debug information. > > The [reproducer](https://github.com/jodzga/debugnonsafepoints-problem) shared by the original reporter @jodzga (@jbachorik also reported this issue separately) does not work anymore with recent JDK versions but with a slight adjustment to trigger node renumbering, I could reproduce the wrong JFR method profile: > > ![Screenshot from 2023-03-01 13-17-48](https://user-images.githubusercontent.com/5312595/222146314-8b5299a8-c1c0-4360-b356-ac6a8c34371c.png) > > It suggests that the hottest method of the [Test](https://github.com/jodzga/debugnonsafepoints-problem/blob/f8ed40f24ef6a6bff7f86ea861c022db193ef48a/src/main/java/org/tests/Test.java#L28) is **not** the long running loop in [Test::arraycopy](https://github.com/jodzga/debugnonsafepoints-problem/blob/f8ed40f24ef6a6bff7f86ea861c022db193ef48a/src/main/java/org/tests/Test.java#L56) but several other short running methods. The hot method is not even in the profile. This is obviously wrong. > > With the fix, or when running with `-XX:-RenumberLiveNodes` as a workaround, the correct profile looks like this: > > ![Screenshot from 2023-03-01 13-20-09](https://user-images.githubusercontent.com/5312595/222146316-b036ca7d-8a92-42b7-9570-c29e3cfcc2f2.png) > > With the help of the IR framework, it's easy to create a simple regression test (see `testRenumberLiveNodes`). > > The fix is to create a new `node_note_array` and copy the debug information to the right index after updating node indices. We do the same in the matcher: > https://github.com/openjdk/jdk/blob/c1e77e05647ca93bb4f39a320a5c7a632e283410/src/hotspot/share/opto/matcher.cpp#L337-L342 > > Another problem is that `Parse::Parse` calls `C->set_default_node_notes(caller_nn)` before `do_exits`, which resets the `JVMState` to the caller state. We then set the bci to `InvocationEntryBci` in the **caller** `JVMState`. Any new node that is emitted in `do_exits`, for example a `MemBarRelease`, will have that `JVMState` attached and `NonSafepointEmitter::observe_instruction` -> `DebugInformationRecorder::describe_scope` will then use that information when emitting debug info. The resulting debug info is misleading because it suggests that we are at the beginning of the caller method. The tests `testFinalFieldInit` and `testSynchronized` reproduce that scenario. > > The fix is to move `set_default_node_notes` down to after `do_exits`. > > I find it also misleading that we often emit "synchronization entry" for `InvocationEntryBci` at method entry/exit in the debug info, although there is no synchronization happening. I filed [JDK-8303451](https://bugs.openjdk.org/browse/JDK-8303451) to fix that. > > Thanks, > Tobias Tobias Hartmann has updated the pull request incrementally with one additional commit since the last revision: Presize new node note array ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12806/files - new: https://git.openjdk.org/jdk/pull/12806/files/413729d2..f1bc8db4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12806&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12806&range=02-03 Stats: 3 lines in 1 file changed: 2 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/12806.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12806/head:pull/12806 PR: https://git.openjdk.org/jdk/pull/12806 From thartmann at openjdk.org Thu Mar 2 09:49:12 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 2 Mar 2023 09:49:12 GMT Subject: RFR: 8201516: DebugNonSafepoints generates incorrect information [v2] In-Reply-To: References: Message-ID: On Wed, 1 Mar 2023 17:51:14 GMT, Vladimir Kozlov wrote: >> Tobias Hartmann has updated the pull request incrementally with one additional commit since the last revision: >> >> Removed default argument > > src/hotspot/share/opto/phaseX.cpp line 474: > >> 472: GrowableArray* old_node_note_array = C->node_note_array(); >> 473: if (old_node_note_array != NULL) { >> 474: C->set_node_note_array(new (C->comp_arena()) GrowableArray (C->comp_arena(), 8, 0, NULL)); > > Use `nullptr` in these lines. > Can you use `_useful.size()` as initial array length? Thanks for the review, Vladimir. I updated the `NULL` usages. > Can you use _useful.size() as initial array length? The `node_note_array` uses buckets/blocks of size `C->_node_notes_block_size == 256`. So the actual required size would be `ceil((double)useful.size() / (double)256)` but we could simply use `1 + (useful.size() / 256)`. Now even when initializing the GrowableArray to that size, setting notes will then still call `Compile::grow_node_notes` multiple times to create the `Node_Notes` buckets and since it always at least doubles the backing array size, we actually end up with an array that is larger than what is required: https://github.com/openjdk/jdk/blob/4619e8bae838abd1f243c2c65a538806d226b8e8/src/hotspot/share/opto/compile.cpp#L1233-L1238 I updated the code to properly pre-size the structure by calling `grow_node_notes`. This also has the advantage that the `Node_Notes` arena is one big chunk instead of incrementally allocating small ones. What do you think? ------------- PR: https://git.openjdk.org/jdk/pull/12806 From roland at openjdk.org Thu Mar 2 09:56:08 2023 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 2 Mar 2023 09:56:08 GMT Subject: RFR: 8300258: C2: vectorization fails on simple ByteBuffer loop [v2] In-Reply-To: References: <5i999_cMhXUrM7TY5N0YGJAsiegwj-SdzuKLGscKaTQ=.4c75e5f3-d9ff-485d-a525-462c02979671@github.com> <1jbga6zehjnJnIvTQEJNCASY18zE56emNpxGuBnQjH8=.b5fb2848-33ff-4f47-9ef4-b77b4982565b@github.com> Message-ID: On Thu, 2 Mar 2023 09:05:28 GMT, Emanuel Peter wrote: > Keeping `same_velt_type` will probably not harm much, but be more restrictive than neccessary. It's quite possible that it's over conservative. What this change is trying to achieve is to relax checks so a pattern that's known to be used in the core libraries optimizes better. That pattern only optimizes for misaligned accesses. So it does seem wrong that those architectures that don't allow misaligned accesses are affected. Also, this code is complicated so it certainly feels safer to me to be on the safe side even if it feels too restrictive. Going forward refactoring this would be nice. I gave it a quick try and it was more complicated than I expected. I also don't think we should spend too much time making sure every possible combinations of unsafe accesses optimize well or even correctly if it's too much work. Once people start using unsafe, they are on their own. I think we should stick with whatever feels reasonable or is used in the core libraries (hopefully the second category is included in the first category). What do you think @vnkozlov ? ------------- PR: https://git.openjdk.org/jdk/pull/12440 From epeter at openjdk.org Thu Mar 2 10:37:10 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 2 Mar 2023 10:37:10 GMT Subject: RFR: 8300258: C2: vectorization fails on simple ByteBuffer loop [v2] In-Reply-To: References: <5i999_cMhXUrM7TY5N0YGJAsiegwj-SdzuKLGscKaTQ=.4c75e5f3-d9ff-485d-a525-462c02979671@github.com> <1jbga6zehjnJnIvTQEJNCASY18zE56emNpxGuBnQjH8=.b5fb2848-33ff-4f47-9ef4-b77b4982565b@github.com> Message-ID: On Thu, 2 Mar 2023 09:53:35 GMT, Roland Westrelin wrote: >>> Thanks for the comments @eme64 @vnkozlov Looking at the code again, if `vectors_should_be_aligned()` is true, if `create_pack` is false, the current code removes every memops and already created packset with `same_velt_type()` true: That can't be motivated by a correctness issue. So I suppose we want to preserve that behavior. Wouldn't we need the change of the last commit I pushed then? >> >> I think the reason we used `same_velt_type` was that we were confused. Or maybe we did that before we had memory slices, and using `same_velt_type` was at least already an improvemnt? At any rate: it was confused and leads to Bugs in conjunction with `Unsafe`, as my example showed. >> >> Keeping `same_velt_type` will probably not harm much, but be more restrictive than neccessary. >> It will not harm much because `velt_type == memory_slice` as long as we are not using `Unsafe`. >> And when we do use `Unsafe`, we probably do not use it in very wild ways. >> >> One "wild" use might be something like this: >> >> void test(int[] iarr, float[] farr) { >> // cyclic dependency -> not vectorized >> in v1 = (int)Unsafe.LoadF(iarr, i); // assume this to be best >> Unsafe.StoreI(iarr, i + 1); >> // separate slice -> could be vectorized >> Unsafe.StoreI(farr, i) = Unsafe.LoadI(farr, i); // on different slice as best, but have same velt_type -> rejected >> // We end up vectorizing nothing, even though we could vectorize the farr >> } > >> Keeping `same_velt_type` will probably not harm much, but be more restrictive than neccessary. > > It's quite possible that it's over conservative. What this change is trying to achieve is to relax checks so a pattern that's known to be used in the core libraries optimizes better. That pattern only optimizes for misaligned accesses. So it does seem wrong that those architectures that don't allow misaligned accesses are affected. Also, this code is complicated so it certainly feels safer to me to be on the safe side even if it feels too restrictive. Going forward refactoring this would be nice. I gave it a quick try and it was more complicated than I expected. > > I also don't think we should spend too much time making sure every possible combinations of unsafe accesses optimize well or even correctly if it's too much work. Once people start using unsafe, they are on their own. I think we should stick with whatever feels reasonable or is used in the core libraries (hopefully the second category is included in the first category). > > What do you think @vnkozlov ? @rwestrel that sounds ok to me. BTW I am refactoring the code around your changes, because multiple bugs https://github.com/openjdk/jdk/pull/12350. I think the code was so badly structured that more and more bugs snuck in. ------------- PR: https://git.openjdk.org/jdk/pull/12440 From duke at openjdk.org Thu Mar 2 13:29:33 2023 From: duke at openjdk.org (Saint Wesonga) Date: Thu, 2 Mar 2023 13:29:33 GMT Subject: Integrated: 8303409: Add Windows AArch64 ABI support to the Foreign Function & Memory API In-Reply-To: References: Message-ID: On Mon, 27 Feb 2023 17:04:28 GMT, Saint Wesonga wrote: > There are 2 primary differences between the Windows ARM64 ABI and the macOS/Linux ARM64 ABI: variadic floating point arguments are passed in general purpose registers on Windows (instead of the vector registers). In addition to this, up to 64 bytes of a struct being passed to a variadic function can be placed in general purpose registers. This happens regardless of the type of struct (HFA or other generic struct). This means that a struct can be split across registers and the stack when invoking a variadic function. The Windows ARM64 ABI conventions are documented at https://learn.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions > > For details about the Foreign Function & Memory API, see JEP 434 at https://openjdk.org/jeps/434 > > This change is a cherry pick of https://github.com/openjdk/panama-foreign/commit/d379ca1c and https://github.com/openjdk/panama-foreign/commit/08225e4f from https://github.com/openjdk/panama-foreign/pull/754 and includes an additional commit that introduces a VaList implementation for Windows on AArch64. This pull request has now been integrated. Changeset: fb130639 Author: Saint Wesonga Committer: Jorn Vernee URL: https://git.openjdk.org/jdk/commit/fb1306394368bdfe3ccfe4980c663d0a56b4a643 Stats: 2135 lines in 20 files changed: 1445 ins; 650 del; 40 mod 8303409: Add Windows AArch64 ABI support to the Foreign Function & Memory API Reviewed-by: jvernee ------------- PR: https://git.openjdk.org/jdk/pull/12773 From thartmann at openjdk.org Thu Mar 2 13:38:13 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 2 Mar 2023 13:38:13 GMT Subject: RFR: 8143900: OptimizeStringConcat has an opaque dependency on Integer.sizeTable field [v3] In-Reply-To: References: Message-ID: On Mon, 27 Feb 2023 02:16:21 GMT, Yi Yang wrote: >> Hi, can I have a review for this patch? I noticed a strange field Integer.sizeTable which is used by PhaseStringOpts, after digging into the history, I think it could be replaced by an in-place array allocation and initialization. Before it, we are fetching from Integer.sizeTable and get num of digit in integer by iterating size table, now we fetch from in-place sizeTable and get size from that. The changed IR looks like this: >> >> ![image](https://user-images.githubusercontent.com/5010047/220239446-7b8b8381-b300-4f2c-a24a-aa19ec9e2f88.png) >> >> Thanks. > > Yi Yang has updated the pull request incrementally with one additional commit since the last revision: > > comment from review feedback Looks good to me too. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/12680 From rcastanedalo at openjdk.org Thu Mar 2 14:01:11 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 2 Mar 2023 14:01:11 GMT Subject: RFR: JDK-8303443: IGV: Syntax highlighting and resizing for filter editor In-Reply-To: References: Message-ID: On Wed, 1 Mar 2023 12:14:07 GMT, Tobias Holenstein wrote: > In the Filter window of the IdealGraphVisualizer (IGV) the user can double-click on a filter to edit the javascript code. > > - Previously, the code window was not resizable and had no syntax highlighting > editor_old > > - Now, the code window can be resized by the user and has basic syntax highlighting for `keywords`, `strings` and `comments` > editor_new > > - Further all filter are now saved as .js files in `src/utils/IdealGraphVisualizer/application/target/userdir/config/Filters` and reloaded when opening a new IGV instance > js_ext Thanks for improving this area of IGV! I found a few issues in the changeset: 1. The proposed font ("Courier") is not very readable on my system (Ubuntu Linux): ![courier](https://user-images.githubusercontent.com/8792647/222445180-40f46bee-90e3-4b48-9d24-9571e1c4e5e8.png) I suggest using the logical font "Monospaced" instead [for portability](https://docs.oracle.com/en/java/javase/19/docs/api/java.desktop/java/awt/Font.html): ![monospaced](https://user-images.githubusercontent.com/8792647/222446352-49633e0f-db7c-445d-b5ae-6335b11eb95c.png) 2. The changeset edits the generated code area of `EditFilterDialog.java`, making it out of sync with `EditFilterDialog.form`. Would it be possible to change that part of `EditFilterDialog.java` using NetBeans' GUI Builder instead? (so that `EditFilterDialog.form` stays on sync). 3. When I run IGV, make some change in a filter, close IGV, then re-run it again, I end up with the same filters loaded multiple times: ![filters](https://user-images.githubusercontent.com/8792647/222448902-47a6999e-0b4a-4739-9863-3da0aaf41b86.png) ------------- Changes requested by rcastanedalo (Reviewer). PR: https://git.openjdk.org/jdk/pull/12803 From epeter at openjdk.org Thu Mar 2 15:55:51 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 2 Mar 2023 15:55:51 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v18] In-Reply-To: References: Message-ID: > Cyclic dependencies are not handled correctly in all cases. Three examples: > > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/loopopts/superword/TestCyclicDependency.java#L270-L277 > > And this, compiled with `-XX:CompileCommand=option,compiler.vectorization.TestOptionVectorizeIR::test*,Vectorize`: > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestOptionVectorizeIR.java#L173-L180 > > And for `vmIntrinsics::_forEachRemaining` compile option `Vectorize` is always enabled: > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestForEachRem.java#L69-L73 > > All of these examples are vectorized, despite the cyclic dependency of distance 2. The cyclic dependency is dropped, instead the emitted vector code implements a shift by 2, instead of repeating the same 2 values. > > **Analysis** > > The `create_pack` logic in `SuperWord::find_adjacent_refs` is broken in two ways: > > - When the compile directive `Vectorize` is on, or we compile `vmIntrinsics::_forEachRemaining` we have `_do_vector_loop == true`. When that is the case, we blindly trust that there is no cyclic dependency larger than distance 1. Distance 1 would already be detected by the `independence(s1, s2)` checks we do for all adjacent memops. But for larger distances, we rely on `memory_alignment == 0`. But the compile directive avoids these checks. > - If `best_align_to_mem_ref` is of a different type, and we have `memory_alignment(mem_ref, best_align_to_mem_ref) == 0`, we do not check if `mem_ref` has `memory_alignment == 0` for all other refs of the same type. In the example `TestCyclicDependency::test2`, we have `best_align_to_mem_ref` as the `StoreF`. Then we assess the `StoreI`, which is not aligned with it, but it is of a different type, so we accept it too. Finally, we look at `LoadI`, which has perfect alignment with the `StoreF`, so we accept it too (even though it is in conflict with the `StoreI`). > > Generally, the nested if-statements are confusing and buggy. I propose to fix and refactor the code. > > I also propose to only allow the compile directive `Vectorize` only if `vectors_should_be_aligned() == false`. If all vector operations have to be `vector_width` aligned, then they also have to be mutually aligned, and we cannot have patterns like `v[i] = v[i] + v[i+1]` for which the compile directive was introduced in the first place https://github.com/openjdk/jdk/commit/c7d33de202203b6da544f2e0f9a13952381b32dd. > **Update**: I found a **Test.java** that lead to a crash (`SIGBUS`) on a ARM32 on master. The example bypassed the alignment requirement because of `_do_vector_loop`, and allowed unaligned vector loads to be generated, on a platform that requires alignment. Thanks @fg1417 for running that test for me! > > **Solution** > > First, I implemented `SuperWord::verify_packs` which catches cyclic dependencies just before scheduling. The idea is to reassess every pack, and check if all memops in it are mutually independent. Turns out that per vector pack, it suffices to do a single BFS over the nodes in the block (see `SuperWord::find_dependence`). With this verification in place we at least get an assert instead of wrong execution. > > I then refactored and fixed the `create_pack` code, and put the logic all in `SuperWord::is_mem_ref_alignment_ok`. With the added comments, I hope the logic is more straight forward and readable. If `_do_vector_loop == true`, then I filter the vector packs again in `SuperWord::combine_packs`, since we are at that point not sure that the packs are actually independent, we only know that adjacient memops are independent. > > Another change I have made: > Disallow `extend_packlist` from adding `MemNodes` back in. Because if we have rejected some memops, we do not want them to be added back in later. > > **Testing** > > I added a few more regression tests, and am running tier1-3, plus some stress testing. > > However, I need help from someone who can test this on **ARM32** and **PPC**, basically machines that have `vectors_should_be_aligned() == true`. I would love to have additional testing on those machine, and some reviews. > **Update:** @fg1417 did testing on ARM32, @reinrich did testing on PPC. > > **Discussion / Future Work** > > I wonder if we should have `_do_vector_loop == true` by default, since it allows more vectorization. With the added filtering, we are sure that we do not schedule packs with cyclic dependencies. We would have to evaluate performance and other side-effects of course. What do you think? [JDK-8303113](https://bugs.openjdk.org/browse/JDK-8303113) Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Reworked TestDependencyOffsets.java ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12350/files - new: https://git.openjdk.org/jdk/pull/12350/files/9b8738ae..645ed502 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12350&range=17 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12350&range=16-17 Stats: 5816 lines in 1 file changed: 1491 ins; 1701 del; 2624 mod Patch: https://git.openjdk.org/jdk/pull/12350.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12350/head:pull/12350 PR: https://git.openjdk.org/jdk/pull/12350 From epeter at openjdk.org Thu Mar 2 16:00:44 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 2 Mar 2023 16:00:44 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v19] In-Reply-To: References: Message-ID: > Cyclic dependencies are not handled correctly in all cases. Three examples: > > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/loopopts/superword/TestCyclicDependency.java#L270-L277 > > And this, compiled with `-XX:CompileCommand=option,compiler.vectorization.TestOptionVectorizeIR::test*,Vectorize`: > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestOptionVectorizeIR.java#L173-L180 > > And for `vmIntrinsics::_forEachRemaining` compile option `Vectorize` is always enabled: > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestForEachRem.java#L69-L73 > > All of these examples are vectorized, despite the cyclic dependency of distance 2. The cyclic dependency is dropped, instead the emitted vector code implements a shift by 2, instead of repeating the same 2 values. > > **Analysis** > > The `create_pack` logic in `SuperWord::find_adjacent_refs` is broken in two ways: > > - When the compile directive `Vectorize` is on, or we compile `vmIntrinsics::_forEachRemaining` we have `_do_vector_loop == true`. When that is the case, we blindly trust that there is no cyclic dependency larger than distance 1. Distance 1 would already be detected by the `independence(s1, s2)` checks we do for all adjacent memops. But for larger distances, we rely on `memory_alignment == 0`. But the compile directive avoids these checks. > - If `best_align_to_mem_ref` is of a different type, and we have `memory_alignment(mem_ref, best_align_to_mem_ref) == 0`, we do not check if `mem_ref` has `memory_alignment == 0` for all other refs of the same type. In the example `TestCyclicDependency::test2`, we have `best_align_to_mem_ref` as the `StoreF`. Then we assess the `StoreI`, which is not aligned with it, but it is of a different type, so we accept it too. Finally, we look at `LoadI`, which has perfect alignment with the `StoreF`, so we accept it too (even though it is in conflict with the `StoreI`). > > Generally, the nested if-statements are confusing and buggy. I propose to fix and refactor the code. > > I also propose to only allow the compile directive `Vectorize` only if `vectors_should_be_aligned() == false`. If all vector operations have to be `vector_width` aligned, then they also have to be mutually aligned, and we cannot have patterns like `v[i] = v[i] + v[i+1]` for which the compile directive was introduced in the first place https://github.com/openjdk/jdk/commit/c7d33de202203b6da544f2e0f9a13952381b32dd. > **Update**: I found a **Test.java** that lead to a crash (`SIGBUS`) on a ARM32 on master. The example bypassed the alignment requirement because of `_do_vector_loop`, and allowed unaligned vector loads to be generated, on a platform that requires alignment. Thanks @fg1417 for running that test for me! > > **Solution** > > First, I implemented `SuperWord::verify_packs` which catches cyclic dependencies just before scheduling. The idea is to reassess every pack, and check if all memops in it are mutually independent. Turns out that per vector pack, it suffices to do a single BFS over the nodes in the block (see `SuperWord::find_dependence`). With this verification in place we at least get an assert instead of wrong execution. > > I then refactored and fixed the `create_pack` code, and put the logic all in `SuperWord::is_mem_ref_alignment_ok`. With the added comments, I hope the logic is more straight forward and readable. If `_do_vector_loop == true`, then I filter the vector packs again in `SuperWord::combine_packs`, since we are at that point not sure that the packs are actually independent, we only know that adjacient memops are independent. > > Another change I have made: > Disallow `extend_packlist` from adding `MemNodes` back in. Because if we have rejected some memops, we do not want them to be added back in later. > > **Testing** > > I added a few more regression tests, and am running tier1-3, plus some stress testing. > > However, I need help from someone who can test this on **ARM32** and **PPC**, basically machines that have `vectors_should_be_aligned() == true`. I would love to have additional testing on those machine, and some reviews. > **Update:** @fg1417 did testing on ARM32, @reinrich did testing on PPC. > > **Discussion / Future Work** > > I wonder if we should have `_do_vector_loop == true` by default, since it allows more vectorization. With the added filtering, we are sure that we do not schedule packs with cyclic dependencies. We would have to evaluate performance and other side-effects of course. What do you think? [JDK-8303113](https://bugs.openjdk.org/browse/JDK-8303113) Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 33 additional commits since the last revision: - Merge branch 'master' into JDK-8298935 - Reworked TestDependencyOffsets.java - remove negative IR rules for TestOptionVectorizeIR.java - removed negative rules for TestCyclicDependency.java - TestDependencyOffsets.java: MulVL not supported on NEON / asimd. Replaced it with AddVL - Fix TestOptionVectorizeIR.java for aarch64 machines with AlignVector == true - Fix TestCyclicDependency.java for aarch64 machines with AlignVector == true - v2 TestDependencyOffsets.java based on MaxVectorSize not SuperWordMaxVectorSize (platform independent) - Version 1 of script-generated offset dependency test - Merge branch 'master' into JDK-8298935 - ... and 23 more: https://git.openjdk.org/jdk/compare/cb2bd7ae...199fcf0a ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12350/files - new: https://git.openjdk.org/jdk/pull/12350/files/645ed502..199fcf0a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12350&range=18 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12350&range=17-18 Stats: 13256 lines in 453 files changed: 8478 ins; 2727 del; 2051 mod Patch: https://git.openjdk.org/jdk/pull/12350.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12350/head:pull/12350 PR: https://git.openjdk.org/jdk/pull/12350 From epeter at openjdk.org Thu Mar 2 16:00:44 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 2 Mar 2023 16:00:44 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v9] In-Reply-To: References: <8lyMxPlTmzLDuwFLvhla8AAL6sVxEjigf4LaDlsAVWg=.4bfb5428-4b11-4c65-8540-503cf9e595f1@github.com> <_eDa4Xjs3HIxm2u4bUOWmmZL6y2DjQGd9kVyrUCTSD0=.d56d4533-bf46-4900-8c7c-6e22d247ff22@github.com> Message-ID: <9bFwAtzXSc26dTNribicwtgQTa5GENR0Pwl0xUA1v2A=.2aa36d32-5976-4f75-a0d1-44227fc29168@github.com> On Thu, 2 Mar 2023 07:44:16 GMT, Emanuel Peter wrote: >>> @jatin-bhateja I now have a first version out of the `test/hotspot/jtreg/compiler/loopopts/superword/TestDependencyOffsets.java`. It seems to work for `sse4.1 ... avx512`. I'm now testing it for `asimd`. And then I will proceed to add the features for `-XX:+AlignVector`, with the modulo check. >> >> I am seeing lots of IR violations with UseSSE=4. > > I realized I have lots of negative IR rules that check that we do NOT vectorize if I expect cyclic dependency. But these negative rules are difficult, there may always be some other factor that leads to shorter vector sizes than what I expect. And then it vectorizes, and does not encounter a cyclic dependency. So I will have to remove all these negative IR rules. > > @jatin-bhateja was there any positive IR rule that failed? One that did expect vectorization, but it did not in fact vectorize? I now removed all such negative IR rules. ------------- PR: https://git.openjdk.org/jdk/pull/12350 From epeter at openjdk.org Thu Mar 2 16:00:47 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 2 Mar 2023 16:00:47 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v15] In-Reply-To: References: <7BB6Bc-RlCF3GYkbuOEmSJgdv4aAzQ-2cQFdiCY2vaQ=.37b1a20a-70f2-43a2-865e-b49aafe09b42@github.com> Message-ID: On Thu, 2 Mar 2023 07:09:46 GMT, Emanuel Peter wrote: >> test/hotspot/jtreg/compiler/loopopts/superword/TestDependencyOffsets.java line 881: >> >>> 879: // cpu: sse4.1 to avx -> vector_width: 16 -> elements in vector: 4 >>> 880: // positive byte_offset 4 can lead to cyclic dependency >>> 881: @IR(counts = {IRNode.LOAD_VECTOR, "> 0", IRNode.MUL_V, "> 0", IRNode.STORE_VECTOR, "> 0"}, >> >> Needs to fix the test generation script. This IR rule looks incorrect since there is a valid dependency with distance 1. > > @jatin-bhateja I think the IR rule is just ineffective. I have the following condition in it that will never be met: > `applyIfAnd = {"MaxVectorSize", ">= 8", "MaxVectorSize", "<= 4"},` > The `<= 4` must hold so that `byte_offset <= MaxVectorSize`, and so the cyclical dependency would not happen. But `>= 8` must hold so that two ints fit in a vector, so that we even vectorize. > > I could improve the script and filter out such ineffective IR rules. Not sure if that is worth it though. I fixed my script, it should now compute the ranges correctly, and not add IR rules with impossible ranges. ------------- PR: https://git.openjdk.org/jdk/pull/12350 From kvn at openjdk.org Thu Mar 2 17:33:21 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 2 Mar 2023 17:33:21 GMT Subject: RFR: 8300258: C2: vectorization fails on simple ByteBuffer loop [v2] In-Reply-To: References: <5i999_cMhXUrM7TY5N0YGJAsiegwj-SdzuKLGscKaTQ=.4c75e5f3-d9ff-485d-a525-462c02979671@github.com> <1jbga6zehjnJnIvTQEJNCASY18zE56emNpxGuBnQjH8=.b5fb2848-33ff-4f47-9ef4-b77b4982565b@github.com> Message-ID: On Thu, 2 Mar 2023 09:05:28 GMT, Emanuel Peter wrote: > I think the reason we used `same_velt_type` was that we were confused. Or maybe we did that before we had memory slices, and using `same_velt_type` was at least already an improvemnt? At any rate: it was confused and leads to Bugs in conjunction with `Unsafe`, as my example showed. I did not consider using memory slices (or unsafe access) when worked on this code. Same element type was easy choice for this check. > I also don't think we should spend too much time making sure every possible combinations of unsafe accesses optimize well or even correctly if it's too much work. Once people start using unsafe, they are on their own. I think we should stick with whatever feels reasonable or is used in the core libraries (hopefully the second category is included in the first category). Yes, even without vectorization we can construct a Java test with Unsafe access which has cyclic dependencies and overwrite elements intentionally or by mistake. I remember one such case in system libraries several years ago which was fixed. JIT optimization should not introduce wrong behavior if Java code does not have it. But if we can correctly detect and reject cyclic dependency we can vectorize it. Original Roland's example and last Emanuel's `StoreI(farr, i)` example don't have "bad" cyclic dependency - at worst they store the same values to the same elements. So it is all about cyclic dependency detection and assumption that we may accessing the same array. ------------- PR: https://git.openjdk.org/jdk/pull/12440 From kvn at openjdk.org Thu Mar 2 17:36:22 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 2 Mar 2023 17:36:22 GMT Subject: RFR: 8303511: C2: assert(get_ctrl(n) == cle_out) during unrolling In-Reply-To: References: Message-ID: On Thu, 2 Mar 2023 09:24:17 GMT, Roland Westrelin wrote: > In the same round of loop optimizations: > > - `PhaseIdealLoop::remix_address_expressions()` creates a new `AddP` > out of loop. It sets it control to > `n_loop->_head->in(LoopNode::EntryControl)` which, because the loop is > strip mined, is an `OuterStripMinedLoop`. > > - The `LoadI` for that `AddP` is found to only have uses outside the > loop and is cloned out of the loop. It's referenced by the outer > loop's safepoint. > > - The loop is unrolled. Unrolling follows the safepoint's inputs and > find the new `AddP` with control set to the `OuterStripMinedLoop` > and the assert fires. > > No control should be set to an `OuterStripMinedLoop`. The fix is > straightforward and sets the control to the `OuterStripMinedLoop` > entry control. Yes, this looks good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/12824 From kvn at openjdk.org Thu Mar 2 17:50:07 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 2 Mar 2023 17:50:07 GMT Subject: RFR: 8201516: DebugNonSafepoints generates incorrect information [v4] In-Reply-To: References: Message-ID: On Thu, 2 Mar 2023 09:38:56 GMT, Tobias Hartmann wrote: >> C2 emits incorrect debug information when diagnostic `-XX:+DebugNonSafepoints` is enabled. The problem is that renumbering of live nodes (`-XX:+RenumberLiveNodes`) introduced by [JDK-8129847](https://bugs.openjdk.org/browse/JDK-8129847) in JDK 8u92 / JDK 9 does not update the `_node_note_array` side table that links IR node indices to debug information. As a result, after node indices are updated, they point to unrelated debug information. >> >> The [reproducer](https://github.com/jodzga/debugnonsafepoints-problem) shared by the original reporter @jodzga (@jbachorik also reported this issue separately) does not work anymore with recent JDK versions but with a slight adjustment to trigger node renumbering, I could reproduce the wrong JFR method profile: >> >> ![Screenshot from 2023-03-01 13-17-48](https://user-images.githubusercontent.com/5312595/222146314-8b5299a8-c1c0-4360-b356-ac6a8c34371c.png) >> >> It suggests that the hottest method of the [Test](https://github.com/jodzga/debugnonsafepoints-problem/blob/f8ed40f24ef6a6bff7f86ea861c022db193ef48a/src/main/java/org/tests/Test.java#L28) is **not** the long running loop in [Test::arraycopy](https://github.com/jodzga/debugnonsafepoints-problem/blob/f8ed40f24ef6a6bff7f86ea861c022db193ef48a/src/main/java/org/tests/Test.java#L56) but several other short running methods. The hot method is not even in the profile. This is obviously wrong. >> >> With the fix, or when running with `-XX:-RenumberLiveNodes` as a workaround, the correct profile looks like this: >> >> ![Screenshot from 2023-03-01 13-20-09](https://user-images.githubusercontent.com/5312595/222146316-b036ca7d-8a92-42b7-9570-c29e3cfcc2f2.png) >> >> With the help of the IR framework, it's easy to create a simple regression test (see `testRenumberLiveNodes`). >> >> The fix is to create a new `node_note_array` and copy the debug information to the right index after updating node indices. We do the same in the matcher: >> https://github.com/openjdk/jdk/blob/c1e77e05647ca93bb4f39a320a5c7a632e283410/src/hotspot/share/opto/matcher.cpp#L337-L342 >> >> Another problem is that `Parse::Parse` calls `C->set_default_node_notes(caller_nn)` before `do_exits`, which resets the `JVMState` to the caller state. We then set the bci to `InvocationEntryBci` in the **caller** `JVMState`. Any new node that is emitted in `do_exits`, for example a `MemBarRelease`, will have that `JVMState` attached and `NonSafepointEmitter::observe_instruction` -> `DebugInformationRecorder::describe_scope` will then use that information when emitting debug info. The resulting debug info is misleading because it suggests that we are at the beginning of the caller method. The tests `testFinalFieldInit` and `testSynchronized` reproduce that scenario. >> >> The fix is to move `set_default_node_notes` down to after `do_exits`. >> >> I find it also misleading that we often emit "synchronization entry" for `InvocationEntryBci` at method entry/exit in the debug info, although there is no synchronization happening. I filed [JDK-8303451](https://bugs.openjdk.org/browse/JDK-8303451) to fix that. >> >> Thanks, >> Tobias > > Tobias Hartmann has updated the pull request incrementally with one additional commit since the last revision: > > Presize new node note array src/hotspot/share/opto/phaseX.cpp line 474: > 472: GrowableArray* old_node_note_array = C->node_note_array(); > 473: if (old_node_note_array != nullptr) { > 474: int new_size = (_useful.size() >> 8) + 1; // The node note array uses blocks, see C->_log2_node_notes_block_size You should call `new_size = MAX2(8, new_size)` to make sure that we have at least 8 elements for initial allocation. ------------- PR: https://git.openjdk.org/jdk/pull/12806 From aturbanov at openjdk.org Thu Mar 2 18:23:13 2023 From: aturbanov at openjdk.org (Andrey Turbanov) Date: Thu, 2 Mar 2023 18:23:13 GMT Subject: RFR: 8303511: C2: assert(get_ctrl(n) == cle_out) during unrolling In-Reply-To: References: Message-ID: On Thu, 2 Mar 2023 09:24:17 GMT, Roland Westrelin wrote: > In the same round of loop optimizations: > > - `PhaseIdealLoop::remix_address_expressions()` creates a new `AddP` > out of loop. It sets it control to > `n_loop->_head->in(LoopNode::EntryControl)` which, because the loop is > strip mined, is an `OuterStripMinedLoop`. > > - The `LoadI` for that `AddP` is found to only have uses outside the > loop and is cloned out of the loop. It's referenced by the outer > loop's safepoint. > > - The loop is unrolled. Unrolling follows the safepoint's inputs and > find the new `AddP` with control set to the `OuterStripMinedLoop` > and the assert fires. > > No control should be set to an `OuterStripMinedLoop`. The fix is > straightforward and sets the control to the `OuterStripMinedLoop` > entry control. test/hotspot/jtreg/compiler/loopstripmining/TestAddPAtOuterLoopHead.java line 65: > 63: > 64: > 65: int v = 0; Suggestion: int v = 0; ------------- PR: https://git.openjdk.org/jdk/pull/12824 From xgong at openjdk.org Fri Mar 3 01:22:04 2023 From: xgong at openjdk.org (Xiaohong Gong) Date: Fri, 3 Mar 2023 01:22:04 GMT Subject: RFR: 8302830: AArch64: Fix the mismatch between cas.m4 and aarch64.ad [v2] In-Reply-To: References: Message-ID: On Tue, 21 Feb 2023 05:34:42 GMT, Hao Sun wrote: >> Fix the following mismatch between cas.m4 and aarch64.ad. >> >> >> $ m4 cas.m4 > cas.gen.ad >> $ sed '8930,9404!d' aarch64.ad > res.ad >> $ diff -uN cas.gen.ad res.ad | cat -A >> >> --- cas.gen.ad^I2023-02-20 04:18:46.624289978 +0000$ >> +++ res.ad^I2023-02-20 04:19:08.780351888 +0000$ >> @@ -15,7 +15,7 @@$ >> // This pattern is generated automatically from cas.m4.$ >> // DO NOT EDIT ANYTHING IN THIS SECTION OF THE FILE$ >> instruct compareAndExchangeB(iRegINoSp res, indirect mem, iRegI oldval, iRegI newval, rFlagsReg cr) %{$ >> - $ >> +$ >> match(Set res (CompareAndExchangeB mem (Binary oldval newval)));$ >> ins_cost(2 * VOLATILE_REF_COST);$ >> effect(TEMP_DEF res, KILL cr);$ >> ... >> >> >> Besides, update the comment since "aarch_ad_cas.m4" was renamed to "cas.m4". > > Hao Sun has updated the pull request incrementally with one additional commit since the last revision: > > Remove empty lines in AD files > > Co-Developed-by: aph LGTM! ------------- Marked as reviewed by xgong (Committer). PR: https://git.openjdk.org/jdk/pull/12647 From haosun at openjdk.org Fri Mar 3 01:27:19 2023 From: haosun at openjdk.org (Hao Sun) Date: Fri, 3 Mar 2023 01:27:19 GMT Subject: RFR: 8302830: AArch64: Fix the mismatch between cas.m4 and aarch64.ad In-Reply-To: References: Message-ID: On Mon, 20 Feb 2023 12:57:46 GMT, Andrew Haley wrote: >> Fix the following mismatch between cas.m4 and aarch64.ad. >> >> >> $ m4 cas.m4 > cas.gen.ad >> $ sed '8930,9404!d' aarch64.ad > res.ad >> $ diff -uN cas.gen.ad res.ad | cat -A >> >> --- cas.gen.ad^I2023-02-20 04:18:46.624289978 +0000$ >> +++ res.ad^I2023-02-20 04:19:08.780351888 +0000$ >> @@ -15,7 +15,7 @@$ >> // This pattern is generated automatically from cas.m4.$ >> // DO NOT EDIT ANYTHING IN THIS SECTION OF THE FILE$ >> instruct compareAndExchangeB(iRegINoSp res, indirect mem, iRegI oldval, iRegI newval, rFlagsReg cr) %{$ >> - $ >> +$ >> match(Set res (CompareAndExchangeB mem (Binary oldval newval)));$ >> ins_cost(2 * VOLATILE_REF_COST);$ >> effect(TEMP_DEF res, KILL cr);$ >> ... >> >> >> Besides, update the comment since "aarch_ad_cas.m4" was renamed to "cas.m4". > > Whitespace is all messed up. try this: > > [foo.zip](https://github.com/openjdk/jdk/files/10784530/foo.zip) Thanks for your review, @theRealAph and @XiaohongGong ! I don't think the GAH failure, i.e. linux-cross-compile / build(arm), is related to this patch. ------------- PR: https://git.openjdk.org/jdk/pull/12647 From haosun at openjdk.org Fri Mar 3 01:27:20 2023 From: haosun at openjdk.org (Hao Sun) Date: Fri, 3 Mar 2023 01:27:20 GMT Subject: Integrated: 8302830: AArch64: Fix the mismatch between cas.m4 and aarch64.ad In-Reply-To: References: Message-ID: <0zoGQOOSs-nma2fbzxMjIG5EwL97g7vSHprBJs_e8IE=.f9d9540b-fc2f-4641-917f-ee64841da8b1@github.com> On Mon, 20 Feb 2023 07:30:10 GMT, Hao Sun wrote: > Fix the following mismatch between cas.m4 and aarch64.ad. > > > $ m4 cas.m4 > cas.gen.ad > $ sed '8930,9404!d' aarch64.ad > res.ad > $ diff -uN cas.gen.ad res.ad | cat -A > > --- cas.gen.ad^I2023-02-20 04:18:46.624289978 +0000$ > +++ res.ad^I2023-02-20 04:19:08.780351888 +0000$ > @@ -15,7 +15,7 @@$ > // This pattern is generated automatically from cas.m4.$ > // DO NOT EDIT ANYTHING IN THIS SECTION OF THE FILE$ > instruct compareAndExchangeB(iRegINoSp res, indirect mem, iRegI oldval, iRegI newval, rFlagsReg cr) %{$ > - $ > +$ > match(Set res (CompareAndExchangeB mem (Binary oldval newval)));$ > ins_cost(2 * VOLATILE_REF_COST);$ > effect(TEMP_DEF res, KILL cr);$ > ... > > > Besides, update the comment since "aarch_ad_cas.m4" was renamed to "cas.m4". This pull request has now been integrated. Changeset: 35003b5f Author: Hao Sun URL: https://git.openjdk.org/jdk/commit/35003b5f7b341d7abd932fc4c795797960321369 Stats: 33 lines in 2 files changed: 6 ins; 12 del; 15 mod 8302830: AArch64: Fix the mismatch between cas.m4 and aarch64.ad Reviewed-by: aph, xgong ------------- PR: https://git.openjdk.org/jdk/pull/12647 From yyang at openjdk.org Fri Mar 3 02:04:21 2023 From: yyang at openjdk.org (Yi Yang) Date: Fri, 3 Mar 2023 02:04:21 GMT Subject: RFR: 8143900: OptimizeStringConcat has an opaque dependency on Integer.sizeTable field [v2] In-Reply-To: References: Message-ID: On Mon, 27 Feb 2023 18:45:44 GMT, Vladimir Kozlov wrote: >>> Testing results seem good. Except one strange failure I put in [JDK-8143900](https://bugs.openjdk.org/browse/JDK-8143900) comment. You need second review. >> >> This seems related to https://bugs.openjdk.org/browse/JDK-8296914 > >> > Testing results seem good. Except one strange failure I put in [JDK-8143900](https://bugs.openjdk.org/browse/JDK-8143900) comment. You need second review. >> >> This seems related to https://bugs.openjdk.org/browse/JDK-8296914 > > [JDK-8296914 ](https://bugs.openjdk.org/browse/JDK-8296914)was closed as duplicate of still opened [JDK-8270202](https://bugs.openjdk.org/browse/JDK-8270202) Thanks @vnkozlov @TobiHartmann for reviews. ------------- PR: https://git.openjdk.org/jdk/pull/12680 From yyang at openjdk.org Fri Mar 3 02:04:23 2023 From: yyang at openjdk.org (Yi Yang) Date: Fri, 3 Mar 2023 02:04:23 GMT Subject: Integrated: 8143900: OptimizeStringConcat has an opaque dependency on Integer.sizeTable field In-Reply-To: References: Message-ID: On Tue, 21 Feb 2023 02:29:44 GMT, Yi Yang wrote: > Hi, can I have a review for this patch? I noticed a strange field Integer.sizeTable which is used by PhaseStringOpts, after digging into the history, I think it could be replaced by an in-place array allocation and initialization. Before it, we are fetching from Integer.sizeTable and get num of digit in integer by iterating size table, now we fetch from in-place sizeTable and get size from that. The changed IR looks like this: > > ![image](https://user-images.githubusercontent.com/5010047/220239446-7b8b8381-b300-4f2c-a24a-aa19ec9e2f88.png) > > Thanks. This pull request has now been integrated. Changeset: c961a918 Author: Yi Yang URL: https://git.openjdk.org/jdk/commit/c961a918ad41a78ec15389837abf29c98d66792f Stats: 192 lines in 3 files changed: 49 ins; 106 del; 37 mod 8143900: OptimizeStringConcat has an opaque dependency on Integer.sizeTable field Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/12680 From duke at openjdk.org Fri Mar 3 02:24:51 2023 From: duke at openjdk.org (Chang Peng) Date: Fri, 3 Mar 2023 02:24:51 GMT Subject: RFR: 8297753: AArch64: Add optimized rules for vector compare with zero on NEON [v7] In-Reply-To: References: Message-ID: > We can use the compare-with-zero instructions like cmgt(zero)[1] immediately to avoid the extra scalar2vector operations. > > The following instruction sequence > > movi v16.4s, #0x0 > cmgt v16.4s, v17.4s, v16.4s > > can be optimized to: > > cmgt v16.4s, v17.4s, #0x0 > > This patch does the following: > 1. Add NEON floating-point compare-with-zero instructions. > 2. Add optimized match rules to generate the compare-with-zero instructions. > > [1]: https://developer.arm.com/documentation/ddi0602/2022-06/SIMD-FP-Instructions/CMGT--zero---Compare-signed-Greater-than-zero--vector-- Chang Peng has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: - Merge branch 'openjdk:master' into add_cmp0_neon - Remove some switch-case stmts in c2_MacroAssembler_aarch64.cpp and avoid unsigned comparison. - Revert "Remove switch-case stmts in c2_MacroAssembler_aarch64.cpp" This reverts commit d899238d0cb98fdf375b3011670495c3bfe8bbaf. - Merge branch 'openjdk:master' into add_cmp0_neon - Remove switch-case stmts in c2_MacroAssembler_aarch64.cpp - Merge fcm instruction encoding functions into a single function. - Merge branch 'openjdk:master' into add_cmp0_neon - Resolving the merge conflicts caused by test/hotspot/gtest/aarch64/asmtest.out.h Change-Id: I896b879c8b7097a99e35fc1e53abab646240281a - 8297753: AArch64: Add optimized rules for vector compare with zero on NEON We can use the compare-with-zero instructions like cmgt(zero)[1] immediately to avoid the extra scalar2vector operations. The following instruction sequence ``` movi v16.4s, #0x0 cmgt v16.4s, v17.4s, v16.4s ``` can be optimized to: ``` cmgt v16.4s, v17.4s, #0x0 ``` This patch does the following: 1. Add NEON floating-point compare-with-zero instructions. 2. Add optimized match rules to generate the compare-with-zero instructions. [1]: https://developer.arm.com/documentation/ddi0602/2022-06/SIMD-FP-Instructions/CMGT--zero---Compare-signed-Greater-than-zero--vector-- Change-Id: If026b477a0cad809bd201feafbfc9ab301a1b569 ------------- Changes: https://git.openjdk.org/jdk/pull/11822/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11822&range=06 Stats: 1033 lines in 10 files changed: 535 ins; 0 del; 498 mod Patch: https://git.openjdk.org/jdk/pull/11822.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11822/head:pull/11822 PR: https://git.openjdk.org/jdk/pull/11822 From duke at openjdk.org Fri Mar 3 02:55:48 2023 From: duke at openjdk.org (Chang Peng) Date: Fri, 3 Mar 2023 02:55:48 GMT Subject: RFR: 8297753: AArch64: Add optimized rules for vector compare with zero on NEON [v8] In-Reply-To: References: Message-ID: > We can use the compare-with-zero instructions like cmgt(zero)[1] immediately to avoid the extra scalar2vector operations. > > The following instruction sequence > > movi v16.4s, #0x0 > cmgt v16.4s, v17.4s, v16.4s > > can be optimized to: > > cmgt v16.4s, v17.4s, #0x0 > > This patch does the following: > 1. Add NEON floating-point compare-with-zero instructions. > 2. Add optimized match rules to generate the compare-with-zero instructions. > > [1]: https://developer.arm.com/documentation/ddi0602/2022-06/SIMD-FP-Instructions/CMGT--zero---Compare-signed-Greater-than-zero--vector-- Chang Peng has updated the pull request incrementally with one additional commit since the last revision: Merge cmxx instructions of different conditions into a single function. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11822/files - new: https://git.openjdk.org/jdk/pull/11822/files/32df021f..cd425ef1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11822&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11822&range=06-07 Stats: 82 lines in 6 files changed: 31 ins; 24 del; 27 mod Patch: https://git.openjdk.org/jdk/pull/11822.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11822/head:pull/11822 PR: https://git.openjdk.org/jdk/pull/11822 From duke at openjdk.org Fri Mar 3 03:20:12 2023 From: duke at openjdk.org (Amit Kumar) Date: Fri, 3 Mar 2023 03:20:12 GMT Subject: RFR: 8303147: [s390x] fast & slow debug builds are broken In-Reply-To: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> References: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> Message-ID: On Thu, 2 Mar 2023 10:06:17 GMT, Amit Kumar wrote: > This PR fixes broken fast debug and slow debug build for s390x-arch. tier1 test are completed and results are not affect after this patch. @sunny868 As you were the author for original changes, please review this PR as well. Also suggest if you've any better idea to do the same, I would appreciate that. hs_err & replay log files you will find in JBS-issue. Thank you ------------- PR: https://git.openjdk.org/jdk/pull/12825 From thartmann at openjdk.org Fri Mar 3 06:37:08 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 3 Mar 2023 06:37:08 GMT Subject: RFR: 8201516: DebugNonSafepoints generates incorrect information [v5] In-Reply-To: References: Message-ID: <4PpM775Vw8MQtZu4vuycM0ZDx8bkqy8TJ_aS4OmRDVE=.63e42077-5ccd-43a8-96af-163f0fe8392b@github.com> > C2 emits incorrect debug information when diagnostic `-XX:+DebugNonSafepoints` is enabled. The problem is that renumbering of live nodes (`-XX:+RenumberLiveNodes`) introduced by [JDK-8129847](https://bugs.openjdk.org/browse/JDK-8129847) in JDK 8u92 / JDK 9 does not update the `_node_note_array` side table that links IR node indices to debug information. As a result, after node indices are updated, they point to unrelated debug information. > > The [reproducer](https://github.com/jodzga/debugnonsafepoints-problem) shared by the original reporter @jodzga (@jbachorik also reported this issue separately) does not work anymore with recent JDK versions but with a slight adjustment to trigger node renumbering, I could reproduce the wrong JFR method profile: > > ![Screenshot from 2023-03-01 13-17-48](https://user-images.githubusercontent.com/5312595/222146314-8b5299a8-c1c0-4360-b356-ac6a8c34371c.png) > > It suggests that the hottest method of the [Test](https://github.com/jodzga/debugnonsafepoints-problem/blob/f8ed40f24ef6a6bff7f86ea861c022db193ef48a/src/main/java/org/tests/Test.java#L28) is **not** the long running loop in [Test::arraycopy](https://github.com/jodzga/debugnonsafepoints-problem/blob/f8ed40f24ef6a6bff7f86ea861c022db193ef48a/src/main/java/org/tests/Test.java#L56) but several other short running methods. The hot method is not even in the profile. This is obviously wrong. > > With the fix, or when running with `-XX:-RenumberLiveNodes` as a workaround, the correct profile looks like this: > > ![Screenshot from 2023-03-01 13-20-09](https://user-images.githubusercontent.com/5312595/222146316-b036ca7d-8a92-42b7-9570-c29e3cfcc2f2.png) > > With the help of the IR framework, it's easy to create a simple regression test (see `testRenumberLiveNodes`). > > The fix is to create a new `node_note_array` and copy the debug information to the right index after updating node indices. We do the same in the matcher: > https://github.com/openjdk/jdk/blob/c1e77e05647ca93bb4f39a320a5c7a632e283410/src/hotspot/share/opto/matcher.cpp#L337-L342 > > Another problem is that `Parse::Parse` calls `C->set_default_node_notes(caller_nn)` before `do_exits`, which resets the `JVMState` to the caller state. We then set the bci to `InvocationEntryBci` in the **caller** `JVMState`. Any new node that is emitted in `do_exits`, for example a `MemBarRelease`, will have that `JVMState` attached and `NonSafepointEmitter::observe_instruction` -> `DebugInformationRecorder::describe_scope` will then use that information when emitting debug info. The resulting debug info is misleading because it suggests that we are at the beginning of the caller method. The tests `testFinalFieldInit` and `testSynchronized` reproduce that scenario. > > The fix is to move `set_default_node_notes` down to after `do_exits`. > > I find it also misleading that we often emit "synchronization entry" for `InvocationEntryBci` at method entry/exit in the debug info, although there is no synchronization happening. I filed [JDK-8303451](https://bugs.openjdk.org/browse/JDK-8303451) to fix that. > > Thanks, > Tobias Tobias Hartmann has updated the pull request incrementally with one additional commit since the last revision: Use MAX2 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12806/files - new: https://git.openjdk.org/jdk/pull/12806/files/f1bc8db4..5e22c2cb Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12806&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12806&range=03-04 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/12806.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12806/head:pull/12806 PR: https://git.openjdk.org/jdk/pull/12806 From thartmann at openjdk.org Fri Mar 3 06:37:12 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 3 Mar 2023 06:37:12 GMT Subject: RFR: 8201516: DebugNonSafepoints generates incorrect information [v4] In-Reply-To: References: Message-ID: <0gI6DIHtc7F63CFYoccotGQv-BHYadPRW0liqEQvh6Q=.58a774ec-7591-4f44-aa7f-7755593ac04e@github.com> On Thu, 2 Mar 2023 17:47:17 GMT, Vladimir Kozlov wrote: >> Tobias Hartmann has updated the pull request incrementally with one additional commit since the last revision: >> >> Presize new node note array > > src/hotspot/share/opto/phaseX.cpp line 474: > >> 472: GrowableArray* old_node_note_array = C->node_note_array(); >> 473: if (old_node_note_array != nullptr) { >> 474: int new_size = (_useful.size() >> 8) + 1; // The node note array uses blocks, see C->_log2_node_notes_block_size > > You should call `new_size = MAX2(8, new_size)` to make sure that we have at least 8 elements for initial allocation. Okay, I added that. The 8 seems arbitrary to me but since we already use that for initial allocation of the array, we can as well be consistent here. Just note that since we are calling `C->grow_node_notes`, we will also initialize with `Node_Notes*` right away. ------------- PR: https://git.openjdk.org/jdk/pull/12806 From epeter at openjdk.org Fri Mar 3 07:37:00 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 3 Mar 2023 07:37:00 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v20] In-Reply-To: References: Message-ID: > Cyclic dependencies are not handled correctly in all cases. Three examples: > > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/loopopts/superword/TestCyclicDependency.java#L270-L277 > > And this, compiled with `-XX:CompileCommand=option,compiler.vectorization.TestOptionVectorizeIR::test*,Vectorize`: > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestOptionVectorizeIR.java#L173-L180 > > And for `vmIntrinsics::_forEachRemaining` compile option `Vectorize` is always enabled: > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestForEachRem.java#L69-L73 > > All of these examples are vectorized, despite the cyclic dependency of distance 2. The cyclic dependency is dropped, instead the emitted vector code implements a shift by 2, instead of repeating the same 2 values. > > **Analysis** > > The `create_pack` logic in `SuperWord::find_adjacent_refs` is broken in two ways: > > - When the compile directive `Vectorize` is on, or we compile `vmIntrinsics::_forEachRemaining` we have `_do_vector_loop == true`. When that is the case, we blindly trust that there is no cyclic dependency larger than distance 1. Distance 1 would already be detected by the `independence(s1, s2)` checks we do for all adjacent memops. But for larger distances, we rely on `memory_alignment == 0`. But the compile directive avoids these checks. > - If `best_align_to_mem_ref` is of a different type, and we have `memory_alignment(mem_ref, best_align_to_mem_ref) == 0`, we do not check if `mem_ref` has `memory_alignment == 0` for all other refs of the same type. In the example `TestCyclicDependency::test2`, we have `best_align_to_mem_ref` as the `StoreF`. Then we assess the `StoreI`, which is not aligned with it, but it is of a different type, so we accept it too. Finally, we look at `LoadI`, which has perfect alignment with the `StoreF`, so we accept it too (even though it is in conflict with the `StoreI`). > > Generally, the nested if-statements are confusing and buggy. I propose to fix and refactor the code. > > I also propose to only allow the compile directive `Vectorize` only if `vectors_should_be_aligned() == false`. If all vector operations have to be `vector_width` aligned, then they also have to be mutually aligned, and we cannot have patterns like `v[i] = v[i] + v[i+1]` for which the compile directive was introduced in the first place https://github.com/openjdk/jdk/commit/c7d33de202203b6da544f2e0f9a13952381b32dd. > **Update**: I found a **Test.java** that lead to a crash (`SIGBUS`) on a ARM32 on master. The example bypassed the alignment requirement because of `_do_vector_loop`, and allowed unaligned vector loads to be generated, on a platform that requires alignment. Thanks @fg1417 for running that test for me! > > **Solution** > > First, I implemented `SuperWord::verify_packs` which catches cyclic dependencies just before scheduling. The idea is to reassess every pack, and check if all memops in it are mutually independent. Turns out that per vector pack, it suffices to do a single BFS over the nodes in the block (see `SuperWord::find_dependence`). With this verification in place we at least get an assert instead of wrong execution. > > I then refactored and fixed the `create_pack` code, and put the logic all in `SuperWord::is_mem_ref_alignment_ok`. With the added comments, I hope the logic is more straight forward and readable. If `_do_vector_loop == true`, then I filter the vector packs again in `SuperWord::combine_packs`, since we are at that point not sure that the packs are actually independent, we only know that adjacient memops are independent. > > Another change I have made: > Disallow `extend_packlist` from adding `MemNodes` back in. Because if we have rejected some memops, we do not want them to be added back in later. > > **Testing** > > I added a few more regression tests, and am running tier1-3, plus some stress testing. > > However, I need help from someone who can test this on **ARM32** and **PPC**, basically machines that have `vectors_should_be_aligned() == true`. I would love to have additional testing on those machine, and some reviews. > **Update:** @fg1417 did testing on ARM32, @reinrich did testing on PPC. > > **Discussion / Future Work** > > I wonder if we should have `_do_vector_loop == true` by default, since it allows more vectorization. With the added filtering, we are sure that we do not schedule packs with cyclic dependencies. We would have to evaluate performance and other side-effects of course. What do you think? [JDK-8303113](https://bugs.openjdk.org/browse/JDK-8303113) Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: TestOptionVectorizeIR.java: removed PopulateIndex IR rule - fails on x86 32bit - see Matcher::match_rule_supported ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12350/files - new: https://git.openjdk.org/jdk/pull/12350/files/199fcf0a..fb7f6dd9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12350&range=19 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12350&range=18-19 Stats: 2 lines in 1 file changed: 0 ins; 2 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/12350.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12350/head:pull/12350 PR: https://git.openjdk.org/jdk/pull/12350 From duke at openjdk.org Fri Mar 3 07:46:57 2023 From: duke at openjdk.org (Chang Peng) Date: Fri, 3 Mar 2023 07:46:57 GMT Subject: RFR: 8297753: AArch64: Add optimized rules for vector compare with zero on NEON [v9] In-Reply-To: References: Message-ID: > We can use the compare-with-zero instructions like cmgt(zero)[1] immediately to avoid the extra scalar2vector operations. > > The following instruction sequence > > movi v16.4s, #0x0 > cmgt v16.4s, v17.4s, v16.4s > > can be optimized to: > > cmgt v16.4s, v17.4s, #0x0 > > This patch does the following: > 1. Add NEON floating-point compare-with-zero instructions. > 2. Add optimized match rules to generate the compare-with-zero instructions. > > [1]: https://developer.arm.com/documentation/ddi0602/2022-06/SIMD-FP-Instructions/CMGT--zero---Compare-signed-Greater-than-zero--vector-- Chang Peng has updated the pull request incrementally with one additional commit since the last revision: Remove hard-coded 0b1111 in to_assembler_condition() ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11822/files - new: https://git.openjdk.org/jdk/pull/11822/files/cd425ef1..1d1458c0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11822&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11822&range=07-08 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11822.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11822/head:pull/11822 PR: https://git.openjdk.org/jdk/pull/11822 From aph at openjdk.org Fri Mar 3 09:38:06 2023 From: aph at openjdk.org (Andrew Haley) Date: Fri, 3 Mar 2023 09:38:06 GMT Subject: RFR: 8297753: AArch64: Add optimized rules for vector compare with zero on NEON [v9] In-Reply-To: References: Message-ID: <13_9PNDgn4MlJFUojrJuiKBgv-sS_W4JXsjzTNBHpQ0=.f2058238-ce44-4e46-a3ab-e2f87897363f@github.com> On Fri, 3 Mar 2023 07:46:57 GMT, Chang Peng wrote: >> We can use the compare-with-zero instructions like cmgt(zero)[1] immediately to avoid the extra scalar2vector operations. >> >> The following instruction sequence >> >> movi v16.4s, #0x0 >> cmgt v16.4s, v17.4s, v16.4s >> >> can be optimized to: >> >> cmgt v16.4s, v17.4s, #0x0 >> >> This patch does the following: >> 1. Add NEON floating-point compare-with-zero instructions. >> 2. Add optimized match rules to generate the compare-with-zero instructions. >> >> [1]: https://developer.arm.com/documentation/ddi0602/2022-06/SIMD-FP-Instructions/CMGT--zero---Compare-signed-Greater-than-zero--vector-- > > Chang Peng has updated the pull request incrementally with one additional commit since the last revision: > > Remove hard-coded 0b1111 in to_assembler_condition() Alright! That is _beautiful_. I felt a bit bad about pushing you so hard on this, but I think the quality of the result justifies the effort. I hope you agree. Thank you. ------------- Marked as reviewed by aph (Reviewer). PR: https://git.openjdk.org/jdk/pull/11822 From duke at openjdk.org Fri Mar 3 09:58:28 2023 From: duke at openjdk.org (Chang Peng) Date: Fri, 3 Mar 2023 09:58:28 GMT Subject: RFR: 8297753: AArch64: Add optimized rules for vector compare with zero on NEON [v9] In-Reply-To: <13_9PNDgn4MlJFUojrJuiKBgv-sS_W4JXsjzTNBHpQ0=.f2058238-ce44-4e46-a3ab-e2f87897363f@github.com> References: <13_9PNDgn4MlJFUojrJuiKBgv-sS_W4JXsjzTNBHpQ0=.f2058238-ce44-4e46-a3ab-e2f87897363f@github.com> Message-ID: On Fri, 3 Mar 2023 09:35:33 GMT, Andrew Haley wrote: > Alright! That is _beautiful_. > > I felt a bit bad about pushing you so hard on this, but I think the quality of the result justifies the effort. I hope you agree. > > Thank you. Thanks for your review. ------------- PR: https://git.openjdk.org/jdk/pull/11822 From roland at openjdk.org Fri Mar 3 10:31:09 2023 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 3 Mar 2023 10:31:09 GMT Subject: RFR: 8303564: C2: "Bad graph detected in build_loop_late" after a CMove is wrongly split thru phi Message-ID: The following steps lead to the crash: - In `testHelper()`, the null and range checks for the `field1[0]` load are hoisted out of the counted loop by loop predication - As a result, the `field1[0]` load is also out of loop, control dependent on a predicate - pre/main/post loops are created, the main loop is unrolled, the `f` value that's stored in `field3` is a Phi that merges the values out of the 3 loops. - the `stop` variable that captures the limit of the loop is transformed into a `Phi` that merges 1 and 2. - As a result, the Phi that's stored in `field3` now only merges the value of the pre and post loop and is transformed into a `CmoveI` that merges 2 values dependent on the `field1[0]` `LoadI` that's control dependent on a predicate. - On the next round of loop opts, the `CmoveI` is assigned control below the predicate but the `Bool`/`CmpI` for the `CmoveI` is assigned control above, right below a `Region` that has a `Phi` that is input to the `CmpI`. The reason is this logic: https://github.com/rwestrel/jdk/blob/99f5687eb192b249a4a4533578f56b131fb8f234/src/hotspot/share/opto/loopnode.cpp#L5968 - The `CmoveI` is split thru phi because the `Bool`/`CmpI` have control right below a `Region`. That shouldn't happen because the `CmoveI` itself doesn't have control at the `Region` and is actually pinned through the `LoadI` below the `Region`. The fix I propose is to check the control of the `CmoveI` before proceding with split thru phi. ------------- Commit messages: - fix Changes: https://git.openjdk.org/jdk/pull/12851/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12851&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8303564 Stats: 82 lines in 2 files changed: 80 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/12851.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12851/head:pull/12851 PR: https://git.openjdk.org/jdk/pull/12851 From mcimadamore at openjdk.org Fri Mar 3 11:10:22 2023 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Fri, 3 Mar 2023 11:10:22 GMT Subject: RFR: 8303409: Add Windows AArch64 ABI support to the Foreign Function & Memory API In-Reply-To: References: Message-ID: On Mon, 27 Feb 2023 17:04:28 GMT, Saint Wesonga wrote: > There are 2 primary differences between the Windows ARM64 ABI and the macOS/Linux ARM64 ABI: variadic floating point arguments are passed in general purpose registers on Windows (instead of the vector registers). In addition to this, up to 64 bytes of a struct being passed to a variadic function can be placed in general purpose registers. This happens regardless of the type of struct (HFA or other generic struct). This means that a struct can be split across registers and the stack when invoking a variadic function. The Windows ARM64 ABI conventions are documented at https://learn.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions > > For details about the Foreign Function & Memory API, see JEP 434 at https://openjdk.org/jeps/434 > > This change is a cherry pick of https://github.com/openjdk/panama-foreign/commit/d379ca1c and https://github.com/openjdk/panama-foreign/commit/08225e4f from https://github.com/openjdk/panama-foreign/pull/754 and includes an additional commit that introduces a VaList implementation for Windows on AArch64. Many thanks for looking into this port! ------------- PR: https://git.openjdk.org/jdk/pull/12773 From duke at openjdk.org Fri Mar 3 12:14:40 2023 From: duke at openjdk.org (changpeng1997) Date: Fri, 3 Mar 2023 12:14:40 GMT Subject: Integrated: 8297753: AArch64: Add optimized rules for vector compare with zero on NEON In-Reply-To: References: Message-ID: On Tue, 3 Jan 2023 08:24:50 GMT, changpeng1997 wrote: > We can use the compare-with-zero instructions like cmgt(zero)[1] immediately to avoid the extra scalar2vector operations. > > The following instruction sequence > > movi v16.4s, #0x0 > cmgt v16.4s, v17.4s, v16.4s > > can be optimized to: > > cmgt v16.4s, v17.4s, #0x0 > > This patch does the following: > 1. Add NEON floating-point compare-with-zero instructions. > 2. Add optimized match rules to generate the compare-with-zero instructions. > > [1]: https://developer.arm.com/documentation/ddi0602/2022-06/SIMD-FP-Instructions/CMGT--zero---Compare-signed-Greater-than-zero--vector-- This pull request has now been integrated. Changeset: d23a8bfb Author: changpeng1997 Committer: Andrew Dinn URL: https://git.openjdk.org/jdk/commit/d23a8bfb14037460731fb6ca1890b03278b84b1a Stats: 1054 lines in 11 files changed: 548 ins; 6 del; 500 mod 8297753: AArch64: Add optimized rules for vector compare with zero on NEON Reviewed-by: aph ------------- PR: https://git.openjdk.org/jdk/pull/11822 From roland at openjdk.org Fri Mar 3 13:50:40 2023 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 3 Mar 2023 13:50:40 GMT Subject: RFR: 8303511: C2: assert(get_ctrl(n) == cle_out) during unrolling [v2] In-Reply-To: References: Message-ID: > In the same round of loop optimizations: > > - `PhaseIdealLoop::remix_address_expressions()` creates a new `AddP` > out of loop. It sets it control to > `n_loop->_head->in(LoopNode::EntryControl)` which, because the loop is > strip mined, is an `OuterStripMinedLoop`. > > - The `LoadI` for that `AddP` is found to only have uses outside the > loop and is cloned out of the loop. It's referenced by the outer > loop's safepoint. > > - The loop is unrolled. Unrolling follows the safepoint's inputs and > find the new `AddP` with control set to the `OuterStripMinedLoop` > and the assert fires. > > No control should be set to an `OuterStripMinedLoop`. The fix is > straightforward and sets the control to the `OuterStripMinedLoop` > entry control. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: Update test/hotspot/jtreg/compiler/loopstripmining/TestAddPAtOuterLoopHead.java Co-authored-by: Andrey Turbanov ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12824/files - new: https://git.openjdk.org/jdk/pull/12824/files/c2787438..f20e29b6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12824&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12824&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/12824.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12824/head:pull/12824 PR: https://git.openjdk.org/jdk/pull/12824 From roland at openjdk.org Fri Mar 3 13:50:43 2023 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 3 Mar 2023 13:50:43 GMT Subject: RFR: 8303511: C2: assert(get_ctrl(n) == cle_out) during unrolling [v2] In-Reply-To: References: Message-ID: On Thu, 2 Mar 2023 18:20:05 GMT, Andrey Turbanov wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> Update test/hotspot/jtreg/compiler/loopstripmining/TestAddPAtOuterLoopHead.java >> >> Co-authored-by: Andrey Turbanov > > test/hotspot/jtreg/compiler/loopstripmining/TestAddPAtOuterLoopHead.java line 65: > >> 63: >> 64: >> 65: int v = 0; > > Suggestion: > > int v = 0; thanks for catching that. ------------- PR: https://git.openjdk.org/jdk/pull/12824 From tholenstein at openjdk.org Fri Mar 3 14:45:04 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Fri, 3 Mar 2023 14:45:04 GMT Subject: RFR: JDK-8303443: IGV: Syntax highlighting and resizing for filter editor [v2] In-Reply-To: References: Message-ID: <8oRYadPW4r7i-ljWu2sBv32ATJUKiQ_EmcvbyhtMBTA=.367c18f0-c234-443c-9936-37d24b561824@github.com> On Thu, 2 Mar 2023 13:57:51 GMT, Roberto Casta?eda Lozano wrote: > Thanks for improving this area of IGV! > > I found a few issues in the changeset: > > 1. The proposed font ("Courier") is not very readable on my system (Ubuntu Linux): > > ![courier](https://user-images.githubusercontent.com/8792647/222445180-40f46bee-90e3-4b48-9d24-9571e1c4e5e8.png) > > I suggest using the logical font "Monospaced" instead [for portability](https://docs.oracle.com/en/java/javase/19/docs/api/java.desktop/java/awt/Font.html): > > ![monospaced](https://user-images.githubusercontent.com/8792647/222446352-49633e0f-db7c-445d-b5ae-6335b11eb95c.png) > > 2. The changeset edits the generated code area of `EditFilterDialog.java`, making it out of sync with `EditFilterDialog.form`. Would it be possible to change that part of `EditFilterDialog.java` using NetBeans' GUI Builder instead? (so that `EditFilterDialog.form` stays on sync). > 3. When I run IGV, make some change in a filter, close IGV, then re-run it again, I end up with the same filters loaded multiple times: > > ![filters](https://user-images.githubusercontent.com/8792647/222448902-47a6999e-0b4a-4739-9863-3da0aaf41b86.png) @robcasloz thanks for the feedback! I fixed 1) and 2) as suggested. Regarding 3) I decided to remove it from this PR. Better to add this functionality separately ------------- PR: https://git.openjdk.org/jdk/pull/12803 From tholenstein at openjdk.org Fri Mar 3 14:45:00 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Fri, 3 Mar 2023 14:45:00 GMT Subject: RFR: JDK-8303443: IGV: Syntax highlighting and resizing for filter editor [v2] In-Reply-To: References: Message-ID: > In the Filter window of the IdealGraphVisualizer (IGV) the user can double-click on a filter to edit the javascript code. > > - Previously, the code window was not resizable and had no syntax highlighting > editor_old > > - Now, the code window can be resized by the user and has basic syntax highlighting for `keywords`, `strings` and `comments` > editor_new Tobias Holenstein has updated the pull request incrementally with four additional commits since the last revision: - revert .js ending for filters - copyright year - form working in Netbeans IDE - Monospaced font ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12803/files - new: https://git.openjdk.org/jdk/pull/12803/files/34169439..4b2dd4a5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12803&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12803&range=00-01 Stats: 18 lines in 3 files changed: 5 ins; 6 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/12803.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12803/head:pull/12803 PR: https://git.openjdk.org/jdk/pull/12803 From roland at openjdk.org Fri Mar 3 15:37:37 2023 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 3 Mar 2023 15:37:37 GMT Subject: RFR: 8300258: C2: vectorization fails on simple ByteBuffer loop [v4] In-Reply-To: References: Message-ID: > The loop that doesn't vectorize is: > > > public static void testByteLong4(byte[] dest, long[] src, int start, int stop) { > for (int i = start; i < stop; i++) { > UNSAFE.putLongUnaligned(dest, 8 * i + baseOffset, src[i]); > } > } > > > It's from a micro-benchmark in the panama > repo. `SuperWord::find_adjacent_refs() `prevents it from vectorizing > because it finds it cannot properly align the loop and, from the > comment in the code, that: > > > // Can't allow vectorization of unaligned memory accesses with the > // same type since it could be overlapped accesses to the same array. > > > The test for "same type" is implemented by looking at the memory > operation type which in this case is overly conservative as the loop > above is reading and writing with long loads/stores but from and to > arrays of different types that can't overlap. Actually, with such > mismatched accesses, it's also likely an incorrect test (reading and > writing could be to the same array with loads/stores that use > different operand size) eventhough I couldn't write a test case that > would trigger an incorrect execution. > > As a fix, I propose implementing the "same type" test by looking at > memory aliases instead. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: more tests ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12440/files - new: https://git.openjdk.org/jdk/pull/12440/files/c6c09763..f6820c45 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12440&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12440&range=02-03 Stats: 55 lines in 1 file changed: 55 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/12440.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12440/head:pull/12440 PR: https://git.openjdk.org/jdk/pull/12440 From roland at openjdk.org Fri Mar 3 15:38:12 2023 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 3 Mar 2023 15:38:12 GMT Subject: RFR: 8300258: C2: vectorization fails on simple ByteBuffer loop [v3] In-Reply-To: References: Message-ID: <_Pliz5pE-VLu0n8Vw65SQpfTe2IyLygr4I1hSz6znMw=.5f8c9dab-739c-436a-ab19-150fe7e63fd3@github.com> On Thu, 2 Mar 2023 08:37:49 GMT, Roland Westrelin wrote: >> The loop that doesn't vectorize is: >> >> >> public static void testByteLong4(byte[] dest, long[] src, int start, int stop) { >> for (int i = start; i < stop; i++) { >> UNSAFE.putLongUnaligned(dest, 8 * i + baseOffset, src[i]); >> } >> } >> >> >> It's from a micro-benchmark in the panama >> repo. `SuperWord::find_adjacent_refs() `prevents it from vectorizing >> because it finds it cannot properly align the loop and, from the >> comment in the code, that: >> >> >> // Can't allow vectorization of unaligned memory accesses with the >> // same type since it could be overlapped accesses to the same array. >> >> >> The test for "same type" is implemented by looking at the memory >> operation type which in this case is overly conservative as the loop >> above is reading and writing with long loads/stores but from and to >> arrays of different types that can't overlap. Actually, with such >> mismatched accesses, it's also likely an incorrect test (reading and >> writing could be to the same array with loads/stores that use >> different operand size) eventhough I couldn't write a test case that >> would trigger an incorrect execution. >> >> As a fix, I propose implementing the "same type" test by looking at >> memory aliases instead. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains ten additional commits since the last revision: > > - more > - Merge branch 'master' into JDK-8300258 > - comments > - extra test > - more > - Merge branch 'master' into JDK-8300258 > - review > - more > - fix & test I pushed some more tests than don't vectorize. At this point, is there something missing from the change so that it can move forward? I'm fine with waiting for #12350 to integrate and then rebase. ------------- PR: https://git.openjdk.org/jdk/pull/12440 From roland at openjdk.org Fri Mar 3 15:38:13 2023 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 3 Mar 2023 15:38:13 GMT Subject: RFR: 8300258: C2: vectorization fails on simple ByteBuffer loop [v3] In-Reply-To: <_Pliz5pE-VLu0n8Vw65SQpfTe2IyLygr4I1hSz6znMw=.5f8c9dab-739c-436a-ab19-150fe7e63fd3@github.com> References: <_Pliz5pE-VLu0n8Vw65SQpfTe2IyLygr4I1hSz6znMw=.5f8c9dab-739c-436a-ab19-150fe7e63fd3@github.com> Message-ID: On Fri, 3 Mar 2023 15:27:06 GMT, Roland Westrelin wrote: >> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains ten additional commits since the last revision: >> >> - more >> - Merge branch 'master' into JDK-8300258 >> - comments >> - extra test >> - more >> - Merge branch 'master' into JDK-8300258 >> - review >> - more >> - fix & test > > I pushed some more tests than don't vectorize. > @rwestrel : can you change the name of the Bug to reflect this? Suggestion: `C2: SuperWord alignment analysis must be based on memory slice, not velt type` The bug database is not used only by developers that are familiar with the code. Sometimes people look for a bug matching what they observe (an error message for instance). So I'm not sure it's a good idea to change the bug synopsis that way as it's meaningless to anyone not familiar with the c2 code. ------------- PR: https://git.openjdk.org/jdk/pull/12440 From epeter at openjdk.org Fri Mar 3 15:38:14 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 3 Mar 2023 15:38:14 GMT Subject: RFR: 8300258: C2: vectorization fails on simple ByteBuffer loop [v3] In-Reply-To: References: Message-ID: On Thu, 2 Mar 2023 08:37:49 GMT, Roland Westrelin wrote: >> The loop that doesn't vectorize is: >> >> >> public static void testByteLong4(byte[] dest, long[] src, int start, int stop) { >> for (int i = start; i < stop; i++) { >> UNSAFE.putLongUnaligned(dest, 8 * i + baseOffset, src[i]); >> } >> } >> >> >> It's from a micro-benchmark in the panama >> repo. `SuperWord::find_adjacent_refs() `prevents it from vectorizing >> because it finds it cannot properly align the loop and, from the >> comment in the code, that: >> >> >> // Can't allow vectorization of unaligned memory accesses with the >> // same type since it could be overlapped accesses to the same array. >> >> >> The test for "same type" is implemented by looking at the memory >> operation type which in this case is overly conservative as the loop >> above is reading and writing with long loads/stores but from and to >> arrays of different types that can't overlap. Actually, with such >> mismatched accesses, it's also likely an incorrect test (reading and >> writing could be to the same array with loads/stores that use >> different operand size) eventhough I couldn't write a test case that >> would trigger an incorrect execution. >> >> As a fix, I propose implementing the "same type" test by looking at >> memory aliases instead. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains ten additional commits since the last revision: > > - more > - Merge branch 'master' into JDK-8300258 > - comments > - extra test > - more > - Merge branch 'master' into JDK-8300258 > - review > - more > - fix & test I think `testByteByte2` gets filtered out by the `memory_alignment` check. https://github.com/openjdk/jdk/blob/f6820c45e9648b2e4f0bf1b5529458e3b3ff3cc5/src/hotspot/share/opto/superword.cpp#L646 or here https://github.com/openjdk/jdk/blob/f6820c45e9648b2e4f0bf1b5529458e3b3ff3cc5/src/hotspot/share/opto/superword.cpp#L683-L684 That is to be expected. Otherwise you need to turn on `_do_vector_loop`. ------------- PR: https://git.openjdk.org/jdk/pull/12440 From kvn at openjdk.org Fri Mar 3 15:47:52 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 3 Mar 2023 15:47:52 GMT Subject: RFR: 8201516: DebugNonSafepoints generates incorrect information [v5] In-Reply-To: <4PpM775Vw8MQtZu4vuycM0ZDx8bkqy8TJ_aS4OmRDVE=.63e42077-5ccd-43a8-96af-163f0fe8392b@github.com> References: <4PpM775Vw8MQtZu4vuycM0ZDx8bkqy8TJ_aS4OmRDVE=.63e42077-5ccd-43a8-96af-163f0fe8392b@github.com> Message-ID: On Fri, 3 Mar 2023 06:37:08 GMT, Tobias Hartmann wrote: >> C2 emits incorrect debug information when diagnostic `-XX:+DebugNonSafepoints` is enabled. The problem is that renumbering of live nodes (`-XX:+RenumberLiveNodes`) introduced by [JDK-8129847](https://bugs.openjdk.org/browse/JDK-8129847) in JDK 8u92 / JDK 9 does not update the `_node_note_array` side table that links IR node indices to debug information. As a result, after node indices are updated, they point to unrelated debug information. >> >> The [reproducer](https://github.com/jodzga/debugnonsafepoints-problem) shared by the original reporter @jodzga (@jbachorik also reported this issue separately) does not work anymore with recent JDK versions but with a slight adjustment to trigger node renumbering, I could reproduce the wrong JFR method profile: >> >> ![Screenshot from 2023-03-01 13-17-48](https://user-images.githubusercontent.com/5312595/222146314-8b5299a8-c1c0-4360-b356-ac6a8c34371c.png) >> >> It suggests that the hottest method of the [Test](https://github.com/jodzga/debugnonsafepoints-problem/blob/f8ed40f24ef6a6bff7f86ea861c022db193ef48a/src/main/java/org/tests/Test.java#L28) is **not** the long running loop in [Test::arraycopy](https://github.com/jodzga/debugnonsafepoints-problem/blob/f8ed40f24ef6a6bff7f86ea861c022db193ef48a/src/main/java/org/tests/Test.java#L56) but several other short running methods. The hot method is not even in the profile. This is obviously wrong. >> >> With the fix, or when running with `-XX:-RenumberLiveNodes` as a workaround, the correct profile looks like this: >> >> ![Screenshot from 2023-03-01 13-20-09](https://user-images.githubusercontent.com/5312595/222146316-b036ca7d-8a92-42b7-9570-c29e3cfcc2f2.png) >> >> With the help of the IR framework, it's easy to create a simple regression test (see `testRenumberLiveNodes`). >> >> The fix is to create a new `node_note_array` and copy the debug information to the right index after updating node indices. We do the same in the matcher: >> https://github.com/openjdk/jdk/blob/c1e77e05647ca93bb4f39a320a5c7a632e283410/src/hotspot/share/opto/matcher.cpp#L337-L342 >> >> Another problem is that `Parse::Parse` calls `C->set_default_node_notes(caller_nn)` before `do_exits`, which resets the `JVMState` to the caller state. We then set the bci to `InvocationEntryBci` in the **caller** `JVMState`. Any new node that is emitted in `do_exits`, for example a `MemBarRelease`, will have that `JVMState` attached and `NonSafepointEmitter::observe_instruction` -> `DebugInformationRecorder::describe_scope` will then use that information when emitting debug info. The resulting debug info is misleading because it suggests that we are at the beginning of the caller method. The tests `testFinalFieldInit` and `testSynchronized` reproduce that scenario. >> >> The fix is to move `set_default_node_notes` down to after `do_exits`. >> >> I find it also misleading that we often emit "synchronization entry" for `InvocationEntryBci` at method entry/exit in the debug info, although there is no synchronization happening. I filed [JDK-8303451](https://bugs.openjdk.org/browse/JDK-8303451) to fix that. >> >> Thanks, >> Tobias > > Tobias Hartmann has updated the pull request incrementally with one additional commit since the last revision: > > Use MAX2 Good ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/12806 From roland at openjdk.org Fri Mar 3 15:56:20 2023 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 3 Mar 2023 15:56:20 GMT Subject: RFR: 8300258: C2: vectorization fails on simple ByteBuffer loop [v3] In-Reply-To: References: Message-ID: On Fri, 3 Mar 2023 15:32:28 GMT, Emanuel Peter wrote: > I guess that `testByteByte2` gets filtered out by the `memory_alignment` check. Are you suggesting, `testByteByte2` should have an IR rule that checks it doesn't vectorize? It wouldn't be illegal to vectorize while the others are either illegal to vectorize (they wouldn't execute correctly) or they can't be proven to be legal to vectorize with static information. ------------- PR: https://git.openjdk.org/jdk/pull/12440 From dnsimon at openjdk.org Fri Mar 3 15:56:33 2023 From: dnsimon at openjdk.org (Doug Simon) Date: Fri, 3 Mar 2023 15:56:33 GMT Subject: RFR: 8279619: [JVMCI] improve EncodedSpeculationReason In-Reply-To: References: Message-ID: On Mon, 13 Feb 2023 12:22:01 GMT, Doug Simon wrote: > This PR enhances `jdk.vm.ci.meta.EncodedSpeculationReason.encode` such that the `groupName` parameter is included in the encoding. This mitigates the possibility of 2 unrelated speculation objects having the same hash which, in turn, mitigates the possibility of missing a speculation based optimization opportunity. Thanks for the review Tom. Longer term, we can improve this API. ------------- PR: https://git.openjdk.org/jdk/pull/12532 From dnsimon at openjdk.org Fri Mar 3 15:56:34 2023 From: dnsimon at openjdk.org (Doug Simon) Date: Fri, 3 Mar 2023 15:56:34 GMT Subject: Integrated: 8279619: [JVMCI] improve EncodedSpeculationReason In-Reply-To: References: Message-ID: On Mon, 13 Feb 2023 12:22:01 GMT, Doug Simon wrote: > This PR enhances `jdk.vm.ci.meta.EncodedSpeculationReason.encode` such that the `groupName` parameter is included in the encoding. This mitigates the possibility of 2 unrelated speculation objects having the same hash which, in turn, mitigates the possibility of missing a speculation based optimization opportunity. This pull request has now been integrated. Changeset: 80739e11 Author: Doug Simon URL: https://git.openjdk.org/jdk/commit/80739e11b52a73d76525f9508e30f8809342e933 Stats: 42 lines in 2 files changed: 40 ins; 0 del; 2 mod 8279619: [JVMCI] improve EncodedSpeculationReason Reviewed-by: never ------------- PR: https://git.openjdk.org/jdk/pull/12532 From epeter at openjdk.org Fri Mar 3 16:05:42 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 3 Mar 2023 16:05:42 GMT Subject: RFR: 8300258: C2: vectorization fails on simple ByteBuffer loop [v3] In-Reply-To: References: Message-ID: On Fri, 3 Mar 2023 15:53:34 GMT, Roland Westrelin wrote: > Are you suggesting, testByteByte2 should have an IR rule that checks it doesn't vectorize? No, I am not suggesting that. I would not add any IR rule. I had overseen that you have already commented the IR rule out, that is why I proceeded to explain why it fails ? But what I would suggest: Add value verification to the tests. Because we may make mistakes with the vectorization and create wrong results. ------------- PR: https://git.openjdk.org/jdk/pull/12440 From epeter at openjdk.org Fri Mar 3 16:05:44 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 3 Mar 2023 16:05:44 GMT Subject: RFR: 8300258: C2: vectorization fails on simple ByteBuffer loop [v4] In-Reply-To: References: Message-ID: On Fri, 3 Mar 2023 15:37:37 GMT, Roland Westrelin wrote: >> The loop that doesn't vectorize is: >> >> >> public static void testByteLong4(byte[] dest, long[] src, int start, int stop) { >> for (int i = start; i < stop; i++) { >> UNSAFE.putLongUnaligned(dest, 8 * i + baseOffset, src[i]); >> } >> } >> >> >> It's from a micro-benchmark in the panama >> repo. `SuperWord::find_adjacent_refs() `prevents it from vectorizing >> because it finds it cannot properly align the loop and, from the >> comment in the code, that: >> >> >> // Can't allow vectorization of unaligned memory accesses with the >> // same type since it could be overlapped accesses to the same array. >> >> >> The test for "same type" is implemented by looking at the memory >> operation type which in this case is overly conservative as the loop >> above is reading and writing with long loads/stores but from and to >> arrays of different types that can't overlap. Actually, with such >> mismatched accesses, it's also likely an incorrect test (reading and >> writing could be to the same array with loads/stores that use >> different operand size) eventhough I couldn't write a test case that >> would trigger an incorrect execution. >> >> As a fix, I propose implementing the "same type" test by looking at >> memory aliases instead. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > more tests For me it is also ok if you integrate before https://github.com/openjdk/jdk/pull/12350. ------------- PR: https://git.openjdk.org/jdk/pull/12440 From kvn at openjdk.org Fri Mar 3 16:28:05 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 3 Mar 2023 16:28:05 GMT Subject: RFR: 8303564: C2: "Bad graph detected in build_loop_late" after a CMove is wrongly split thru phi In-Reply-To: References: Message-ID: On Fri, 3 Mar 2023 10:22:55 GMT, Roland Westrelin wrote: > The following steps lead to the crash: > > - In `testHelper()`, the null and range checks for the `field1[0]` > load are hoisted out of the counted loop by loop predication > > - As a result, the `field1[0]` load is also out of loop, control > dependent on a predicate > > - pre/main/post loops are created, the main loop is unrolled, the `f` > value that's stored in `field3` is a Phi that merges the values out > of the 3 loops. > > - the `stop` variable that captures the limit of the loop is > transformed into a `Phi` that merges 1 and 2. > > - As a result, the Phi that's stored in `field3` now only merges the > value of the pre and post loop and is transformed into a `CmoveI` > that merges 2 values dependent on the `field1[0]` `LoadI` that's > control dependent on a predicate. > > - On the next round of loop opts, the `CmoveI` is assigned control > below the predicate but the `Bool`/`CmpI` for the `CmoveI` is > assigned control above, right below a `Region` that has a `Phi` that > is input to the `CmpI`. The reason is this logic: > https://github.com/rwestrel/jdk/blob/99f5687eb192b249a4a4533578f56b131fb8f234/src/hotspot/share/opto/loopnode.cpp#L5968 > > - The `CmoveI` is split thru phi because the `Bool`/`CmpI` have > control right below a `Region`. That shouldn't happen because the > `CmoveI` itself doesn't have control at the `Region` and is actually > pinned through the `LoadI` below the `Region`. > > The fix I propose is to check the control of the `CmoveI` before > proceding with split thru phi. Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/12851 From kvn at openjdk.org Fri Mar 3 16:32:18 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 3 Mar 2023 16:32:18 GMT Subject: RFR: 8300258: C2: vectorization fails on simple ByteBuffer loop [v4] In-Reply-To: References: Message-ID: On Fri, 3 Mar 2023 15:37:37 GMT, Roland Westrelin wrote: >> The loop that doesn't vectorize is: >> >> >> public static void testByteLong4(byte[] dest, long[] src, int start, int stop) { >> for (int i = start; i < stop; i++) { >> UNSAFE.putLongUnaligned(dest, 8 * i + baseOffset, src[i]); >> } >> } >> >> >> It's from a micro-benchmark in the panama >> repo. `SuperWord::find_adjacent_refs() `prevents it from vectorizing >> because it finds it cannot properly align the loop and, from the >> comment in the code, that: >> >> >> // Can't allow vectorization of unaligned memory accesses with the >> // same type since it could be overlapped accesses to the same array. >> >> >> The test for "same type" is implemented by looking at the memory >> operation type which in this case is overly conservative as the loop >> above is reading and writing with long loads/stores but from and to >> arrays of different types that can't overlap. Actually, with such >> mismatched accesses, it's also likely an incorrect test (reading and >> writing could be to the same array with loads/stores that use >> different operand size) eventhough I couldn't write a test case that >> would trigger an incorrect execution. >> >> As a fix, I propose implementing the "same type" test by looking at >> memory aliases instead. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > more tests I think we discussed this enough. Approved. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/12440 From epeter at openjdk.org Fri Mar 3 17:22:11 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 3 Mar 2023 17:22:11 GMT Subject: RFR: 8300258: C2: vectorization fails on simple ByteBuffer loop [v4] In-Reply-To: References: Message-ID: On Fri, 3 Mar 2023 15:37:37 GMT, Roland Westrelin wrote: >> The loop that doesn't vectorize is: >> >> >> public static void testByteLong4(byte[] dest, long[] src, int start, int stop) { >> for (int i = start; i < stop; i++) { >> UNSAFE.putLongUnaligned(dest, 8 * i + baseOffset, src[i]); >> } >> } >> >> >> It's from a micro-benchmark in the panama >> repo. `SuperWord::find_adjacent_refs() `prevents it from vectorizing >> because it finds it cannot properly align the loop and, from the >> comment in the code, that: >> >> >> // Can't allow vectorization of unaligned memory accesses with the >> // same type since it could be overlapped accesses to the same array. >> >> >> The test for "same type" is implemented by looking at the memory >> operation type which in this case is overly conservative as the loop >> above is reading and writing with long loads/stores but from and to >> arrays of different types that can't overlap. Actually, with such >> mismatched accesses, it's also likely an incorrect test (reading and >> writing could be to the same array with loads/stores that use >> different operand size) eventhough I couldn't write a test case that >> would trigger an incorrect execution. >> >> As a fix, I propose implementing the "same type" test by looking at >> memory aliases instead. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > more tests Ok. I am running the testing, include stress testing again for commit 10. Will report back on monday. ------------- PR: https://git.openjdk.org/jdk/pull/12440 From xxinliu at amazon.com Fri Mar 3 18:56:22 2023 From: xxinliu at amazon.com (Liu, Xin) Date: Fri, 3 Mar 2023 18:56:22 +0000 Subject: Update on PEA in C2 (Episode2) Message-ID: Hi, I would like to update what we have done in C2 PEA. This message consists of 3 parts. First of all, allow me to brief our progress. Second, I will cover what we have done recently. Last, I will share our source code and our document. We managed to compile all methods of java.base module using CTW without inliner. We start looking into them with inliner. This is challenging. Not only the compilation units are exponentially bigger, but we also need to track allocation state across method boundary. By fixing a couple of bugs, we can compile 1581 out of 59,000 methods. The remaining methods are not compiled yet because CTW cannot skip an error and keep going. Since last correspondence, we mainly refactored materialization. In our design, PEA clones the object and initializes its fields when a virtual object converts to a materialized object. Of course, we materialize an object when it is about to escape. This is easier to deal with because the bytecodes that may cause escapement are fixed. In contrast, it's hard to anticipate the position when we need to materialize an object at merge points. I refer to this as 'passive materialization'. Basically, we coerce a phi of an object to 'materialized'. One predecessor has materialized, so PEA has to materialize it from the other predecessor. AllocateNode is dependent on JVMState. Sometimes, the current JVMState isn't fit to emit AllocateNode at all. There are multiple reasons. 1) sp has been changed. We need reexecute_sp but Parse doesn't support it. 2) there are dead locals so we can't call GraphKit::add_safepoint_edges(). 3) bci isn't where is supposed to generate uncommon_trap. Because PEA emits a cluster of nodes to an arbitrary position and it implies a Throwable exception, we have seen that materialization altered the program semantics by accident. Eg. try-finally below captures 'Throwable'. If we materialize 'attrs' at call lstat0(because it's a native method), we mistakenly wire the exception to the exception handler. static void lstat(UnixPath path, UnixFileAttributes attrs) throws UnixException { try (NativeBuffer buffer = copyToNativeBuffer(path)) { long comp = Blocker.begin(); try { lstat0(buffer.address(), attrs); // materialize attrs here. } finally { // here is a hidden catch(Throwable) block Blocker.end(comp); } } } It requires quite a lot of hacks to overcome those issues. I decide to ensure materialization is position-agnostic. I implemented the following measures for this goal. 1. I copy debug edges and JVMState from the original AllocateNode. 2. use deoptimization rather than the real exception Last, I would like to ask advice. Our leadership asks me to add more transparency and engagement in the project. I also could use the community helps or it looks like I am grinding. What should I do to make this situation better? 1. source-code We have packed source code here. https://github.com/navyxliu/jdk/tree/PEA_beta In particular, PEA/Makefile can drive all regression tests. Eg. make run-ctw-no-inline compile java.base module without inliner. Currently, I merge jdk/master from time to time. Should I check in the source code to 'sandbox repo' of jdk? Or this is fine to share? 2. issue management. I am working on PEA_Parser branch. https://github.com/navyxliu/jdk/tree/PEA_parser now I start filing PRs on github. I try to explain the issues in PR and how I fixed them. Somehow, I still need an issue system to track bugs and tasks. Should I file them in JBS? I'm concerned that I abuse the JBS. I don't know what maintainer's take on such experimental feature. 3. doc For people who are interested in reviewing what we have done, I prepare a document of details. This is the continuation of our high-level design. https://gist.github.com/navyxliu/d1994ae68a999a70300d6ea7096a2b97#file-c2_pea_details-md thanks, --lx From kbarrett at openjdk.org Fri Mar 3 20:30:12 2023 From: kbarrett at openjdk.org (Kim Barrett) Date: Fri, 3 Mar 2023 20:30:12 GMT Subject: RFR: 8303147: [s390x] fast & slow debug builds are broken In-Reply-To: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> References: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> Message-ID: <_z1b8GbcB-3lzR_X34030jptgO0nfvg3GkCiIFcKK18=.aa4f6074-1866-4f70-a9a7-e1266aa14303@github.com> On Thu, 2 Mar 2023 10:06:17 GMT, Amit Kumar wrote: > This PR fixes broken fast debug and slow debug build for s390x-arch. tier1 test are completed and results are not affect after this patch. src/hotspot/cpu/aarch64/c1_Defs_aarch64.hpp line 93: > 91: enum { > 92: pd_reserved_argument_area_size_factor = 2 > 93: }; [Not a review, just a drive-by comment.] Please don't use enums to define constants like this. This is an old and obsolete style that shouldn't be used any more. Quoting from the HotSpot Style Guide: "Due to bugs in certain (very old) compilers, there is widespread use of enums and avoidance of in-class initialization of static integral constant members. Compilers having such bugs are no longer supported. Except where an enum is semantically appropriate, new code should use integral constants." ------------- PR: https://git.openjdk.org/jdk/pull/12825 From duke at openjdk.org Sat Mar 4 01:21:11 2023 From: duke at openjdk.org (SUN Guoyun) Date: Sat, 4 Mar 2023 01:21:11 GMT Subject: RFR: 8303147: [s390x] fast & slow debug builds are broken In-Reply-To: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> References: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> Message-ID: On Thu, 2 Mar 2023 10:06:17 GMT, Amit Kumar wrote: > This PR fixes broken fast debug and slow debug build for s390x-arch. tier1 test are completed and results are not affect after this patch. src/hotspot/cpu/s390/c1_Defs_s390.hpp line 77: > 75: // the number of stack required by ArrayCopyStub > 76: enum { > 77: pd_arraycopystub_reserved_argument_area_size = 4 Do you know the root cause of failure? Maybe only the modification here are needed. You can try. ------------- PR: https://git.openjdk.org/jdk/pull/12825 From duke at openjdk.org Sat Mar 4 01:30:13 2023 From: duke at openjdk.org (SUN Guoyun) Date: Sat, 4 Mar 2023 01:30:13 GMT Subject: RFR: 8303147: [s390x] fast & slow debug builds are broken In-Reply-To: <_z1b8GbcB-3lzR_X34030jptgO0nfvg3GkCiIFcKK18=.aa4f6074-1866-4f70-a9a7-e1266aa14303@github.com> References: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> <_z1b8GbcB-3lzR_X34030jptgO0nfvg3GkCiIFcKK18=.aa4f6074-1866-4f70-a9a7-e1266aa14303@github.com> Message-ID: On Fri, 3 Mar 2023 20:27:14 GMT, Kim Barrett wrote: >> This PR fixes broken fast debug and slow debug build for s390x-arch. tier1 test are completed and results are not affect after this patch. > > src/hotspot/cpu/aarch64/c1_Defs_aarch64.hpp line 93: > >> 91: enum { >> 92: pd_reserved_argument_area_size_factor = 2 >> 93: }; > > [Not a review, just a drive-by comment.] > Please don't use enums to define constants like this. This is an old and obsolete style that shouldn't be > used any more. Quoting from the HotSpot Style Guide: > "Due to bugs in certain (very old) compilers, there is widespread use of enums and avoidance of in-class initialization of static integral constant members. Compilers having such bugs are no longer supported. Except where an enum is semantically appropriate, new code should use integral constants." @kimbarrett Where to see `HotSpot Style Guide`, can you give a URL link? ------------- PR: https://git.openjdk.org/jdk/pull/12825 From kvn at openjdk.org Sat Mar 4 02:00:11 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 4 Mar 2023 02:00:11 GMT Subject: RFR: 8303415: Add VM_Version::is_intrinsic_supported(id) In-Reply-To: References: Message-ID: On Fri, 3 Mar 2023 16:16:08 GMT, Vladimir Kozlov wrote: > Currently we check VM flags, directives and JIT compiler support when we generate intrinsics. > We have *product* VM flags for most intrinsics and set them in VM based on HW support. > But not all intrinsics have such flags and it is not scalable to add new *product* flag for each new intrinsic. > Also we have `-XX:DisableIntrinsic=` and `-XX:ControlIntrinsic=` flags to control intrinsics from command line. We don't need specific flags for that. > > I propose to add new `VM_Version::is_intrinsic_supported(id)` method to check platform support for intrinsic without adding new flag. I used it for `_floatToFloat16` intrinsic for my work on [JDK-8302976](https://bugs.openjdk.org/browse/JDK-8302976). > > Additional fixes: > Fixed Interpreter to skip intrinsics if they are disabled with flag. > Moved Interpreter's `InlineIntrinsics` flag check into one place in shared code. > Added separate interpreter id for `_dsqrt_strict` so it could be disabled separately from regular `_dsqrt`. > Added missing `native` mark to `_currentThread`. > Removed unused `AbstractInterpreter::in_native_entry()`. > Cleanup C2 intrinsic checks code. > > Tested tier1-4,xcomp,stress. Also ran tier1-3,xcomp with `-XX:-InlineIntrinsics`. GHA failure on linux-x86 in test compiler/vectorization/runner/LoopRangeStrideTest.java is due to [JDK-8303105](https://bugs.openjdk.org/browse/JDK-8303105) ------------- PR: https://git.openjdk.org/jdk/pull/12858 From kvn at openjdk.org Sat Mar 4 02:13:10 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 4 Mar 2023 02:13:10 GMT Subject: RFR: 8303415: Add VM_Version::is_intrinsic_supported(id) In-Reply-To: References: Message-ID: On Fri, 3 Mar 2023 16:16:08 GMT, Vladimir Kozlov wrote: > Currently we check VM flags, directives and JIT compiler support when we generate intrinsics. > We have *product* VM flags for most intrinsics and set them in VM based on HW support. > But not all intrinsics have such flags and it is not scalable to add new *product* flag for each new intrinsic. > Also we have `-XX:DisableIntrinsic=` and `-XX:ControlIntrinsic=` flags to control intrinsics from command line. We don't need specific flags for that. > > I propose to add new `VM_Version::is_intrinsic_supported(id)` method to check platform support for intrinsic without adding new flag. I used it for `_floatToFloat16` intrinsic for my work on [JDK-8302976](https://bugs.openjdk.org/browse/JDK-8302976). > > Additional fixes: > Fixed Interpreter to skip intrinsics if they are disabled with flag. > Moved Interpreter's `InlineIntrinsics` flag check into one place in shared code. > Added separate interpreter id for `_dsqrt_strict` so it could be disabled separately from regular `_dsqrt`. > Added missing `native` mark to `_currentThread`. > Removed unused `AbstractInterpreter::in_native_entry()`. > Cleanup C2 intrinsic checks code. > > Tested tier1-4,xcomp,stress. Also ran tier1-3,xcomp with `-XX:-InlineIntrinsics`. GHA failure on linux-x86 in test java/foreign/callarranger/TestMacOsAArch64CallArranger.java most likely due to [JDK-8303516](https://bugs.openjdk.org/browse/JDK-8303516). I did not have the fix when submitted PR. ------------- PR: https://git.openjdk.org/jdk/pull/12858 From amitkumar at openjdk.org Sat Mar 4 04:48:11 2023 From: amitkumar at openjdk.org (Amit Kumar) Date: Sat, 4 Mar 2023 04:48:11 GMT Subject: RFR: 8303147: [s390x] fast & slow debug builds are broken In-Reply-To: References: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> <_z1b8GbcB-3lzR_X34030jptgO0nfvg3GkCiIFcKK18=.aa4f6074-1866-4f70-a9a7-e1266aa14303@github.com> Message-ID: <9Vy53R1wVzDdnedV1XQM7x3ZXEL7-VRaMCC-6u7HFsE=.62031107-d4bc-44b9-8608-8477da8b987e@github.com> On Sat, 4 Mar 2023 01:27:36 GMT, SUN Guoyun wrote: >> src/hotspot/cpu/aarch64/c1_Defs_aarch64.hpp line 93: >> >>> 91: enum { >>> 92: pd_reserved_argument_area_size_factor = 2 >>> 93: }; >> >> [Not a review, just a drive-by comment.] >> Please don't use enums to define constants like this. This is an old and obsolete style that shouldn't be >> used any more. Quoting from the HotSpot Style Guide: >> "Due to bugs in certain (very old) compilers, there is widespread use of enums and avoidance of in-class initialization of static integral constant members. Compilers having such bugs are no longer supported. Except where an enum is semantically appropriate, new code should use integral constants." > > @kimbarrett Where to see `HotSpot Style Guide`, can you give a URL link? @sunny868 here: https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md ------------- PR: https://git.openjdk.org/jdk/pull/12825 From duke at openjdk.org Mon Mar 6 02:49:00 2023 From: duke at openjdk.org (SUN Guoyun) Date: Mon, 6 Mar 2023 02:49:00 GMT Subject: RFR: 8303627: compiler/loopopts/TestUnreachableInnerLoop.java failed with -XX:LoopMaxUnroll=4 Message-ID: This test failed with VM_OPTIONS=-XX:LoopMaxUnroll=4 and CONF=fastdebug on X86_64, AArch64 LoongArch64 architecture.

# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (/home/sunguoyun/jdk-ls/src/hotspot/share/opto/block.cpp:1359), pid=31328, tid=31344
# assert(n->is_Root() || n->is_Region() || n->is_Phi() || n->is_MachMerge() || def_block->dominates(block)) failed: uses must be dominated by definitions
#
This PR fix the issue, Please help review it. Thanks. ------------- Commit messages: - 8303627: compiler/loopopts/TestUnreachableInnerLoop.java failed with -XX:LoopMaxUnroll=4 Changes: https://git.openjdk.org/jdk/pull/12874/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12874&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8303627 Stats: 17 lines in 2 files changed: 14 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/12874.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12874/head:pull/12874 PR: https://git.openjdk.org/jdk/pull/12874 From amitkumar at openjdk.org Mon Mar 6 03:45:02 2023 From: amitkumar at openjdk.org (Amit Kumar) Date: Mon, 6 Mar 2023 03:45:02 GMT Subject: RFR: 8303627: compiler/loopopts/TestUnreachableInnerLoop.java failed with -XX:LoopMaxUnroll=4 In-Reply-To: References: Message-ID: <6ika2MpzCLoxWxpvPuiqX4Ukh8XxueLXCB6xy9Qbdec=.6eec5a9d-8721-4eed-9eee-6705e89863cc@github.com> On Mon, 6 Mar 2023 02:42:21 GMT, SUN Guoyun wrote: > This test failed with VM_OPTIONS=-XX:LoopMaxUnroll=4 and CONF=fastdebug on X86_64, AArch64 LoongArch64 architecture. > >

> # A fatal error has been detected by the Java Runtime Environment:
> #
> # Internal Error (/home/sunguoyun/jdk-ls/src/hotspot/share/opto/block.cpp:1359), pid=31328, tid=31344
> # assert(n->is_Root() || n->is_Region() || n->is_Phi() || n->is_MachMerge() || def_block->dominates(block)) failed: uses must be dominated by definitions
> #
> 
> This PR fix the issue, Please help review it. > > Thanks. [*Not a Review*] This PR fixes test failure on s390x as well [JDK-8303496](https://bugs.openjdk.org/browse/JDK-8303496). Which was initially expected to be fix by [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981). Thanks @sunny868 ------------- Marked as reviewed by amitkumar (Author). PR: https://git.openjdk.org/jdk/pull/12874 From amitkumar at openjdk.org Mon Mar 6 04:47:11 2023 From: amitkumar at openjdk.org (Amit Kumar) Date: Mon, 6 Mar 2023 04:47:11 GMT Subject: RFR: 8303147: [s390x] fast & slow debug builds are broken In-Reply-To: References: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> Message-ID: On Sat, 4 Mar 2023 01:17:30 GMT, SUN Guoyun wrote: >> This PR fixes broken fast debug and slow debug build for s390x-arch. tier1 test are completed and results are not affect after this patch. > > src/hotspot/cpu/s390/c1_Defs_s390.hpp line 77: > >> 75: // the number of stack required by ArrayCopyStub >> 76: enum { >> 77: pd_arraycopystub_reserved_argument_area_size = 4 > > Do you know the root cause of failure? Maybe only the modification here are needed. You can try. this isn't fixing the issue. Beside that I had to make changes in `src/hotspot/share/c1/c1_Compilation.cpp` because `compiler/c1/KlassAccessCheckTest.java` appeared after build fix and it was failing with the same error as build. ------------- PR: https://git.openjdk.org/jdk/pull/12825 From jbhateja at openjdk.org Mon Mar 6 04:51:09 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 6 Mar 2023 04:51:09 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs In-Reply-To: <86PauwnM1BpvfJVJW1kyP6g3sVeirHKrMIry-kl6Ins=.50f25d11-47d2-4b33-8008-7f36a874b06e@github.com> References: <86PauwnM1BpvfJVJW1kyP6g3sVeirHKrMIry-kl6Ins=.50f25d11-47d2-4b33-8008-7f36a874b06e@github.com> Message-ID: On Mon, 13 Feb 2023 01:43:49 GMT, Fei Gao wrote: >>> @fg1417 Do you have the possibility to test on arm32? >> >> Sure. I'll do the testing with a 32-bit docker container on a 64-bit host. > >> > @fg1417 Do you have the possibility to test on arm32? >> >> Sure. I'll do the testing with a 32-bit docker container on a 64-bit host. > > The testing for tier 1 - 3 and jcstress looks good. No new failures on arm32. Thanks. > In a follow-up RFE, the people who care about `+AlignVector` (eg. @fg1417 ) can improve the alignment requirements and analysis **[1]**, so that the performance can be restored, or even improved compared to what master does now. > > What are your thoughts on this? @fg1417 @jatin-bhateja @vnkozlov @TobiHartmann ? > I did some additional testing and found that following case which does not carry any true dependency no longer vectorizes with default options (-AlignVector). public class Test { static int N = 512; public static void main(String[] strArr) { double[] data1 = new double[N]; short[] data2 = new short[N]; init(data1, data2); for (int i = 0; i < 10_000; i++){ test(data1, data2); } } static void test(double[] data1, short[] data2) { for (int i = 16; i < N-16; i++) { short v = data2[i + 2]; data2[i] = v; data1[i] = (double)v; } } static void init(double[] data1, short[] data2) { for (int j = 0; j < N; j++) { data1[j] = (double)j; data2[j] = (short)j; } } } With -XX:-AlignVector and Vectorize,true pragma it does vectorize, since short stores were seen as mis-aligned w.r.t to alignment enforced by ShortD nodes which were identified as best memory reference to algin with on account of being most well connected nodes, In strict sense new behavior looks correct, even bypassing mis-aligned memory access during extend_packlist. ------------- PR: https://git.openjdk.org/jdk/pull/12350 From jbhateja at openjdk.org Mon Mar 6 05:16:13 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 6 Mar 2023 05:16:13 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v15] In-Reply-To: References: <7BB6Bc-RlCF3GYkbuOEmSJgdv4aAzQ-2cQFdiCY2vaQ=.37b1a20a-70f2-43a2-865e-b49aafe09b42@github.com> Message-ID: On Thu, 2 Mar 2023 15:56:00 GMT, Emanuel Peter wrote: >> @jatin-bhateja I think the IR rule is just ineffective. I have the following condition in it that will never be met: >> `applyIfAnd = {"MaxVectorSize", ">= 8", "MaxVectorSize", "<= 4"},` >> The `<= 4` must hold so that `byte_offset <= MaxVectorSize`, and so the cyclical dependency would not happen. But `>= 8` must hold so that two ints fit in a vector, so that we even vectorize. >> >> I could improve the script and filter out such ineffective IR rules. Not sure if that is worth it though. > > I fixed my script, it should now compute the ranges correctly, and not add IR rules with impossible ranges. Thanks, even though newly added test now passes at all AVX and SSE level can you kindly investigate why should following be vectorized with un-aligned accesses when it carries a cross iteration true dependency with distance 4. @Test // CPU: sse4.1 to avx -> vector_width: 16 -> elements in vector: 4 // positive byte_offset 12 can lead to cyclic dependency @IR(counts = {IRNode.LOAD_VECTOR, "> 0", IRNode.MUL_V, "> 0", IRNode.STORE_VECTOR, "> 0"}, applyIfAnd = {"AlignVector", "false", "MaxVectorSize", ">= 8", "MaxVectorSize", "<= 12"}, applyIfCPUFeatureAnd = {"sse4.1", "true", "avx2", "false"}) // CPU: avx2 -> vector_width: 32 -> elements in vector: 8 // positive byte_offset 12 can lead to cyclic dependency @IR(counts = {IRNode.LOAD_VECTOR, "> 0", IRNode.MUL_V, "> 0", IRNode.STORE_VECTOR, "> 0"}, applyIfAnd = {"AlignVector", "false", "MaxVectorSize", ">= 8", "MaxVectorSize", "<= 12"}, applyIfCPUFeatureAnd = {"avx2", "true", "avx512", "false"}) // CPU: avx512 -> vector_width: 64 -> elements in vector: 16 // positive byte_offset 12 can lead to cyclic dependency @IR(counts = {IRNode.LOAD_VECTOR, "> 0", IRNode.MUL_V, "> 0", IRNode.STORE_VECTOR, "> 0"}, applyIfAnd = {"AlignVector", "false", "MaxVectorSize", ">= 8", "MaxVectorSize", "<= 12"}, applyIfCPUFeature = {"avx512", "true"}) // CPU: asimd -> vector_width: 32 -> elements in vector: 8 // positive byte_offset 12 can lead to cyclic dependency @IR(counts = {IRNode.LOAD_VECTOR, "> 0", IRNode.MUL_V, "> 0", IRNode.STORE_VECTOR, "> 0"}, applyIfAnd = {"AlignVector", "false", "MaxVectorSize", ">= 8", "MaxVectorSize", "<= 12"}, applyIfCPUFeature = {"asimd", "true"}) public static void testIntP3(int[] data) { for (int j = 0; j < RANGE - 3; j++) { data[j + 3] = (int)(data[j] * (int)-11); } } Also SLP now operates under SuperWordMaxVectorSize so it will be good to its it instead. ------------- PR: https://git.openjdk.org/jdk/pull/12350 From jbhateja at openjdk.org Mon Mar 6 05:24:19 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 6 Mar 2023 05:24:19 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v15] In-Reply-To: References: <7BB6Bc-RlCF3GYkbuOEmSJgdv4aAzQ-2cQFdiCY2vaQ=.37b1a20a-70f2-43a2-865e-b49aafe09b42@github.com> Message-ID: On Mon, 6 Mar 2023 05:13:27 GMT, Jatin Bhateja wrote: >> I fixed my script, it should now compute the ranges correctly, and not add IR rules with impossible ranges. > > Thanks, even though newly added test now passes at all AVX and SSE level can you kindly investigate why should following be vectorized with un-aligned accesses when it carries a cross iteration true dependency with distance 4. > > > @Test > // CPU: sse4.1 to avx -> vector_width: 16 -> elements in vector: 4 > // positive byte_offset 12 can lead to cyclic dependency > @IR(counts = {IRNode.LOAD_VECTOR, "> 0", IRNode.MUL_V, "> 0", IRNode.STORE_VECTOR, "> 0"}, > applyIfAnd = {"AlignVector", "false", "MaxVectorSize", ">= 8", "MaxVectorSize", "<= 12"}, > applyIfCPUFeatureAnd = {"sse4.1", "true", "avx2", "false"}) > // CPU: avx2 -> vector_width: 32 -> elements in vector: 8 > // positive byte_offset 12 can lead to cyclic dependency > @IR(counts = {IRNode.LOAD_VECTOR, "> 0", IRNode.MUL_V, "> 0", IRNode.STORE_VECTOR, "> 0"}, > applyIfAnd = {"AlignVector", "false", "MaxVectorSize", ">= 8", "MaxVectorSize", "<= 12"}, > applyIfCPUFeatureAnd = {"avx2", "true", "avx512", "false"}) > // CPU: avx512 -> vector_width: 64 -> elements in vector: 16 > // positive byte_offset 12 can lead to cyclic dependency > @IR(counts = {IRNode.LOAD_VECTOR, "> 0", IRNode.MUL_V, "> 0", IRNode.STORE_VECTOR, "> 0"}, > applyIfAnd = {"AlignVector", "false", "MaxVectorSize", ">= 8", "MaxVectorSize", "<= 12"}, > applyIfCPUFeature = {"avx512", "true"}) > // CPU: asimd -> vector_width: 32 -> elements in vector: 8 > // positive byte_offset 12 can lead to cyclic dependency > @IR(counts = {IRNode.LOAD_VECTOR, "> 0", IRNode.MUL_V, "> 0", IRNode.STORE_VECTOR, "> 0"}, > applyIfAnd = {"AlignVector", "false", "MaxVectorSize", ">= 8", "MaxVectorSize", "<= 12"}, > applyIfCPUFeature = {"asimd", "true"}) > public static void testIntP3(int[] data) { > for (int j = 0; j < RANGE - 3; j++) { > data[j + 3] = (int)(data[j] * (int)-11); > } > } > > > Also SLP now operates under SuperWordMaxVectorSize so it will be good to its it instead. With +AlignVector behavior with and without Vectorize,true pragma should match. static void test1() { for (int i = 4; i < 100; i++) { fArr[i + 4] = fArr[i]; } } CPROMPT>javad -XX:+TraceNewVectors -XX:+AlignVector -cp . bug WARNING: Using incubator modules: jdk.incubator.vector res = 0.0 CPROMPT> CPROMPT>javad -XX:+TraceNewVectors -XX:+AlignVector -XX:CompileCommand=Vectorize,bug::test1,true -cp . bug CompileCommand: Vectorize bug.test1 bool Vectorize = true WARNING: Using incubator modules: jdk.incubator.vector new Vector node: 990 LoadVector === 373 856 824 [[ 822 802 800 798 718 706 556 196 ]] @float[int:>=0]:NotNull:exact+any *, idx=6; mismatched #vectory[8]:{float} !orig=[823],[719],[557],[199],143 !jvms: bug::test1 @ bci:18 (line 7) new Vector node: 991 StoreVector === 855 856 825 990 [[ 988 195 856 ]] @float[int:>=0]:NotNull:exact+any *, idx=6; mismatched Memory: @float[int:>=0]:NotNull:exact+any *, idx=6; !orig=[822],[718],[556],[196],164 !jvms: bug::test1 @ bci:19 (line 7) res = 0.0 ------------- PR: https://git.openjdk.org/jdk/pull/12350 From amitkumar at openjdk.org Mon Mar 6 05:58:23 2023 From: amitkumar at openjdk.org (Amit Kumar) Date: Mon, 6 Mar 2023 05:58:23 GMT Subject: RFR: 8303147: [s390x] fast & slow debug builds are broken [v2] In-Reply-To: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> References: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> Message-ID: > This PR fixes broken fast debug and slow debug build for s390x-arch. tier1 test are completed and results are not affect after this patch. Amit Kumar has updated the pull request incrementally with one additional commit since the last revision: use constant instead of enum ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12825/files - new: https://git.openjdk.org/jdk/pull/12825/files/bb0ae5d8..820e2884 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12825&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12825&range=00-01 Stats: 21 lines in 7 files changed: 0 ins; 14 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/12825.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12825/head:pull/12825 PR: https://git.openjdk.org/jdk/pull/12825 From amitkumar at openjdk.org Mon Mar 6 06:00:14 2023 From: amitkumar at openjdk.org (Amit Kumar) Date: Mon, 6 Mar 2023 06:00:14 GMT Subject: RFR: 8303147: [s390x] fast & slow debug builds are broken [v2] In-Reply-To: <_z1b8GbcB-3lzR_X34030jptgO0nfvg3GkCiIFcKK18=.aa4f6074-1866-4f70-a9a7-e1266aa14303@github.com> References: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> <_z1b8GbcB-3lzR_X34030jptgO0nfvg3GkCiIFcKK18=.aa4f6074-1866-4f70-a9a7-e1266aa14303@github.com> Message-ID: On Fri, 3 Mar 2023 20:27:14 GMT, Kim Barrett wrote: >> Amit Kumar has updated the pull request incrementally with one additional commit since the last revision: >> >> use constant instead of enum > > src/hotspot/cpu/aarch64/c1_Defs_aarch64.hpp line 93: > >> 91: enum { >> 92: pd_reserved_argument_area_size_factor = 2 >> 93: }; > > [Not a review, just a drive-by comment.] > Please don't use enums to define constants like this. This is an old and obsolete style that shouldn't be > used any more. Quoting from the HotSpot Style Guide: > "Due to bugs in certain (very old) compilers, there is widespread use of enums and avoidance of in-class initialization of static integral constant members. Compilers having such bugs are no longer supported. Except where an enum is semantically appropriate, new code should use integral constants." @kimbarrett I've made changes, please take a look now. Thank you. ------------- PR: https://git.openjdk.org/jdk/pull/12825 From duke at openjdk.org Mon Mar 6 06:24:26 2023 From: duke at openjdk.org (SUN Guoyun) Date: Mon, 6 Mar 2023 06:24:26 GMT Subject: RFR: 8303147: [s390x] fast & slow debug builds are broken [v2] In-Reply-To: References: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> Message-ID: On Mon, 6 Mar 2023 04:44:04 GMT, Amit Kumar wrote: >> src/hotspot/cpu/s390/c1_Defs_s390.hpp line 77: >> >>> 75: // the number of stack required by ArrayCopyStub >>> 76: enum { >>> 77: pd_arraycopystub_reserved_argument_area_size = 4 >> >> Do you know the root cause of failure? Maybe only the modification here are needed. You can try. > > this isn't fixing the issue. Beside that I had to make changes in `src/hotspot/share/c1/c1_Compilation.cpp` because `compiler/c1/KlassAccessCheckTest.java` appeared after build fix and it was failing with the same error as build. Can you upload hs_err_xxx.log file information? ------------- PR: https://git.openjdk.org/jdk/pull/12825 From amitkumar at openjdk.org Mon Mar 6 06:33:13 2023 From: amitkumar at openjdk.org (Amit Kumar) Date: Mon, 6 Mar 2023 06:33:13 GMT Subject: RFR: 8303147: [s390x] fast & slow debug builds are broken [v2] In-Reply-To: References: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> Message-ID: On Mon, 6 Mar 2023 06:21:15 GMT, SUN Guoyun wrote: >> this isn't fixing the issue. Beside that I had to make changes in `src/hotspot/share/c1/c1_Compilation.cpp` because `compiler/c1/KlassAccessCheckTest.java` appeared after build fix and it was failing with the same error as build. > > Can you upload hs_err_xxx.log file information? Sure; [KlassAccessCheckTest.log](https://github.com/openjdk/jdk/files/10895005/KlassAccessCheckTest.log) ------------- PR: https://git.openjdk.org/jdk/pull/12825 From duke at openjdk.org Mon Mar 6 06:42:14 2023 From: duke at openjdk.org (SUN Guoyun) Date: Mon, 6 Mar 2023 06:42:14 GMT Subject: RFR: 8303147: [s390x] fast & slow debug builds are broken [v2] In-Reply-To: References: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> Message-ID: On Mon, 6 Mar 2023 06:30:25 GMT, Amit Kumar wrote: >> Can you upload hs_err_xxx.log file information? > > Sure; [KlassAccessCheckTest.log](https://github.com/openjdk/jdk/files/10895005/KlassAccessCheckTest.log) I want to see /home/amit/build_test/jdk/build/linux-s390x-server-fastdebug/test-support/jtreg_test_hotspot_jtreg_tier1/scratch/1/hs_err_pid901198.log ------------- PR: https://git.openjdk.org/jdk/pull/12825 From duke at openjdk.org Mon Mar 6 06:42:14 2023 From: duke at openjdk.org (SUN Guoyun) Date: Mon, 6 Mar 2023 06:42:14 GMT Subject: RFR: 8303147: [s390x] fast & slow debug builds are broken [v2] In-Reply-To: References: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> Message-ID: On Mon, 6 Mar 2023 06:37:26 GMT, SUN Guoyun wrote: >> Sure; [KlassAccessCheckTest.log](https://github.com/openjdk/jdk/files/10895005/KlassAccessCheckTest.log) > > I want to see /home/amit/build_test/jdk/build/linux-s390x-server-fastdebug/test-support/jtreg_test_hotspot_jtreg_tier1/scratch/1/hs_err_pid901198.log I would like to see more detailed stack information from hs_err_pid901198.log, so to know which stub is causing the problem. ------------- PR: https://git.openjdk.org/jdk/pull/12825 From amitkumar at openjdk.org Mon Mar 6 06:49:13 2023 From: amitkumar at openjdk.org (Amit Kumar) Date: Mon, 6 Mar 2023 06:49:13 GMT Subject: RFR: 8303147: [s390x] fast & slow debug builds are broken [v2] In-Reply-To: References: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> Message-ID: On Mon, 6 Mar 2023 06:39:24 GMT, SUN Guoyun wrote: >> I want to see /home/amit/build_test/jdk/build/linux-s390x-server-fastdebug/test-support/jtreg_test_hotspot_jtreg_tier1/scratch/1/hs_err_pid901198.log > > I would like to see more detailed stack information from hs_err_pid901198.log, so to know which stub is causing the problem. I'm really sorry. [hs_err_pid901198.log](https://github.com/openjdk/jdk/files/10895097/hs_err_pid901198.log) ------------- PR: https://git.openjdk.org/jdk/pull/12825 From duke at openjdk.org Mon Mar 6 06:59:04 2023 From: duke at openjdk.org (SUN Guoyun) Date: Mon, 6 Mar 2023 06:59:04 GMT Subject: RFR: 8303147: [s390x] fast & slow debug builds are broken [v2] In-Reply-To: References: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> Message-ID: The message from this sender included one or more files which could not be scanned for virus detection; do not open these files unless you are certain of the sender's intent. ---------------------------------------------------------------------- On Mon, 6 Mar 2023 06:45:52 GMT, Amit Kumar wrote: >> I would like to see more detailed stack information from hs_err_pid901198.log, so to know which stub is causing the problem. > > I'm really sorry. > [hs_err_pid901198.log](https://github.com/openjdk/jdk/files/10895097/hs_err_pid901198.log) Thank you. perhaps this is the difference between the S390 and other architectures, do not call store_parameter() in this method on other architectures, such as x86. LIR_Assembler::emit_typecheck_helper(LIR_OpTypeCheck*, Label*, Label*, Label*)+0x800 (c1_LIRAssembler_s390.cpp:2534) Next, let's see if there is a better solution. ------------- PR: https://git.openjdk.org/jdk/pull/12825 From duke at openjdk.org Mon Mar 6 07:23:05 2023 From: duke at openjdk.org (SUN Guoyun) Date: Mon, 6 Mar 2023 07:23:05 GMT Subject: RFR: 8303147: [s390x] fast & slow debug builds are broken [v2] In-Reply-To: References: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> Message-ID: On Mon, 6 Mar 2023 06:56:44 GMT, SUN Guoyun wrote: >> I'm really sorry. >> [hs_err_pid901198.log](https://github.com/openjdk/jdk/files/10895097/hs_err_pid901198.log) > > Thank you. perhaps this is the difference between the S390 and other architectures, do not call store_parameter() in this method on other architectures, such as x86. > > LIR_Assembler::emit_typecheck_helper(LIR_OpTypeCheck*, Label*, Label*, Label*)+0x800 (c1_LIRAssembler_s390.cpp:2534) > > Next, let's see if there is a better solution. https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/s390/c1_LIRAssembler_s390.cpp#L2534 Maybe you can try store klass_RInfo & k_RInfo to current sp location but not reserve_stack space. such as: __ z_stg(klass_RInfo, offset_in_bytes, Z_SP); ------------- PR: https://git.openjdk.org/jdk/pull/12825 From duke at openjdk.org Mon Mar 6 07:55:10 2023 From: duke at openjdk.org (SUN Guoyun) Date: Mon, 6 Mar 2023 07:55:10 GMT Subject: RFR: 8303627: compiler/loopopts/TestUnreachableInnerLoop.java failed with -XX:LoopMaxUnroll=4 In-Reply-To: References: Message-ID: On Mon, 6 Mar 2023 02:42:21 GMT, SUN Guoyun wrote: > This test failed with VM_OPTIONS=-XX:LoopMaxUnroll=4 and CONF=fastdebug on X86_64, AArch64 LoongArch64 architecture. > >

> # A fatal error has been detected by the Java Runtime Environment:
> #
> # Internal Error (/home/sunguoyun/jdk-ls/src/hotspot/share/opto/block.cpp:1359), pid=31328, tid=31344
> # assert(n->is_Root() || n->is_Region() || n->is_Phi() || n->is_MachMerge() || def_block->dominates(block)) failed: uses must be dominated by definitions
> #
> 
> This PR fix the issue, Please help review it. > > Thanks. The test `test/hotspot/jtreg/compiler/vectorization/runner/LoopRangeStrideTest.java` failure is not related to this patch, which failed on #12873 also. ------------- PR: https://git.openjdk.org/jdk/pull/12874 From duke at openjdk.org Mon Mar 6 08:08:17 2023 From: duke at openjdk.org (SUN Guoyun) Date: Mon, 6 Mar 2023 08:08:17 GMT Subject: RFR: 8303627: compiler/loopopts/TestUnreachableInnerLoop.java failed with -XX:LoopMaxUnroll=4 In-Reply-To: References: Message-ID: On Mon, 6 Mar 2023 02:42:21 GMT, SUN Guoyun wrote: > This test failed with VM_OPTIONS=-XX:LoopMaxUnroll=4 and CONF=fastdebug on X86_64, AArch64 LoongArch64 architecture. > >

> # A fatal error has been detected by the Java Runtime Environment:
> #
> # Internal Error (/home/sunguoyun/jdk-ls/src/hotspot/share/opto/block.cpp:1359), pid=31328, tid=31344
> # assert(n->is_Root() || n->is_Region() || n->is_Phi() || n->is_MachMerge() || def_block->dominates(block)) failed: uses must be dominated by definitions
> #
> 
> This PR fix the issue, Please help review it. > > Thanks. > [_Not a Review_] This PR fixes test failure on s390x as well [JDK-8303496](https://bugs.openjdk.org/browse/JDK-8303496). Which was initially expected to be fix by [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981). > Thanks @offamitkumar for the reminding me. @chhagedorn, whether this patch duplicates with JDK-8288981? ------------- PR: https://git.openjdk.org/jdk/pull/12874 From amitkumar at openjdk.org Mon Mar 6 08:31:04 2023 From: amitkumar at openjdk.org (Amit Kumar) Date: Mon, 6 Mar 2023 08:31:04 GMT Subject: RFR: 8303147: [s390x] fast & slow debug builds are broken [v2] In-Reply-To: References: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> Message-ID: On Mon, 6 Mar 2023 07:20:37 GMT, SUN Guoyun wrote: >> Thank you. perhaps this is the difference between the S390 and other architectures, do not call store_parameter() in this method on other architectures, such as x86. >> >> LIR_Assembler::emit_typecheck_helper(LIR_OpTypeCheck*, Label*, Label*, Label*)+0x800 (c1_LIRAssembler_s390.cpp:2534) >> >> Next, let's see if there is a better solution. > > https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/s390/c1_LIRAssembler_s390.cpp#L2534 > Maybe you can try store klass_RInfo & k_RInfo to current sp location but not reserve_stack space. such as: > __ z_stg(klass_RInfo, offset_in_bytes, Z_SP); But there are other call sites as well, what about them ? You mean to change them all ? ------------- PR: https://git.openjdk.org/jdk/pull/12825 From duke at openjdk.org Mon Mar 6 08:37:15 2023 From: duke at openjdk.org (SUN Guoyun) Date: Mon, 6 Mar 2023 08:37:15 GMT Subject: RFR: 8303147: [s390x] fast & slow debug builds are broken [v2] In-Reply-To: References: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> Message-ID: <1qEPj_24jMagvij94wWZoE2P6HAGOtGDZe9aCMxsO8w=.657268cc-4137-4ef9-80a0-c748c27ac678@github.com> On Mon, 6 Mar 2023 08:28:36 GMT, Amit Kumar wrote: >> https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/s390/c1_LIRAssembler_s390.cpp#L2534 >> Maybe you can try store klass_RInfo & k_RInfo to current sp location but not reserve_stack space. such as: >> __ z_stg(klass_RInfo, offset_in_bytes, Z_SP); > > But there are other call sites as well, what about them ? You mean to change them all ? I don't know much about the implementation of `emit_typecheck_helper` yet, maybe you can refer to aarch64. ------------- PR: https://git.openjdk.org/jdk/pull/12825 From epeter at openjdk.org Mon Mar 6 08:42:20 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 6 Mar 2023 08:42:20 GMT Subject: RFR: 8300258: C2: vectorization fails on simple ByteBuffer loop [v4] In-Reply-To: References: Message-ID: <1m2NKVJjltewcKmzGmLv3Yhf5YERTRHHO1LkrnCErqU=.b401d791-accd-4a89-b1e7-02b9ae681fe8@github.com> On Fri, 3 Mar 2023 15:37:37 GMT, Roland Westrelin wrote: >> The loop that doesn't vectorize is: >> >> >> public static void testByteLong4(byte[] dest, long[] src, int start, int stop) { >> for (int i = start; i < stop; i++) { >> UNSAFE.putLongUnaligned(dest, 8 * i + baseOffset, src[i]); >> } >> } >> >> >> It's from a micro-benchmark in the panama >> repo. `SuperWord::find_adjacent_refs() `prevents it from vectorizing >> because it finds it cannot properly align the loop and, from the >> comment in the code, that: >> >> >> // Can't allow vectorization of unaligned memory accesses with the >> // same type since it could be overlapped accesses to the same array. >> >> >> The test for "same type" is implemented by looking at the memory >> operation type which in this case is overly conservative as the loop >> above is reading and writing with long loads/stores but from and to >> arrays of different types that can't overlap. Actually, with such >> mismatched accesses, it's also likely an incorrect test (reading and >> writing could be to the same array with loads/stores that use >> different operand size) eventhough I couldn't write a test case that >> would trigger an incorrect execution. >> >> As a fix, I propose implementing the "same type" test by looking at >> memory aliases instead. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > more tests test/hotspot/jtreg/compiler/c2/irTests/TestVectorizationMismatchedAccess.java line 185: > 183: baseOffset = 1; > 184: testByteByte5(byteArray, byteArray, 0, size-1); > 185: } Suggestion: } test/hotspot/jtreg/compiler/c2/irTests/TestVectorizationMismatchedAccess.java line 186: > 184: testByteByte5(byteArray, byteArray, 0, size-1); > 185: } > 186: Suggestion: ------------- PR: https://git.openjdk.org/jdk/pull/12440 From epeter at openjdk.org Mon Mar 6 08:42:21 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 6 Mar 2023 08:42:21 GMT Subject: RFR: 8300258: C2: vectorization fails on simple ByteBuffer loop [v4] In-Reply-To: <1m2NKVJjltewcKmzGmLv3Yhf5YERTRHHO1LkrnCErqU=.b401d791-accd-4a89-b1e7-02b9ae681fe8@github.com> References: <1m2NKVJjltewcKmzGmLv3Yhf5YERTRHHO1LkrnCErqU=.b401d791-accd-4a89-b1e7-02b9ae681fe8@github.com> Message-ID: The message from this sender included one or more files which could not be scanned for virus detection; do not open these files unless you are certain of the sender's intent. ---------------------------------------------------------------------- On Mon, 6 Mar 2023 08:39:24 GMT, Emanuel Peter wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> more tests > > test/hotspot/jtreg/compiler/c2/irTests/TestVectorizationMismatchedAccess.java line 186: > >> 184: testByteByte5(byteArray, byteArray, 0, size-1); >> 185: } >> 186: > > Suggestion: Remove the trailing spaces ------------- PR: https://git.openjdk.org/jdk/pull/12440 From duke at openjdk.org Mon Mar 6 08:46:16 2023 From: duke at openjdk.org (SUN Guoyun) Date: Mon, 6 Mar 2023 08:46:16 GMT Subject: RFR: 8303147: [s390x] fast & slow debug builds are broken [v2] In-Reply-To: <1qEPj_24jMagvij94wWZoE2P6HAGOtGDZe9aCMxsO8w=.657268cc-4137-4ef9-80a0-c748c27ac678@github.com> References: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> <1qEPj_24jMagvij94wWZoE2P6HAGOtGDZe9aCMxsO8w=.657268cc-4137-4ef9-80a0-c748c27ac678@github.com> Message-ID: On Mon, 6 Mar 2023 08:34:05 GMT, SUN Guoyun wrote: >> But there are other call sites as well, what about them ? You mean to change them all ? > > I don't know much about the implementation of `emit_typecheck_helper` yet, maybe you can refer to aarch64. According to my understanding, C1 stack space is divided into two parts: general stack space ( see hir()->max_stack) and reserved stack space (see reserved_argument_area, for stub used). `emit_typecheck_helper` is not stub, so shouldn't use the reserved stack. ------------- PR: https://git.openjdk.org/jdk/pull/12825 From epeter at openjdk.org Mon Mar 6 08:46:21 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 6 Mar 2023 08:46:21 GMT Subject: RFR: 8300258: C2: vectorization fails on simple ByteBuffer loop [v4] In-Reply-To: References: Message-ID: <3ERB_bka5zNhZE0PYGLP-cD5b4SjrWM9wHU0LgnrO6A=.5df6e61d-c6d0-4a8d-a578-f023f1246723@github.com> On Fri, 3 Mar 2023 15:37:37 GMT, Roland Westrelin wrote: >> The loop that doesn't vectorize is: >> >> >> public static void testByteLong4(byte[] dest, long[] src, int start, int stop) { >> for (int i = start; i < stop; i++) { >> UNSAFE.putLongUnaligned(dest, 8 * i + baseOffset, src[i]); >> } >> } >> >> >> It's from a micro-benchmark in the panama >> repo. `SuperWord::find_adjacent_refs() `prevents it from vectorizing >> because it finds it cannot properly align the loop and, from the >> comment in the code, that: >> >> >> // Can't allow vectorization of unaligned memory accesses with the >> // same type since it could be overlapped accesses to the same array. >> >> >> The test for "same type" is implemented by looking at the memory >> operation type which in this case is overly conservative as the loop >> above is reading and writing with long loads/stores but from and to >> arrays of different types that can't overlap. Actually, with such >> mismatched accesses, it's also likely an incorrect test (reading and >> writing could be to the same array with loads/stores that use >> different operand size) eventhough I couldn't write a test case that >> would trigger an incorrect execution. >> >> As a fix, I propose implementing the "same type" test by looking at >> memory aliases instead. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > more tests Testing passes for change 10. Looks good now. ------------- Marked as reviewed by epeter (Committer). PR: https://git.openjdk.org/jdk/pull/12440 From epeter at openjdk.org Mon Mar 6 08:57:14 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 6 Mar 2023 08:57:14 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v15] In-Reply-To: References: <7BB6Bc-RlCF3GYkbuOEmSJgdv4aAzQ-2cQFdiCY2vaQ=.37b1a20a-70f2-43a2-865e-b49aafe09b42@github.com> Message-ID: On Mon, 6 Mar 2023 05:19:37 GMT, Jatin Bhateja wrote: >> Thanks, even though newly added test now passes at all AVX and SSE level can you kindly investigate why should following be vectorized with un-aligned accesses when it carries a cross iteration true dependency with distance 4. >> >> >> @Test >> // CPU: sse4.1 to avx -> vector_width: 16 -> elements in vector: 4 >> // positive byte_offset 12 can lead to cyclic dependency >> @IR(counts = {IRNode.LOAD_VECTOR, "> 0", IRNode.MUL_V, "> 0", IRNode.STORE_VECTOR, "> 0"}, >> applyIfAnd = {"AlignVector", "false", "MaxVectorSize", ">= 8", "MaxVectorSize", "<= 12"}, >> applyIfCPUFeatureAnd = {"sse4.1", "true", "avx2", "false"}) >> // CPU: avx2 -> vector_width: 32 -> elements in vector: 8 >> // positive byte_offset 12 can lead to cyclic dependency >> @IR(counts = {IRNode.LOAD_VECTOR, "> 0", IRNode.MUL_V, "> 0", IRNode.STORE_VECTOR, "> 0"}, >> applyIfAnd = {"AlignVector", "false", "MaxVectorSize", ">= 8", "MaxVectorSize", "<= 12"}, >> applyIfCPUFeatureAnd = {"avx2", "true", "avx512", "false"}) >> // CPU: avx512 -> vector_width: 64 -> elements in vector: 16 >> // positive byte_offset 12 can lead to cyclic dependency >> @IR(counts = {IRNode.LOAD_VECTOR, "> 0", IRNode.MUL_V, "> 0", IRNode.STORE_VECTOR, "> 0"}, >> applyIfAnd = {"AlignVector", "false", "MaxVectorSize", ">= 8", "MaxVectorSize", "<= 12"}, >> applyIfCPUFeature = {"avx512", "true"}) >> // CPU: asimd -> vector_width: 32 -> elements in vector: 8 >> // positive byte_offset 12 can lead to cyclic dependency >> @IR(counts = {IRNode.LOAD_VECTOR, "> 0", IRNode.MUL_V, "> 0", IRNode.STORE_VECTOR, "> 0"}, >> applyIfAnd = {"AlignVector", "false", "MaxVectorSize", ">= 8", "MaxVectorSize", "<= 12"}, >> applyIfCPUFeature = {"asimd", "true"}) >> public static void testIntP3(int[] data) { >> for (int j = 0; j < RANGE - 3; j++) { >> data[j + 3] = (int)(data[j] * (int)-11); >> } >> } >> >> >> Also SLP now operates under SuperWordMaxVectorSize so it will be good to its it instead. > > With +AlignVector behavior with and without Vectorize,true pragma should match. > > > static void test1() { > for (int i = 4; i < 100; i++) { > fArr[i + 4] = fArr[i]; > } > } > > > > CPROMPT>javad -XX:+TraceNewVectors -XX:+AlignVector -cp . bug > WARNING: Using incubator modules: jdk.incubator.vector > res = 0.0 > CPROMPT> > CPROMPT>javad -XX:+TraceNewVectors -XX:+AlignVector -XX:CompileCommand=Vectorize,bug::test1,true -cp . bug > CompileCommand: Vectorize bug.test1 bool Vectorize = true > WARNING: Using incubator modules: jdk.incubator.vector > new Vector node: 990 LoadVector === 373 856 824 [[ 822 802 800 798 718 706 556 196 ]] @float[int:>=0]:NotNull:exact+any *, idx=6; mismatched #vectory[8]:{float} !orig=[823],[719],[557],[199],143 !jvms: bug::test1 @ bci:18 (line 7) > new Vector node: 991 StoreVector === 855 856 825 990 [[ 988 195 856 ]] @float[int:>=0]:NotNull:exact+any *, idx=6; mismatched Memory: @float[int:>=0]:NotNull:exact+any *, idx=6; !orig=[822],[718],[556],[196],164 !jvms: bug::test1 @ bci:19 (line 7) > res = 0.0 @jatin-bhateja Under `aarch64`, I have made bad experiences with `SuperWordMaxVectorSize`. It is not properly adjusted to be at most `MaxVectorSize`. For example if the `aarch64` machine only supports `MaxVectorSize <= 32`, but I set `SuperWordMaxVectorSize = 64`, then it will keep it at `64`. So then my IR rules fail. For the `x86 / x64` machines we have: https://github.com/openjdk/jdk/blob/33bec207103acd520eb99afb093cfafa44aecfda/src/hotspot/cpu/x86/vm_version_x86.cpp#L1314-L1333 @fg1417 Would you like to implement this for `aarch64`? ------------- PR: https://git.openjdk.org/jdk/pull/12350 From epeter at openjdk.org Mon Mar 6 09:05:18 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 6 Mar 2023 09:05:18 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v15] In-Reply-To: References: <7BB6Bc-RlCF3GYkbuOEmSJgdv4aAzQ-2cQFdiCY2vaQ=.37b1a20a-70f2-43a2-865e-b49aafe09b42@github.com> Message-ID: On Mon, 6 Mar 2023 05:19:37 GMT, Jatin Bhateja wrote: >> Thanks, even though newly added test now passes at all AVX and SSE level can you kindly investigate why should following be vectorized with un-aligned accesses when it carries a cross iteration true dependency with distance 4. >> >> >> @Test >> // CPU: sse4.1 to avx -> vector_width: 16 -> elements in vector: 4 >> // positive byte_offset 12 can lead to cyclic dependency >> @IR(counts = {IRNode.LOAD_VECTOR, "> 0", IRNode.MUL_V, "> 0", IRNode.STORE_VECTOR, "> 0"}, >> applyIfAnd = {"AlignVector", "false", "MaxVectorSize", ">= 8", "MaxVectorSize", "<= 12"}, >> applyIfCPUFeatureAnd = {"sse4.1", "true", "avx2", "false"}) >> // CPU: avx2 -> vector_width: 32 -> elements in vector: 8 >> // positive byte_offset 12 can lead to cyclic dependency >> @IR(counts = {IRNode.LOAD_VECTOR, "> 0", IRNode.MUL_V, "> 0", IRNode.STORE_VECTOR, "> 0"}, >> applyIfAnd = {"AlignVector", "false", "MaxVectorSize", ">= 8", "MaxVectorSize", "<= 12"}, >> applyIfCPUFeatureAnd = {"avx2", "true", "avx512", "false"}) >> // CPU: avx512 -> vector_width: 64 -> elements in vector: 16 >> // positive byte_offset 12 can lead to cyclic dependency >> @IR(counts = {IRNode.LOAD_VECTOR, "> 0", IRNode.MUL_V, "> 0", IRNode.STORE_VECTOR, "> 0"}, >> applyIfAnd = {"AlignVector", "false", "MaxVectorSize", ">= 8", "MaxVectorSize", "<= 12"}, >> applyIfCPUFeature = {"avx512", "true"}) >> // CPU: asimd -> vector_width: 32 -> elements in vector: 8 >> // positive byte_offset 12 can lead to cyclic dependency >> @IR(counts = {IRNode.LOAD_VECTOR, "> 0", IRNode.MUL_V, "> 0", IRNode.STORE_VECTOR, "> 0"}, >> applyIfAnd = {"AlignVector", "false", "MaxVectorSize", ">= 8", "MaxVectorSize", "<= 12"}, >> applyIfCPUFeature = {"asimd", "true"}) >> public static void testIntP3(int[] data) { >> for (int j = 0; j < RANGE - 3; j++) { >> data[j + 3] = (int)(data[j] * (int)-11); >> } >> } >> >> >> Also SLP now operates under SuperWordMaxVectorSize so it will be good to its it instead. > > With +AlignVector behavior with and without Vectorize,true pragma should match. > > > static void test1() { > for (int i = 4; i < 100; i++) { > fArr[i + 4] = fArr[i]; > } > } > > > > CPROMPT>javad -XX:+TraceNewVectors -XX:+AlignVector -cp . bug > WARNING: Using incubator modules: jdk.incubator.vector > res = 0.0 > CPROMPT> > CPROMPT>javad -XX:+TraceNewVectors -XX:+AlignVector -XX:CompileCommand=Vectorize,bug::test1,true -cp . bug > CompileCommand: Vectorize bug.test1 bool Vectorize = true > WARNING: Using incubator modules: jdk.incubator.vector > new Vector node: 990 LoadVector === 373 856 824 [[ 822 802 800 798 718 706 556 196 ]] @float[int:>=0]:NotNull:exact+any *, idx=6; mismatched #vectory[8]:{float} !orig=[823],[719],[557],[199],143 !jvms: bug::test1 @ bci:18 (line 7) > new Vector node: 991 StoreVector === 855 856 825 990 [[ 988 195 856 ]] @float[int:>=0]:NotNull:exact+any *, idx=6; mismatched Memory: @float[int:>=0]:NotNull:exact+any *, idx=6; !orig=[822],[718],[556],[196],164 !jvms: bug::test1 @ bci:19 (line 7) > res = 0.0 @jatin-bhateja > Thanks, even though newly added test now passes at all AVX and SSE level can you kindly investigate why should following be vectorized with un-aligned accesses when it carries a cross iteration true dependency with distance 4. The cyclic dependency is at a distance of 3, not 4 in this example. Ints are 4 bytes. Thus, the `byte_offset` is 12 bytes. So if `MaxVectorSize <= 12`, we cannot ever have a cyclic dependency within a vector. See my explanations at the beginning of the test file, for example: https://github.com/openjdk/jdk/blob/fb7f6dd9fbc2a6086d2ad36e0681fbc9eff6c9a7/test/hotspot/jtreg/compiler/loopopts/superword/TestDependencyOffsets.java#L49-L57 ------------- PR: https://git.openjdk.org/jdk/pull/12350 From epeter at openjdk.org Mon Mar 6 09:05:19 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 6 Mar 2023 09:05:19 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v15] In-Reply-To: References: <7BB6Bc-RlCF3GYkbuOEmSJgdv4aAzQ-2cQFdiCY2vaQ=.37b1a20a-70f2-43a2-865e-b49aafe09b42@github.com> Message-ID: On Mon, 6 Mar 2023 08:59:30 GMT, Emanuel Peter wrote: >> With +AlignVector behavior with and without Vectorize,true pragma should match. >> >> >> static void test1() { >> for (int i = 4; i < 100; i++) { >> fArr[i + 4] = fArr[i]; >> } >> } >> >> >> >> CPROMPT>javad -XX:+TraceNewVectors -XX:+AlignVector -cp . bug >> WARNING: Using incubator modules: jdk.incubator.vector >> res = 0.0 >> CPROMPT> >> CPROMPT>javad -XX:+TraceNewVectors -XX:+AlignVector -XX:CompileCommand=Vectorize,bug::test1,true -cp . bug >> CompileCommand: Vectorize bug.test1 bool Vectorize = true >> WARNING: Using incubator modules: jdk.incubator.vector >> new Vector node: 990 LoadVector === 373 856 824 [[ 822 802 800 798 718 706 556 196 ]] @float[int:>=0]:NotNull:exact+any *, idx=6; mismatched #vectory[8]:{float} !orig=[823],[719],[557],[199],143 !jvms: bug::test1 @ bci:18 (line 7) >> new Vector node: 991 StoreVector === 855 856 825 990 [[ 988 195 856 ]] @float[int:>=0]:NotNull:exact+any *, idx=6; mismatched Memory: @float[int:>=0]:NotNull:exact+any *, idx=6; !orig=[822],[718],[556],[196],164 !jvms: bug::test1 @ bci:19 (line 7) >> res = 0.0 > > @jatin-bhateja >> Thanks, even though newly added test now passes at all AVX and SSE level can you kindly investigate why should following be vectorized with un-aligned accesses when it carries a cross iteration true dependency with distance 4. > > The cyclic dependency is at a distance of 3, not 4 in this example. Ints are 4 bytes. Thus, the `byte_offset` is 12 bytes. So if `MaxVectorSize <= 12`, we cannot ever have a cyclic dependency within a vector. > > See my explanations at the beginning of the test file, for example: > https://github.com/openjdk/jdk/blob/fb7f6dd9fbc2a6086d2ad36e0681fbc9eff6c9a7/test/hotspot/jtreg/compiler/loopopts/superword/TestDependencyOffsets.java#L49-L57 You can see that the rules for distance 4 are adjusted, if you look at `testIntP4`: https://github.com/openjdk/jdk/blob/fb7f6dd9fbc2a6086d2ad36e0681fbc9eff6c9a7/test/hotspot/jtreg/compiler/loopopts/superword/TestDependencyOffsets.java#L1202-L1230 ------------- PR: https://git.openjdk.org/jdk/pull/12350 From jbhateja at openjdk.org Mon Mar 6 10:05:25 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 6 Mar 2023 10:05:25 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v15] In-Reply-To: References: <7BB6Bc-RlCF3GYkbuOEmSJgdv4aAzQ-2cQFdiCY2vaQ=.37b1a20a-70f2-43a2-865e-b49aafe09b42@github.com> Message-ID: On Mon, 6 Mar 2023 09:01:49 GMT, Emanuel Peter wrote: >> @jatin-bhateja >>> Thanks, even though newly added test now passes at all AVX and SSE level can you kindly investigate why should following be vectorized with un-aligned accesses when it carries a cross iteration true dependency with distance 4. >> >> The cyclic dependency is at a distance of 3, not 4 in this example. Ints are 4 bytes. Thus, the `byte_offset` is 12 bytes. So if `MaxVectorSize <= 12`, we cannot ever have a cyclic dependency within a vector. >> >> See my explanations at the beginning of the test file, for example: >> https://github.com/openjdk/jdk/blob/fb7f6dd9fbc2a6086d2ad36e0681fbc9eff6c9a7/test/hotspot/jtreg/compiler/loopopts/superword/TestDependencyOffsets.java#L49-L57 > > You can see that the rules for distance 4 are adjusted, if you look at `testIntP4`: > https://github.com/openjdk/jdk/blob/fb7f6dd9fbc2a6086d2ad36e0681fbc9eff6c9a7/test/hotspot/jtreg/compiler/loopopts/superword/TestDependencyOffsets.java#L1202-L1230 Correct, my bad just pasted a wrong example!, Do you see any value in > @jatin-bhateja > > > Thanks, even though newly added test now passes at all AVX and SSE level can you kindly investigate why should following be vectorized with un-aligned accesses when it carries a cross iteration true dependency with distance 4. > > The cyclic dependency is at a distance of 3, not 4 in this example. Ints are 4 bytes. Thus, the `byte_offset` is 12 bytes. So if `MaxVectorSize <= 12`, we cannot ever have a cyclic dependency within a vector. > > See my explanations at the beginning of the test file, for example: > > https://github.com/openjdk/jdk/blob/fb7f6dd9fbc2a6086d2ad36e0681fbc9eff6c9a7/test/hotspot/jtreg/compiler/loopopts/superword/TestDependencyOffsets.java#L49-L57 My bad, just pasted a wrong example, do you really see any value in generating tests for synthetic vector sizes where MaxVectorSize is a non-power of two. Lets remove them to reduce noise ? ------------- PR: https://git.openjdk.org/jdk/pull/12350 From epeter at openjdk.org Mon Mar 6 10:24:28 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 6 Mar 2023 10:24:28 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v15] In-Reply-To: References: <7BB6Bc-RlCF3GYkbuOEmSJgdv4aAzQ-2cQFdiCY2vaQ=.37b1a20a-70f2-43a2-865e-b49aafe09b42@github.com> Message-ID: On Mon, 6 Mar 2023 10:01:47 GMT, Jatin Bhateja wrote: >> With +AlignVector behavior with and without Vectorize,true pragma should match. >> >> >> static void test1() { >> for (int i = 4; i < 100; i++) { >> fArr[i + 4] = fArr[i]; >> } >> } >> >> >> >> CPROMPT>javad -XX:+TraceNewVectors -XX:+AlignVector -cp . bug >> WARNING: Using incubator modules: jdk.incubator.vector >> res = 0.0 >> CPROMPT> >> CPROMPT>javad -XX:+TraceNewVectors -XX:+AlignVector -XX:CompileCommand=Vectorize,bug::test1,true -cp . bug >> CompileCommand: Vectorize bug.test1 bool Vectorize = true >> WARNING: Using incubator modules: jdk.incubator.vector >> new Vector node: 990 LoadVector === 373 856 824 [[ 822 802 800 798 718 706 556 196 ]] @float[int:>=0]:NotNull:exact+any *, idx=6; mismatched #vectory[8]:{float} !orig=[823],[719],[557],[199],143 !jvms: bug::test1 @ bci:18 (line 7) >> new Vector node: 991 StoreVector === 855 856 825 990 [[ 988 195 856 ]] @float[int:>=0]:NotNull:exact+any *, idx=6; mismatched Memory: @float[int:>=0]:NotNull:exact+any *, idx=6; !orig=[822],[718],[556],[196],164 !jvms: bug::test1 @ bci:19 (line 7) >> res = 0.0 > >> @jatin-bhateja >> >> > Thanks, even though newly added test now passes at all AVX and SSE level can you kindly investigate why should following be vectorized with un-aligned accesses when it carries a cross iteration true dependency with distance 4. >> >> The cyclic dependency is at a distance of 3, not 4 in this example. Ints are 4 bytes. Thus, the `byte_offset` is 12 bytes. So if `MaxVectorSize <= 12`, we cannot ever have a cyclic dependency within a vector. >> >> See my explanations at the beginning of the test file, for example: >> >> https://github.com/openjdk/jdk/blob/fb7f6dd9fbc2a6086d2ad36e0681fbc9eff6c9a7/test/hotspot/jtreg/compiler/loopopts/superword/TestDependencyOffsets.java#L49-L57 > > My bad, just pasted a wrong example, wanted to refer > > https://github.com/openjdk/jdk/pull/12350#discussion_r1125924921 > > do you really see any value in generating tests for synthetic vector sizes where MaxVectorSize is a non-power of two. Lets remove them to reduce noise ? > With +AlignVector behavior with and without Vectorize,true pragma should match. This was about example with `fArr[i + 4] = fArr[i];` in the loop. `byte_offset = 4 * 4 = 16`. @jatin-bhateja I am not sure what you are trying to say, what do you mean by `should match`? If you mean to say "should vectorize": I think it should **not** vectorize, and your output shows that there must be a bug (with master, before my fix): `LoadVector === ... #vectory[8]:{float}` You have a cyclic dependency with float-distance 4 (`byte_distance = 16`). But you have 8 floats in the vector. That will lead to wrong results. It should only vectorize if `MaxVectorSize <= 16`. See conditions for `testIntP4` which I quoted above. I made a full test with it, and pasted it below. I run it with these command lines: `./java -Xbatch -XX:CompileCommand=compileonly,Test::test -XX:CompileCommand=Vectorize,Test::test,true -XX:+TraceNewVectors -XX:+AlignVector Test.java` 1. On `master`, with `+AlignVector` and `Vectorize true`: Vectorizes, wrong results, asserts! 2. On `master`, with `-AlignVector` and `Vectorize true`: Vectorizes, wrong results, asserts! 3. On `master`, with `-AlignVector` and `Vectorize false`: Does not vectorize. Detects the cyclic dependency (`LoadF` and `StoreF` have `memory_alignment != 0`). 4. On `master`, with `+AlignVector` and `Vectorize false`: same as for 4. As you can see, here the flag `AlignVector` is not even relevant. Why do we get wrong results? We bypass the `memory_alignment == 0` check when we have `_do_vector_loop == true`. That bypasses the alignment analysis which is critical. Without it, we only ever check `independence` at distance 1 (for the pairs), and not for all elements in a vector! Relevant section on `master`: https://github.com/openjdk/jdk/blob/8f195ff236000d9c019f8beb2b13355083e211b5/src/hotspot/share/opto/superword.cpp#L646 With `my patch`: all of the command-lines from above will not vectorize. Except if you set `-XX:MaxVectorSize=16` or smaller, where the cyclic dependency cannot manifest within one vector. @jatin-bhateja does this answer you question? Or did I misunderstand your question? **PS**: I have found the "alignment analysis" and "independence" checks rather confusing. And obviously our code-base was changed without properly testing it, and I think also without properly understanding it. In Larsen's [paper](https://groups.csail.mit.edu/cag/slp/SLP-PLDI-2000.pdf), on which the `SuperWord` implementation is based, they only ever explicitly test `independent(s1, s2)` for the elements of a pair. But in their definitions they definie not just `pairs` to be `independent`, but also `packs`. But how do you get from `independence` of `pairs` to `independence` of `packs`? The best I could find was this sentence in the paper: Since the adjacent memory identification phase uses alignment information, it will never create pairs of memory accesses that cross an alignment boundary. It is not further described in the paper unfortunately. But the idea is that you have "alignment boundies", and that pairs are not supposed to cross them. I think that is exactly why we require all `mem_ref`'s of the same type (memory slice) to be aligned (`memory_alignment == 0`). That ensures that no pairs cross the alignment boundary of any other `pack` of the same type (memory slice). But of course requiring this strict alignment is quite restrictive. So that is why the CompileCommand `Vectorize` was introduced. But it was never properly tested it seems. And it just trusts the programmer that there are no cyclic dependencies. That is why I now added the verification and filtering. It prevents vectorization when cyclic dependencies are detected by my new `SuperWord::find_dependence`. public class Test { static int N = 100; public static void main(String[] strArr) { float[] gold = new float[N]; float[] data = new float[N]; init(gold); test(gold); for (int i = 0; i < 10_000; i++){ init(data); test(data); verify(data, gold); } System.out.println("success."); } static void test(float[] data) { for (int i = 0; i < N - 4; i++) { data[i + 4] = data[i]; } } static void init(float[] data) { for (int j = 0; j < N; j++) { data[j] = j; } } static void verify(float[] data, float[] gold) { for (int i = 0; i < N; i++) { if (data[i] != gold[i]) { throw new RuntimeException(" Invalid result: dataF[" + i + "]: " + data[i] + " != " + gold[i]); } } } } ------------- PR: https://git.openjdk.org/jdk/pull/12350 From epeter at openjdk.org Mon Mar 6 10:35:26 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 6 Mar 2023 10:35:26 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v15] In-Reply-To: References: <7BB6Bc-RlCF3GYkbuOEmSJgdv4aAzQ-2cQFdiCY2vaQ=.37b1a20a-70f2-43a2-865e-b49aafe09b42@github.com> Message-ID: <9tb5DWpbU4UzvvQXlD4rRFtNEgwXbcVRNYmDtGan_sI=.21ae9f61-4abc-4143-9bf3-dfcd11b42ce9@github.com> On Mon, 6 Mar 2023 10:21:42 GMT, Emanuel Peter wrote: >>> @jatin-bhateja >>> >>> > Thanks, even though newly added test now passes at all AVX and SSE level can you kindly investigate why should following be vectorized with un-aligned accesses when it carries a cross iteration true dependency with distance 4. >>> >>> The cyclic dependency is at a distance of 3, not 4 in this example. Ints are 4 bytes. Thus, the `byte_offset` is 12 bytes. So if `MaxVectorSize <= 12`, we cannot ever have a cyclic dependency within a vector. >>> >>> See my explanations at the beginning of the test file, for example: >>> >>> https://github.com/openjdk/jdk/blob/fb7f6dd9fbc2a6086d2ad36e0681fbc9eff6c9a7/test/hotspot/jtreg/compiler/loopopts/superword/TestDependencyOffsets.java#L49-L57 >> >> My bad, just pasted a wrong example, wanted to refer >> >> https://github.com/openjdk/jdk/pull/12350#discussion_r1125924921 >> >> do you really see any value in generating tests for synthetic vector sizes where MaxVectorSize is a non-power of two. Lets remove them to reduce noise ? > >> With +AlignVector behavior with and without Vectorize,true pragma should match. > > This was about example with `fArr[i + 4] = fArr[i];` in the loop. `byte_offset = 4 * 4 = 16`. > > @jatin-bhateja I am not sure what you are trying to say, what do you mean by `should match`? > > If you mean to say "should vectorize": I think it should **not** vectorize, and your output shows that there must be a bug (with master, before my fix): > `LoadVector === ... #vectory[8]:{float}` > You have a cyclic dependency with float-distance 4 (`byte_distance = 16`). But you have 8 floats in the vector. That will lead to wrong results. It should only vectorize if `MaxVectorSize <= 16`. See conditions for `testIntP4` which I quoted above. > > I made a full test with it, and pasted it below. > I run it with these command lines: > > `./java -Xbatch -XX:CompileCommand=compileonly,Test::test -XX:CompileCommand=Vectorize,Test::test,true -XX:+TraceNewVectors -XX:+AlignVector Test.java` > > 1. On `master`, with `+AlignVector` and `Vectorize true`: Vectorizes, wrong results, asserts! > 2. On `master`, with `-AlignVector` and `Vectorize true`: Vectorizes, wrong results, asserts! > 3. On `master`, with `-AlignVector` and `Vectorize false`: Does not vectorize. Detects the cyclic dependency (`LoadF` and `StoreF` have `memory_alignment != 0`). > 4. On `master`, with `+AlignVector` and `Vectorize false`: same as for 4. > > As you can see, here the flag `AlignVector` is not even relevant. > > Why do we get wrong results? We bypass the `memory_alignment == 0` check when we have `_do_vector_loop == true`. That bypasses the alignment analysis which is critical. Without it, we only ever check `independence` at distance 1 (for the pairs), and not for all elements in a vector! Relevant section on `master`: > https://github.com/openjdk/jdk/blob/8f195ff236000d9c019f8beb2b13355083e211b5/src/hotspot/share/opto/superword.cpp#L646 > > With `my patch`: all of the command-lines from above will not vectorize. Except if you set `-XX:MaxVectorSize=16` or smaller, where the cyclic dependency cannot manifest within one vector. > > @jatin-bhateja does this answer you question? Or did I misunderstand your question? > > **PS**: I have found the "alignment analysis" and "independence" checks rather confusing. And obviously our code-base was changed without properly testing it, and I think also without properly understanding it. In Larsen's [paper](https://groups.csail.mit.edu/cag/slp/SLP-PLDI-2000.pdf), on which the `SuperWord` implementation is based, they only ever explicitly test `independent(s1, s2)` for the elements of a pair. But in their definitions they definie not just `pairs` to be `independent`, but also `packs`. But how do you get from `independence` of `pairs` to `independence` of `packs`? The best I could find was this sentence in the paper: > > Since the adjacent memory identification phase uses alignment information, > it will never create pairs of memory accesses that cross an alignment boundary. > > It is not further described in the paper unfortunately. But the idea is that you have "alignment boundies", and that pairs are not supposed to cross them. I think that is exactly why we require all `mem_ref`'s of the same type (memory slice) to be aligned (`memory_alignment == 0`). That ensures that no pairs cross the alignment boundary of any other `pack` of the same type (memory slice). > > But of course requiring this strict alignment is quite restrictive. So that is why the CompileCommand `Vectorize` was introduced. But it was never properly tested it seems. And it just trusts the programmer that there are no cyclic dependencies. That is why I now added the verification and filtering. It prevents vectorization when cyclic dependencies are detected by my new `SuperWord::find_dependence`. > > > public class Test { > static int N = 100; > > public static void main(String[] strArr) { > float[] gold = new float[N]; > float[] data = new float[N]; > init(gold); > test(gold); > for (int i = 0; i < 10_000; i++){ > init(data); > test(data); > verify(data, gold); > } > System.out.println("success."); > } > > static void test(float[] data) { > for (int i = 0; i < N - 4; i++) { > data[i + 4] = data[i]; > } > } > > static void init(float[] data) { > for (int j = 0; j < N; j++) { > data[j] = j; > } > } > > static void verify(float[] data, float[] gold) { > for (int i = 0; i < N; i++) { > if (data[i] != gold[i]) { > throw new RuntimeException(" Invalid result: dataF[" + i + "]: " + data[i] + " != " + gold[i]); > } > } > } > } > do you really see any value in generating tests for synthetic vector sizes where MaxVectorSize is a non-power of two. Lets remove them to reduce noise ? I see a value in having non-power of 2 offsets, yes. They should vectorize if the vector width is small enough. And then there are some values like `18, 20, 192` that are a there to check vectorization with `+AlignVector`. Maybe you find the `MaxVectorSize <= 12` "noisy" somehow, because it is equivalent to `MaxVectorSize <= 8`? I find it rather helpful, because `12` reflects the `byte_offset`, and so makes the rule a bit more understandable. Finally, I generate many tests, I don't want to do that by hand. So maybe the rules are not simplified perfectly. I tried to improve it a bit. If you have a concrete idea how to further improve, I'm open for suggestions. I could for example round down the values to the next power of 2, or something like that. But again: would that really make the rules more understandable? ------------- PR: https://git.openjdk.org/jdk/pull/12350 From epeter at openjdk.org Mon Mar 6 10:45:53 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 6 Mar 2023 10:45:53 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs In-Reply-To: References: <86PauwnM1BpvfJVJW1kyP6g3sVeirHKrMIry-kl6Ins=.50f25d11-47d2-4b33-8008-7f36a874b06e@github.com> Message-ID: On Mon, 6 Mar 2023 04:48:04 GMT, Jatin Bhateja wrote: > I did some additional testing and found that following case which does not carry any true dependency no longer vectorizes with default options (-AlignVector). I think this was never supposed to vectorize, but just slipped through the cracks as a "happy accident". But the "bad accident" case also slipped through, as shown in my PR description at the beginning: static void test2(int[] dataI, float[] dataF) { for (int i = 0; i < RANGE - 2; i++) { // dataI has cyclid dependency of distance 2, cannot vectorize int v = dataI[i]; dataI[i + 2] = v; dataF[i] = v; // let's not get confused by another type } } The reason why it should not vectorize: loads / stores of the same `velt_type` (memory slice) do not pass `memory_alignment == 0`, which is crucial to ensure that the `packs` are `independent` (and not just the `pairs`). CompileCommand `Vectorize` bypasses this behaviour, as just explained here https://github.com/openjdk/jdk/pull/12350#discussion_r1126206603. ------------- PR: https://git.openjdk.org/jdk/pull/12350 From redestad at openjdk.org Mon Mar 6 11:12:12 2023 From: redestad at openjdk.org (Claes Redestad) Date: Mon, 6 Mar 2023 11:12:12 GMT Subject: RFR: 8303238: Create generalizations for existing LShift ideal transforms In-Reply-To: References: Message-ID: On Thu, 23 Feb 2023 20:28:31 GMT, Jasmine K. wrote: > Hello, > I would like to generalize two ideal transforms for bitwise shifts. Left shift nodes perform the transformations `(x >> C1) << C2 => x & (-1 << C2)` and `((x >> C1) & Y) << C2 => x & (Y << C2)`, but only when the case where `C1 == C2`. However, it is possible to use both of these rules to improve cases where the constants aren't equal, by removing one of the shifts and replacing it with a bitwise and. This transformation is profitable because typically more bitwise ands can be dispatched per cycle than bit shifts. In addition, the strength reduction from a shift to a bitwise and can allow more profitable transformations to occur. These patterns are found throughout the JDK, mainly around strings and OW2 ASM. I've attached some profiling results from my (Zen 2) machine below: > > Baseline Patch Improvement > Benchmark Mode Cnt Score Error Units Score Error Units > LShiftNodeIdealize.testRgbaToAbgr avgt 15 63.287 ? 1.770 ns/op / 54.199 ? 1.408 ns/op + 14.36% > LShiftNodeIdealize.testShiftAndInt avgt 15 874.564 ? 15.334 ns/op / 538.408 ? 11.768 ns/op + 38.44% > LShiftNodeIdealize.testShiftAndLong avgt 15 1017.466 ? 29.010 ns/op / 701.356 ? 18.258 ns/op + 31.07% > LShiftNodeIdealize.testShiftInt avgt 15 663.865 ? 14.226 ns/op / 533.588 ? 9.949 ns/op + 19.63% > LShiftNodeIdealize.testShiftInt2 avgt 15 658.976 ? 32.856 ns/op / 649.871 ? 10.598 ns/op + 1.38% > LShiftNodeIdealize.testShiftLong avgt 15 815.540 ? 14.721 ns/op / 689.270 ? 14.028 ns/op + 15.48% > LShiftNodeIdealize.testShiftLong2 avgt 15 817.936 ? 23.573 ns/op / 810.185 ? 14.983 ns/op + 0.95% > > > In addition, in the process of making this PR I've found a missing ideal transform for `RShiftLNode`, so right shifts of large numbers (such as `x >> 65`) are not properly folded down, like how they are `RShiftINode` and `URShiftLNode`. I'll address this in a future RFR. > > Testing: GHA, tier1 local, and performance testing > > Thanks, > Jasmine K Very nice overall! Some superficial comments inline. src/hotspot/share/opto/mulnode.cpp line 850: > 848: } > 849: > 850: // Check for "(x >> C1) << C2" which just masks off low bits The "which just masks off the low bits" comments should move to the C1 == C2 special case. Same below and for `LShiftLNode`. test/micro/org/openjdk/bench/vm/compiler/LShiftNodeIdealize.java line 100: > 98: public static class BenchState { > 99: int[] ints; > 100: Random random = new Random(); A hard-coded or parameterized seed is preferred for microbenchmarking to reduce noise from different data distributions in back-to-back runs. ------------- Changes requested by redestad (Reviewer). PR: https://git.openjdk.org/jdk/pull/12734 From duke at openjdk.org Mon Mar 6 12:13:05 2023 From: duke at openjdk.org (=?UTF-8?B?VG9tw6HFoQ==?= Zezula) Date: Mon, 6 Mar 2023 12:13:05 GMT Subject: RFR: JDK-8303646: Add possibility to lookup ResolvedJavaType from jclass. Message-ID: This pull request adds a `jdk.vm.ci.hotspot.HotSpotJVMCIRuntime#asResolvedJavaType(long hotspot_jclass_value)` method, which converts a HotSpot heap JNI `hotspot_jclass_value` to a `jdk.vm.ci.meta.ResolvedJavaType`. ------------- Commit messages: - JDK-8303646: Add possibility to lookup ResolvedJavaType from HotSpot heap JNI jclass. Changes: https://git.openjdk.org/jdk/pull/12878/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12878&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8303646 Stats: 42 lines in 3 files changed: 42 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/12878.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12878/head:pull/12878 PR: https://git.openjdk.org/jdk/pull/12878 From duke at openjdk.org Mon Mar 6 12:28:45 2023 From: duke at openjdk.org (=?UTF-8?B?VG9tw6HFoQ==?= Zezula) Date: Mon, 6 Mar 2023 12:28:45 GMT Subject: RFR: JDK-8303646: [JVMCI] Add possibility to lookup ResolvedJavaType from jclass. [v2] In-Reply-To: References: Message-ID: > This pull request adds a `jdk.vm.ci.hotspot.HotSpotJVMCIRuntime#asResolvedJavaType(long hotspot_jclass_value)` method, which converts a HotSpot heap JNI `hotspot_jclass_value` to a `jdk.vm.ci.meta.ResolvedJavaType`. Tom?? Zezula has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: JDK-8303646: Add possibility to lookup ResolvedJavaType from HotSpot heap JNI jclass. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12878/files - new: https://git.openjdk.org/jdk/pull/12878/files/0e4b2f10..336144a0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12878&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12878&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/12878.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12878/head:pull/12878 PR: https://git.openjdk.org/jdk/pull/12878 From rcastanedalo at openjdk.org Mon Mar 6 13:25:27 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 6 Mar 2023 13:25:27 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally In-Reply-To: References: Message-ID: On Wed, 22 Feb 2023 13:59:41 GMT, Tobias Holenstein wrote: > In IGV the user can apply a set of filters to a graph. Currently, the same set of selected filters is applied to all graphs (globally). > > - With this change the use can define a set of filters for each individual graph tab using the `--Custom--` profile > - Further a filter profile can be created that represents a set of filter. This filter profile can the be selected in each graph tab individually. > > ### Custom profile > Each tab has a `--Custom--` filter profile which is selected when opening a graph. Filters applied to the `--Custom--` profile are applied only to the currently selected tabs. Only one tab can be selected at a time and a tab gets selected by clicking on it. To make it more clear which tab is currently selected, the title of the selected tab is displayed in **bold** font. > tabA > > When clicking on a different tab with a different `--Custom--` profile, the selected filters get updated accordingly. > tabB > > ### New profile > The user can also create a new filter profile and give it a name. E.g. `My Filters` > newProfile > > The `My Filters` profile is then globally available to other tabs as well > selectProfile > > > ### Filters for cloned tabs > When the user clones a tab, the `--Custom--` profile gets cloned as well. Further the clone has the same filter profile selected when it gets opened > cloneTab Thanks for working on this, Toby! Being able to apply different filters per tab is definitely useful feature. However, as an IGV user I miss two things from the current behavior: persistence (the same filters are applied after restarting IGV) and the ability to apply the same filter configuration to all tabs in a simple manner. I would like to propose an alternative model that is almost a superset of what is proposed here and would preserve persistence and easy filter synchronization among tabs. By default, each tab has two filter profiles available, "local" and "global". More profiles cannot be added or removed. The local filter profile can be edited but is not persistent (i.e. it acts like the `--Custom--` profile in this changeset). The global filter profile can be edited, is persistent, and the changes are propagated for all tabs where it is selected. The `Link node selection globally` button is generalized to `Link node and filter selection globally`. It is disabled by default, and clicking on it selects the global filter profile for all opened tabs. What do you think? This is just my input as user, it would be useful to see what others think here. ------------- PR: https://git.openjdk.org/jdk/pull/12714 From roland at openjdk.org Mon Mar 6 14:26:19 2023 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 6 Mar 2023 14:26:19 GMT Subject: RFR: 8300258: C2: vectorization fails on simple ByteBuffer loop [v5] In-Reply-To: References: Message-ID: > The loop that doesn't vectorize is: > > > public static void testByteLong4(byte[] dest, long[] src, int start, int stop) { > for (int i = start; i < stop; i++) { > UNSAFE.putLongUnaligned(dest, 8 * i + baseOffset, src[i]); > } > } > > > It's from a micro-benchmark in the panama > repo. `SuperWord::find_adjacent_refs() `prevents it from vectorizing > because it finds it cannot properly align the loop and, from the > comment in the code, that: > > > // Can't allow vectorization of unaligned memory accesses with the > // same type since it could be overlapped accesses to the same array. > > > The test for "same type" is implemented by looking at the memory > operation type which in this case is overly conservative as the loop > above is reading and writing with long loads/stores but from and to > arrays of different types that can't overlap. Actually, with such > mismatched accesses, it's also likely an incorrect test (reading and > writing could be to the same array with loads/stores that use > different operand size) eventhough I couldn't write a test case that > would trigger an incorrect execution. > > As a fix, I propose implementing the "same type" test by looking at > memory aliases instead. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: improved test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12440/files - new: https://git.openjdk.org/jdk/pull/12440/files/f6820c45..0e3a3c84 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12440&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12440&range=03-04 Stats: 123 lines in 1 file changed: 101 ins; 1 del; 21 mod Patch: https://git.openjdk.org/jdk/pull/12440.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12440/head:pull/12440 PR: https://git.openjdk.org/jdk/pull/12440 From roland at openjdk.org Mon Mar 6 14:47:12 2023 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 6 Mar 2023 14:47:12 GMT Subject: RFR: 8201516: DebugNonSafepoints generates incorrect information [v5] In-Reply-To: <4PpM775Vw8MQtZu4vuycM0ZDx8bkqy8TJ_aS4OmRDVE=.63e42077-5ccd-43a8-96af-163f0fe8392b@github.com> References: <4PpM775Vw8MQtZu4vuycM0ZDx8bkqy8TJ_aS4OmRDVE=.63e42077-5ccd-43a8-96af-163f0fe8392b@github.com> Message-ID: The message from this sender included one or more files which could not be scanned for virus detection; do not open these files unless you are certain of the sender's intent. ---------------------------------------------------------------------- On Fri, 3 Mar 2023 06:37:08 GMT, Tobias Hartmann wrote: >> C2 emits incorrect debug information when diagnostic `-XX:+DebugNonSafepoints` is enabled. The problem is that renumbering of live nodes (`-XX:+RenumberLiveNodes`) introduced by [JDK-8129847](https://bugs.openjdk.org/browse/JDK-8129847) in JDK 8u92 / JDK 9 does not update the `_node_note_array` side table that links IR node indices to debug information. As a result, after node indices are updated, they point to unrelated debug information. >> >> The [reproducer](https://github.com/jodzga/debugnonsafepoints-problem) shared by the original reporter @jodzga (@jbachorik also reported this issue separately) does not work anymore with recent JDK versions but with a slight adjustment to trigger node renumbering, I could reproduce the wrong JFR method profile: >> >> ![Screenshot from 2023-03-01 13-17-48](https://user-images.githubusercontent.com/5312595/222146314-8b5299a8-c1c0-4360-b356-ac6a8c34371c.png) >> >> It suggests that the hottest method of the [Test](https://github.com/jodzga/debugnonsafepoints-problem/blob/f8ed40f24ef6a6bff7f86ea861c022db193ef48a/src/main/java/org/tests/Test.java#L28) is **not** the long running loop in [Test::arraycopy](https://github.com/jodzga/debugnonsafepoints-problem/blob/f8ed40f24ef6a6bff7f86ea861c022db193ef48a/src/main/java/org/tests/Test.java#L56) but several other short running methods. The hot method is not even in the profile. This is obviously wrong. >> >> With the fix, or when running with `-XX:-RenumberLiveNodes` as a workaround, the correct profile looks like this: >> >> ![Screenshot from 2023-03-01 13-20-09](https://user-images.githubusercontent.com/5312595/222146316-b036ca7d-8a92-42b7-9570-c29e3cfcc2f2.png) >> >> With the help of the IR framework, it's easy to create a simple regression test (see `testRenumberLiveNodes`). >> >> The fix is to create a new `node_note_array` and copy the debug information to the right index after updating node indices. We do the same in the matcher: >> https://github.com/openjdk/jdk/blob/c1e77e05647ca93bb4f39a320a5c7a632e283410/src/hotspot/share/opto/matcher.cpp#L337-L342 >> >> Another problem is that `Parse::Parse` calls `C->set_default_node_notes(caller_nn)` before `do_exits`, which resets the `JVMState` to the caller state. We then set the bci to `InvocationEntryBci` in the **caller** `JVMState`. Any new node that is emitted in `do_exits`, for example a `MemBarRelease`, will have that `JVMState` attached and `NonSafepointEmitter::observe_instruction` -> `DebugInformationRecorder::describe_scope` will then use that information when emitting debug info. The resulting debug info is misleading because it suggests that we are at the beginning of the caller method. The tests `testFinalFieldInit` and `testSynchronized` reproduce that scenario. >> >> The fix is to move `set_default_node_notes` down to after `do_exits`. >> >> I find it also misleading that we often emit "synchronization entry" for `InvocationEntryBci` at method entry/exit in the debug info, although there is no synchronization happening. I filed [JDK-8303451](https://bugs.openjdk.org/browse/JDK-8303451) to fix that. >> >> Thanks, >> Tobias > > Tobias Hartmann has updated the pull request incrementally with one additional commit since the last revision: > > Use MAX2 Looks good to me. ------------- Marked as reviewed by roland (Reviewer). PR: https://git.openjdk.org/jdk/pull/12806 From tholenstein at openjdk.org Mon Mar 6 14:59:00 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 6 Mar 2023 14:59:00 GMT Subject: RFR: JDK-8303443: IGV: Syntax highlighting and resizing for filter editor [v3] In-Reply-To: References: Message-ID: > In the Filter window of the IdealGraphVisualizer (IGV) the user can double-click on a filter to edit the javascript code. > > - Previously, the code window was not resizable and had no syntax highlighting > editor_old > > - Now, the code window can be resized by the user and has basic syntax highlighting for `keywords`, `strings` and `comments` > editor_new Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: make NetBeans form editor work again ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12803/files - new: https://git.openjdk.org/jdk/pull/12803/files/4b2dd4a5..4f02c553 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12803&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12803&range=01-02 Stats: 45 lines in 2 files changed: 20 ins; 5 del; 20 mod Patch: https://git.openjdk.org/jdk/pull/12803.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12803/head:pull/12803 PR: https://git.openjdk.org/jdk/pull/12803 From rcastanedalo at openjdk.org Mon Mar 6 15:38:13 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 6 Mar 2023 15:38:13 GMT Subject: RFR: JDK-8303443: IGV: Syntax highlighting and resizing for filter editor [v3] In-Reply-To: References: Message-ID: On Mon, 6 Mar 2023 14:59:00 GMT, Tobias Holenstein wrote: >> In the Filter window of the IdealGraphVisualizer (IGV) the user can double-click on a filter to edit the javascript code. >> >> - Previously, the code window was not resizable and had no syntax highlighting >> editor_old >> >> - Now, the code window can be resized by the user and has basic syntax highlighting for `keywords`, `strings` and `comments` >> editor_new > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > make NetBeans form editor work again Thanks for addressing my comments, looks good! ------------- Marked as reviewed by rcastanedalo (Reviewer). PR: https://git.openjdk.org/jdk/pull/12803 From duke at openjdk.org Mon Mar 6 16:06:13 2023 From: duke at openjdk.org (=?UTF-8?B?VG9tw6HFoQ==?= Zezula) Date: Mon, 6 Mar 2023 16:06:13 GMT Subject: RFR: JDK-8303678: [JVMCI] Add possibility to convert object JavaConstant to jobject. Message-ID: <7CCfSjqdge_fL8Ev_oY44xARp28LpOIOwZQjTks8Igg=.61bcfaa2-23fe-4dbd-965b-39b77ebdec5e@github.com> This pull request adds a `jdk.vm.ci.hotspot.HotSpotJVMCIRuntime#getJObjectValue(HotSpotObjectConstant peerObject)` method, which gets a reference to an object in the peer runtime wrapped by the `jdk.vm.ci.hotspot.IndirectHotSpotObjectConstantImpl`. The reference is returned as a HotSpot heap JNI jobject. ------------- Commit messages: - JDK-8303678: Add possibility to convert object JavaConstant to jobject. Changes: https://git.openjdk.org/jdk/pull/12882/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12882&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8303678 Stats: 20 lines in 1 file changed: 20 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/12882.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12882/head:pull/12882 PR: https://git.openjdk.org/jdk/pull/12882 From kvn at openjdk.org Mon Mar 6 16:09:14 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 6 Mar 2023 16:09:14 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter In-Reply-To: References: Message-ID: On Fri, 3 Mar 2023 21:41:35 GMT, Vladimir Kozlov wrote: > Implemented `Float.floatToFloat16` and `Float.float16ToFloat` intrinsics in Interpreter and C1 compiler to produce the same results as C2 intrinsics on x64, Aarch64 and RISC-V - all platforms where C2 intrinsics for these Java methods were implemented originally. > > Replaced `SharedRuntime::f2hf()` and `hf2f()` C runtime functions with calls to runtime stubs which use the same HW instructions as C2 intrinsics. Only for 64-bit x64 because 32-bit x86 stub does not work: result is passed through FPU register and NaN values become different from C2 intrinsic. This runtime stub is only used to calculate constant values during C2 compilation and can be skipped. > > I added new tests based on Tobias's `TestAll.java` And copied `jdk/lang/Float/Binary16Conversion*.java` tests to run them with `-Xcomp` to make sure code is compiled by C1 or C2. I modified `Binary16ConversionNaN.java` to compare results from Interpreter, C1 and C2. > > Tested tier1-5, Xcomp, stress GHA failure on linux-x86 in test compiler/vectorization/runner/LoopRangeStrideTest.java is due to [JDK-8303105](https://bugs.openjdk.org/browse/JDK-8303105) ------------- PR: https://git.openjdk.org/jdk/pull/12869 From kvn at openjdk.org Mon Mar 6 16:09:13 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 6 Mar 2023 16:09:13 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter Message-ID: Implemented `Float.floatToFloat16` and `Float.float16ToFloat` intrinsics in Interpreter and C1 compiler to produce the same results as C2 intrinsics on x64, Aarch64 and RISC-V - all platforms where C2 intrinsics for these Java methods were implemented originally. Replaced `SharedRuntime::f2hf()` and `hf2f()` C runtime functions with calls to runtime stubs which use the same HW instructions as C2 intrinsics. Only for 64-bit x64 because 32-bit x86 stub does not work: result is passed through FPU register and NaN values become different from C2 intrinsic. This runtime stub is only used to calculate constant values during C2 compilation and can be skipped. I added new tests based on Tobias's `TestAll.java` And copied `jdk/lang/Float/Binary16Conversion*.java` tests to run them with `-Xcomp` to make sure code is compiled by C1 or C2. I modified `Binary16ConversionNaN.java` to compare results from Interpreter, C1 and C2. Tested tier1-5, Xcomp, stress ------------- Commit messages: - Remove ConvF2HFNode::Identity(). Updated tests - Copyright year update - Check float16 instructions support on platforms for C1 - Update Copyright year. Add missing UnlockDiagnosticVMOptions flag in new test - Implement Aarch64 and RiscV parts, remove 32-bit x86 runtime stub - Merge branch 'master' into 8302976 - 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter Changes: https://git.openjdk.org/jdk/pull/12869/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12869&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8302976 Stats: 1432 lines in 48 files changed: 1284 ins; 97 del; 51 mod Patch: https://git.openjdk.org/jdk/pull/12869.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12869/head:pull/12869 PR: https://git.openjdk.org/jdk/pull/12869 From never at openjdk.org Mon Mar 6 16:10:32 2023 From: never at openjdk.org (Tom Rodriguez) Date: Mon, 6 Mar 2023 16:10:32 GMT Subject: RFR: JDK-8303646: [JVMCI] Add possibility to lookup ResolvedJavaType from jclass. [v2] In-Reply-To: References: Message-ID: On Mon, 6 Mar 2023 12:28:45 GMT, Tom?? Zezula wrote: >> This pull request adds a `jdk.vm.ci.hotspot.HotSpotJVMCIRuntime#asResolvedJavaType(long hotspot_jclass_value)` method, which converts a HotSpot heap JNI `hotspot_jclass_value` to a `jdk.vm.ci.meta.ResolvedJavaType`. > > Tom?? Zezula has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: > > JDK-8303646: Add possibility to lookup ResolvedJavaType from HotSpot heap JNI jclass. Marked as reviewed by never (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/12878 From jsjolen at openjdk.org Mon Mar 6 16:14:00 2023 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Mon, 6 Mar 2023 16:14:00 GMT Subject: RFR: JDK-8301074: Replace NULL with nullptr in share/opto/ [v4] In-Reply-To: References: Message-ID: > Hi, this PR changes all occurrences of NULL to nullptr for the subdirectory share/opto/. Unfortunately the script that does the change isn't perfect, and so we > need to comb through these manually to make sure nothing has gone wrong. I also review these changes but things slip past my eyes sometimes. > > Here are some typical things to look out for: > > No changes but copyright header changed (probably because I reverted some changes but forgot the copyright). > Macros having their NULL changed to nullptr, these are added to the script when I find them. They should be NULL. > nullptr in comments and logs. We try to use lower case "null" in these cases as it reads better. An exception is made when code expressions are in a comment. > > An example of this: > > > // This function returns null > void* ret_null(); > // This function returns true if *x == nullptr > bool is_nullptr(void** x); > > > Note how nullptr participates in a code expression here, we really are talking about the specific value nullptr. > > Thanks! Johan Sj?len has updated the pull request incrementally with one additional commit since the last revision: Explicitly use 0 for null in ARM interpreter ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12187/files - new: https://git.openjdk.org/jdk/pull/12187/files/a440058b..dd4f57e0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12187&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12187&range=02-03 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/12187.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12187/head:pull/12187 PR: https://git.openjdk.org/jdk/pull/12187 From roland at openjdk.org Mon Mar 6 17:08:04 2023 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 6 Mar 2023 17:08:04 GMT Subject: RFR: 8300258: C2: vectorization fails on simple ByteBuffer loop [v2] In-Reply-To: References: <5i999_cMhXUrM7TY5N0YGJAsiegwj-SdzuKLGscKaTQ=.4c75e5f3-d9ff-485d-a525-462c02979671@github.com> <1jbga6zehjnJnIvTQEJNCASY18zE56emNpxGuBnQjH8=.b5fb2848-33ff-4f47-9ef4-b77b4982565b@github.com> Message-ID: <1tHvkIRt-pBT1ImtDsqcnIEAW6b_WaJ0uyW2F3EUEbg=.01eaa23a-29b6-43aa-87af-0a00d19ed594@github.com> On Thu, 2 Mar 2023 17:30:06 GMT, Vladimir Kozlov wrote: >>> Thanks for the comments @eme64 @vnkozlov Looking at the code again, if `vectors_should_be_aligned()` is true, if `create_pack` is false, the current code removes every memops and already created packset with `same_velt_type()` true: That can't be motivated by a correctness issue. So I suppose we want to preserve that behavior. Wouldn't we need the change of the last commit I pushed then? >> >> I think the reason we used `same_velt_type` was that we were confused. Or maybe we did that before we had memory slices, and using `same_velt_type` was at least already an improvemnt? At any rate: it was confused and leads to Bugs in conjunction with `Unsafe`, as my example showed. >> >> Keeping `same_velt_type` will probably not harm much, but be more restrictive than neccessary. >> It will not harm much because `velt_type == memory_slice` as long as we are not using `Unsafe`. >> And when we do use `Unsafe`, we probably do not use it in very wild ways. >> >> One "wild" use might be something like this: >> >> void test(int[] iarr, float[] farr) { >> // cyclic dependency -> not vectorized >> in v1 = (int)Unsafe.LoadF(iarr, i); // assume this to be best >> Unsafe.StoreI(iarr, i + 1); >> // separate slice -> could be vectorized >> Unsafe.StoreI(farr, i) = Unsafe.LoadI(farr, i); // on different slice as best, but have same velt_type -> rejected >> // We end up vectorizing nothing, even though we could vectorize the farr >> } > >> I think the reason we used `same_velt_type` was that we were confused. Or maybe we did that before we had memory slices, and using `same_velt_type` was at least already an improvemnt? At any rate: it was confused and leads to Bugs in conjunction with `Unsafe`, as my example showed. > > I did not consider using memory slices (or unsafe access) when worked on this code. Same element type was easy choice for this check. > >> I also don't think we should spend too much time making sure every possible combinations of unsafe accesses optimize well or even correctly if it's too much work. Once people start using unsafe, they are on their own. I think we should stick with whatever feels reasonable or is used in the core libraries (hopefully the second category is included in the first category). > > Yes, even without vectorization we can construct a Java test with Unsafe access which has cyclic dependencies and overwrite elements intentionally or by mistake. I remember one such case in system libraries several years ago which was fixed. > > JIT optimization should not introduce wrong behavior if Java code does not have it. But if we can correctly detect and reject cyclic dependency we can vectorize it. Original Roland's example and last Emanuel's `StoreI(farr, i)` example don't have "bad" cyclic dependency - at worst they store the same values to the same elements. So it is all about cyclic dependency detection and assumption that we may accessing the same array. Thanks for the reviews, testing and discussion @vnkozlov @eme64 I updated the test case once more: - Some of the ByteByte test were broken - I made sure the test only runs on aarch64 and x64, the platforms I can test this on. Also the test is now skipped if `AlignVector` is false - I added verification code for the content of the arrays ------------- PR: https://git.openjdk.org/jdk/pull/12440 From jbhateja at openjdk.org Mon Mar 6 17:50:12 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 6 Mar 2023 17:50:12 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v15] In-Reply-To: <9tb5DWpbU4UzvvQXlD4rRFtNEgwXbcVRNYmDtGan_sI=.21ae9f61-4abc-4143-9bf3-dfcd11b42ce9@github.com> References: <7BB6Bc-RlCF3GYkbuOEmSJgdv4aAzQ-2cQFdiCY2vaQ=.37b1a20a-70f2-43a2-865e-b49aafe09b42@github.com> <9tb5DWpbU4UzvvQXlD4rRFtNEgwXbcVRNYmDtGan_sI=.21ae9f61-4abc-4143-9bf3-dfcd11b42ce9@github.com> Message-ID: On Mon, 6 Mar 2023 10:31:30 GMT, Emanuel Peter wrote: >>> With +AlignVector behavior with and without Vectorize,true pragma should match. >> >> This was about example with `fArr[i + 4] = fArr[i];` in the loop. `byte_offset = 4 * 4 = 16`. >> >> @jatin-bhateja I am not sure what you are trying to say, what do you mean by `should match`? >> >> If you mean to say "should vectorize": I think it should **not** vectorize, and your output shows that there must be a bug (with master, before my fix): >> `LoadVector === ... #vectory[8]:{float}` >> You have a cyclic dependency with float-distance 4 (`byte_distance = 16`). But you have 8 floats in the vector. That will lead to wrong results. It should only vectorize if `MaxVectorSize <= 16`. See conditions for `testIntP4` which I quoted above. >> >> I made a full test with it, and pasted it below. >> I run it with these command lines: >> >> `./java -Xbatch -XX:CompileCommand=compileonly,Test::test -XX:CompileCommand=Vectorize,Test::test,true -XX:+TraceNewVectors -XX:+AlignVector Test.java` >> >> 1. On `master`, with `+AlignVector` and `Vectorize true`: Vectorizes, wrong results, asserts! >> 2. On `master`, with `-AlignVector` and `Vectorize true`: Vectorizes, wrong results, asserts! >> 3. On `master`, with `-AlignVector` and `Vectorize false`: Does not vectorize. Detects the cyclic dependency (`LoadF` and `StoreF` have `memory_alignment != 0`). >> 4. On `master`, with `+AlignVector` and `Vectorize false`: same as for 4. >> >> As you can see, here the flag `AlignVector` is not even relevant. >> >> Why do we get wrong results? We bypass the `memory_alignment == 0` check when we have `_do_vector_loop == true`. That bypasses the alignment analysis which is critical. Without it, we only ever check `independence` at distance 1 (for the pairs), and not for all elements in a vector! Relevant section on `master`: >> https://github.com/openjdk/jdk/blob/8f195ff236000d9c019f8beb2b13355083e211b5/src/hotspot/share/opto/superword.cpp#L646 >> >> With `my patch`: all of the command-lines from above will not vectorize. Except if you set `-XX:MaxVectorSize=16` or smaller, where the cyclic dependency cannot manifest within one vector. >> >> @jatin-bhateja does this answer you question? Or did I misunderstand your question? >> >> **PS**: I have found the "alignment analysis" and "independence" checks rather confusing. And obviously our code-base was changed without properly testing it, and I think also without properly understanding it. In Larsen's [paper](https://groups.csail.mit.edu/cag/slp/SLP-PLDI-2000.pdf), on which the `SuperWord` implementation is based, they only ever explicitly test `independent(s1, s2)` for the elements of a pair. But in their definitions they definie not just `pairs` to be `independent`, but also `packs`. But how do you get from `independence` of `pairs` to `independence` of `packs`? The best I could find was this sentence in the paper: >> >> Since the adjacent memory identification phase uses alignment information, >> it will never create pairs of memory accesses that cross an alignment boundary. >> >> It is not further described in the paper unfortunately. But the idea is that you have "alignment boundies", and that pairs are not supposed to cross them. I think that is exactly why we require all `mem_ref`'s of the same type (memory slice) to be aligned (`memory_alignment == 0`). That ensures that no pairs cross the alignment boundary of any other `pack` of the same type (memory slice). >> >> But of course requiring this strict alignment is quite restrictive. So that is why the CompileCommand `Vectorize` was introduced. But it was never properly tested it seems. And it just trusts the programmer that there are no cyclic dependencies. That is why I now added the verification and filtering. It prevents vectorization when cyclic dependencies are detected by my new `SuperWord::find_dependence`. >> >> >> public class Test { >> static int N = 100; >> >> public static void main(String[] strArr) { >> float[] gold = new float[N]; >> float[] data = new float[N]; >> init(gold); >> test(gold); >> for (int i = 0; i < 10_000; i++){ >> init(data); >> test(data); >> verify(data, gold); >> } >> System.out.println("success."); >> } >> >> static void test(float[] data) { >> for (int i = 0; i < N - 4; i++) { >> data[i + 4] = data[i]; >> } >> } >> >> static void init(float[] data) { >> for (int j = 0; j < N; j++) { >> data[j] = j; >> } >> } >> >> static void verify(float[] data, float[] gold) { >> for (int i = 0; i < N; i++) { >> if (data[i] != gold[i]) { >> throw new RuntimeException(" Invalid result: dataF[" + i + "]: " + data[i] + " != " + gold[i]); >> } >> } >> } >> } > >> do you really see any value in generating tests for synthetic vector sizes where MaxVectorSize is a non-power of two. Lets remove them to reduce noise ? > > I see a value in having non-power of 2 offsets, yes. They should vectorize if the vector width is small enough. And then there are some values like `18, 20, 192` that are a there to check vectorization with `+AlignVector`, where we expect vectorization only if we have `byte_offset % vector_width == 0`. So it is interesting to have some non-power-of-2 values that have various power-of-2 factors in them. > > Maybe you find the `MaxVectorSize <= 12` "noisy" somehow, because it is equivalent to `MaxVectorSize <= 8`? I find it rather helpful, because `12` reflects the `byte_offset`, and so makes the rule a bit more understandable. > > Finally, I generate many tests, I don't want to do that by hand. So maybe the rules are not simplified perfectly. I tried to improve it a bit. If you have a concrete idea how to further improve, I'm open for suggestions. I could for example round down the values to the next power of 2, or something like that. But again: would that really make the rules more understandable? > > With +AlignVector behavior with and without Vectorize,true pragma should match. > > This was about example with `fArr[i + 4] = fArr[i];` in the loop. `byte_offset = 4 * 4 = 16`. > > @jatin-bhateja I am not sure what you are trying to say, what do you mean by `should match`? > Yes, this was a bug in mainline where we were incorrectly vectorizing which is now fixed with your changes, just wanted to get that point highlighted. ------------- PR: https://git.openjdk.org/jdk/pull/12350 From thartmann at openjdk.org Mon Mar 6 18:39:39 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 6 Mar 2023 18:39:39 GMT Subject: RFR: 8201516: DebugNonSafepoints generates incorrect information [v5] In-Reply-To: <4PpM775Vw8MQtZu4vuycM0ZDx8bkqy8TJ_aS4OmRDVE=.63e42077-5ccd-43a8-96af-163f0fe8392b@github.com> References: <4PpM775Vw8MQtZu4vuycM0ZDx8bkqy8TJ_aS4OmRDVE=.63e42077-5ccd-43a8-96af-163f0fe8392b@github.com> Message-ID: <9f28lJesfbw-XWucLy5VAIJKZ4037NQ4BV6De_reTuk=.b073470c-c85b-4446-b04f-5e687ffb8689@github.com> On Fri, 3 Mar 2023 06:37:08 GMT, Tobias Hartmann wrote: >> C2 emits incorrect debug information when diagnostic `-XX:+DebugNonSafepoints` is enabled. The problem is that renumbering of live nodes (`-XX:+RenumberLiveNodes`) introduced by [JDK-8129847](https://bugs.openjdk.org/browse/JDK-8129847) in JDK 8u92 / JDK 9 does not update the `_node_note_array` side table that links IR node indices to debug information. As a result, after node indices are updated, they point to unrelated debug information. >> >> The [reproducer](https://github.com/jodzga/debugnonsafepoints-problem) shared by the original reporter @jodzga (@jbachorik also reported this issue separately) does not work anymore with recent JDK versions but with a slight adjustment to trigger node renumbering, I could reproduce the wrong JFR method profile: >> >> ![Screenshot from 2023-03-01 13-17-48](https://user-images.githubusercontent.com/5312595/222146314-8b5299a8-c1c0-4360-b356-ac6a8c34371c.png) >> >> It suggests that the hottest method of the [Test](https://github.com/jodzga/debugnonsafepoints-problem/blob/f8ed40f24ef6a6bff7f86ea861c022db193ef48a/src/main/java/org/tests/Test.java#L28) is **not** the long running loop in [Test::arraycopy](https://github.com/jodzga/debugnonsafepoints-problem/blob/f8ed40f24ef6a6bff7f86ea861c022db193ef48a/src/main/java/org/tests/Test.java#L56) but several other short running methods. The hot method is not even in the profile. This is obviously wrong. >> >> With the fix, or when running with `-XX:-RenumberLiveNodes` as a workaround, the correct profile looks like this: >> >> ![Screenshot from 2023-03-01 13-20-09](https://user-images.githubusercontent.com/5312595/222146316-b036ca7d-8a92-42b7-9570-c29e3cfcc2f2.png) >> >> With the help of the IR framework, it's easy to create a simple regression test (see `testRenumberLiveNodes`). >> >> The fix is to create a new `node_note_array` and copy the debug information to the right index after updating node indices. We do the same in the matcher: >> https://github.com/openjdk/jdk/blob/c1e77e05647ca93bb4f39a320a5c7a632e283410/src/hotspot/share/opto/matcher.cpp#L337-L342 >> >> Another problem is that `Parse::Parse` calls `C->set_default_node_notes(caller_nn)` before `do_exits`, which resets the `JVMState` to the caller state. We then set the bci to `InvocationEntryBci` in the **caller** `JVMState`. Any new node that is emitted in `do_exits`, for example a `MemBarRelease`, will have that `JVMState` attached and `NonSafepointEmitter::observe_instruction` -> `DebugInformationRecorder::describe_scope` will then use that information when emitting debug info. The resulting debug info is misleading because it suggests that we are at the beginning of the caller method. The tests `testFinalFieldInit` and `testSynchronized` reproduce that scenario. >> >> The fix is to move `set_default_node_notes` down to after `do_exits`. >> >> I find it also misleading that we often emit "synchronization entry" for `InvocationEntryBci` at method entry/exit in the debug info, although there is no synchronization happening. I filed [JDK-8303451](https://bugs.openjdk.org/browse/JDK-8303451) to fix that. >> >> Thanks, >> Tobias > > Tobias Hartmann has updated the pull request incrementally with one additional commit since the last revision: > > Use MAX2 Thanks for the review, Vladimir and Roland! ------------- PR: https://git.openjdk.org/jdk/pull/12806 From kvn at openjdk.org Mon Mar 6 20:41:12 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 6 Mar 2023 20:41:12 GMT Subject: RFR: 8303627: compiler/loopopts/TestUnreachableInnerLoop.java failed with -XX:LoopMaxUnroll=4 In-Reply-To: References: Message-ID: On Mon, 6 Mar 2023 02:42:21 GMT, SUN Guoyun wrote: > This test failed with VM_OPTIONS=-XX:LoopMaxUnroll=4 and CONF=fastdebug on X86_64, AArch64 LoongArch64 architecture. > >

> # A fatal error has been detected by the Java Runtime Environment:
> #
> # Internal Error (/home/sunguoyun/jdk-ls/src/hotspot/share/opto/block.cpp:1359), pid=31328, tid=31344
> # assert(n->is_Root() || n->is_Region() || n->is_Phi() || n->is_MachMerge() || def_block->dominates(block)) failed: uses must be dominated by definitions
> #
> 
> This PR fix the issue, Please help review it. > > Thanks. src/hotspot/share/opto/block.hpp line 195: > 193: if (dom_diff > 0) return false; > 194: for (; dom_diff < 0; dom_diff++) that = that->_idom; > 195: return (this == that) || (this != that && this->_dom_depth == that->_dom_depth); I don't think this is correct. `this` should be reachable from `that` for this method return `true`. Imaging you compare `B53->dominates(B61)` from your example. ------------- PR: https://git.openjdk.org/jdk/pull/12874 From kvn at openjdk.org Mon Mar 6 23:57:16 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 6 Mar 2023 23:57:16 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter In-Reply-To: References: Message-ID: On Fri, 3 Mar 2023 21:41:35 GMT, Vladimir Kozlov wrote: > Implemented `Float.floatToFloat16` and `Float.float16ToFloat` intrinsics in Interpreter and C1 compiler to produce the same results as C2 intrinsics on x64, Aarch64 and RISC-V - all platforms where C2 intrinsics for these Java methods were implemented originally. > > Replaced `SharedRuntime::f2hf()` and `hf2f()` C runtime functions with calls to runtime stubs which use the same HW instructions as C2 intrinsics. Only for 64-bit x64 because 32-bit x86 stub does not work: result is passed through FPU register and NaN values become different from C2 intrinsic. This runtime stub is only used to calculate constant values during C2 compilation and can be skipped. > > I added new tests based on Tobias's `TestAll.java` And copied `jdk/lang/Float/Binary16Conversion*.java` tests to run them with `-Xcomp` to make sure code is compiled by C1 or C2. I modified `Binary16ConversionNaN.java` to compare results from Interpreter, C1 and C2. > > Tested tier1-5, Xcomp, stress @fyang, please help to verify that new tests passed on RISC-V with these changes and review these changes. Thanks! I tested x86 (64- and 32-bit) and AArch64. ------------- PR: https://git.openjdk.org/jdk/pull/12869 From sviswanathan at openjdk.org Tue Mar 7 00:22:15 2023 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 7 Mar 2023 00:22:15 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter In-Reply-To: References: Message-ID: On Mon, 6 Mar 2023 23:54:44 GMT, Vladimir Kozlov wrote: >> Implemented `Float.floatToFloat16` and `Float.float16ToFloat` intrinsics in Interpreter and C1 compiler to produce the same results as C2 intrinsics on x64, Aarch64 and RISC-V - all platforms where C2 intrinsics for these Java methods were implemented originally. >> >> Replaced `SharedRuntime::f2hf()` and `hf2f()` C runtime functions with calls to runtime stubs which use the same HW instructions as C2 intrinsics. Only for 64-bit x64 because 32-bit x86 stub does not work: result is passed through FPU register and NaN values become different from C2 intrinsic. This runtime stub is only used to calculate constant values during C2 compilation and can be skipped. >> >> I added new tests based on Tobias's `TestAll.java` And copied `jdk/lang/Float/Binary16Conversion*.java` tests to run them with `-Xcomp` to make sure code is compiled by C1 or C2. I modified `Binary16ConversionNaN.java` to compare results from Interpreter, C1 and C2. >> >> Tested tier1-5, Xcomp, stress > > @fyang, please help to verify that new tests passed on RISC-V with these changes and review these changes. Thanks! > > I tested x86 (64- and 32-bit) and AArch64. @vnkozlov Thanks a lot for taking this up. Is the following in the PR description still true: "Replaced SharedRuntime::f2hf() and hf2f() C runtime functions with calls to runtime stubs which use the same HW instructions as C2 intrinsics. Only for 64-bit x64 because 32-bit x86 stub does not work: result is passed through FPU register and NaN values become different from C2 intrinsic." >From the PR it looks to me that for x86_64 you have the changes in place for SharedRuntime and the same result is produced across SharedRuntime, interpreter, c1, and c2. For x86 32-bit also things are consistent across. Only the SharedRuntime optimization doesnt happen for x86 32bit as StubRoutines::hf2f() and StubRoutines::f2hf() are set as null. The fallback is handled correctly in interpreter, c1, and c2. ------------- PR: https://git.openjdk.org/jdk/pull/12869 From kvn at openjdk.org Tue Mar 7 00:51:12 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 7 Mar 2023 00:51:12 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter In-Reply-To: References: Message-ID: On Tue, 7 Mar 2023 00:19:03 GMT, Sandhya Viswanathan wrote: > For x86 32-bit also things are consistent across. Only the SharedRuntime optimization doesnt happen for x86 32bit as StubRoutines::hf2f() and StubRoutines::f2hf() are set as null. The fallback is handled correctly in interpreter, c1, and c2. Correct, it is consistent. Only optimization to calculate constant value during compile time is skipped. C2 will generate HW instruction for `ConvF2HF` node as if its input was not constant. That is it. It is possible to add similar Stub routines for AArch64 and RISC-V to be called from C2 but I am not expert in those platforms so I skipped them. ------------- PR: https://git.openjdk.org/jdk/pull/12869 From kvn at openjdk.org Tue Mar 7 00:55:16 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 7 Mar 2023 00:55:16 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter In-Reply-To: References: Message-ID: On Fri, 3 Mar 2023 21:41:35 GMT, Vladimir Kozlov wrote: > Implemented `Float.floatToFloat16` and `Float.float16ToFloat` intrinsics in Interpreter and C1 compiler to produce the same results as C2 intrinsics on x64, Aarch64 and RISC-V - all platforms where C2 intrinsics for these Java methods were implemented originally. > > Replaced `SharedRuntime::f2hf()` and `hf2f()` C runtime functions with calls to runtime stubs which use the same HW instructions as C2 intrinsics. Only for 64-bit x64 because 32-bit x86 stub does not work: result is passed through FPU register and NaN values become different from C2 intrinsic. This runtime stub is only used to calculate constant values during C2 compilation and can be skipped. > > I added new tests based on Tobias's `TestAll.java` And copied `jdk/lang/Float/Binary16Conversion*.java` tests to run them with `-Xcomp` to make sure code is compiled by C1 or C2. I modified `Binary16ConversionNaN.java` to compare results from Interpreter, C1 and C2. > > Tested tier1-5, Xcomp, stress Note, I removed `ConvF2HFNode::Identity()` optimization because tests show that it produces different NaN results due to skipped conversion. ------------- PR: https://git.openjdk.org/jdk/pull/12869 From duke at openjdk.org Tue Mar 7 01:21:17 2023 From: duke at openjdk.org (SUN Guoyun) Date: Tue, 7 Mar 2023 01:21:17 GMT Subject: RFR: 8303627: compiler/loopopts/TestUnreachableInnerLoop.java failed with -XX:LoopMaxUnroll=4 In-Reply-To: References: Message-ID: <20Ve037spuX25RDg8zcS7YOAEwWuMIlQiDNMDnVm60k=.0332fb26-49e4-4371-ac74-3e7ba9cfe2f3@github.com> On Mon, 6 Mar 2023 20:37:47 GMT, Vladimir Kozlov wrote: >> This test failed with VM_OPTIONS=-XX:LoopMaxUnroll=4 and CONF=fastdebug on X86_64, AArch64 LoongArch64 architecture. >> >>

>> # A fatal error has been detected by the Java Runtime Environment:
>> #
>> # Internal Error (/home/sunguoyun/jdk-ls/src/hotspot/share/opto/block.cpp:1359), pid=31328, tid=31344
>> # assert(n->is_Root() || n->is_Region() || n->is_Phi() || n->is_MachMerge() || def_block->dominates(block)) failed: uses must be dominated by definitions
>> #
>> 
>> This PR fix the issue, Please help review it. >> >> Thanks. > > src/hotspot/share/opto/block.hpp line 195: > >> 193: if (dom_diff > 0) return false; >> 194: for (; dom_diff < 0; dom_diff++) that = that->_idom; >> 195: return (this == that) || (this != that && this->_dom_depth == that->_dom_depth); > > I don't think this is correct. `this` should be reachable from `that` for this method return `true`. > Imaging you compare `B53->dominates(B61)` from your example. Thanks @vnkozlov. The example I enumerate reflect the relationship between the related blocks when `TestUnreachableInnerLoop.java` fails. B53(this) cannot dominates B55(that), so trigger assertion error?

#  Internal Error (/home/user/jdk-ls/src/hotspot/share/opto/block.cpp:1375), pid=17010, tid=17024
#  assert(n->is_Root() || n->is_Region() || n->is_Phi() || n->is_MachMerge() || def_block->dominates(block)) failed: uses must be dominated by definitions
Indeed, as can be seen from this example, B53 and B61 only dominates itself, B52 dominates B54 and B55, so I'm not sure if this patch is the most correct solution, Do you have any good advice for me? ------------- PR: https://git.openjdk.org/jdk/pull/12874 From sviswanathan at openjdk.org Tue Mar 7 01:24:19 2023 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 7 Mar 2023 01:24:19 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter In-Reply-To: References: Message-ID: On Fri, 3 Mar 2023 21:41:35 GMT, Vladimir Kozlov wrote: > Implemented `Float.floatToFloat16` and `Float.float16ToFloat` intrinsics in Interpreter and C1 compiler to produce the same results as C2 intrinsics on x64, Aarch64 and RISC-V - all platforms where C2 intrinsics for these Java methods were implemented originally. > > Replaced `SharedRuntime::f2hf()` and `hf2f()` C runtime functions with calls to runtime stubs which use the same HW instructions as C2 intrinsics. Only for 64-bit x64 because 32-bit x86 stub does not work: result is passed through FPU register and NaN values become different from C2 intrinsic. This runtime stub is only used to calculate constant values during C2 compilation and can be skipped. > > I added new tests based on Tobias's `TestAll.java` And copied `jdk/lang/Float/Binary16Conversion*.java` tests to run them with `-Xcomp` to make sure code is compiled by C1 or C2. I modified `Binary16ConversionNaN.java` to compare results from Interpreter, C1 and C2. > > Tested tier1-5, Xcomp, stress src/hotspot/cpu/x86/macroAssembler_x86.hpp line 199: > 197: void flt_to_flt16(Register dst, XMMRegister src, XMMRegister tmp) { > 198: // Instruction requires different XMM registers > 199: vcvtps2ph(tmp, src, 0x04, Assembler::AVX_128bit); vcvtps2ph can have source and destination as same. Did you mean to say here in the comment that "Instruction requires XMM register as destination"? src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 3928: > 3926: } > 3927: > 3928: if (VM_Version::supports_f16c() || VM_Version::supports_avx512vl()) { We could check for VM_Version::supports_float16() here instead. src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 3931: > 3929: // For results consistency both intrinsics should be enabled. > 3930: if (vmIntrinsics::is_intrinsic_available(vmIntrinsics::_float16ToFloat) && > 3931: vmIntrinsics::is_intrinsic_available(vmIntrinsics::_floatToFloat16)) { Should this also check for InlineIntrinsics? src/hotspot/cpu/x86/templateInterpreterGenerator_x86_64.cpp line 346: > 344: } > 345: // For AVX CPUs only. f16c support is disabled if UseAVX == 0. > 346: if (VM_Version::supports_f16c() || VM_Version::supports_avx512vl()) { We could check for VM_Version::supports_float16() here instead. ------------- PR: https://git.openjdk.org/jdk/pull/12869 From sviswanathan at openjdk.org Tue Mar 7 01:29:13 2023 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 7 Mar 2023 01:29:13 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter In-Reply-To: References: Message-ID: On Tue, 7 Mar 2023 00:52:37 GMT, Vladimir Kozlov wrote: > Note, I removed `ConvF2HFNode::Identity()` optimization because tests show that it produces different NaN results due to skipped conversion. Yes, removing the Identity optimization is correct. It doesn't hold for NaN inputs. ------------- PR: https://git.openjdk.org/jdk/pull/12869 From sviswanathan at openjdk.org Tue Mar 7 01:29:16 2023 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 7 Mar 2023 01:29:16 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter In-Reply-To: References: Message-ID: On Fri, 3 Mar 2023 21:41:35 GMT, Vladimir Kozlov wrote: > Implemented `Float.floatToFloat16` and `Float.float16ToFloat` intrinsics in Interpreter and C1 compiler to produce the same results as C2 intrinsics on x64, Aarch64 and RISC-V - all platforms where C2 intrinsics for these Java methods were implemented originally. > > Replaced `SharedRuntime::f2hf()` and `hf2f()` C runtime functions with calls to runtime stubs which use the same HW instructions as C2 intrinsics. Only for 64-bit x64 because 32-bit x86 stub does not work: result is passed through FPU register and NaN values become different from C2 intrinsic. This runtime stub is only used to calculate constant values during C2 compilation and can be skipped. > > I added new tests based on Tobias's `TestAll.java` And copied `jdk/lang/Float/Binary16Conversion*.java` tests to run them with `-Xcomp` to make sure code is compiled by C1 or C2. I modified `Binary16ConversionNaN.java` to compare results from Interpreter, C1 and C2. > > Tested tier1-5, Xcomp, stress Other than the minor comments above, the x86 side changes look good to me. ------------- PR: https://git.openjdk.org/jdk/pull/12869 From pli at openjdk.org Tue Mar 7 01:44:13 2023 From: pli at openjdk.org (Pengfei Li) Date: Tue, 7 Mar 2023 01:44:13 GMT Subject: RFR: 8303627: compiler/loopopts/TestUnreachableInnerLoop.java failed with -XX:LoopMaxUnroll=4 In-Reply-To: References: Message-ID: On Mon, 6 Mar 2023 02:42:21 GMT, SUN Guoyun wrote: > This test failed with VM_OPTIONS=-XX:LoopMaxUnroll=4 and CONF=fastdebug on X86_64, AArch64 LoongArch64 architecture. > >

> # A fatal error has been detected by the Java Runtime Environment:
> #
> # Internal Error (/home/sunguoyun/jdk-ls/src/hotspot/share/opto/block.cpp:1359), pid=31328, tid=31344
> # assert(n->is_Root() || n->is_Region() || n->is_Phi() || n->is_MachMerge() || def_block->dominates(block)) failed: uses must be dominated by definitions
> #
> 
> This PR fix the issue, Please help review it. > > Thanks. [_Not a review_] I think this issue is duplicated to JDK-8291025. @eme64 has some detailed analysis about this. Please see his comments at https://bugs.openjdk.org/browse/JDK-8291025 ------------- PR: https://git.openjdk.org/jdk/pull/12874 From Divino.Cesar at microsoft.com Tue Mar 7 02:01:21 2023 From: Divino.Cesar at microsoft.com (Cesar Soares Lucas) Date: Tue, 7 Mar 2023 02:01:21 +0000 Subject: Plan to create unique types for inputs of allocation merges In-Reply-To: <5796e30a-e9ed-429a-b7d5-851d12a79500@oracle.com> References: <5796e30a-e9ed-429a-b7d5-851d12a79500@oracle.com> Message-ID: Hi, Vladimir. Once more, thank you for taking a look at my proposal and reviewing my PRs. > I'm glad you continue working on the EA enhancement. Looking forward for > the next iteration of your work! I just created a new PR (https://github.com/openjdk/jdk/pull/12897) with the latest progress I have on this front! This latest PR adds support for re-materialization of scalar replaced objects participating in merges. I.e., the `new Point` below will now be scalar replaced. Note that only one of the inputs to the merge is scalar replaced. Point p = new Point(...); if (...) p = otherMethod(...); new Unloaded(); The current PR only handle merges that are used solely as debug information. The next PR I'm going to create is for scalar replacing merges that have field loads as users and I'm still considering using LoadNode::split_through_phi for that. However, I'm still investigating if I'll be able to use split_through_phi since it requires `instance_id`. > I still think that improving split_unique_types to handle allocation > merges is a better alternative to custom RAM nodes with an extra phase > to optimize them. I agree with you here. I don't think I'll need new IR node for untangling the merges. > I wouldn't be bothered too much by the fact that stores can't be > reliably optimized in general case. The whole idea behind JDK-8289943 is > to focus on "simple enough" shapes. There will always be complex enough > code shapes the conservative analysis couldn't handle. Sounds good to me! Cheers, Cesar ________________________________________ From: Vladimir Ivanov Sent: Monday, February 27, 2023 2:01 PM To: Cesar Soares Lucas; hotspot-compiler-dev at openjdk.java.net Subject: Re: Plan to create unique types for inputs of allocation merges Hi Cesar, I assume you consider enhancing split_unique_types as part of JDK-8289943 [1] we discussed before. I still think that improving split_unique_types to handle allocation merges is a better alternative to custom RAM nodes with an extra phase to optimize them. Your proposal looks reasonable. Speaking of Store nodes, my recollection is they aren't handled by pr/9073 [1] yet. (In particular, ConnectionGraph::can_reduce_this_phi() bails on !ConnectionGraph::is_read_only() check [2].) I wouldn't be bothered too much by the fact that stores can't be reliably optimized in general case. The whole idea behind JDK-8289943 is to focus on "simple enough" shapes. There will always be complex enough code shapes the conservative analysis couldn't handle. I'm glad you continue working on the EA enhancement. Looking forward for the next iteration of your work! Best regards, Vladimir Ivanov [1] https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.openjdk.org%2Fjdk%2Fpull%2F9073&data=05%7C01%7CDivino.Cesar%40microsoft.com%7Cc583064e305e4368c6a408db190e3c98%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638131321163410389%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=uO1mpEkUQiupJ5TtMUGitwwcq%2BzQe1FKg6lKqJOsc5U%3D&reserved=0 [2] https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopenjdk%2Fjdk%2Fpull%2F9073%2Ffiles%23diff-03f7ae3cf79ff61be6e4f0590b7809a87825b073341fdbfcf36143b99c304474R652&data=05%7C01%7CDivino.Cesar%40microsoft.com%7Cc583064e305e4368c6a408db190e3c98%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638131321163410389%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=JOXdGfitlt84KXZNYYLC%2BEyZrZNR1N8N0SAMoh53vlA%3D&reserved=0 On 2/6/23 11:56, Cesar Soares Lucas wrote: > Hello there! > > Can I please get some feedback on the following plan? > > --- > > Plan to patch the split_unique_types method to make it able to create > > unique types for inputs participating in NoEscape merges. The motivation for > > doing this is because it will make possible to remove the allocation merge > > (after making other changes) and scalar replace some of the merge's input. I > > will be concurrently working on a different set of changes to eliminate the > > allocation merges. > > My idea is to approach this in three parts. The parts are separated by > the type > > of node that uses the allocation merge. > > 1) Merges that have only SafePoint (and subclasses) nodes as users. This > will > > let me handle merges used as debug information in SafePoints. There are > not many > > merges that are only used by SafePoints but SafePointNode is the most common > > user of merges (the merge is used by a SafePointNode and some other type > > of node). > > 2) Also handle merges that have field loads as users. This will let me > later on > > split-thru-phi the loads and after some other changes remove the merge. > > With (1) and (2) in place many (maybe most) merge usage patterns can be > > simplified. > > 3) Also handle merges that have Store as users. This is by far the most > > complicated case. It might not be possible to handle this kind of merge in > > general, but we can make some cases work. My current idea for handling this > > scenario is to clone the Store, make each one use a different input from the > > merge and have a different unique type. This will require adding a "selector > > phi" to output which input of the merge should be used and IfNode's to > control > > the execution of the cloned Store's. I understand we'll need to limit > the number > > of bases and stores that we can handle. > > As an illustration of the 3rd scenario above, this code: > > p = phi(x, o1, o2); > > p.x = 10; // no instance_id > > would become: > > p1 = phi(x, o1, NULL); > > p2 = phi(x, NULL, o2); > > sl = phi(x, 0, 1); // selector Phi > > if (sl == 0) > > p1.x = 10; // instance_id "1" > > else > > p2.x = 10; // instance_id "2" > > Best regards, > > Cesar > From kvn at openjdk.org Tue Mar 7 02:07:21 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 7 Mar 2023 02:07:21 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter In-Reply-To: References: Message-ID: On Fri, 3 Mar 2023 21:41:35 GMT, Vladimir Kozlov wrote: > Implemented `Float.floatToFloat16` and `Float.float16ToFloat` intrinsics in Interpreter and C1 compiler to produce the same results as C2 intrinsics on x64, Aarch64 and RISC-V - all platforms where C2 intrinsics for these Java methods were implemented originally. > > Replaced `SharedRuntime::f2hf()` and `hf2f()` C runtime functions with calls to runtime stubs which use the same HW instructions as C2 intrinsics. Only for 64-bit x64 because 32-bit x86 stub does not work: result is passed through FPU register and NaN values become different from C2 intrinsic. This runtime stub is only used to calculate constant values during C2 compilation and can be skipped. > > I added new tests based on Tobias's `TestAll.java` And copied `jdk/lang/Float/Binary16Conversion*.java` tests to run them with `-Xcomp` to make sure code is compiled by C1 or C2. I modified `Binary16ConversionNaN.java` to compare results from Interpreter, C1 and C2. > > Tested tier1-5, Xcomp, stress Thank you for review @sviswa7. I will address you comments. ------------- PR: https://git.openjdk.org/jdk/pull/12869 From kvn at openjdk.org Tue Mar 7 02:07:26 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 7 Mar 2023 02:07:26 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter In-Reply-To: References: Message-ID: On Tue, 7 Mar 2023 01:04:00 GMT, Sandhya Viswanathan wrote: >> Implemented `Float.floatToFloat16` and `Float.float16ToFloat` intrinsics in Interpreter and C1 compiler to produce the same results as C2 intrinsics on x64, Aarch64 and RISC-V - all platforms where C2 intrinsics for these Java methods were implemented originally. >> >> Replaced `SharedRuntime::f2hf()` and `hf2f()` C runtime functions with calls to runtime stubs which use the same HW instructions as C2 intrinsics. Only for 64-bit x64 because 32-bit x86 stub does not work: result is passed through FPU register and NaN values become different from C2 intrinsic. This runtime stub is only used to calculate constant values during C2 compilation and can be skipped. >> >> I added new tests based on Tobias's `TestAll.java` And copied `jdk/lang/Float/Binary16Conversion*.java` tests to run them with `-Xcomp` to make sure code is compiled by C1 or C2. I modified `Binary16ConversionNaN.java` to compare results from Interpreter, C1 and C2. >> >> Tested tier1-5, Xcomp, stress > > src/hotspot/cpu/x86/macroAssembler_x86.hpp line 199: > >> 197: void flt_to_flt16(Register dst, XMMRegister src, XMMRegister tmp) { >> 198: // Instruction requires different XMM registers >> 199: vcvtps2ph(tmp, src, 0x04, Assembler::AVX_128bit); > > vcvtps2ph can have source and destination as same. Did you mean to say here in the comment that "Instruction requires XMM register as destination"? `flt_to_flt16` is used in `x86.ad` instruction which requires preserving `src` register. I did not want to add an other macroassembler instruction for src->src case. I will add this to this comment. > src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 3928: > >> 3926: } >> 3927: >> 3928: if (VM_Version::supports_f16c() || VM_Version::supports_avx512vl()) { > > We could check for VM_Version::supports_float16() here instead. Yes. > src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 3931: > >> 3929: // For results consistency both intrinsics should be enabled. >> 3930: if (vmIntrinsics::is_intrinsic_available(vmIntrinsics::_float16ToFloat) && >> 3931: vmIntrinsics::is_intrinsic_available(vmIntrinsics::_floatToFloat16)) { > > Should this also check for InlineIntrinsics? `vmIntrinsics::disabled_by_jvm_flags()` checks `InlineIntrinsics`. See `vmIntrinsics.cpp` changes. > src/hotspot/cpu/x86/templateInterpreterGenerator_x86_64.cpp line 346: > >> 344: } >> 345: // For AVX CPUs only. f16c support is disabled if UseAVX == 0. >> 346: if (VM_Version::supports_f16c() || VM_Version::supports_avx512vl()) { > > We could check for VM_Version::supports_float16() here instead. Yes. And I need to remove `!InlineIntrinsics` check at line 340. ------------- PR: https://git.openjdk.org/jdk/pull/12869 From cslucas at openjdk.org Tue Mar 7 02:14:51 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Tue, 7 Mar 2023 02:14:51 GMT Subject: RFR: 8289943: Simplify some object allocation merges [v14] In-Reply-To: References: Message-ID: <_U3c5AkkByfX0HDPKFU8nU_50S64mWoEvzXVoYsTols=.eaaad9d0-7a2b-48b2-b760-3d9b1435cba8@github.com> On Tue, 3 Jan 2023 20:27:41 GMT, Cesar Soares Lucas wrote: >> Hi there, can I please get some feedback on this approach to simplify object allocation merges in order to promote Scalar Replacement of the objects involved in the merge? >> >> The basic idea for this [approach was discussed in this thread](https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2022-April/055189.html) and it consists of: >> 1) Identify Phi nodes that merge object allocations and replace them with a new IR node called ReducedAllocationMergeNode (RAM node). >> 2) Scalar Replace the incoming allocations to the RAM node. >> 3) Scalar Replace the RAM node itself. >> >> There are a few conditions for doing the replacement of the Phi by a RAM node though - Although I plan to work on removing them in subsequent PRs: >> >> - The only supported users of the original Phi are AddP->Load, SafePoints/Traps, DecodeN. >> >> These are the critical parts of the implementation and I'd appreciate it very much if you could tell me if what I implemented isn't violating any C2 IR constraints: >> >> - The way I identify/use the memory edges that will be used to find the last stored values to the merged object fields. >> - The way I check if there is an incoming Allocate node to the original Phi node. >> - The way I check if there is no store to the merged objects after they are merged. >> >> Testing: >> - Windows/Linux/MAC fastdebug/release >> - hotspot_all >> - tier1 >> - Renaissance >> - dacapo >> - new IR-based tests > > Cesar Soares Lucas has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 43 commits: > > - updating with master branch > - Fix x86 tests. > - Fix code style. > - Address PR feedback. Fix test & one bug. Set RAM parameter to true by default. > - Addressing PR feedback. Added new constraint for case of merging SR and NSR allocations. > - Merge branch 'openjdk:master' into allocation-merges > - Remove debug messages. > - Add functional tests, micro benchmarks and fix some bugs. > - fix 32 bit execution. > - Back out on fixing existing issue. Some tests depend on it. > - ... and 33 more: https://git.openjdk.org/jdk/compare/ea25a561...d26909fa Hi, ALL. I decided to [create a new PR](https://github.com/openjdk/jdk/pull/12897) since after the latest changes the code looked much different than the version in this PR. I also attacked the problem from another direction: I decided to create an infrastructure for re-materializing objects before anything else since merges being used as debug information is the most common use case (see charts on new PR). Still, in this new approach, I plan to include all the feedback I received here. 1) No need for RAM node; 2) Improve split-unique-types, 3) Make use of split-through-phi. Thank you all again! ------------- PR: https://git.openjdk.org/jdk/pull/9073 From cslucas at openjdk.org Tue Mar 7 02:14:54 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Tue, 7 Mar 2023 02:14:54 GMT Subject: Withdrawn: 8289943: Simplify some object allocation merges In-Reply-To: References: Message-ID: <1eddrfe0L6Qvob84YWlk6NtR5LIlSXazTlBc8hHkHMg=.441abb42-e2fc-4cd2-9898-8c94ed55642a@github.com> On Tue, 7 Jun 2022 23:24:02 GMT, Cesar Soares Lucas wrote: > Hi there, can I please get some feedback on this approach to simplify object allocation merges in order to promote Scalar Replacement of the objects involved in the merge? > > The basic idea for this [approach was discussed in this thread](https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2022-April/055189.html) and it consists of: > 1) Identify Phi nodes that merge object allocations and replace them with a new IR node called ReducedAllocationMergeNode (RAM node). > 2) Scalar Replace the incoming allocations to the RAM node. > 3) Scalar Replace the RAM node itself. > > There are a few conditions for doing the replacement of the Phi by a RAM node though - Although I plan to work on removing them in subsequent PRs: > > - The only supported users of the original Phi are AddP->Load, SafePoints/Traps, DecodeN. > > These are the critical parts of the implementation and I'd appreciate it very much if you could tell me if what I implemented isn't violating any C2 IR constraints: > > - The way I identify/use the memory edges that will be used to find the last stored values to the merged object fields. > - The way I check if there is an incoming Allocate node to the original Phi node. > - The way I check if there is no store to the merged objects after they are merged. > > Testing: > - Windows/Linux/MAC fastdebug/release > - hotspot_all > - tier1 > - Renaissance > - dacapo > - new IR-based tests This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/9073 From sviswanathan at openjdk.org Tue Mar 7 02:45:15 2023 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 7 Mar 2023 02:45:15 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter In-Reply-To: References: Message-ID: On Tue, 7 Mar 2023 01:59:25 GMT, Vladimir Kozlov wrote: >> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 3931: >> >>> 3929: // For results consistency both intrinsics should be enabled. >>> 3930: if (vmIntrinsics::is_intrinsic_available(vmIntrinsics::_float16ToFloat) && >>> 3931: vmIntrinsics::is_intrinsic_available(vmIntrinsics::_floatToFloat16)) { >> >> Should this also check for InlineIntrinsics? > > `vmIntrinsics::disabled_by_jvm_flags()` checks `InlineIntrinsics`. See `vmIntrinsics.cpp` changes. Yes you are right. ------------- PR: https://git.openjdk.org/jdk/pull/12869 From kvn at openjdk.org Tue Mar 7 02:53:48 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 7 Mar 2023 02:53:48 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter [v2] In-Reply-To: References: Message-ID: <9xh4y5OLD7_ZgSgnRtBk8dUWEODYa9ut3H19y3XNGxw=.d21814f5-ac8f-4786-98ff-53b926f0ad8e@github.com> > Implemented `Float.floatToFloat16` and `Float.float16ToFloat` intrinsics in Interpreter and C1 compiler to produce the same results as C2 intrinsics on x64, Aarch64 and RISC-V - all platforms where C2 intrinsics for these Java methods were implemented originally. > > Replaced `SharedRuntime::f2hf()` and `hf2f()` C runtime functions with calls to runtime stubs which use the same HW instructions as C2 intrinsics. Only for 64-bit x64 because 32-bit x86 stub does not work: result is passed through FPU register and NaN values become different from C2 intrinsic. This runtime stub is only used to calculate constant values during C2 compilation and can be skipped. > > I added new tests based on Tobias's `TestAll.java` And copied `jdk/lang/Float/Binary16Conversion*.java` tests to run them with `-Xcomp` to make sure code is compiled by C1 or C2. I modified `Binary16ConversionNaN.java` to compare results from Interpreter, C1 and C2. > > Tested tier1-5, Xcomp, stress Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: Address review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12869/files - new: https://git.openjdk.org/jdk/pull/12869/files/2eb47bf5..9302d4bc Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12869&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12869&range=00-01 Stats: 86 lines in 8 files changed: 15 ins; 24 del; 47 mod Patch: https://git.openjdk.org/jdk/pull/12869.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12869/head:pull/12869 PR: https://git.openjdk.org/jdk/pull/12869 From kvn at openjdk.org Tue Mar 7 03:03:07 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 7 Mar 2023 03:03:07 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter In-Reply-To: References: Message-ID: On Tue, 7 Mar 2023 01:26:44 GMT, Sandhya Viswanathan wrote: >> Implemented `Float.floatToFloat16` and `Float.float16ToFloat` intrinsics in Interpreter and C1 compiler to produce the same results as C2 intrinsics on x64, Aarch64 and RISC-V - all platforms where C2 intrinsics for these Java methods were implemented originally. >> >> Replaced `SharedRuntime::f2hf()` and `hf2f()` C runtime functions with calls to runtime stubs which use the same HW instructions as C2 intrinsics. Only for 64-bit x64 because 32-bit x86 stub does not work: result is passed through FPU register and NaN values become different from C2 intrinsic. This runtime stub is only used to calculate constant values during C2 compilation and can be skipped. >> >> I added new tests based on Tobias's `TestAll.java` And copied `jdk/lang/Float/Binary16Conversion*.java` tests to run them with `-Xcomp` to make sure code is compiled by C1 or C2. I modified `Binary16ConversionNaN.java` to compare results from Interpreter, C1 and C2. >> >> Tested tier1-5, Xcomp, stress > > Other than the minor comments above, the x86 side changes look good to me. @sviswa7 I update changes based on your comments. Please, look: [9302d4b](https://github.com/openjdk/jdk/pull/12869/commits/9302d4bc00f8f1d8e774a260eb6aacb2d51a2dd4) ------------- PR: https://git.openjdk.org/jdk/pull/12869 From duke at openjdk.org Tue Mar 7 03:05:26 2023 From: duke at openjdk.org (SUN Guoyun) Date: Tue, 7 Mar 2023 03:05:26 GMT Subject: Withdrawn: 8303627: compiler/loopopts/TestUnreachableInnerLoop.java failed with -XX:LoopMaxUnroll=4 In-Reply-To: References: Message-ID: On Mon, 6 Mar 2023 02:42:21 GMT, SUN Guoyun wrote: > This test failed with VM_OPTIONS=-XX:LoopMaxUnroll=4 and CONF=fastdebug on X86_64, AArch64 LoongArch64 architecture. > >

> # A fatal error has been detected by the Java Runtime Environment:
> #
> # Internal Error (/home/sunguoyun/jdk-ls/src/hotspot/share/opto/block.cpp:1359), pid=31328, tid=31344
> # assert(n->is_Root() || n->is_Region() || n->is_Phi() || n->is_MachMerge() || def_block->dominates(block)) failed: uses must be dominated by definitions
> #
> 
> This PR fix the issue, Please help review it. > > Thanks. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/12874 From duke at openjdk.org Tue Mar 7 03:05:25 2023 From: duke at openjdk.org (SUN Guoyun) Date: Tue, 7 Mar 2023 03:05:25 GMT Subject: RFR: 8303627: compiler/loopopts/TestUnreachableInnerLoop.java failed with -XX:LoopMaxUnroll=4 In-Reply-To: References: Message-ID: On Mon, 6 Mar 2023 02:42:21 GMT, SUN Guoyun wrote: > This test failed with VM_OPTIONS=-XX:LoopMaxUnroll=4 and CONF=fastdebug on X86_64, AArch64 LoongArch64 architecture. > >

> # A fatal error has been detected by the Java Runtime Environment:
> #
> # Internal Error (/home/sunguoyun/jdk-ls/src/hotspot/share/opto/block.cpp:1359), pid=31328, tid=31344
> # assert(n->is_Root() || n->is_Region() || n->is_Phi() || n->is_MachMerge() || def_block->dominates(block)) failed: uses must be dominated by definitions
> #
> 
> This PR fix the issue, Please help review it. > > Thanks. I've already associated this task with [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981), so I'm closing this task now. ------------- PR: https://git.openjdk.org/jdk/pull/12874 From duke at openjdk.org Tue Mar 7 03:47:54 2023 From: duke at openjdk.org (Jasmine K.) Date: Tue, 7 Mar 2023 03:47:54 GMT Subject: RFR: 8303238: Create generalizations for existing LShift ideal transforms [v2] In-Reply-To: References: Message-ID: > Hello, > I would like to generalize two ideal transforms for bitwise shifts. Left shift nodes perform the transformations `(x >> C1) << C2 => x & (-1 << C2)` and `((x >> C1) & Y) << C2 => x & (Y << C2)`, but only when the case where `C1 == C2`. However, it is possible to use both of these rules to improve cases where the constants aren't equal, by removing one of the shifts and replacing it with a bitwise and. This transformation is profitable because typically more bitwise ands can be dispatched per cycle than bit shifts. In addition, the strength reduction from a shift to a bitwise and can allow more profitable transformations to occur. These patterns are found throughout the JDK, mainly around strings and OW2 ASM. I've attached some profiling results from my (Zen 2) machine below: > > Baseline Patch Improvement > Benchmark Mode Cnt Score Error Units Score Error Units > LShiftNodeIdealize.testRgbaToAbgr avgt 15 63.287 ? 1.770 ns/op / 54.199 ? 1.408 ns/op + 14.36% > LShiftNodeIdealize.testShiftAndInt avgt 15 874.564 ? 15.334 ns/op / 538.408 ? 11.768 ns/op + 38.44% > LShiftNodeIdealize.testShiftAndLong avgt 15 1017.466 ? 29.010 ns/op / 701.356 ? 18.258 ns/op + 31.07% > LShiftNodeIdealize.testShiftInt avgt 15 663.865 ? 14.226 ns/op / 533.588 ? 9.949 ns/op + 19.63% > LShiftNodeIdealize.testShiftInt2 avgt 15 658.976 ? 32.856 ns/op / 649.871 ? 10.598 ns/op + 1.38% > LShiftNodeIdealize.testShiftLong avgt 15 815.540 ? 14.721 ns/op / 689.270 ? 14.028 ns/op + 15.48% > LShiftNodeIdealize.testShiftLong2 avgt 15 817.936 ? 23.573 ns/op / 810.185 ? 14.983 ns/op + 0.95% > > > In addition, in the process of making this PR I've found a missing ideal transform for `RShiftLNode`, so right shifts of large numbers (such as `x >> 65`) are not properly folded down, like how they are `RShiftINode` and `URShiftLNode`. I'll address this in a future RFR. > > Testing: GHA, tier1 local, and performance testing > > Thanks, > Jasmine K Jasmine K. has updated the pull request incrementally with one additional commit since the last revision: Comments from code review ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12734/files - new: https://git.openjdk.org/jdk/pull/12734/files/a1baa1d2..bd161561 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12734&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12734&range=00-01 Stats: 10 lines in 2 files changed: 1 ins; 1 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/12734.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12734/head:pull/12734 PR: https://git.openjdk.org/jdk/pull/12734 From duke at openjdk.org Tue Mar 7 03:47:55 2023 From: duke at openjdk.org (Jasmine K.) Date: Tue, 7 Mar 2023 03:47:55 GMT Subject: RFR: 8303238: Create generalizations for existing LShift ideal transforms In-Reply-To: References: Message-ID: On Thu, 23 Feb 2023 20:28:31 GMT, Jasmine K. wrote: > Hello, > I would like to generalize two ideal transforms for bitwise shifts. Left shift nodes perform the transformations `(x >> C1) << C2 => x & (-1 << C2)` and `((x >> C1) & Y) << C2 => x & (Y << C2)`, but only when the case where `C1 == C2`. However, it is possible to use both of these rules to improve cases where the constants aren't equal, by removing one of the shifts and replacing it with a bitwise and. This transformation is profitable because typically more bitwise ands can be dispatched per cycle than bit shifts. In addition, the strength reduction from a shift to a bitwise and can allow more profitable transformations to occur. These patterns are found throughout the JDK, mainly around strings and OW2 ASM. I've attached some profiling results from my (Zen 2) machine below: > > Baseline Patch Improvement > Benchmark Mode Cnt Score Error Units Score Error Units > LShiftNodeIdealize.testRgbaToAbgr avgt 15 63.287 ? 1.770 ns/op / 54.199 ? 1.408 ns/op + 14.36% > LShiftNodeIdealize.testShiftAndInt avgt 15 874.564 ? 15.334 ns/op / 538.408 ? 11.768 ns/op + 38.44% > LShiftNodeIdealize.testShiftAndLong avgt 15 1017.466 ? 29.010 ns/op / 701.356 ? 18.258 ns/op + 31.07% > LShiftNodeIdealize.testShiftInt avgt 15 663.865 ? 14.226 ns/op / 533.588 ? 9.949 ns/op + 19.63% > LShiftNodeIdealize.testShiftInt2 avgt 15 658.976 ? 32.856 ns/op / 649.871 ? 10.598 ns/op + 1.38% > LShiftNodeIdealize.testShiftLong avgt 15 815.540 ? 14.721 ns/op / 689.270 ? 14.028 ns/op + 15.48% > LShiftNodeIdealize.testShiftLong2 avgt 15 817.936 ? 23.573 ns/op / 810.185 ? 14.983 ns/op + 0.95% > > > In addition, in the process of making this PR I've found a missing ideal transform for `RShiftLNode`, so right shifts of large numbers (such as `x >> 65`) are not properly folded down, like how they are `RShiftINode` and `URShiftLNode`. I'll address this in a future RFR. > > Testing: GHA, tier1 local, and performance testing > > Thanks, > Jasmine K Thanks for the comments! I have updated the code. ------------- PR: https://git.openjdk.org/jdk/pull/12734 From kvn at openjdk.org Tue Mar 7 03:56:59 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 7 Mar 2023 03:56:59 GMT Subject: RFR: 8302508: Add timestamp to the output TraceCompilerThreads Message-ID: Having timestamps added to the output of TraceCompilerThreads will be helpful in understanding how frequently the compiler threads are being added or removed. I did that and also added UL output. java -XX:+TraceCompilerThreads -XX:+PrintCompilation -version 86 Added initial compiler thread C2 CompilerThread0 86 Added initial compiler thread C1 CompilerThread0 92 1 3 java.lang.Object:: (1 bytes) 96 2 3 java.lang.String::coder (15 bytes) java -Xlog:jit+thread=debug -Xlog:jit+compilation=debug -version [0.078s][debug][jit,thread] Added initial compiler thread C2 CompilerThread0 [0.078s][debug][jit,thread] Added initial compiler thread C1 CompilerThread0 [0.083s][debug][jit,compilation] 1 3 java.lang.Object:: (1 bytes) [0.087s][debug][jit,compilation] 2 3 java.lang.String::coder (15 bytes) Tested tier1. ------------- Commit messages: - 8302508: Add timestamp to the output TraceCompilerThreads Changes: https://git.openjdk.org/jdk/pull/12898/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12898&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8302508 Stats: 44 lines in 1 file changed: 28 ins; 0 del; 16 mod Patch: https://git.openjdk.org/jdk/pull/12898.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12898/head:pull/12898 PR: https://git.openjdk.org/jdk/pull/12898 From jbhateja at openjdk.org Tue Mar 7 06:37:24 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 7 Mar 2023 06:37:24 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v9] In-Reply-To: <9bFwAtzXSc26dTNribicwtgQTa5GENR0Pwl0xUA1v2A=.2aa36d32-5976-4f75-a0d1-44227fc29168@github.com> References: <8lyMxPlTmzLDuwFLvhla8AAL6sVxEjigf4LaDlsAVWg=.4bfb5428-4b11-4c65-8540-503cf9e595f1@github.com> <_eDa4Xjs3HIxm2u4bUOWmmZL6y2DjQGd9kVyrUCTSD0=.d56d4533-bf46-4900-8c7c-6e22d247ff22@github.com> <9bFwAtzXSc26dTNribicwtgQTa5GENR0Pwl0xUA1v2A=.2aa36d32-5976-4f75-a0d1-44227fc29168@github.com> Message-ID: The message from this sender included one or more files which could not be scanned for virus detection; do not open these files unless you are certain of the sender's intent. ---------------------------------------------------------------------- On Thu, 2 Mar 2023 15:56:23 GMT, Emanuel Peter wrote: >> I realized I have lots of negative IR rules that check that we do NOT vectorize if I expect cyclic dependency. But these negative rules are difficult, there may always be some other factor that leads to shorter vector sizes than what I expect. And then it vectorizes, and does not encounter a cyclic dependency. So I will have to remove all these negative IR rules. >> >> @jatin-bhateja was there any positive IR rule that failed? One that did expect vectorization, but it did not in fact vectorize? > > I now removed all such negative IR rules. Hi @eme64 , Few more suggestions, my original request was to write few hand crafted tests with combinations of AlignVector, Vectorize and MaxVectorSize options. All the test points in TestDependencyOffsets.java are mainly around two kernels, forward writes (RAW/true dependency) and forward reads (WAR/anti-dependence) but exhaustively generates multiple IR rules. applyIfCPUFeature uses VM identified CPU feature checks which are constrained by UseSSE and UseAVX options. You may extend your script to generate curated scenarios like following to trigger more rules. // CPU: sse4.1 to avx -> vector_width: 16 -> elements in vector: 4 scenarios[0] = new Scenario(0, "-XX:-/+AlignVector", "-XX:MaxVectorSize=XXX", "-XX:UseAVX=0"); // CPU: avx2 -> vector_width: 32 -> elements in vector: 8 scenarios[1] = new Scenario(1, "-XX:-/+AlignVector", "-XX:MaxVectorSize=XXX", "-XX:UseAVX=2"); // CPU: avx512 -> vector_width: 64 -> elements in vector: 16 scenarios[2] = new Scenario(2, "-XX:-/+AlignVector", "-XX:MaxVectorSize=XXX", "-XX:UseAVX=3"); Since there are mainly 4 categories of features. CPROMPT>grep "applyIfCPUFeatureAnd" TestDependencyOffsets.java | sort -u | uniq applyIfCPUFeatureAnd = {"avx2", "true", "avx512bw", "false"}) applyIfCPUFeatureAnd = {"avx2", "true", "avx512", "false"}) applyIfCPUFeatureAnd = {"avx", "true", "avx512", "false"}) applyIfCPUFeatureAnd = {"sse4.1", "true", "avx2", "false"}) applyIfCPUFeatureAnd = {"sse4.1", "true", "avx", "false"}) Also, this test file is script generated hence we may generate separate test files for AARCH64 and X86 to make the tests more maintainable. ------------- PR: https://git.openjdk.org/jdk/pull/12350 From duke at openjdk.org Tue Mar 7 06:44:47 2023 From: duke at openjdk.org (changpeng1997) Date: Tue, 7 Mar 2023 06:44:47 GMT Subject: RFR: 8302906: AArch64: Add SVE backend support for vector unsigned comparison [v2] In-Reply-To: References: Message-ID: The message from this sender included one or more files which could not be scanned for virus detection; do not open these files unless you are certain of the sender's intent. ---------------------------------------------------------------------- > This patch implements unsigned vector comparison on SVE. > > 1: Test: > All vector API test cases[1][2] passed without new failure. Existing test cases can cover all unsigned comparison conditions for all kinds of vector. > > 2: Performance: > (1): Benchmark: > As existing benchmarks in panama repo (such as [3]) have some issues [4] (We will fix them in a separate patch.), I collected performance data with a reduced jmh benchmark [5]. e.g. for ByteVector unsigned comparison: > > > @Benchmark > public void byteVectorUnsignedCompare() { > for (int j = 0; j < 200; j++) { > for (int i = 0; i < bspecies.length(); i++) { > ByteVector av = ByteVector.fromArray(bspecies, ba, i); > ByteVector ca = ByteVector.fromArray(bspecies, bb, i); > av.compare(VectorOperators.UNSIGNED_GT, ca).intoArray(br, i); > } > } > } > > > (2): Performance data > > Before: > > > Benchmark Score(op/ms) Error > ByteVector.UNSIGNED_GT#size(1024) 4.846 3.419 > ShortVector.UNSIGNED_GE#size(1024) 3.055 1.369 > IntVector.UNSIGNED_LT#size(1024) 3.475 1.269 > LongVector.UNSIGNED_LE#size(1024) 4.515 1.812 > > > After: > > > Benchmark Score(op/ms) Error > ByteVector.UNSIGNED_GT#size(1024) 493.937 1.389 > ShortVector.UNSIGNED_GE#size(1024) 5308.796 20.557 > IntVector.UNSIGNED_LT#size(1024) 4944.744 10.606 > LongVector.UNSIGNED_LE#size(1024) 8459.605 28.683 > > > [1] https://github.com/openjdk/jdk/tree/master/test/jdk/jdk/incubator/vector > [2] https://github.com/openjdk/jdk/tree/master/test/hotspot/jtreg/compiler/vectorapi > [3] https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/ByteMaxVector.java#L1459 > [4] https://bugs.openjdk.org/browse/JDK-8282850 > [5] https://gist.github.com/changpeng1997/d311127e1015c107197f9b56a92b0fae changpeng1997 has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge branch 'openjdk:master' into sve_cmpU - 8302906: AArch64: Add SVE backend support for vector unsigned comparison This patch implements unsigned vector comparison on SVE. 1: Test: All vector API test cases[1][2] passed without new failure. Existing test cases can cover all unsigned comparison conditions for all kinds of vector. 2: Performance: (1): Benchmark: As existing benchmarks in panama repo (such as [3]) have some issues [4] (We will fix them in a separate patch.), I collected performance data with a reduced jmh benchmark [5]. e.g. for ByteVector unsigned comparison: ``` @Benchmark public void byteVectorUnsignedCompare() { for (int j = 0; j < 200; j++) { for (int i = 0; i < bspecies.length(); i++) { ByteVector av = ByteVector.fromArray(bspecies, ba, i); ByteVector ca = ByteVector.fromArray(bspecies, bb, i); av.compare(VectorOperators.UNSIGNED_GT, ca).intoArray(br, i); } } } ``` (2): Performance data Before: ``` Benchmark Score(op/ms) Error ByteVector.UNSIGNED_GT#size(1024) 4.846 3.419 ShortVector.UNSIGNED_GE#size(1024) 3.055 1.369 IntVector.UNSIGNED_LT#size(1024) 3.475 1.269 LongVector.UNSIGNED_LE#size(1024) 4.515 1.812 ``` After: ``` Benchmark Score(op/ms) Error ByteVector.UNSIGNED_GT#size(1024) 493.937 1.389 ShortVector.UNSIGNED_GE#size(1024) 5308.796 20.557 IntVector.UNSIGNED_LT#size(1024) 4944.744 10.606 LongVector.UNSIGNED_LE#size(1024) 8459.605 28.683 ``` [1] https://github.com/openjdk/jdk/tree/master/test/jdk/jdk/incubator/vector [2] https://github.com/openjdk/jdk/tree/master/test/hotspot/jtreg/compiler/vectorapi [3] https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/ByteMaxVector.java#L1459 [4] https://bugs.openjdk.org/browse/JDK-8282850 [5] https://gist.github.com/changpeng1997/d311127e1015c107197f9b56a92b0fae TEST_CMD: true Jira: ENTLLT-6097 Change-Id: I236cf4a7626af3aad04bf081b47849a00e77df15 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12725/files - new: https://git.openjdk.org/jdk/pull/12725/files/599cf967..d210ae92 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12725&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12725&range=00-01 Stats: 27315 lines in 984 files changed: 17946 ins; 5228 del; 4141 mod Patch: https://git.openjdk.org/jdk/pull/12725.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12725/head:pull/12725 PR: https://git.openjdk.org/jdk/pull/12725 From duke at openjdk.org Tue Mar 7 07:02:27 2023 From: duke at openjdk.org (changpeng1997) Date: Tue, 7 Mar 2023 07:02:27 GMT Subject: RFR: 8302906: AArch64: Add SVE backend support for vector unsigned comparison [v3] In-Reply-To: References: Message-ID: > This patch implements unsigned vector comparison on SVE. > > 1: Test: > All vector API test cases[1][2] passed without new failure. Existing test cases can cover all unsigned comparison conditions for all kinds of vector. > > 2: Performance: > (1): Benchmark: > As existing benchmarks in panama repo (such as [3]) have some issues [4] (We will fix them in a separate patch.), I collected performance data with a reduced jmh benchmark [5]. e.g. for ByteVector unsigned comparison: > > > @Benchmark > public void byteVectorUnsignedCompare() { > for (int j = 0; j < 200; j++) { > for (int i = 0; i < bspecies.length(); i++) { > ByteVector av = ByteVector.fromArray(bspecies, ba, i); > ByteVector ca = ByteVector.fromArray(bspecies, bb, i); > av.compare(VectorOperators.UNSIGNED_GT, ca).intoArray(br, i); > } > } > } > > > (2): Performance data > > Before: > > > Benchmark Score(op/ms) Error > ByteVector.UNSIGNED_GT#size(1024) 4.846 3.419 > ShortVector.UNSIGNED_GE#size(1024) 3.055 1.369 > IntVector.UNSIGNED_LT#size(1024) 3.475 1.269 > LongVector.UNSIGNED_LE#size(1024) 4.515 1.812 > > > After: > > > Benchmark Score(op/ms) Error > ByteVector.UNSIGNED_GT#size(1024) 493.937 1.389 > ShortVector.UNSIGNED_GE#size(1024) 5308.796 20.557 > IntVector.UNSIGNED_LT#size(1024) 4944.744 10.606 > LongVector.UNSIGNED_LE#size(1024) 8459.605 28.683 > > > [1] https://github.com/openjdk/jdk/tree/master/test/jdk/jdk/incubator/vector > [2] https://github.com/openjdk/jdk/tree/master/test/hotspot/jtreg/compiler/vectorapi > [3] https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/ByteMaxVector.java#L1459 > [4] https://bugs.openjdk.org/browse/JDK-8282850 > [5] https://gist.github.com/changpeng1997/d311127e1015c107197f9b56a92b0fae changpeng1997 has updated the pull request incrementally with one additional commit since the last revision: Refactor part of code in C2 assembler and remove some switch-case stmts. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12725/files - new: https://git.openjdk.org/jdk/pull/12725/files/d210ae92..5acf5ba4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12725&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12725&range=01-02 Stats: 263 lines in 8 files changed: 92 ins; 63 del; 108 mod Patch: https://git.openjdk.org/jdk/pull/12725.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12725/head:pull/12725 PR: https://git.openjdk.org/jdk/pull/12725 From thartmann at openjdk.org Tue Mar 7 07:03:29 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 7 Mar 2023 07:03:29 GMT Subject: Integrated: 8201516: DebugNonSafepoints generates incorrect information In-Reply-To: References: Message-ID: On Wed, 1 Mar 2023 14:12:34 GMT, Tobias Hartmann wrote: > C2 emits incorrect debug information when diagnostic `-XX:+DebugNonSafepoints` is enabled. The problem is that renumbering of live nodes (`-XX:+RenumberLiveNodes`) introduced by [JDK-8129847](https://bugs.openjdk.org/browse/JDK-8129847) in JDK 8u92 / JDK 9 does not update the `_node_note_array` side table that links IR node indices to debug information. As a result, after node indices are updated, they point to unrelated debug information. > > The [reproducer](https://github.com/jodzga/debugnonsafepoints-problem) shared by the original reporter @jodzga (@jbachorik also reported this issue separately) does not work anymore with recent JDK versions but with a slight adjustment to trigger node renumbering, I could reproduce the wrong JFR method profile: > > ![Screenshot from 2023-03-01 13-17-48](https://user-images.githubusercontent.com/5312595/222146314-8b5299a8-c1c0-4360-b356-ac6a8c34371c.png) > > It suggests that the hottest method of the [Test](https://github.com/jodzga/debugnonsafepoints-problem/blob/f8ed40f24ef6a6bff7f86ea861c022db193ef48a/src/main/java/org/tests/Test.java#L28) is **not** the long running loop in [Test::arraycopy](https://github.com/jodzga/debugnonsafepoints-problem/blob/f8ed40f24ef6a6bff7f86ea861c022db193ef48a/src/main/java/org/tests/Test.java#L56) but several other short running methods. The hot method is not even in the profile. This is obviously wrong. > > With the fix, or when running with `-XX:-RenumberLiveNodes` as a workaround, the correct profile looks like this: > > ![Screenshot from 2023-03-01 13-20-09](https://user-images.githubusercontent.com/5312595/222146316-b036ca7d-8a92-42b7-9570-c29e3cfcc2f2.png) > > With the help of the IR framework, it's easy to create a simple regression test (see `testRenumberLiveNodes`). > > The fix is to create a new `node_note_array` and copy the debug information to the right index after updating node indices. We do the same in the matcher: > https://github.com/openjdk/jdk/blob/c1e77e05647ca93bb4f39a320a5c7a632e283410/src/hotspot/share/opto/matcher.cpp#L337-L342 > > Another problem is that `Parse::Parse` calls `C->set_default_node_notes(caller_nn)` before `do_exits`, which resets the `JVMState` to the caller state. We then set the bci to `InvocationEntryBci` in the **caller** `JVMState`. Any new node that is emitted in `do_exits`, for example a `MemBarRelease`, will have that `JVMState` attached and `NonSafepointEmitter::observe_instruction` -> `DebugInformationRecorder::describe_scope` will then use that information when emitting debug info. The resulting debug info is misleading because it suggests that we are at the beginning of the caller method. The tests `testFinalFieldInit` and `testSynchronized` reproduce that scenario. > > The fix is to move `set_default_node_notes` down to after `do_exits`. > > I find it also misleading that we often emit "synchronization entry" for `InvocationEntryBci` at method entry/exit in the debug info, although there is no synchronization happening. I filed [JDK-8303451](https://bugs.openjdk.org/browse/JDK-8303451) to fix that. > > Thanks, > Tobias This pull request has now been integrated. Changeset: 94eda53d Author: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/94eda53d98e5011cc613d031ff8941e254eb666b Stats: 154 lines in 3 files changed: 152 ins; 2 del; 0 mod 8201516: DebugNonSafepoints generates incorrect information Reviewed-by: kvn, roland ------------- PR: https://git.openjdk.org/jdk/pull/12806 From chagedorn at openjdk.org Tue Mar 7 07:59:30 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 7 Mar 2023 07:59:30 GMT Subject: RFR: 8303627: compiler/loopopts/TestUnreachableInnerLoop.java failed with -XX:LoopMaxUnroll=4 In-Reply-To: References: Message-ID: <5Y1mi68ViwHZt3bHOUOdrrIo3_yjhZhCz7qpbERx0qc=.ccbc674a-c45f-4c07-aa0a-e9465dc789c8@github.com> On Mon, 6 Mar 2023 02:42:21 GMT, SUN Guoyun wrote: > This test failed with VM_OPTIONS=-XX:LoopMaxUnroll=4 and CONF=fastdebug on X86_64, AArch64 LoongArch64 architecture. > >

> # A fatal error has been detected by the Java Runtime Environment:
> #
> # Internal Error (/home/sunguoyun/jdk-ls/src/hotspot/share/opto/block.cpp:1359), pid=31328, tid=31344
> # assert(n->is_Root() || n->is_Region() || n->is_Phi() || n->is_MachMerge() || def_block->dominates(block)) failed: uses must be dominated by definitions
> #
> 
> This PR fix the issue, Please help review it. > > Thanks. It indeed seems like a dup of [JDK-8291025](https://bugs.openjdk.org/browse/JDK-8291025) and eventually [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981). But I'm currently out of the office until the end of the week, so I cannot verify it - will have a look again when I'm back to work. ------------- PR: https://git.openjdk.org/jdk/pull/12874 From thartmann at openjdk.org Tue Mar 7 08:05:12 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 7 Mar 2023 08:05:12 GMT Subject: RFR: 8303511: C2: assert(get_ctrl(n) == cle_out) during unrolling [v2] In-Reply-To: References: Message-ID: On Fri, 3 Mar 2023 13:50:40 GMT, Roland Westrelin wrote: >> In the same round of loop optimizations: >> >> - `PhaseIdealLoop::remix_address_expressions()` creates a new `AddP` >> out of loop. It sets it control to >> `n_loop->_head->in(LoopNode::EntryControl)` which, because the loop is >> strip mined, is an `OuterStripMinedLoop`. >> >> - The `LoadI` for that `AddP` is found to only have uses outside the >> loop and is cloned out of the loop. It's referenced by the outer >> loop's safepoint. >> >> - The loop is unrolled. Unrolling follows the safepoint's inputs and >> find the new `AddP` with control set to the `OuterStripMinedLoop` >> and the assert fires. >> >> No control should be set to an `OuterStripMinedLoop`. The fix is >> straightforward and sets the control to the `OuterStripMinedLoop` >> entry control. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > Update test/hotspot/jtreg/compiler/loopstripmining/TestAddPAtOuterLoopHead.java > > Co-authored-by: Andrey Turbanov Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/12824 From amitkumar at openjdk.org Tue Mar 7 08:26:44 2023 From: amitkumar at openjdk.org (Amit Kumar) Date: Tue, 7 Mar 2023 08:26:44 GMT Subject: RFR: 8303497: [s390x] ProblemList TestUnreachableInnerLoop.java Message-ID: This PR adds TestUnreachableInnerLoop.java in ProblemList.txt which is failing on s390x and will be fixed by [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981). ------------- Commit messages: - updates copyright header - adds TestUnreachableInnerLoop.java to ProblemList Changes: https://git.openjdk.org/jdk/pull/12833/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12833&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8303497 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/12833.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12833/head:pull/12833 PR: https://git.openjdk.org/jdk/pull/12833 From jsjolen at openjdk.org Tue Mar 7 08:34:59 2023 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Tue, 7 Mar 2023 08:34:59 GMT Subject: RFR: JDK-8301074: Replace NULL with nullptr in share/opto/ [v4] In-Reply-To: References: Message-ID: On Mon, 6 Mar 2023 16:14:00 GMT, Johan Sj?len wrote: >> Hi, this PR changes all occurrences of NULL to nullptr for the subdirectory share/opto/. Unfortunately the script that does the change isn't perfect, and so we >> need to comb through these manually to make sure nothing has gone wrong. I also review these changes but things slip past my eyes sometimes. >> >> Here are some typical things to look out for: >> >> No changes but copyright header changed (probably because I reverted some changes but forgot the copyright). >> Macros having their NULL changed to nullptr, these are added to the script when I find them. They should be NULL. >> nullptr in comments and logs. We try to use lower case "null" in these cases as it reads better. An exception is made when code expressions are in a comment. >> >> An example of this: >> >> >> // This function returns null >> void* ret_null(); >> // This function returns true if *x == nullptr >> bool is_nullptr(void** x); >> >> >> Note how nullptr participates in a code expression here, we really are talking about the specific value nullptr. >> >> Thanks! > > Johan Sj?len has updated the pull request incrementally with one additional commit since the last revision: > > Explicitly use 0 for null in ARM interpreter linux-x86 GHA test failed with: > Unrecognized VM option 'UseCompressedClassPointers' Unlikely to be a bug in this PR. ------------- PR: https://git.openjdk.org/jdk/pull/12187 From jsjolen at openjdk.org Tue Mar 7 08:39:49 2023 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Tue, 7 Mar 2023 08:39:49 GMT Subject: RFR: JDK-8301074: Replace NULL with nullptr in share/opto/ [v5] In-Reply-To: References: Message-ID: > Hi, this PR changes all occurrences of NULL to nullptr for the subdirectory share/opto/. Unfortunately the script that does the change isn't perfect, and so we > need to comb through these manually to make sure nothing has gone wrong. I also review these changes but things slip past my eyes sometimes. > > Here are some typical things to look out for: > > No changes but copyright header changed (probably because I reverted some changes but forgot the copyright). > Macros having their NULL changed to nullptr, these are added to the script when I find them. They should be NULL. > nullptr in comments and logs. We try to use lower case "null" in these cases as it reads better. An exception is made when code expressions are in a comment. > > An example of this: > > > // This function returns null > void* ret_null(); > // This function returns true if *x == nullptr > bool is_nullptr(void** x); > > > Note how nullptr participates in a code expression here, we really are talking about the specific value nullptr. > > Thanks! Johan Sj?len has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: - Merge remote-tracking branch 'origin/JDK-8301074' into JDK-8301074 - Explicitly use 0 for null in ARM interpreter - Merge remote-tracking branch 'origin/master' into JDK-8301074 - Remove trailing whitespace - Check for null string explicitly - vkozlov fixes - Manual review fixes - Fix - Fix compile errors - Replace NULL with nullptr in share/opto/ ------------- Changes: https://git.openjdk.org/jdk/pull/12187/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12187&range=04 Stats: 5590 lines in 111 files changed: 1 ins; 0 del; 5589 mod Patch: https://git.openjdk.org/jdk/pull/12187.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12187/head:pull/12187 PR: https://git.openjdk.org/jdk/pull/12187 From thartmann at openjdk.org Tue Mar 7 08:40:55 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 7 Mar 2023 08:40:55 GMT Subject: RFR: 8303497: [s390x] ProblemList TestUnreachableInnerLoop.java In-Reply-To: References: Message-ID: <0VVp8qFodqhUrp11CZpr23fzngQ0y8K4Oo1Rb3M3Xqw=.7533134b-229e-4f37-9e33-e8a192a3cd21@github.com> On Thu, 2 Mar 2023 16:08:17 GMT, Amit Kumar wrote: > This PR adds TestUnreachableInnerLoop.java in ProblemList.txt which is failing on s390x and will be fixed by [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981). Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/12833 From amitkumar at openjdk.org Tue Mar 7 08:40:55 2023 From: amitkumar at openjdk.org (Amit Kumar) Date: Tue, 7 Mar 2023 08:40:55 GMT Subject: RFR: 8303497: [s390x] ProblemList TestUnreachableInnerLoop.java In-Reply-To: <0VVp8qFodqhUrp11CZpr23fzngQ0y8K4Oo1Rb3M3Xqw=.7533134b-229e-4f37-9e33-e8a192a3cd21@github.com> References: <0VVp8qFodqhUrp11CZpr23fzngQ0y8K4Oo1Rb3M3Xqw=.7533134b-229e-4f37-9e33-e8a192a3cd21@github.com> Message-ID: On Tue, 7 Mar 2023 08:37:59 GMT, Tobias Hartmann wrote: >> This PR adds TestUnreachableInnerLoop.java in ProblemList.txt which is failing on s390x and will be fixed by [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981). > > Looks good to me. Thanks @TobiHartmann for approval. ------------- PR: https://git.openjdk.org/jdk/pull/12833 From roland at openjdk.org Tue Mar 7 08:41:16 2023 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 7 Mar 2023 08:41:16 GMT Subject: RFR: 8303511: C2: assert(get_ctrl(n) == cle_out) during unrolling [v2] In-Reply-To: References: Message-ID: On Thu, 2 Mar 2023 17:33:10 GMT, Vladimir Kozlov wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> Update test/hotspot/jtreg/compiler/loopstripmining/TestAddPAtOuterLoopHead.java >> >> Co-authored-by: Andrey Turbanov > > Yes, this looks good. @vnkozlov @TobiHartmann thanks for the reviews! ------------- PR: https://git.openjdk.org/jdk/pull/12824 From roland at openjdk.org Tue Mar 7 08:41:20 2023 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 7 Mar 2023 08:41:20 GMT Subject: Integrated: 8303511: C2: assert(get_ctrl(n) == cle_out) during unrolling In-Reply-To: References: Message-ID: On Thu, 2 Mar 2023 09:24:17 GMT, Roland Westrelin wrote: > In the same round of loop optimizations: > > - `PhaseIdealLoop::remix_address_expressions()` creates a new `AddP` > out of loop. It sets it control to > `n_loop->_head->in(LoopNode::EntryControl)` which, because the loop is > strip mined, is an `OuterStripMinedLoop`. > > - The `LoadI` for that `AddP` is found to only have uses outside the > loop and is cloned out of the loop. It's referenced by the outer > loop's safepoint. > > - The loop is unrolled. Unrolling follows the safepoint's inputs and > find the new `AddP` with control set to the `OuterStripMinedLoop` > and the assert fires. > > No control should be set to an `OuterStripMinedLoop`. The fix is > straightforward and sets the control to the `OuterStripMinedLoop` > entry control. This pull request has now been integrated. Changeset: 3f2d929d Author: Roland Westrelin URL: https://git.openjdk.org/jdk/commit/3f2d929dc3336b301e7e5dceb899d59451645828 Stats: 84 lines in 2 files changed: 82 ins; 0 del; 2 mod 8303511: C2: assert(get_ctrl(n) == cle_out) during unrolling Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/12824 From amitkumar at openjdk.org Tue Mar 7 09:00:43 2023 From: amitkumar at openjdk.org (Amit Kumar) Date: Tue, 7 Mar 2023 09:00:43 GMT Subject: Integrated: 8303497: [s390x] ProblemList TestUnreachableInnerLoop.java In-Reply-To: References: Message-ID: On Thu, 2 Mar 2023 16:08:17 GMT, Amit Kumar wrote: > This PR adds TestUnreachableInnerLoop.java in ProblemList.txt which is failing on s390x and will be fixed by [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981). This pull request has now been integrated. Changeset: 52d30087 Author: Amit Kumar Committer: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/52d30087734ad95761078793da6e207797558e2b Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod 8303497: [s390x] ProblemList TestUnreachableInnerLoop.java Reviewed-by: thartmann ------------- PR: https://git.openjdk.org/jdk/pull/12833 From tholenstein at openjdk.org Tue Mar 7 09:20:04 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Tue, 7 Mar 2023 09:20:04 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally [v2] In-Reply-To: References: Message-ID: <3gzIf5m8QCIFtTbcIHsIrk_yL-2JY2gN8q3MuCbwjkw=.673b118d-e2ee-4d98-a61f-b548e784dbe6@github.com> > In IGV the user can apply a set of filters to a graph. Currently, the same set of selected filters is applied to all graphs (globally). > > - With this change the use can define a set of filters for each individual graph tab using the `--Custom--` profile > - Further a filter profile can be created that represents a set of filter. This filter profile can the be selected in each graph tab individually. > > ### Custom profile > Each tab has a `--Custom--` filter profile which is selected when opening a graph. Filters applied to the `--Custom--` profile are applied only to the currently selected tabs. Only one tab can be selected at a time and a tab gets selected by clicking on it. To make it more clear which tab is currently selected, the title of the selected tab is displayed in **bold** font. > tabA > > When clicking on a different tab with a different `--Custom--` profile, the selected filters get updated accordingly. > tabB > > ### New profile > The user can also create a new filter profile and give it a name. E.g. `My Filters` > newProfile > > The `My Filters` profile is then globally available to other tabs as well > selectProfile > > > ### Filters for cloned tabs > When the user clones a tab, the `--Custom--` profile gets cloned as well. Further the clone has the same filter profile selected when it gets opened > cloneTab Tobias Holenstein has updated the pull request incrementally with three additional commits since the last revision: - global with default filters - add global filter profile - save filter profiles ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12714/files - new: https://git.openjdk.org/jdk/pull/12714/files/86e5153f..ff4d7850 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12714&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12714&range=00-01 Stats: 90 lines in 1 file changed: 87 ins; 1 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/12714.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12714/head:pull/12714 PR: https://git.openjdk.org/jdk/pull/12714 From dnsimon at openjdk.org Tue Mar 7 09:40:20 2023 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 7 Mar 2023 09:40:20 GMT Subject: RFR: 8299570: [JVMCI] Insufficient error handling when CodeBuffer is exhausted [v4] In-Reply-To: References: <2glhKPYZQpwJioZGfuoEyrB_JwEkDnMKV6FbuzwoIk4=.1de88577-0676-4492-9b6f-bc81e36e68f6@github.com> Message-ID: On Fri, 20 Jan 2023 12:38:34 GMT, Doug Simon wrote: >> @dougxc I'm not sure why the above code is written the way it is rather than the way you rewrote it. I cannot see any reason why there should already be a trampoline stub in place when trampoline_jump is called given how it is being called at present. I thought perhaps it might be something to do with the (newly introduced) shared trampoline code but that is not relevant here and, besides, this routine has been thew way it is since it was first introduced. >> >> I know the trampoline (and related far jump) code has been subject to change over the years so it may be something to do with how this routine was called in an earlier incarnation of the code. Andrew Haley will have a better idea than me as he was the original author. >> >> Anyway, if we may need far branches and the call to is_NativeCallTrampolineStub_at fails then it does not seem tome to make any sense to call set_destination (at least null is returned which is correct). So, I think your rewrite looks like it is doing the right thing. >> >> I think you probably need an ok from Andrew Haley here though. > > Ok, thanks for your input. I'll wait for @theRealAph to review it as well. @theRealAph @adinn can I now merge this PR? ------------- PR: https://git.openjdk.org/jdk/pull/11945 From epeter at openjdk.org Tue Mar 7 09:52:30 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 7 Mar 2023 09:52:30 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v9] In-Reply-To: References: <8lyMxPlTmzLDuwFLvhla8AAL6sVxEjigf4LaDlsAVWg=.4bfb5428-4b11-4c65-8540-503cf9e595f1@github.com> <_eDa4Xjs3HIxm2u4bUOWmmZL6y2DjQGd9kVyrUCTSD0=.d56d4533-bf46-4900-8c7c-6e22d247ff22@github.com> <9bFwAtzXSc26dTNribicwtgQTa5GENR0Pwl0xUA1v2A=.2aa36d32-5976-4f75-a0d1-44227fc29168@github.com> Message-ID: On Tue, 7 Mar 2023 06:33:47 GMT, Jatin Bhateja wrote: >> I now removed all such negative IR rules. > > Hi @eme64 , > Few more suggestions, my original request was to write few hand crafted tests with combinations of AlignVector, Vectorize and MaxVectorSize options. All the test points in TestDependencyOffsets.java are mainly around two kernels, forward writes (RAW/true dependency) and forward reads (WAR/anti-dependence) but exhaustively generates multiple IR rules. > applyIfCPUFeature uses VM identified CPU feature checks which are constrained by UseSSE and UseAVX options. You may extend your script to generate curated scenarios like following to trigger more rules. > > > // CPU: sse4.1 to avx -> vector_width: 16 -> elements in vector: 4 > scenarios[0] = new Scenario(0, "-XX:-/+AlignVector", "-XX:MaxVectorSize=XXX", "-XX:UseAVX=0"); > // CPU: avx2 -> vector_width: 32 -> elements in vector: 8 > scenarios[1] = new Scenario(1, "-XX:-/+AlignVector", "-XX:MaxVectorSize=XXX", "-XX:UseAVX=2"); > // CPU: avx512 -> vector_width: 64 -> elements in vector: 16 > scenarios[2] = new Scenario(2, "-XX:-/+AlignVector", "-XX:MaxVectorSize=XXX", "-XX:UseAVX=3"); > > > Since there are mainly 4 categories of features. > > > CPROMPT>grep "applyIfCPUFeatureAnd" TestDependencyOffsets.java | sort -u | uniq > applyIfCPUFeatureAnd = {"avx2", "true", "avx512bw", "false"}) > applyIfCPUFeatureAnd = {"avx2", "true", "avx512", "false"}) > applyIfCPUFeatureAnd = {"avx", "true", "avx512", "false"}) > applyIfCPUFeatureAnd = {"sse4.1", "true", "avx2", "false"}) > applyIfCPUFeatureAnd = {"sse4.1", "true", "avx", "false"}) > > > Also, this test file is script generated hence we may generate separate test files for AARCH64 and X86 to make the tests more maintainable. Hi @jatin-bhateja . At Oracle we run the `compiler/loopopts/superword` and `compiler/vectorization` tests with various `AVX` and `SSE` settings, including the `UseKNLSetting`. I could remove the exhaustive loop over all `MaxVectorSize` values, and instead run only a few per `AVX / SSE` setting. That may make it a bit more efficient, and easier to test everything at once on one machine. I can also make one file `x86` specific (require `x86/x64`) and the other file allows all other platforms (but only contains the `asimd` IR rules). I would probably have different scenarios for that second file, since ARM chips have various `MaxVectorSize`, depending on hardware. ------------- PR: https://git.openjdk.org/jdk/pull/12350 From tholenstein at openjdk.org Tue Mar 7 10:15:49 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Tue, 7 Mar 2023 10:15:49 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally [v3] In-Reply-To: References: Message-ID: > In IGV the user can apply a set of filters to a graph. Currently, the same set of selected filters is applied to all graphs (globally). > > - With this change the use can define a set of filters for each individual graph tab using the `--Custom--` profile > - Further a filter profile can be created that represents a set of filter. This filter profile can the be selected in each graph tab individually. > > ### Custom profile > Each tab has a `--Custom--` filter profile which is selected when opening a graph. Filters applied to the `--Custom--` profile are applied only to the currently selected tabs. Only one tab can be selected at a time and a tab gets selected by clicking on it. To make it more clear which tab is currently selected, the title of the selected tab is displayed in **bold** font. > tabA > > When clicking on a different tab with a different `--Custom--` profile, the selected filters get updated accordingly. > tabB > > ### New profile > The user can also create a new filter profile and give it a name. E.g. `My Filters` > newProfile > > The `My Filters` profile is then globally available to other tabs as well > selectProfile > > > ### Filters for cloned tabs > When the user clones a tab, the `--Custom--` profile gets cloned as well. Further the clone has the same filter profile selected when it gets opened > cloneTab Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: make global the default filter ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12714/files - new: https://git.openjdk.org/jdk/pull/12714/files/ff4d7850..fbabcdaa Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12714&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12714&range=01-02 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/12714.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12714/head:pull/12714 PR: https://git.openjdk.org/jdk/pull/12714 From tholenstein at openjdk.org Tue Mar 7 10:24:28 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Tue, 7 Mar 2023 10:24:28 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally In-Reply-To: References: Message-ID: On Mon, 6 Mar 2023 13:22:16 GMT, Roberto Casta?eda Lozano wrote: >> In IGV the user can apply a set of filters to a graph. Currently, the same set of selected filters is applied to all graphs (globally). >> >> - With this change the use can define a set of filters for each individual graph tab using the `--Local--` profile >> - Further a filter profile can be created that represents a set of filter. This filter profile can the be selected in each graph tab individually. >> >> ### Global profile >> Each tab has a `--Global--` filter profile which is selected when opening a graph. Filters applied to the `--Global--` profile are applied to all tabs that have the `--Global--` profile selected. >> >> ### Local profile >> Each tab has its own `--Local--` filter profile. Filters applied to the `--Local--` profile are applied only to the currently selected tabs. Only one tab can be selected at a time and a tab gets selected by clicking on it. To make it more clear which tab is currently selected, the title of the selected tab is displayed in **bold** font. >> tabA >> >> When clicking on a different tab with a different `--Local--` profile, the selected filters get updated accordingly. >> tabB >> >> ### New profile >> The user can also create a new filter profile and give it a name. E.g. `My Filters` >> newProfile >> >> The `My Filters` profile is then globally available to other tabs as well >> selectProfile >> >> >> ### Filters for cloned tabs >> When the user clones a tab, the `--Local--` profile gets cloned as well. Further the clone has the same filter profile selected when it gets opened >> cloneTab > > Thanks for working on this, Toby! Being able to apply different filters per tab is definitely useful feature. However, as an IGV user I miss two things from the current behavior: persistence (the same filters are applied after restarting IGV) and the ability to apply the same filter configuration to all tabs in a simple manner. > > I would like to propose an alternative model that is almost a superset of what is proposed here and would preserve persistence and easy filter synchronization among tabs. By default, each tab has two filter profiles available, "local" and "global". More profiles cannot be added or removed. The local filter profile can be edited but is not persistent (i.e. it acts like the `--Custom--` profile in this changeset). The global filter profile can be edited, is persistent, and the changes are propagated for all tabs where it is selected. The `Link node selection globally` button is generalized to `Link node and filter selection globally`. It is disabled by default, and clicking on it selects the global filter profile for all opened tabs. > > What do you think? This is just my input as user, it would be useful to see what others think here. I updated the PR. @robcasloz > However, as an IGV user I miss two things from the current behavior: persistence (the same filters are applied after restarting IGV) I agree with this. Now global filters profiles are saved and reloaded at startup > and the ability to apply the same filter configuration to all tabs in a simple manner. Before my PR all filter profiles were global. And they still are except for the `--Local--` profile. I now added also a `--Global--` profile that is selected by default. > I would like to propose an alternative model that is almost a superset of what is proposed here and would preserve persistence and easy filter synchronization among tabs. By default, each tab has two filter profiles available, ?local? and ?global?. I added that now. > More profiles cannot be added or removed. I would prefer to keep the option to define new profile (especially, now that they are saved and reloaded at startup) > The local filter profile can be edited but is not persistent (i.e. it acts like the --Custom-- profile in this changeset). That?s what we have now > The global filter profile can be edited, is persistent, and the changes are propagated for all tabs where it is selected. `--Global--` is like this > The Link node selection globally button is generalized to Link node and filter selection globally. It is disabled by default, and clicking on it selects the global filter profile for all opened tabs. I prefer to keep the option to have a Tab with local and a Tab with global filters AND be able to link the selection. ------------- PR: https://git.openjdk.org/jdk/pull/12714 From tholenstein at openjdk.org Tue Mar 7 10:24:32 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Tue, 7 Mar 2023 10:24:32 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally [v3] In-Reply-To: References: Message-ID: On Tue, 7 Mar 2023 10:15:49 GMT, Tobias Holenstein wrote: >> In IGV the user can apply a set of filters to a graph. Currently, the same set of selected filters is applied to all graphs (globally). >> >> - With this change the use can define a set of filters for each individual graph tab using the `--Local--` profile >> - Further a filter profile can be created that represents a set of filter. This filter profile can the be selected in each graph tab individually. >> >> ### Global profile >> Each tab has a `--Global--` filter profile which is selected when opening a graph. Filters applied to the `--Global--` profile are applied to all tabs that have the `--Global--` profile selected. >> >> ### Local profile >> Each tab has its own `--Local--` filter profile. Filters applied to the `--Local--` profile are applied only to the currently selected tabs. Only one tab can be selected at a time and a tab gets selected by clicking on it. To make it more clear which tab is currently selected, the title of the selected tab is displayed in **bold** font. >> tabA >> >> When clicking on a different tab with a different `--Local--` profile, the selected filters get updated accordingly. >> tabB >> >> ### New profile >> The user can also create a new filter profile and give it a name. E.g. `My Filters` >> newProfile >> >> The `My Filters` profile is then globally available to other tabs as well >> selectProfile >> >> >> ### Filters for cloned tabs >> When the user clones a tab, the `--Local--` profile gets cloned as well. Further the clone has the same filter profile selected when it gets opened >> cloneTab > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > make global the default filter (I renamed `--Custom--` to `--Local--`) ------------- PR: https://git.openjdk.org/jdk/pull/12714 From bkilambi at openjdk.org Tue Mar 7 11:05:03 2023 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Tue, 7 Mar 2023 11:05:03 GMT Subject: RFR: 8303161: [vectorapi] VectorMask.cast narrow operation returns incorrect value with SVE Message-ID: The cast operation for VectorMask from wider type to narrow type returns incorrect result for trueCount() method invocation for the resultant mask with SVE (on some SVE machines toLong() also results in incorrect values). An example narrow operation which results in incorrect toLong() and trueCount() values is shown below for a 128-bit -> 64-bit conversion and this can be extended to other narrow operations where the source mask in bytes is either 4x or 8x the size of the result mask in bytes - public class TestMaskCast { static final boolean [] mask_arr = {true, true, false, true}; public static long narrow_long() { VectorMask lmask128 = VectorMask.fromArray(LongVector.SPECIES_128, mask_arr, 0); return lmask128.cast(IntVector.SPECIES_64).toLong(); } public static void main(String[] args) { long r = 0L; for (int ic = 0; ic < 50000; ic++) { r = narrow_long(); } System.out.println("toLong() : " + r); } } **C2 compilation result :** java --add-modules jdk.incubator.vector TestMaskCast toLong(): 15 **Interpreter result (for verification) :** java --add-modules jdk.incubator.vector -Xint TestMaskCast toLong(): 3 The incorrect results with toLong() have been observed only on the 128-bit and 256-bit SVE machines but they are not reproducible on a 512-bit machine. However, trueCount() returns incorrect values too and they are reproducible on all the SVE machines and thus is more reliable to use trueCount() to bring out the drawbacks of the current implementation of mask cast narrow operation for SVE. Replacing the call to toLong() by trueCount() in the above example - public class TestMaskCast { static final boolean [] mask_arr = {true, true, false, true}; public static int narrow_long() { VectorMask lmask128 = VectorMask.fromArray(LongVector.SPECIES_128, mask_arr, 0); return lmask128.cast(IntVector.SPECIES_64).trueCount(); } public static void main(String[] args) { int r = 0; for (int ic = 0; ic < 50000; ic++) { r = narrow_long(); } System.out.println("trueCount() : " + r); } } **C2 compilation result:** java --add-modules jdk.incubator.vector TestMaskCast trueCount() : 4 **Interpreter result:** java --add-modules jdk.incubator.vector -Xint TestMaskCast trueCount() : 2 Since in this example, the source mask size in bytes is 2x that of the result mask, trueCount() returns 2x the number of true elements in the source mask. It would return 4x/8x the number of true elements in the source mask if the size of the source mask is 4x/8x that of result mask. The returned values are incorrect because of the higher order bits in the result not being cleared (since the result is narrowed down) and trueCount() or toLong() tend to consider the higher order bits in the vector register as well which results in incorrect value. For the 128-bit to 64-bit conversion with a mask - "TT" passed, the current implementation for mask cast narrow operation returns the same mask in the lower and upper half of the 128-bit register that is - "TTTT" which results in a long value of 15 (instead of 3 - "FFTT" for the 64-bit Integer mask) and number of true elements to be 4 (instead of 2). This patch proposes a fix for this problem. An already existing JTREG IR test - "test/hotspot/jtreg/compiler/vectorapi/VectorMaskCastTest.java" has also been modified to call the trueCount() method as well since the toString() method alone cannot be used to reproduce the incorrect values in this bug. This test passes successfully on 128-bit, 256-bit and 512-bit SVE machines. Since the IR test has been changed, it has been tested successfully on other platforms like x86 and aarch64 Neon machines as well to ensure the changes have not introduced any new errors. ------------- Commit messages: - 8303161: [vectorapi] VectorMask.cast narrow operation returns incorrect value with SVE Changes: https://git.openjdk.org/jdk/pull/12901/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12901&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8303161 Stats: 589 lines in 5 files changed: 449 ins; 0 del; 140 mod Patch: https://git.openjdk.org/jdk/pull/12901.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12901/head:pull/12901 PR: https://git.openjdk.org/jdk/pull/12901 From jbhateja at openjdk.org Tue Mar 7 11:06:25 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 7 Mar 2023 11:06:25 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v9] In-Reply-To: References: <8lyMxPlTmzLDuwFLvhla8AAL6sVxEjigf4LaDlsAVWg=.4bfb5428-4b11-4c65-8540-503cf9e595f1@github.com> <_eDa4Xjs3HIxm2u4bUOWmmZL6y2DjQGd9kVyrUCTSD0=.d56d4533-bf46-4900-8c7c-6e22d247ff22@github.com> <9bFwAtzXSc26dTNribicwtgQTa5GENR0Pwl0xUA1v2A=.2aa36d32-5976-4f75-a0d1-44227fc29168@github.com> Message-ID: On Tue, 7 Mar 2023 09:48:46 GMT, Emanuel Peter wrote: >> Hi @eme64 , >> Few more suggestions, my original request was to write few hand crafted tests with combinations of AlignVector, Vectorize and MaxVectorSize options. All the test points in TestDependencyOffsets.java are mainly around two kernels, forward writes (RAW/true dependency) and forward reads (WAR/anti-dependence) but exhaustively generates multiple IR rules. >> applyIfCPUFeature uses VM identified CPU feature checks which are constrained by UseSSE and UseAVX options. You may extend your script to generate curated scenarios like following to trigger more rules. >> >> >> // CPU: sse4.1 to avx -> vector_width: 16 -> elements in vector: 4 >> scenarios[0] = new Scenario(0, "-XX:-/+AlignVector", "-XX:MaxVectorSize=XXX", "-XX:UseAVX=0"); >> // CPU: avx2 -> vector_width: 32 -> elements in vector: 8 >> scenarios[1] = new Scenario(1, "-XX:-/+AlignVector", "-XX:MaxVectorSize=XXX", "-XX:UseAVX=2"); >> // CPU: avx512 -> vector_width: 64 -> elements in vector: 16 >> scenarios[2] = new Scenario(2, "-XX:-/+AlignVector", "-XX:MaxVectorSize=XXX", "-XX:UseAVX=3"); >> >> >> Since there are mainly 4 categories of features. >> >> >> CPROMPT>grep "applyIfCPUFeatureAnd" TestDependencyOffsets.java | sort -u | uniq >> applyIfCPUFeatureAnd = {"avx2", "true", "avx512bw", "false"}) >> applyIfCPUFeatureAnd = {"avx2", "true", "avx512", "false"}) >> applyIfCPUFeatureAnd = {"avx", "true", "avx512", "false"}) >> applyIfCPUFeatureAnd = {"sse4.1", "true", "avx2", "false"}) >> applyIfCPUFeatureAnd = {"sse4.1", "true", "avx", "false"}) >> >> >> Also, this test file is script generated hence we may generate separate test files for AARCH64 and X86 to make the tests more maintainable. > > Hi @jatin-bhateja . > At Oracle we run the `compiler/loopopts/superword` and `compiler/vectorization` tests with various `AVX` and `SSE` settings, including the `UseKNLSetting`. > > Actually, I'm against splitting the test for different platforms. Because we have a few tests that now only get executed on one platform, and the features might be rotting on other platforms without us noticing it. Also: I don't like code duplication. If someone wants to add a test, then it has to be added in multiple files. Not great. > > My suggestion: Instead of the Scenarios, I can create multiple jtreg-test statements. Advantages: > > 1. The can run in parallel. > 2. I can require platform features. > 3. I can have different jvm-flags for different runs. > > That way I can make some jtreg-test statements for the `AVX / SSE` platforms (and have different `UseAVX` settings). And other jtreg-test statements for other platforms (eg. aarch64 `asimd == Neon`). But I will keep the IR rules for all platforms at the specific `@Test`. Agree, it will be better if we can run multiple IR rules on one target using UseSSE/UseAVX flags rather than having tight target dependencies. ------------- PR: https://git.openjdk.org/jdk/pull/12350 From rcastanedalo at openjdk.org Tue Mar 7 12:39:56 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 7 Mar 2023 12:39:56 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally In-Reply-To: References: Message-ID: On Tue, 7 Mar 2023 10:19:52 GMT, Tobias Holenstein wrote: > I now added also a --Global-- profile that is selected by default. Thanks for the changes, Toby. I can see the `--Global--` profile selected by default, however as soon as I open a graph it switches to `--Local--`. Is this intended? ------------- PR: https://git.openjdk.org/jdk/pull/12714 From tholenstein at openjdk.org Tue Mar 7 13:43:12 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Tue, 7 Mar 2023 13:43:12 GMT Subject: RFR: 8300821: UB: Applying non-zero offset to non-null pointer 0xfffffffffffffffe produced null pointer Message-ID: <3li-TpWHA4E_MnXnSXCI6AXju4o6rBSw0Ey2J6fXOKM=.f05f51f5-3add-4b11-bc34-efeae0e39b7c@github.com> "UndefinedBehaviorSanitizer" (https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html) in Xcode running on `java --version` discovered an Undefined Behavior. The reason is in the `next()` method https://github.com/openjdk/jdk/blob/040f5b55bd03bcc2209ece6eebf223ba1fabf824/src/hotspot/share/asm/codeBuffer.cpp#L798 In ``RelocIterator::next()`` we get a nullpointer after `_current++` https://github.com/openjdk/jdk/blob/040f5b55bd03bcc2209ece6eebf223ba1fabf824/src/hotspot/share/code/relocInfo.hpp#L612 But this is actually expected: In the constructor of the iterator `RelocIterator::RelocIterator` we have ```c++ _current = cs->locs_start()-1; _end = cs->locs_end(); and in our case locs_start() and locs_end() are `null` - so `_current` is `null`-1. After `_current++` both `_end` and `_current` are `null`. Just after `_current++` we then check if `_current == _end` and return `false` (there is no next reloc info) ## Solution We want to be able to turn on "UndefinedBehaviorSanitizer" and don't have false positives. So we add a check `cs->has_locs()` and only create the iterator if we have reloc info. Also added a sanity check in `RelocIterator::RelocIterator` that checks that either both `_current` and `_end` are null or both are not null. ------------- Commit messages: - UB: Applying non-zero offset to non-null pointer 0xfffffffffffffffe produced null pointer Changes: https://git.openjdk.org/jdk/pull/12854/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12854&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8300821 Stats: 4 lines in 2 files changed: 1 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/12854.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12854/head:pull/12854 PR: https://git.openjdk.org/jdk/pull/12854 From adinn at openjdk.org Tue Mar 7 15:30:26 2023 From: adinn at openjdk.org (Andrew Dinn) Date: Tue, 7 Mar 2023 15:30:26 GMT Subject: RFR: 8299570: [JVMCI] Insufficient error handling when CodeBuffer is exhausted [v4] In-Reply-To: References: <2glhKPYZQpwJioZGfuoEyrB_JwEkDnMKV6FbuzwoIk4=.1de88577-0676-4492-9b6f-bc81e36e68f6@github.com> Message-ID: The message from this sender included one or more files which could not be scanned for virus detection; do not open these files unless you are certain of the sender's intent. ---------------------------------------------------------------------- On Tue, 7 Mar 2023 09:37:35 GMT, Doug Simon wrote: >> Ok, thanks for your input. I'll wait for @theRealAph to review it as well. > > @theRealAph @adinn can I now merge this PR? @dougxc Still ok with me. I just pinged Andrew Haley to see if he is ok with it. ------------- PR: https://git.openjdk.org/jdk/pull/11945 From sviswanathan at openjdk.org Tue Mar 7 17:02:23 2023 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 7 Mar 2023 17:02:23 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter [v2] In-Reply-To: <9xh4y5OLD7_ZgSgnRtBk8dUWEODYa9ut3H19y3XNGxw=.d21814f5-ac8f-4786-98ff-53b926f0ad8e@github.com> References: <9xh4y5OLD7_ZgSgnRtBk8dUWEODYa9ut3H19y3XNGxw=.d21814f5-ac8f-4786-98ff-53b926f0ad8e@github.com> Message-ID: On Tue, 7 Mar 2023 02:53:48 GMT, Vladimir Kozlov wrote: >> Implemented `Float.floatToFloat16` and `Float.float16ToFloat` intrinsics in Interpreter and C1 compiler to produce the same results as C2 intrinsics on x64, Aarch64 and RISC-V - all platforms where C2 intrinsics for these Java methods were implemented originally. >> >> Replaced `SharedRuntime::f2hf()` and `hf2f()` C runtime functions with calls to runtime stubs which use the same HW instructions as C2 intrinsics. Only for 64-bit x64 because 32-bit x86 stub does not work: result is passed through FPU register and NaN values become different from C2 intrinsic. This runtime stub is only used to calculate constant values during C2 compilation and can be skipped. >> >> I added new tests based on Tobias's `TestAll.java` And copied `jdk/lang/Float/Binary16Conversion*.java` tests to run them with `-Xcomp` to make sure code is compiled by C1 or C2. I modified `Binary16ConversionNaN.java` to compare results from Interpreter, C1 and C2. >> >> Tested tier1-5, Xcomp, stress > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Address review comments The PR looks good to me. ------------- Marked as reviewed by sviswanathan (Reviewer). PR: https://git.openjdk.org/jdk/pull/12869 From never at openjdk.org Tue Mar 7 17:32:13 2023 From: never at openjdk.org (Tom Rodriguez) Date: Tue, 7 Mar 2023 17:32:13 GMT Subject: RFR: 8302452: [JVMCI] Export _poly1305_processBlocks, JfrThreadLocal fields to JVMCI compiler. In-Reply-To: References: Message-ID: On Tue, 14 Feb 2023 15:13:05 GMT, Yudi Zheng wrote: > This PR allows JVMCI compiler intrinsics to reuse the _poly1305_processBlocks stub and to update JfrThreadLocal fields on `Thread.setCurrentThread` events. Marked as reviewed by never (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/12560 From never at openjdk.org Tue Mar 7 17:33:08 2023 From: never at openjdk.org (Tom Rodriguez) Date: Tue, 7 Mar 2023 17:33:08 GMT Subject: RFR: JDK-8303678: [JVMCI] Add possibility to convert object JavaConstant to jobject. In-Reply-To: <7CCfSjqdge_fL8Ev_oY44xARp28LpOIOwZQjTks8Igg=.61bcfaa2-23fe-4dbd-965b-39b77ebdec5e@github.com> References: <7CCfSjqdge_fL8Ev_oY44xARp28LpOIOwZQjTks8Igg=.61bcfaa2-23fe-4dbd-965b-39b77ebdec5e@github.com> Message-ID: On Mon, 6 Mar 2023 15:25:36 GMT, Tom?? Zezula wrote: > This pull request adds a `jdk.vm.ci.hotspot.HotSpotJVMCIRuntime#getJObjectValue(HotSpotObjectConstant peerObject)` method, which gets a reference to an object in the peer runtime wrapped by the `jdk.vm.ci.hotspot.IndirectHotSpotObjectConstantImpl`. The reference is returned as a HotSpot heap JNI jobject. Marked as reviewed by never (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/12882 From jbhateja at openjdk.org Tue Mar 7 18:33:11 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 7 Mar 2023 18:33:11 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter In-Reply-To: References: Message-ID: On Tue, 7 Mar 2023 03:00:34 GMT, Vladimir Kozlov wrote: >> Other than the minor comments above, the x86 side changes look good to me. > > @sviswa7 I update changes based on your comments. Please, look: [9302d4b](https://github.com/openjdk/jdk/pull/12869/commits/9302d4bc00f8f1d8e774a260eb6aacb2d51a2dd4) Hi @vnkozlov , There is some discrepancy in results b/w interpreter, C1 and C2 for following case. public class Foo { public static short bar(float f) {return Float.floatToFloat16(f);} public static void main(String[] args) { System.out.println(Float.floatToRawIntBits(Float.float16ToFloat((short) 31745))); System.out.println(bar(Float.float16ToFloat((short) 31745))); } } CPROMPT>java -Xint -cp . Foo 2143297536 // FP32 QNaN + significand preserved 32257 // FP16 QNaN + significand preserved CPROMPT>java -Xbatch -Xcomp -cp . Foo 2139103232 // FP32 SNaN + significand preserved 31745 // FP16 SNaN + significand preserved CPROMPT>java -XX:-TieredCompilation -Xbatch -Xcomp -cp . Foo 2139103232 // FP32 SNaN + significand preserved 32257 // FP16 QNaN + significand preserved. ------------- PR: https://git.openjdk.org/jdk/pull/12869 From qamai at openjdk.org Tue Mar 7 18:34:01 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 7 Mar 2023 18:34:01 GMT Subject: RFR: 8303762: [vectorapi] Intrinsification of Vector.slice Message-ID: `Vector::slice` is a method at the top-level class of the Vector API that concatenates the 2 inputs into an intermediate composite and extracts a window equal to the size of the inputs into the result. It is used in vector conversion methods where the part number is not 0 to slice the parts to the correct positions. Slicing is also used in text processing such as utf8 and utf16 validation. x86 starting from SSSE3 has `palignr` which does vector slicing very efficiently. As a result, I think it is beneficial to add a C2 node for this operation as well as intrinsify `Vector::slice` method. A slice is currently implemented as `v2.rearrange(iota).blend(v1.rearrange(iota), blendMask)` which requires preparation of the index vector and the blending mask. Even with the preparations being hoisted out of the loops, microbenchmarks show improvement using the slice instrinsics. Some have tremendous increases in throughput due to the limitation that a mask of length 2 cannot currently be intrinsified, leading to falling back to the Java implementations. Please take a look and have some reviews. Thank you very much. ------------- Commit messages: - sse2, increase warmup - aesthetic - optimise 64B - add jmh - vector slice intrinsics Changes: https://git.openjdk.org/jdk/pull/12909/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12909&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8303762 Stats: 1699 lines in 58 files changed: 1376 ins; 257 del; 66 mod Patch: https://git.openjdk.org/jdk/pull/12909.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12909/head:pull/12909 PR: https://git.openjdk.org/jdk/pull/12909 From qamai at openjdk.org Tue Mar 7 18:34:01 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 7 Mar 2023 18:34:01 GMT Subject: RFR: 8303762: [vectorapi] Intrinsification of Vector.slice In-Reply-To: References: Message-ID: The message from this sender included one or more files which could not be scanned for virus detection; do not open these files unless you are certain of the sender's intent. ---------------------------------------------------------------------- On Tue, 7 Mar 2023 18:23:42 GMT, Quan Anh Mai wrote: > `Vector::slice` is a method at the top-level class of the Vector API that concatenates the 2 inputs into an intermediate composite and extracts a window equal to the size of the inputs into the result. It is used in vector conversion methods where the part number is not 0 to slice the parts to the correct positions. Slicing is also used in text processing such as utf8 and utf16 validation. x86 starting from SSSE3 has `palignr` which does vector slicing very efficiently. As a result, I think it is beneficial to add a C2 node for this operation as well as intrinsify `Vector::slice` method. > > A slice is currently implemented as `v2.rearrange(iota).blend(v1.rearrange(iota), blendMask)` which requires preparation of the index vector and the blending mask. Even with the preparations being hoisted out of the loops, microbenchmarks show improvement using the slice instrinsics. Some have tremendous increases in throughput due to the limitation that a mask of length 2 cannot currently be intrinsified, leading to falling back to the Java implementations. > > Please take a look and have some reviews. Thank you very much. Benchmark results: Before After Benchmark (size) Mode Cnt Score Error Score Error Units Change Byte128Vector.sliceBinaryConstant 1024 thrpt 5 5058.760 ? 2214.115 8315.263 ? 102.169 ops/ms +64.37% Byte256Vector.sliceBinaryConstant 1024 thrpt 5 6986.299 ? 1028.257 8440.387 ? 30.163 ops/ms +20.81% Byte64Vector.sliceBinaryConstant 1024 thrpt 5 2944.869 ? 849.548 5926.054 ? 493.146 ops/ms +101.23% ByteMaxVector.sliceBinaryConstant 1024 thrpt 5 7269.226 ? 366.246 8201.184 ? 309.539 ops/ms +12.82% Double128Vector.sliceBinaryConstant 1024 thrpt 5 10.204 ? 0.508 979.287 ? 19.991 ops/ms x95.97 Double256Vector.sliceBinaryConstant 1024 thrpt 5 868.085 ? 26.378 967.799 ? 10.224 ops/ms +11.49% DoubleMaxVector.sliceBinaryConstant 1024 thrpt 5 813.646 ? 74.468 978.150 ? 14.316 ops/ms +20.22% Float128Vector.sliceBinaryConstant 1024 thrpt 5 1297.281 ? 23.650 1850.995 ? 29.741 ops/ms +42.68% Float256Vector.sliceBinaryConstant 1024 thrpt 5 1796.121 ? 26.662 2011.362 ? 38.418 ops/ms +11.98% Float64Vector.sliceBinaryConstant 1024 thrpt 5 10.381 ? 0.194 1628.510 ? 8.752 ops/ms x156.87 FloatMaxVector.sliceBinaryConstant 1024 thrpt 5 1820.161 ? 26.802 1988.085 ? 41.835 ops/ms +9.23% Int128Vector.sliceBinaryConstant 1024 thrpt 5 1394.911 ? 40.815 1864.818 ? 33.792 ops/ms +33.69% Int256Vector.sliceBinaryConstant 1024 thrpt 5 1874.496 ? 60.541 1864.818 ? 33.792 ops/ms -0.52% Int64Vector.sliceBinaryConstant 1024 thrpt 5 10.942 ? 0.377 1621.849 ? 56.538 ops/ms x148.22 IntMaxVector.sliceBinaryConstant 1024 thrpt 5 1870.746 ? 40.665 2027.041 ? 25.880 ops/ms +8.35% Long128Vector.sliceBinaryConstant 1024 thrpt 5 10.595 ? 0.306 991.969 ? 15.033 ops/ms x93.63 Long256Vector.sliceBinaryConstant 1024 thrpt 5 815.689 ? 12.243 989.365 ? 25.969 ops/ms +21.29% LongMaxVector.sliceBinaryConstant 1024 thrpt 5 822.060 ? 12.337 977.061 ? 31.968 ops/ms +18.86% Short128Vector.sliceBinaryConstant 1024 thrpt 5 3062.676 ? 124.796 3890.796 ? 326.767 ops/ms +27.04% Short256Vector.sliceBinaryConstant 1024 thrpt 5 3747.778 ? 119.356 4125.463 ? 33.602 ops/ms +10.08% Short64Vector.sliceBinaryConstant 1024 thrpt 5 1879.203 ? 69.160 2899.515 ? 57.870 ops/ms +54.29% ShortMaxVector.sliceBinaryConstant 1024 thrpt 5 3717.217 ? 48.876 4035.455 ? 102.725 ops/ms +8.56% ------------- PR: https://git.openjdk.org/jdk/pull/12909 From kvn at openjdk.org Tue Mar 7 18:41:37 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 7 Mar 2023 18:41:37 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter In-Reply-To: References: Message-ID: On Tue, 7 Mar 2023 03:00:34 GMT, Vladimir Kozlov wrote: >> Other than the minor comments above, the x86 side changes look good to me. > > @sviswa7 I update changes based on your comments. Please, look: [9302d4b](https://github.com/openjdk/jdk/pull/12869/commits/9302d4bc00f8f1d8e774a260eb6aacb2d51a2dd4) > Hi @vnkozlov , There is some discrepancy in results b/w interpreter, C1 and C2 for following case. And that is fine. Consistency have to be preserved only during one run. Different runs with different flags (with disabled intrinsics, for example) may produce different results. ------------- PR: https://git.openjdk.org/jdk/pull/12869 From yzheng at openjdk.org Tue Mar 7 18:47:56 2023 From: yzheng at openjdk.org (Yudi Zheng) Date: Tue, 7 Mar 2023 18:47:56 GMT Subject: Integrated: 8302452: [JVMCI] Export _poly1305_processBlocks, JfrThreadLocal fields to JVMCI compiler. In-Reply-To: References: Message-ID: On Tue, 14 Feb 2023 15:13:05 GMT, Yudi Zheng wrote: > This PR allows JVMCI compiler intrinsics to reuse the _poly1305_processBlocks stub and to update JfrThreadLocal fields on `Thread.setCurrentThread` events. This pull request has now been integrated. Changeset: 4d4eadea Author: Yudi Zheng Committer: Doug Simon URL: https://git.openjdk.org/jdk/commit/4d4eadeae320722191feaf8022a04461232ae95b Stats: 12 lines in 3 files changed: 12 ins; 0 del; 0 mod 8302452: [JVMCI] Export _poly1305_processBlocks, JfrThreadLocal fields to JVMCI compiler. Reviewed-by: dnsimon, never ------------- PR: https://git.openjdk.org/jdk/pull/12560 From jwilhelm at openjdk.org Tue Mar 7 18:48:20 2023 From: jwilhelm at openjdk.org (Jesper Wilhelmsson) Date: Tue, 7 Mar 2023 18:48:20 GMT Subject: RFR: JDK-8301074: Replace NULL with nullptr in share/opto/ [v5] In-Reply-To: References: Message-ID: On Tue, 7 Mar 2023 08:39:49 GMT, Johan Sj?len wrote: >> Hi, this PR changes all occurrences of NULL to nullptr for the subdirectory share/opto/. Unfortunately the script that does the change isn't perfect, and so we >> need to comb through these manually to make sure nothing has gone wrong. I also review these changes but things slip past my eyes sometimes. >> >> Here are some typical things to look out for: >> >> No changes but copyright header changed (probably because I reverted some changes but forgot the copyright). >> Macros having their NULL changed to nullptr, these are added to the script when I find them. They should be NULL. >> nullptr in comments and logs. We try to use lower case "null" in these cases as it reads better. An exception is made when code expressions are in a comment. >> >> An example of this: >> >> >> // This function returns null >> void* ret_null(); >> // This function returns true if *x == nullptr >> bool is_nullptr(void** x); >> >> >> Note how nullptr participates in a code expression here, we really are talking about the specific value nullptr. >> >> Thanks! > > Johan Sj?len has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: > > - Merge remote-tracking branch 'origin/JDK-8301074' into JDK-8301074 > - Explicitly use 0 for null in ARM interpreter > - Merge remote-tracking branch 'origin/master' into JDK-8301074 > - Remove trailing whitespace > - Check for null string explicitly > - vkozlov fixes > - Manual review fixes > - Fix > - Fix compile errors > - Replace NULL with nullptr in share/opto/ Looks good in general. There are a bunch of places where the code previously was intentionally aligned (comments, braces, instructions etc) where it's now not aligned anymore since nullptr has more characters than NULL. I have put notes on the places I noticed. There are several strings that are modified, some seems to be part of logging and other text that is observable by a user. I'm not sure what our policy is around keeping logging / hs_err.log etc stable so just want to bring it up so that it's not a surprise to anyone. src/hotspot/share/opto/callGenerator.hpp line 48: > 46: > 47: virtual bool do_late_inline_check(Compile* C, JVMState* jvms) { ShouldNotReachHere(); return false; } > 48: virtual CallGenerator* inline_cg() const { ShouldNotReachHere(); return nullptr; } 'NULL' was shorter than 'false' so it had two spaces before the }. That's not the case anymore so remove a space? src/hotspot/share/opto/callnode.cpp line 837: > 835: const TypeInstPtr* inst_t = phase->type(proj)->isa_instptr(); > 836: if ((inst_t != nullptr) && (!inst_t->klass_is_exact() || > 837: (inst_t->instance_klass() == boxing_klass))) { Indentation of this line should increase as the starting parenthesis above has moved. src/hotspot/share/opto/cfgnode.cpp line 627: > 625: > 626: if( cnt <= 1 ) { // Only 1 path in? > 627: set_req(0, nullptr); // Null control input for region copy Align comments? src/hotspot/share/opto/cfgnode.cpp line 1778: > 1776: return nullptr; // Bail out on funny non-value stuff > 1777: if( phi->req() <= 3 ) // Need at least 2 matched inputs and a > 1778: return nullptr; // third unequal input to be worth doing Align comments. src/hotspot/share/opto/chaitin.cpp line 382: > 380: { > 381: Compile::TracePhase tp("computeLive", &timers[_t_computeLive]); > 382: _live = nullptr; // Mark live as being not available Align comment src/hotspot/share/opto/compile.cpp line 301: > 299: > 300: // Initialize worklist > 301: if (root() != nullptr) { useful.push(root()); } Looks like the braces was aligned before. src/hotspot/share/opto/compile.cpp line 1617: > 1615: > 1616: // Handle special cases. > 1617: if (adr_type == nullptr) return alias_type(AliasIdxTop); Align returns. src/hotspot/share/opto/compile.cpp line 1760: > 1758: bool Compile::must_alias(const TypePtr* adr_type, int alias_idx) { > 1759: if (alias_idx == AliasIdxBot) return true; // the universal category > 1760: if (adr_type == nullptr) return true; // null serves as TypePtr::TOP Align return. src/hotspot/share/opto/compile.cpp line 1778: > 1776: bool Compile::can_alias(const TypePtr* adr_type, int alias_idx) { > 1777: if (alias_idx == AliasIdxTop) return false; // the empty category > 1778: if (adr_type == nullptr) return false; // null serves as TypePtr::TOP Align returns. src/hotspot/share/opto/divnode.cpp line 467: > 465: const Type *t = phase->type( in(2) ); > 466: if( t == TypeInt::ONE ) // Identity? > 467: return nullptr; // Skip it Align comment. src/hotspot/share/opto/divnode.cpp line 482: > 480: jint i = ti->get_con(); // Get divisor > 481: > 482: if (i == 0) return nullptr; // Dividing by zero constant does not idealize Align comment? With line 480. src/hotspot/share/opto/divnode.cpp line 573: > 571: const Type *t = phase->type( in(2) ); > 572: if( t == TypeLong::ONE ) // Identity? > 573: return nullptr; // Skip it Align comment. src/hotspot/share/opto/divnode.cpp line 581: > 579: // Check for excluding div-zero case > 580: if (in(0) && (tl->_hi < 0 || tl->_lo > 0)) { > 581: set_req(0, nullptr); // Yank control input Align comment. src/hotspot/share/opto/divnode.cpp line 588: > 586: jlong l = tl->get_con(); // Get divisor > 587: > 588: if (l == 0) return nullptr; // Dividing by zero constant does not idealize Align comment. src/hotspot/share/opto/divnode.cpp line 724: > 722: const Type *t2 = phase->type( in(2) ); > 723: if( t2 == TypeF::ONE ) // Identity? > 724: return nullptr; // Skip it Align comment. src/hotspot/share/opto/divnode.cpp line 816: > 814: const Type *t2 = phase->type( in(2) ); > 815: if( t2 == TypeD::ONE ) // Identity? > 816: return nullptr; // Skip it Align comment. src/hotspot/share/opto/gcm.cpp line 283: > 281: static Block* find_deepest_input(Node* n, const PhaseCFG* cfg) { > 282: // Find the last input dominated by all other inputs. > 283: Block* deepb = nullptr; // Deepest block so far Align comment. src/hotspot/share/opto/gcm.cpp line 431: > 429: static Block* raise_LCA_above_use(Block* LCA, Node* use, Node* def, const PhaseCFG* cfg) { > 430: Block* buse = cfg->get_block_for_node(use); > 431: if (buse == nullptr) return LCA; // Unused killing Projs have no use block Align return. src/hotspot/share/opto/graphKit.cpp line 178: > 176: // Tell if _map is null, or control is top. > 177: bool GraphKit::stopped() { > 178: if (map() == nullptr) return true; Align return. src/hotspot/share/opto/library_call.cpp line 2417: > 2415: (mismatched || > 2416: heap_base_oop == top() || // - heap_base_oop is null or > 2417: (can_access_non_heap && field == nullptr)) // - heap_base_oop is potentially null Align comment. src/hotspot/share/opto/loopnode.cpp line 470: > 468: if (!stride->is_Con()) { // Oops, swap these > 469: if (!xphi->is_Con()) { // Is the other guy a constant? > 470: return nullptr; // Nope, unknown stride, bail out Align comment. src/hotspot/share/opto/memnode.cpp line 271: > 269: st->print("alias_idx==%d, adr_check==", alias_idx); > 270: if( adr_check == nullptr ) { > 271: st->print("null"); Where are these strings printed? Is this a user detectable change? src/hotspot/share/opto/memnode.cpp line 3697: > 3695: Node* base = AddPNode::Ideal_base_and_offset(st->in(MemNode::Address), > 3696: phase, offset); > 3697: if (base == nullptr) return -1; // something is dead, Align returns. src/hotspot/share/opto/memnode.cpp line 3720: > 3718: > 3719: Node* n = worklist.at(j); > 3720: if (n == nullptr) continue; // (can this really happen?) Align continue. src/hotspot/share/opto/memnode.cpp line 3944: > 3942: int i = captured_store_insertion_point(start, size_in_bytes, phase); > 3943: if (i == 0) { > 3944: return nullptr; // something is dead Align comment. src/hotspot/share/opto/memnode.cpp line 3995: > 3993: int i = captured_store_insertion_point(start, size_in_bytes, phase); > 3994: if (i == 0) return nullptr; // bail out > 3995: Node* prev_mem = nullptr; // raw memory for the captured store Align comments. src/hotspot/share/opto/memnode.cpp line 4166: > 4164: intcon[0] = 0; // undo store_constant() > 4165: set_req(i-1, st); // undo set_req(i, zmem) > 4166: nodes[j] = nullptr; // undo nodes[j] = st Align comment. src/hotspot/share/opto/multnode.cpp line 158: > 156: void ProjNode::check_con() const { > 157: Node* n = in(0); > 158: if (n == nullptr) return; // should be assert, but NodeHash makes bogons Align return. src/hotspot/share/opto/node.cpp line 1050: > 1048: i++; > 1049: } > 1050: _in[i] = n; // Stuff prec edge over null Align comment. src/hotspot/share/opto/node.hpp line 529: > 527: } > 528: _in[gap] = last; // Move last slot to empty one. > 529: _in[i] = nullptr; // null out last slot. Align comment. src/hotspot/share/opto/output.cpp line 2158: > 2156: #ifndef PRODUCT > 2157: if (_cfg->C->trace_opto_output()) > 2158: tty->print("# ChooseNodeToBundle: null\n"); User observable change(?) Do we care? src/hotspot/share/opto/phaseX.cpp line 2292: > 2290: } else if( n->is_Region() ) { // Unreachable region > 2291: // Note: nn == C->top() > 2292: n->set_req(0, nullptr); // Cut selfreference Align comment. ------------- Marked as reviewed by jwilhelm (Reviewer). PR: https://git.openjdk.org/jdk/pull/12187 From jwilhelm at openjdk.org Tue Mar 7 18:48:22 2023 From: jwilhelm at openjdk.org (Jesper Wilhelmsson) Date: Tue, 7 Mar 2023 18:48:22 GMT Subject: RFR: JDK-8301074: Replace NULL with nullptr in share/opto/ [v5] In-Reply-To: References: Message-ID: On Tue, 7 Mar 2023 15:11:04 GMT, Jesper Wilhelmsson wrote: >> Johan Sj?len has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: >> >> - Merge remote-tracking branch 'origin/JDK-8301074' into JDK-8301074 >> - Explicitly use 0 for null in ARM interpreter >> - Merge remote-tracking branch 'origin/master' into JDK-8301074 >> - Remove trailing whitespace >> - Check for null string explicitly >> - vkozlov fixes >> - Manual review fixes >> - Fix >> - Fix compile errors >> - Replace NULL with nullptr in share/opto/ > > src/hotspot/share/opto/compile.cpp line 301: > >> 299: >> 300: // Initialize worklist >> 301: if (root() != nullptr) { useful.push(root()); } > > Looks like the braces was aligned before. (With line 303.) ------------- PR: https://git.openjdk.org/jdk/pull/12187 From kvn at openjdk.org Tue Mar 7 18:48:53 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 7 Mar 2023 18:48:53 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter [v2] In-Reply-To: <9xh4y5OLD7_ZgSgnRtBk8dUWEODYa9ut3H19y3XNGxw=.d21814f5-ac8f-4786-98ff-53b926f0ad8e@github.com> References: <9xh4y5OLD7_ZgSgnRtBk8dUWEODYa9ut3H19y3XNGxw=.d21814f5-ac8f-4786-98ff-53b926f0ad8e@github.com> Message-ID: On Tue, 7 Mar 2023 02:53:48 GMT, Vladimir Kozlov wrote: >> Implemented `Float.floatToFloat16` and `Float.float16ToFloat` intrinsics in Interpreter and C1 compiler to produce the same results as C2 intrinsics on x64, Aarch64 and RISC-V - all platforms where C2 intrinsics for these Java methods were implemented originally. >> >> Replaced `SharedRuntime::f2hf()` and `hf2f()` C runtime functions with calls to runtime stubs which use the same HW instructions as C2 intrinsics. Only for 64-bit x64 because 32-bit x86 stub does not work: result is passed through FPU register and NaN values become different from C2 intrinsic. This runtime stub is only used to calculate constant values during C2 compilation and can be skipped. >> >> I added new tests based on Tobias's `TestAll.java` And copied `jdk/lang/Float/Binary16Conversion*.java` tests to run them with `-Xcomp` to make sure code is compiled by C1 or C2. I modified `Binary16ConversionNaN.java` to compare results from Interpreter, C1 and C2. >> >> Tested tier1-5, Xcomp, stress > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Address review comments On other hand `-Xint` should not disable intrinsics. I will look. ------------- PR: https://git.openjdk.org/jdk/pull/12869 From kvn at openjdk.org Tue Mar 7 19:01:39 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 7 Mar 2023 19:01:39 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter [v2] In-Reply-To: <9xh4y5OLD7_ZgSgnRtBk8dUWEODYa9ut3H19y3XNGxw=.d21814f5-ac8f-4786-98ff-53b926f0ad8e@github.com> References: <9xh4y5OLD7_ZgSgnRtBk8dUWEODYa9ut3H19y3XNGxw=.d21814f5-ac8f-4786-98ff-53b926f0ad8e@github.com> Message-ID: <6vpmSEQIHUeuayoEWhm124zzQY2CgwxG-hydLzjm4Z4=.47ad041b-1c49-4c06-ba9d-66f64f5fb7a1@github.com> On Tue, 7 Mar 2023 02:53:48 GMT, Vladimir Kozlov wrote: >> Implemented `Float.floatToFloat16` and `Float.float16ToFloat` intrinsics in Interpreter and C1 compiler to produce the same results as C2 intrinsics on x64, Aarch64 and RISC-V - all platforms where C2 intrinsics for these Java methods were implemented originally. >> >> Replaced `SharedRuntime::f2hf()` and `hf2f()` C runtime functions with calls to runtime stubs which use the same HW instructions as C2 intrinsics. Only for 64-bit x64 because 32-bit x86 stub does not work: result is passed through FPU register and NaN values become different from C2 intrinsic. This runtime stub is only used to calculate constant values during C2 compilation and can be skipped. >> >> I added new tests based on Tobias's `TestAll.java` And copied `jdk/lang/Float/Binary16Conversion*.java` tests to run them with `-Xcomp` to make sure code is compiled by C1 or C2. I modified `Binary16ConversionNaN.java` to compare results from Interpreter, C1 and C2. >> >> Tested tier1-5, Xcomp, stress > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Address review comments It looks like C1 compilation does not invoke intrinsics. Investigating. ------------- PR: https://git.openjdk.org/jdk/pull/12869 From kvn at openjdk.org Tue Mar 7 20:28:25 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 7 Mar 2023 20:28:25 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter [v2] In-Reply-To: <9xh4y5OLD7_ZgSgnRtBk8dUWEODYa9ut3H19y3XNGxw=.d21814f5-ac8f-4786-98ff-53b926f0ad8e@github.com> References: <9xh4y5OLD7_ZgSgnRtBk8dUWEODYa9ut3H19y3XNGxw=.d21814f5-ac8f-4786-98ff-53b926f0ad8e@github.com> Message-ID: On Tue, 7 Mar 2023 02:53:48 GMT, Vladimir Kozlov wrote: >> Implemented `Float.floatToFloat16` and `Float.float16ToFloat` intrinsics in Interpreter and C1 compiler to produce the same results as C2 intrinsics on x64, Aarch64 and RISC-V - all platforms where C2 intrinsics for these Java methods were implemented originally. >> >> Replaced `SharedRuntime::f2hf()` and `hf2f()` C runtime functions with calls to runtime stubs which use the same HW instructions as C2 intrinsics. Only for 64-bit x64 because 32-bit x86 stub does not work: result is passed through FPU register and NaN values become different from C2 intrinsic. This runtime stub is only used to calculate constant values during C2 compilation and can be skipped. >> >> I added new tests based on Tobias's `TestAll.java` And copied `jdk/lang/Float/Binary16Conversion*.java` tests to run them with `-Xcomp` to make sure code is compiled by C1 or C2. I modified `Binary16ConversionNaN.java` to compare results from Interpreter, C1 and C2. >> >> Tested tier1-5, Xcomp, stress > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Address review comments We should not allow JIT compilation of `Float.float16ToFloat` and `Float.floatToFloat16` Java methods as we do for other math methods: [abstractInterpreter.hpp#L144](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/interpreter/abstractInterpreter.hpp#L144) What happens with @jatin-bhateja test is `Float` class is still not loaded when we trigger compilation for `Foo::main` method with `-Xcomp` flag. C2 generates Uncommon trap at the very start of `main()` and code is deoptimized and goes to Interpreter which uses intrinsics. C1 on other hand generates calls to `Float.float16ToFloat` and `Float.floatToFloat16` in compiled `Foo::main` method and then compiles `Float.float16ToFloat` and `Float.floatToFloat16` which are called. The fix ix simple: +++ b/src/hotspot/share/interpreter/abstractInterpreter.hpp @@ -159,6 +159,8 @@ class AbstractInterpreter: AllStatic { case vmIntrinsics::_dexp : // fall thru case vmIntrinsics::_fmaD : // fall thru case vmIntrinsics::_fmaF : // fall thru + case vmIntrinsics::_floatToFloat16 : // fall thru + case vmIntrinsics::_float16ToFloat : // fall thru case vmIntrinsics::_Continuation_doYield : // fall thru return false; test now produce the same result: $ java -Xint Foo 2143297536 32257 $ java -Xcomp -XX:-TieredCompilation Foo 2143297536 32257 $ java -Xcomp -XX:TieredStopAtLevel=1 Foo 2143297536 32257 ------------- PR: https://git.openjdk.org/jdk/pull/12869 From kvn at openjdk.org Tue Mar 7 20:35:04 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 7 Mar 2023 20:35:04 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter [v2] In-Reply-To: <9xh4y5OLD7_ZgSgnRtBk8dUWEODYa9ut3H19y3XNGxw=.d21814f5-ac8f-4786-98ff-53b926f0ad8e@github.com> References: <9xh4y5OLD7_ZgSgnRtBk8dUWEODYa9ut3H19y3XNGxw=.d21814f5-ac8f-4786-98ff-53b926f0ad8e@github.com> Message-ID: <_z_y5VwjSFljVZjnn-lM7X6ecLtPpkSSSfKEiFjDCA4=.2d84e402-c22c-40e3-9a05-40977fd2ed63@github.com> On Tue, 7 Mar 2023 02:53:48 GMT, Vladimir Kozlov wrote: >> Implemented `Float.floatToFloat16` and `Float.float16ToFloat` intrinsics in Interpreter and C1 compiler to produce the same results as C2 intrinsics on x64, Aarch64 and RISC-V - all platforms where C2 intrinsics for these Java methods were implemented originally. >> >> Replaced `SharedRuntime::f2hf()` and `hf2f()` C runtime functions with calls to runtime stubs which use the same HW instructions as C2 intrinsics. Only for 64-bit x64 because 32-bit x86 stub does not work: result is passed through FPU register and NaN values become different from C2 intrinsic. This runtime stub is only used to calculate constant values during C2 compilation and can be skipped. >> >> I added new tests based on Tobias's `TestAll.java` And copied `jdk/lang/Float/Binary16Conversion*.java` tests to run them with `-Xcomp` to make sure code is compiled by C1 or C2. I modified `Binary16ConversionNaN.java` to compare results from Interpreter, C1 and C2. >> >> Tested tier1-5, Xcomp, stress > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Address review comments C2 also compiled `float16ToFloat`. That is why we got `FP32 SNaN` result with `-XX:-TieredCompilation` in Jatin's example. ------------- PR: https://git.openjdk.org/jdk/pull/12869 From kvn at openjdk.org Tue Mar 7 21:38:38 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 7 Mar 2023 21:38:38 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter [v3] In-Reply-To: References: Message-ID: > Implemented `Float.floatToFloat16` and `Float.float16ToFloat` intrinsics in Interpreter and C1 compiler to produce the same results as C2 intrinsics on x64, Aarch64 and RISC-V - all platforms where C2 intrinsics for these Java methods were implemented originally. > > Replaced `SharedRuntime::f2hf()` and `hf2f()` C runtime functions with calls to runtime stubs which use the same HW instructions as C2 intrinsics. Only for 64-bit x64 because 32-bit x86 stub does not work: result is passed through FPU register and NaN values become different from C2 intrinsic. This runtime stub is only used to calculate constant values during C2 compilation and can be skipped. > > I added new tests based on Tobias's `TestAll.java` And copied `jdk/lang/Float/Binary16Conversion*.java` tests to run them with `-Xcomp` to make sure code is compiled by C1 or C2. I modified `Binary16ConversionNaN.java` to compare results from Interpreter, C1 and C2. > > Tested tier1-5, Xcomp, stress Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: Do not allow JIT compilation of Float.float16ToFloat and Float.floatToFloat16 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12869/files - new: https://git.openjdk.org/jdk/pull/12869/files/9302d4bc..ed01863d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12869&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12869&range=01-02 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/12869.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12869/head:pull/12869 PR: https://git.openjdk.org/jdk/pull/12869 From kvn at openjdk.org Tue Mar 7 21:38:40 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 7 Mar 2023 21:38:40 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter In-Reply-To: References: Message-ID: On Tue, 7 Mar 2023 18:28:46 GMT, Jatin Bhateja wrote: >> @sviswa7 I update changes based on your comments. Please, look: [9302d4b](https://github.com/openjdk/jdk/pull/12869/commits/9302d4bc00f8f1d8e774a260eb6aacb2d51a2dd4) > > Hi @vnkozlov , There is some discrepancy in results b/w interpreter, C1 and C2 for following case. > > public class Foo { > public static short bar(float f) {return Float.floatToFloat16(f);} > public static void main(String[] args) { > System.out.println(Float.floatToRawIntBits(Float.float16ToFloat((short) 31745))); > System.out.println(bar(Float.float16ToFloat((short) 31745))); > } > } > > CPROMPT>java -Xint -cp . Foo > 2143297536 // FP32 QNaN + significand preserved > 32257 // FP16 QNaN + significand preserved > CPROMPT>java -Xbatch -Xcomp -cp . Foo > 2139103232 // FP32 SNaN + significand preserved > 31745 // FP16 SNaN + significand preserved > CPROMPT>java -XX:-TieredCompilation -Xbatch -Xcomp -cp . Foo > 2139103232 // FP32 SNaN + significand preserved > 32257 // FP16 QNaN + significand preserved. @jatin-bhateja I applied the fix. Please, verify. ------------- PR: https://git.openjdk.org/jdk/pull/12869 From psandoz at openjdk.org Wed Mar 8 00:32:17 2023 From: psandoz at openjdk.org (Paul Sandoz) Date: Wed, 8 Mar 2023 00:32:17 GMT Subject: RFR: 8303762: [vectorapi] Intrinsification of Vector.slice In-Reply-To: References: Message-ID: <_PA9oL9dVd3Yrg0sXw3m0uwfGjP6TuqXGBm5M090GHM=.a09a8733-e59e-4b6d-a6a6-e518a8518450@github.com> The message from this sender included one or more files which could not be scanned for virus detection; do not open these files unless you are certain of the sender's intent. ---------------------------------------------------------------------- On Tue, 7 Mar 2023 18:23:42 GMT, Quan Anh Mai wrote: > `Vector::slice` is a method at the top-level class of the Vector API that concatenates the 2 inputs into an intermediate composite and extracts a window equal to the size of the inputs into the result. It is used in vector conversion methods where the part number is not 0 to slice the parts to the correct positions. Slicing is also used in text processing such as utf8 and utf16 validation. x86 starting from SSSE3 has `palignr` which does vector slicing very efficiently. As a result, I think it is beneficial to add a C2 node for this operation as well as intrinsify `Vector::slice` method. > > A slice is currently implemented as `v2.rearrange(iota).blend(v1.rearrange(iota), blendMask)` which requires preparation of the index vector and the blending mask. Even with the preparations being hoisted out of the loops, microbenchmarks show improvement using the slice instrinsics. Some have tremendous increases in throughput due to the limitation that a mask of length 2 cannot currently be intrinsified, leading to falling back to the Java implementations. > > Please take a look and have some reviews. Thank you very much. src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java line 2289: > 2287: getClass(), byte.class, length(), > 2288: this, that, origin, > 2289: new VectorSliceOp() { Change from inner class to lambda expression? ------------- PR: https://git.openjdk.org/jdk/pull/12909 From psandoz at openjdk.org Wed Mar 8 00:49:16 2023 From: psandoz at openjdk.org (Paul Sandoz) Date: Wed, 8 Mar 2023 00:49:16 GMT Subject: RFR: 8303762: [vectorapi] Intrinsification of Vector.slice In-Reply-To: References: Message-ID: <_btCmeotboVIVWcIbHksJAaRcJO5aFl0CPVRnqpkuj0=.e3405352-fa81-4707-babe-25061abd99c5@github.com> On Tue, 7 Mar 2023 18:23:42 GMT, Quan Anh Mai wrote: > `Vector::slice` is a method at the top-level class of the Vector API that concatenates the 2 inputs into an intermediate composite and extracts a window equal to the size of the inputs into the result. It is used in vector conversion methods where the part number is not 0 to slice the parts to the correct positions. Slicing is also used in text processing such as utf8 and utf16 validation. x86 starting from SSSE3 has `palignr` which does vector slicing very efficiently. As a result, I think it is beneficial to add a C2 node for this operation as well as intrinsify `Vector::slice` method. > > A slice is currently implemented as `v2.rearrange(iota).blend(v1.rearrange(iota), blendMask)` which requires preparation of the index vector and the blending mask. Even with the preparations being hoisted out of the loops, microbenchmarks show improvement using the slice instrinsics. Some have tremendous increases in throughput due to the limitation that a mask of length 2 cannot currently be intrinsified, leading to falling back to the Java implementations. > > Please take a look and have some reviews. Thank you very much. test/hotspot/jtreg/compiler/vectorapi/TestVectorSlice.java line 65: > 63: Asserts.assertEquals(expected, dst[i][j]); > 64: } > 65: } It should be possible to factor out this code into something like this: assertOffsets(length, (expected, i, j) -> Assert.assertEquals((byte)expected, dst[i][j]) test/hotspot/jtreg/compiler/vectorapi/TestVectorSlice.java line 68: > 66: > 67: length = 16; > 68: testB128(dst, src1, src2); Should `dst` be zeroed before the next call? or maybe easier to just reallocate. test/jdk/jdk/incubator/vector/templates/Kernel-Slice-bop-const.template line 1: > 1: $type$[] a = fa.apply(SPECIES.length()); Forgot to commit the updated unit tests? ------------- PR: https://git.openjdk.org/jdk/pull/12909 From kvn at openjdk.org Wed Mar 8 00:56:14 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 8 Mar 2023 00:56:14 GMT Subject: RFR: 8300821: UB: Applying non-zero offset to non-null pointer 0xfffffffffffffffe produced null pointer In-Reply-To: <3li-TpWHA4E_MnXnSXCI6AXju4o6rBSw0Ey2J6fXOKM=.f05f51f5-3add-4b11-bc34-efeae0e39b7c@github.com> References: <3li-TpWHA4E_MnXnSXCI6AXju4o6rBSw0Ey2J6fXOKM=.f05f51f5-3add-4b11-bc34-efeae0e39b7c@github.com> Message-ID: On Fri, 3 Mar 2023 14:46:51 GMT, Tobias Holenstein wrote: > "UndefinedBehaviorSanitizer" (https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html) in Xcode running on `java --version` discovered an Undefined Behavior. The reason is in the `next()` method https://github.com/openjdk/jdk/blob/040f5b55bd03bcc2209ece6eebf223ba1fabf824/src/hotspot/share/asm/codeBuffer.cpp#L798 > > In ``RelocIterator::next()`` we get a nullpointer after `_current++` > https://github.com/openjdk/jdk/blob/040f5b55bd03bcc2209ece6eebf223ba1fabf824/src/hotspot/share/code/relocInfo.hpp#L612 > But this is actually expected: In the constructor of the iterator `RelocIterator::RelocIterator` we have > ```c++ > _current = cs->locs_start()-1; > _end = cs->locs_end(); > > and in our case locs_start() and locs_end() are `null` - so `_current` is `null`-1. After `_current++` both `_end` and `_current` are `null`. Just after `_current++` we then check if `_current == _end` and return `false` (there is no next reloc info) > > ## Solution > We want to be able to turn on "UndefinedBehaviorSanitizer" and don't have false positives. So we add a check > `cs->has_locs()` and only create the iterator if we have reloc info. > > Also added a sanity check in `RelocIterator::RelocIterator` that checks that either both `_current` and `_end` are null or both are not null. Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/12854 From fyang at openjdk.org Wed Mar 8 01:57:31 2023 From: fyang at openjdk.org (Fei Yang) Date: Wed, 8 Mar 2023 01:57:31 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter [v3] In-Reply-To: References: Message-ID: On Tue, 7 Mar 2023 21:38:38 GMT, Vladimir Kozlov wrote: >> Implemented `Float.floatToFloat16` and `Float.float16ToFloat` intrinsics in Interpreter and C1 compiler to produce the same results as C2 intrinsics on x64, Aarch64 and RISC-V - all platforms where C2 intrinsics for these Java methods were implemented originally. >> >> Replaced `SharedRuntime::f2hf()` and `hf2f()` C runtime functions with calls to runtime stubs which use the same HW instructions as C2 intrinsics. Only for 64-bit x64 because 32-bit x86 stub does not work: result is passed through FPU register and NaN values become different from C2 intrinsic. This runtime stub is only used to calculate constant values during C2 compilation and can be skipped. >> >> I added new tests based on Tobias's `TestAll.java` And copied `jdk/lang/Float/Binary16Conversion*.java` tests to run them with `-Xcomp` to make sure code is compiled by C1 or C2. I modified `Binary16ConversionNaN.java` to compare results from Interpreter, C1 and C2. >> >> Tested tier1-5, Xcomp, stress > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Do not allow JIT compilation of Float.float16ToFloat and Float.floatToFloat16 Hi, Thanks for handling linux-riscv64 at the same time. Bad news is that we witnessed test failures when running following test with QEMU (no riscv hardware available with Zfhmin extension for now): test/hotspot/jtreg/compiler/intrinsics/float16/TestAllFloat16ToFloat.java Exception in thread "main" java.lang.RuntimeException: Inconsistent result for Float.floatToFloat16(NaN/7fc00000): 7e00 != fc01 at TestAllFloat16ToFloat.verify(TestAllFloat16ToFloat.java:62) at TestAllFloat16ToFloat.run(TestAllFloat16ToFloat.java:72) at TestAllFloat16ToFloat.main(TestAllFloat16ToFloat.java:94) It looks like there is a problem when handling NaNs with fcvt.h.s/fmv.x.h and fmv.h.x/fcvt.s.h instructions at the bottom. It's also possible to be an issue of QEMU as well. It would take quite a while to diagnose. But I don't want this to block this PR. So I would prefer removing support of this feature for this port and adding back once this is resolved. And I have prepared a patch for that purpose. See attachement. [12869-revert-riscv.txt](https://github.com/openjdk/jdk/files/10915606/12869-revert-riscv.txt) ------------- PR: https://git.openjdk.org/jdk/pull/12869 From wanghaomin at openjdk.org Wed Mar 8 03:59:51 2023 From: wanghaomin at openjdk.org (Wang Haomin) Date: Wed, 8 Mar 2023 03:59:51 GMT Subject: RFR: 8303804: Fix some errors of If-VectorTest and CMove-VectorTest Message-ID: <-Jw_zF5ca_3WHcoZQwzsT6lMA1NFdAzbOv3063qU6Lw=.b90c971a-bfe4-4144-93dc-b04a4c89a154@github.com> After https://bugs.openjdk.org/browse/JDK-8292289 , the base class of VectorTestNode changed from Node to CmpNode. So I add two match rule into ad file. match(If cop (VectorTest op1 op2)); match(Set dst (CMoveI (Binary cop (VectorTest op1 op2)) (Binary src1 src2))); First error, rule1 shouldn't generate the statement "node->_bottom_type = _leaf->bottom_type();". Second error, both rule1 and rule2 need to use VectorTestNode, the VectorTestNode should be cloned like CmpNode. ------------- Commit messages: - 8303804: Fix some errors of If-VectorTest and CMove-VectorTest Changes: https://git.openjdk.org/jdk/pull/12917/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12917&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8303804 Stats: 4 lines in 2 files changed: 2 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/12917.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12917/head:pull/12917 PR: https://git.openjdk.org/jdk/pull/12917 From letianqiu at gmail.com Wed Mar 8 01:14:52 2023 From: letianqiu at gmail.com (tianle qiu) Date: Wed, 8 Mar 2023 12:14:52 +1100 Subject: A Noob Question of C1 LIR branch Message-ID: Hi, In my Phd project, I am messing around with barriers and notice a weird behavior of branching. It might be that I am doing it wrong. Typically, a barrier will do a check and then jump to slow case if necessary. The code often looks like: __ branch(lir_cond_notEqual, T_INT, slow); __ branch_destination(slow->continuation()); (These two lines are copied from G1BarrierSetC1::pre_barrier). But if I changed those two lines to: __ branch(lir_cond_equal, T_INT, slow->continuation()); __ jump(slow); __ branch_destination(slow->continuation()); Then hotspot cannot bootstrap, causing build failure. I know in practice no one is going to write code like that, but it does look weird. They should be logically equivalent. By the way, I am using the latest jdk11u master branch. Any suggestions would be appreciated. Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From kvn at openjdk.org Wed Mar 8 05:17:53 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 8 Mar 2023 05:17:53 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter [v4] In-Reply-To: References: Message-ID: > Implemented `Float.floatToFloat16` and `Float.float16ToFloat` intrinsics in Interpreter and C1 compiler to produce the same results as C2 intrinsics on x64, Aarch64 and RISC-V - all platforms where C2 intrinsics for these Java methods were implemented originally. > > Replaced `SharedRuntime::f2hf()` and `hf2f()` C runtime functions with calls to runtime stubs which use the same HW instructions as C2 intrinsics. Only for 64-bit x64 because 32-bit x86 stub does not work: result is passed through FPU register and NaN values become different from C2 intrinsic. This runtime stub is only used to calculate constant values during C2 compilation and can be skipped. > > I added new tests based on Tobias's `TestAll.java` And copied `jdk/lang/Float/Binary16Conversion*.java` tests to run them with `-Xcomp` to make sure code is compiled by C1 or C2. I modified `Binary16ConversionNaN.java` to compare results from Interpreter, C1 and C2. > > Tested tier1-5, Xcomp, stress Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: Remove RISC-V port code for float16 intrinsics ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12869/files - new: https://git.openjdk.org/jdk/pull/12869/files/ed01863d..9f4b2474 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12869&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12869&range=02-03 Stats: 130 lines in 13 files changed: 0 ins; 116 del; 14 mod Patch: https://git.openjdk.org/jdk/pull/12869.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12869/head:pull/12869 PR: https://git.openjdk.org/jdk/pull/12869 From kvn at openjdk.org Wed Mar 8 05:17:56 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 8 Mar 2023 05:17:56 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter [v3] In-Reply-To: References: Message-ID: On Wed, 8 Mar 2023 01:53:14 GMT, Fei Yang wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Do not allow JIT compilation of Float.float16ToFloat and Float.floatToFloat16 > > Hi, Thanks for handling linux-riscv64 at the same time. > Bad news is that we witnessed test failures when running following test with QEMU (no riscv hardware available with Zfhmin extension for now): test/hotspot/jtreg/compiler/intrinsics/float16/TestAllFloat16ToFloat.java > > > Exception in thread "main" java.lang.RuntimeException: Inconsistent result for Float.floatToFloat16(NaN/7fc00000): 7e00 != fc01 > at TestAllFloat16ToFloat.verify(TestAllFloat16ToFloat.java:62) > at TestAllFloat16ToFloat.run(TestAllFloat16ToFloat.java:72) > at TestAllFloat16ToFloat.main(TestAllFloat16ToFloat.java:94) > > > It looks like there is a problem when handling NaNs with fcvt.h.s/fmv.x.h and fmv.h.x/fcvt.s.h instructions at the bottom. > It's also possible to be an issue of QEMU as well. It would take quite a while to diagnose. But I don't want this to block this PR. > So I would prefer removing support of this feature for this port and adding back once this is resolved. And I have prepared a patch for that purpose. See attachement. > [12869-revert-riscv.txt](https://github.com/openjdk/jdk/files/10915606/12869-revert-riscv.txt) Thank you very much @RealFYang for testing changes and preparing patch. I applied your patch. ------------- PR: https://git.openjdk.org/jdk/pull/12869 From xliu at openjdk.org Wed Mar 8 06:19:28 2023 From: xliu at openjdk.org (Xin Liu) Date: Wed, 8 Mar 2023 06:19:28 GMT Subject: RFR: 8201516: DebugNonSafepoints generates incorrect information [v4] In-Reply-To: <0gI6DIHtc7F63CFYoccotGQv-BHYadPRW0liqEQvh6Q=.58a774ec-7591-4f44-aa7f-7755593ac04e@github.com> References: <0gI6DIHtc7F63CFYoccotGQv-BHYadPRW0liqEQvh6Q=.58a774ec-7591-4f44-aa7f-7755593ac04e@github.com> Message-ID: On Fri, 3 Mar 2023 06:31:53 GMT, Tobias Hartmann wrote: >> src/hotspot/share/opto/phaseX.cpp line 474: >> >>> 472: GrowableArray* old_node_note_array = C->node_note_array(); >>> 473: if (old_node_note_array != nullptr) { >>> 474: int new_size = (_useful.size() >> 8) + 1; // The node note array uses blocks, see C->_log2_node_notes_block_size >> >> You should call `new_size = MAX2(8, new_size)` to make sure that we have at least 8 elements for initial allocation. > > Okay, I added that. The 8 seems arbitrary to me but since we already use that for initial allocation of the array, we can as well be consistent here. Just note that since we are calling `C->grow_node_notes`, we will also initialize with `Node_Notes*` right away. Why don't we just use `C->_log2_node_notes_block_size` directly in (_useful.size() >> 8)? I don't understand why we have to add MAX2(8, new_size) either. It looks like c2 doesn't want to have node-level accuracy. It drops the lowest 8bits of node_idx as block_id. I think the minimal number of "block" is 1, or arr is NULL. ------------- PR: https://git.openjdk.org/jdk/pull/12806 From duke at openjdk.org Wed Mar 8 08:19:05 2023 From: duke at openjdk.org (Daniel Skantz) Date: Wed, 8 Mar 2023 08:19:05 GMT Subject: RFR: 8294715: Add IR checks to the reduction vectorization tests [v4] In-Reply-To: References: Message-ID: > We are lifting some loopopts/superword tests to use the IR framework, and add IR annotations to check that vector reductions take place on x86_64. This can be useful to prevent issues such as JDK-8300865. > > Approach: lift the more general tests in loopopts/superword, mainly using matching rules in cpu/x86/x86.ad, but leave tests mostly unchanged otherwise. Some reductions are considered non-profitable (superword.cpp), so we might need to raise sse/avx value pre-conditions from what would be a strict reading of x86.ad (as noted by @eme64). > > Testing: Local testing (x86_64) using UseSSE={2,3,4}, UseAVX={0,1,2,3}. Tested running all jtreg compiler tests. Tier1-tier5 runs to my knowledge never showed any compiler-related regression in other tests as a result from this work. GHA. Validation: all tests fail if we put unreasonable counts for the respective reduction node, such as counts = {IRNode.ADD_REDUCTION_VI, ">= 10000000"}). > > Thanks @robcasloz and @eme64 for advice. > > Notes: ProdRed_Double does not vectorize (JDK-8300865). SumRed_Long does not vectorize on 32-bit, according to my reading of source, test on GHA and cross-compiled JDK on 32-bit Linux, so removed these platforms from @requires. Lifted the AbsNeg tests too but added no checks, as these are currently not run on x86_64. Daniel Skantz has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 14 additional commits since the last revision: - Merge branch 'master' of github.com:openjdk/jdk into JDK-8294715-IR-new - Correctly reset totals in RedTest*; put debug print msgs in exception - Remove 2-unroll scenario - remove duped unrolllimit - fix typo; remove non-store case from SumRedSqrt_Double due to slow run time - Merge branch 'master' of github.com:openjdk/jdk into JDK-8294715-IR-new - Remove print statements for prints that were silenced by IR framework addition - Remove non-double stores - Revert much of last commit, and part of the first commit addressing review comments : intention is to remove all the negative tests, except for on -XX:-SuperWordReductions. Keep some comments and additional IR nodes added to existing checks. - Address further review comments (edits) - ... and 4 more: https://git.openjdk.org/jdk/compare/0f7d12c4...ec02160d ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12683/files - new: https://git.openjdk.org/jdk/pull/12683/files/21598f4e..ec02160d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12683&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12683&range=02-03 Stats: 29253 lines in 1051 files changed: 19049 ins; 5720 del; 4484 mod Patch: https://git.openjdk.org/jdk/pull/12683.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12683/head:pull/12683 PR: https://git.openjdk.org/jdk/pull/12683 From jbhateja at openjdk.org Wed Mar 8 08:39:27 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 8 Mar 2023 08:39:27 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter [v4] In-Reply-To: References: Message-ID: <7DG9mLetq1UlaCe0EbNx-Lvk6roh05PalMDwmGsPQwU=.a6a01b6a-8575-49d5-a8d9-d31eee93ad25@github.com> On Wed, 8 Mar 2023 05:17:53 GMT, Vladimir Kozlov wrote: >> Implemented `Float.floatToFloat16` and `Float.float16ToFloat` intrinsics in Interpreter and C1 compiler to produce the same results as C2 intrinsics on x64, Aarch64 and RISC-V - all platforms where C2 intrinsics for these Java methods were implemented originally. >> >> Replaced `SharedRuntime::f2hf()` and `hf2f()` C runtime functions with calls to runtime stubs which use the same HW instructions as C2 intrinsics. Only for 64-bit x64 because 32-bit x86 stub does not work: result is passed through FPU register and NaN values become different from C2 intrinsic. This runtime stub is only used to calculate constant values during C2 compilation and can be skipped. >> >> I added new tests based on Tobias's `TestAll.java` And copied `jdk/lang/Float/Binary16Conversion*.java` tests to run them with `-Xcomp` to make sure code is compiled by C1 or C2. I modified `Binary16ConversionNaN.java` to compare results from Interpreter, C1 and C2. >> >> Tested tier1-5, Xcomp, stress > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Remove RISC-V port code for float16 intrinsics Hi @vnkozlov , Thanks for explanations, looks good to me now. ------------- Marked as reviewed by jbhateja (Reviewer). PR: https://git.openjdk.org/jdk/pull/12869 From wanghaomin at openjdk.org Wed Mar 8 09:03:41 2023 From: wanghaomin at openjdk.org (Wang Haomin) Date: Wed, 8 Mar 2023 09:03:41 GMT Subject: RFR: 8303804: Fix some errors of If-VectorTest and CMove-VectorTest In-Reply-To: <-Jw_zF5ca_3WHcoZQwzsT6lMA1NFdAzbOv3063qU6Lw=.b90c971a-bfe4-4144-93dc-b04a4c89a154@github.com> References: <-Jw_zF5ca_3WHcoZQwzsT6lMA1NFdAzbOv3063qU6Lw=.b90c971a-bfe4-4144-93dc-b04a4c89a154@github.com> Message-ID: On Wed, 8 Mar 2023 03:52:33 GMT, Wang Haomin wrote: > After https://bugs.openjdk.org/browse/JDK-8292289 , the base class of VectorTestNode changed from Node to CmpNode. So I add two match rule into ad file. > > match(If cop (VectorTest op1 op2)); > match(Set dst (CMoveI (Binary cop (VectorTest op1 op2)) (Binary src1 src2))); > > First error, rule1 shouldn't generate the statement "node->_bottom_type = _leaf->bottom_type();". > Second error, both rule1 and rule2 need to use VectorTestNode, the VectorTestNode should be cloned like CmpNode. @merykitty Could you review this, Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/12917 From thartmann at openjdk.org Wed Mar 8 09:27:38 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 8 Mar 2023 09:27:38 GMT Subject: RFR: 8300821: UB: Applying non-zero offset to non-null pointer 0xfffffffffffffffe produced null pointer In-Reply-To: <3li-TpWHA4E_MnXnSXCI6AXju4o6rBSw0Ey2J6fXOKM=.f05f51f5-3add-4b11-bc34-efeae0e39b7c@github.com> References: <3li-TpWHA4E_MnXnSXCI6AXju4o6rBSw0Ey2J6fXOKM=.f05f51f5-3add-4b11-bc34-efeae0e39b7c@github.com> Message-ID: On Fri, 3 Mar 2023 14:46:51 GMT, Tobias Holenstein wrote: > "UndefinedBehaviorSanitizer" (https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html) in Xcode running on `java --version` discovered an Undefined Behavior. The reason is in the `next()` method https://github.com/openjdk/jdk/blob/040f5b55bd03bcc2209ece6eebf223ba1fabf824/src/hotspot/share/asm/codeBuffer.cpp#L798 > > In ``RelocIterator::next()`` we get a nullpointer after `_current++` > https://github.com/openjdk/jdk/blob/040f5b55bd03bcc2209ece6eebf223ba1fabf824/src/hotspot/share/code/relocInfo.hpp#L612 > But this is actually expected: In the constructor of the iterator `RelocIterator::RelocIterator` we have > ```c++ > _current = cs->locs_start()-1; > _end = cs->locs_end(); > > and in our case locs_start() and locs_end() are `null` - so `_current` is `null`-1. After `_current++` both `_end` and `_current` are `null`. Just after `_current++` we then check if `_current == _end` and return `false` (there is no next reloc info) > > ## Solution > We want to be able to turn on "UndefinedBehaviorSanitizer" and don't have false positives. So we add a check > `cs->has_locs()` and only create the iterator if we have reloc info. > > Also added a sanity check in `RelocIterator::RelocIterator` that checks that either both `_current` and `_end` are null or both are not null. Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/12854 From thartmann at openjdk.org Wed Mar 8 10:00:07 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 8 Mar 2023 10:00:07 GMT Subject: RFR: 8303415: Add VM_Version::is_intrinsic_supported(id) In-Reply-To: References: Message-ID: On Fri, 3 Mar 2023 16:16:08 GMT, Vladimir Kozlov wrote: > Currently we check VM flags, directives and JIT compiler support when we generate intrinsics. > We have *product* VM flags for most intrinsics and set them in VM based on HW support. > But not all intrinsics have such flags and it is not scalable to add new *product* flag for each new intrinsic. > Also we have `-XX:DisableIntrinsic=` and `-XX:ControlIntrinsic=` flags to control intrinsics from command line. We don't need specific flags for that. > > I propose to add new `VM_Version::is_intrinsic_supported(id)` method to check platform support for intrinsic without adding new flag. I used it for `_floatToFloat16` intrinsic for my work on [JDK-8302976](https://bugs.openjdk.org/browse/JDK-8302976). > > Additional fixes: > Fixed Interpreter to skip intrinsics if they are disabled with flag. > Moved Interpreter's `InlineIntrinsics` flag check into one place in shared code. > Added separate interpreter id for `_dsqrt_strict` so it could be disabled separately from regular `_dsqrt`. > Added missing `native` mark to `_currentThread`. > Removed unused `AbstractInterpreter::in_native_entry()`. > Cleanup C2 intrinsic checks code. > > Tested tier1-4,xcomp,stress. Also ran tier1-3,xcomp with `-XX:-InlineIntrinsics`. Looks good to me. src/hotspot/share/opto/c2compiler.hpp line 68: > 66: // Check if the compiler supports an intrinsic for 'method' given the > 67: // the dispatch mode specified by the 'is_virtual' parameter. > 68: bool is_virtual_intrinsic_supported(vmIntrinsics::ID id, bool is_virtual); I find the new name of the method confusing because it suggests that the intrinsic is always virtual. Can we keep the old name? ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/12858 From thartmann at openjdk.org Wed Mar 8 10:04:13 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 8 Mar 2023 10:04:13 GMT Subject: RFR: 8302508: Add timestamp to the output TraceCompilerThreads In-Reply-To: References: Message-ID: On Tue, 7 Mar 2023 03:49:52 GMT, Vladimir Kozlov wrote: > Having timestamps added to the output of TraceCompilerThreads will be helpful in understanding how frequently the compiler threads are being added or removed. > > I did that and also added UL output. > > > java -XX:+TraceCompilerThreads -XX:+PrintCompilation -version > > 86 Added initial compiler thread C2 CompilerThread0 > 86 Added initial compiler thread C1 CompilerThread0 > 92 1 3 java.lang.Object:: (1 bytes) > 96 2 3 java.lang.String::coder (15 bytes) > > java -Xlog:jit+thread=debug -Xlog:jit+compilation=debug -version > > [0.078s][debug][jit,thread] Added initial compiler thread C2 CompilerThread0 > [0.078s][debug][jit,thread] Added initial compiler thread C1 CompilerThread0 > [0.083s][debug][jit,compilation] 1 3 java.lang.Object:: (1 bytes) > [0.087s][debug][jit,compilation] 2 3 java.lang.String::coder (15 bytes) > > > Tested tier1. Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/12898 From epeter at openjdk.org Wed Mar 8 10:15:45 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 8 Mar 2023 10:15:45 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v21] In-Reply-To: References: Message-ID: > Cyclic dependencies are not handled correctly in all cases. Three examples: > > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/loopopts/superword/TestCyclicDependency.java#L270-L277 > > And this, compiled with `-XX:CompileCommand=option,compiler.vectorization.TestOptionVectorizeIR::test*,Vectorize`: > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestOptionVectorizeIR.java#L173-L180 > > And for `vmIntrinsics::_forEachRemaining` compile option `Vectorize` is always enabled: > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestForEachRem.java#L69-L73 > > All of these examples are vectorized, despite the cyclic dependency of distance 2. The cyclic dependency is dropped, instead the emitted vector code implements a shift by 2, instead of repeating the same 2 values. > > **Analysis** > > The `create_pack` logic in `SuperWord::find_adjacent_refs` is broken in two ways: > > - When the compile directive `Vectorize` is on, or we compile `vmIntrinsics::_forEachRemaining` we have `_do_vector_loop == true`. When that is the case, we blindly trust that there is no cyclic dependency larger than distance 1. Distance 1 would already be detected by the `independence(s1, s2)` checks we do for all adjacent memops. But for larger distances, we rely on `memory_alignment == 0`. But the compile directive avoids these checks. > - If `best_align_to_mem_ref` is of a different type, and we have `memory_alignment(mem_ref, best_align_to_mem_ref) == 0`, we do not check if `mem_ref` has `memory_alignment == 0` for all other refs of the same type. In the example `TestCyclicDependency::test2`, we have `best_align_to_mem_ref` as the `StoreF`. Then we assess the `StoreI`, which is not aligned with it, but it is of a different type, so we accept it too. Finally, we look at `LoadI`, which has perfect alignment with the `StoreF`, so we accept it too (even though it is in conflict with the `StoreI`). > > Generally, the nested if-statements are confusing and buggy. I propose to fix and refactor the code. > > I also propose to only allow the compile directive `Vectorize` only if `vectors_should_be_aligned() == false`. If all vector operations have to be `vector_width` aligned, then they also have to be mutually aligned, and we cannot have patterns like `v[i] = v[i] + v[i+1]` for which the compile directive was introduced in the first place https://github.com/openjdk/jdk/commit/c7d33de202203b6da544f2e0f9a13952381b32dd. > **Update**: I found a **Test.java** that lead to a crash (`SIGBUS`) on a ARM32 on master. The example bypassed the alignment requirement because of `_do_vector_loop`, and allowed unaligned vector loads to be generated, on a platform that requires alignment. Thanks @fg1417 for running that test for me! > > **Solution** > > First, I implemented `SuperWord::verify_packs` which catches cyclic dependencies just before scheduling. The idea is to reassess every pack, and check if all memops in it are mutually independent. Turns out that per vector pack, it suffices to do a single BFS over the nodes in the block (see `SuperWord::find_dependence`). With this verification in place we at least get an assert instead of wrong execution. > > I then refactored and fixed the `create_pack` code, and put the logic all in `SuperWord::is_mem_ref_alignment_ok`. With the added comments, I hope the logic is more straight forward and readable. If `_do_vector_loop == true`, then I filter the vector packs again in `SuperWord::combine_packs`, since we are at that point not sure that the packs are actually independent, we only know that adjacient memops are independent. > > Another change I have made: > Disallow `extend_packlist` from adding `MemNodes` back in. Because if we have rejected some memops, we do not want them to be added back in later. > > **Testing** > > I added a few more regression tests, and am running tier1-3, plus some stress testing. > > However, I need help from someone who can test this on **ARM32** and **PPC**, basically machines that have `vectors_should_be_aligned() == true`. I would love to have additional testing on those machine, and some reviews. > **Update:** @fg1417 did testing on ARM32, @reinrich did testing on PPC. > > **Discussion / Future Work** > > I wonder if we should have `_do_vector_loop == true` by default, since it allows more vectorization. With the added filtering, we are sure that we do not schedule packs with cyclic dependencies. We would have to evaluate performance and other side-effects of course. What do you think? [JDK-8303113](https://bugs.openjdk.org/browse/JDK-8303113) Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: TestDependencyOffsets.java: parallelize it + various AVX settings ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12350/files - new: https://git.openjdk.org/jdk/pull/12350/files/fb7f6dd9..9e3d4805 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12350&range=20 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12350&range=19-20 Stats: 760 lines in 1 file changed: 476 ins; 106 del; 178 mod Patch: https://git.openjdk.org/jdk/pull/12350.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12350/head:pull/12350 PR: https://git.openjdk.org/jdk/pull/12350 From epeter at openjdk.org Wed Mar 8 10:15:47 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 8 Mar 2023 10:15:47 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v20] In-Reply-To: References: Message-ID: On Fri, 3 Mar 2023 07:37:00 GMT, Emanuel Peter wrote: >> Cyclic dependencies are not handled correctly in all cases. Three examples: >> >> https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/loopopts/superword/TestCyclicDependency.java#L270-L277 >> >> And this, compiled with `-XX:CompileCommand=option,compiler.vectorization.TestOptionVectorizeIR::test*,Vectorize`: >> https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestOptionVectorizeIR.java#L173-L180 >> >> And for `vmIntrinsics::_forEachRemaining` compile option `Vectorize` is always enabled: >> https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestForEachRem.java#L69-L73 >> >> All of these examples are vectorized, despite the cyclic dependency of distance 2. The cyclic dependency is dropped, instead the emitted vector code implements a shift by 2, instead of repeating the same 2 values. >> >> **Analysis** >> >> The `create_pack` logic in `SuperWord::find_adjacent_refs` is broken in two ways: >> >> - When the compile directive `Vectorize` is on, or we compile `vmIntrinsics::_forEachRemaining` we have `_do_vector_loop == true`. When that is the case, we blindly trust that there is no cyclic dependency larger than distance 1. Distance 1 would already be detected by the `independence(s1, s2)` checks we do for all adjacent memops. But for larger distances, we rely on `memory_alignment == 0`. But the compile directive avoids these checks. >> - If `best_align_to_mem_ref` is of a different type, and we have `memory_alignment(mem_ref, best_align_to_mem_ref) == 0`, we do not check if `mem_ref` has `memory_alignment == 0` for all other refs of the same type. In the example `TestCyclicDependency::test2`, we have `best_align_to_mem_ref` as the `StoreF`. Then we assess the `StoreI`, which is not aligned with it, but it is of a different type, so we accept it too. Finally, we look at `LoadI`, which has perfect alignment with the `StoreF`, so we accept it too (even though it is in conflict with the `StoreI`). >> >> Generally, the nested if-statements are confusing and buggy. I propose to fix and refactor the code. >> >> I also propose to only allow the compile directive `Vectorize` only if `vectors_should_be_aligned() == false`. If all vector operations have to be `vector_width` aligned, then they also have to be mutually aligned, and we cannot have patterns like `v[i] = v[i] + v[i+1]` for which the compile directive was introduced in the first place https://github.com/openjdk/jdk/commit/c7d33de202203b6da544f2e0f9a13952381b32dd. >> **Update**: I found a **Test.java** that lead to a crash (`SIGBUS`) on a ARM32 on master. The example bypassed the alignment requirement because of `_do_vector_loop`, and allowed unaligned vector loads to be generated, on a platform that requires alignment. Thanks @fg1417 for running that test for me! >> >> **Solution** >> >> First, I implemented `SuperWord::verify_packs` which catches cyclic dependencies just before scheduling. The idea is to reassess every pack, and check if all memops in it are mutually independent. Turns out that per vector pack, it suffices to do a single BFS over the nodes in the block (see `SuperWord::find_dependence`). With this verification in place we at least get an assert instead of wrong execution. >> >> I then refactored and fixed the `create_pack` code, and put the logic all in `SuperWord::is_mem_ref_alignment_ok`. With the added comments, I hope the logic is more straight forward and readable. If `_do_vector_loop == true`, then I filter the vector packs again in `SuperWord::combine_packs`, since we are at that point not sure that the packs are actually independent, we only know that adjacient memops are independent. >> >> Another change I have made: >> Disallow `extend_packlist` from adding `MemNodes` back in. Because if we have rejected some memops, we do not want them to be added back in later. >> >> **Testing** >> >> I added a few more regression tests, and am running tier1-3, plus some stress testing. >> >> However, I need help from someone who can test this on **ARM32** and **PPC**, basically machines that have `vectors_should_be_aligned() == true`. I would love to have additional testing on those machine, and some reviews. >> **Update:** @fg1417 did testing on ARM32, @reinrich did testing on PPC. >> >> **Discussion / Future Work** >> >> I wonder if we should have `_do_vector_loop == true` by default, since it allows more vectorization. With the added filtering, we are sure that we do not schedule packs with cyclic dependencies. We would have to evaluate performance and other side-effects of course. What do you think? [JDK-8303113](https://bugs.openjdk.org/browse/JDK-8303113) > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > TestOptionVectorizeIR.java: removed PopulateIndex IR rule - fails on x86 32bit - see Matcher::match_rule_supported @jatin-bhateja do your agree with the recent changes to `TestDependencyOffsets.java`? Do you think the jtreg-tests are ok, with their requirements? ------------- PR: https://git.openjdk.org/jdk/pull/12350 From thartmann at openjdk.org Wed Mar 8 10:25:12 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 8 Mar 2023 10:25:12 GMT Subject: RFR: JDK-8303443: IGV: Syntax highlighting and resizing for filter editor [v3] In-Reply-To: References: Message-ID: <4qRM88afZnlsZ0JdxHMiT5O5wzyeSDbYByWBBWLE3io=.48a1b9c0-f050-4e1c-a9f7-a632824eff05@github.com> On Mon, 6 Mar 2023 14:59:00 GMT, Tobias Holenstein wrote: >> In the Filter window of the IdealGraphVisualizer (IGV) the user can double-click on a filter to edit the javascript code. >> >> - Previously, the code window was not resizable and had no syntax highlighting >> editor_old >> >> - Now, the code window can be resized by the user and has basic syntax highlighting for `keywords`, `strings` and `comments` >> editor_new > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > make NetBeans form editor work again Works well for me too. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/12803 From epeter at openjdk.org Wed Mar 8 10:26:23 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 8 Mar 2023 10:26:23 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v9] In-Reply-To: References: <8lyMxPlTmzLDuwFLvhla8AAL6sVxEjigf4LaDlsAVWg=.4bfb5428-4b11-4c65-8540-503cf9e595f1@github.com> <_eDa4Xjs3HIxm2u4bUOWmmZL6y2DjQGd9kVyrUCTSD0=.d56d4533-bf46-4900-8c7c-6e22d247ff22@github.com> <9bFwAtzXSc26dTNribicwtgQTa5GENR0Pwl0xUA1v2A=.2aa36d32-5976-4f75-a0d1-44227fc29168@github.com> Message-ID: On Tue, 7 Mar 2023 11:03:18 GMT, Jatin Bhateja wrote: >> Hi @jatin-bhateja . >> At Oracle we run the `compiler/loopopts/superword` and `compiler/vectorization` tests with various `AVX` and `SSE` settings, including the `UseKNLSetting`. >> >> Actually, I'm against splitting the test for different platforms. Because we have a few tests that now only get executed on one platform, and the features might be rotting on other platforms without us noticing it. Also: I don't like code duplication. If someone wants to add a test, then it has to be added in multiple files. Not great. >> >> My suggestion: Instead of the Scenarios, I can create multiple jtreg-test statements. Advantages: >> >> 1. The can run in parallel. >> 2. I can require platform features. >> 3. I can have different jvm-flags for different runs. >> >> That way I can make some jtreg-test statements for the `AVX / SSE` platforms (and have different `UseAVX` settings). And other jtreg-test statements for other platforms (eg. aarch64 `asimd == Neon`). But I will keep the IR rules for all platforms at the specific `@Test`. > > Agree, it will be better if we can run multiple IR rules on one target using UseSSE/UseAVX flags rather than having tight target dependencies. @jatin-bhateja do your agree with the recent changes to TestDependencyOffsets.java? Do you think the jtreg-tests are ok, with their requirements? ------------- PR: https://git.openjdk.org/jdk/pull/12350 From epeter at openjdk.org Wed Mar 8 10:33:03 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 8 Mar 2023 10:33:03 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v22] In-Reply-To: References: Message-ID: > Cyclic dependencies are not handled correctly in all cases. Three examples: > > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/loopopts/superword/TestCyclicDependency.java#L270-L277 > > And this, compiled with `-XX:CompileCommand=option,compiler.vectorization.TestOptionVectorizeIR::test*,Vectorize`: > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestOptionVectorizeIR.java#L173-L180 > > And for `vmIntrinsics::_forEachRemaining` compile option `Vectorize` is always enabled: > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestForEachRem.java#L69-L73 > > All of these examples are vectorized, despite the cyclic dependency of distance 2. The cyclic dependency is dropped, instead the emitted vector code implements a shift by 2, instead of repeating the same 2 values. > > **Analysis** > > The `create_pack` logic in `SuperWord::find_adjacent_refs` is broken in two ways: > > - When the compile directive `Vectorize` is on, or we compile `vmIntrinsics::_forEachRemaining` we have `_do_vector_loop == true`. When that is the case, we blindly trust that there is no cyclic dependency larger than distance 1. Distance 1 would already be detected by the `independence(s1, s2)` checks we do for all adjacent memops. But for larger distances, we rely on `memory_alignment == 0`. But the compile directive avoids these checks. > - If `best_align_to_mem_ref` is of a different type, and we have `memory_alignment(mem_ref, best_align_to_mem_ref) == 0`, we do not check if `mem_ref` has `memory_alignment == 0` for all other refs of the same type. In the example `TestCyclicDependency::test2`, we have `best_align_to_mem_ref` as the `StoreF`. Then we assess the `StoreI`, which is not aligned with it, but it is of a different type, so we accept it too. Finally, we look at `LoadI`, which has perfect alignment with the `StoreF`, so we accept it too (even though it is in conflict with the `StoreI`). > > Generally, the nested if-statements are confusing and buggy. I propose to fix and refactor the code. > > I also propose to only allow the compile directive `Vectorize` only if `vectors_should_be_aligned() == false`. If all vector operations have to be `vector_width` aligned, then they also have to be mutually aligned, and we cannot have patterns like `v[i] = v[i] + v[i+1]` for which the compile directive was introduced in the first place https://github.com/openjdk/jdk/commit/c7d33de202203b6da544f2e0f9a13952381b32dd. > **Update**: I found a **Test.java** that lead to a crash (`SIGBUS`) on a ARM32 on master. The example bypassed the alignment requirement because of `_do_vector_loop`, and allowed unaligned vector loads to be generated, on a platform that requires alignment. Thanks @fg1417 for running that test for me! > > **Solution** > > First, I implemented `SuperWord::verify_packs` which catches cyclic dependencies just before scheduling. The idea is to reassess every pack, and check if all memops in it are mutually independent. Turns out that per vector pack, it suffices to do a single BFS over the nodes in the block (see `SuperWord::find_dependence`). With this verification in place we at least get an assert instead of wrong execution. > > I then refactored and fixed the `create_pack` code, and put the logic all in `SuperWord::is_mem_ref_alignment_ok`. With the added comments, I hope the logic is more straight forward and readable. If `_do_vector_loop == true`, then I filter the vector packs again in `SuperWord::combine_packs`, since we are at that point not sure that the packs are actually independent, we only know that adjacient memops are independent. > > Another change I have made: > Disallow `extend_packlist` from adding `MemNodes` back in. Because if we have rejected some memops, we do not want them to be added back in later. > > **Testing** > > I added a few more regression tests, and am running tier1-3, plus some stress testing. > > However, I need help from someone who can test this on **ARM32** and **PPC**, basically machines that have `vectors_should_be_aligned() == true`. I would love to have additional testing on those machine, and some reviews. > **Update:** @fg1417 did testing on ARM32, @reinrich did testing on PPC. > > **Discussion / Future Work** > > I wonder if we should have `_do_vector_loop == true` by default, since it allows more vectorization. With the added filtering, we are sure that we do not schedule packs with cyclic dependencies. We would have to evaluate performance and other side-effects of course. What do you think? [JDK-8303113](https://bugs.openjdk.org/browse/JDK-8303113) Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: TestDependencyOffsets.java: add vanilla run ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12350/files - new: https://git.openjdk.org/jdk/pull/12350/files/9e3d4805..a44082b6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12350&range=21 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12350&range=20-21 Stats: 24 lines in 1 file changed: 24 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/12350.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12350/head:pull/12350 PR: https://git.openjdk.org/jdk/pull/12350 From adinn at openjdk.org Wed Mar 8 10:52:09 2023 From: adinn at openjdk.org (Andrew Dinn) Date: Wed, 8 Mar 2023 10:52:09 GMT Subject: RFR: 8299570: [JVMCI] Insufficient error handling when CodeBuffer is exhausted [v4] In-Reply-To: References: Message-ID: The message from this sender included one or more files which could not be scanned for virus detection; do not open these files unless you are certain of the sender's intent. ---------------------------------------------------------------------- On Fri, 20 Jan 2023 15:33:50 GMT, Doug Simon wrote: >> This PR fixes the handling of a full CodeBuffer when emitting stubs as part of JVMCI code installation. > > Doug Simon has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: > > - Merge remote-tracking branch 'openjdk-jdk/master' into JDK-8299570 > - changed return type of NativeCall::trampoline_jump to void > - try rationalize code in NativeCall::trampoline_jump > - properly handle CodeBuffer exhaustion in JVMCI backend I'm happy to approve this. ------------- Marked as reviewed by adinn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11945 From thartmann at openjdk.org Wed Mar 8 10:55:25 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 8 Mar 2023 10:55:25 GMT Subject: RFR: 8201516: DebugNonSafepoints generates incorrect information [v4] In-Reply-To: References: <0gI6DIHtc7F63CFYoccotGQv-BHYadPRW0liqEQvh6Q=.58a774ec-7591-4f44-aa7f-7755593ac04e@github.com> Message-ID: On Wed, 8 Mar 2023 06:16:16 GMT, Xin Liu wrote: > Why don't we just use C->_log2_node_notes_block_size directly in (_useful.size() >> 8)? Because it's private in `Compile`. We could make it public but I thought it's not worth it. > I don't understand why we have to add MAX2(8, new_size) either. It looks like c2 doesn't want to have node-level accuracy. It drops the lowest 8bits of node_idx as block_id. I think the minimal number of "block" is 1, or arr is NULL. I think you are misinterpreting the code in `Compile::locate_node_notes`. It first determines the `block_idx` by `idx >> _log2_node_notes_block_size` and then the position in that block by `idx & (_node_notes_block_size-1)`. ------------- PR: https://git.openjdk.org/jdk/pull/12806 From tholenstein at openjdk.org Wed Mar 8 10:55:27 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 8 Mar 2023 10:55:27 GMT Subject: RFR: JDK-8303443: IGV: Syntax highlighting and resizing for filter editor [v3] In-Reply-To: References: Message-ID: On Mon, 6 Mar 2023 15:35:12 GMT, Roberto Casta?eda Lozano wrote: >> Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: >> >> make NetBeans form editor work again > > Thanks for addressing my comments, looks good! thanks @robcasloz and @TobiHartmann for the reviews! ------------- PR: https://git.openjdk.org/jdk/pull/12803 From tholenstein at openjdk.org Wed Mar 8 10:55:28 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 8 Mar 2023 10:55:28 GMT Subject: Integrated: JDK-8303443: IGV: Syntax highlighting and resizing for filter editor In-Reply-To: References: Message-ID: On Wed, 1 Mar 2023 12:14:07 GMT, Tobias Holenstein wrote: > In the Filter window of the IdealGraphVisualizer (IGV) the user can double-click on a filter to edit the javascript code. > > - Previously, the code window was not resizable and had no syntax highlighting > editor_old > > - Now, the code window can be resized by the user and has basic syntax highlighting for `keywords`, `strings` and `comments` > editor_new This pull request has now been integrated. Changeset: d9882523 Author: Tobias Holenstein URL: https://git.openjdk.org/jdk/commit/d9882523780f360afc94d3df5658019d832e596e Stats: 113 lines in 2 files changed: 89 ins; 3 del; 21 mod 8303443: IGV: Syntax highlighting and resizing for filter editor Reviewed-by: rcastanedalo, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/12803 From adinn at openjdk.org Wed Mar 8 10:57:19 2023 From: adinn at openjdk.org (Andrew Dinn) Date: Wed, 8 Mar 2023 10:57:19 GMT Subject: RFR: 8299570: [JVMCI] Insufficient error handling when CodeBuffer is exhausted [v4] In-Reply-To: References: Message-ID: On Fri, 20 Jan 2023 15:33:50 GMT, Doug Simon wrote: >> This PR fixes the handling of a full CodeBuffer when emitting stubs as part of JVMCI code installation. > > Doug Simon has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: > > - Merge remote-tracking branch 'openjdk-jdk/master' into JDK-8299570 > - changed return type of NativeCall::trampoline_jump to void > - try rationalize code in NativeCall::trampoline_jump > - properly handle CodeBuffer exhaustion in JVMCI backend @dougxc I'm assuming the test failures are unrelated? If so then ok to push. ------------- PR: https://git.openjdk.org/jdk/pull/11945 From aph at openjdk.org Wed Mar 8 10:57:21 2023 From: aph at openjdk.org (Andrew Haley) Date: Wed, 8 Mar 2023 10:57:21 GMT Subject: RFR: 8299570: [JVMCI] Insufficient error handling when CodeBuffer is exhausted [v4] In-Reply-To: References: Message-ID: On Fri, 20 Jan 2023 15:33:50 GMT, Doug Simon wrote: >> This PR fixes the handling of a full CodeBuffer when emitting stubs as part of JVMCI code installation. > > Doug Simon has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: > > - Merge remote-tracking branch 'openjdk-jdk/master' into JDK-8299570 > - changed return type of NativeCall::trampoline_jump to void > - try rationalize code in NativeCall::trampoline_jump > - properly handle CodeBuffer exhaustion in JVMCI backend src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp line 548: > 546: } > 547: } else { > 548: // If not using far branches, patch this call directly to dest. This is all very complicated. Can't we just add `JVMCI_ERROR`s where we need them? ------------- PR: https://git.openjdk.org/jdk/pull/11945 From dnsimon at openjdk.org Wed Mar 8 11:04:26 2023 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 8 Mar 2023 11:04:26 GMT Subject: RFR: 8299570: [JVMCI] Insufficient error handling when CodeBuffer is exhausted [v4] In-Reply-To: References: Message-ID: <-uUN-pL3z7PbMG_UqGO454NNQtTiD_jhWEALBbCGk5Q=.6bf265d0-66b1-4369-ae65-5ad0bfbd8f79@github.com> On Wed, 8 Mar 2023 10:54:24 GMT, Andrew Haley wrote: >> Doug Simon has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: >> >> - Merge remote-tracking branch 'openjdk-jdk/master' into JDK-8299570 >> - changed return type of NativeCall::trampoline_jump to void >> - try rationalize code in NativeCall::trampoline_jump >> - properly handle CodeBuffer exhaustion in JVMCI backend > > src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp line 548: > >> 546: } >> 547: } else { >> 548: // If not using far branches, patch this call directly to dest. > > This is all very complicated. Can't we just add `JVMCI_ERROR`s where we need them? I'm not following - isn't this exactly what the code is doing? Maybe you could demonstrate how you think it should look. ------------- PR: https://git.openjdk.org/jdk/pull/11945 From aph at openjdk.org Wed Mar 8 11:04:27 2023 From: aph at openjdk.org (Andrew Haley) Date: Wed, 8 Mar 2023 11:04:27 GMT Subject: RFR: 8299570: [JVMCI] Insufficient error handling when CodeBuffer is exhausted [v4] In-Reply-To: <-uUN-pL3z7PbMG_UqGO454NNQtTiD_jhWEALBbCGk5Q=.6bf265d0-66b1-4369-ae65-5ad0bfbd8f79@github.com> References: <-uUN-pL3z7PbMG_UqGO454NNQtTiD_jhWEALBbCGk5Q=.6bf265d0-66b1-4369-ae65-5ad0bfbd8f79@github.com> Message-ID: On Wed, 8 Mar 2023 10:59:35 GMT, Doug Simon wrote: >> src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp line 548: >> >>> 546: } >>> 547: } else { >>> 548: // If not using far branches, patch this call directly to dest. >> >> This is all very complicated. Can't we just add `JVMCI_ERROR`s where we need them? > > I'm not following - isn't this exactly what the code is doing? Maybe you could demonstrate how you think it should look. Maybe it would be nicer to get the `! far_branches` code path out of the way first, and return immediately. ------------- PR: https://git.openjdk.org/jdk/pull/11945 From dnsimon at openjdk.org Wed Mar 8 11:12:16 2023 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 8 Mar 2023 11:12:16 GMT Subject: RFR: 8299570: [JVMCI] Insufficient error handling when CodeBuffer is exhausted [v4] In-Reply-To: References: <-uUN-pL3z7PbMG_UqGO454NNQtTiD_jhWEALBbCGk5Q=.6bf265d0-66b1-4369-ae65-5ad0bfbd8f79@github.com> Message-ID: <16XkRlEfbFrHsd3QFmSEQRs-IUeuLFM9kQ1LdSMmh78=.4e1799f6-063a-49e8-9623-e5c72d0fd788@github.com> On Wed, 8 Mar 2023 11:01:18 GMT, Andrew Haley wrote: >> I'm not following - isn't this exactly what the code is doing? Maybe you could demonstrate how you think it should look. > > Maybe it would be nicer to get the `! far_branches` code path out of the way first, and return immediately. Something like this? diff --git a/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp b/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp index 88dc59f80d0..83ec182d2c7 100644 --- a/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp +++ b/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp @@ -532,21 +532,22 @@ void NativeCallTrampolineStub::set_destination(address new_destination) { void NativeCall::trampoline_jump(CodeBuffer &cbuf, address dest, JVMCI_TRAPS) { MacroAssembler a(&cbuf); - if (a.far_branches()) { - if (!is_NativeCallTrampolineStub_at(instruction_address() + displacement())) { - address stub = a.emit_trampoline_stub(instruction_address() - cbuf.insts()->start(), dest); - if (stub == nullptr) { - JVMCI_ERROR("could not emit trampoline stub - code cache is full"); - } - // The relocation is created while emitting the stub will ensure this - // call instruction is subsequently patched to call the stub. - } else { - // Not sure how this can be happen but be defensive - JVMCI_ERROR("single-use stub should not exist"); - } - } else { + if (!a.far_branches()) { // If not using far branches, patch this call directly to dest. set_destination(dest); + return; + } + + if (!is_NativeCallTrampolineStub_at(instruction_address() + displacement())) { + address stub = a.emit_trampoline_stub(instruction_address() - cbuf.insts()->start(), dest); + if (stub == nullptr) { + JVMCI_ERROR("could not emit trampoline stub - code cache is full"); + } + // The relocation is created while emitting the stub will ensure this + // call instruction is subsequently patched to call the stub. + } else { + // Not sure how this can be happen but be defensive + JVMCI_ERROR("single-use stub should not exist"); } } #endif ------------- PR: https://git.openjdk.org/jdk/pull/11945 From tobias.hartmann at oracle.com Wed Mar 8 12:05:05 2023 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Wed, 8 Mar 2023 13:05:05 +0100 Subject: A Noob Question of C1 LIR branch In-Reply-To: References: Message-ID: Hi, I think the problem is that you are adding control flow within a basic block that is invisible to the register allocator. A similar issue is described here: https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2018-September/030745.html Branching to a stub and jumping right back to the next instruction is okay but anything more complicated can lead to spilling/restoring in-between. In your case, the register allocator adds a mov 0x28(%rsp),%rsi between the two jumps. Best regards, Tobias On 08.03.23 02:14, tianle qiu wrote: > Hi, > In my Phd project, I am messing around with barriers and notice a weird behavior of branching. It > might be that I am doing it wrong. Typically, a barrier will do a check and then jump to slow case > if necessary. The code often looks like:? > > __ branch(lir_cond_notEqual, T_INT, slow); > __ branch_destination(slow->continuation()); > > (These two lines are copied from?G1BarrierSetC1::pre_barrier). > But if I changed those two lines to: > > __ branch(lir_cond_equal, T_INT,?slow->continuation()); > __ jump(slow); > __ branch_destination(slow->continuation()); > > Then hotspot cannot bootstrap, causing build failure. I know in practice no one is going to write > code like that,?but it?does look weird. They should be logically?equivalent. > By the way, I am using the latest jdk11u master branch. Any suggestions would be appreciated. Thanks? From rcastanedalo at openjdk.org Wed Mar 8 12:43:09 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 8 Mar 2023 12:43:09 GMT Subject: RFR: 8294715: Add IR checks to the reduction vectorization tests [v4] In-Reply-To: References: Message-ID: On Wed, 8 Mar 2023 08:19:05 GMT, Daniel Skantz wrote: >> We are lifting some loopopts/superword tests to use the IR framework, and add IR annotations to check that vector reductions take place on x86_64. This can be useful to prevent issues such as JDK-8300865. >> >> Approach: lift the more general tests in loopopts/superword, mainly using matching rules in cpu/x86/x86.ad, but leave tests mostly unchanged otherwise. Some reductions are considered non-profitable (superword.cpp), so we might need to raise sse/avx value pre-conditions from what would be a strict reading of x86.ad (as noted by @eme64). >> >> Testing: Local testing (x86_64) using UseSSE={2,3,4}, UseAVX={0,1,2,3}. Tested running all jtreg compiler tests. Tier1-tier5 runs to my knowledge never showed any compiler-related regression in other tests as a result from this work. GHA. Validation: all tests fail if we put unreasonable counts for the respective reduction node, such as counts = {IRNode.ADD_REDUCTION_VI, ">= 10000000"}). >> >> Thanks @robcasloz and @eme64 for advice. >> >> Notes: ProdRed_Double does not vectorize (JDK-8300865). SumRed_Long does not vectorize on 32-bit, according to my reading of source, test on GHA and cross-compiled JDK on 32-bit Linux, so removed these platforms from @requires. Lifted the AbsNeg tests too but added no checks, as these are currently not run on x86_64. > > Daniel Skantz has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 14 additional commits since the last revision: > > - Merge branch 'master' of github.com:openjdk/jdk into JDK-8294715-IR-new > - Correctly reset totals in RedTest*; put debug print msgs in exception > - Remove 2-unroll scenario > - remove duped unrolllimit > - fix typo; remove non-store case from SumRedSqrt_Double due to slow run time > - Merge branch 'master' of github.com:openjdk/jdk into JDK-8294715-IR-new > - Remove print statements for prints that were silenced by IR framework addition > - Remove non-double stores > - Revert much of last commit, and part of the first commit addressing review comments : intention is to remove all the negative tests, except for on -XX:-SuperWordReductions. Keep some comments and additional IR nodes added to existing checks. > - Address further review comments (edits) > - ... and 4 more: https://git.openjdk.org/jdk/compare/c7819e4d...ec02160d Looks good. Thanks again for the thorough and meticulous work, Daniel! All `@IR` checks are either trivially negative (`@IR(applyIf = {"SuperWordReductions", "false"}, failOn = ...)`) or guarded by x86-specific features, so these changes should not cause false failures for non-x86 architectures anymore. ------------- Marked as reviewed by rcastanedalo (Reviewer). PR: https://git.openjdk.org/jdk/pull/12683 From aph at openjdk.org Wed Mar 8 13:15:15 2023 From: aph at openjdk.org (Andrew Haley) Date: Wed, 8 Mar 2023 13:15:15 GMT Subject: RFR: 8299570: [JVMCI] Insufficient error handling when CodeBuffer is exhausted [v4] In-Reply-To: <16XkRlEfbFrHsd3QFmSEQRs-IUeuLFM9kQ1LdSMmh78=.4e1799f6-063a-49e8-9623-e5c72d0fd788@github.com> References: <-uUN-pL3z7PbMG_UqGO454NNQtTiD_jhWEALBbCGk5Q=.6bf265d0-66b1-4369-ae65-5ad0bfbd8f79@github.com> <16XkRlEfbFrHsd3QFmSEQRs-IUeuLFM9kQ1LdSMmh78=.4e1799f6-063a-49e8-9623-e5c72d0fd788@github.com> Message-ID: On Wed, 8 Mar 2023 11:09:44 GMT, Doug Simon wrote: >> Maybe it would be nicer to get the `! far_branches` code path out of the way first, and return immediately. > > Something like this? > > diff --git a/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp b/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp > index 88dc59f80d0..83ec182d2c7 100644 > --- a/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp > +++ b/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp > @@ -532,21 +532,22 @@ void NativeCallTrampolineStub::set_destination(address new_destination) { > void NativeCall::trampoline_jump(CodeBuffer &cbuf, address dest, JVMCI_TRAPS) { > MacroAssembler a(&cbuf); > > - if (a.far_branches()) { > - if (!is_NativeCallTrampolineStub_at(instruction_address() + displacement())) { > - address stub = a.emit_trampoline_stub(instruction_address() - cbuf.insts()->start(), dest); > - if (stub == nullptr) { > - JVMCI_ERROR("could not emit trampoline stub - code cache is full"); > - } > - // The relocation is created while emitting the stub will ensure this > - // call instruction is subsequently patched to call the stub. > - } else { > - // Not sure how this can be happen but be defensive > - JVMCI_ERROR("single-use stub should not exist"); > - } > - } else { > + if (!a.far_branches()) { > // If not using far branches, patch this call directly to dest. > set_destination(dest); > + return; > + } > + > + if (!is_NativeCallTrampolineStub_at(instruction_address() + displacement())) { > + address stub = a.emit_trampoline_stub(instruction_address() - cbuf.insts()->start(), dest); > + if (stub == nullptr) { > + JVMCI_ERROR("could not emit trampoline stub - code cache is full"); > + } > + // The relocation is created while emitting the stub will ensure this > + // call instruction is subsequently patched to call the stub. > + } else { > + // Not sure how this can be happen but be defensive > + JVMCI_ERROR("single-use stub should not exist"); > } > } > #endif Yes. Or maybe // Generate a trampoline for a branch to dest. If there's no need for a // trampoline, simply patch the call directly to dest. void NativeCall::trampoline_jump(CodeBuffer &cbuf, address dest, JVMCI_TRAPS) { MacroAssembler a(&cbuf); if (! a.far_branches()) { // If not using far branches, patch this call directly to dest. set_destination(dest); } else if (!is_NativeCallTrampolineStub_at(instruction_address() + displacement())) { // If we want far branches and there isn't a trampoline stub, emit one. address stub = a.emit_trampoline_stub(instruction_address() - cbuf.insts()->start(), dest); if (stub == nullptr) { JVMCI_ERROR("could not emit trampoline stub - code cache is full"); } // The relocation is created while emitting the stub will ensure this // call instruction is subsequently patched to call the stub. } else { // Not sure how this can be happen but be defensive JVMCI_ERROR("single-use stub should not exist"); } } ------------- PR: https://git.openjdk.org/jdk/pull/11945 From qamai at openjdk.org Wed Mar 8 13:46:03 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 8 Mar 2023 13:46:03 GMT Subject: RFR: 8303762: [vectorapi] Intrinsification of Vector.slice [v2] In-Reply-To: References: Message-ID: > `Vector::slice` is a method at the top-level class of the Vector API that concatenates the 2 inputs into an intermediate composite and extracts a window equal to the size of the inputs into the result. It is used in vector conversion methods where the part number is not 0 to slice the parts to the correct positions. Slicing is also used in text processing such as utf8 and utf16 validation. x86 starting from SSSE3 has `palignr` which does vector slicing very efficiently. As a result, I think it is beneficial to add a C2 node for this operation as well as intrinsify `Vector::slice` method. > > A slice is currently implemented as `v2.rearrange(iota).blend(v1.rearrange(iota), blendMask)` which requires preparation of the index vector and the blending mask. Even with the preparations being hoisted out of the loops, microbenchmarks show improvement using the slice instrinsics. Some have tremendous increases in throughput due to the limitation that a mask of length 2 cannot currently be intrinsified, leading to falling back to the Java implementations. > > Please take a look and have some reviews. Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: address reviews ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12909/files - new: https://git.openjdk.org/jdk/pull/12909/files/e992d4c6..65409f13 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12909&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12909&range=00-01 Stats: 333 lines in 2 files changed: 61 ins; 182 del; 90 mod Patch: https://git.openjdk.org/jdk/pull/12909.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12909/head:pull/12909 PR: https://git.openjdk.org/jdk/pull/12909 From qamai at openjdk.org Wed Mar 8 13:52:18 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 8 Mar 2023 13:52:18 GMT Subject: RFR: 8303762: [vectorapi] Intrinsification of Vector.slice [v2] In-Reply-To: <_PA9oL9dVd3Yrg0sXw3m0uwfGjP6TuqXGBm5M090GHM=.a09a8733-e59e-4b6d-a6a6-e518a8518450@github.com> References: <_PA9oL9dVd3Yrg0sXw3m0uwfGjP6TuqXGBm5M090GHM=.a09a8733-e59e-4b6d-a6a6-e518a8518450@github.com> Message-ID: On Wed, 8 Mar 2023 00:29:05 GMT, Paul Sandoz wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> address reviews > > src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java line 2289: > >> 2287: getClass(), byte.class, length(), >> 2288: this, that, origin, >> 2289: new VectorSliceOp() { > > Change from inner class to lambda expression? We still need this method to be inlined and I don't know if there is a way to annotate the lambda function. > test/hotspot/jtreg/compiler/vectorapi/TestVectorSlice.java line 65: > >> 63: Asserts.assertEquals(expected, dst[i][j]); >> 64: } >> 65: } > > It should be possible to factor out this code into something like this: > > assertOffsets(length, (expected, i, j) -> Assert.assertEquals((byte)expected, dst[i][j]) Fixed. > test/hotspot/jtreg/compiler/vectorapi/TestVectorSlice.java line 68: > >> 66: >> 67: length = 16; >> 68: testB128(dst, src1, src2); > > Should `dst` be zeroed before the next call? or maybe easier to just reallocate. Fixed, I just allocate another array. > test/jdk/jdk/incubator/vector/templates/Kernel-Slice-bop-const.template line 1: > >> 1: $type$[] a = fa.apply(SPECIES.length()); > > Forgot to commit the updated unit tests? This is for the microbenchmarks generated in the panama-vector repo only. Thanks a lot. ------------- PR: https://git.openjdk.org/jdk/pull/12909 From dnsimon at openjdk.org Wed Mar 8 14:00:57 2023 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 8 Mar 2023 14:00:57 GMT Subject: RFR: 8299570: [JVMCI] Insufficient error handling when CodeBuffer is exhausted [v5] In-Reply-To: References: Message-ID: > This PR fixes the handling of a full CodeBuffer when emitting stubs as part of JVMCI code installation. Doug Simon has updated the pull request incrementally with one additional commit since the last revision: reorganized code in NativeCall::trampoline_jump ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11945/files - new: https://git.openjdk.org/jdk/pull/11945/files/9b1d1fbe..d723563a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11945&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11945&range=03-04 Stats: 24 lines in 1 file changed: 11 ins; 12 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11945.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11945/head:pull/11945 PR: https://git.openjdk.org/jdk/pull/11945 From dnsimon at openjdk.org Wed Mar 8 14:00:59 2023 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 8 Mar 2023 14:00:59 GMT Subject: RFR: 8299570: [JVMCI] Insufficient error handling when CodeBuffer is exhausted [v4] In-Reply-To: References: <-uUN-pL3z7PbMG_UqGO454NNQtTiD_jhWEALBbCGk5Q=.6bf265d0-66b1-4369-ae65-5ad0bfbd8f79@github.com> <16XkRlEfbFrHsd3QFmSEQRs-IUeuLFM9kQ1LdSMmh78=.4e1799f6-063a-49e8-9623-e5c72d0fd788@github.com> Message-ID: On Wed, 8 Mar 2023 13:12:28 GMT, Andrew Haley wrote: >> Something like this? >> >> diff --git a/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp b/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp >> index 88dc59f80d0..83ec182d2c7 100644 >> --- a/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp >> +++ b/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp >> @@ -532,21 +532,22 @@ void NativeCallTrampolineStub::set_destination(address new_destination) { >> void NativeCall::trampoline_jump(CodeBuffer &cbuf, address dest, JVMCI_TRAPS) { >> MacroAssembler a(&cbuf); >> >> - if (a.far_branches()) { >> - if (!is_NativeCallTrampolineStub_at(instruction_address() + displacement())) { >> - address stub = a.emit_trampoline_stub(instruction_address() - cbuf.insts()->start(), dest); >> - if (stub == nullptr) { >> - JVMCI_ERROR("could not emit trampoline stub - code cache is full"); >> - } >> - // The relocation is created while emitting the stub will ensure this >> - // call instruction is subsequently patched to call the stub. >> - } else { >> - // Not sure how this can be happen but be defensive >> - JVMCI_ERROR("single-use stub should not exist"); >> - } >> - } else { >> + if (!a.far_branches()) { >> // If not using far branches, patch this call directly to dest. >> set_destination(dest); >> + return; >> + } >> + >> + if (!is_NativeCallTrampolineStub_at(instruction_address() + displacement())) { >> + address stub = a.emit_trampoline_stub(instruction_address() - cbuf.insts()->start(), dest); >> + if (stub == nullptr) { >> + JVMCI_ERROR("could not emit trampoline stub - code cache is full"); >> + } >> + // The relocation is created while emitting the stub will ensure this >> + // call instruction is subsequently patched to call the stub. >> + } else { >> + // Not sure how this can be happen but be defensive >> + JVMCI_ERROR("single-use stub should not exist"); >> } >> } >> #endif > > Yes. Or maybe > > > // Generate a trampoline for a branch to dest. If there's no need for a > // trampoline, simply patch the call directly to dest. > void NativeCall::trampoline_jump(CodeBuffer &cbuf, address dest, JVMCI_TRAPS) { > MacroAssembler a(&cbuf); > > if (! a.far_branches()) { > // If not using far branches, patch this call directly to dest. > set_destination(dest); > } else if (!is_NativeCallTrampolineStub_at(instruction_address() + displacement())) { > // If we want far branches and there isn't a trampoline stub, emit one. > address stub = a.emit_trampoline_stub(instruction_address() - cbuf.insts()->start(), dest); > if (stub == nullptr) { > JVMCI_ERROR("could not emit trampoline stub - code cache is full"); > } > // The relocation is created while emitting the stub will ensure this > // call instruction is subsequently patched to call the stub. > } else { > // Not sure how this can be happen but be defensive > JVMCI_ERROR("single-use stub should not exist"); > } > } Done. ------------- PR: https://git.openjdk.org/jdk/pull/11945 From dnsimon at openjdk.org Wed Mar 8 14:14:43 2023 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 8 Mar 2023 14:14:43 GMT Subject: RFR: 8299570: [JVMCI] Insufficient error handling when CodeBuffer is exhausted [v6] In-Reply-To: References: Message-ID: The message from this sender included one or more files which could not be scanned for virus detection; do not open these files unless you are certain of the sender's intent. ---------------------------------------------------------------------- > This PR fixes the handling of a full CodeBuffer when emitting stubs as part of JVMCI code installation. Doug Simon has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: - Merge remote-tracking branch 'openjdk-jdk/master' into JDK-8299570 - reorganized code in NativeCall::trampoline_jump - Merge remote-tracking branch 'openjdk-jdk/master' into JDK-8299570 - changed return type of NativeCall::trampoline_jump to void - try rationalize code in NativeCall::trampoline_jump - properly handle CodeBuffer exhaustion in JVMCI backend ------------- Changes: https://git.openjdk.org/jdk/pull/11945/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11945&range=05 Stats: 45 lines in 5 files changed: 27 ins; 10 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/11945.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11945/head:pull/11945 PR: https://git.openjdk.org/jdk/pull/11945 From epeter at openjdk.org Wed Mar 8 14:32:16 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 8 Mar 2023 14:32:16 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v22] In-Reply-To: References: Message-ID: On Wed, 8 Mar 2023 10:33:03 GMT, Emanuel Peter wrote: >> Cyclic dependencies are not handled correctly in all cases. Three examples: >> >> https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/loopopts/superword/TestCyclicDependency.java#L270-L277 >> >> And this, compiled with `-XX:CompileCommand=option,compiler.vectorization.TestOptionVectorizeIR::test*,Vectorize`: >> https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestOptionVectorizeIR.java#L173-L180 >> >> And for `vmIntrinsics::_forEachRemaining` compile option `Vectorize` is always enabled: >> https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestForEachRem.java#L69-L73 >> >> All of these examples are vectorized, despite the cyclic dependency of distance 2. The cyclic dependency is dropped, instead the emitted vector code implements a shift by 2, instead of repeating the same 2 values. >> >> **Analysis** >> >> The `create_pack` logic in `SuperWord::find_adjacent_refs` is broken in two ways: >> >> - When the compile directive `Vectorize` is on, or we compile `vmIntrinsics::_forEachRemaining` we have `_do_vector_loop == true`. When that is the case, we blindly trust that there is no cyclic dependency larger than distance 1. Distance 1 would already be detected by the `independence(s1, s2)` checks we do for all adjacent memops. But for larger distances, we rely on `memory_alignment == 0`. But the compile directive avoids these checks. >> - If `best_align_to_mem_ref` is of a different type, and we have `memory_alignment(mem_ref, best_align_to_mem_ref) == 0`, we do not check if `mem_ref` has `memory_alignment == 0` for all other refs of the same type. In the example `TestCyclicDependency::test2`, we have `best_align_to_mem_ref` as the `StoreF`. Then we assess the `StoreI`, which is not aligned with it, but it is of a different type, so we accept it too. Finally, we look at `LoadI`, which has perfect alignment with the `StoreF`, so we accept it too (even though it is in conflict with the `StoreI`). >> >> Generally, the nested if-statements are confusing and buggy. I propose to fix and refactor the code. >> >> I also propose to only allow the compile directive `Vectorize` only if `vectors_should_be_aligned() == false`. If all vector operations have to be `vector_width` aligned, then they also have to be mutually aligned, and we cannot have patterns like `v[i] = v[i] + v[i+1]` for which the compile directive was introduced in the first place https://github.com/openjdk/jdk/commit/c7d33de202203b6da544f2e0f9a13952381b32dd. >> **Update**: I found a **Test.java** that lead to a crash (`SIGBUS`) on a ARM32 on master. The example bypassed the alignment requirement because of `_do_vector_loop`, and allowed unaligned vector loads to be generated, on a platform that requires alignment. Thanks @fg1417 for running that test for me! >> >> **Solution** >> >> First, I implemented `SuperWord::verify_packs` which catches cyclic dependencies just before scheduling. The idea is to reassess every pack, and check if all memops in it are mutually independent. Turns out that per vector pack, it suffices to do a single BFS over the nodes in the block (see `SuperWord::find_dependence`). With this verification in place we at least get an assert instead of wrong execution. >> >> I then refactored and fixed the `create_pack` code, and put the logic all in `SuperWord::is_mem_ref_alignment_ok`. With the added comments, I hope the logic is more straight forward and readable. If `_do_vector_loop == true`, then I filter the vector packs again in `SuperWord::combine_packs`, since we are at that point not sure that the packs are actually independent, we only know that adjacient memops are independent. >> >> Another change I have made: >> Disallow `extend_packlist` from adding `MemNodes` back in. Because if we have rejected some memops, we do not want them to be added back in later. >> >> **Testing** >> >> I added a few more regression tests, and am running tier1-3, plus some stress testing. >> >> However, I need help from someone who can test this on **ARM32** and **PPC**, basically machines that have `vectors_should_be_aligned() == true`. I would love to have additional testing on those machine, and some reviews. >> **Update:** @fg1417 did testing on ARM32, @reinrich did testing on PPC. >> >> **Discussion / Future Work** >> >> I wonder if we should have `_do_vector_loop == true` by default, since it allows more vectorization. With the added filtering, we are sure that we do not schedule packs with cyclic dependencies. We would have to evaluate performance and other side-effects of course. What do you think? [JDK-8303113](https://bugs.openjdk.org/browse/JDK-8303113) > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > TestDependencyOffsets.java: add vanilla run **Sumary of this Post**: The original `_do_vector_loop` RFE assumed that the user would ensure parallelization would lead to correct results. I think that assumption is wrong. **Conclusion**: it is correct to verify `independence` on the pack level, as I have introduced in `combine_packs`. I did some more digging, out of curiosity. Ignore it if you are not interested in the details. Apparently, the `IntStream.forEach` is **explicitly undeterministic**, i.e. unordered, to improve parallelism. There is also a `IntStream.forEachOrdered` which must ensure the order. The parameter to `forEach` should be a `non-interfering action`. For parallel streams, the sequences can for example be split into chunks, and executed on separate threads in parallel. It is up to the user to ensure `non-interference`. If the user does not ensure it, this is allowed to happen: https://github.com/openjdk/jdk/blob/fbe40e8926ff4376f945455c3d4e8ed20277eeba/src/java.base/share/classes/java/util/stream/package-info.java#L230-L232 I wonder what exactly `non-interferance` with the "data source" means: 1. Does it mean we just cannot modify the stream, so for example the `IntStream.range()`? So we cannot add, modify, or delete in the data source of the stream? 2. Does it mean that the different iterations are `non-interferant`? So we cannot store an array position that another iteration would load, since we do not know in which order the iterations are executed? I think it is 1. I think the user is allowed to write actions that interfere with other iterations - there is just no guarantee of the order of execution. Still, from what I understand: the idea of `forEachRemaining` is that it takes a one of the chunks and works on it sequentially, after all it can be called from `forEach` (unordered) or from `forEachOrdered`. `_do_vector_loop` / CompileCommand `Vectorize` was intriduced in: [JDK-8076284](https://bugs.openjdk.org/browse/JDK-8076284) Improve vectorization of parallel streams I also read quickly through the [review](https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2015-May/017844.html) of it. It seems that not much of that code is still there, a lot of it seemed to be buggy. Still, one of its assumptions: Note, that neither 2 or 3 goes thru the data dependency analysis, since the correctness of parallelization was guaranteed by the user. I have the hypothesis that `_do_vector_loop` was introduced under the assumption that the user only writes actions that do not interfere with other iterations, or that the user accepts that they may be executed concurrently, which could lead to race-conditions if the user is not careful. However, not all streams are `parallel`. The flag can be flipped with `sequential()` and `parallel()`. Sequential streams should of course be executed sequentially. In that case, it is wrong not to do "dependency analysis" - after all we need to respect the sequential dependencies. I could imagine a future RFE, that targets the **unordered** parallel streams (maybe with an `forEachRemainingParallel`?). I would still be skeptical if one can fully disable the "dependency" analysis. After all the dependencies within a single iteration should still be respected, even if we could ignore the dependencies between iterations. Such a optimization could be interesting, as we may be able to vectorize this: void diffusion(float[] src, float[] dst) { for (int i = 1; i < src.length-1; i++) { dst[i] = a * src[i - 1] + b * src[i] + c * src[i + 1]; } } Since at most `dst[i]` and `src[i]` would overlap, and the dependencies of `src[i-1]` and `src[i+1]` on other iterations can be ignored. However, after unrolling, multiple iterations would have been put in sequential order. At that point, it may be challenging to separate out the iterations, and determine intra and inter iteration dependencies. **IntStream.forEach** https://github.com/openjdk/jdk/blob/fbe40e8926ff4376f945455c3d4e8ed20277eeba/src/java.base/share/classes/java/util/stream/Stream.java#L834-L851 **Non-Interference** https://github.com/openjdk/jdk/blob/fbe40e8926ff4376f945455c3d4e8ed20277eeba/src/java.base/share/classes/java/util/stream/package-info.java#L209-L253 ------------- PR: https://git.openjdk.org/jdk/pull/12350 From qamai at openjdk.org Wed Mar 8 14:44:00 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 8 Mar 2023 14:44:00 GMT Subject: RFR: 8303804: Fix some errors of If-VectorTest and CMove-VectorTest In-Reply-To: <-Jw_zF5ca_3WHcoZQwzsT6lMA1NFdAzbOv3063qU6Lw=.b90c971a-bfe4-4144-93dc-b04a4c89a154@github.com> References: <-Jw_zF5ca_3WHcoZQwzsT6lMA1NFdAzbOv3063qU6Lw=.b90c971a-bfe4-4144-93dc-b04a4c89a154@github.com> Message-ID: <9x4lN-lTSZwSFUhIsxtq2uDhy-9piQ6aoDHF94QLs7w=.c03c138e-e83f-483a-9e28-365b4052898a@github.com> On Wed, 8 Mar 2023 03:52:33 GMT, Wang Haomin wrote: > After https://bugs.openjdk.org/browse/JDK-8292289 , the base class of VectorTestNode changed from Node to CmpNode. So I add two match rule into ad file. > > match(If cop (VectorTest op1 op2)); > match(Set dst (CMoveI (Binary cop (VectorTest op1 op2)) (Binary src1 src2))); > > First error, rule1 shouldn't generate the statement "node->_bottom_type = _leaf->bottom_type();". > Second error, both rule1 and rule2 need to use VectorTestNode, the VectorTestNode should be cloned like CmpNode. Looks good to me with a small suggestion. Thanks a lot. src/hotspot/share/adlc/output_c.cpp line 3989: > 3987: if (inst->captures_bottom_type(_globalNames)) { > 3988: if (strncmp("MachCall", inst->mach_base_class(_globalNames), strlen("MachCall")) > 3989: && strncmp("MachIf", inst->mach_base_class(_globalNames), strlen("MachIf"))) { Maybe explicitly compare the results with 0 here for consistency with other usage of `strcmp` ------------- Marked as reviewed by qamai (Committer). PR: https://git.openjdk.org/jdk/pull/12917 From jvernee at openjdk.org Wed Mar 8 14:45:55 2023 From: jvernee at openjdk.org (Jorn Vernee) Date: Wed, 8 Mar 2023 14:45:55 GMT Subject: RFR: 8303022: "assert(allocates2(pc)) failed: not in CodeBuffer memory" When linking downcall handle Message-ID: The issue is that the size of the code buffer is not large enough to hold the whole stub. Proposed solution is to scale the size of the stub with the number of arguments. I've adjusted sizes for both downcall and upcall stubs. I've also dropped the number of relocations, since we're not really using any for downcalls, and for upcalls we only have 1 AFAICS. (the size of the relocations can not be zero however, as that leads to the relocation section [not being initialized][1], and triggering [an assert][2] later when the code blob is copied). The way I've determined the new base size and per-argument size for stubs, is by first linking a stub without any arguments to get the required base size, and by then adding 20 `double` arguments to get a rough per-argument size. Both values have wiggle room as well. The sizes can be printed using e.g. `-XX:+LogCompilation`, and then looking for `nep_invoker_blob` and `upcall_stub*` in the log file. This experiment was done on a fastdebug build to account for additional debug code being generated. The included test is designed to try and maximize the size of the generated stub. I've also updated `CodeBuffer::log_section_sizes` to print the in-use size, rather than just the capacity and free space. [1]: https://github.com/openjdk/jdk/blob/56512cfe1f0682c98ba3488af3d03ccef632c016/src/hotspot/share/asm/codeBuffer.cpp#L119-L121 [2]: https://github.com/openjdk/jdk/blob/56512cfe1f0682c98ba3488af3d03ccef632c016/src/hotspot/share/asm/codeBuffer.cpp#L675 ------------- Commit messages: - improve test - fix test - adjust x86_64 stub size scaling - adjust upcall stub size based on arguments - adjust downcall stub size for aarch64 - fix comment typo - simplify test - improve log_section_sizes - base code size on argument count Changes: https://git.openjdk.org/jdk/pull/12908/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12908&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8303022 Stats: 84 lines in 6 files changed: 73 ins; 0 del; 11 mod Patch: https://git.openjdk.org/jdk/pull/12908.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12908/head:pull/12908 PR: https://git.openjdk.org/jdk/pull/12908 From poonam at openjdk.org Wed Mar 8 14:47:00 2023 From: poonam at openjdk.org (Poonam Bajaj) Date: Wed, 8 Mar 2023 14:47:00 GMT Subject: RFR: 8302508: Add timestamp to the output TraceCompilerThreads In-Reply-To: References: Message-ID: <9UQm-9DqJQA31NblCIeBMijftErsCqjIAKGHoRXf8GU=.d32fa3b9-7f18-4b62-87d0-36791af33e02@github.com> On Tue, 7 Mar 2023 03:49:52 GMT, Vladimir Kozlov wrote: > Having timestamps added to the output of TraceCompilerThreads will be helpful in understanding how frequently the compiler threads are being added or removed. > > I did that and also added UL output. > > > java -XX:+TraceCompilerThreads -XX:+PrintCompilation -version > > 86 Added initial compiler thread C2 CompilerThread0 > 86 Added initial compiler thread C1 CompilerThread0 > 92 1 3 java.lang.Object:: (1 bytes) > 96 2 3 java.lang.String::coder (15 bytes) > > java -Xlog:jit+thread=debug -Xlog:jit+compilation=debug -version > > [0.078s][debug][jit,thread] Added initial compiler thread C2 CompilerThread0 > [0.078s][debug][jit,thread] Added initial compiler thread C1 CompilerThread0 > [0.083s][debug][jit,compilation] 1 3 java.lang.Object:: (1 bytes) > [0.087s][debug][jit,compilation] 2 3 java.lang.String::coder (15 bytes) > > > Tested tier1. The changes look good to me! ------------- PR: https://git.openjdk.org/jdk/pull/12898 From jvernee at openjdk.org Wed Mar 8 14:52:01 2023 From: jvernee at openjdk.org (Jorn Vernee) Date: Wed, 8 Mar 2023 14:52:01 GMT Subject: RFR: 8303022: "assert(allocates2(pc)) failed: not in CodeBuffer memory" When linking downcall handle [v2] In-Reply-To: References: Message-ID: The message from this sender included one or more files which could not be scanned for virus detection; do not open these files unless you are certain of the sender's intent. ---------------------------------------------------------------------- > The issue is that the size of the code buffer is not large enough to hold the whole stub. > > Proposed solution is to scale the size of the stub with the number of arguments. I've adjusted sizes for both downcall and upcall stubs. I've also dropped the number of relocations, since we're not really using any for downcalls, and for upcalls we only have 1 AFAICS. (the size of the relocations can not be zero however, as that leads to the relocation section [not being initialized][1], and triggering [an assert][2] later when the code blob is copied). > > The way I've determined the new base size and per-argument size for stubs, is by first linking a stub without any arguments to get the required base size, and by then adding 20 `double` arguments to get a rough per-argument size. Both values have wiggle room as well. The sizes can be printed using e.g. `-XX:+LogCompilation`, and then looking for `nep_invoker_blob` and `upcall_stub*` in the log file. This experiment was done on a fastdebug build to account for additional debug code being generated. The included test is designed to try and maximize the size of the generated stub. > > I've also updated `CodeBuffer::log_section_sizes` to print the in-use size, rather than just the capacity and free space. > > [1]: https://github.com/openjdk/jdk/blob/56512cfe1f0682c98ba3488af3d03ccef632c016/src/hotspot/share/asm/codeBuffer.cpp#L119-L121 > [2]: https://github.com/openjdk/jdk/blob/56512cfe1f0682c98ba3488af3d03ccef632c016/src/hotspot/share/asm/codeBuffer.cpp#L675 Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision: update copyright years ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12908/files - new: https://git.openjdk.org/jdk/pull/12908/files/6428c8b7..0a2bc96c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12908&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12908&range=00-01 Stats: 4 lines in 4 files changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/12908.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12908/head:pull/12908 PR: https://git.openjdk.org/jdk/pull/12908 From epeter at openjdk.org Wed Mar 8 15:24:41 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 8 Mar 2023 15:24:41 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v22] In-Reply-To: References: Message-ID: The message from this sender included one or more files which could not be scanned for virus detection; do not open these files unless you are certain of the sender's intent. ---------------------------------------------------------------------- On Wed, 8 Mar 2023 10:33:03 GMT, Emanuel Peter wrote: >> Cyclic dependencies are not handled correctly in all cases. Three examples: >> >> https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/loopopts/superword/TestCyclicDependency.java#L270-L277 >> >> And this, compiled with `-XX:CompileCommand=option,compiler.vectorization.TestOptionVectorizeIR::test*,Vectorize`: >> https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestOptionVectorizeIR.java#L173-L180 >> >> And for `vmIntrinsics::_forEachRemaining` compile option `Vectorize` is always enabled: >> https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestForEachRem.java#L69-L73 >> >> All of these examples are vectorized, despite the cyclic dependency of distance 2. The cyclic dependency is dropped, instead the emitted vector code implements a shift by 2, instead of repeating the same 2 values. >> >> **Analysis** >> >> The `create_pack` logic in `SuperWord::find_adjacent_refs` is broken in two ways: >> >> - When the compile directive `Vectorize` is on, or we compile `vmIntrinsics::_forEachRemaining` we have `_do_vector_loop == true`. When that is the case, we blindly trust that there is no cyclic dependency larger than distance 1. Distance 1 would already be detected by the `independence(s1, s2)` checks we do for all adjacent memops. But for larger distances, we rely on `memory_alignment == 0`. But the compile directive avoids these checks. >> - If `best_align_to_mem_ref` is of a different type, and we have `memory_alignment(mem_ref, best_align_to_mem_ref) == 0`, we do not check if `mem_ref` has `memory_alignment == 0` for all other refs of the same type. In the example `TestCyclicDependency::test2`, we have `best_align_to_mem_ref` as the `StoreF`. Then we assess the `StoreI`, which is not aligned with it, but it is of a different type, so we accept it too. Finally, we look at `LoadI`, which has perfect alignment with the `StoreF`, so we accept it too (even though it is in conflict with the `StoreI`). >> >> Generally, the nested if-statements are confusing and buggy. I propose to fix and refactor the code. >> >> I also propose to only allow the compile directive `Vectorize` only if `vectors_should_be_aligned() == false`. If all vector operations have to be `vector_width` aligned, then they also have to be mutually aligned, and we cannot have patterns like `v[i] = v[i] + v[i+1]` for which the compile directive was introduced in the first place https://github.com/openjdk/jdk/commit/c7d33de202203b6da544f2e0f9a13952381b32dd. >> **Update**: I found a **Test.java** that lead to a crash (`SIGBUS`) on a ARM32 on master. The example bypassed the alignment requirement because of `_do_vector_loop`, and allowed unaligned vector loads to be generated, on a platform that requires alignment. Thanks @fg1417 for running that test for me! >> >> **Solution** >> >> First, I implemented `SuperWord::verify_packs` which catches cyclic dependencies just before scheduling. The idea is to reassess every pack, and check if all memops in it are mutually independent. Turns out that per vector pack, it suffices to do a single BFS over the nodes in the block (see `SuperWord::find_dependence`). With this verification in place we at least get an assert instead of wrong execution. >> >> I then refactored and fixed the `create_pack` code, and put the logic all in `SuperWord::is_mem_ref_alignment_ok`. With the added comments, I hope the logic is more straight forward and readable. If `_do_vector_loop == true`, then I filter the vector packs again in `SuperWord::combine_packs`, since we are at that point not sure that the packs are actually independent, we only know that adjacient memops are independent. >> >> Another change I have made: >> Disallow `extend_packlist` from adding `MemNodes` back in. Because if we have rejected some memops, we do not want them to be added back in later. >> >> **Testing** >> >> I added a few more regression tests, and am running tier1-3, plus some stress testing. >> >> However, I need help from someone who can test this on **ARM32** and **PPC**, basically machines that have `vectors_should_be_aligned() == true`. I would love to have additional testing on those machine, and some reviews. >> **Update:** @fg1417 did testing on ARM32, @reinrich did testing on PPC. >> >> **Discussion / Future Work** >> >> I wonder if we should have `_do_vector_loop == true` by default, since it allows more vectorization. With the added filtering, we are sure that we do not schedule packs with cyclic dependencies. We would have to evaluate performance and other side-effects of course. What do you think? [JDK-8303113](https://bugs.openjdk.org/browse/JDK-8303113) > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > TestDependencyOffsets.java: add vanilla run Since this thread is so long, I summarize what I have done again: **Bug 1: bad nested if** In pseudo-code we did basically this in `find_adjacent_refs`, when checking that we have alignment with all other packs of the same `velt_type / memory_slice`: if (memory_alignment(mem_ref, best_iv_adjustment) == 0) { // go ahead and vectorize! // PROBLEM: what if "best" was from a different velt_type / memory slice? // We may have alignment with "best", but we ignore misalignment with other packs! } else { if (same_velt_type(mem_ref, best_align_to_mem_ref)) { // misaligned to same velt_type / memory_slice -> no vectorization } else { // for all other packs with the same velt_type / memory_slice, check if we have alignment -> only then vectorize } } Fix (swap if conditions, refactor): https://github.com/openjdk/jdk/blob/a44082b61f22dcdee115697f34d39c1d8382a15d/src/hotspot/share/opto/superword.cpp#L811-L815 **Bug 2: rejected memops should not be resurrected** If memops were rejected during `find_adjacent_refs`, we sometimes resurrected them again during `extend_packlist`. Memops may have been rejected because they had a misalignment (which could imply dependence). So we should not blindly resurrect such memops. Fix (only extend to non-memop): https://github.com/openjdk/jdk/blob/a44082b61f22dcdee115697f34d39c1d8382a15d/src/hotspot/share/opto/superword.cpp#L1562 However, there is a downside to this fix: there seemed to have been many "happy accidents", where resurrecting memops lead to vectorization that was correct. For example under `-XX:+AlignVector`, we require all `mem_refs` to align with `best`, and have a `vector_width` at most as large as that of `best`. This is to ensure that all `mem_refs` are aligned to memory, modulo their `vector_width` (once we memory align `best` to memory modulo its `vector_width` by adjusting the pre-loop limit). @Test @IR(counts = {IRNode.LOAD_VECTOR, ">0", IRNode.VECTOR_CAST_I2X, ">0", IRNode.STORE_VECTOR, ">0"}) private static void testConvI2D(double[] d, int[] a) { for(int i = 0; i < d.length; i++) { d[i] = (double) (a[i]); } } In `TestVectorizeTypeConversion.testConvI2D`, `best_align_to_mem_ref` is `StoreD`, which on some machines may have a smaller `vector_width` than the `LoadI`, and hence gets rejected. This would prevent vectorization. The "happy accident" was that it was resurrected during `extend_packlist`. This still leads to correct results "by accident", since most machines at most require 8-byte alignment, and not `vector_width` alignment. This performance regression can be fixed by a follow-up RFE. For now, we should prefer correctness over performance. **Bug 3: _do_vector_loop should not ignore dependencies** TODO write ------------- PR: https://git.openjdk.org/jdk/pull/12350 From aph at openjdk.org Wed Mar 8 15:45:04 2023 From: aph at openjdk.org (Andrew Haley) Date: Wed, 8 Mar 2023 15:45:04 GMT Subject: RFR: 8299570: [JVMCI] Insufficient error handling when CodeBuffer is exhausted [v6] In-Reply-To: References: Message-ID: On Wed, 8 Mar 2023 14:14:43 GMT, Doug Simon wrote: >> This PR fixes the handling of a full CodeBuffer when emitting stubs as part of JVMCI code installation. > > Doug Simon has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: > > - Merge remote-tracking branch 'openjdk-jdk/master' into JDK-8299570 > - reorganized code in NativeCall::trampoline_jump > - Merge remote-tracking branch 'openjdk-jdk/master' into JDK-8299570 > - changed return type of NativeCall::trampoline_jump to void > - try rationalize code in NativeCall::trampoline_jump > - properly handle CodeBuffer exhaustion in JVMCI backend Marked as reviewed by aph (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11945 From psandoz at openjdk.org Wed Mar 8 16:22:13 2023 From: psandoz at openjdk.org (Paul Sandoz) Date: Wed, 8 Mar 2023 16:22:13 GMT Subject: RFR: 8303762: [vectorapi] Intrinsification of Vector.slice [v2] In-Reply-To: References: <_PA9oL9dVd3Yrg0sXw3m0uwfGjP6TuqXGBm5M090GHM=.a09a8733-e59e-4b6d-a6a6-e518a8518450@github.com> Message-ID: On Wed, 8 Mar 2023 13:48:16 GMT, Quan Anh Mai wrote: >> src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java line 2289: >> >>> 2287: getClass(), byte.class, length(), >>> 2288: this, that, origin, >>> 2289: new VectorSliceOp() { >> >> Change from inner class to lambda expression? > > We still need this method to be inlined and I don't know if there is a way to annotate the lambda function. Yes, i wondered about the inline and how important it might be. You want the fallback to inline so as not to perturb platforms without the intrinsic. Can you add a comment on the anon class? ------------- PR: https://git.openjdk.org/jdk/pull/12909 From psandoz at openjdk.org Wed Mar 8 16:26:05 2023 From: psandoz at openjdk.org (Paul Sandoz) Date: Wed, 8 Mar 2023 16:26:05 GMT Subject: RFR: 8303762: [vectorapi] Intrinsification of Vector.slice [v2] In-Reply-To: References: Message-ID: On Wed, 8 Mar 2023 13:46:03 GMT, Quan Anh Mai wrote: >> `Vector::slice` is a method at the top-level class of the Vector API that concatenates the 2 inputs into an intermediate composite and extracts a window equal to the size of the inputs into the result. It is used in vector conversion methods where the part number is not 0 to slice the parts to the correct positions. Slicing is also used in text processing such as utf8 and utf16 validation. x86 starting from SSSE3 has `palignr` which does vector slicing very efficiently. As a result, I think it is beneficial to add a C2 node for this operation as well as intrinsify `Vector::slice` method. >> >> A slice is currently implemented as `v2.rearrange(iota).blend(v1.rearrange(iota), blendMask)` which requires preparation of the index vector and the blending mask. Even with the preparations being hoisted out of the loops, microbenchmarks show improvement using the slice instrinsics. Some have tremendous increases in throughput due to the limitation that a mask of length 2 cannot currently be intrinsified, leading to falling back to the Java implementations. >> >> Please take a look and have some reviews. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > address reviews Java changes look good. The HotSpot code looks well structured but i will let others comment on the specifics. ------------- Marked as reviewed by psandoz (Reviewer). PR: https://git.openjdk.org/jdk/pull/12909 From qamai at openjdk.org Wed Mar 8 17:24:50 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 8 Mar 2023 17:24:50 GMT Subject: RFR: 8303762: [vectorapi] Intrinsification of Vector.slice [v3] In-Reply-To: References: Message-ID: > `Vector::slice` is a method at the top-level class of the Vector API that concatenates the 2 inputs into an intermediate composite and extracts a window equal to the size of the inputs into the result. It is used in vector conversion methods where the part number is not 0 to slice the parts to the correct positions. Slicing is also used in text processing such as utf8 and utf16 validation. x86 starting from SSSE3 has `palignr` which does vector slicing very efficiently. As a result, I think it is beneficial to add a C2 node for this operation as well as intrinsify `Vector::slice` method. > > A slice is currently implemented as `v2.rearrange(iota).blend(v1.rearrange(iota), blendMask)` which requires preparation of the index vector and the blending mask. Even with the preparations being hoisted out of the loops, microbenchmarks show improvement using the slice instrinsics. Some have tremendous increases in throughput due to the limitation that a mask of length 2 cannot currently be intrinsified, leading to falling back to the Java implementations. > > Please take a look and have some reviews. Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: add comments explaining anonymous classes ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12909/files - new: https://git.openjdk.org/jdk/pull/12909/files/65409f13..c31fdfe8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12909&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12909&range=01-02 Stats: 21 lines in 7 files changed: 21 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/12909.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12909/head:pull/12909 PR: https://git.openjdk.org/jdk/pull/12909 From qamai at openjdk.org Wed Mar 8 17:24:55 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 8 Mar 2023 17:24:55 GMT Subject: RFR: 8303762: [vectorapi] Intrinsification of Vector.slice [v2] In-Reply-To: References: Message-ID: On Wed, 8 Mar 2023 16:23:16 GMT, Paul Sandoz wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> address reviews > > Java changes look good. The HotSpot code looks well structured but i will let others comment on the specifics. @PaulSandoz Thanks for your review, I have added a comment explaining the rationales behind the anonymous class usage. ------------- PR: https://git.openjdk.org/jdk/pull/12909 From kvn at openjdk.org Wed Mar 8 18:20:27 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 8 Mar 2023 18:20:27 GMT Subject: RFR: 8303415: Add VM_Version::is_intrinsic_supported(id) In-Reply-To: References: Message-ID: On Wed, 8 Mar 2023 09:55:47 GMT, Tobias Hartmann wrote: >> Currently we check VM flags, directives and JIT compiler support when we generate intrinsics. >> We have *product* VM flags for most intrinsics and set them in VM based on HW support. >> But not all intrinsics have such flags and it is not scalable to add new *product* flag for each new intrinsic. >> Also we have `-XX:DisableIntrinsic=` and `-XX:ControlIntrinsic=` flags to control intrinsics from command line. We don't need specific flags for that. >> >> I propose to add new `VM_Version::is_intrinsic_supported(id)` method to check platform support for intrinsic without adding new flag. I used it for `_floatToFloat16` intrinsic for my work on [JDK-8302976](https://bugs.openjdk.org/browse/JDK-8302976). >> >> Additional fixes: >> Fixed Interpreter to skip intrinsics if they are disabled with flag. >> Moved Interpreter's `InlineIntrinsics` flag check into one place in shared code. >> Added separate interpreter id for `_dsqrt_strict` so it could be disabled separately from regular `_dsqrt`. >> Added missing `native` mark to `_currentThread`. >> Removed unused `AbstractInterpreter::in_native_entry()`. >> Cleanup C2 intrinsic checks code. >> >> Tested tier1-4,xcomp,stress. Also ran tier1-3,xcomp with `-XX:-InlineIntrinsics`. > > src/hotspot/share/opto/c2compiler.hpp line 68: > >> 66: // Check if the compiler supports an intrinsic for 'method' given the >> 67: // the dispatch mode specified by the 'is_virtual' parameter. >> 68: bool is_virtual_intrinsic_supported(vmIntrinsics::ID id, bool is_virtual); > > I find the new name of the method confusing because it suggests that the intrinsic is always virtual. Can we keep the old name? We still have `is_intrinsic_supported()` declared at line 64. New `is_virtual_intrinsic_supported()` is added to handle only virtual intrinsics`_hashCode` and `_clone`. I simply moved these intrinsics handling into separate method because it is used only by C2 in `library_call.cpp`. ------------- PR: https://git.openjdk.org/jdk/pull/12858 From kvn at openjdk.org Wed Mar 8 18:24:14 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 8 Mar 2023 18:24:14 GMT Subject: RFR: 8302508: Add timestamp to the output TraceCompilerThreads In-Reply-To: References: Message-ID: On Wed, 8 Mar 2023 10:01:11 GMT, Tobias Hartmann wrote: >> Having timestamps added to the output of TraceCompilerThreads will be helpful in understanding how frequently the compiler threads are being added or removed. >> >> I did that and also added UL output. >> >> >> java -XX:+TraceCompilerThreads -XX:+PrintCompilation -version >> >> 86 Added initial compiler thread C2 CompilerThread0 >> 86 Added initial compiler thread C1 CompilerThread0 >> 92 1 3 java.lang.Object:: (1 bytes) >> 96 2 3 java.lang.String::coder (15 bytes) >> >> java -Xlog:jit+thread=debug -Xlog:jit+compilation=debug -version >> >> [0.078s][debug][jit,thread] Added initial compiler thread C2 CompilerThread0 >> [0.078s][debug][jit,thread] Added initial compiler thread C1 CompilerThread0 >> [0.083s][debug][jit,compilation] 1 3 java.lang.Object:: (1 bytes) >> [0.087s][debug][jit,compilation] 2 3 java.lang.String::coder (15 bytes) >> >> >> Tested tier1. > > Looks good to me. Thank you @TobiHartmann and @poonamparhar for reviews. ------------- PR: https://git.openjdk.org/jdk/pull/12898 From kvn at openjdk.org Wed Mar 8 18:29:30 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 8 Mar 2023 18:29:30 GMT Subject: Integrated: 8302508: Add timestamp to the output TraceCompilerThreads In-Reply-To: References: Message-ID: On Tue, 7 Mar 2023 03:49:52 GMT, Vladimir Kozlov wrote: > Having timestamps added to the output of TraceCompilerThreads will be helpful in understanding how frequently the compiler threads are being added or removed. > > I did that and also added UL output. > > > java -XX:+TraceCompilerThreads -XX:+PrintCompilation -version > > 86 Added initial compiler thread C2 CompilerThread0 > 86 Added initial compiler thread C1 CompilerThread0 > 92 1 3 java.lang.Object:: (1 bytes) > 96 2 3 java.lang.String::coder (15 bytes) > > java -Xlog:jit+thread=debug -Xlog:jit+compilation=debug -version > > [0.078s][debug][jit,thread] Added initial compiler thread C2 CompilerThread0 > [0.078s][debug][jit,thread] Added initial compiler thread C1 CompilerThread0 > [0.083s][debug][jit,compilation] 1 3 java.lang.Object:: (1 bytes) > [0.087s][debug][jit,compilation] 2 3 java.lang.String::coder (15 bytes) > > > Tested tier1. This pull request has now been integrated. Changeset: f813dc71 Author: Vladimir Kozlov URL: https://git.openjdk.org/jdk/commit/f813dc71836e002814622fead8a2b0464b49c83a Stats: 44 lines in 1 file changed: 28 ins; 0 del; 16 mod 8302508: Add timestamp to the output TraceCompilerThreads Reviewed-by: thartmann ------------- PR: https://git.openjdk.org/jdk/pull/12898 From jsjolen at openjdk.org Wed Mar 8 18:38:33 2023 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Wed, 8 Mar 2023 18:38:33 GMT Subject: RFR: JDK-8301074: Replace NULL with nullptr in share/opto/ [v5] In-Reply-To: References: Message-ID: On Tue, 7 Mar 2023 17:50:23 GMT, Jesper Wilhelmsson wrote: >> Johan Sj?len has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: >> >> - Merge remote-tracking branch 'origin/JDK-8301074' into JDK-8301074 >> - Explicitly use 0 for null in ARM interpreter >> - Merge remote-tracking branch 'origin/master' into JDK-8301074 >> - Remove trailing whitespace >> - Check for null string explicitly >> - vkozlov fixes >> - Manual review fixes >> - Fix >> - Fix compile errors >> - Replace NULL with nullptr in share/opto/ > > src/hotspot/share/opto/memnode.cpp line 271: > >> 269: st->print("alias_idx==%d, adr_check==", alias_idx); >> 270: if( adr_check == nullptr ) { >> 271: st->print("null"); > > Where are these strings printed? Is this a user detectable change? The tty, so yes. Previous changes to logging have been accepted ------------- PR: https://git.openjdk.org/jdk/pull/12187 From kvn at openjdk.org Wed Mar 8 18:43:30 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 8 Mar 2023 18:43:30 GMT Subject: RFR: 8303022: "assert(allocates2(pc)) failed: not in CodeBuffer memory" When linking downcall handle [v2] In-Reply-To: References: Message-ID: On Wed, 8 Mar 2023 14:52:01 GMT, Jorn Vernee wrote: >> The issue is that the size of the code buffer is not large enough to hold the whole stub. >> >> Proposed solution is to scale the size of the stub with the number of arguments. I've adjusted sizes for both downcall and upcall stubs. I've also dropped the number of relocations, since we're not really using any for downcalls, and for upcalls we only have 1 AFAICS. (the size of the relocations can not be zero however, as that leads to the relocation section [not being initialized][1], and triggering [an assert][2] later when the code blob is copied). >> >> The way I've determined the new base size and per-argument size for stubs, is by first linking a stub without any arguments to get the required base size, and by then adding 20 `double` arguments to get a rough per-argument size. Both values have wiggle room as well. The sizes can be printed using e.g. `-XX:+LogCompilation`, and then looking for `nep_invoker_blob` and `upcall_stub*` in the log file. This experiment was done on a fastdebug build to account for additional debug code being generated. The included test is designed to try and maximize the size of the generated stub. >> >> I've also updated `CodeBuffer::log_section_sizes` to print the in-use size, rather than just the capacity and free space. >> >> [1]: https://github.com/openjdk/jdk/blob/56512cfe1f0682c98ba3488af3d03ccef632c016/src/hotspot/share/asm/codeBuffer.cpp#L119-L121 >> [2]: https://github.com/openjdk/jdk/blob/56512cfe1f0682c98ba3488af3d03ccef632c016/src/hotspot/share/asm/codeBuffer.cpp#L675 > > Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision: > > update copyright years I think you need to add these changes to other ports too and ask for testing on them. ------------- PR: https://git.openjdk.org/jdk/pull/12908 From jsjolen at openjdk.org Wed Mar 8 18:45:14 2023 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Wed, 8 Mar 2023 18:45:14 GMT Subject: RFR: JDK-8301074: Replace NULL with nullptr in share/opto/ [v6] In-Reply-To: References: Message-ID: <4WIyLmvrydonrPXxP9zmPrwLdSLLi-gGxFw3FxDN6PM=.01cb6b70-4b0d-4779-b5ec-35f1cf5594e7@github.com> > Hi, this PR changes all occurrences of NULL to nullptr for the subdirectory share/opto/. Unfortunately the script that does the change isn't perfect, and so we > need to comb through these manually to make sure nothing has gone wrong. I also review these changes but things slip past my eyes sometimes. > > Here are some typical things to look out for: > > No changes but copyright header changed (probably because I reverted some changes but forgot the copyright). > Macros having their NULL changed to nullptr, these are added to the script when I find them. They should be NULL. > nullptr in comments and logs. We try to use lower case "null" in these cases as it reads better. An exception is made when code expressions are in a comment. > > An example of this: > > > // This function returns null > void* ret_null(); > // This function returns true if *x == nullptr > bool is_nullptr(void** x); > > > Note how nullptr participates in a code expression here, we really are talking about the specific value nullptr. > > Thanks! Johan Sj?len has updated the pull request incrementally with one additional commit since the last revision: Jesper's fixes ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12187/files - new: https://git.openjdk.org/jdk/pull/12187/files/10902048..41bb3390 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12187&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12187&range=04-05 Stats: 37 lines in 15 files changed: 0 ins; 0 del; 37 mod Patch: https://git.openjdk.org/jdk/pull/12187.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12187/head:pull/12187 PR: https://git.openjdk.org/jdk/pull/12187 From kvn at openjdk.org Wed Mar 8 18:56:27 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 8 Mar 2023 18:56:27 GMT Subject: RFR: JDK-8301074: Replace NULL with nullptr in share/opto/ [v6] In-Reply-To: <4WIyLmvrydonrPXxP9zmPrwLdSLLi-gGxFw3FxDN6PM=.01cb6b70-4b0d-4779-b5ec-35f1cf5594e7@github.com> References: <4WIyLmvrydonrPXxP9zmPrwLdSLLi-gGxFw3FxDN6PM=.01cb6b70-4b0d-4779-b5ec-35f1cf5594e7@github.com> Message-ID: On Wed, 8 Mar 2023 18:45:14 GMT, Johan Sj?len wrote: >> Hi, this PR changes all occurrences of NULL to nullptr for the subdirectory share/opto/. Unfortunately the script that does the change isn't perfect, and so we >> need to comb through these manually to make sure nothing has gone wrong. I also review these changes but things slip past my eyes sometimes. >> >> Here are some typical things to look out for: >> >> No changes but copyright header changed (probably because I reverted some changes but forgot the copyright). >> Macros having their NULL changed to nullptr, these are added to the script when I find them. They should be NULL. >> nullptr in comments and logs. We try to use lower case "null" in these cases as it reads better. An exception is made when code expressions are in a comment. >> >> An example of this: >> >> >> // This function returns null >> void* ret_null(); >> // This function returns true if *x == nullptr >> bool is_nullptr(void** x); >> >> >> Note how nullptr participates in a code expression here, we really are talking about the specific value nullptr. >> >> Thanks! > > Johan Sj?len has updated the pull request incrementally with one additional commit since the last revision: > > Jesper's fixes Update is good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/12187 From kvn at openjdk.org Wed Mar 8 18:56:30 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 8 Mar 2023 18:56:30 GMT Subject: RFR: JDK-8301074: Replace NULL with nullptr in share/opto/ [v5] In-Reply-To: References: Message-ID: On Wed, 8 Mar 2023 18:35:36 GMT, Johan Sj?len wrote: >> src/hotspot/share/opto/memnode.cpp line 271: >> >>> 269: st->print("alias_idx==%d, adr_check==", alias_idx); >>> 270: if( adr_check == nullptr ) { >>> 271: st->print("null"); >> >> Where are these strings printed? Is this a user detectable change? > > The tty, so yes. Previous changes to logging have been accepted This code print additional information on `tty` before we fail the assert at line 277. It helps us with debugging. Note, the code is under (!consistent) check so it prints this only when something failed. ------------- PR: https://git.openjdk.org/jdk/pull/12187 From thartmann at openjdk.org Wed Mar 8 18:59:39 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 8 Mar 2023 18:59:39 GMT Subject: RFR: 8303415: Add VM_Version::is_intrinsic_supported(id) In-Reply-To: References: Message-ID: <5tSsZG-Xkuz5KL0GfHdblkfBgLyi7jRHSeysdBS4Lp8=.17d07529-fa55-4b5d-ac70-7e485e470e85@github.com> The message from this sender included one or more files which could not be scanned for virus detection; do not open these files unless you are certain of the sender's intent. ---------------------------------------------------------------------- On Wed, 8 Mar 2023 18:17:50 GMT, Vladimir Kozlov wrote: >> src/hotspot/share/opto/c2compiler.hpp line 68: >> >>> 66: // Check if the compiler supports an intrinsic for 'method' given the >>> 67: // the dispatch mode specified by the 'is_virtual' parameter. >>> 68: bool is_virtual_intrinsic_supported(vmIntrinsics::ID id, bool is_virtual); >> >> I find the new name of the method confusing because it suggests that the intrinsic is always virtual. Can we keep the old name? > > We still have `is_intrinsic_supported()` declared at line 64. New `is_virtual_intrinsic_supported()` is added to handle only virtual intrinsics`_hashCode` and `_clone`. I simply moved these intrinsics handling into separate method because it is used only by C2 in `library_call.cpp`. Okay, got it. Thanks for the explanation. ------------- PR: https://git.openjdk.org/jdk/pull/12858 From thartmann at openjdk.org Wed Mar 8 19:15:05 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 8 Mar 2023 19:15:05 GMT Subject: RFR: 8303564: C2: "Bad graph detected in build_loop_late" after a CMove is wrongly split thru phi In-Reply-To: References: Message-ID: On Fri, 3 Mar 2023 10:22:55 GMT, Roland Westrelin wrote: > The following steps lead to the crash: > > - In `testHelper()`, the null and range checks for the `field1[0]` > load are hoisted out of the counted loop by loop predication > > - As a result, the `field1[0]` load is also out of loop, control > dependent on a predicate > > - pre/main/post loops are created, the main loop is unrolled, the `f` > value that's stored in `field3` is a Phi that merges the values out > of the 3 loops. > > - the `stop` variable that captures the limit of the loop is > transformed into a `Phi` that merges 1 and 2. > > - As a result, the Phi that's stored in `field3` now only merges the > value of the pre and post loop and is transformed into a `CmoveI` > that merges 2 values dependent on the `field1[0]` `LoadI` that's > control dependent on a predicate. > > - On the next round of loop opts, the `CmoveI` is assigned control > below the predicate but the `Bool`/`CmpI` for the `CmoveI` is > assigned control above, right below a `Region` that has a `Phi` that > is input to the `CmpI`. The reason is this logic: > https://github.com/rwestrel/jdk/blob/99f5687eb192b249a4a4533578f56b131fb8f234/src/hotspot/share/opto/loopnode.cpp#L5968 > > - The `CmoveI` is split thru phi because the `Bool`/`CmpI` have > control right below a `Region`. That shouldn't happen because the > `CmoveI` itself doesn't have control at the `Region` and is actually > pinned through the `LoadI` below the `Region`. > > The fix I propose is to check the control of the `CmoveI` before > proceding with split thru phi. Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/12851 From xliu at openjdk.org Wed Mar 8 19:27:12 2023 From: xliu at openjdk.org (Xin Liu) Date: Wed, 8 Mar 2023 19:27:12 GMT Subject: RFR: 8201516: DebugNonSafepoints generates incorrect information [v4] In-Reply-To: References: <0gI6DIHtc7F63CFYoccotGQv-BHYadPRW0liqEQvh6Q=.58a774ec-7591-4f44-aa7f-7755593ac04e@github.com> Message-ID: On Wed, 8 Mar 2023 10:51:50 GMT, Tobias Hartmann wrote: >> Why don't we just use `C->_log2_node_notes_block_size` directly in (_useful.size() >> 8)? >> >> I don't understand why we have to add MAX2(8, new_size) either. It looks like c2 doesn't want to have node-level accuracy. It drops the lowest 8bits of node_idx as block_id. I think the minimal number of "block" is 1, or arr is NULL. > >> Why don't we just use C->_log2_node_notes_block_size directly in (_useful.size() >> 8)? > > Because it's private in `Compile`. We could make it public but I thought it's not worth it. > >> I don't understand why we have to add MAX2(8, new_size) either. It looks like c2 doesn't want to have node-level accuracy. It drops the lowest 8bits of node_idx as block_id. I think the minimal number of "block" is 1, or arr is NULL. > > I think you are misinterpreting the code in `Compile::locate_node_notes`. It first determines the `block_idx` by `idx >> _log2_node_notes_block_size` and then the position in that block by `idx & (_node_notes_block_size-1)`. Thanks for the clarification. yes, I understanding was wrong. A block is an array of 256 Node_Note. Is it a particular reason that Compile needs at least 8 blocks? It can grow automatically anyway. ------------- PR: https://git.openjdk.org/jdk/pull/12806 From vlivanov at openjdk.org Wed Mar 8 19:46:14 2023 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 8 Mar 2023 19:46:14 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter [v4] In-Reply-To: References: Message-ID: <1WK1RTMCYwOXw7LvaJr04fL68nN64TpB05fcPC2a3uo=.d71b77a8-c814-4166-8f56-af97ac70dc55@github.com> On Wed, 8 Mar 2023 05:17:53 GMT, Vladimir Kozlov wrote: >> Implemented `Float.floatToFloat16` and `Float.float16ToFloat` intrinsics in Interpreter and C1 compiler to produce the same results as C2 intrinsics on x64, Aarch64 and RISC-V - all platforms where C2 intrinsics for these Java methods were implemented originally. >> >> Replaced `SharedRuntime::f2hf()` and `hf2f()` C runtime functions with calls to runtime stubs which use the same HW instructions as C2 intrinsics. Only for 64-bit x64 because 32-bit x86 stub does not work: result is passed through FPU register and NaN values become different from C2 intrinsic. This runtime stub is only used to calculate constant values during C2 compilation and can be skipped. >> >> I added new tests based on Tobias's `TestAll.java` And copied `jdk/lang/Float/Binary16Conversion*.java` tests to run them with `-Xcomp` to make sure code is compiled by C1 or C2. I modified `Binary16ConversionNaN.java` to compare results from Interpreter, C1 and C2. >> >> Tested tier1-5, Xcomp, stress > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Remove RISC-V port code for float16 intrinsics Overall, looks good. Minor comments/suggestions follow. src/hotspot/share/opto/convertnode.cpp line 171: > 169: if (t == Type::TOP) return Type::TOP; > 170: if (t == Type::FLOAT) return TypeInt::SHORT; > 171: if (StubRoutines::f2hf() == nullptr) return bottom_type(); What's the purpose of this check? My understanding is ConvF2HF/ConvHF2F require intrinsification and on platforms where stubs are absent, intrinsification is disabled. src/hotspot/share/opto/convertnode.cpp line 244: > 242: > 243: const TypeInt *ti = t->is_int(); > 244: if (ti->is_con()) { I find it confusing that `ConvHF2FNode::Value()` has `is_con()` check, but `ConvF2HFNode::Value()`doesn't. I'd prefer to see both implementations unified. src/hotspot/share/runtime/sharedRuntime.cpp line 451: > 449: assert(StubRoutines::f2hf() != nullptr, "floatToFloat16 intrinsic is not supported on this platform"); > 450: typedef jshort (*f2hf_stub_t)(jfloat x); > 451: return ((f2hf_stub_t)StubRoutines::f2hf())(x); What's the point of keeping the wrappers around? The stubs can be called directly, can't they? ------------- PR: https://git.openjdk.org/jdk/pull/12869 From kvn at openjdk.org Wed Mar 8 20:42:25 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 8 Mar 2023 20:42:25 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v22] In-Reply-To: References: Message-ID: On Wed, 8 Mar 2023 10:33:03 GMT, Emanuel Peter wrote: >> **List of important things below** >> >> - 3 Bugs I fixed + regression tests https://github.com/openjdk/jdk/pull/12350#issuecomment-1460323523 >> - Conversation with @jatin-bhateja about script-generated regression test: https://github.com/openjdk/jdk/pull/12350#discussion_r1115317152 >> - My [blog-article](https://eme64.github.io/blog/2023/02/23/SuperWord-Introduction.html) about SuperWord >> >> **Original RFE description:** >> Cyclic dependencies are not handled correctly in all cases. Three examples: >> >> https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/loopopts/superword/TestCyclicDependency.java#L270-L277 >> >> And this, compiled with `-XX:CompileCommand=option,compiler.vectorization.TestOptionVectorizeIR::test*,Vectorize`: >> https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestOptionVectorizeIR.java#L173-L180 >> >> And for `vmIntrinsics::_forEachRemaining` compile option `Vectorize` is always enabled: >> https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestForEachRem.java#L69-L73 >> >> All of these examples are vectorized, despite the cyclic dependency of distance 2. The cyclic dependency is dropped, instead the emitted vector code implements a shift by 2, instead of repeating the same 2 values. >> >> **Analysis** >> >> The `create_pack` logic in `SuperWord::find_adjacent_refs` is broken in two ways: >> >> - When the compile directive `Vectorize` is on, or we compile `vmIntrinsics::_forEachRemaining` we have `_do_vector_loop == true`. When that is the case, we blindly trust that there is no cyclic dependency larger than distance 1. Distance 1 would already be detected by the `independence(s1, s2)` checks we do for all adjacent memops. But for larger distances, we rely on `memory_alignment == 0`. But the compile directive avoids these checks. >> - If `best_align_to_mem_ref` is of a different type, and we have `memory_alignment(mem_ref, best_align_to_mem_ref) == 0`, we do not check if `mem_ref` has `memory_alignment == 0` for all other refs of the same type. In the example `TestCyclicDependency::test2`, we have `best_align_to_mem_ref` as the `StoreF`. Then we assess the `StoreI`, which is not aligned with it, but it is of a different type, so we accept it too. Finally, we look at `LoadI`, which has perfect alignment with the `StoreF`, so we accept it too (even though it is in conflict with the `StoreI`). >> >> Generally, the nested if-statements are confusing and buggy. I propose to fix and refactor the code. >> >> I also propose to only allow the compile directive `Vectorize` only if `vectors_should_be_aligned() == false`. If all vector operations have to be `vector_width` aligned, then they also have to be mutually aligned, and we cannot have patterns like `v[i] = v[i] + v[i+1]` for which the compile directive was introduced in the first place https://github.com/openjdk/jdk/commit/c7d33de202203b6da544f2e0f9a13952381b32dd. >> **Update**: I found a **Test.java** that lead to a crash (`SIGBUS`) on a ARM32 on master. The example bypassed the alignment requirement because of `_do_vector_loop`, and allowed unaligned vector loads to be generated, on a platform that requires alignment. Thanks @fg1417 for running that test for me! >> >> **Solution** >> >> First, I implemented `SuperWord::verify_packs` which catches cyclic dependencies just before scheduling. The idea is to reassess every pack, and check if all memops in it are mutually independent. Turns out that per vector pack, it suffices to do a single BFS over the nodes in the block (see `SuperWord::find_dependence`). With this verification in place we at least get an assert instead of wrong execution. >> >> I then refactored and fixed the `create_pack` code, and put the logic all in `SuperWord::is_mem_ref_alignment_ok`. With the added comments, I hope the logic is more straight forward and readable. If `_do_vector_loop == true`, then I filter the vector packs again in `SuperWord::combine_packs`, since we are at that point not sure that the packs are actually independent, we only know that adjacient memops are independent. >> >> Another change I have made: >> Disallow `extend_packlist` from adding `MemNodes` back in. Because if we have rejected some memops, we do not want them to be added back in later. >> >> **Testing** >> >> I added a few more regression tests, and am running tier1-3, plus some stress testing. >> >> However, I need help from someone who can test this on **ARM32** and **PPC**, basically machines that have `vectors_should_be_aligned() == true`. I would love to have additional testing on those machine, and some reviews. >> **Update:** @fg1417 did testing on ARM32, @reinrich did testing on PPC. >> >> **Discussion / Future Work** >> >> I wonder if we should have `_do_vector_loop == true` by default, since it allows more vectorization. With the added filtering, we are sure that we do not schedule packs with cyclic dependencies. We would have to evaluate performance and other side-effects of course. What do you think? [JDK-8303113](https://bugs.openjdk.org/browse/JDK-8303113) > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > TestDependencyOffsets.java: add vanilla run Looks reasonable. The one thing I don't understand is new method `find_dependence()`. May be because I don't know what data `DepPreds` is operating on. Do I understand it correctly?: 1. All nodes in one pack are independent 2. Using `DepPreds` looks through all inputs for each node in pack and put them on work list if they are in the same block and in the same depth range. Does not matter if they are in an other pack or not? 3. Go through these inputs and put their inputs on work list if they satisfy conditions. 4. If we find input which is a node in the pack - we got dependence, return this pack's node. 5. We check an input only once because we use Unique_Node_List. ------------- PR: https://git.openjdk.org/jdk/pull/12350 From jvernee at openjdk.org Wed Mar 8 20:57:16 2023 From: jvernee at openjdk.org (Jorn Vernee) Date: Wed, 8 Mar 2023 20:57:16 GMT Subject: RFR: 8303022: "assert(allocates2(pc)) failed: not in CodeBuffer memory" When linking downcall handle [v2] In-Reply-To: References: Message-ID: On Wed, 8 Mar 2023 18:40:05 GMT, Vladimir Kozlov wrote: >> Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision: >> >> update copyright years > > I think you need to add these changes to other ports too and ask for testing on them. @vnkozlov Yes, this is true. The only other existing port of this code is RISCV. However, to fix that port properly, someone needs to repeat the experiment on RISCV in order to figure out what the base size and the size per argument should be. I don't have access to a RISCV machine, so I figured I would file a followup issue for the RISCV maintainers to fix separately. @feilongjiang Could you comment on this? If you could figure out the needed sizes for RISCV I could add the needed changes to this patch. Otherwise I could file a followup issue if that seems more convenient. ------------- PR: https://git.openjdk.org/jdk/pull/12908 From kvn at openjdk.org Wed Mar 8 21:14:19 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 8 Mar 2023 21:14:19 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter [v4] In-Reply-To: References: Message-ID: <5OfHFOlt05wJhzyU1Rpce3xx3B4BnO-dEF6X0I1JgD8=.1135fdd9-ae96-45d9-a26c-ad22f721bad7@github.com> On Wed, 8 Mar 2023 05:17:53 GMT, Vladimir Kozlov wrote: >> Implemented `Float.floatToFloat16` and `Float.float16ToFloat` intrinsics in Interpreter and C1 compiler to produce the same results as C2 intrinsics on x64, Aarch64 and RISC-V - all platforms where C2 intrinsics for these Java methods were implemented originally. >> >> Replaced `SharedRuntime::f2hf()` and `hf2f()` C runtime functions with calls to runtime stubs which use the same HW instructions as C2 intrinsics. Only for 64-bit x64 because 32-bit x86 stub does not work: result is passed through FPU register and NaN values become different from C2 intrinsic. This runtime stub is only used to calculate constant values during C2 compilation and can be skipped. >> >> I added new tests based on Tobias's `TestAll.java` And copied `jdk/lang/Float/Binary16Conversion*.java` tests to run them with `-Xcomp` to make sure code is compiled by C1 or C2. I modified `Binary16ConversionNaN.java` to compare results from Interpreter, C1 and C2. >> >> Tested tier1-5, Xcomp, stress > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Remove RISC-V port code for float16 intrinsics Thank you for review @iwanowww ------------- PR: https://git.openjdk.org/jdk/pull/12869 From kvn at openjdk.org Wed Mar 8 21:14:23 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 8 Mar 2023 21:14:23 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter [v4] In-Reply-To: <1WK1RTMCYwOXw7LvaJr04fL68nN64TpB05fcPC2a3uo=.d71b77a8-c814-4166-8f56-af97ac70dc55@github.com> References: <1WK1RTMCYwOXw7LvaJr04fL68nN64TpB05fcPC2a3uo=.d71b77a8-c814-4166-8f56-af97ac70dc55@github.com> Message-ID: The message from this sender included one or more files which could not be scanned for virus detection; do not open these files unless you are certain of the sender's intent. ---------------------------------------------------------------------- On Wed, 8 Mar 2023 19:38:56 GMT, Vladimir Ivanov wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove RISC-V port code for float16 intrinsics > > src/hotspot/share/opto/convertnode.cpp line 171: > >> 169: if (t == Type::TOP) return Type::TOP; >> 170: if (t == Type::FLOAT) return TypeInt::SHORT; >> 171: if (StubRoutines::f2hf() == nullptr) return bottom_type(); > > What's the purpose of this check? My understanding is ConvF2HF/ConvHF2F require intrinsification and on platforms where stubs are absent, intrinsification is disabled. This code is optimization: use stub to calculate constant value during compilation instead of generating HW instruction in compiled code. It is not required to have this stub for intensification to work - `ConvF2HFNode` will be processed normally and will use intrinsics code (HW instruction) defined in .ad file. These stubs are used only here, not in C1 and not in Interpreter. As consequence these stubs implementation is optional and I implemented them only on x64. That is why I have this check. I debated to not have them at all to not confuse people but they did improve performance a little. > src/hotspot/share/opto/convertnode.cpp line 244: > >> 242: >> 243: const TypeInt *ti = t->is_int(); >> 244: if (ti->is_con()) { > > I find it confusing that `ConvHF2FNode::Value()` has `is_con()` check, but `ConvF2HFNode::Value()`doesn't. I'd prefer to see both implementations unified. It follows the same pattern as other nodes here: `ConvF2INode::Value()` vs `ConvI2FNode::Value()`. If you want to change it we need to do that in separate RFE for all methods here. But I don't think we need to do that because Float/Double does not have range values as Integer types. Float have only 3 types of value: FloatTop, FloatBot, FloatCon. So we don't need to check for constant if checked for TOP and BOT. For Integer we need to check `bool is_con() const { return _lo==_hi; }`. ------------- PR: https://git.openjdk.org/jdk/pull/12869 From kvn at openjdk.org Wed Mar 8 21:22:25 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 8 Mar 2023 21:22:25 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter [v4] In-Reply-To: <1WK1RTMCYwOXw7LvaJr04fL68nN64TpB05fcPC2a3uo=.d71b77a8-c814-4166-8f56-af97ac70dc55@github.com> References: <1WK1RTMCYwOXw7LvaJr04fL68nN64TpB05fcPC2a3uo=.d71b77a8-c814-4166-8f56-af97ac70dc55@github.com> Message-ID: <53sy823UIcXi4OTG8SgGnjrUmkOymF2L62BHoFkTgPk=.5de6fcd3-ea94-4f6a-8c1b-2ccf243374fd@github.com> On Wed, 8 Mar 2023 19:04:01 GMT, Vladimir Ivanov wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove RISC-V port code for float16 intrinsics > > src/hotspot/share/runtime/sharedRuntime.cpp line 451: > >> 449: assert(StubRoutines::f2hf() != nullptr, "floatToFloat16 intrinsic is not supported on this platform"); >> 450: typedef jshort (*f2hf_stub_t)(jfloat x); >> 451: return ((f2hf_stub_t)StubRoutines::f2hf())(x); > > What's the point of keeping the wrappers around? The stubs can be called directly, can't they? I wanted isolate function type cast and assert in one place. BTW the comment in assert should be "the stub is not implemented on this platform". ------------- PR: https://git.openjdk.org/jdk/pull/12869 From vlivanov at openjdk.org Wed Mar 8 21:44:18 2023 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 8 Mar 2023 21:44:18 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter [v4] In-Reply-To: References: <1WK1RTMCYwOXw7LvaJr04fL68nN64TpB05fcPC2a3uo=.d71b77a8-c814-4166-8f56-af97ac70dc55@github.com> Message-ID: On Wed, 8 Mar 2023 20:55:29 GMT, Vladimir Kozlov wrote: >> src/hotspot/share/opto/convertnode.cpp line 171: >> >>> 169: if (t == Type::TOP) return Type::TOP; >>> 170: if (t == Type::FLOAT) return TypeInt::SHORT; >>> 171: if (StubRoutines::f2hf() == nullptr) return bottom_type(); >> >> What's the purpose of this check? My understanding is ConvF2HF/ConvHF2F require intrinsification and on platforms where stubs are absent, intrinsification is disabled. > > This code is optimization: use stub to calculate constant value during compilation instead of generating HW instruction in compiled code. It is not required to have this stub for intensification to work - `ConvF2HFNode` will be processed normally and will use intrinsics code (HW instruction) defined in .ad file. > These stubs are used only here, not in C1 and not in Interpreter. As consequence these stubs implementation is optional and I implemented them only on x64. That is why I have this check. > I debated to not have them at all to not confuse people but they did improve performance a little. Thanks for the clarifications. Now it makes much more sense. Still, the mix of `StubRoutines::f2hf()` and `SharedRuntime::f2hf()` looks a bit confusing. What if you move the wrapper to `StubRoutines` class instead? (`JRT_LEAF` et al stuff looks redundant here. Also, even though there are other arithmetic operations declared on `StubRoutines`, they provide default implementations universally available across all platforms. `f2hf` case is different since it exposes a platform-specific stub and its availability is limited.) Or encapsulate the constant folding logic (along with the guard) into `SharedRuntime` and return `Type*` (instead of int/float scalar). ------------- PR: https://git.openjdk.org/jdk/pull/12869 From vlivanov at openjdk.org Wed Mar 8 21:50:17 2023 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 8 Mar 2023 21:50:17 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter [v4] In-Reply-To: References: <1WK1RTMCYwOXw7LvaJr04fL68nN64TpB05fcPC2a3uo=.d71b77a8-c814-4166-8f56-af97ac70dc55@github.com> Message-ID: <-P79Z2uDRHMZk2XTPNY9MidCv-H4t9Cj2Rot1aEj0Cg=.cc46bdde-0e11-4252-b2d7-96ee906c7f04@github.com> On Wed, 8 Mar 2023 21:41:31 GMT, Vladimir Ivanov wrote: > Or encapsulate the constant folding logic (along with the guard) into SharedRuntime and return Type* (instead of int/float scalar). I take this particular suggestion back. `SharedRuntime` is compiler-agnostic while `Type` is C2-specific. ------------- PR: https://git.openjdk.org/jdk/pull/12869 From dlong at openjdk.org Wed Mar 8 22:30:19 2023 From: dlong at openjdk.org (Dean Long) Date: Wed, 8 Mar 2023 22:30:19 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter [v4] In-Reply-To: References: Message-ID: On Wed, 8 Mar 2023 05:17:53 GMT, Vladimir Kozlov wrote: >> Implemented `Float.floatToFloat16` and `Float.float16ToFloat` intrinsics in Interpreter and C1 compiler to produce the same results as C2 intrinsics on x64, Aarch64 and RISC-V - all platforms where C2 intrinsics for these Java methods were implemented originally. >> >> Replaced `SharedRuntime::f2hf()` and `hf2f()` C runtime functions with calls to runtime stubs which use the same HW instructions as C2 intrinsics. Only for 64-bit x64 because 32-bit x86 stub does not work: result is passed through FPU register and NaN values become different from C2 intrinsic. This runtime stub is only used to calculate constant values during C2 compilation and can be skipped. >> >> I added new tests based on Tobias's `TestAll.java` And copied `jdk/lang/Float/Binary16Conversion*.java` tests to run them with `-Xcomp` to make sure code is compiled by C1 or C2. I modified `Binary16ConversionNaN.java` to compare results from Interpreter, C1 and C2. >> >> Tested tier1-5, Xcomp, stress > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Remove RISC-V port code for float16 intrinsics src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 3534: > 3532: __ leave(); // required for proper stackwalking of RuntimeStub frame > 3533: __ ret(0); > 3534: Do we really need to set up a stack frame for these two? This should be a leaf, and we have other leaf stubs that don't set up a frame. ------------- PR: https://git.openjdk.org/jdk/pull/12869 From dnsimon at openjdk.org Wed Mar 8 22:37:26 2023 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 8 Mar 2023 22:37:26 GMT Subject: RFR: 8299570: [JVMCI] Insufficient error handling when CodeBuffer is exhausted [v4] In-Reply-To: References: Message-ID: On Wed, 8 Mar 2023 10:52:40 GMT, Andrew Dinn wrote: >> Doug Simon has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: >> >> - Merge remote-tracking branch 'openjdk-jdk/master' into JDK-8299570 >> - changed return type of NativeCall::trampoline_jump to void >> - try rationalize code in NativeCall::trampoline_jump >> - properly handle CodeBuffer exhaustion in JVMCI backend > > @dougxc I'm assuming the test failures are unrelated? If so then ok to push. Thanks for the reviews @adinn @theRealAph and @tkrodriguez . ------------- PR: https://git.openjdk.org/jdk/pull/11945 From dnsimon at openjdk.org Wed Mar 8 22:37:28 2023 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 8 Mar 2023 22:37:28 GMT Subject: Integrated: 8299570: [JVMCI] Insufficient error handling when CodeBuffer is exhausted In-Reply-To: References: Message-ID: On Wed, 11 Jan 2023 14:29:43 GMT, Doug Simon wrote: > This PR fixes the handling of a full CodeBuffer when emitting stubs as part of JVMCI code installation. This pull request has now been integrated. Changeset: ad326fc6 Author: Doug Simon URL: https://git.openjdk.org/jdk/commit/ad326fc62be9fa29438fb4b59a51c38dd94afd68 Stats: 45 lines in 5 files changed: 27 ins; 10 del; 8 mod 8299570: [JVMCI] Insufficient error handling when CodeBuffer is exhausted Reviewed-by: never, adinn, aph ------------- PR: https://git.openjdk.org/jdk/pull/11945 From kvn at openjdk.org Wed Mar 8 22:54:17 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 8 Mar 2023 22:54:17 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter [v4] In-Reply-To: References: Message-ID: The message from this sender included one or more files which could not be scanned for virus detection; do not open these files unless you are certain of the sender's intent. ---------------------------------------------------------------------- On Wed, 8 Mar 2023 05:17:53 GMT, Vladimir Kozlov wrote: >> Implemented `Float.floatToFloat16` and `Float.float16ToFloat` intrinsics in Interpreter and C1 compiler to produce the same results as C2 intrinsics on x64, Aarch64 and RISC-V - all platforms where C2 intrinsics for these Java methods were implemented originally. >> >> Replaced `SharedRuntime::f2hf()` and `hf2f()` C runtime functions with calls to runtime stubs which use the same HW instructions as C2 intrinsics. Only for 64-bit x64 because 32-bit x86 stub does not work: result is passed through FPU register and NaN values become different from C2 intrinsic. This runtime stub is only used to calculate constant values during C2 compilation and can be skipped. >> >> I added new tests based on Tobias's `TestAll.java` And copied `jdk/lang/Float/Binary16Conversion*.java` tests to run them with `-Xcomp` to make sure code is compiled by C1 or C2. I modified `Binary16ConversionNaN.java` to compare results from Interpreter, C1 and C2. >> >> Tested tier1-5, Xcomp, stress > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Remove RISC-V port code for float16 intrinsics > What if you move the wrapper to StubRoutines class instead? Okay, I will try it. ------------- PR: https://git.openjdk.org/jdk/pull/12869 From kvn at openjdk.org Wed Mar 8 22:54:19 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 8 Mar 2023 22:54:19 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter [v4] In-Reply-To: References: Message-ID: <-bRTiJ15e_f536NOqVIIge5Q_Ij1YwZVtS1Ciew-XDY=.9cbea18b-3202-4b76-a4aa-618005d4677b@github.com> On Wed, 8 Mar 2023 22:27:31 GMT, Dean Long wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove RISC-V port code for float16 intrinsics > > src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 3534: > >> 3532: __ leave(); // required for proper stackwalking of RuntimeStub frame >> 3533: __ ret(0); >> 3534: > > Do we really need to set up a stack frame for these two? This should be a leaf, and we have other leaf stubs that don't set up a frame. I think you are right. These stubs are not called from compiled code, only from C++ (C2) code during compilation. Let me test it. ------------- PR: https://git.openjdk.org/jdk/pull/12869 From kvn at openjdk.org Wed Mar 8 23:14:05 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 8 Mar 2023 23:14:05 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter [v5] In-Reply-To: References: Message-ID: > Implemented `Float.floatToFloat16` and `Float.float16ToFloat` intrinsics in Interpreter and C1 compiler to produce the same results as C2 intrinsics on x64, Aarch64 and RISC-V - all platforms where C2 intrinsics for these Java methods were implemented originally. > > Replaced `SharedRuntime::f2hf()` and `hf2f()` C runtime functions with calls to runtime stubs which use the same HW instructions as C2 intrinsics. Only for 64-bit x64 because 32-bit x86 stub does not work: result is passed through FPU register and NaN values become different from C2 intrinsic. This runtime stub is only used to calculate constant values during C2 compilation and can be skipped. > > I added new tests based on Tobias's `TestAll.java` And copied `jdk/lang/Float/Binary16Conversion*.java` tests to run them with `-Xcomp` to make sure code is compiled by C1 or C2. I modified `Binary16ConversionNaN.java` to compare results from Interpreter, C1 and C2. > > Tested tier1-5, Xcomp, stress Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: Remove SharedRuntime::f2hf and hf2f wrapper functions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12869/files - new: https://git.openjdk.org/jdk/pull/12869/files/9f4b2474..fa799942 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12869&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12869&range=03-04 Stats: 43 lines in 5 files changed: 15 ins; 20 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/12869.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12869/head:pull/12869 PR: https://git.openjdk.org/jdk/pull/12869 From vlivanov at openjdk.org Wed Mar 8 23:40:09 2023 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 8 Mar 2023 23:40:09 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter [v5] In-Reply-To: References: Message-ID: On Wed, 8 Mar 2023 23:14:05 GMT, Vladimir Kozlov wrote: >> Implemented `Float.floatToFloat16` and `Float.float16ToFloat` intrinsics in Interpreter and C1 compiler to produce the same results as C2 intrinsics on x64, Aarch64 and RISC-V - all platforms where C2 intrinsics for these Java methods were implemented originally. >> >> Replaced `SharedRuntime::f2hf()` and `hf2f()` C runtime functions with calls to runtime stubs which use the same HW instructions as C2 intrinsics. Only for 64-bit x64 because 32-bit x86 stub does not work: result is passed through FPU register and NaN values become different from C2 intrinsic. This runtime stub is only used to calculate constant values during C2 compilation and can be skipped. >> >> I added new tests based on Tobias's `TestAll.java` And copied `jdk/lang/Float/Binary16Conversion*.java` tests to run them with `-Xcomp` to make sure code is compiled by C1 or C2. I modified `Binary16ConversionNaN.java` to compare results from Interpreter, C1 and C2. >> >> Tested tier1-5, Xcomp, stress > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Remove SharedRuntime::f2hf and hf2f wrapper functions Looks good. ------------- Marked as reviewed by vlivanov (Reviewer). PR: https://git.openjdk.org/jdk/pull/12869 From kvn at openjdk.org Thu Mar 9 00:04:09 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 9 Mar 2023 00:04:09 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter [v4] In-Reply-To: <-bRTiJ15e_f536NOqVIIge5Q_Ij1YwZVtS1Ciew-XDY=.9cbea18b-3202-4b76-a4aa-618005d4677b@github.com> References: <-bRTiJ15e_f536NOqVIIge5Q_Ij1YwZVtS1Ciew-XDY=.9cbea18b-3202-4b76-a4aa-618005d4677b@github.com> Message-ID: <3vTFng8nGQ-gFMkTzyexYknU4McM_8mzY5b6699QFaE=.ccd9b65f-fb69-49a4-b174-1847cdee3ebb@github.com> On Wed, 8 Mar 2023 22:49:42 GMT, Vladimir Kozlov wrote: >> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 3534: >> >>> 3532: __ leave(); // required for proper stackwalking of RuntimeStub frame >>> 3533: __ ret(0); >>> 3534: >> >> Do we really need to set up a stack frame for these two? This should be a leaf, and we have other leaf stubs that don't set up a frame. > > I think you are right. These stubs are not called from compiled code, only from C++ (C2) code during compilation. > Let me test it. Testing passed with `enter()` and `leave()` removed ------------- PR: https://git.openjdk.org/jdk/pull/12869 From kvn at openjdk.org Thu Mar 9 00:09:10 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 9 Mar 2023 00:09:10 GMT Subject: RFR: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter [v5] In-Reply-To: References: Message-ID: On Wed, 8 Mar 2023 23:14:05 GMT, Vladimir Kozlov wrote: >> Implemented `Float.floatToFloat16` and `Float.float16ToFloat` intrinsics in Interpreter and C1 compiler to produce the same results as C2 intrinsics on x64, Aarch64 and RISC-V - all platforms where C2 intrinsics for these Java methods were implemented originally. >> >> Replaced `SharedRuntime::f2hf()` and `hf2f()` C runtime functions with calls to runtime stubs which use the same HW instructions as C2 intrinsics. Only for 64-bit x64 because 32-bit x86 stub does not work: result is passed through FPU register and NaN values become different from C2 intrinsic. This runtime stub is only used to calculate constant values during C2 compilation and can be skipped. >> >> I added new tests based on Tobias's `TestAll.java` And copied `jdk/lang/Float/Binary16Conversion*.java` tests to run them with `-Xcomp` to make sure code is compiled by C1 or C2. I modified `Binary16ConversionNaN.java` to compare results from Interpreter, C1 and C2. >> >> Tested tier1-5, Xcomp, stress > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Remove SharedRuntime::f2hf and hf2f wrapper functions Thank you, Vladimir and Dean for review. ------------- PR: https://git.openjdk.org/jdk/pull/12869 From wanghaomin at openjdk.org Thu Mar 9 01:19:40 2023 From: wanghaomin at openjdk.org (Wang Haomin) Date: Thu, 9 Mar 2023 01:19:40 GMT Subject: RFR: 8303804: Fix some errors of If-VectorTest and CMove-VectorTest [v2] In-Reply-To: <-Jw_zF5ca_3WHcoZQwzsT6lMA1NFdAzbOv3063qU6Lw=.b90c971a-bfe4-4144-93dc-b04a4c89a154@github.com> References: <-Jw_zF5ca_3WHcoZQwzsT6lMA1NFdAzbOv3063qU6Lw=.b90c971a-bfe4-4144-93dc-b04a4c89a154@github.com> Message-ID: > After https://bugs.openjdk.org/browse/JDK-8292289 , the base class of VectorTestNode changed from Node to CmpNode. So I add two match rule into ad file. > > match(If cop (VectorTest op1 op2)); > match(Set dst (CMoveI (Binary cop (VectorTest op1 op2)) (Binary src1 src2))); > > First error, rule1 shouldn't generate the statement "node->_bottom_type = _leaf->bottom_type();". > Second error, both rule1 and rule2 need to use VectorTestNode, the VectorTestNode should be cloned like CmpNode. Wang Haomin has updated the pull request incrementally with one additional commit since the last revision: compare the results with 0 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12917/files - new: https://git.openjdk.org/jdk/pull/12917/files/bfba75dd..564d0c48 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12917&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12917&range=00-01 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/12917.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12917/head:pull/12917 PR: https://git.openjdk.org/jdk/pull/12917 From wanghaomin at openjdk.org Thu Mar 9 01:22:14 2023 From: wanghaomin at openjdk.org (Wang Haomin) Date: Thu, 9 Mar 2023 01:22:14 GMT Subject: RFR: 8303804: Fix some errors of If-VectorTest and CMove-VectorTest [v2] In-Reply-To: <9x4lN-lTSZwSFUhIsxtq2uDhy-9piQ6aoDHF94QLs7w=.c03c138e-e83f-483a-9e28-365b4052898a@github.com> References: <-Jw_zF5ca_3WHcoZQwzsT6lMA1NFdAzbOv3063qU6Lw=.b90c971a-bfe4-4144-93dc-b04a4c89a154@github.com> <9x4lN-lTSZwSFUhIsxtq2uDhy-9piQ6aoDHF94QLs7w=.c03c138e-e83f-483a-9e28-365b4052898a@github.com> Message-ID: <8sCKzXtE6VXzjnz_Vy7SPufcwBctZZw3-r3bYIikm5U=.12418a81-e370-4294-a4c6-212279900f40@github.com> On Wed, 8 Mar 2023 14:40:35 GMT, Quan Anh Mai wrote: >> Wang Haomin has updated the pull request incrementally with one additional commit since the last revision: >> >> compare the results with 0 > > src/hotspot/share/adlc/output_c.cpp line 3989: > >> 3987: if (inst->captures_bottom_type(_globalNames)) { >> 3988: if (strncmp("MachCall", inst->mach_base_class(_globalNames), strlen("MachCall")) >> 3989: && strncmp("MachIf", inst->mach_base_class(_globalNames), strlen("MachIf"))) { > > Maybe explicitly compare the results with 0 here for consistency with other usage of `strcmp` Thanks for your review, DONE ------------- PR: https://git.openjdk.org/jdk/pull/12917 From qamai at openjdk.org Thu Mar 9 02:34:07 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 9 Mar 2023 02:34:07 GMT Subject: RFR: 8303804: Fix some errors of If-VectorTest and CMove-VectorTest [v2] In-Reply-To: <8sCKzXtE6VXzjnz_Vy7SPufcwBctZZw3-r3bYIikm5U=.12418a81-e370-4294-a4c6-212279900f40@github.com> References: <-Jw_zF5ca_3WHcoZQwzsT6lMA1NFdAzbOv3063qU6Lw=.b90c971a-bfe4-4144-93dc-b04a4c89a154@github.com> <9x4lN-lTSZwSFUhIsxtq2uDhy-9piQ6aoDHF94QLs7w=.c03c138e-e83f-483a-9e28-365b4052898a@github.com> <8sCKzXtE6VXzjnz_Vy7SPufcwBctZZw3-r3bYIikm5U=.12418a81-e370-4294-a4c6-212279900f40@github.com> Message-ID: <1tvFtmaxsxXGiVLLbe6urnvdDJyjYjfG_rO-orqjppE=.bdede5ad-0fec-4eb2-94df-60214d3179f2@github.com> The message from this sender included one or more files which could not be scanned for virus detection; do not open these files unless you are certain of the sender's intent. ---------------------------------------------------------------------- On Thu, 9 Mar 2023 01:19:43 GMT, Wang Haomin wrote: >> src/hotspot/share/adlc/output_c.cpp line 3989: >> >>> 3987: if (inst->captures_bottom_type(_globalNames)) { >>> 3988: if (strncmp("MachCall", inst->mach_base_class(_globalNames), strlen("MachCall")) >>> 3989: && strncmp("MachIf", inst->mach_base_class(_globalNames), strlen("MachIf"))) { >> >> Maybe explicitly compare the results with 0 here for consistency with other usage of `strcmp` > > Thanks for your review, DONE Rethink about it should it be `strncmp("MachCall", inst->mach_base_class(_globalNames), strlen("MachCall") + 1)` instead? Or else we could return true for nodes that have `MachCall` as a prefix. ------------- PR: https://git.openjdk.org/jdk/pull/12917 From dholmes at openjdk.org Thu Mar 9 02:37:07 2023 From: dholmes at openjdk.org (David Holmes) Date: Thu, 9 Mar 2023 02:37:07 GMT Subject: RFR: 8303415: Add VM_Version::is_intrinsic_supported(id) In-Reply-To: References: Message-ID: <14u458Y0fDM8iouM3LGbvnQZKw5FzRoVmgKchzOUpsk=.1abcb17b-53dd-4524-9942-be9e3ec6a358@github.com> The message from this sender included one or more files which could not be scanned for virus detection; do not open these files unless you are certain of the sender's intent. ---------------------------------------------------------------------- On Fri, 3 Mar 2023 16:16:08 GMT, Vladimir Kozlov wrote: > Currently we check VM flags, directives and JIT compiler support when we generate intrinsics. > We have *product* VM flags for most intrinsics and set them in VM based on HW support. > But not all intrinsics have such flags and it is not scalable to add new *product* flag for each new intrinsic. > Also we have `-XX:DisableIntrinsic=` and `-XX:ControlIntrinsic=` flags to control intrinsics from command line. We don't need specific flags for that. > > I propose to add new `VM_Version::is_intrinsic_supported(id)` method to check platform support for intrinsic without adding new flag. I used it for `_floatToFloat16` intrinsic for my work on [JDK-8302976](https://bugs.openjdk.org/browse/JDK-8302976). > > Additional fixes: > Fixed Interpreter to skip intrinsics if they are disabled with flag. > Moved Interpreter's `InlineIntrinsics` flag check into one place in shared code. > Added separate interpreter id for `_dsqrt_strict` so it could be disabled separately from regular `_dsqrt`. > Added missing `native` mark to `_currentThread`. > Removed unused `AbstractInterpreter::in_native_entry()`. > Cleanup C2 intrinsic checks code. > > Tested tier1-4,xcomp,stress. Also ran tier1-3,xcomp with `-XX:-InlineIntrinsics`. Just a couple of drive-by comments. I'm a bit confused about the core issue of " Add VM_Version::is_intrinsic_supported(id)" because AFAICS this was only added for x86 ?? src/hotspot/cpu/x86/vm_version_x86.cpp line 3230: > 3228: case vmIntrinsics::_floatToFloat16: > 3229: case vmIntrinsics::_float16ToFloat: > 3230: if (!supports_f16c() && !supports_avx512vl()) return false; Please put the return on a new line - it is too easy to just see the break. ------------- PR: https://git.openjdk.org/jdk/pull/12858 From dholmes at openjdk.org Thu Mar 9 02:37:08 2023 From: dholmes at openjdk.org (David Holmes) Date: Thu, 9 Mar 2023 02:37:08 GMT Subject: RFR: 8303415: Add VM_Version::is_intrinsic_supported(id) In-Reply-To: <5tSsZG-Xkuz5KL0GfHdblkfBgLyi7jRHSeysdBS4Lp8=.17d07529-fa55-4b5d-ac70-7e485e470e85@github.com> References: <5tSsZG-Xkuz5KL0GfHdblkfBgLyi7jRHSeysdBS4Lp8=.17d07529-fa55-4b5d-ac70-7e485e470e85@github.com> Message-ID: On Wed, 8 Mar 2023 18:56:55 GMT, Tobias Hartmann wrote: >> We still have `is_intrinsic_supported()` declared at line 64. New `is_virtual_intrinsic_supported()` is added to handle only virtual intrinsics`_hashCode` and `_clone`. I simply moved these intrinsics handling into separate method because it is used only by C2 in `library_call.cpp`. > > Okay, got it. Thanks for the explanation. Okay but then why do we still need the parameter `is_virtual`? ------------- PR: https://git.openjdk.org/jdk/pull/12858 From wanghaomin at openjdk.org Thu Mar 9 03:17:14 2023 From: wanghaomin at openjdk.org (Wang Haomin) Date: Thu, 9 Mar 2023 03:17:14 GMT Subject: RFR: 8303804: Fix some errors of If-VectorTest and CMove-VectorTest [v2] In-Reply-To: <1tvFtmaxsxXGiVLLbe6urnvdDJyjYjfG_rO-orqjppE=.bdede5ad-0fec-4eb2-94df-60214d3179f2@github.com> References: <-Jw_zF5ca_3WHcoZQwzsT6lMA1NFdAzbOv3063qU6Lw=.b90c971a-bfe4-4144-93dc-b04a4c89a154@github.com> <9x4lN-lTSZwSFUhIsxtq2uDhy-9piQ6aoDHF94QLs7w=.c03c138e-e83f-483a-9e28-365b4052898a@github.com> <8sCKzXtE6VXzjnz_Vy7SPufcwBctZZw3-r3bYIikm5U=.12418a81-e370-4294-a4c6-212279900f40@github.com> <1tvFtmaxsxXGiVLLbe6urnvdDJyjYjfG_rO-orqjppE=.bdede5ad-0fec-4eb2-94df-60214d3179f2@github.com> Message-ID: On Thu, 9 Mar 2023 02:30:51 GMT, Quan Anh Mai wrote: >> Thanks for your review, DONE > > Rethink about it should it be `strncmp("MachCall", inst->mach_base_class(_globalNames), strlen("MachCall") + 1)` instead? Or else we could return true for nodes that have `MachCall` as a prefix. For example, `strncmp("MachCall", "MachCallStaticJavaNode", strlen("MachCall"))` , it just compare first 8 characters, the result of this `strncmp` is 0. This is the expected result, I don't understand why `strlen` +1. ------------- PR: https://git.openjdk.org/jdk/pull/12917 From kvn at openjdk.org Thu Mar 9 03:30:30 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 9 Mar 2023 03:30:30 GMT Subject: Integrated: 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter In-Reply-To: References: Message-ID: On Fri, 3 Mar 2023 21:41:35 GMT, Vladimir Kozlov wrote: > Implemented `Float.floatToFloat16` and `Float.float16ToFloat` intrinsics in Interpreter and C1 compiler to produce the same results as C2 intrinsics on x64, Aarch64 and RISC-V - all platforms where C2 intrinsics for these Java methods were implemented originally. > > Replaced `SharedRuntime::f2hf()` and `hf2f()` C runtime functions with calls to runtime stubs which use the same HW instructions as C2 intrinsics. Only for 64-bit x64 because 32-bit x86 stub does not work: result is passed through FPU register and NaN values become different from C2 intrinsic. This runtime stub is only used to calculate constant values during C2 compilation and can be skipped. > > I added new tests based on Tobias's `TestAll.java` And copied `jdk/lang/Float/Binary16Conversion*.java` tests to run them with `-Xcomp` to make sure code is compiled by C1 or C2. I modified `Binary16ConversionNaN.java` to compare results from Interpreter, C1 and C2. > > Tested tier1-5, Xcomp, stress This pull request has now been integrated. Changeset: 8cfd74f7 Author: Vladimir Kozlov URL: https://git.openjdk.org/jdk/commit/8cfd74f76afc9e5d50c52104fef9974784718dd4 Stats: 1408 lines in 47 files changed: 1213 ins; 154 del; 41 mod 8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter Reviewed-by: sviswanathan, jbhateja, vlivanov ------------- PR: https://git.openjdk.org/jdk/pull/12869 From kvn at openjdk.org Thu Mar 9 03:32:15 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 9 Mar 2023 03:32:15 GMT Subject: RFR: 8303415: Add VM_Version::is_intrinsic_supported(id) In-Reply-To: <14u458Y0fDM8iouM3LGbvnQZKw5FzRoVmgKchzOUpsk=.1abcb17b-53dd-4524-9942-be9e3ec6a358@github.com> References: <14u458Y0fDM8iouM3LGbvnQZKw5FzRoVmgKchzOUpsk=.1abcb17b-53dd-4524-9942-be9e3ec6a358@github.com> Message-ID: On Thu, 9 Mar 2023 02:19:44 GMT, David Holmes wrote: >> Currently we check VM flags, directives and JIT compiler support when we generate intrinsics. >> We have *product* VM flags for most intrinsics and set them in VM based on HW support. >> But not all intrinsics have such flags and it is not scalable to add new *product* flag for each new intrinsic. >> Also we have `-XX:DisableIntrinsic=` and `-XX:ControlIntrinsic=` flags to control intrinsics from command line. We don't need specific flags for that. >> >> I propose to add new `VM_Version::is_intrinsic_supported(id)` method to check platform support for intrinsic without adding new flag. I used it for `_floatToFloat16` intrinsic for my work on [JDK-8302976](https://bugs.openjdk.org/browse/JDK-8302976). >> >> Additional fixes: >> Fixed Interpreter to skip intrinsics if they are disabled with flag. >> Moved Interpreter's `InlineIntrinsics` flag check into one place in shared code. >> Added separate interpreter id for `_dsqrt_strict` so it could be disabled separately from regular `_dsqrt`. >> Added missing `native` mark to `_currentThread`. >> Removed unused `AbstractInterpreter::in_native_entry()`. >> Cleanup C2 intrinsic checks code. >> >> Tested tier1-4,xcomp,stress. Also ran tier1-3,xcomp with `-XX:-InlineIntrinsics`. > > src/hotspot/cpu/x86/vm_version_x86.cpp line 3230: > >> 3228: case vmIntrinsics::_floatToFloat16: >> 3229: case vmIntrinsics::_float16ToFloat: >> 3230: if (!supports_f16c() && !supports_avx512vl()) return false; > > Please put the return on a new line - it is too easy to just see the break. Okay. ------------- PR: https://git.openjdk.org/jdk/pull/12858 From kvn at openjdk.org Thu Mar 9 04:02:12 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 9 Mar 2023 04:02:12 GMT Subject: RFR: 8303415: Add VM_Version::is_intrinsic_supported(id) In-Reply-To: References: <5tSsZG-Xkuz5KL0GfHdblkfBgLyi7jRHSeysdBS4Lp8=.17d07529-fa55-4b5d-ac70-7e485e470e85@github.com> Message-ID: On Thu, 9 Mar 2023 02:28:44 GMT, David Holmes wrote: >> Okay, got it. Thanks for the explanation. > > Okay but then why do we still need the parameter `is_virtual`? @dholmes-ora I can remove parameter if I modify caller code like next: - is_available = compiler != NULL && compiler->is_intrinsic_available(mh, C->directive()) && - compiler->is_virtual_intrinsic_supported(id, is_virtual); + is_available = compiler != NULL && compiler->is_intrinsic_available(mh, C->directive()); + if (is_available && is_virtual) { + is_available = compiler->is_virtual_intrinsic_supported(id); + } Will it satisfy you? ------------- PR: https://git.openjdk.org/jdk/pull/12858 From dholmes at openjdk.org Thu Mar 9 04:35:07 2023 From: dholmes at openjdk.org (David Holmes) Date: Thu, 9 Mar 2023 04:35:07 GMT Subject: RFR: 8303415: Add VM_Version::is_intrinsic_supported(id) In-Reply-To: References: <5tSsZG-Xkuz5KL0GfHdblkfBgLyi7jRHSeysdBS4Lp8=.17d07529-fa55-4b5d-ac70-7e485e470e85@github.com> Message-ID: On Thu, 9 Mar 2023 03:59:35 GMT, Vladimir Kozlov wrote: >> Okay but then why do we still need the parameter `is_virtual`? > > @dholmes-ora I can remove parameter if I modify caller code like next: > > - is_available = compiler != NULL && compiler->is_intrinsic_available(mh, C->directive()) && > - compiler->is_virtual_intrinsic_supported(id, is_virtual); > + is_available = compiler != NULL && compiler->is_intrinsic_available(mh, C->directive()); > + if (is_available && is_virtual) { > + is_available = compiler->is_virtual_intrinsic_supported(id); > + } > > Will it satisfy you? How many callers are there? From an API design perspective this method is either only for virtual intrinsics, so no parameter needed, or it is for general intrinsics and the parameter indicates what type. ------------- PR: https://git.openjdk.org/jdk/pull/12858 From qamai at openjdk.org Thu Mar 9 04:37:14 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 9 Mar 2023 04:37:14 GMT Subject: RFR: 8303804: Fix some errors of If-VectorTest and CMove-VectorTest [v2] In-Reply-To: References: <-Jw_zF5ca_3WHcoZQwzsT6lMA1NFdAzbOv3063qU6Lw=.b90c971a-bfe4-4144-93dc-b04a4c89a154@github.com> <9x4lN-lTSZwSFUhIsxtq2uDhy-9piQ6aoDHF94QLs7w=.c03c138e-e83f-483a-9e28-365b4052898a@github.com> <8sCKzXtE6VXzjnz_Vy7SPufcwBctZZw3-r3bYIikm5U=.12418a81-e370-4294-a4c6-212279900f40@github.com> <1tvFtmaxsxXGiVLLbe6urnvdDJyjYjfG_rO-orqjppE=.bdede5ad-0fec-4eb2-94df-60214d3179f2@github.com> Message-ID: On Thu, 9 Mar 2023 03:14:20 GMT, Wang Haomin wrote: >> Rethink about it should it be `strncmp("MachCall", inst->mach_base_class(_globalNames), strlen("MachCall") + 1)` instead? Or else we could return true for nodes that have `MachCall` as a prefix. > > For example, `strncmp("MachCall", "MachCallStaticJavaNode", strlen("MachCall"))` , it just compare first 8 characters, the result of this `strncmp` is 0. This is the expected result, I don't understand why `strlen` +1. @haominw Ah I understand, is this behaviour similarly expected for `MachIf`, or we want the node to exactly match here? Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/12917 From wanghaomin at openjdk.org Thu Mar 9 04:59:04 2023 From: wanghaomin at openjdk.org (Wang Haomin) Date: Thu, 9 Mar 2023 04:59:04 GMT Subject: RFR: 8303804: Fix some errors of If-VectorTest and CMove-VectorTest [v2] In-Reply-To: References: <-Jw_zF5ca_3WHcoZQwzsT6lMA1NFdAzbOv3063qU6Lw=.b90c971a-bfe4-4144-93dc-b04a4c89a154@github.com> <9x4lN-lTSZwSFUhIsxtq2uDhy-9piQ6aoDHF94QLs7w=.c03c138e-e83f-483a-9e28-365b4052898a@github.com> <8sCKzXtE6VXzjnz_Vy7SPufcwBctZZw3-r3bYIikm5U=.12418a81-e370-4294-a4c6-212279900f40@github.com> <1tvFtmaxsxXGiVLLbe6urnvdDJyjYjfG_rO-orqjppE=.bdede5ad-0fec-4eb2-94df-60214d3179f2@github.com> Message-ID: <2st223YSbG9FzwhezqigbVClARllwU4M3-sUDlW1ziI=.db911de6-67d9-462b-902c-da9067c50094@github.com> On Thu, 9 Mar 2023 04:34:38 GMT, Quan Anh Mai wrote: >> For example, `strncmp("MachCall", "MachCallStaticJavaNode", strlen("MachCall"))` , it just compare first 8 characters, the result of this `strncmp` is 0. This is the expected result, I don't understand why `strlen` +1. > > @haominw Ah I understand, is this behaviour similarly expected for `MachIf`, or we want the node to exactly match here? Thanks. Just want to match MachIfNode. Because the output of `inst->mach_base_class(_globalNames)` is only `MachIfNode`, no other `MachIf`, so I just match `MachIf`. ------------- PR: https://git.openjdk.org/jdk/pull/12917 From kvn at openjdk.org Thu Mar 9 05:06:15 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 9 Mar 2023 05:06:15 GMT Subject: RFR: 8303415: Add VM_Version::is_intrinsic_supported(id) In-Reply-To: References: <5tSsZG-Xkuz5KL0GfHdblkfBgLyi7jRHSeysdBS4Lp8=.17d07529-fa55-4b5d-ac70-7e485e470e85@github.com> Message-ID: On Thu, 9 Mar 2023 04:32:06 GMT, David Holmes wrote: >> @dholmes-ora I can remove parameter if I modify caller code like next: >> >> - is_available = compiler != NULL && compiler->is_intrinsic_available(mh, C->directive()) && >> - compiler->is_virtual_intrinsic_supported(id, is_virtual); >> + is_available = compiler != NULL && compiler->is_intrinsic_available(mh, C->directive()); >> + if (is_available && is_virtual) { >> + is_available = compiler->is_virtual_intrinsic_supported(id); >> + } >> >> Will it satisfy you? > > How many callers are there? From an API design perspective this method is either only for virtual intrinsics, so no parameter needed, or it is for general intrinsics and the parameter indicates what type. This is the only place where it is called. ------------- PR: https://git.openjdk.org/jdk/pull/12858 From kvn at openjdk.org Thu Mar 9 05:24:07 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 9 Mar 2023 05:24:07 GMT Subject: RFR: 8303415: Add VM_Version::is_intrinsic_supported(id) In-Reply-To: References: <5tSsZG-Xkuz5KL0GfHdblkfBgLyi7jRHSeysdBS4Lp8=.17d07529-fa55-4b5d-ac70-7e485e470e85@github.com> Message-ID: On Thu, 9 Mar 2023 05:03:27 GMT, Vladimir Kozlov wrote: >> How many callers are there? From an API design perspective this method is either only for virtual intrinsics, so no parameter needed, or it is for general intrinsics and the parameter indicates what type. > > This is the only place where it is called. Actually I found that I don't need `is_virtual_intrinsic_supported()` because the same information provides existing [vmIntrinsics::does_virtual_dispatch(id)](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/classfile/vmIntrinsics.cpp#L173) ------------- PR: https://git.openjdk.org/jdk/pull/12858 From kvn at openjdk.org Thu Mar 9 05:31:52 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 9 Mar 2023 05:31:52 GMT Subject: RFR: 8303415: Add VM_Version::is_intrinsic_supported(id) [v2] In-Reply-To: References: Message-ID: > Currently we check VM flags, directives and JIT compiler support when we generate intrinsics. > We have *product* VM flags for most intrinsics and set them in VM based on HW support. > But not all intrinsics have such flags and it is not scalable to add new *product* flag for each new intrinsic. > Also we have `-XX:DisableIntrinsic=` and `-XX:ControlIntrinsic=` flags to control intrinsics from command line. We don't need specific flags for that. > > I propose to add new `VM_Version::is_intrinsic_supported(id)` method to check platform support for intrinsic without adding new flag. I used it for `_floatToFloat16` intrinsic for my work on [JDK-8302976](https://bugs.openjdk.org/browse/JDK-8302976). > > Additional fixes: > Fixed Interpreter to skip intrinsics if they are disabled with flag. > Moved Interpreter's `InlineIntrinsics` flag check into one place in shared code. > Added separate interpreter id for `_dsqrt_strict` so it could be disabled separately from regular `_dsqrt`. > Added missing `native` mark to `_currentThread`. > Removed unused `AbstractInterpreter::in_native_entry()`. > Cleanup C2 intrinsic checks code. > > Tested tier1-4,xcomp,stress. Also ran tier1-3,xcomp with `-XX:-InlineIntrinsics`. Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: Address comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12858/files - new: https://git.openjdk.org/jdk/pull/12858/files/43c0056b..2d6ff556 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12858&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12858&range=00-01 Stats: 28 lines in 4 files changed: 4 ins; 21 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/12858.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12858/head:pull/12858 PR: https://git.openjdk.org/jdk/pull/12858 From kvn at openjdk.org Thu Mar 9 05:43:06 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 9 Mar 2023 05:43:06 GMT Subject: RFR: 8303415: Add VM_Version::is_intrinsic_supported(id) [v2] In-Reply-To: <14u458Y0fDM8iouM3LGbvnQZKw5FzRoVmgKchzOUpsk=.1abcb17b-53dd-4524-9942-be9e3ec6a358@github.com> References: <14u458Y0fDM8iouM3LGbvnQZKw5FzRoVmgKchzOUpsk=.1abcb17b-53dd-4524-9942-be9e3ec6a358@github.com> Message-ID: On Thu, 9 Mar 2023 02:34:34 GMT, David Holmes wrote: > I'm a bit confused about the core issue of "Add VM_Version::is_intrinsic_supported(id)" because AFAICS this was only added for x86 ?? Currently the only way to check in shared code if a platform supports intrinsic is to add new flag which is set in platform specific code based on CPU instructions set. And we have "ton" of such global flags already: [globals.hpp#L320](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/runtime/globals.hpp#L320) I don't want to add new flag for each new intrinsic. I propose to add new VM_Version method `is_intrinsic_supported(id)` which can be called from shared code. Currently only new `float16ToFloat` intrinsic does not corresponding flag for which I can use this new API. The intrinsic is implemented only on Aarch64 (unconditionally) and on x86 (if it has corresponding AVX512 or F16C instructions). Because of that currently we need to check `float16ToFloat` support only for x86. ------------- PR: https://git.openjdk.org/jdk/pull/12858 From kvn at openjdk.org Thu Mar 9 05:56:16 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 9 Mar 2023 05:56:16 GMT Subject: RFR: 8303415: Add VM_Version::is_intrinsic_supported(id) [v2] In-Reply-To: References: Message-ID: On Thu, 9 Mar 2023 05:31:52 GMT, Vladimir Kozlov wrote: >> Currently we check VM flags, directives and JIT compiler support when we generate intrinsics. >> We have *product* VM flags for most intrinsics and set them in VM based on HW support. >> But not all intrinsics have such flags and it is not scalable to add new *product* flag for each new intrinsic. >> Also we have `-XX:DisableIntrinsic=` and `-XX:ControlIntrinsic=` flags to control intrinsics from command line. We don't need specific flags for that. >> >> I propose to add new `VM_Version::is_intrinsic_supported(id)` method to check platform support for intrinsic without adding new flag. I used it for `_floatToFloat16` intrinsic for my work on [JDK-8302976](https://bugs.openjdk.org/browse/JDK-8302976). >> >> Additional fixes: >> Fixed Interpreter to skip intrinsics if they are disabled with flag. >> Moved Interpreter's `InlineIntrinsics` flag check into one place in shared code. >> Added separate interpreter id for `_dsqrt_strict` so it could be disabled separately from regular `_dsqrt`. >> Added missing `native` mark to `_currentThread`. >> Removed unused `AbstractInterpreter::in_native_entry()`. >> Cleanup C2 intrinsic checks code. >> >> Tested tier1-4,xcomp,stress. Also ran tier1-3,xcomp with `-XX:-InlineIntrinsics`. > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Address comments An other approach, which I implemented in [JDK-8302976](https://git.openjdk.org/jdk/pull/12869), is to add intrinsic specific `VM_Version` method: [vm_version_x86.hpp#L762](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/vm_version_x86.hpp#L762) Then you have the same issue as with flags. You will have multiply such methods in a future. ------------- PR: https://git.openjdk.org/jdk/pull/12858 From dholmes at openjdk.org Thu Mar 9 06:26:15 2023 From: dholmes at openjdk.org (David Holmes) Date: Thu, 9 Mar 2023 06:26:15 GMT Subject: RFR: 8303415: Add VM_Version::is_intrinsic_supported(id) [v2] In-Reply-To: References: Message-ID: <8GoprkMChP1bYoO8PPNijDfhYgMtrD4UGuaHHTiRbZs=.12a42da9-b05f-497e-9c7e-616907aebe90@github.com> On Thu, 9 Mar 2023 05:31:52 GMT, Vladimir Kozlov wrote: >> Currently we check VM flags, directives and JIT compiler support when we generate intrinsics. >> We have *product* VM flags for most intrinsics and set them in VM based on HW support. >> But not all intrinsics have such flags and it is not scalable to add new *product* flag for each new intrinsic. >> Also we have `-XX:DisableIntrinsic=` and `-XX:ControlIntrinsic=` flags to control intrinsics from command line. We don't need specific flags for that. >> >> I propose to add new `VM_Version::is_intrinsic_supported(id)` method to check platform support for intrinsic without adding new flag. I used it for `_floatToFloat16` intrinsic for my work on [JDK-8302976](https://bugs.openjdk.org/browse/JDK-8302976). >> >> Additional fixes: >> Fixed Interpreter to skip intrinsics if they are disabled with flag. >> Moved Interpreter's `InlineIntrinsics` flag check into one place in shared code. >> Added separate interpreter id for `_dsqrt_strict` so it could be disabled separately from regular `_dsqrt`. >> Added missing `native` mark to `_currentThread`. >> Removed unused `AbstractInterpreter::in_native_entry()`. >> Cleanup C2 intrinsic checks code. >> >> Tested tier1-4,xcomp,stress. Also ran tier1-3,xcomp with `-XX:-InlineIntrinsics`. > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Address comments So are there plans to migrate to this new mechanism and remove those global flags? ------------- PR: https://git.openjdk.org/jdk/pull/12858 From fjiang at openjdk.org Thu Mar 9 06:43:25 2023 From: fjiang at openjdk.org (Feilong Jiang) Date: Thu, 9 Mar 2023 06:43:25 GMT Subject: RFR: 8303022: "assert(allocates2(pc)) failed: not in CodeBuffer memory" When linking downcall handle [v2] In-Reply-To: References: Message-ID: On Wed, 8 Mar 2023 20:54:17 GMT, Jorn Vernee wrote: > @feilongjiang Could you comment on this? If you could figure out the needed sizes for RISCV I could add the needed changes to this patch. Otherwise I could file a followup issue if that seems more convenient. TIA Yes, I will take a look to find out the needed size for RISCV. ------------- PR: https://git.openjdk.org/jdk/pull/12908 From kvn at openjdk.org Thu Mar 9 06:43:25 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 9 Mar 2023 06:43:25 GMT Subject: RFR: 8303415: Add VM_Version::is_intrinsic_supported(id) [v2] In-Reply-To: <8GoprkMChP1bYoO8PPNijDfhYgMtrD4UGuaHHTiRbZs=.12a42da9-b05f-497e-9c7e-616907aebe90@github.com> References: <8GoprkMChP1bYoO8PPNijDfhYgMtrD4UGuaHHTiRbZs=.12a42da9-b05f-497e-9c7e-616907aebe90@github.com> Message-ID: On Thu, 9 Mar 2023 06:22:56 GMT, David Holmes wrote: > So are there plans to migrate to this new mechanism and remove those global flags? Good point. I filed RFE [JDK-8303864](https://bugs.openjdk.org/browse/JDK-8303864) ------------- PR: https://git.openjdk.org/jdk/pull/12858 From thartmann at openjdk.org Thu Mar 9 07:27:28 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 9 Mar 2023 07:27:28 GMT Subject: RFR: 8201516: DebugNonSafepoints generates incorrect information [v4] In-Reply-To: References: <0gI6DIHtc7F63CFYoccotGQv-BHYadPRW0liqEQvh6Q=.58a774ec-7591-4f44-aa7f-7755593ac04e@github.com> Message-ID: On Wed, 8 Mar 2023 19:24:30 GMT, Xin Liu wrote: > Is it a particular reason that Compile needs at least 8 blocks? It's the value that is used when creating the array: https://github.com/openjdk/jdk/blob/4619e8bae838abd1f243c2c65a538806d226b8e8/src/hotspot/share/opto/compile.cpp#L1060-L1062 I don't think there is a particular reason for 8 but it's just one of those more or less reasonable default values/sizes that we use all over the place when creating containers. ------------- PR: https://git.openjdk.org/jdk/pull/12806 From epeter at openjdk.org Thu Mar 9 07:52:18 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 9 Mar 2023 07:52:18 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v22] In-Reply-To: References: Message-ID: <3B2STiUAqHX5dlu7J3qXXEp9gagFhw7k4jy-TFCIcm8=.827d38c0-919a-4550-827a-bde58006ecd5@github.com> On Wed, 8 Mar 2023 20:39:43 GMT, Vladimir Kozlov wrote: > May be because I don't know what data `DepPreds` is operating on. https://github.com/openjdk/jdk/blob/a44082b61f22dcdee115697f34d39c1d8382a15d/src/hotspot/share/opto/superword.cpp#L4913-L4934 This comes from https://github.com/openjdk/jdk/blob/a44082b61f22dcdee115697f34d39c1d8382a15d/src/hotspot/share/opto/superword.cpp#L1094-L1097 Basically, `DepPreds` is an iterator of the `memory` (load / store) dependencies and the `data` dependencies. The `memory` edges are detected during `dependence_graph`: For every slice, look at all pairs of memops (exclude Load-Load pairs, those are never true dependnecies). If we have `!SWPointer::not_equal` for the two memops, we do not know that they are "not-equal", and we add a dependence edge. If we know that `SWPointer::not_equal` we do not need a dependence edge, because the memory does not overlap (eg. `data[i]` and `data[i+1]`). In the constructor of `DepPreds`, you can see what dependencies are included for each type of node: - Load / Store TODO continue ------------- PR: https://git.openjdk.org/jdk/pull/12350 From thartmann at openjdk.org Thu Mar 9 07:52:22 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 9 Mar 2023 07:52:22 GMT Subject: RFR: 8303238: Create generalizations for existing LShift ideal transforms [v2] In-Reply-To: References: Message-ID: On Tue, 7 Mar 2023 03:47:54 GMT, Jasmine K. wrote: >> Hello, >> I would like to generalize two ideal transforms for bitwise shifts. Left shift nodes perform the transformations `(x >> C1) << C2 => x & (-1 << C2)` and `((x >> C1) & Y) << C2 => x & (Y << C2)`, but only when the case where `C1 == C2`. However, it is possible to use both of these rules to improve cases where the constants aren't equal, by removing one of the shifts and replacing it with a bitwise and. This transformation is profitable because typically more bitwise ands can be dispatched per cycle than bit shifts. In addition, the strength reduction from a shift to a bitwise and can allow more profitable transformations to occur. These patterns are found throughout the JDK, mainly around strings and OW2 ASM. I've attached some profiling results from my (Zen 2) machine below: >> >> Baseline Patch Improvement >> Benchmark Mode Cnt Score Error Units Score Error Units >> LShiftNodeIdealize.testRgbaToAbgr avgt 15 63.287 ? 1.770 ns/op / 54.199 ? 1.408 ns/op + 14.36% >> LShiftNodeIdealize.testShiftAndInt avgt 15 874.564 ? 15.334 ns/op / 538.408 ? 11.768 ns/op + 38.44% >> LShiftNodeIdealize.testShiftAndLong avgt 15 1017.466 ? 29.010 ns/op / 701.356 ? 18.258 ns/op + 31.07% >> LShiftNodeIdealize.testShiftInt avgt 15 663.865 ? 14.226 ns/op / 533.588 ? 9.949 ns/op + 19.63% >> LShiftNodeIdealize.testShiftInt2 avgt 15 658.976 ? 32.856 ns/op / 649.871 ? 10.598 ns/op + 1.38% >> LShiftNodeIdealize.testShiftLong avgt 15 815.540 ? 14.721 ns/op / 689.270 ? 14.028 ns/op + 15.48% >> LShiftNodeIdealize.testShiftLong2 avgt 15 817.936 ? 23.573 ns/op / 810.185 ? 14.983 ns/op + 0.95% >> >> >> In addition, in the process of making this PR I've found a missing ideal transform for `RShiftLNode`, so right shifts of large numbers (such as `x >> 65`) are not properly folded down, like how they are `RShiftINode` and `URShiftLNode`. I'll address this in a future RFR. >> >> Testing: GHA, tier1 local, and performance testing >> >> Thanks, >> Jasmine K > > Jasmine K. has updated the pull request incrementally with one additional commit since the last revision: > > Comments from code review The change looks good to me and internal testing passes. Thanks for the contribution! > This transformation is profitable because typically more bitwise ands can be dispatched per cycle than bit shifts Are we sure that this is profitable on all architectures? Another review would be good. test/hotspot/jtreg/compiler/c2/irTests/LShiftINodeIdealizationTests.java line 122: > 120: @IR(failOn = { IRNode.RSHIFT }) > 121: @IR(counts = { IRNode.AND, "1", IRNode.LSHIFT, "1" }) > 122: // Checks (x >> 4) << 8 => (x << 4) & 0xFF00 Suggestion: // Checks ((x >> 4) & 0xFF) << 8 => (x << 4) & 0xFF00 Right? Same for test8 and the corresponding long tests. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/12734 From wanghaomin at openjdk.org Thu Mar 9 07:56:14 2023 From: wanghaomin at openjdk.org (Wang Haomin) Date: Thu, 9 Mar 2023 07:56:14 GMT Subject: RFR: 8303804: Fix some errors of If-VectorTest and CMove-VectorTest [v2] In-Reply-To: References: <-Jw_zF5ca_3WHcoZQwzsT6lMA1NFdAzbOv3063qU6Lw=.b90c971a-bfe4-4144-93dc-b04a4c89a154@github.com> Message-ID: On Thu, 9 Mar 2023 01:19:40 GMT, Wang Haomin wrote: >> After https://bugs.openjdk.org/browse/JDK-8292289 , the base class of VectorTestNode changed from Node to CmpNode. So I add two match rule into ad file. >> >> match(If cop (VectorTest op1 op2)); >> match(Set dst (CMoveI (Binary cop (VectorTest op1 op2)) (Binary src1 src2))); >> >> First error, rule1 shouldn't generate the statement "node->_bottom_type = _leaf->bottom_type();". >> Second error, both rule1 and rule2 need to use VectorTestNode, the VectorTestNode should be cloned like CmpNode. > > Wang Haomin has updated the pull request incrementally with one additional commit since the last revision: > > compare the results with 0 @RealFYang Could you review it, I think there are the same erros in riscv. ------------- PR: https://git.openjdk.org/jdk/pull/12917 From roland at openjdk.org Thu Mar 9 08:03:26 2023 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 9 Mar 2023 08:03:26 GMT Subject: RFR: 8303564: C2: "Bad graph detected in build_loop_late" after a CMove is wrongly split thru phi In-Reply-To: References: Message-ID: <8Wa0vVsCZyjrxiTd62XRVmq1WVL01I2u7ylZYUGxOp4=.15374fc7-919d-48cc-9598-8eec7f55b9d6@github.com> On Fri, 3 Mar 2023 16:25:39 GMT, Vladimir Kozlov wrote: >> The following steps lead to the crash: >> >> - In `testHelper()`, the null and range checks for the `field1[0]` >> load are hoisted out of the counted loop by loop predication >> >> - As a result, the `field1[0]` load is also out of loop, control >> dependent on a predicate >> >> - pre/main/post loops are created, the main loop is unrolled, the `f` >> value that's stored in `field3` is a Phi that merges the values out >> of the 3 loops. >> >> - the `stop` variable that captures the limit of the loop is >> transformed into a `Phi` that merges 1 and 2. >> >> - As a result, the Phi that's stored in `field3` now only merges the >> value of the pre and post loop and is transformed into a `CmoveI` >> that merges 2 values dependent on the `field1[0]` `LoadI` that's >> control dependent on a predicate. >> >> - On the next round of loop opts, the `CmoveI` is assigned control >> below the predicate but the `Bool`/`CmpI` for the `CmoveI` is >> assigned control above, right below a `Region` that has a `Phi` that >> is input to the `CmpI`. The reason is this logic: >> https://github.com/rwestrel/jdk/blob/99f5687eb192b249a4a4533578f56b131fb8f234/src/hotspot/share/opto/loopnode.cpp#L5968 >> >> - The `CmoveI` is split thru phi because the `Bool`/`CmpI` have >> control right below a `Region`. That shouldn't happen because the >> `CmoveI` itself doesn't have control at the `Region` and is actually >> pinned through the `LoadI` below the `Region`. >> >> The fix I propose is to check the control of the `CmoveI` before >> proceding with split thru phi. > > Good. @vnkozlov @TobiHartmann thanks for the reviews ------------- PR: https://git.openjdk.org/jdk/pull/12851 From roland at openjdk.org Thu Mar 9 08:03:29 2023 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 9 Mar 2023 08:03:29 GMT Subject: Integrated: 8303564: C2: "Bad graph detected in build_loop_late" after a CMove is wrongly split thru phi In-Reply-To: References: Message-ID: On Fri, 3 Mar 2023 10:22:55 GMT, Roland Westrelin wrote: > The following steps lead to the crash: > > - In `testHelper()`, the null and range checks for the `field1[0]` > load are hoisted out of the counted loop by loop predication > > - As a result, the `field1[0]` load is also out of loop, control > dependent on a predicate > > - pre/main/post loops are created, the main loop is unrolled, the `f` > value that's stored in `field3` is a Phi that merges the values out > of the 3 loops. > > - the `stop` variable that captures the limit of the loop is > transformed into a `Phi` that merges 1 and 2. > > - As a result, the Phi that's stored in `field3` now only merges the > value of the pre and post loop and is transformed into a `CmoveI` > that merges 2 values dependent on the `field1[0]` `LoadI` that's > control dependent on a predicate. > > - On the next round of loop opts, the `CmoveI` is assigned control > below the predicate but the `Bool`/`CmpI` for the `CmoveI` is > assigned control above, right below a `Region` that has a `Phi` that > is input to the `CmpI`. The reason is this logic: > https://github.com/rwestrel/jdk/blob/99f5687eb192b249a4a4533578f56b131fb8f234/src/hotspot/share/opto/loopnode.cpp#L5968 > > - The `CmoveI` is split thru phi because the `Bool`/`CmpI` have > control right below a `Region`. That shouldn't happen because the > `CmoveI` itself doesn't have control at the `Region` and is actually > pinned through the `LoadI` below the `Region`. > > The fix I propose is to check the control of the `CmoveI` before > proceding with split thru phi. This pull request has now been integrated. Changeset: 5e232cf0 Author: Roland Westrelin URL: https://git.openjdk.org/jdk/commit/5e232cf0a96cf81036a2d9d7814127b7bc9ebab1 Stats: 82 lines in 2 files changed: 80 ins; 0 del; 2 mod 8303564: C2: "Bad graph detected in build_loop_late" after a CMove is wrongly split thru phi Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/12851 From roland at openjdk.org Thu Mar 9 08:04:31 2023 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 9 Mar 2023 08:04:31 GMT Subject: Integrated: 8300258: C2: vectorization fails on simple ByteBuffer loop In-Reply-To: References: Message-ID: On Mon, 6 Feb 2023 14:15:19 GMT, Roland Westrelin wrote: > The loop that doesn't vectorize is: > > > public static void testByteLong4(byte[] dest, long[] src, int start, int stop) { > for (int i = start; i < stop; i++) { > UNSAFE.putLongUnaligned(dest, 8 * i + baseOffset, src[i]); > } > } > > > It's from a micro-benchmark in the panama > repo. `SuperWord::find_adjacent_refs() `prevents it from vectorizing > because it finds it cannot properly align the loop and, from the > comment in the code, that: > > > // Can't allow vectorization of unaligned memory accesses with the > // same type since it could be overlapped accesses to the same array. > > > The test for "same type" is implemented by looking at the memory > operation type which in this case is overly conservative as the loop > above is reading and writing with long loads/stores but from and to > arrays of different types that can't overlap. Actually, with such > mismatched accesses, it's also likely an incorrect test (reading and > writing could be to the same array with loads/stores that use > different operand size) eventhough I couldn't write a test case that > would trigger an incorrect execution. > > As a fix, I propose implementing the "same type" test by looking at > memory aliases instead. This pull request has now been integrated. Changeset: dc523a58 Author: Roland Westrelin URL: https://git.openjdk.org/jdk/commit/dc523a58a6ece87e5865bea0342415a969172c77 Stats: 426 lines in 4 files changed: 415 ins; 1 del; 10 mod 8300258: C2: vectorization fails on simple ByteBuffer loop Co-authored-by: Emanuel Peter Reviewed-by: epeter, kvn ------------- PR: https://git.openjdk.org/jdk/pull/12440 From epeter at openjdk.org Thu Mar 9 09:50:33 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 9 Mar 2023 09:50:33 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v22] In-Reply-To: References: Message-ID: On Wed, 8 Mar 2023 20:39:43 GMT, Vladimir Kozlov wrote: > The one thing I don't understand is new method `find_dependence()`. @vnkozlov Now to your questions about `find_dependence()`. I need a method that can check if a `pack` is `independent`, ie that all members are mutually `independent`. I want to filter with it, and verify at the end. All it checks if any of the `pack`-members have a `DepPreds` path to any other `pack`-member. For that, I could also check `independent(s1, s2)` on every pair, but that would be squared many BFS traversal, one per `independent` call. Note that the depth range is used inside `independence`: The idea is to start at the "deeper" node, and BFS up the DAG, but no further than the depth of the "shallower" node. Because if we go further up, we will never find the "shallower" node. The BFS traversal is a bit convoluted, and implemented with recursive function calls to `independent_path` and checking `visited_test` to ensure nodes are not visited more than once. I now generalized this `independence` query to the `pack`-level: - Find `min_d`, of the "shallowest" node in the `pack` (no need to ever traverse further up than that depth). - Instead of BFS-ing from one start node, directly start at all nodes from the `pack`. - I mark all nodes from the `pack` as `visited`, and only those. This gives me an easy query to check if a node is from the `pack`. - The `worklist` (`Unique_Node_List`) is my BFS data-structure. It is unique so that we only visit every node once. And it simultaneously works as the BFS queue, as `j` iterates over them linearly. - I BFS traverse upward, only inside the basic block (`in_bb`), and only up to `min_d`. If I ever find a node `visit_test(pred)`, then I know that I have encountered a node from the `pack` from below. This implies that there must be a path from one of the nodes in the `pack` up to this node (return it). There is thus a dependence between two `si, sj` in the `pack`. If I never find a path, then I know that there cannot be a dependence, the `pack` is `independent`. I hope I have clarified things. Except maybe this point: > Does not matter if they are in an other pack or not? `find_dependence` only checks if all the members of a `pack` are mutually `independent` in the DAG. One might now wonder: do other `packs` not have an impact also? Might another pack not create additional dependencies "across" the vector elements? In the [paper](https://groups.csail.mit.edu/cag/slp/SLP-PLDI-2000.pdf), they have this example: Quote: 3.7 Scheduling Dependence analysis before packing ensures that statements within a group can be executed safely in parallel. However, it may be the case that executing two groups produces a dependence violation. An example of this is shown in Figure 6. Here, dependence edges are drawn between groups if a statement in one group is dependent on a statement in the other. As long as there are no cycles in this dependence graph, all groups can be scheduled such that no violations occur. However, a cycle indicates that the set of chosen groups is invalid and at least one group will need to be eliminated. Although experimental data has shown this case to be extremely rare, care must be taken to ensure correctness. The idea is this: before `schedule`, we must ensure that `packs` are `independent`. **But:** `independence` on the `pack` level is **not** sufficient, we also need to ensure that the `packs` (groups) are acyclic before we `schedule`. >From the comments in `schedule -> co_locate_pack`, it seems we are assuming that there are no cycles (see point 5 in the list there). So is there a **4th Bug** lurking here? Maybe. If so, I'd say we should fix that in a separate RFE, since all my examples failed because of `dependence` in the `pack`, and not because of cyclic dependencies between `independent` packs. https://github.com/openjdk/jdk/blob/a44082b61f22dcdee115697f34d39c1d8382a15d/src/hotspot/share/opto/superword.cpp#L1367-L1376 https://github.com/openjdk/jdk/blob/a44082b61f22dcdee115697f34d39c1d8382a15d/src/hotspot/share/opto/superword.cpp#L2400-L2417 ------------- PR: https://git.openjdk.org/jdk/pull/12350 From jbhateja at openjdk.org Thu Mar 9 10:18:09 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 9 Mar 2023 10:18:09 GMT Subject: RFR: 8303105: LoopRangeStrideTest fails IR verification on x86 Message-ID: SLP fails to recognize valid address expression during SWPointer creation for memory operands with 32 bit jvm, this prevents gathering adjacent memory operations. Debug trace with -XX:+TraceSuperWord -XX:+TraceNewVectors -XX:CompileCommand=VectorizeDebug,,3 shows following errors . SWPointer::memory_alignment: SWPointer p invalid, return bottom_align SWPointer::memory_alignment: SWPointer p invalid, return bottom_align SWPointer::memory_alignment: SWPointer p invalid, return bottom_align SWPointer::memory_alignment: SWPointer p invalid, return bottom_align SWPointer::memory_alignment: SWPointer p invalid, return bottom_align Problem also exist in JDK17 LTS. As an interim solution to prevent this showing up as a GHA test failure, we can enable the test only for x86_64 and aarch64 targets. Difference in address expression b/w X86 32 and 64 bit jvm will be root caused in follow issue JDK-8303885. ------------- Commit messages: - 8303105: LoopRangeStrideTest fails IR verification on x86 Changes: https://git.openjdk.org/jdk/pull/12938/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12938&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8303105 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/12938.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12938/head:pull/12938 PR: https://git.openjdk.org/jdk/pull/12938 From thartmann at openjdk.org Thu Mar 9 10:28:16 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 9 Mar 2023 10:28:16 GMT Subject: RFR: 8303105: LoopRangeStrideTest fails IR verification on x86 In-Reply-To: References: Message-ID: <50tGU73Ql39yqO_ouY9O1EhnBh8bzzUcGxica9u6vvw=.ee2295f9-38f5-44bf-a014-12c2bd310889@github.com> On Thu, 9 Mar 2023 10:08:51 GMT, Jatin Bhateja wrote: > SLP fails to recognize valid address expression during SWPointer creation for memory operands with 32 bit jvm, this prevents gathering adjacent memory operations. > Debug trace with -XX:+TraceSuperWord -XX:+TraceNewVectors -XX:CompileCommand=VectorizeDebug,,3 shows following errors . > > SWPointer::memory_alignment: SWPointer p invalid, return bottom_align > SWPointer::memory_alignment: SWPointer p invalid, return bottom_align > SWPointer::memory_alignment: SWPointer p invalid, return bottom_align > SWPointer::memory_alignment: SWPointer p invalid, return bottom_align > SWPointer::memory_alignment: SWPointer p invalid, return bottom_align > > Problem also exist in JDK17 LTS. As an interim solution to prevent this showing up as a GHA test failure, we can enable the test only for x86_64 and aarch64 targets. Difference in address expression b/w X86 32 and 64 bit jvm will be root caused in follow issue JDK-8303885. Looks good and trivial. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/12938 From epeter at openjdk.org Thu Mar 9 10:29:45 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 9 Mar 2023 10:29:45 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v23] In-Reply-To: References: Message-ID: > **List of important things below** > > - 3 Bugs I fixed + regression tests https://github.com/openjdk/jdk/pull/12350#issuecomment-1460323523 > - Conversation with @jatin-bhateja about script-generated regression test: https://github.com/openjdk/jdk/pull/12350#discussion_r1115317152 > - My [blog-article](https://eme64.github.io/blog/2023/02/23/SuperWord-Introduction.html) about SuperWord > - Explanation of `dependency_graph` and `DepPreds` https://github.com/openjdk/jdk/pull/12350#issuecomment-1461498252 > - Explanation of my new `find_dependency`. Arguments about `independence` of packs and cyclic dependencies between packs https://github.com/openjdk/jdk/pull/12350#issuecomment-1461681129 > > **Original RFE description:** > Cyclic dependencies are not handled correctly in all cases. Three examples: > > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/loopopts/superword/TestCyclicDependency.java#L270-L277 > > And this, compiled with `-XX:CompileCommand=option,compiler.vectorization.TestOptionVectorizeIR::test*,Vectorize`: > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestOptionVectorizeIR.java#L173-L180 > > And for `vmIntrinsics::_forEachRemaining` compile option `Vectorize` is always enabled: > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestForEachRem.java#L69-L73 > > All of these examples are vectorized, despite the cyclic dependency of distance 2. The cyclic dependency is dropped, instead the emitted vector code implements a shift by 2, instead of repeating the same 2 values. > > **Analysis** > > The `create_pack` logic in `SuperWord::find_adjacent_refs` is broken in two ways: > > - When the compile directive `Vectorize` is on, or we compile `vmIntrinsics::_forEachRemaining` we have `_do_vector_loop == true`. When that is the case, we blindly trust that there is no cyclic dependency larger than distance 1. Distance 1 would already be detected by the `independence(s1, s2)` checks we do for all adjacent memops. But for larger distances, we rely on `memory_alignment == 0`. But the compile directive avoids these checks. > - If `best_align_to_mem_ref` is of a different type, and we have `memory_alignment(mem_ref, best_align_to_mem_ref) == 0`, we do not check if `mem_ref` has `memory_alignment == 0` for all other refs of the same type. In the example `TestCyclicDependency::test2`, we have `best_align_to_mem_ref` as the `StoreF`. Then we assess the `StoreI`, which is not aligned with it, but it is of a different type, so we accept it too. Finally, we look at `LoadI`, which has perfect alignment with the `StoreF`, so we accept it too (even though it is in conflict with the `StoreI`). > > Generally, the nested if-statements are confusing and buggy. I propose to fix and refactor the code. > > I also propose to only allow the compile directive `Vectorize` only if `vectors_should_be_aligned() == false`. If all vector operations have to be `vector_width` aligned, then they also have to be mutually aligned, and we cannot have patterns like `v[i] = v[i] + v[i+1]` for which the compile directive was introduced in the first place https://github.com/openjdk/jdk/commit/c7d33de202203b6da544f2e0f9a13952381b32dd. > **Update**: I found a **Test.java** that lead to a crash (`SIGBUS`) on a ARM32 on master. The example bypassed the alignment requirement because of `_do_vector_loop`, and allowed unaligned vector loads to be generated, on a platform that requires alignment. Thanks @fg1417 for running that test for me! > > **Solution** > > First, I implemented `SuperWord::verify_packs` which catches cyclic dependencies just before scheduling. The idea is to reassess every pack, and check if all memops in it are mutually independent. Turns out that per vector pack, it suffices to do a single BFS over the nodes in the block (see `SuperWord::find_dependence`). With this verification in place we at least get an assert instead of wrong execution. > > I then refactored and fixed the `create_pack` code, and put the logic all in `SuperWord::is_mem_ref_alignment_ok`. With the added comments, I hope the logic is more straight forward and readable. If `_do_vector_loop == true`, then I filter the vector packs again in `SuperWord::combine_packs`, since we are at that point not sure that the packs are actually independent, we only know that adjacient memops are independent. > > Another change I have made: > Disallow `extend_packlist` from adding `MemNodes` back in. Because if we have rejected some memops, we do not want them to be added back in later. > > **Testing** > > I added a few more regression tests, and am running tier1-3, plus some stress testing. > > However, I need help from someone who can test this on **ARM32** and **PPC**, basically machines that have `vectors_should_be_aligned() == true`. I would love to have additional testing on those machine, and some reviews. > **Update:** @fg1417 did testing on ARM32, @reinrich did testing on PPC. > > **Discussion / Future Work** > > I wonder if we should have `_do_vector_loop == true` by default, since it allows more vectorization. With the added filtering, we are sure that we do not schedule packs with cyclic dependencies. We would have to evaluate performance and other side-effects of course. What do you think? [JDK-8303113](https://bugs.openjdk.org/browse/JDK-8303113) Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 37 commits: - resolve merge conflict after Roland's fix - TestDependencyOffsets.java: add vanilla run - TestDependencyOffsets.java: parallelize it + various AVX settings - TestOptionVectorizeIR.java: removed PopulateIndex IR rule - fails on x86 32bit - see Matcher::match_rule_supported - Merge branch 'master' into JDK-8298935 - Reworked TestDependencyOffsets.java - remove negative IR rules for TestOptionVectorizeIR.java - removed negative rules for TestCyclicDependency.java - TestDependencyOffsets.java: MulVL not supported on NEON / asimd. Replaced it with AddVL - Fix TestOptionVectorizeIR.java for aarch64 machines with AlignVector == true - ... and 27 more: https://git.openjdk.org/jdk/compare/34a92466...0f7e39c4 ------------- Changes: https://git.openjdk.org/jdk/pull/12350/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12350&range=22 Stats: 12910 lines in 7 files changed: 12852 ins; 44 del; 14 mod Patch: https://git.openjdk.org/jdk/pull/12350.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12350/head:pull/12350 PR: https://git.openjdk.org/jdk/pull/12350 From roland at openjdk.org Thu Mar 9 10:55:21 2023 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 9 Mar 2023 10:55:21 GMT Subject: RFR: 8300257: C2: vectorization fails on some simple Memory Segment loops Message-ID: In the test case `testByteLong1` (that's extracted from a memory segment micro benchmark), the address of the store is initially: (AddP#204 base#195 base#195 (AddL#164 (ConvI2L#158 (CastII#157 (LshiftI#107 iv#101))) invar#163)) (#numbers are node numbers to help the discussion). `iv#101` is the `Phi` of a counted loop. `invar#163` is the `baseOffset` load. To eliminate the range check, the loop is transformed into a loop nest and as a consequence the address above becomes: (AddP#204 base#195 base#195 (AddL#164 (ConvI2L#158 (CastII#157 (LShiftI#107 (AddI#326 invar#308 iv#321)))) invar#163)) `invar#308` is some expression from a `Phi` of the outer loop. That `AddP` is transformed multiple times to push the invariants out of loop: (AddP#568 base#195 (AddP#556 base#195 base#195 invar#163) (ConvI2L#158 (CastII#157 (AddI#566 (LShiftI#565 iv#321) invar#577)))) then: (AddP#568 base#195 (AddP#847 (AddP#556 base#195 base#195 invar#163) (AddL#838 (ConvI2L#793 (LShiftL#760 iv#767)) (ConvI2L#818 (CastII#779 invar#577))))) and finally: (AddP#568 base#195 (AddP#949 base#195 (AddP#855 base#195 (AddP#556 base#195 base#195 invar#163) (ConvI2L#818 (CastII#809 invar#577))) (ConvI2L#938 (LShiftI#896 iv#908)))) `AddP#855` is out of the inner loop. This doesn't vectorize because: - there are 2 invariants in the address expression but superword only support one (tracked by `_invar` in `SWPointer`) - there are more levels of `AddP` (4) than superword supports (3) To fix that, I propose to no longer track the address elements in `_invar`, `_negate_invar` and `_invar_scale` but instead to have a single `_invar` which is an expression built by superword as it follows chains of `addP` nodes. I kept the previous `_invar`, `_negate_invar` and `_invar_scale` as debugging and use them to check that what vectorized with the previous scheme still does. I also propose lifting the restriction on 3 levels of `AddP` entirely. ------------- Commit messages: - fix & test Changes: https://git.openjdk.org/jdk/pull/12942/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12942&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8300257 Stats: 274 lines in 3 files changed: 212 ins; 23 del; 39 mod Patch: https://git.openjdk.org/jdk/pull/12942.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12942/head:pull/12942 PR: https://git.openjdk.org/jdk/pull/12942 From jbhateja at openjdk.org Thu Mar 9 12:09:24 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 9 Mar 2023 12:09:24 GMT Subject: RFR: 8303105: LoopRangeStrideTest fails IR verification on x86 In-Reply-To: <50tGU73Ql39yqO_ouY9O1EhnBh8bzzUcGxica9u6vvw=.ee2295f9-38f5-44bf-a014-12c2bd310889@github.com> References: <50tGU73Ql39yqO_ouY9O1EhnBh8bzzUcGxica9u6vvw=.ee2295f9-38f5-44bf-a014-12c2bd310889@github.com> Message-ID: On Thu, 9 Mar 2023 10:25:12 GMT, Tobias Hartmann wrote: >> SLP fails to recognize valid address expression during SWPointer creation for memory operands with 32 bit jvm, this prevents gathering adjacent memory operations. >> Debug trace with -XX:+TraceSuperWord -XX:+TraceNewVectors -XX:CompileCommand=VectorizeDebug,,3 shows following errors . >> >> SWPointer::memory_alignment: SWPointer p invalid, return bottom_align >> SWPointer::memory_alignment: SWPointer p invalid, return bottom_align >> SWPointer::memory_alignment: SWPointer p invalid, return bottom_align >> SWPointer::memory_alignment: SWPointer p invalid, return bottom_align >> SWPointer::memory_alignment: SWPointer p invalid, return bottom_align >> >> Problem also exist in JDK17 LTS. As an interim solution to prevent this showing up as a GHA test failure, we can enable the test only for x86_64 and aarch64 targets. Difference in address expression b/w X86 32 and 64 bit jvm will be root caused in follow issue JDK-8303885. > > Looks good and trivial. Thanks @TobiHartmann , integrating it. ------------- PR: https://git.openjdk.org/jdk/pull/12938 From jbhateja at openjdk.org Thu Mar 9 12:09:26 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 9 Mar 2023 12:09:26 GMT Subject: Integrated: 8303105: LoopRangeStrideTest fails IR verification on x86 In-Reply-To: References: Message-ID: On Thu, 9 Mar 2023 10:08:51 GMT, Jatin Bhateja wrote: > SLP fails to recognize valid address expression during SWPointer creation for memory operands with 32 bit jvm, this prevents gathering adjacent memory operations. > Debug trace with -XX:+TraceSuperWord -XX:+TraceNewVectors -XX:CompileCommand=VectorizeDebug,,3 shows following errors . > > SWPointer::memory_alignment: SWPointer p invalid, return bottom_align > SWPointer::memory_alignment: SWPointer p invalid, return bottom_align > SWPointer::memory_alignment: SWPointer p invalid, return bottom_align > SWPointer::memory_alignment: SWPointer p invalid, return bottom_align > SWPointer::memory_alignment: SWPointer p invalid, return bottom_align > > Problem also exist in JDK17 LTS. As an interim solution to prevent this showing up as a GHA test failure, we can enable the test only for x86_64 and aarch64 targets. Difference in address expression b/w X86 32 and 64 bit jvm will be root caused in follow issue JDK-8303885. This pull request has now been integrated. Changeset: 713def0b Author: Jatin Bhateja URL: https://git.openjdk.org/jdk/commit/713def0bf25c3488afb72e453f3b7cd09a909599 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod 8303105: LoopRangeStrideTest fails IR verification on x86 Reviewed-by: thartmann ------------- PR: https://git.openjdk.org/jdk/pull/12938 From jsjolen at openjdk.org Thu Mar 9 12:15:33 2023 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Thu, 9 Mar 2023 12:15:33 GMT Subject: RFR: JDK-8301074: Replace NULL with nullptr in share/opto/ [v7] In-Reply-To: References: Message-ID: > Hi, this PR changes all occurrences of NULL to nullptr for the subdirectory share/opto/. Unfortunately the script that does the change isn't perfect, and so we > need to comb through these manually to make sure nothing has gone wrong. I also review these changes but things slip past my eyes sometimes. > > Here are some typical things to look out for: > > No changes but copyright header changed (probably because I reverted some changes but forgot the copyright). > Macros having their NULL changed to nullptr, these are added to the script when I find them. They should be NULL. > nullptr in comments and logs. We try to use lower case "null" in these cases as it reads better. An exception is made when code expressions are in a comment. > > An example of this: > > > // This function returns null > void* ret_null(); > // This function returns true if *x == nullptr > bool is_nullptr(void** x); > > > Note how nullptr participates in a code expression here, we really are talking about the specific value nullptr. > > Thanks! Johan Sj?len has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits: - Merge remote-tracking branch 'origin/master' into JDK-8301074 - Jesper's fixes - Merge remote-tracking branch 'origin/JDK-8301074' into JDK-8301074 - Explicitly use 0 for null in ARM interpreter - Merge remote-tracking branch 'origin/master' into JDK-8301074 - Remove trailing whitespace - Check for null string explicitly - vkozlov fixes - Manual review fixes - Fix - ... and 2 more: https://git.openjdk.org/jdk/compare/34a92466...8efdb67a ------------- Changes: https://git.openjdk.org/jdk/pull/12187/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12187&range=06 Stats: 5594 lines in 111 files changed: 1 ins; 0 del; 5593 mod Patch: https://git.openjdk.org/jdk/pull/12187.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12187/head:pull/12187 PR: https://git.openjdk.org/jdk/pull/12187 From epeter at openjdk.org Thu Mar 9 13:34:06 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 9 Mar 2023 13:34:06 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v24] In-Reply-To: References: Message-ID: > **List of important things below** > > - 3 Bugs I fixed + regression tests https://github.com/openjdk/jdk/pull/12350#issuecomment-1460323523 > - Conversation with @jatin-bhateja about script-generated regression test: https://github.com/openjdk/jdk/pull/12350#discussion_r1115317152 > - My [blog-article](https://eme64.github.io/blog/2023/02/23/SuperWord-Introduction.html) about SuperWord > - Explanation of `dependency_graph` and `DepPreds` https://github.com/openjdk/jdk/pull/12350#issuecomment-1461498252 > - Explanation of my new `find_dependency`. Arguments about `independence` of packs and cyclic dependencies between packs https://github.com/openjdk/jdk/pull/12350#issuecomment-1461681129 > > **Original RFE description:** > Cyclic dependencies are not handled correctly in all cases. Three examples: > > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/loopopts/superword/TestCyclicDependency.java#L270-L277 > > And this, compiled with `-XX:CompileCommand=option,compiler.vectorization.TestOptionVectorizeIR::test*,Vectorize`: > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestOptionVectorizeIR.java#L173-L180 > > And for `vmIntrinsics::_forEachRemaining` compile option `Vectorize` is always enabled: > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestForEachRem.java#L69-L73 > > All of these examples are vectorized, despite the cyclic dependency of distance 2. The cyclic dependency is dropped, instead the emitted vector code implements a shift by 2, instead of repeating the same 2 values. > > **Analysis** > > The `create_pack` logic in `SuperWord::find_adjacent_refs` is broken in two ways: > > - When the compile directive `Vectorize` is on, or we compile `vmIntrinsics::_forEachRemaining` we have `_do_vector_loop == true`. When that is the case, we blindly trust that there is no cyclic dependency larger than distance 1. Distance 1 would already be detected by the `independence(s1, s2)` checks we do for all adjacent memops. But for larger distances, we rely on `memory_alignment == 0`. But the compile directive avoids these checks. > - If `best_align_to_mem_ref` is of a different type, and we have `memory_alignment(mem_ref, best_align_to_mem_ref) == 0`, we do not check if `mem_ref` has `memory_alignment == 0` for all other refs of the same type. In the example `TestCyclicDependency::test2`, we have `best_align_to_mem_ref` as the `StoreF`. Then we assess the `StoreI`, which is not aligned with it, but it is of a different type, so we accept it too. Finally, we look at `LoadI`, which has perfect alignment with the `StoreF`, so we accept it too (even though it is in conflict with the `StoreI`). > > Generally, the nested if-statements are confusing and buggy. I propose to fix and refactor the code. > > I also propose to only allow the compile directive `Vectorize` only if `vectors_should_be_aligned() == false`. If all vector operations have to be `vector_width` aligned, then they also have to be mutually aligned, and we cannot have patterns like `v[i] = v[i] + v[i+1]` for which the compile directive was introduced in the first place https://github.com/openjdk/jdk/commit/c7d33de202203b6da544f2e0f9a13952381b32dd. > **Update**: I found a **Test.java** that lead to a crash (`SIGBUS`) on a ARM32 on master. The example bypassed the alignment requirement because of `_do_vector_loop`, and allowed unaligned vector loads to be generated, on a platform that requires alignment. Thanks @fg1417 for running that test for me! > > **Solution** > > First, I implemented `SuperWord::verify_packs` which catches cyclic dependencies just before scheduling. The idea is to reassess every pack, and check if all memops in it are mutually independent. Turns out that per vector pack, it suffices to do a single BFS over the nodes in the block (see `SuperWord::find_dependence`). With this verification in place we at least get an assert instead of wrong execution. > > I then refactored and fixed the `create_pack` code, and put the logic all in `SuperWord::is_mem_ref_alignment_ok`. With the added comments, I hope the logic is more straight forward and readable. If `_do_vector_loop == true`, then I filter the vector packs again in `SuperWord::combine_packs`, since we are at that point not sure that the packs are actually independent, we only know that adjacient memops are independent. > > Another change I have made: > Disallow `extend_packlist` from adding `MemNodes` back in. Because if we have rejected some memops, we do not want them to be added back in later. > > **Testing** > > I added a few more regression tests, and am running tier1-3, plus some stress testing. > > However, I need help from someone who can test this on **ARM32** and **PPC**, basically machines that have `vectors_should_be_aligned() == true`. I would love to have additional testing on those machine, and some reviews. > **Update:** @fg1417 did testing on ARM32, @reinrich did testing on PPC. > > **Discussion / Future Work** > > I wonder if we should have `_do_vector_loop == true` by default, since it allows more vectorization. With the added filtering, we are sure that we do not schedule packs with cyclic dependencies. We would have to evaluate performance and other side-effects of course. What do you think? [JDK-8303113](https://bugs.openjdk.org/browse/JDK-8303113) Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: A little renaming and improved comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12350/files - new: https://git.openjdk.org/jdk/pull/12350/files/0f7e39c4..216bb1a0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12350&range=23 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12350&range=22-23 Stats: 59 lines in 2 files changed: 22 ins; 8 del; 29 mod Patch: https://git.openjdk.org/jdk/pull/12350.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12350/head:pull/12350 PR: https://git.openjdk.org/jdk/pull/12350 From epeter at openjdk.org Thu Mar 9 13:45:11 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 9 Mar 2023 13:45:11 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v25] In-Reply-To: References: Message-ID: > **List of important things below** > > - 3 Bugs I fixed + regression tests https://github.com/openjdk/jdk/pull/12350#issuecomment-1460323523 > - Conversation with @jatin-bhateja about script-generated regression test: https://github.com/openjdk/jdk/pull/12350#discussion_r1115317152 > - My [blog-article](https://eme64.github.io/blog/2023/02/23/SuperWord-Introduction.html) about SuperWord > - Explanation of `dependency_graph` and `DepPreds` https://github.com/openjdk/jdk/pull/12350#issuecomment-1461498252 > - Explanation of my new `find_dependency`. Arguments about `independence` of packs and cyclic dependencies between packs https://github.com/openjdk/jdk/pull/12350#issuecomment-1461681129 > > **Original RFE description:** > Cyclic dependencies are not handled correctly in all cases. Three examples: > > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/loopopts/superword/TestCyclicDependency.java#L270-L277 > > And this, compiled with `-XX:CompileCommand=option,compiler.vectorization.TestOptionVectorizeIR::test*,Vectorize`: > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestOptionVectorizeIR.java#L173-L180 > > And for `vmIntrinsics::_forEachRemaining` compile option `Vectorize` is always enabled: > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestForEachRem.java#L69-L73 > > All of these examples are vectorized, despite the cyclic dependency of distance 2. The cyclic dependency is dropped, instead the emitted vector code implements a shift by 2, instead of repeating the same 2 values. > > **Analysis** > > The `create_pack` logic in `SuperWord::find_adjacent_refs` is broken in two ways: > > - When the compile directive `Vectorize` is on, or we compile `vmIntrinsics::_forEachRemaining` we have `_do_vector_loop == true`. When that is the case, we blindly trust that there is no cyclic dependency larger than distance 1. Distance 1 would already be detected by the `independence(s1, s2)` checks we do for all adjacent memops. But for larger distances, we rely on `memory_alignment == 0`. But the compile directive avoids these checks. > - If `best_align_to_mem_ref` is of a different type, and we have `memory_alignment(mem_ref, best_align_to_mem_ref) == 0`, we do not check if `mem_ref` has `memory_alignment == 0` for all other refs of the same type. In the example `TestCyclicDependency::test2`, we have `best_align_to_mem_ref` as the `StoreF`. Then we assess the `StoreI`, which is not aligned with it, but it is of a different type, so we accept it too. Finally, we look at `LoadI`, which has perfect alignment with the `StoreF`, so we accept it too (even though it is in conflict with the `StoreI`). > > Generally, the nested if-statements are confusing and buggy. I propose to fix and refactor the code. > > I also propose to only allow the compile directive `Vectorize` only if `vectors_should_be_aligned() == false`. If all vector operations have to be `vector_width` aligned, then they also have to be mutually aligned, and we cannot have patterns like `v[i] = v[i] + v[i+1]` for which the compile directive was introduced in the first place https://github.com/openjdk/jdk/commit/c7d33de202203b6da544f2e0f9a13952381b32dd. > **Update**: I found a **Test.java** that lead to a crash (`SIGBUS`) on a ARM32 on master. The example bypassed the alignment requirement because of `_do_vector_loop`, and allowed unaligned vector loads to be generated, on a platform that requires alignment. Thanks @fg1417 for running that test for me! > > **Solution** > > First, I implemented `SuperWord::verify_packs` which catches cyclic dependencies just before scheduling. The idea is to reassess every pack, and check if all memops in it are mutually independent. Turns out that per vector pack, it suffices to do a single BFS over the nodes in the block (see `SuperWord::find_dependence`). With this verification in place we at least get an assert instead of wrong execution. > > I then refactored and fixed the `create_pack` code, and put the logic all in `SuperWord::is_mem_ref_alignment_ok`. With the added comments, I hope the logic is more straight forward and readable. If `_do_vector_loop == true`, then I filter the vector packs again in `SuperWord::combine_packs`, since we are at that point not sure that the packs are actually independent, we only know that adjacient memops are independent. > > Another change I have made: > Disallow `extend_packlist` from adding `MemNodes` back in. Because if we have rejected some memops, we do not want them to be added back in later. > > **Testing** > > I added a few more regression tests, and am running tier1-3, plus some stress testing. > > However, I need help from someone who can test this on **ARM32** and **PPC**, basically machines that have `vectors_should_be_aligned() == true`. I would love to have additional testing on those machine, and some reviews. > **Update:** @fg1417 did testing on ARM32, @reinrich did testing on PPC. > > **Discussion / Future Work** > > I wonder if we should have `_do_vector_loop == true` by default, since it allows more vectorization. With the added filtering, we are sure that we do not schedule packs with cyclic dependencies. We would have to evaluate performance and other side-effects of course. What do you think? [JDK-8303113](https://bugs.openjdk.org/browse/JDK-8303113) Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Fixed wording from last commit ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12350/files - new: https://git.openjdk.org/jdk/pull/12350/files/216bb1a0..ef61acd5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12350&range=24 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12350&range=23-24 Stats: 4 lines in 2 files changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/12350.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12350/head:pull/12350 PR: https://git.openjdk.org/jdk/pull/12350 From jsjolen at openjdk.org Thu Mar 9 14:17:14 2023 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Thu, 9 Mar 2023 14:17:14 GMT Subject: RFR: JDK-8301074: Replace NULL with nullptr in share/opto/ [v8] In-Reply-To: References: Message-ID: > Hi, this PR changes all occurrences of NULL to nullptr for the subdirectory share/opto/. Unfortunately the script that does the change isn't perfect, and so we > need to comb through these manually to make sure nothing has gone wrong. I also review these changes but things slip past my eyes sometimes. > > Here are some typical things to look out for: > > No changes but copyright header changed (probably because I reverted some changes but forgot the copyright). > Macros having their NULL changed to nullptr, these are added to the script when I find them. They should be NULL. > nullptr in comments and logs. We try to use lower case "null" in these cases as it reads better. An exception is made when code expressions are in a comment. > > An example of this: > > > // This function returns null > void* ret_null(); > // This function returns true if *x == nullptr > bool is_nullptr(void** x); > > > Note how nullptr participates in a code expression here, we really are talking about the specific value nullptr. > > Thanks! Johan Sj?len has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits: - Merge remote-tracking branch 'origin/master' into JDK-8301074 - Merge remote-tracking branch 'origin/master' into JDK-8301074 - Jesper's fixes - Merge remote-tracking branch 'origin/JDK-8301074' into JDK-8301074 - Explicitly use 0 for null in ARM interpreter - Merge remote-tracking branch 'origin/master' into JDK-8301074 - Remove trailing whitespace - Check for null string explicitly - vkozlov fixes - Manual review fixes - ... and 3 more: https://git.openjdk.org/jdk/compare/1e9942aa...038599b9 ------------- Changes: https://git.openjdk.org/jdk/pull/12187/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12187&range=07 Stats: 5594 lines in 111 files changed: 1 ins; 0 del; 5593 mod Patch: https://git.openjdk.org/jdk/pull/12187.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12187/head:pull/12187 PR: https://git.openjdk.org/jdk/pull/12187 From kvn at openjdk.org Thu Mar 9 16:31:29 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 9 Mar 2023 16:31:29 GMT Subject: RFR: 8300257: C2: vectorization fails on some simple Memory Segment loops In-Reply-To: References: Message-ID: On Thu, 9 Mar 2023 10:48:06 GMT, Roland Westrelin wrote: > In the test case `testByteLong1` (that's extracted from a memory > segment micro benchmark), the address of the store is initially: > > > (AddP#204 base#195 base#195 (AddL#164 (ConvI2L#158 (CastII#157 (LshiftI#107 iv#101))) invar#163)) > > > (#numbers are node numbers to help the discussion). > > `iv#101` is the `Phi` of a counted loop. `invar#163` is the > `baseOffset` load. > > To eliminate the range check, the loop is transformed into a loop nest > and as a consequence the address above becomes: > > > (AddP#204 base#195 base#195 (AddL#164 (ConvI2L#158 (CastII#157 (LShiftI#107 (AddI#326 invar#308 iv#321)))) invar#163)) > > > `invar#308` is some expression from a `Phi` of the outer loop. > > That `AddP` is transformed multiple times to push the invariants out of loop: > > > (AddP#568 base#195 (AddP#556 base#195 base#195 invar#163) (ConvI2L#158 (CastII#157 (AddI#566 (LShiftI#565 iv#321) invar#577)))) > > > then: > > > (AddP#568 base#195 (AddP#847 (AddP#556 base#195 base#195 invar#163) (AddL#838 (ConvI2L#793 (LShiftL#760 iv#767)) (ConvI2L#818 (CastII#779 invar#577))))) > > > and finally: > > > (AddP#568 base#195 (AddP#949 base#195 (AddP#855 base#195 (AddP#556 base#195 base#195 invar#163) (ConvI2L#818 (CastII#809 invar#577))) (ConvI2L#938 (LShiftI#896 iv#908)))) > > > `AddP#855` is out of the inner loop. > > This doesn't vectorize because: > > - there are 2 invariants in the address expression but superword only > support one (tracked by `_invar` in `SWPointer`) > > - there are more levels of `AddP` (4) than superword supports (3) > > To fix that, I propose to no longer track the address elements in > `_invar`, `_negate_invar` and `_invar_scale` but instead to have a > single `_invar` which is an expression built by superword as it > follows chains of `addP` nodes. I kept the previous `_invar`, > `_negate_invar` and `_invar_scale` as debugging and use them to check > that what vectorized with the previous scheme still does. > > I also propose lifting the restriction on 3 levels of `AddP` entirely. So the final expression should be seen by superword as next since `AddP#855` is invariant for inner loop: (AddP#568 base#195 (AddP#949 base#195 (invar#855) (ConvI2L#938 (LShiftI#896 iv#908)))) First, I think you messed up with `()`. AddP node should have 3 inputs: base, address and offset. I don't see offset for `AddP#568`. Second, if superword can mark `AddP#855` as invariant (no need to parse it) your address expression become simple. ------------- PR: https://git.openjdk.org/jdk/pull/12942 From kvn at openjdk.org Thu Mar 9 16:43:02 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 9 Mar 2023 16:43:02 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v22] In-Reply-To: References: Message-ID: On Thu, 9 Mar 2023 09:47:04 GMT, Emanuel Peter wrote: >> Looks reasonable. The one thing I don't understand is new method `find_dependence()`. >> May be because I don't know what data `DepPreds` is operating on. >> >> Do I understand it correctly?: >> 1. All nodes in one pack are independent >> 2. Using `DepPreds` looks through all inputs for each node in pack and put them on work list if they are in the same block and in the same depth range. Does not matter if they are in an other pack or not? >> 3. Go through these inputs and put their inputs on work list if they satisfy conditions. >> 4. If we find input which is a node in the pack - we got dependence, return this pack's node. >> 5. We check an input only once because we use Unique_Node_List. > >> The one thing I don't understand is new method `find_dependence()`. > > @vnkozlov Now to your questions about `find_dependence()`. > > I need a method that can check if a `pack` is `independent`, ie that all members are mutually `independent`. I want to filter with it, and verify at the end. > > All it checks if any of the `pack`-members have a `DepPreds` path to any other `pack`-member. For that, I could also check `independent(s1, s2)` on every pair, but that would be squared many BFS traversal, one per `independent` call. Note that the depth range is used inside `independence`: The idea is to start at the "deeper" node, and BFS up the DAG, but no further than the depth of the "shallower" node. Because if we go further up, we will never find the "shallower" node. The BFS traversal is a bit convoluted, and implemented with recursive function calls to `independent_path` and checking `visited_test` to ensure nodes are not visited more than once. > > I now generalized this `independence` query to the `pack`-level: > - Find `min_d`, of the "shallowest" node in the `pack` (no need to ever traverse further up than that depth). > - Instead of BFS-ing from one start node, directly start at all nodes from the `pack`. > - I mark all nodes from the `pack` as `visited`, and only those. This gives me an easy query to check if a node is from the `pack`. > - The `worklist` (`Unique_Node_List`) is my BFS data-structure. It is unique so that we only visit every node once. And it simultaneously works as the BFS queue, as `j` iterates over them linearly. > - I BFS traverse upward, only inside the basic block (`in_bb`), and only up to `min_d`. If I ever find a node `visit_test(pred)`, then I know that I have encountered a node from the `pack` from below. This implies that there must be a path from one of the nodes in the `pack` up to this node (return it). There is thus a dependence between two `si, sj` in the `pack`. If I never find a path, then I know that there cannot be a dependence, the `pack` is `independent`. > > I hope I have clarified things. Except maybe this point: >> Does not matter if they are in an other pack or not? > > `find_dependence` only checks if all the members of a `pack` are mutually `independent` in the DAG. One might now wonder: do other `packs` not have an impact also? Might another pack not create additional dependencies "across" the vector elements? > > In the [paper](https://groups.csail.mit.edu/cag/slp/SLP-PLDI-2000.pdf), they have this example: > > > > Quote: > > 3.7 Scheduling > Dependence analysis before packing ensures that statements within a group can be executed > safely in parallel. However, it may be the case that executing two groups produces a dependence > violation. An example of this is shown in Figure 6. Here, dependence edges are drawn between > groups if a statement in one group is dependent on a statement in the other. As long as there > are no cycles in this dependence graph, all groups can be scheduled such that no violations > occur. However, a cycle indicates that the set of chosen groups is invalid and at least one group > will need to be eliminated. Although experimental data has shown this case to be extremely rare, > care must be taken to ensure correctness. > > > The idea is this: before `schedule`, we must ensure that `packs` are `independent`. **But:** `independence` on the `pack` level is **not** sufficient, we also need to ensure that the `packs` (groups) are acyclic before we `schedule`. > > From the comments in `schedule -> co_locate_pack`, it seems we are assuming that there are no cycles (see point 5 in the list there). > > So is there a **4th Bug** lurking here? > > Maybe. If so, I'd say we should fix that in a separate RFE, since all my examples failed because of `dependence` in the `pack`, and not because of cyclic dependencies between `independent` packs. > > If you want, I could implement a verification code, that at least checks that there is no cyclic dependency between the `packs`. I had actually implemented this earlier, but then decided against it, in favour of `find_dependence` - because it is much less code. > https://github.com/openjdk/jdk/blob/5e01c9f5aa39c94bc70069df02e0043678fa69c7/src/hotspot/share/opto/superword.cpp#L2290-L2293 > > But here an **argument why there is no such bug**: In `SuperWord::profitable`, we check that all inputs are also vectors. And in `SuperWord::is_vector_use`, we check that the corresponding "lanes" match. We thus have the same dependencies inside a "lane" of a vector as between the vectors. This is a limitation to our SLP that the paper does not directly require, as far as I see. If we have `n` lanes / elements in a vector, then we basically expect to find `n` `isomorphic` subgraphs, that are completely `independent` from each other. > > In the example from the paper, we see that the vectors do not match (`q,r,s` / `k1,k2,s`), and that values permute between vector lanes (`x,y,z` / `y,k3,k4`). > > **However**: if we were ever to implement `permutations`, or `ExtractNode` or `PackNode`, this would change. We might allow one lane to be unpacked, and re-packed to another. Or a direct permutation does that. > > https://github.com/openjdk/jdk/blob/a44082b61f22dcdee115697f34d39c1d8382a15d/src/hotspot/share/opto/superword.cpp#L1367-L1376 > > https://github.com/openjdk/jdk/blob/a44082b61f22dcdee115697f34d39c1d8382a15d/src/hotspot/share/opto/superword.cpp#L2400-L2417 @eme64 Thank you for explaining code in such details. This confirmed by understanding how `find_dependence()` works. Very good. Please, don't any more investigations in this PR. I think it is solid enough already. File separate RFEs and look on them later. ------------- PR: https://git.openjdk.org/jdk/pull/12350 From kvn at openjdk.org Thu Mar 9 16:52:36 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 9 Mar 2023 16:52:36 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v25] In-Reply-To: References: Message-ID: On Thu, 9 Mar 2023 13:45:11 GMT, Emanuel Peter wrote: >> **List of important things below** >> >> - 3 Bugs I fixed + regression tests https://github.com/openjdk/jdk/pull/12350#issuecomment-1460323523 >> - Conversation with @jatin-bhateja about script-generated regression test: https://github.com/openjdk/jdk/pull/12350#discussion_r1115317152 >> - My [blog-article](https://eme64.github.io/blog/2023/02/23/SuperWord-Introduction.html) about SuperWord >> - Explanation of `dependency_graph` and `DepPreds` https://github.com/openjdk/jdk/pull/12350#issuecomment-1461498252 >> - Explanation of my new `find_dependency`. Arguments about `independence` of packs and cyclic dependencies between packs https://github.com/openjdk/jdk/pull/12350#issuecomment-1461681129 >> >> **Original RFE description:** >> Cyclic dependencies are not handled correctly in all cases. Three examples: >> >> https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/loopopts/superword/TestCyclicDependency.java#L270-L277 >> >> And this, compiled with `-XX:CompileCommand=option,compiler.vectorization.TestOptionVectorizeIR::test*,Vectorize`: >> https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestOptionVectorizeIR.java#L173-L180 >> >> And for `vmIntrinsics::_forEachRemaining` compile option `Vectorize` is always enabled: >> https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestForEachRem.java#L69-L73 >> >> All of these examples are vectorized, despite the cyclic dependency of distance 2. The cyclic dependency is dropped, instead the emitted vector code implements a shift by 2, instead of repeating the same 2 values. >> >> **Analysis** >> >> The `create_pack` logic in `SuperWord::find_adjacent_refs` is broken in two ways: >> >> - When the compile directive `Vectorize` is on, or we compile `vmIntrinsics::_forEachRemaining` we have `_do_vector_loop == true`. When that is the case, we blindly trust that there is no cyclic dependency larger than distance 1. Distance 1 would already be detected by the `independence(s1, s2)` checks we do for all adjacent memops. But for larger distances, we rely on `memory_alignment == 0`. But the compile directive avoids these checks. >> - If `best_align_to_mem_ref` is of a different type, and we have `memory_alignment(mem_ref, best_align_to_mem_ref) == 0`, we do not check if `mem_ref` has `memory_alignment == 0` for all other refs of the same type. In the example `TestCyclicDependency::test2`, we have `best_align_to_mem_ref` as the `StoreF`. Then we assess the `StoreI`, which is not aligned with it, but it is of a different type, so we accept it too. Finally, we look at `LoadI`, which has perfect alignment with the `StoreF`, so we accept it too (even though it is in conflict with the `StoreI`). >> >> Generally, the nested if-statements are confusing and buggy. I propose to fix and refactor the code. >> >> I also propose to only allow the compile directive `Vectorize` only if `vectors_should_be_aligned() == false`. If all vector operations have to be `vector_width` aligned, then they also have to be mutually aligned, and we cannot have patterns like `v[i] = v[i] + v[i+1]` for which the compile directive was introduced in the first place https://github.com/openjdk/jdk/commit/c7d33de202203b6da544f2e0f9a13952381b32dd. >> **Update**: I found a **Test.java** that lead to a crash (`SIGBUS`) on a ARM32 on master. The example bypassed the alignment requirement because of `_do_vector_loop`, and allowed unaligned vector loads to be generated, on a platform that requires alignment. Thanks @fg1417 for running that test for me! >> >> **Solution** >> >> First, I implemented `SuperWord::verify_packs` which catches cyclic dependencies just before scheduling. The idea is to reassess every pack, and check if all memops in it are mutually independent. Turns out that per vector pack, it suffices to do a single BFS over the nodes in the block (see `SuperWord::find_dependence`). With this verification in place we at least get an assert instead of wrong execution. >> >> I then refactored and fixed the `create_pack` code, and put the logic all in `SuperWord::is_mem_ref_alignment_ok`. With the added comments, I hope the logic is more straight forward and readable. If `_do_vector_loop == true`, then I filter the vector packs again in `SuperWord::combine_packs`, since we are at that point not sure that the packs are actually independent, we only know that adjacient memops are independent. >> >> Another change I have made: >> Disallow `extend_packlist` from adding `MemNodes` back in. Because if we have rejected some memops, we do not want them to be added back in later. >> >> **Testing** >> >> I added a few more regression tests, and am running tier1-3, plus some stress testing. >> >> However, I need help from someone who can test this on **ARM32** and **PPC**, basically machines that have `vectors_should_be_aligned() == true`. I would love to have additional testing on those machine, and some reviews. >> **Update:** @fg1417 did testing on ARM32, @reinrich did testing on PPC. >> >> **Discussion / Future Work** >> >> I wonder if we should have `_do_vector_loop == true` by default, since it allows more vectorization. With the added filtering, we are sure that we do not schedule packs with cyclic dependencies. We would have to evaluate performance and other side-effects of course. What do you think? [JDK-8303113](https://bugs.openjdk.org/browse/JDK-8303113) > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > Fixed wording from last commit Looks good to me. There was no big meaning in my question "Does not matter if they are in an other pack or not?" As you explained we go through memory and data inputs. In simple case they would be in an other pack (since we looking only inside block). But in `_do_vector_loop` case (and may be other cases) some packs could be eliminate leaving nodes not in packs. But it does not hinder the search for dependence. That is what I want to say and ask for confirmation. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/12350 From roland at openjdk.org Thu Mar 9 16:55:42 2023 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 9 Mar 2023 16:55:42 GMT Subject: RFR: 8300257: C2: vectorization fails on some simple Memory Segment loops In-Reply-To: References: Message-ID: <3FbBSaF53QKEc4ass3QHuQNlaUWr0bzTz_j3xGnZiEg=.3178b71a-1c56-4b50-b208-041ae693d91d@github.com> On Thu, 9 Mar 2023 16:28:36 GMT, Vladimir Kozlov wrote: > So the final expression should be seen by superword as next since `AddP#855` is invariant for inner loop: > > ``` > (AddP#568 base#195 (AddP#949 base#195 (invar#855) (ConvI2L#938 (LShiftI#896 iv#908)))) > ``` > > First, I think you messed up with `()`. AddP node should have 3 inputs: base, address and offset. I don't see offset for `AddP#568`. Second, if superword can mark `AddP#855` as invariant (no need to parse it) your address expression become simple. I left the constant part of the address out of the expressions. So that was on purpose but confusing, sorry. The current logic only sets the invariant to the result of an integer operation and otherwise follows through the AddP and checks that all of them have the same base. So If we wanted to use the `AddP` as invariant, we would need to change the logic so it stops as the first `AddP` that's loop invariant, we would likely not check that all `AddP` nodes have the same base and we would need to cast the `AddP` to an integer to compute the loop alignment. Do you think that's better that what I'm proposing? I don't really have a strong opinion. Your suggestion is likely a smaller change. ------------- PR: https://git.openjdk.org/jdk/pull/12942 From kvn at openjdk.org Thu Mar 9 17:19:48 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 9 Mar 2023 17:19:48 GMT Subject: RFR: 8300257: C2: vectorization fails on some simple Memory Segment loops In-Reply-To: References: Message-ID: On Thu, 9 Mar 2023 10:48:06 GMT, Roland Westrelin wrote: > In the test case `testByteLong1` (that's extracted from a memory > segment micro benchmark), the address of the store is initially: > > > (AddP#204 base#195 base#195 (AddL#164 (ConvI2L#158 (CastII#157 (LshiftI#107 iv#101))) invar#163)) > > > (#numbers are node numbers to help the discussion). > > `iv#101` is the `Phi` of a counted loop. `invar#163` is the > `baseOffset` load. > > To eliminate the range check, the loop is transformed into a loop nest > and as a consequence the address above becomes: > > > (AddP#204 base#195 base#195 (AddL#164 (ConvI2L#158 (CastII#157 (LShiftI#107 (AddI#326 invar#308 iv#321)))) invar#163)) > > > `invar#308` is some expression from a `Phi` of the outer loop. > > That `AddP` is transformed multiple times to push the invariants out of loop: > > > (AddP#568 base#195 (AddP#556 base#195 base#195 invar#163) (ConvI2L#158 (CastII#157 (AddI#566 (LShiftI#565 iv#321) invar#577)))) > > > then: > > > (AddP#568 base#195 (AddP#847 (AddP#556 base#195 base#195 invar#163) (AddL#838 (ConvI2L#793 (LShiftL#760 iv#767)) (ConvI2L#818 (CastII#779 invar#577))))) > > > and finally: > > > (AddP#568 base#195 (AddP#949 base#195 (AddP#855 base#195 (AddP#556 base#195 base#195 invar#163) (ConvI2L#818 (CastII#809 invar#577))) (ConvI2L#938 (LShiftI#896 iv#908)))) > > > `AddP#855` is out of the inner loop. > > This doesn't vectorize because: > > - there are 2 invariants in the address expression but superword only > support one (tracked by `_invar` in `SWPointer`) > > - there are more levels of `AddP` (4) than superword supports (3) > > To fix that, I propose to no longer track the address elements in > `_invar`, `_negate_invar` and `_invar_scale` but instead to have a > single `_invar` which is an expression built by superword as it > follows chains of `addP` nodes. I kept the previous `_invar`, > `_negate_invar` and `_invar_scale` as debugging and use them to check > that what vectorized with the previous scheme still does. > > I also propose lifting the restriction on 3 levels of `AddP` entirely. So we need to check all AddP to find base offset. No short cuts then :( I agree with your proposal with invariants. Last time I touched this code it was one of pain points that we have only one _invar. Any improvements to that is welcome. I will start testing while looking on changes. ------------- PR: https://git.openjdk.org/jdk/pull/12942 From rcastanedalo at openjdk.org Thu Mar 9 18:41:32 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 9 Mar 2023 18:41:32 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter Message-ID: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> The "Simplify graph" filter abstracts away details from the graph that are typically unnecessary for debugging or analyzing the represented program. This changeset decouples this filter into two: - "Simplify graph", which hides elements that are typically (but not always) unnecessary, and - "Condense graph", which makes the graph more compact without loss of information. Together, these two filters reduce the average graph size by a factor of 1.6x (nodes) and 1.9x (edges): ![without-with-filters](https://user-images.githubusercontent.com/8792647/224118397-e6bd45d1-0b90-4d94-88ae-0a83f9ef20da.png) Besides decoupling the "Simplify graph" filter, the changeset extends its functionality by: - combining Bool and conversion nodes into their predecessors, - inlining all Parm nodes except control into their successors (this removes lots of long edges), - removing "top" inputs from call-like nodes, - inlining more source nodes (such as MachTemp and ThreadLocal) into their successors, - pretty-printing the labels of many inlined and combined nodes such as Bool comparisons or Catch projections (via a new filter that edits node properties), and - using a sparse representation of nodes with empty inputs (e.g. call-like nodes after applying "Simplify graph"). The sparse input representation shows dots between non-contiguous inputs, instead of horizontal space proportional to the number of empty inputs. This helps reducing node width, which is known to affect layout quality negatively: ![dense-vs-sparse](https://user-images.githubusercontent.com/8792647/224118703-04f663b7-7a73-4e49-87d9-2acd8b98522b.png) Note that the exact input indices can still be retrieved via the incoming edge's tooltips: ![tooltip-with-input-index](https://user-images.githubusercontent.com/8792647/224119319-7f40fba2-1e9f-436e-a11c-8c3d428d46a6.png) The control-flow graph view is also adapted to this representation: ![sparse-in-cfg](https://user-images.githubusercontent.com/8792647/224119399-884e2516-a9a1-43fd-b5f5-747c99472ace.png) #### Additional improvements Additionally, this changeset: - ensures that the selected filter subset is applied in the order listed in the "Filter" window (this is necessary for combining effectively the "Simplify graph" and "Condense graph" filters, but is also generally desirable for simplicity and consistency), - introduces a complementary filter "Show custom node info" (enabled by default) that extends the labels of call and exception-creation nodes with custom information, - extends the search functionality so that combined and inlined nodes can also be searched on and selected, and - defines and documents JavaScript helpers to simplify the new and existing available filters. Here is an example of the effect of the new "Show custom node info" filter: ![show-custom-node-info](https://user-images.githubusercontent.com/8792647/224119545-fd564224-7ccc-4829-988e-77f05d25b3bc.png) ### Testing #### Functionality - Tested the functionality manually on a small selection of graphs. - Tested automatically that viewing thousands of graphs in the three views with different filter subsets enabled does not trigger any assertion failure (by instrumenting IGV to view graphs as they are loaded and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`). #### Performance Measured the combined filter application and view creation time for the sea-of-nodes view on a selection of 100 medium-sized graphs (200-500 nodes). On average, applying the new "Show custom node info" filter introduces a minimal overhead of around 1%, which motivates enabling it by default. Applying the "simplify graph" and "condense graph" on top actually gives a speedup of about 12%, since the additional filter application time is amortized by laying out and drawing fewer nodes. However, these filters are not enabled by default, since they cause a (minor) loss of information which is not desirable in every use case. The graph size reduction and performance results are [attached](https://github.com/openjdk/jdk/files/10934804/performance-evaluation.ods) (note that each time measurement in the sheet corresponds to the median of ten runs). ------------- Commit messages: - Document filter helpers - Combine Bool nodes into SubTypeCheck comparisons as well - Generalize null-pointer slot text filter - Fix figure selection - Split simplify graph filter into two, ensure they are applied in right order - Make slots searchable and selectable - Make dots bolder - Remove broken attempt to add split nodes as output slots - Restore default settings - Undo pretty-printing of CatchProj nodes from HotSpot side - ... and 34 more: https://git.openjdk.org/jdk/compare/56512cfe...b379e87c Changes: https://git.openjdk.org/jdk/pull/12955/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12955&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8302738 Stats: 935 lines in 39 files changed: 574 ins; 213 del; 148 mod Patch: https://git.openjdk.org/jdk/pull/12955.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12955/head:pull/12955 PR: https://git.openjdk.org/jdk/pull/12955 From jsjolen at openjdk.org Thu Mar 9 20:31:44 2023 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Thu, 9 Mar 2023 20:31:44 GMT Subject: RFR: JDK-8301074: Replace NULL with nullptr in share/opto/ [v8] In-Reply-To: References: Message-ID: On Thu, 9 Mar 2023 14:17:14 GMT, Johan Sj?len wrote: >> Hi, this PR changes all occurrences of NULL to nullptr for the subdirectory share/opto/. Unfortunately the script that does the change isn't perfect, and so we >> need to comb through these manually to make sure nothing has gone wrong. I also review these changes but things slip past my eyes sometimes. >> >> Here are some typical things to look out for: >> >> No changes but copyright header changed (probably because I reverted some changes but forgot the copyright). >> Macros having their NULL changed to nullptr, these are added to the script when I find them. They should be NULL. >> nullptr in comments and logs. We try to use lower case "null" in these cases as it reads better. An exception is made when code expressions are in a comment. >> >> An example of this: >> >> >> // This function returns null >> void* ret_null(); >> // This function returns true if *x == nullptr >> bool is_nullptr(void** x); >> >> >> Note how nullptr participates in a code expression here, we really are talking about the specific value nullptr. >> >> Thanks! > > Johan Sj?len has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits: > > - Merge remote-tracking branch 'origin/master' into JDK-8301074 > - Merge remote-tracking branch 'origin/master' into JDK-8301074 > - Jesper's fixes > - Merge remote-tracking branch 'origin/JDK-8301074' into JDK-8301074 > - Explicitly use 0 for null in ARM interpreter > - Merge remote-tracking branch 'origin/master' into JDK-8301074 > - Remove trailing whitespace > - Check for null string explicitly > - vkozlov fixes > - Manual review fixes > - ... and 3 more: https://git.openjdk.org/jdk/compare/1e9942aa...038599b9 Passes tier1. Integrating, thank you for the reviews. ------------- PR: https://git.openjdk.org/jdk/pull/12187 From jsjolen at openjdk.org Thu Mar 9 20:31:45 2023 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Thu, 9 Mar 2023 20:31:45 GMT Subject: Integrated: JDK-8301074: Replace NULL with nullptr in share/opto/ In-Reply-To: References: Message-ID: On Wed, 25 Jan 2023 11:46:36 GMT, Johan Sj?len wrote: > Hi, this PR changes all occurrences of NULL to nullptr for the subdirectory share/opto/. Unfortunately the script that does the change isn't perfect, and so we > need to comb through these manually to make sure nothing has gone wrong. I also review these changes but things slip past my eyes sometimes. > > Here are some typical things to look out for: > > No changes but copyright header changed (probably because I reverted some changes but forgot the copyright). > Macros having their NULL changed to nullptr, these are added to the script when I find them. They should be NULL. > nullptr in comments and logs. We try to use lower case "null" in these cases as it reads better. An exception is made when code expressions are in a comment. > > An example of this: > > > // This function returns null > void* ret_null(); > // This function returns true if *x == nullptr > bool is_nullptr(void** x); > > > Note how nullptr participates in a code expression here, we really are talking about the specific value nullptr. > > Thanks! This pull request has now been integrated. Changeset: 5726d31e Author: Johan Sj?len URL: https://git.openjdk.org/jdk/commit/5726d31e56530bbe7dee61ae04b126e20cb3611d Stats: 5594 lines in 111 files changed: 1 ins; 0 del; 5593 mod 8301074: Replace NULL with nullptr in share/opto/ Reviewed-by: kvn, jwilhelm ------------- PR: https://git.openjdk.org/jdk/pull/12187 From epeter at openjdk.org Thu Mar 9 20:57:59 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 9 Mar 2023 20:57:59 GMT Subject: RFR: 8303611: Null pointer dereference in block.hpp:586 (ID: 44856) Message-ID: Replaced `unique_ctrl_out_or_null` with `unique_ctrl_out`, which asserts if it finds `nullptr`. This is better than running into a `nullptr`-dereference inside `get_block_for_node`. This was found by a static code analyzer, so it is not clear that a `nullptr` dereference would ever happen. But let's still fix it. ------------- Commit messages: - Merge branch 'master' into JDK-8303611 - 8303611: Null pointer dereference in block.hpp:586 (ID: 44856) Changes: https://git.openjdk.org/jdk/pull/12919/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12919&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8303611 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/12919.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12919/head:pull/12919 PR: https://git.openjdk.org/jdk/pull/12919 From epeter at openjdk.org Thu Mar 9 21:04:57 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 9 Mar 2023 21:04:57 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v26] In-Reply-To: References: Message-ID: <_ygbSM3ARi21Y5jfOqEIYEqZe84cwTbQ6vayG9hVYiY=.36ace32c-73ac-452e-a9ac-f9ffaea08931@github.com> > **List of important things below** > > - 3 Bugs I fixed + regression tests https://github.com/openjdk/jdk/pull/12350#issuecomment-1460323523 > - Conversation with @jatin-bhateja about script-generated regression test: https://github.com/openjdk/jdk/pull/12350#discussion_r1115317152 > - My [blog-article](https://eme64.github.io/blog/2023/02/23/SuperWord-Introduction.html) about SuperWord > - Explanation of `dependency_graph` and `DepPreds` https://github.com/openjdk/jdk/pull/12350#issuecomment-1461498252 > - Explanation of my new `find_dependency`. Arguments about `independence` of packs and cyclic dependencies between packs https://github.com/openjdk/jdk/pull/12350#issuecomment-1461681129 > > **Original RFE description:** > Cyclic dependencies are not handled correctly in all cases. Three examples: > > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/loopopts/superword/TestCyclicDependency.java#L270-L277 > > And this, compiled with `-XX:CompileCommand=option,compiler.vectorization.TestOptionVectorizeIR::test*,Vectorize`: > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestOptionVectorizeIR.java#L173-L180 > > And for `vmIntrinsics::_forEachRemaining` compile option `Vectorize` is always enabled: > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestForEachRem.java#L69-L73 > > All of these examples are vectorized, despite the cyclic dependency of distance 2. The cyclic dependency is dropped, instead the emitted vector code implements a shift by 2, instead of repeating the same 2 values. > > **Analysis** > > The `create_pack` logic in `SuperWord::find_adjacent_refs` is broken in two ways: > > - When the compile directive `Vectorize` is on, or we compile `vmIntrinsics::_forEachRemaining` we have `_do_vector_loop == true`. When that is the case, we blindly trust that there is no cyclic dependency larger than distance 1. Distance 1 would already be detected by the `independence(s1, s2)` checks we do for all adjacent memops. But for larger distances, we rely on `memory_alignment == 0`. But the compile directive avoids these checks. > - If `best_align_to_mem_ref` is of a different type, and we have `memory_alignment(mem_ref, best_align_to_mem_ref) == 0`, we do not check if `mem_ref` has `memory_alignment == 0` for all other refs of the same type. In the example `TestCyclicDependency::test2`, we have `best_align_to_mem_ref` as the `StoreF`. Then we assess the `StoreI`, which is not aligned with it, but it is of a different type, so we accept it too. Finally, we look at `LoadI`, which has perfect alignment with the `StoreF`, so we accept it too (even though it is in conflict with the `StoreI`). > > Generally, the nested if-statements are confusing and buggy. I propose to fix and refactor the code. > > I also propose to only allow the compile directive `Vectorize` only if `vectors_should_be_aligned() == false`. If all vector operations have to be `vector_width` aligned, then they also have to be mutually aligned, and we cannot have patterns like `v[i] = v[i] + v[i+1]` for which the compile directive was introduced in the first place https://github.com/openjdk/jdk/commit/c7d33de202203b6da544f2e0f9a13952381b32dd. > **Update**: I found a **Test.java** that lead to a crash (`SIGBUS`) on a ARM32 on master. The example bypassed the alignment requirement because of `_do_vector_loop`, and allowed unaligned vector loads to be generated, on a platform that requires alignment. Thanks @fg1417 for running that test for me! > > **Solution** > > First, I implemented `SuperWord::verify_packs` which catches cyclic dependencies just before scheduling. The idea is to reassess every pack, and check if all memops in it are mutually independent. Turns out that per vector pack, it suffices to do a single BFS over the nodes in the block (see `SuperWord::find_dependence`). With this verification in place we at least get an assert instead of wrong execution. > > I then refactored and fixed the `create_pack` code, and put the logic all in `SuperWord::is_mem_ref_alignment_ok`. With the added comments, I hope the logic is more straight forward and readable. If `_do_vector_loop == true`, then I filter the vector packs again in `SuperWord::combine_packs`, since we are at that point not sure that the packs are actually independent, we only know that adjacient memops are independent. > > Another change I have made: > Disallow `extend_packlist` from adding `MemNodes` back in. Because if we have rejected some memops, we do not want them to be added back in later. > > **Testing** > > I added a few more regression tests, and am running tier1-3, plus some stress testing. > > However, I need help from someone who can test this on **ARM32** and **PPC**, basically machines that have `vectors_should_be_aligned() == true`. I would love to have additional testing on those machine, and some reviews. > **Update:** @fg1417 did testing on ARM32, @reinrich did testing on PPC. > > **Discussion / Future Work** > > I wonder if we should have `_do_vector_loop == true` by default, since it allows more vectorization. With the added filtering, we are sure that we do not schedule packs with cyclic dependencies. We would have to evaluate performance and other side-effects of course. What do you think? [JDK-8303113](https://bugs.openjdk.org/browse/JDK-8303113) Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 40 commits: - Merge master after NULL -> nullptr conversion - Fixed wording from last commit - A little renaming and improved comments - resolve merge conflict after Roland's fix - TestDependencyOffsets.java: add vanilla run - TestDependencyOffsets.java: parallelize it + various AVX settings - TestOptionVectorizeIR.java: removed PopulateIndex IR rule - fails on x86 32bit - see Matcher::match_rule_supported - Merge branch 'master' into JDK-8298935 - Reworked TestDependencyOffsets.java - remove negative IR rules for TestOptionVectorizeIR.java - ... and 30 more: https://git.openjdk.org/jdk/compare/5726d31e...731cc7b5 ------------- Changes: https://git.openjdk.org/jdk/pull/12350/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12350&range=25 Stats: 12929 lines in 7 files changed: 12868 ins; 46 del; 15 mod Patch: https://git.openjdk.org/jdk/pull/12350.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12350/head:pull/12350 PR: https://git.openjdk.org/jdk/pull/12350 From kvn at openjdk.org Thu Mar 9 21:11:54 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 9 Mar 2023 21:11:54 GMT Subject: RFR: 8303611: Null pointer dereference in block.hpp:586 (ID: 44856) In-Reply-To: References: Message-ID: On Wed, 8 Mar 2023 08:51:46 GMT, Emanuel Peter wrote: > Replaced `unique_ctrl_out_or_null` with `unique_ctrl_out`, which asserts if it finds `nullptr`. This is better than running into a `nullptr`-dereference inside `get_block_for_node`. > > This was found by a static code analyzer, so it is not clear that a `nullptr` dereference would ever happen. But let's still fix it. Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/12919 From duke at openjdk.org Thu Mar 9 22:47:52 2023 From: duke at openjdk.org (Jasmine K.) Date: Thu, 9 Mar 2023 22:47:52 GMT Subject: RFR: 8303238: Create generalizations for existing LShift ideal transforms [v3] In-Reply-To: References: Message-ID: <8j6FJE2RODz6nR8hzV3K7WboYWg8mR2q2zKPlU8Bts8=.8578d67b-ef6b-4962-9985-c42147e7f4c5@github.com> > Hello, > I would like to generalize two ideal transforms for bitwise shifts. Left shift nodes perform the transformations `(x >> C1) << C2 => x & (-1 << C2)` and `((x >> C1) & Y) << C2 => x & (Y << C2)`, but only when the case where `C1 == C2`. However, it is possible to use both of these rules to improve cases where the constants aren't equal, by removing one of the shifts and replacing it with a bitwise and. This transformation is profitable because typically more bitwise ands can be dispatched per cycle than bit shifts. In addition, the strength reduction from a shift to a bitwise and can allow more profitable transformations to occur. These patterns are found throughout the JDK, mainly around strings and OW2 ASM. I've attached some profiling results from my (Zen 2) machine below: > > Baseline Patch Improvement > Benchmark Mode Cnt Score Error Units Score Error Units > LShiftNodeIdealize.testRgbaToAbgr avgt 15 63.287 ? 1.770 ns/op / 54.199 ? 1.408 ns/op + 14.36% > LShiftNodeIdealize.testShiftAndInt avgt 15 874.564 ? 15.334 ns/op / 538.408 ? 11.768 ns/op + 38.44% > LShiftNodeIdealize.testShiftAndLong avgt 15 1017.466 ? 29.010 ns/op / 701.356 ? 18.258 ns/op + 31.07% > LShiftNodeIdealize.testShiftInt avgt 15 663.865 ? 14.226 ns/op / 533.588 ? 9.949 ns/op + 19.63% > LShiftNodeIdealize.testShiftInt2 avgt 15 658.976 ? 32.856 ns/op / 649.871 ? 10.598 ns/op + 1.38% > LShiftNodeIdealize.testShiftLong avgt 15 815.540 ? 14.721 ns/op / 689.270 ? 14.028 ns/op + 15.48% > LShiftNodeIdealize.testShiftLong2 avgt 15 817.936 ? 23.573 ns/op / 810.185 ? 14.983 ns/op + 0.95% > > > In addition, in the process of making this PR I've found a missing ideal transform for `RShiftLNode`, so right shifts of large numbers (such as `x >> 65`) are not properly folded down, like how they are `RShiftINode` and `URShiftLNode`. I'll address this in a future RFR. > > Testing: GHA, tier1 local, and performance testing > > Thanks, > Jasmine K Jasmine K. has updated the pull request incrementally with one additional commit since the last revision: Update comments in IR tests ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12734/files - new: https://git.openjdk.org/jdk/pull/12734/files/bd161561..9a6ff3c4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12734&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12734&range=01-02 Stats: 6 lines in 2 files changed: 1 ins; 0 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/12734.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12734/head:pull/12734 PR: https://git.openjdk.org/jdk/pull/12734 From duke at openjdk.org Thu Mar 9 22:47:55 2023 From: duke at openjdk.org (Jasmine K.) Date: Thu, 9 Mar 2023 22:47:55 GMT Subject: RFR: 8303238: Create generalizations for existing LShift ideal transforms [v2] In-Reply-To: References: Message-ID: On Thu, 9 Mar 2023 07:47:00 GMT, Tobias Hartmann wrote: >> Jasmine K. has updated the pull request incrementally with one additional commit since the last revision: >> >> Comments from code review > > test/hotspot/jtreg/compiler/c2/irTests/LShiftINodeIdealizationTests.java line 122: > >> 120: @IR(failOn = { IRNode.RSHIFT }) >> 121: @IR(counts = { IRNode.AND, "1", IRNode.LSHIFT, "1" }) >> 122: // Checks (x >> 4) << 8 => (x << 4) & 0xFF00 > > Suggestion: > > // Checks ((x >> 4) & 0xFF) << 8 => (x << 4) & 0xFF00 > > > Right? Same for test8 and the corresponding long tests. Oh, nice catch! It seems I had mistyped there. ------------- PR: https://git.openjdk.org/jdk/pull/12734 From kvn at openjdk.org Thu Mar 9 22:49:02 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 9 Mar 2023 22:49:02 GMT Subject: RFR: 8300257: C2: vectorization fails on some simple Memory Segment loops In-Reply-To: References: Message-ID: On Thu, 9 Mar 2023 10:48:06 GMT, Roland Westrelin wrote: > In the test case `testByteLong1` (that's extracted from a memory > segment micro benchmark), the address of the store is initially: > > > (AddP#204 base#195 base#195 (AddL#164 (ConvI2L#158 (CastII#157 (LshiftI#107 iv#101))) invar#163)) > > > (#numbers are node numbers to help the discussion). > > `iv#101` is the `Phi` of a counted loop. `invar#163` is the > `baseOffset` load. > > To eliminate the range check, the loop is transformed into a loop nest > and as a consequence the address above becomes: > > > (AddP#204 base#195 base#195 (AddL#164 (ConvI2L#158 (CastII#157 (LShiftI#107 (AddI#326 invar#308 iv#321)))) invar#163)) > > > `invar#308` is some expression from a `Phi` of the outer loop. > > That `AddP` is transformed multiple times to push the invariants out of loop: > > > (AddP#568 base#195 (AddP#556 base#195 base#195 invar#163) (ConvI2L#158 (CastII#157 (AddI#566 (LShiftI#565 iv#321) invar#577)))) > > > then: > > > (AddP#568 base#195 (AddP#847 (AddP#556 base#195 base#195 invar#163) (AddL#838 (ConvI2L#793 (LShiftL#760 iv#767)) (ConvI2L#818 (CastII#779 invar#577))))) > > > and finally: > > > (AddP#568 base#195 (AddP#949 base#195 (AddP#855 base#195 (AddP#556 base#195 base#195 invar#163) (ConvI2L#818 (CastII#809 invar#577))) (ConvI2L#938 (LShiftI#896 iv#908)))) > > > `AddP#855` is out of the inner loop. > > This doesn't vectorize because: > > - there are 2 invariants in the address expression but superword only > support one (tracked by `_invar` in `SWPointer`) > > - there are more levels of `AddP` (4) than superword supports (3) > > To fix that, I propose to no longer track the address elements in > `_invar`, `_negate_invar` and `_invar_scale` but instead to have a > single `_invar` which is an expression built by superword as it > follows chains of `addP` nodes. I kept the previous `_invar`, > `_negate_invar` and `_invar_scale` as debugging and use them to check > that what vectorized with the previous scheme still does. > > I also propose lifting the restriction on 3 levels of `AddP` entirely. My tier1-4,xcomp and stress testing passed. I looked and changes and they seems fine. May be we need to run performance testing too. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/12942 From duke at openjdk.org Thu Mar 9 22:54:16 2023 From: duke at openjdk.org (Jasmine K.) Date: Thu, 9 Mar 2023 22:54:16 GMT Subject: RFR: 8303238: Create generalizations for existing LShift ideal transforms [v3] In-Reply-To: <8j6FJE2RODz6nR8hzV3K7WboYWg8mR2q2zKPlU8Bts8=.8578d67b-ef6b-4962-9985-c42147e7f4c5@github.com> References: <8j6FJE2RODz6nR8hzV3K7WboYWg8mR2q2zKPlU8Bts8=.8578d67b-ef6b-4962-9985-c42147e7f4c5@github.com> Message-ID: On Thu, 9 Mar 2023 22:47:52 GMT, Jasmine K. wrote: >> Hello, >> I would like to generalize two ideal transforms for bitwise shifts. Left shift nodes perform the transformations `(x >> C1) << C2 => x & (-1 << C2)` and `((x >> C1) & Y) << C2 => x & (Y << C2)`, but only when the case where `C1 == C2`. However, it is possible to use both of these rules to improve cases where the constants aren't equal, by removing one of the shifts and replacing it with a bitwise and. This transformation is profitable because typically more bitwise ands can be dispatched per cycle than bit shifts. In addition, the strength reduction from a shift to a bitwise and can allow more profitable transformations to occur. These patterns are found throughout the JDK, mainly around strings and OW2 ASM. I've attached some profiling results from my (Zen 2) machine below: >> >> Baseline Patch Improvement >> Benchmark Mode Cnt Score Error Units Score Error Units >> LShiftNodeIdealize.testRgbaToAbgr avgt 15 63.287 ? 1.770 ns/op / 54.199 ? 1.408 ns/op + 14.36% >> LShiftNodeIdealize.testShiftAndInt avgt 15 874.564 ? 15.334 ns/op / 538.408 ? 11.768 ns/op + 38.44% >> LShiftNodeIdealize.testShiftAndLong avgt 15 1017.466 ? 29.010 ns/op / 701.356 ? 18.258 ns/op + 31.07% >> LShiftNodeIdealize.testShiftInt avgt 15 663.865 ? 14.226 ns/op / 533.588 ? 9.949 ns/op + 19.63% >> LShiftNodeIdealize.testShiftInt2 avgt 15 658.976 ? 32.856 ns/op / 649.871 ? 10.598 ns/op + 1.38% >> LShiftNodeIdealize.testShiftLong avgt 15 815.540 ? 14.721 ns/op / 689.270 ? 14.028 ns/op + 15.48% >> LShiftNodeIdealize.testShiftLong2 avgt 15 817.936 ? 23.573 ns/op / 810.185 ? 14.983 ns/op + 0.95% >> >> >> In addition, in the process of making this PR I've found a missing ideal transform for `RShiftLNode`, so right shifts of large numbers (such as `x >> 65`) are not properly folded down, like how they are `RShiftINode` and `URShiftLNode`. I'll address this in a future RFR. >> >> Testing: GHA, tier1 local, and performance testing >> >> Thanks, >> Jasmine K > > Jasmine K. has updated the pull request incrementally with one additional commit since the last revision: > > Update comments in IR tests Hi, thanks for the review! I have fixed the mistype in the comments, and have updated the bug headers. As for profitability, I looked through [Agner Fog's instruction tables](https://www.agner.org/optimize/instruction_tables.pdf) and [uops.info](https://uops.info/table.html) and found that for all x86 microarchitectures listed, bitwise `and` instructions are as good as or faster than bit shift instructions, in terms of how many can be dispatched per cycle. I think even in the cases where the dispatches are the same the transformation can be profitable as the two instructions typically utilize different ports of the processor backend (as seen in uops.info), leading to more thorough utilization of the processor's resources. For different architectures, I wasn't able to readily find resources on instruction latency such as Agner for x86, but I was able to find LLVM's scheduler models for [aarch64](https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AArch64/AArch64SchedNeoverseN2.td), [ppc](https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/PowerPC/P9InstrResources.td), and [risc-v](https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/RISCV/RISCVSchedSiFive7.td). These all seemed to be similar to x86- where the bitwise `and` instruction is as good as or better than the shift instructions, while also taking up different processor resources. In addition, due to the constant folding opportunities offered by this change I think it should be applicable, but a review from people familiar with different architectures would be helpful. Hope this clarifies! ------------- PR: https://git.openjdk.org/jdk/pull/12734 From redestad at openjdk.org Thu Mar 9 23:22:06 2023 From: redestad at openjdk.org (Claes Redestad) Date: Thu, 9 Mar 2023 23:22:06 GMT Subject: RFR: 8303238: Create generalizations for existing LShift ideal transforms [v3] In-Reply-To: <8j6FJE2RODz6nR8hzV3K7WboYWg8mR2q2zKPlU8Bts8=.8578d67b-ef6b-4962-9985-c42147e7f4c5@github.com> References: <8j6FJE2RODz6nR8hzV3K7WboYWg8mR2q2zKPlU8Bts8=.8578d67b-ef6b-4962-9985-c42147e7f4c5@github.com> Message-ID: On Thu, 9 Mar 2023 22:47:52 GMT, Jasmine K. wrote: >> Hello, >> I would like to generalize two ideal transforms for bitwise shifts. Left shift nodes perform the transformations `(x >> C1) << C2 => x & (-1 << C2)` and `((x >> C1) & Y) << C2 => x & (Y << C2)`, but only when the case where `C1 == C2`. However, it is possible to use both of these rules to improve cases where the constants aren't equal, by removing one of the shifts and replacing it with a bitwise and. This transformation is profitable because typically more bitwise ands can be dispatched per cycle than bit shifts. In addition, the strength reduction from a shift to a bitwise and can allow more profitable transformations to occur. These patterns are found throughout the JDK, mainly around strings and OW2 ASM. I've attached some profiling results from my (Zen 2) machine below: >> >> Baseline Patch Improvement >> Benchmark Mode Cnt Score Error Units Score Error Units >> LShiftNodeIdealize.testRgbaToAbgr avgt 15 63.287 ? 1.770 ns/op / 54.199 ? 1.408 ns/op + 14.36% >> LShiftNodeIdealize.testShiftAndInt avgt 15 874.564 ? 15.334 ns/op / 538.408 ? 11.768 ns/op + 38.44% >> LShiftNodeIdealize.testShiftAndLong avgt 15 1017.466 ? 29.010 ns/op / 701.356 ? 18.258 ns/op + 31.07% >> LShiftNodeIdealize.testShiftInt avgt 15 663.865 ? 14.226 ns/op / 533.588 ? 9.949 ns/op + 19.63% >> LShiftNodeIdealize.testShiftInt2 avgt 15 658.976 ? 32.856 ns/op / 649.871 ? 10.598 ns/op + 1.38% >> LShiftNodeIdealize.testShiftLong avgt 15 815.540 ? 14.721 ns/op / 689.270 ? 14.028 ns/op + 15.48% >> LShiftNodeIdealize.testShiftLong2 avgt 15 817.936 ? 23.573 ns/op / 810.185 ? 14.983 ns/op + 0.95% >> >> >> In addition, in the process of making this PR I've found a missing ideal transform for `RShiftLNode`, so right shifts of large numbers (such as `x >> 65`) are not properly folded down, like how they are `RShiftINode` and `URShiftLNode`. I'll address this in a future RFR. >> >> Testing: GHA, tier1 local, and performance testing >> >> Thanks, >> Jasmine K > > Jasmine K. has updated the pull request incrementally with one additional commit since the last revision: > > Update comments in IR tests Can't say I'm very familiar with aarch64 (yet) but on my Mac M1 (osx-aarch64) I see similar improvements: Baseline Patch Improvement Benchmark Mode Cnt Score Error Units Score Error Units LShiftNodeIdealize.testShiftAndInt avgt 15 601.106 ? 18,668 ns/op / 432.041 ? 3.912 ns/op +39.13% LShiftNodeIdealize.testShiftAndLong avgt 15 588.143 ? 2.281 ns/op / 422.035 ? 5.384 ns/op +39.36% ------------- PR: https://git.openjdk.org/jdk/pull/12734 From duke at openjdk.org Fri Mar 10 00:53:15 2023 From: duke at openjdk.org (Jasmine K.) Date: Fri, 10 Mar 2023 00:53:15 GMT Subject: RFR: 8303238: Create generalizations for existing LShift ideal transforms [v3] In-Reply-To: <8j6FJE2RODz6nR8hzV3K7WboYWg8mR2q2zKPlU8Bts8=.8578d67b-ef6b-4962-9985-c42147e7f4c5@github.com> References: <8j6FJE2RODz6nR8hzV3K7WboYWg8mR2q2zKPlU8Bts8=.8578d67b-ef6b-4962-9985-c42147e7f4c5@github.com> Message-ID: On Thu, 9 Mar 2023 22:47:52 GMT, Jasmine K. wrote: >> Hello, >> I would like to generalize two ideal transforms for bitwise shifts. Left shift nodes perform the transformations `(x >> C1) << C2 => x & (-1 << C2)` and `((x >> C1) & Y) << C2 => x & (Y << C2)`, but only when the case where `C1 == C2`. However, it is possible to use both of these rules to improve cases where the constants aren't equal, by removing one of the shifts and replacing it with a bitwise and. This transformation is profitable because typically more bitwise ands can be dispatched per cycle than bit shifts. In addition, the strength reduction from a shift to a bitwise and can allow more profitable transformations to occur. These patterns are found throughout the JDK, mainly around strings and OW2 ASM. I've attached some profiling results from my (Zen 2) machine below: >> >> Baseline Patch Improvement >> Benchmark Mode Cnt Score Error Units Score Error Units >> LShiftNodeIdealize.testRgbaToAbgr avgt 15 63.287 ? 1.770 ns/op / 54.199 ? 1.408 ns/op + 14.36% >> LShiftNodeIdealize.testShiftAndInt avgt 15 874.564 ? 15.334 ns/op / 538.408 ? 11.768 ns/op + 38.44% >> LShiftNodeIdealize.testShiftAndLong avgt 15 1017.466 ? 29.010 ns/op / 701.356 ? 18.258 ns/op + 31.07% >> LShiftNodeIdealize.testShiftInt avgt 15 663.865 ? 14.226 ns/op / 533.588 ? 9.949 ns/op + 19.63% >> LShiftNodeIdealize.testShiftInt2 avgt 15 658.976 ? 32.856 ns/op / 649.871 ? 10.598 ns/op + 1.38% >> LShiftNodeIdealize.testShiftLong avgt 15 815.540 ? 14.721 ns/op / 689.270 ? 14.028 ns/op + 15.48% >> LShiftNodeIdealize.testShiftLong2 avgt 15 817.936 ? 23.573 ns/op / 810.185 ? 14.983 ns/op + 0.95% >> >> >> In addition, in the process of making this PR I've found a missing ideal transform for `RShiftLNode`, so right shifts of large numbers (such as `x >> 65`) are not properly folded down, like how they are `RShiftINode` and `URShiftLNode`. I'll address this in a future RFR. >> >> Testing: GHA, tier1 local, and performance testing >> >> Thanks, >> Jasmine K > > Jasmine K. has updated the pull request incrementally with one additional commit since the last revision: > > Update comments in IR tests Nice! Glad to see the change has an impact there :) ------------- PR: https://git.openjdk.org/jdk/pull/12734 From duke at openjdk.org Fri Mar 10 01:10:03 2023 From: duke at openjdk.org (Jasmine K.) Date: Fri, 10 Mar 2023 01:10:03 GMT Subject: RFR: 8303238: Create generalizations for existing LShift ideal transforms [v4] In-Reply-To: References: Message-ID: <22Lk8Rj9OTcddJxDx3-jIAIVXRnEFOwyScgxFTgXePA=.beea5fd7-f2d4-4b39-b745-ec7ab5c77b6b@github.com> The message from this sender included one or more files which could not be scanned for virus detection; do not open these files unless you are certain of the sender's intent. ---------------------------------------------------------------------- > Hello, > I would like to generalize two ideal transforms for bitwise shifts. Left shift nodes perform the transformations `(x >> C1) << C2 => x & (-1 << C2)` and `((x >> C1) & Y) << C2 => x & (Y << C2)`, but only when the case where `C1 == C2`. However, it is possible to use both of these rules to improve cases where the constants aren't equal, by removing one of the shifts and replacing it with a bitwise and. This transformation is profitable because typically more bitwise ands can be dispatched per cycle than bit shifts. In addition, the strength reduction from a shift to a bitwise and can allow more profitable transformations to occur. These patterns are found throughout the JDK, mainly around strings and OW2 ASM. I've attached some profiling results from my (Zen 2) machine below: > > Baseline Patch Improvement > Benchmark Mode Cnt Score Error Units Score Error Units > LShiftNodeIdealize.testRgbaToAbgr avgt 15 63.287 ? 1.770 ns/op / 54.199 ? 1.408 ns/op + 14.36% > LShiftNodeIdealize.testShiftAndInt avgt 15 874.564 ? 15.334 ns/op / 538.408 ? 11.768 ns/op + 38.44% > LShiftNodeIdealize.testShiftAndLong avgt 15 1017.466 ? 29.010 ns/op / 701.356 ? 18.258 ns/op + 31.07% > LShiftNodeIdealize.testShiftInt avgt 15 663.865 ? 14.226 ns/op / 533.588 ? 9.949 ns/op + 19.63% > LShiftNodeIdealize.testShiftInt2 avgt 15 658.976 ? 32.856 ns/op / 649.871 ? 10.598 ns/op + 1.38% > LShiftNodeIdealize.testShiftLong avgt 15 815.540 ? 14.721 ns/op / 689.270 ? 14.028 ns/op + 15.48% > LShiftNodeIdealize.testShiftLong2 avgt 15 817.936 ? 23.573 ns/op / 810.185 ? 14.983 ns/op + 0.95% > > > In addition, in the process of making this PR I've found a missing ideal transform for `RShiftLNode`, so right shifts of large numbers (such as `x >> 65`) are not properly folded down, like how they are `RShiftINode` and `URShiftLNode`. I'll address this in a future RFR. > > Testing: GHA, tier1 local, and performance testing > > Thanks, > Jasmine K Jasmine K. has updated the pull request incrementally with one additional commit since the last revision: Update full name ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12734/files - new: https://git.openjdk.org/jdk/pull/12734/files/9a6ff3c4..c4a5d237 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12734&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12734&range=02-03 Stats: 0 lines in 0 files changed: 0 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/12734.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12734/head:pull/12734 PR: https://git.openjdk.org/jdk/pull/12734 From kvn at openjdk.org Fri Mar 10 01:19:07 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 10 Mar 2023 01:19:07 GMT Subject: RFR: 8303415: Add VM_Version::is_intrinsic_supported(id) [v2] In-Reply-To: <8GoprkMChP1bYoO8PPNijDfhYgMtrD4UGuaHHTiRbZs=.12a42da9-b05f-497e-9c7e-616907aebe90@github.com> References: <8GoprkMChP1bYoO8PPNijDfhYgMtrD4UGuaHHTiRbZs=.12a42da9-b05f-497e-9c7e-616907aebe90@github.com> Message-ID: On Thu, 9 Mar 2023 06:22:56 GMT, David Holmes wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Address comments > > So are there plans to migrate to this new mechanism and remove those global flags? @dholmes-ora Do you have other questions? ------------- PR: https://git.openjdk.org/jdk/pull/12858 From dzhang at openjdk.org Fri Mar 10 02:15:14 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Fri, 10 Mar 2023 02:15:14 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v4] In-Reply-To: References: Message-ID: > HI, > > We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot. > > This patch will add support of `VectorLoadMask` and vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. > > `VectorLoadMask` will generate the corresponding mask vector for the vector addition operation in mask form. > > AddMaskTestMerge case: > > > import jdk.incubator.vector.IntVector; > import jdk.incubator.vector.VectorMask; > import jdk.incubator.vector.VectorOperators; > import jdk.incubator.vector.VectorSpecies; > > public class AddMaskTestMerge { > > static final VectorSpecies SPECIES = IntVector.SPECIES_128; > static final int SIZE = 1024; > static int[] a = new int[SIZE]; > static int[] b = new int[SIZE]; > static int[] r = new int[SIZE]; > static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; > static { > for (int i = 0; i < SIZE; i++) { > a[i] = i; > b[i] = i; > } > } > > static void workload(int idx) { > VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); > IntVector av = IntVector.fromArray(SPECIES, a, idx); > IntVector bv = IntVector.fromArray(SPECIES, b, idx); > av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); > } > > public static void main(String[] args) { > for (int i = 0; i < 30_0000; i++) { > for (int j = 0; j < SIZE; j += SPECIES.length()) { > workload(j); > } > } > } > } > > > This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. > > Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: > > > 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 > 0ae loadV V1, [R31] # vector (rvv) > 0b6 vloadmask V0, V2 # KILL cr > 0be vadd.vv V3, V1, V0 #@vaddI_masked > 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r > 0ca decode_heap_oop R28, R28 #@decodeHeapOop > 0cc lwu R7, [R28, #12] # range, #@loadRange > 0d0 NullCheck R28 > > > And the jit code is as follows: > > > 0x000000400c8109ae: vsetivli t0,4,e32,m1,tu,mu > 0x000000400c8109b2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) > ; - AddMaskTestMerge::workload at 46 (line 25) > 0x000000400c8109b6: vsetivli t0,4,e8,m1,tu,mu > 0x000000400c8109ba: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) > ; - AddMaskTestMerge::workload at 7 (line 22) > 0x000000400c8109be: vsetivli t0,4,e32,m1,tu,mu > 0x000000400c8109c2: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) > ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) > ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) > ; - AddMaskTestMerge::workload at 39 (line 25) > > > > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java > > ### Testing: > > qemu with UseRVV: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) > - [x] Tier3 tests (release) > - [x] test/jdk/jdk/incubator/vector (release/fastdebug) > > Unmatched: > - [x] Tier1 tests (release) > - [ ] Tier2 tests (release) Dingli Zhang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains one commit: RISC-V: Support vector add mask instructions for Vector API ------------- Changes: https://git.openjdk.org/jdk/pull/12682/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=03 Stats: 655 lines in 6 files changed: 643 ins; 5 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/12682.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12682/head:pull/12682 PR: https://git.openjdk.org/jdk/pull/12682 From dholmes at openjdk.org Fri Mar 10 02:33:12 2023 From: dholmes at openjdk.org (David Holmes) Date: Fri, 10 Mar 2023 02:33:12 GMT Subject: RFR: 8303415: Add VM_Version::is_intrinsic_supported(id) [v2] In-Reply-To: References: Message-ID: On Thu, 9 Mar 2023 05:31:52 GMT, Vladimir Kozlov wrote: >> Currently we check VM flags, directives and JIT compiler support when we generate intrinsics. >> We have *product* VM flags for most intrinsics and set them in VM based on HW support. >> But not all intrinsics have such flags and it is not scalable to add new *product* flag for each new intrinsic. >> Also we have `-XX:DisableIntrinsic=` and `-XX:ControlIntrinsic=` flags to control intrinsics from command line. We don't need specific flags for that. >> >> I propose to add new `VM_Version::is_intrinsic_supported(id)` method to check platform support for intrinsic without adding new flag. I used it for `_floatToFloat16` intrinsic for my work on [JDK-8302976](https://bugs.openjdk.org/browse/JDK-8302976). >> >> Additional fixes: >> Fixed Interpreter to skip intrinsics if they are disabled with flag. >> Moved Interpreter's `InlineIntrinsics` flag check into one place in shared code. >> Added separate interpreter id for `_dsqrt_strict` so it could be disabled separately from regular `_dsqrt`. >> Added missing `native` mark to `_currentThread`. >> Removed unused `AbstractInterpreter::in_native_entry()`. >> Cleanup C2 intrinsic checks code. >> >> Tested tier1-4,xcomp,stress. Also ran tier1-3,xcomp with `-XX:-InlineIntrinsics`. > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Address comments No further questions :) Seems okay. Thanks. ------------- Marked as reviewed by dholmes (Reviewer). PR: https://git.openjdk.org/jdk/pull/12858 From dzhang at openjdk.org Fri Mar 10 02:47:44 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Fri, 10 Mar 2023 02:47:44 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v5] In-Reply-To: References: Message-ID: > HI, > > We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot. > > This patch will add support of `VectorLoadMask` and vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. > > `VectorLoadMask` will generate the corresponding mask vector for the vector addition operation in mask form. > > AddMaskTestMerge case: > > > import jdk.incubator.vector.IntVector; > import jdk.incubator.vector.VectorMask; > import jdk.incubator.vector.VectorOperators; > import jdk.incubator.vector.VectorSpecies; > > public class AddMaskTestMerge { > > static final VectorSpecies SPECIES = IntVector.SPECIES_128; > static final int SIZE = 1024; > static int[] a = new int[SIZE]; > static int[] b = new int[SIZE]; > static int[] r = new int[SIZE]; > static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; > static { > for (int i = 0; i < SIZE; i++) { > a[i] = i; > b[i] = i; > } > } > > static void workload(int idx) { > VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); > IntVector av = IntVector.fromArray(SPECIES, a, idx); > IntVector bv = IntVector.fromArray(SPECIES, b, idx); > av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); > } > > public static void main(String[] args) { > for (int i = 0; i < 30_0000; i++) { > for (int j = 0; j < SIZE; j += SPECIES.length()) { > workload(j); > } > } > } > } > > > This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. > > Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: > > > 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 > 0ae loadV V1, [R31] # vector (rvv) > 0b6 vloadmask V0, V2 # KILL cr > 0be vadd.vv V3, V1, V0 #@vaddI_masked > 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r > 0ca decode_heap_oop R28, R28 #@decodeHeapOop > 0cc lwu R7, [R28, #12] # range, #@loadRange > 0d0 NullCheck R28 > > > And the jit code is as follows: > > > 0x000000400c8109ae: vsetivli t0,4,e32,m1,tu,mu > 0x000000400c8109b2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) > ; - AddMaskTestMerge::workload at 46 (line 25) > 0x000000400c8109b6: vsetivli t0,4,e8,m1,tu,mu > 0x000000400c8109ba: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) > ; - AddMaskTestMerge::workload at 7 (line 22) > 0x000000400c8109be: vsetivli t0,4,e32,m1,tu,mu > 0x000000400c8109c2: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) > ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) > ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) > ; - AddMaskTestMerge::workload at 39 (line 25) > > > > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java > > ### Testing: > > qemu with UseRVV: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) > - [x] Tier3 tests (release) > - [x] test/jdk/jdk/incubator/vector (release/fastdebug) > > Unmatched: > - [x] Tier1 tests (release) > - [ ] Tier2 tests (release) Dingli Zhang has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: RISC-V: Support vector add mask instructions for Vector API ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12682/files - new: https://git.openjdk.org/jdk/pull/12682/files/d0aaf9e8..59a15d59 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=03-04 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/12682.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12682/head:pull/12682 PR: https://git.openjdk.org/jdk/pull/12682 From kvn at openjdk.org Fri Mar 10 02:53:05 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 10 Mar 2023 02:53:05 GMT Subject: RFR: 8303415: Add VM_Version::is_intrinsic_supported(id) [v2] In-Reply-To: References: Message-ID: On Thu, 9 Mar 2023 05:31:52 GMT, Vladimir Kozlov wrote: >> Currently we check VM flags, directives and JIT compiler support when we generate intrinsics. >> We have *product* VM flags for most intrinsics and set them in VM based on HW support. >> But not all intrinsics have such flags and it is not scalable to add new *product* flag for each new intrinsic. >> Also we have `-XX:DisableIntrinsic=` and `-XX:ControlIntrinsic=` flags to control intrinsics from command line. We don't need specific flags for that. >> >> I propose to add new `VM_Version::is_intrinsic_supported(id)` method to check platform support for intrinsic without adding new flag. I used it for `_floatToFloat16` intrinsic for my work on [JDK-8302976](https://bugs.openjdk.org/browse/JDK-8302976). >> >> Additional fixes: >> Fixed Interpreter to skip intrinsics if they are disabled with flag. >> Moved Interpreter's `InlineIntrinsics` flag check into one place in shared code. >> Added separate interpreter id for `_dsqrt_strict` so it could be disabled separately from regular `_dsqrt`. >> Added missing `native` mark to `_currentThread`. >> Removed unused `AbstractInterpreter::in_native_entry()`. >> Cleanup C2 intrinsic checks code. >> >> Tested tier1-4,xcomp,stress. Also ran tier1-3,xcomp with `-XX:-InlineIntrinsics`. > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Address comments Thank you, David. ------------- PR: https://git.openjdk.org/jdk/pull/12858 From thartmann at openjdk.org Fri Mar 10 07:36:13 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 10 Mar 2023 07:36:13 GMT Subject: RFR: 8303611: Null pointer dereference in block.hpp:586 (ID: 44856) In-Reply-To: References: Message-ID: On Wed, 8 Mar 2023 08:51:46 GMT, Emanuel Peter wrote: > Replaced `unique_ctrl_out_or_null` with `unique_ctrl_out`, which asserts if it finds `nullptr`. This is better than running into a `nullptr`-dereference inside `get_block_for_node`. > > This was found by a static code analyzer, so it is not clear that a `nullptr` dereference would ever happen. But let's still fix it. Looks good to me too but please verify that the static analysis now passes. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/12919 From thartmann at openjdk.org Fri Mar 10 08:00:18 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 10 Mar 2023 08:00:18 GMT Subject: RFR: 8303238: Create generalizations for existing LShift ideal transforms [v4] In-Reply-To: <22Lk8Rj9OTcddJxDx3-jIAIVXRnEFOwyScgxFTgXePA=.beea5fd7-f2d4-4b39-b745-ec7ab5c77b6b@github.com> References: <22Lk8Rj9OTcddJxDx3-jIAIVXRnEFOwyScgxFTgXePA=.beea5fd7-f2d4-4b39-b745-ec7ab5c77b6b@github.com> Message-ID: The message from this sender included one or more files which could not be scanned for virus detection; do not open these files unless you are certain of the sender's intent. ---------------------------------------------------------------------- On Fri, 10 Mar 2023 01:10:03 GMT, Jasmine K. wrote: >> Hello, >> I would like to generalize two ideal transforms for bitwise shifts. Left shift nodes perform the transformations `(x >> C1) << C2 => x & (-1 << C2)` and `((x >> C1) & Y) << C2 => x & (Y << C2)`, but only when the case where `C1 == C2`. However, it is possible to use both of these rules to improve cases where the constants aren't equal, by removing one of the shifts and replacing it with a bitwise and. This transformation is profitable because typically more bitwise ands can be dispatched per cycle than bit shifts. In addition, the strength reduction from a shift to a bitwise and can allow more profitable transformations to occur. These patterns are found throughout the JDK, mainly around strings and OW2 ASM. I've attached some profiling results from my (Zen 2) machine below: >> >> Baseline Patch Improvement >> Benchmark Mode Cnt Score Error Units Score Error Units >> LShiftNodeIdealize.testRgbaToAbgr avgt 15 63.287 ? 1.770 ns/op / 54.199 ? 1.408 ns/op + 14.36% >> LShiftNodeIdealize.testShiftAndInt avgt 15 874.564 ? 15.334 ns/op / 538.408 ? 11.768 ns/op + 38.44% >> LShiftNodeIdealize.testShiftAndLong avgt 15 1017.466 ? 29.010 ns/op / 701.356 ? 18.258 ns/op + 31.07% >> LShiftNodeIdealize.testShiftInt avgt 15 663.865 ? 14.226 ns/op / 533.588 ? 9.949 ns/op + 19.63% >> LShiftNodeIdealize.testShiftInt2 avgt 15 658.976 ? 32.856 ns/op / 649.871 ? 10.598 ns/op + 1.38% >> LShiftNodeIdealize.testShiftLong avgt 15 815.540 ? 14.721 ns/op / 689.270 ? 14.028 ns/op + 15.48% >> LShiftNodeIdealize.testShiftLong2 avgt 15 817.936 ? 23.573 ns/op / 810.185 ? 14.983 ns/op + 0.95% >> >> >> In addition, in the process of making this PR I've found a missing ideal transform for `RShiftLNode`, so right shifts of large numbers (such as `x >> 65`) are not properly folded down, like how they are `RShiftINode` and `URShiftLNode`. I'll address this in a future RFR. >> >> Testing: GHA, tier1 local, and performance testing >> >> Thanks, >> Jasmine K > > Jasmine K. has updated the pull request incrementally with one additional commit since the last revision: > > Update full name Thanks for updating the comments and the additional details. Looks good to me! ------------- PR: https://git.openjdk.org/jdk/pull/12734 From duke at openjdk.org Fri Mar 10 10:11:04 2023 From: duke at openjdk.org (Damon Fenacci) Date: Fri, 10 Mar 2023 10:11:04 GMT Subject: RFR: JDK-8303154: Investigate and improve instruction cache flushing during compilation Message-ID: It was noticed that we flush the instruction cache too much for a single C1 compilation. The same is true for the C2 compilation. There are several places in the code where the instruction cache is called and many of them are very intertwined (see [bug report](https://bugs.openjdk.org/browse/JDK-8303154)). This PR is meant to be a "minimum" set of changes that improve the situation without introducing excessive extra information to keep track of the origin of the call through call stacks. This is done by avoiding calls to flush the ICache at `Compilation::emit_code_epilog` and when copying from the temporary buffer into the code cache in `CodeBuffer::copy_code_to` if there is a call to `CodeCache::commit` later on. This results in flushing the ICache only once instead of 3 times for a C1 compilation and twice for a C2 compilation respectively. This change decreases the number of calls to flush the ICache for a simple _Hello world_ program on Mac OSX aarch64 from 3568 to 3487 on C1 (2.27% improvement) and from 3499 to 3415 on C2 (2.40% improvement). ------------- Commit messages: - JDK-8303154: fix syntax and add comment - JDK-8303154: fix syntax - JDK-8303154: remove unnecessary icache flushing at the end emit_code_epilog - JDK-8303154: Investigate and improve instruction cache flushing during compilation Changes: https://git.openjdk.org/jdk/pull/12877/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12877&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8303154 Stats: 15 lines in 5 files changed: 4 ins; 3 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/12877.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12877/head:pull/12877 PR: https://git.openjdk.org/jdk/pull/12877 From thartmann at openjdk.org Fri Mar 10 10:11:05 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 10 Mar 2023 10:11:05 GMT Subject: RFR: JDK-8303154: Investigate and improve instruction cache flushing during compilation In-Reply-To: References: Message-ID: On Mon, 6 Mar 2023 08:37:50 GMT, Damon Fenacci wrote: > It was noticed that we flush the instruction cache too much for a single C1 compilation. The same is true for the C2 compilation. > There are several places in the code where the instruction cache is called and many of them are very intertwined (see [bug report](https://bugs.openjdk.org/browse/JDK-8303154)). > > This PR is meant to be a "minimum" set of changes that improve the situation without introducing excessive extra information to keep track of the origin of the call through call stacks. This is done by avoiding calls to flush the ICache at `Compilation::emit_code_epilog` and when copying from the temporary buffer into the code cache in `CodeBuffer::copy_code_to` if there is a call to `CodeCache::commit` later on. This results in flushing the ICache only once instead of 3 times for a C1 compilation and twice for a C2 compilation respectively. > > This change decreases the number of calls to flush the ICache for a simple _Hello world_ program on Mac OSX aarch64 from 3568 to 3487 on C1 (2.27% improvement) and from 3499 to 3415 on C2 (2.40% improvement). Looks good to me. As we discussed, please file a follow-up RFE for the remaining investigations around excessive icache flushing. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/12877 From jvernee at openjdk.org Fri Mar 10 14:14:55 2023 From: jvernee at openjdk.org (Jorn Vernee) Date: Fri, 10 Mar 2023 14:14:55 GMT Subject: RFR: 8303022: "assert(allocates2(pc)) failed: not in CodeBuffer memory" When linking downcall handle [v3] In-Reply-To: References: Message-ID: > The issue is that the size of the code buffer is not large enough to hold the whole stub. > > Proposed solution is to scale the size of the stub with the number of arguments. I've adjusted sizes for both downcall and upcall stubs. I've also dropped the number of relocations, since we're not really using any for downcalls, and for upcalls we only have 1 AFAICS. (the size of the relocations can not be zero however, as that leads to the relocation section [not being initialized][1], and triggering [an assert][2] later when the code blob is copied). > > The way I've determined the new base size and per-argument size for stubs, is by first linking a stub without any arguments to get the required base size, and by then adding 20 `double` arguments to get a rough per-argument size. Both values have wiggle room as well. The sizes can be printed using e.g. `-XX:+LogCompilation`, and then looking for `nep_invoker_blob` and `upcall_stub*` in the log file. This experiment was done on a fastdebug build to account for additional debug code being generated. The included test is designed to try and maximize the size of the generated stub. > > I've also updated `CodeBuffer::log_section_sizes` to print the in-use size, rather than just the capacity and free space. > > [1]: https://github.com/openjdk/jdk/blob/56512cfe1f0682c98ba3488af3d03ccef632c016/src/hotspot/share/asm/codeBuffer.cpp#L119-L121 > [2]: https://github.com/openjdk/jdk/blob/56512cfe1f0682c98ba3488af3d03ccef632c016/src/hotspot/share/asm/codeBuffer.cpp#L675 Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision: RISCV changes ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12908/files - new: https://git.openjdk.org/jdk/pull/12908/files/0a2bc96c..7f467784 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12908&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12908&range=01-02 Stats: 14 lines in 2 files changed: 8 ins; 0 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/12908.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12908/head:pull/12908 PR: https://git.openjdk.org/jdk/pull/12908 From jvernee at openjdk.org Fri Mar 10 14:14:56 2023 From: jvernee at openjdk.org (Jorn Vernee) Date: Fri, 10 Mar 2023 14:14:56 GMT Subject: RFR: 8303022: "assert(allocates2(pc)) failed: not in CodeBuffer memory" When linking downcall handle [v2] In-Reply-To: References: Message-ID: On Thu, 9 Mar 2023 06:39:47 GMT, Feilong Jiang wrote: >> @vnkozlov Yes, this is true. The only other existing port of this code is RISCV. However, to fix that port properly, someone needs to repeat the experiment on RISCV in order to figure out what the base size and the size per argument should be. >> >> I don't have access to a RISCV machine, so I figured I would file a followup issue for the RISCV maintainers to fix separately. >> >> @feilongjiang Could you comment on this? If you could figure out the needed sizes for RISCV I could add the needed changes to this patch. Otherwise I could file a followup issue if that seems more convenient. TIA > >> @feilongjiang Could you comment on this? If you could figure out the needed sizes for RISCV I could add the needed changes to this patch. Otherwise I could file a followup issue if that seems more convenient. TIA > > Yes, I will take a look to find out the needed size for RISCV. > > Update: > When disabling RVC (compressed instructions) on fastdebug build, `LogCompilation` reveals that downcall stub base will cost ~200 bytes, 256 looks good enough. But for upcall stubs, we need ~1700 bytes when Shenandoah GC is enabled, so 2048 would be a safe base size. `jdk_foreign` on RISC-V board are all passed (release & fastdebug) with the fix of #12950. > > Here is the patch: > [riscv.txt](https://github.com/openjdk/jdk/files/10938297/riscv.txt) @feilongjiang Thanks! I've added the riscv changes. ------------- PR: https://git.openjdk.org/jdk/pull/12908 From epeter at openjdk.org Fri Mar 10 14:30:15 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 10 Mar 2023 14:30:15 GMT Subject: RFR: 8294715: Add IR checks to the reduction vectorization tests [v4] In-Reply-To: References: Message-ID: On Wed, 8 Mar 2023 08:19:05 GMT, Daniel Skantz wrote: >> We are lifting some loopopts/superword tests to use the IR framework, and add IR annotations to check that vector reductions take place on x86_64. This can be useful to prevent issues such as JDK-8300865. >> >> Approach: lift the more general tests in loopopts/superword, mainly using matching rules in cpu/x86/x86.ad, but leave tests mostly unchanged otherwise. Some reductions are considered non-profitable (superword.cpp), so we might need to raise sse/avx value pre-conditions from what would be a strict reading of x86.ad (as noted by @eme64). >> >> Testing: Local testing (x86_64) using UseSSE={2,3,4}, UseAVX={0,1,2,3}. Tested running all jtreg compiler tests. Tier1-tier5 runs to my knowledge never showed any compiler-related regression in other tests as a result from this work. GHA. Validation: all tests fail if we put unreasonable counts for the respective reduction node, such as counts = {IRNode.ADD_REDUCTION_VI, ">= 10000000"}). >> >> Thanks @robcasloz and @eme64 for advice. >> >> Notes: ProdRed_Double does not vectorize (JDK-8300865). SumRed_Long does not vectorize on 32-bit, according to my reading of source, test on GHA and cross-compiled JDK on 32-bit Linux, so removed these platforms from @requires. Lifted the AbsNeg tests too but added no checks, as these are currently not run on x86_64. > > Daniel Skantz has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 14 additional commits since the last revision: > > - Merge branch 'master' of github.com:openjdk/jdk into JDK-8294715-IR-new > - Correctly reset totals in RedTest*; put debug print msgs in exception > - Remove 2-unroll scenario > - remove duped unrolllimit > - fix typo; remove non-store case from SumRedSqrt_Double due to slow run time > - Merge branch 'master' of github.com:openjdk/jdk into JDK-8294715-IR-new > - Remove print statements for prints that were silenced by IR framework addition > - Remove non-double stores > - Revert much of last commit, and part of the first commit addressing review comments : intention is to remove all the negative tests, except for on -XX:-SuperWordReductions. Keep some comments and additional IR nodes added to existing checks. > - Address further review comments (edits) > - ... and 4 more: https://git.openjdk.org/jdk/compare/cb20d6e4...ec02160d I know it took a while to complete, but we really do need more solid testing for SuperWord. @danielogh Thanks for the work! ------------- Marked as reviewed by epeter (Committer). PR: https://git.openjdk.org/jdk/pull/12683 From duke at openjdk.org Fri Mar 10 14:34:14 2023 From: duke at openjdk.org (Daniel Skantz) Date: Fri, 10 Mar 2023 14:34:14 GMT Subject: RFR: 8294715: Add IR checks to the reduction vectorization tests [v4] In-Reply-To: References: Message-ID: On Wed, 8 Mar 2023 12:40:43 GMT, Roberto Casta?eda Lozano wrote: >> Daniel Skantz has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 14 additional commits since the last revision: >> >> - Merge branch 'master' of github.com:openjdk/jdk into JDK-8294715-IR-new >> - Correctly reset totals in RedTest*; put debug print msgs in exception >> - Remove 2-unroll scenario >> - remove duped unrolllimit >> - fix typo; remove non-store case from SumRedSqrt_Double due to slow run time >> - Merge branch 'master' of github.com:openjdk/jdk into JDK-8294715-IR-new >> - Remove print statements for prints that were silenced by IR framework addition >> - Remove non-double stores >> - Revert much of last commit, and part of the first commit addressing review comments : intention is to remove all the negative tests, except for on -XX:-SuperWordReductions. Keep some comments and additional IR nodes added to existing checks. >> - Address further review comments (edits) >> - ... and 4 more: https://git.openjdk.org/jdk/compare/3367fb96...ec02160d > > Looks good. Thanks again for the thorough and meticulous work, Daniel! > > All `@IR` checks are either trivially negative (`@IR(applyIf = {"SuperWordReductions", "false"}, failOn = ...)`) or guarded by x86-specific features, so these changes should not cause false failures for non-x86 architectures anymore. Thank you @robcasloz, @eme64 and @fg1417 for all the help and reviews! ------------- PR: https://git.openjdk.org/jdk/pull/12683 From yyang at openjdk.org Fri Mar 10 14:43:26 2023 From: yyang at openjdk.org (Yi Yang) Date: Fri, 10 Mar 2023 14:43:26 GMT Subject: RFR: 8303970: C2 can not merge homogeneous adjacent two If Message-ID: Hi, can I have a review for this patch? It adds new Identity for BoolNode to lookup homogenous back-to-back Ifs, i.e. `Bool (CmpX a b)` is identity to `Bool (CmpX b a)`, in this way, we are able to merge two "identical" Ifs, which is not before. public static void test(int a, int b) { // ok, identical ifs, apply split_if if (a == b) { int_field = 0x42; } else { int_field = 42; } if (a == b) { int_field = 0x42; } else { int_field = 42; } } public static void test(int a, int b) { // do nothing if (a == b) { int_field = 0x42; } else { int_field = 42; } if (b == a) { int_field = 0x42; } else { int_field = 42; } } ------------- Commit messages: - 8303970 C2 can not merge homogeneous adjacent two If Changes: https://git.openjdk.org/jdk/pull/12978/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12978&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8303970 Stats: 72 lines in 3 files changed: 65 ins; 0 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/12978.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12978/head:pull/12978 PR: https://git.openjdk.org/jdk/pull/12978 From redestad at openjdk.org Fri Mar 10 15:24:08 2023 From: redestad at openjdk.org (Claes Redestad) Date: Fri, 10 Mar 2023 15:24:08 GMT Subject: RFR: 8303238: Create generalizations for existing LShift ideal transforms [v4] In-Reply-To: <22Lk8Rj9OTcddJxDx3-jIAIVXRnEFOwyScgxFTgXePA=.beea5fd7-f2d4-4b39-b745-ec7ab5c77b6b@github.com> References: <22Lk8Rj9OTcddJxDx3-jIAIVXRnEFOwyScgxFTgXePA=.beea5fd7-f2d4-4b39-b745-ec7ab5c77b6b@github.com> Message-ID: On Fri, 10 Mar 2023 01:10:03 GMT, Jasmine K. wrote: >> Hello, >> I would like to generalize two ideal transforms for bitwise shifts. Left shift nodes perform the transformations `(x >> C1) << C2 => x & (-1 << C2)` and `((x >> C1) & Y) << C2 => x & (Y << C2)`, but only when the case where `C1 == C2`. However, it is possible to use both of these rules to improve cases where the constants aren't equal, by removing one of the shifts and replacing it with a bitwise and. This transformation is profitable because typically more bitwise ands can be dispatched per cycle than bit shifts. In addition, the strength reduction from a shift to a bitwise and can allow more profitable transformations to occur. These patterns are found throughout the JDK, mainly around strings and OW2 ASM. I've attached some profiling results from my (Zen 2) machine below: >> >> Baseline Patch Improvement >> Benchmark Mode Cnt Score Error Units Score Error Units >> LShiftNodeIdealize.testRgbaToAbgr avgt 15 63.287 ? 1.770 ns/op / 54.199 ? 1.408 ns/op + 14.36% >> LShiftNodeIdealize.testShiftAndInt avgt 15 874.564 ? 15.334 ns/op / 538.408 ? 11.768 ns/op + 38.44% >> LShiftNodeIdealize.testShiftAndLong avgt 15 1017.466 ? 29.010 ns/op / 701.356 ? 18.258 ns/op + 31.07% >> LShiftNodeIdealize.testShiftInt avgt 15 663.865 ? 14.226 ns/op / 533.588 ? 9.949 ns/op + 19.63% >> LShiftNodeIdealize.testShiftInt2 avgt 15 658.976 ? 32.856 ns/op / 649.871 ? 10.598 ns/op + 1.38% >> LShiftNodeIdealize.testShiftLong avgt 15 815.540 ? 14.721 ns/op / 689.270 ? 14.028 ns/op + 15.48% >> LShiftNodeIdealize.testShiftLong2 avgt 15 817.936 ? 23.573 ns/op / 810.185 ? 14.983 ns/op + 0.95% >> >> >> In addition, in the process of making this PR I've found a missing ideal transform for `RShiftLNode`, so right shifts of large numbers (such as `x >> 65`) are not properly folded down, like how they are `RShiftINode` and `URShiftLNode`. I'll address this in a future RFR. >> >> Testing: GHA, tier1 local, and performance testing >> >> Thanks, >> Jasmine K > > Jasmine K. has updated the pull request incrementally with one additional commit since the last revision: > > Update full name Marked as reviewed by redestad (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/12734 From thartmann at openjdk.org Fri Mar 10 15:29:13 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 10 Mar 2023 15:29:13 GMT Subject: RFR: 8303970: C2 can not merge homogeneous adjacent two If In-Reply-To: References: Message-ID: On Fri, 10 Mar 2023 14:37:06 GMT, Yi Yang wrote: > Hi, can I have a review for this patch? It adds new Identity for BoolNode to lookup homogenous integer comparison, i.e. `Bool (CmpX a b)` is identity to `Bool (CmpX b a)`, in this way, we are able to merge the following two "identical" Ifs, which is not before. > > > public static void test(int a, int b) { // ok, identical ifs, apply split_if > if (a == b) { > int_field = 0x42; > } else { > int_field = 42; > } > if (a == b) { > int_field = 0x42; > } else { > int_field = 42; > } > } > > public static void test(int a, int b) { // do nothing > if (a == b) { > int_field = 0x42; > } else { > int_field = 42; > } > if (b == a) { > int_field = 0x42; > } else { > int_field = 42; > } > } I executed some quick testing and `applications/ctw/modules/jdk_internal_le.java` fails with `-ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:-TieredCompilation`: # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/workspace/open/src/hotspot/share/opto/loopnode.hpp:1136), pid=1175080, tid=1175153 # assert(n != nullptr) failed: Bad immediate dominator info. # # JRE version: Java(TM) SE Runtime Environment (21.0) (fastdebug build 21-internal-LTS-2023-03-10-1459266.tobias.hartmann.jdk2) # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 21-internal-LTS-2023-03-10-1459266.tobias.hartmann.jdk2, mixed mode, sharing, compressed oops, compressed class ptrs, g1 gc, linux-amd64) # Problematic frame: # V [libjvm.so+0x14aa3ab] PhaseIdealLoop::idom_no_update(unsigned int) const+0x17b Current CompileTask: C2: 5953 1184 b jdk.internal.org.jline.reader.impl.LineReaderImpl::viYankTo (56 bytes) Stack: [0x00007f50104c5000,0x00007f50105c6000], sp=0x00007f50105bffa0, free space=1003k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x14aa3ab] PhaseIdealLoop::idom_no_update(unsigned int) const+0x17b (loopnode.hpp:1136) V [libjvm.so+0x14e7f6f] PhaseIdealLoop::split_if_with_blocks_post(Node*)+0x32f (loopnode.hpp:1149) V [libjvm.so+0x14e8869] PhaseIdealLoop::split_if_with_blocks(VectorSet&, Node_Stack&)+0x209 (loopopts.cpp:1828) V [libjvm.so+0x14d9aff] PhaseIdealLoop::build_and_optimize()+0x12df (loopnode.cpp:4520) V [libjvm.so+0xb2f661] PhaseIdealLoop::optimize(PhaseIterGVN&, LoopOptsMode)+0x261 (loopnode.hpp:1111) V [libjvm.so+0xb2ae1f] Compile::Optimize()+0xe2f (compile.cpp:2149) V [libjvm.so+0xb2d47a] Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x191a (compile.cpp:833) V [libjvm.so+0x939857] C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x4e7 (c2compiler.cpp:113) V [libjvm.so+0xb3ad6c] CompileBroker::invoke_compiler_on_method(CompileTask*)+0xa7c (compileBroker.cpp:2265) V [libjvm.so+0xb3bbe0] CompileBroker::compiler_thread_loop()+0x690 (compileBroker.cpp:1944) V [libjvm.so+0x108d796] JavaThread::thread_main_inner()+0x206 (javaThread.cpp:710) V [libjvm.so+0x1a92900] Thread::call_run()+0x100 (thread.cpp:224) V [libjvm.so+0x1730fc3] thread_native_entry(Thread*)+0x103 (os_linux.cpp:740) Which makes sense because your optimization does not respect dominance, right? ------------- Changes requested by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/12978 From thartmann at openjdk.org Fri Mar 10 15:34:10 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 10 Mar 2023 15:34:10 GMT Subject: RFR: 8303970: C2 can not merge homogeneous adjacent two If In-Reply-To: References: Message-ID: On Fri, 10 Mar 2023 14:37:06 GMT, Yi Yang wrote: > Hi, can I have a review for this patch? It adds new Identity for BoolNode to lookup homogenous integer comparison, i.e. `Bool (CmpX a b)` is identity to `Bool (CmpX b a)`, in this way, we are able to merge the following two "identical" Ifs, which is not before. > > > public static void test(int a, int b) { // ok, identical ifs, apply split_if > if (a == b) { > int_field = 0x42; > } else { > int_field = 42; > } > if (a == b) { > int_field = 0x42; > } else { > int_field = 42; > } > } > > public static void test(int a, int b) { // do nothing > if (a == b) { > int_field = 0x42; > } else { > int_field = 42; > } > if (b == a) { > int_field = 0x42; > } else { > int_field = 42; > } > } You probably need something similar to this: https://github.com/openjdk/jdk/blob/5726d31e56530bbe7dee61ae04b126e20cb3611d/src/hotspot/share/opto/graphKit.cpp#L1323-L1327 ------------- PR: https://git.openjdk.org/jdk/pull/12978 From tholenstein at openjdk.org Fri Mar 10 15:43:26 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Fri, 10 Mar 2023 15:43:26 GMT Subject: Integrated: 8300821: UB: Applying non-zero offset to non-null pointer 0xfffffffffffffffe produced null pointer In-Reply-To: <3li-TpWHA4E_MnXnSXCI6AXju4o6rBSw0Ey2J6fXOKM=.f05f51f5-3add-4b11-bc34-efeae0e39b7c@github.com> References: <3li-TpWHA4E_MnXnSXCI6AXju4o6rBSw0Ey2J6fXOKM=.f05f51f5-3add-4b11-bc34-efeae0e39b7c@github.com> Message-ID: On Fri, 3 Mar 2023 14:46:51 GMT, Tobias Holenstein wrote: > "UndefinedBehaviorSanitizer" (https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html) in Xcode running on `java --version` discovered an Undefined Behavior. The reason is in the `next()` method https://github.com/openjdk/jdk/blob/040f5b55bd03bcc2209ece6eebf223ba1fabf824/src/hotspot/share/asm/codeBuffer.cpp#L798 > > In ``RelocIterator::next()`` we get a nullpointer after `_current++` > https://github.com/openjdk/jdk/blob/040f5b55bd03bcc2209ece6eebf223ba1fabf824/src/hotspot/share/code/relocInfo.hpp#L612 > But this is actually expected: In the constructor of the iterator `RelocIterator::RelocIterator` we have > ```c++ > _current = cs->locs_start()-1; > _end = cs->locs_end(); > > and in our case locs_start() and locs_end() are `null` - so `_current` is `null`-1. After `_current++` both `_end` and `_current` are `null`. Just after `_current++` we then check if `_current == _end` and return `false` (there is no next reloc info) > > ## Solution > We want to be able to turn on "UndefinedBehaviorSanitizer" and don't have false positives. So we add a check > `cs->has_locs()` and only create the iterator if we have reloc info. > > Also added a sanity check in `RelocIterator::RelocIterator` that checks that either both `_current` and `_end` are null or both are not null. This pull request has now been integrated. Changeset: 01312a00 Author: Tobias Holenstein URL: https://git.openjdk.org/jdk/commit/01312a002ba27bfbfebb9fde484ca34ebde0704c Stats: 4 lines in 2 files changed: 1 ins; 0 del; 3 mod 8300821: UB: Applying non-zero offset to non-null pointer 0xfffffffffffffffe produced null pointer Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/12854 From tholenstein at openjdk.org Fri Mar 10 15:43:25 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Fri, 10 Mar 2023 15:43:25 GMT Subject: RFR: 8300821: UB: Applying non-zero offset to non-null pointer 0xfffffffffffffffe produced null pointer In-Reply-To: References: <3li-TpWHA4E_MnXnSXCI6AXju4o6rBSw0Ey2J6fXOKM=.f05f51f5-3add-4b11-bc34-efeae0e39b7c@github.com> Message-ID: On Wed, 8 Mar 2023 00:53:38 GMT, Vladimir Kozlov wrote: >> "UndefinedBehaviorSanitizer" (https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html) in Xcode running on `java --version` discovered an Undefined Behavior. The reason is in the `next()` method https://github.com/openjdk/jdk/blob/040f5b55bd03bcc2209ece6eebf223ba1fabf824/src/hotspot/share/asm/codeBuffer.cpp#L798 >> >> In ``RelocIterator::next()`` we get a nullpointer after `_current++` >> https://github.com/openjdk/jdk/blob/040f5b55bd03bcc2209ece6eebf223ba1fabf824/src/hotspot/share/code/relocInfo.hpp#L612 >> But this is actually expected: In the constructor of the iterator `RelocIterator::RelocIterator` we have >> ```c++ >> _current = cs->locs_start()-1; >> _end = cs->locs_end(); >> >> and in our case locs_start() and locs_end() are `null` - so `_current` is `null`-1. After `_current++` both `_end` and `_current` are `null`. Just after `_current++` we then check if `_current == _end` and return `false` (there is no next reloc info) >> >> ## Solution >> We want to be able to turn on "UndefinedBehaviorSanitizer" and don't have false positives. So we add a check >> `cs->has_locs()` and only create the iterator if we have reloc info. >> >> Also added a sanity check in `RelocIterator::RelocIterator` that checks that either both `_current` and `_end` are null or both are not null. > > Good. thanks @vnkozlov and @TobiHartmann for the review! ------------- PR: https://git.openjdk.org/jdk/pull/12854 From thartmann at openjdk.org Fri Mar 10 15:46:12 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 10 Mar 2023 15:46:12 GMT Subject: RFR: 8303970: C2 can not merge homogeneous adjacent two If In-Reply-To: References: Message-ID: <5LpP6n-yoQGTZE4DqIUhOJ8bRBcK87xRnDiTFHJjwiU=.80bb832a-498e-4edd-acb7-f42bac7092cb@github.com> On Fri, 10 Mar 2023 14:37:06 GMT, Yi Yang wrote: > Hi, can I have a review for this patch? It adds new Identity for BoolNode to lookup homogenous integer comparison, i.e. `Bool (CmpX a b)` is identity to `Bool (CmpX b a)`, in this way, we are able to merge the following two "identical" Ifs, which is not before. > > > public static void test(int a, int b) { // ok, identical ifs, apply split_if > if (a == b) { > int_field = 0x42; > } else { > int_field = 42; > } > if (a == b) { > int_field = 0x42; > } else { > int_field = 42; > } > } > > public static void test(int a, int b) { // do nothing > if (a == b) { > int_field = 0x42; > } else { > int_field = 42; > } > if (b == a) { > int_field = 0x42; > } else { > int_field = 42; > } > } On second thought, since the same values are compared, dominance should not be an issue. Must be something else then :) ------------- PR: https://git.openjdk.org/jdk/pull/12978 From duke at openjdk.org Fri Mar 10 16:11:08 2023 From: duke at openjdk.org (Jasmine K.) Date: Fri, 10 Mar 2023 16:11:08 GMT Subject: RFR: 8303238: Create generalizations for existing LShift ideal transforms [v4] In-Reply-To: <22Lk8Rj9OTcddJxDx3-jIAIVXRnEFOwyScgxFTgXePA=.beea5fd7-f2d4-4b39-b745-ec7ab5c77b6b@github.com> References: <22Lk8Rj9OTcddJxDx3-jIAIVXRnEFOwyScgxFTgXePA=.beea5fd7-f2d4-4b39-b745-ec7ab5c77b6b@github.com> Message-ID: On Fri, 10 Mar 2023 01:10:03 GMT, Jasmine K. wrote: >> Hello, >> I would like to generalize two ideal transforms for bitwise shifts. Left shift nodes perform the transformations `(x >> C1) << C2 => x & (-1 << C2)` and `((x >> C1) & Y) << C2 => x & (Y << C2)`, but only when the case where `C1 == C2`. However, it is possible to use both of these rules to improve cases where the constants aren't equal, by removing one of the shifts and replacing it with a bitwise and. This transformation is profitable because typically more bitwise ands can be dispatched per cycle than bit shifts. In addition, the strength reduction from a shift to a bitwise and can allow more profitable transformations to occur. These patterns are found throughout the JDK, mainly around strings and OW2 ASM. I've attached some profiling results from my (Zen 2) machine below: >> >> Baseline Patch Improvement >> Benchmark Mode Cnt Score Error Units Score Error Units >> LShiftNodeIdealize.testRgbaToAbgr avgt 15 63.287 ? 1.770 ns/op / 54.199 ? 1.408 ns/op + 14.36% >> LShiftNodeIdealize.testShiftAndInt avgt 15 874.564 ? 15.334 ns/op / 538.408 ? 11.768 ns/op + 38.44% >> LShiftNodeIdealize.testShiftAndLong avgt 15 1017.466 ? 29.010 ns/op / 701.356 ? 18.258 ns/op + 31.07% >> LShiftNodeIdealize.testShiftInt avgt 15 663.865 ? 14.226 ns/op / 533.588 ? 9.949 ns/op + 19.63% >> LShiftNodeIdealize.testShiftInt2 avgt 15 658.976 ? 32.856 ns/op / 649.871 ? 10.598 ns/op + 1.38% >> LShiftNodeIdealize.testShiftLong avgt 15 815.540 ? 14.721 ns/op / 689.270 ? 14.028 ns/op + 15.48% >> LShiftNodeIdealize.testShiftLong2 avgt 15 817.936 ? 23.573 ns/op / 810.185 ? 14.983 ns/op + 0.95% >> >> >> In addition, in the process of making this PR I've found a missing ideal transform for `RShiftLNode`, so right shifts of large numbers (such as `x >> 65`) are not properly folded down, like how they are `RShiftINode` and `URShiftLNode`. I'll address this in a future RFR. >> >> Testing: GHA, tier1 local, and performance testing >> >> Thanks, >> Jasmine K > > Jasmine K. has updated the pull request incrementally with one additional commit since the last revision: > > Update full name Thanks for the reviews! ------------- PR: https://git.openjdk.org/jdk/pull/12734 From qamai at openjdk.org Fri Mar 10 16:11:13 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 10 Mar 2023 16:11:13 GMT Subject: RFR: 8303970: C2 can not merge homogeneous adjacent two If In-Reply-To: References: Message-ID: On Fri, 10 Mar 2023 14:37:06 GMT, Yi Yang wrote: > Hi, can I have a review for this patch? It adds new Identity for BoolNode to lookup homogenous integer comparison, i.e. `Bool (CmpX a b)` is identity to `Bool (CmpX b a)`, in this way, we are able to merge the following two "identical" Ifs, which is not before. > > > public static void test(int a, int b) { // ok, identical ifs, apply split_if > if (a == b) { > int_field = 0x42; > } else { > int_field = 42; > } > if (a == b) { > int_field = 0x42; > } else { > int_field = 42; > } > } > > public static void test(int a, int b) { // do nothing > if (a == b) { > int_field = 0x42; > } else { > int_field = 42; > } > if (b == a) { > int_field = 0x42; > } else { > int_field = 42; > } > } May I ask why is global value numbering not able to merge these 2? Can we modify `BoolNode::hash` and `BoolNode::cmp` to take these into consideration? Thanks a lot. ------------- PR: https://git.openjdk.org/jdk/pull/12978 From redestad at openjdk.org Fri Mar 10 16:23:15 2023 From: redestad at openjdk.org (Claes Redestad) Date: Fri, 10 Mar 2023 16:23:15 GMT Subject: RFR: 8303238: Create generalizations for existing LShift ideal transforms [v4] In-Reply-To: <22Lk8Rj9OTcddJxDx3-jIAIVXRnEFOwyScgxFTgXePA=.beea5fd7-f2d4-4b39-b745-ec7ab5c77b6b@github.com> References: <22Lk8Rj9OTcddJxDx3-jIAIVXRnEFOwyScgxFTgXePA=.beea5fd7-f2d4-4b39-b745-ec7ab5c77b6b@github.com> Message-ID: <-_wbvP9zOlKW-tlbbBYmQZZwvXtYDD6_eDdaIMjKnYY=.5e5f9808-8941-4982-b789-2ac7401168af@github.com> On Fri, 10 Mar 2023 01:10:03 GMT, Jasmine K. wrote: >> Hello, >> I would like to generalize two ideal transforms for bitwise shifts. Left shift nodes perform the transformations `(x >> C1) << C2 => x & (-1 << C2)` and `((x >> C1) & Y) << C2 => x & (Y << C2)`, but only when the case where `C1 == C2`. However, it is possible to use both of these rules to improve cases where the constants aren't equal, by removing one of the shifts and replacing it with a bitwise and. This transformation is profitable because typically more bitwise ands can be dispatched per cycle than bit shifts. In addition, the strength reduction from a shift to a bitwise and can allow more profitable transformations to occur. These patterns are found throughout the JDK, mainly around strings and OW2 ASM. I've attached some profiling results from my (Zen 2) machine below: >> >> Baseline Patch Improvement >> Benchmark Mode Cnt Score Error Units Score Error Units >> LShiftNodeIdealize.testRgbaToAbgr avgt 15 63.287 ? 1.770 ns/op / 54.199 ? 1.408 ns/op + 14.36% >> LShiftNodeIdealize.testShiftAndInt avgt 15 874.564 ? 15.334 ns/op / 538.408 ? 11.768 ns/op + 38.44% >> LShiftNodeIdealize.testShiftAndLong avgt 15 1017.466 ? 29.010 ns/op / 701.356 ? 18.258 ns/op + 31.07% >> LShiftNodeIdealize.testShiftInt avgt 15 663.865 ? 14.226 ns/op / 533.588 ? 9.949 ns/op + 19.63% >> LShiftNodeIdealize.testShiftInt2 avgt 15 658.976 ? 32.856 ns/op / 649.871 ? 10.598 ns/op + 1.38% >> LShiftNodeIdealize.testShiftLong avgt 15 815.540 ? 14.721 ns/op / 689.270 ? 14.028 ns/op + 15.48% >> LShiftNodeIdealize.testShiftLong2 avgt 15 817.936 ? 23.573 ns/op / 810.185 ? 14.983 ns/op + 0.95% >> >> >> In addition, in the process of making this PR I've found a missing ideal transform for `RShiftLNode`, so right shifts of large numbers (such as `x >> 65`) are not properly folded down, like how they are `RShiftINode` and `URShiftLNode`. I'll address this in a future RFR. >> >> Testing: GHA, tier1 local, and performance testing >> >> Thanks, >> Jasmine K > > Jasmine K. has updated the pull request incrementally with one additional commit since the last revision: > > Update full name I've started an internal testing run (tier 1-3) and will report any issues or sponsor depending on the results. The test that's failing in GHA is an unrelated bug that was supposedly fixed yesterday: https://bugs.openjdk.org/browse/JDK-8303105 ------------- PR: https://git.openjdk.org/jdk/pull/12734 From yyang at openjdk.org Fri Mar 10 16:58:12 2023 From: yyang at openjdk.org (Yi Yang) Date: Fri, 10 Mar 2023 16:58:12 GMT Subject: RFR: 8303970: C2 can not merge homogeneous adjacent two If In-Reply-To: References: Message-ID: <1kwsyPbm1yzEIn4y3WUsurt9lXWuUs-08ps8iWFXNBk=.2368f3ed-e6fe-4566-80c7-530862ea9085@github.com> On Fri, 10 Mar 2023 16:08:23 GMT, Quan Anh Mai wrote: > May I ask why is global value numbering not able to merge these 2? Can we modify `BoolNode::hash` and `BoolNode::cmp` to take these into consideration? > Thanks a lot. Hi @merykitty I do considered using GVN since it is a good candidate for that in theory. But I think they have [different input](https://github.com/openjdk/jdk/blob/c26e1d0148de27d0b257ec10380a5c50483fd3c0/src/hotspot/share/opto/phaseX.cpp#L119) and would collide eventually in practice. On the other hand, Identity is more common that GVN, so I try to find existing bool node by its Identity. ------------- PR: https://git.openjdk.org/jdk/pull/12978 From duke at openjdk.org Fri Mar 10 20:08:25 2023 From: duke at openjdk.org (Ilya Korennoy) Date: Fri, 10 Mar 2023 20:08:25 GMT Subject: RFR: 8299226: compiler/profiling/TestTypeProfiling.java: make it not throw if C2 is not enabled. Message-ID: Changing RuntimeException to SkippedException when TieredStopAtLevel < 4. The main part of this problem was done in [JDK-8226795](https://bugs.openjdk.org/browse/JDK-8226795) but for some reason, the exception was not changed. ------------- Commit messages: - 8299226: Throw SkippedException instead of RuntimeException in case of TieredStopAtLevel < 4 Changes: https://git.openjdk.org/jdk/pull/12981/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12981&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8299226 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/12981.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12981/head:pull/12981 PR: https://git.openjdk.org/jdk/pull/12981 From kvn at openjdk.org Fri Mar 10 21:20:39 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 10 Mar 2023 21:20:39 GMT Subject: RFR: 8299226: compiler/profiling/TestTypeProfiling.java: make it not throw if C2 is not enabled. In-Reply-To: References: Message-ID: <4szUzELeZvvinROhWc3CpmcvlowRDiZlVy0KKBj_FCE=.2be8ee8a-9d70-4cde-8325-c1eac472e72c@github.com> On Fri, 10 Mar 2023 19:31:45 GMT, Ilya Korennoy wrote: > Changing RuntimeException to SkippedException when TieredStopAtLevel < 4. > > The main part of this problem was done in [JDK-8226795](https://bugs.openjdk.org/browse/JDK-8226795) but for some reason, the exception was not changed. Did you tried to reproduce the issue described in the bug? JDK-8226795 fix added jtreg's commands `@requires` which will not allow to run without C2. I tried: Test results: no tests selected Also `Platform` and `TIERED_*` checks in `main()` methods are useless and can be removed after JDK-8226795 changes. ------------- Changes requested by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/12981 From duke at openjdk.org Fri Mar 10 22:08:46 2023 From: duke at openjdk.org (Ilya Korennoy) Date: Fri, 10 Mar 2023 22:08:46 GMT Subject: RFR: 8299226: compiler/profiling/TestTypeProfiling.java: make it not throw if C2 is not enabled. In-Reply-To: References: Message-ID: On Fri, 10 Mar 2023 19:31:45 GMT, Ilya Korennoy wrote: > Changing RuntimeException to SkippedException when TieredStopAtLevel < 4. > > The main part of this problem was done in [JDK-8226795](https://bugs.openjdk.org/browse/JDK-8226795) but for some reason, the exception was not changed. Yes, I tried to reproduce the issue, it didn't reproduce for me either. Initially, I thought about removing these checks, but then I decided that there would be less diff and the code would be the same as in other tests that use similar checks. I am not familiar with OpenJDK development processes, but if it's possible within this PR, I can remove the checks code from this test and from other similar tests: Level2RecompilationTest.java OSRFailureLevel4Test.java. ------------- PR: https://git.openjdk.org/jdk/pull/12981 From duke at openjdk.org Fri Mar 10 22:36:23 2023 From: duke at openjdk.org (Ilya Korennoy) Date: Fri, 10 Mar 2023 22:36:23 GMT Subject: RFR: 8299226: compiler/profiling/TestTypeProfiling.java: make it not throw if C2 is not enabled. In-Reply-To: References: Message-ID: On Fri, 10 Mar 2023 19:31:45 GMT, Ilya Korennoy wrote: > Changing RuntimeException to SkippedException when TieredStopAtLevel < 4. > > The main part of this problem was done in [JDK-8226795](https://bugs.openjdk.org/browse/JDK-8226795) but for some reason, the exception was not changed. I looked again at the Level2RecompilationTest and OSRFailureLevel4Test and it seems that I was wrong in these tests the checks in `main()` are different. So, it only needs to remove the checks from TestTypeProfiling. ------------- PR: https://git.openjdk.org/jdk/pull/12981 From kvn at openjdk.org Fri Mar 10 22:49:35 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 10 Mar 2023 22:49:35 GMT Subject: RFR: 8299226: compiler/profiling/TestTypeProfiling.java: make it not throw if C2 is not enabled. In-Reply-To: References: Message-ID: On Fri, 10 Mar 2023 22:33:19 GMT, Ilya Korennoy wrote: >> Changing RuntimeException to SkippedException when TieredStopAtLevel < 4. >> >> The main part of this problem was done in [JDK-8226795](https://bugs.openjdk.org/browse/JDK-8226795) but for some reason, the exception was not changed. > > I looked again at the Level2RecompilationTest and OSRFailureLevel4Test and it seems that I was wrong in these tests the checks in `main()` are different. > > So, it only needs to remove the checks from TestTypeProfiling. @ikorennoy, I added comment with question to Evgeny about how he hit the issue so we can reproduce it. These tests can't be run without JTREG which filter them. Based on his answer we either close bug as not issue or try to find why filtering does not work in his configuration. If you want to look to do clean to remove unneeded checks and simplify `@requires` I would suggest to file a separate RFE. ------------- PR: https://git.openjdk.org/jdk/pull/12981 From kvn at openjdk.org Fri Mar 10 22:58:00 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 10 Mar 2023 22:58:00 GMT Subject: RFR: JDK-8303154: Investigate and improve instruction cache flushing during compilation In-Reply-To: References: Message-ID: On Mon, 6 Mar 2023 08:37:50 GMT, Damon Fenacci wrote: > It was noticed that we flush the instruction cache too much for a single C1 compilation. The same is true for the C2 compilation. > There are several places in the code where the instruction cache is called and many of them are very intertwined (see [bug report](https://bugs.openjdk.org/browse/JDK-8303154)). > > This PR is meant to be a "minimum" set of changes that improve the situation without introducing excessive extra information to keep track of the origin of the call through call stacks. This is done by avoiding calls to flush the ICache at `Compilation::emit_code_epilog` and when copying from the temporary buffer into the code cache in `CodeBuffer::copy_code_to` if there is a call to `CodeCache::commit` later on. This results in flushing the ICache only once instead of 3 times for a C1 compilation and twice for a C2 compilation respectively. > > This change decreases the number of calls to flush the ICache for a simple _Hello world_ program on Mac OSX aarch64 from 3568 to 3487 on C1 (2.27% improvement) and from 3499 to 3415 on C2 (2.40% improvement). Did you look on how many times we flush ICache during adapters generation? It has most numerous cases when I looked on it: "CodeCache::commit() is also used for adapters. But adapters uses RuntimeBlob which calls CodeBuffer::copy_code_to()." I thought we would remove flush from CodeCache::commit() and not from copy_code_to(). ------------- PR: https://git.openjdk.org/jdk/pull/12877 From kvn at openjdk.org Sat Mar 11 00:18:22 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 11 Mar 2023 00:18:22 GMT Subject: RFR: 8303022: "assert(allocates2(pc)) failed: not in CodeBuffer memory" When linking downcall handle [v3] In-Reply-To: References: Message-ID: On Fri, 10 Mar 2023 14:14:55 GMT, Jorn Vernee wrote: >> The issue is that the size of the code buffer is not large enough to hold the whole stub. >> >> Proposed solution is to scale the size of the stub with the number of arguments. I've adjusted sizes for both downcall and upcall stubs. I've also dropped the number of relocations, since we're not really using any for downcalls, and for upcalls we only have 1 AFAICS. (the size of the relocations can not be zero however, as that leads to the relocation section [not being initialized][1], and triggering [an assert][2] later when the code blob is copied). >> >> The way I've determined the new base size and per-argument size for stubs, is by first linking a stub without any arguments to get the required base size, and by then adding 20 `double` arguments to get a rough per-argument size. Both values have wiggle room as well. The sizes can be printed using e.g. `-XX:+LogCompilation`, and then looking for `nep_invoker_blob` and `upcall_stub*` in the log file. This experiment was done on a fastdebug build to account for additional debug code being generated. The included test is designed to try and maximize the size of the generated stub. >> >> I've also updated `CodeBuffer::log_section_sizes` to print the in-use size, rather than just the capacity and free space. >> >> [1]: https://github.com/openjdk/jdk/blob/56512cfe1f0682c98ba3488af3d03ccef632c016/src/hotspot/share/asm/codeBuffer.cpp#L119-L121 >> [2]: https://github.com/openjdk/jdk/blob/56512cfe1f0682c98ba3488af3d03ccef632c016/src/hotspot/share/asm/codeBuffer.cpp#L675 > > Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision: > > RISCV changes Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/12908 From fjiang at openjdk.org Sat Mar 11 02:54:12 2023 From: fjiang at openjdk.org (Feilong Jiang) Date: Sat, 11 Mar 2023 02:54:12 GMT Subject: RFR: 8303955: RISC-V: Factor out the tmp parameter from copy_memory and copy_memory_v Message-ID: The call site of `copy_memory` and `copy_memory_v` always use `t0` as tmp register, so we can factor the tmp parameter out. Testing: - [x] tier1 tests on Unmatched board (release build with `-XX:-UseRVV`) - [x] hotspot_tier1 and jdk_tier1 on QEMU (release build with `-XX:+UseRVV`) ------------- Commit messages: - Factor out the tmp parameter from copy_memory and copy_memory_v Changes: https://git.openjdk.org/jdk/pull/12969/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12969&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8303955 Stats: 29 lines in 1 file changed: 0 ins; 0 del; 29 mod Patch: https://git.openjdk.org/jdk/pull/12969.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12969/head:pull/12969 PR: https://git.openjdk.org/jdk/pull/12969 From jbhateja at openjdk.org Sat Mar 11 16:50:32 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sat, 11 Mar 2023 16:50:32 GMT Subject: RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v26] In-Reply-To: <_ygbSM3ARi21Y5jfOqEIYEqZe84cwTbQ6vayG9hVYiY=.36ace32c-73ac-452e-a9ac-f9ffaea08931@github.com> References: <_ygbSM3ARi21Y5jfOqEIYEqZe84cwTbQ6vayG9hVYiY=.36ace32c-73ac-452e-a9ac-f9ffaea08931@github.com> Message-ID: On Thu, 9 Mar 2023 21:04:57 GMT, Emanuel Peter wrote: >> **List of important things below** >> >> - 3 Bugs I fixed + regression tests https://github.com/openjdk/jdk/pull/12350#issuecomment-1460323523 >> - Conversation with @jatin-bhateja about script-generated regression test: https://github.com/openjdk/jdk/pull/12350#discussion_r1115317152 >> - My [blog-article](https://eme64.github.io/blog/2023/02/23/SuperWord-Introduction.html) about SuperWord >> - Explanation of `dependency_graph` and `DepPreds` https://github.com/openjdk/jdk/pull/12350#issuecomment-1461498252 >> - Explanation of my new `find_dependency`. Arguments about `independence` of packs and cyclic dependencies between packs https://github.com/openjdk/jdk/pull/12350#issuecomment-1461681129 >> >> **Original RFE description:** >> Cyclic dependencies are not handled correctly in all cases. Three examples: >> >> https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/loopopts/superword/TestCyclicDependency.java#L270-L277 >> >> And this, compiled with `-XX:CompileCommand=option,compiler.vectorization.TestOptionVectorizeIR::test*,Vectorize`: >> https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestOptionVectorizeIR.java#L173-L180 >> >> And for `vmIntrinsics::_forEachRemaining` compile option `Vectorize` is always enabled: >> https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestForEachRem.java#L69-L73 >> >> All of these examples are vectorized, despite the cyclic dependency of distance 2. The cyclic dependency is dropped, instead the emitted vector code implements a shift by 2, instead of repeating the same 2 values. >> >> **Analysis** >> >> The `create_pack` logic in `SuperWord::find_adjacent_refs` is broken in two ways: >> >> - When the compile directive `Vectorize` is on, or we compile `vmIntrinsics::_forEachRemaining` we have `_do_vector_loop == true`. When that is the case, we blindly trust that there is no cyclic dependency larger than distance 1. Distance 1 would already be detected by the `independence(s1, s2)` checks we do for all adjacent memops. But for larger distances, we rely on `memory_alignment == 0`. But the compile directive avoids these checks. >> - If `best_align_to_mem_ref` is of a different type, and we have `memory_alignment(mem_ref, best_align_to_mem_ref) == 0`, we do not check if `mem_ref` has `memory_alignment == 0` for all other refs of the same type. In the example `TestCyclicDependency::test2`, we have `best_align_to_mem_ref` as the `StoreF`. Then we assess the `StoreI`, which is not aligned with it, but it is of a different type, so we accept it too. Finally, we look at `LoadI`, which has perfect alignment with the `StoreF`, so we accept it too (even though it is in conflict with the `StoreI`). >> >> Generally, the nested if-statements are confusing and buggy. I propose to fix and refactor the code. >> >> I also propose to only allow the compile directive `Vectorize` only if `vectors_should_be_aligned() == false`. If all vector operations have to be `vector_width` aligned, then they also have to be mutually aligned, and we cannot have patterns like `v[i] = v[i] + v[i+1]` for which the compile directive was introduced in the first place https://github.com/openjdk/jdk/commit/c7d33de202203b6da544f2e0f9a13952381b32dd. >> **Update**: I found a **Test.java** that lead to a crash (`SIGBUS`) on a ARM32 on master. The example bypassed the alignment requirement because of `_do_vector_loop`, and allowed unaligned vector loads to be generated, on a platform that requires alignment. Thanks @fg1417 for running that test for me! >> >> **Solution** >> >> First, I implemented `SuperWord::verify_packs` which catches cyclic dependencies just before scheduling. The idea is to reassess every pack, and check if all memops in it are mutually independent. Turns out that per vector pack, it suffices to do a single BFS over the nodes in the block (see `SuperWord::find_dependence`). With this verification in place we at least get an assert instead of wrong execution. >> >> I then refactored and fixed the `create_pack` code, and put the logic all in `SuperWord::is_mem_ref_alignment_ok`. With the added comments, I hope the logic is more straight forward and readable. If `_do_vector_loop == true`, then I filter the vector packs again in `SuperWord::combine_packs`, since we are at that point not sure that the packs are actually independent, we only know that adjacient memops are independent. >> >> Another change I have made: >> Disallow `extend_packlist` from adding `MemNodes` back in. Because if we have rejected some memops, we do not want them to be added back in later. >> >> **Testing** >> >> I added a few more regression tests, and am running tier1-3, plus some stress testing. >> >> However, I need help from someone who can test this on **ARM32** and **PPC**, basically machines that have `vectors_should_be_aligned() == true`. I would love to have additional testing on those machine, and some reviews. >> **Update:** @fg1417 did testing on ARM32, @reinrich did testing on PPC. >> >> **Discussion / Future Work** >> >> I wonder if we should have `_do_vector_loop == true` by default, since it allows more vectorization. With the added filtering, we are sure that we do not schedule packs with cyclic dependencies. We would have to evaluate performance and other side-effects of course. What do you think? [JDK-8303113](https://bugs.openjdk.org/browse/JDK-8303113) > > Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 40 commits: > > - Merge master after NULL -> nullptr conversion > - Fixed wording from last commit > - A little renaming and improved comments > - resolve merge conflict after Roland's fix > - TestDependencyOffsets.java: add vanilla run > - TestDependencyOffsets.java: parallelize it + various AVX settings > - TestOptionVectorizeIR.java: removed PopulateIndex IR rule - fails on x86 32bit - see Matcher::match_rule_supported > - Merge branch 'master' into JDK-8298935 > - Reworked TestDependencyOffsets.java > - remove negative IR rules for TestOptionVectorizeIR.java > - ... and 30 more: https://git.openjdk.org/jdk/compare/5726d31e...731cc7b5 Thanks @eme64 for your very informative and detailed explanations. ------------- Marked as reviewed by jbhateja (Reviewer). PR: https://git.openjdk.org/jdk/pull/12350 From fyang at openjdk.org Mon Mar 13 01:15:22 2023 From: fyang at openjdk.org (Fei Yang) Date: Mon, 13 Mar 2023 01:15:22 GMT Subject: RFR: 8303955: RISC-V: Factor out the tmp parameter from copy_memory and copy_memory_v In-Reply-To: References: Message-ID: <8HOCjcd_jg5ToRgvOQ9pWS767BatJxZqgBBTqx-jYN4=.b8f737dc-956a-4a31-b03c-f1e895ea4e04@github.com> On Fri, 10 Mar 2023 09:12:25 GMT, Feilong Jiang wrote: > The call site of `copy_memory` and `copy_memory_v` always use `t0` as tmp register, so we can factor the tmp parameter out. > > Testing: > > - [x] tier1 tests on Unmatched board (release build with `-XX:-UseRVV`) > - [x] hotspot_tier1 and jdk_tier1 on QEMU (release build with `-XX:+UseRVV`) Looks fine. Thanks. ------------- Marked as reviewed by fyang (Reviewer). PR: https://git.openjdk.org/jdk/pull/12969 From eliu at openjdk.org Mon Mar 13 02:28:20 2023 From: eliu at openjdk.org (Eric Liu) Date: Mon, 13 Mar 2023 02:28:20 GMT Subject: RFR: 8303161: [vectorapi] VectorMask.cast narrow operation returns incorrect value with SVE In-Reply-To: References: Message-ID: On Tue, 7 Mar 2023 10:56:40 GMT, Bhavana Kilambi wrote: > The cast operation for VectorMask from wider type to narrow type returns incorrect result for trueCount() method invocation for the resultant mask with SVE (on some SVE machines toLong() also results in incorrect values). An example narrow operation which results in incorrect toLong() and trueCount() values is shown below for a 128-bit -> 64-bit conversion and this can be extended to other narrow operations where the source mask in bytes is either 4x or 8x the size of the result mask in bytes - > > > public class TestMaskCast { > > static final boolean [] mask_arr = {true, true, false, true}; > > public static long narrow_long() { > VectorMask lmask128 = VectorMask.fromArray(LongVector.SPECIES_128, mask_arr, 0); > return lmask128.cast(IntVector.SPECIES_64).toLong(); > } > > public static void main(String[] args) { > long r = 0L; > for (int ic = 0; ic < 50000; ic++) { > r = narrow_long(); > } > System.out.println("toLong() : " + r); > } > } > > > **C2 compilation result :** > java --add-modules jdk.incubator.vector TestMaskCast > toLong(): 15 > > **Interpreter result (for verification) :** > java --add-modules jdk.incubator.vector -Xint TestMaskCast > toLong(): 3 > > The incorrect results with toLong() have been observed only on the 128-bit and 256-bit SVE machines but they are not reproducible on a 512-bit machine. However, trueCount() returns incorrect values too and they are reproducible on all the SVE machines and thus is more reliable to use trueCount() to bring out the drawbacks of the current implementation of mask cast narrow operation for SVE. > > Replacing the call to toLong() by trueCount() in the above example - > > > public class TestMaskCast { > > static final boolean [] mask_arr = {true, true, false, true}; > > public static int narrow_long() { > VectorMask lmask128 = VectorMask.fromArray(LongVector.SPECIES_128, mask_arr, 0); > return lmask128.cast(IntVector.SPECIES_64).trueCount(); > } > > public static void main(String[] args) { > int r = 0; > for (int ic = 0; ic < 50000; ic++) { > r = narrow_long(); > } > System.out.println("trueCount() : " + r); > } > } > > > > **C2 compilation result:** > java --add-modules jdk.incubator.vector TestMaskCast > trueCount() : 4 > > **Interpreter result:** > java --add-modules jdk.incubator.vector -Xint TestMaskCast > trueCount() : 2 > > Since in this example, the source mask size in bytes is 2x that of the result mask, trueCount() returns 2x the number of true elements in the source mask. It would return 4x/8x the number of true elements in the source mask if the size of the source mask is 4x/8x that of result mask. > > The returned values are incorrect because of the higher order bits in the result not being cleared (since the result is narrowed down) and trueCount() or toLong() tend to consider the higher order bits in the vector register as well which results in incorrect value. For the 128-bit to 64-bit conversion with a mask - "TT" passed, the current implementation for mask cast narrow operation returns the same mask in the lower and upper half of the 128-bit register that is - "TTTT" which results in a long value of 15 (instead of 3 - "FFTT" for the 64-bit Integer mask) and number of true elements to be 4 (instead of 2). > > This patch proposes a fix for this problem. An already existing JTREG IR test - "test/hotspot/jtreg/compiler/vectorapi/VectorMaskCastTest.java" has also been modified to call the trueCount() method as well since the toString() method alone cannot be used to reproduce the incorrect values in this bug. This test passes successfully on 128-bit, 256-bit and 512-bit SVE machines. Since the IR test has been changed, it has been tested successfully on other platforms like x86 and aarch64 Neon machines as well to ensure the changes have not introduced any new errors. Already reviewed internally. ------------- Marked as reviewed by eliu (Committer). PR: https://git.openjdk.org/jdk/pull/12901 From yyang at openjdk.org Mon Mar 13 06:34:03 2023 From: yyang at openjdk.org (Yi Yang) Date: Mon, 13 Mar 2023 06:34:03 GMT Subject: RFR: 8303970: C2 can not merge homogeneous adjacent two If [v2] In-Reply-To: References: Message-ID: > Hi, can I have a review for this patch? It adds new Identity for BoolNode to lookup homogenous integer comparison, i.e. `Bool (CmpX a b)` is identity to `Bool (CmpX b a)`, in this way, we are able to merge the following two "identical" Ifs, which is not before. > > > public static void test(int a, int b) { // ok, identical ifs, apply split_if > if (a == b) { > int_field = 0x42; > } else { > int_field = 42; > } > if (a == b) { > int_field = 0x42; > } else { > int_field = 42; > } > } > > public static void test(int a, int b) { // do nothing > if (a == b) { > int_field = 0x42; > } else { > int_field = 42; > } > if (b == a) { > int_field = 0x42; > } else { > int_field = 42; > } > } Yi Yang has updated the pull request incrementally with one additional commit since the last revision: dont apply Identity for dead bool ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12978/files - new: https://git.openjdk.org/jdk/pull/12978/files/8d54fb4c..f85fac8f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12978&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12978&range=00-01 Stats: 22 lines in 2 files changed: 12 ins; 5 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/12978.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12978/head:pull/12978 PR: https://git.openjdk.org/jdk/pull/12978 From yyang at openjdk.org Mon Mar 13 07:35:23 2023 From: yyang at openjdk.org (Yi Yang) Date: Mon, 13 Mar 2023 07:35:23 GMT Subject: RFR: 8304034: Remove redundant and meaningless comments in opto Message-ID: Please help review this trivial change to remove redundant and meaningless comments in `hotspot/share/opto` directory. They are either 1. Repeat the function name that the function they comment for. 2. Makes no sense, e.g. `//----Idealize----` And I think original CC-style code (`if( test )`,`call( arg )`) can be formatted in one go, instead of formatting the near code when someone touches them. But this may form a big patch, and it confuses code blame, so I left this work until we reach a consensus. Thanks! ------------- Commit messages: - remove useless comments Changes: https://git.openjdk.org/jdk/pull/12995/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12995&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8304034 Stats: 2135 lines in 97 files changed: 0 ins; 2135 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/12995.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12995/head:pull/12995 PR: https://git.openjdk.org/jdk/pull/12995 From duke at openjdk.org Mon Mar 13 07:37:36 2023 From: duke at openjdk.org (Daniel Skantz) Date: Mon, 13 Mar 2023 07:37:36 GMT Subject: Integrated: 8294715: Add IR checks to the reduction vectorization tests In-Reply-To: References: Message-ID: On Tue, 21 Feb 2023 08:08:31 GMT, Daniel Skantz wrote: > We are lifting some loopopts/superword tests to use the IR framework, and add IR annotations to check that vector reductions take place on x86_64. This can be useful to prevent issues such as JDK-8300865. > > Approach: lift the more general tests in loopopts/superword, mainly using matching rules in cpu/x86/x86.ad, but leave tests mostly unchanged otherwise. Some reductions are considered non-profitable (superword.cpp), so we might need to raise sse/avx value pre-conditions from what would be a strict reading of x86.ad (as noted by @eme64). > > Testing: Local testing (x86_64) using UseSSE={2,3,4}, UseAVX={0,1,2,3}. Tested running all jtreg compiler tests. Tier1-tier5 runs to my knowledge never showed any compiler-related regression in other tests as a result from this work. GHA. Validation: all tests fail if we put unreasonable counts for the respective reduction node, such as counts = {IRNode.ADD_REDUCTION_VI, ">= 10000000"}). > > Thanks @robcasloz and @eme64 for advice. > > Notes: ProdRed_Double does not vectorize (JDK-8300865). SumRed_Long does not vectorize on 32-bit, according to my reading of source, test on GHA and cross-compiled JDK on 32-bit Linux, so removed these platforms from @requires. Lifted the AbsNeg tests too but added no checks, as these are currently not run on x86_64. This pull request has now been integrated. Changeset: d20bde29 Author: Daniel Skantz Committer: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/d20bde29f2c0162ea62b42d0b618566cf5d9678a Stats: 1202 lines in 13 files changed: 517 ins; 579 del; 106 mod 8294715: Add IR checks to the reduction vectorization tests Reviewed-by: rcastanedalo, epeter ------------- PR: https://git.openjdk.org/jdk/pull/12683 From yyang at openjdk.org Mon Mar 13 07:42:59 2023 From: yyang at openjdk.org (Yi Yang) Date: Mon, 13 Mar 2023 07:42:59 GMT Subject: RFR: 8304034: Remove redundant and meaningless comments in opto [v2] In-Reply-To: References: Message-ID: <7IfWb47lN-hT3rvgyDqOTnp8kl5vhlgBbSjNM0VTN10=.311703fa-339a-42c3-b718-316095e1e17e@github.com> > Please help review this trivial change to remove redundant and meaningless comments in `hotspot/share/opto` directory. > > They are either > 1. Repeat the function name that the function they comment for. > 2. Makes no sense, e.g. `//----Idealize----` > > And I think original CC-style code (`if( test )`,`call( arg )`) can be formatted in one go, instead of formatting the near code when someone touches them. But this may form a big patch, and it confuses code blame, so I left this work until we reach a consensus. > > Thanks! Yi Yang has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: 8304034: Remove redundant and meaningless comments in opto ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12995/files - new: https://git.openjdk.org/jdk/pull/12995/files/944d2295..5680a6ce Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12995&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12995&range=00-01 Stats: 0 lines in 0 files changed: 0 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/12995.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12995/head:pull/12995 PR: https://git.openjdk.org/jdk/pull/12995 From yzhu at openjdk.org Mon Mar 13 08:19:20 2023 From: yzhu at openjdk.org (Yanhong Zhu) Date: Mon, 13 Mar 2023 08:19:20 GMT Subject: RFR: 8303955: RISC-V: Factor out the tmp parameter from copy_memory and copy_memory_v In-Reply-To: References: Message-ID: On Fri, 10 Mar 2023 09:12:25 GMT, Feilong Jiang wrote: > The call site of `copy_memory` and `copy_memory_v` always use `t0` as tmp register, so we can factor the tmp parameter out. > > Testing: > > - [x] tier1 tests on Unmatched board (release build with `-XX:-UseRVV`) > - [x] hotspot_tier1 and jdk_tier1 on QEMU (release build with `-XX:+UseRVV`) Looks good. ------------- Marked as reviewed by yzhu (Author). PR: https://git.openjdk.org/jdk/pull/12969 From duke at openjdk.org Mon Mar 13 08:43:32 2023 From: duke at openjdk.org (=?UTF-8?B?VG9tw6HFoQ==?= Zezula) Date: Mon, 13 Mar 2023 08:43:32 GMT Subject: Integrated: JDK-8303678: [JVMCI] Add possibility to convert object JavaConstant to jobject. In-Reply-To: <7CCfSjqdge_fL8Ev_oY44xARp28LpOIOwZQjTks8Igg=.61bcfaa2-23fe-4dbd-965b-39b77ebdec5e@github.com> References: <7CCfSjqdge_fL8Ev_oY44xARp28LpOIOwZQjTks8Igg=.61bcfaa2-23fe-4dbd-965b-39b77ebdec5e@github.com> Message-ID: <9HBxkKO02aPOoP962RYCmXhAMnZMiSBC4Jaw9HYucD0=.97a1e7fe-37e1-4d22-ab18-a665de97d5a8@github.com> On Mon, 6 Mar 2023 15:25:36 GMT, Tom?? Zezula wrote: > This pull request adds a `jdk.vm.ci.hotspot.HotSpotJVMCIRuntime#getJObjectValue(HotSpotObjectConstant peerObject)` method, which gets a reference to an object in the peer runtime wrapped by the `jdk.vm.ci.hotspot.IndirectHotSpotObjectConstantImpl`. The reference is returned as a HotSpot heap JNI jobject. This pull request has now been integrated. Changeset: 1148a659 Author: Tomas Zezula Committer: Doug Simon URL: https://git.openjdk.org/jdk/commit/1148a659a89edc6a4f320d578bc0025eae3553fb Stats: 20 lines in 1 file changed: 20 ins; 0 del; 0 mod 8303678: [JVMCI] Add possibility to convert object JavaConstant to jobject. Reviewed-by: never ------------- PR: https://git.openjdk.org/jdk/pull/12882 From duke at openjdk.org Mon Mar 13 08:44:33 2023 From: duke at openjdk.org (=?UTF-8?B?VG9tw6HFoQ==?= Zezula) Date: Mon, 13 Mar 2023 08:44:33 GMT Subject: Integrated: JDK-8303646: [JVMCI] Add possibility to lookup ResolvedJavaType from jclass. In-Reply-To: References: Message-ID: On Mon, 6 Mar 2023 12:03:24 GMT, Tom?? Zezula wrote: > This pull request adds a `jdk.vm.ci.hotspot.HotSpotJVMCIRuntime#asResolvedJavaType(long hotspot_jclass_value)` method, which converts a HotSpot heap JNI `hotspot_jclass_value` to a `jdk.vm.ci.meta.ResolvedJavaType`. This pull request has now been integrated. Changeset: 31e1e397 Author: Tomas Zezula Committer: Doug Simon URL: https://git.openjdk.org/jdk/commit/31e1e3975bf20a37a93a138dd651c6f50a80808f Stats: 42 lines in 3 files changed: 42 ins; 0 del; 0 mod 8303646: [JVMCI] Add possibility to lookup ResolvedJavaType from jclass. Reviewed-by: never ------------- PR: https://git.openjdk.org/jdk/pull/12878 From epeter at openjdk.org Mon Mar 13 10:16:46 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 13 Mar 2023 10:16:46 GMT Subject: RFR: 8298935: fix independence bug in create_pack logic in SuperWord::find_adjacent_refs [v26] In-Reply-To: <_ygbSM3ARi21Y5jfOqEIYEqZe84cwTbQ6vayG9hVYiY=.36ace32c-73ac-452e-a9ac-f9ffaea08931@github.com> References: <_ygbSM3ARi21Y5jfOqEIYEqZe84cwTbQ6vayG9hVYiY=.36ace32c-73ac-452e-a9ac-f9ffaea08931@github.com> Message-ID: <6-Ea0WdYmM96b6eqsNReqIKevZPZsZkWEOSoUyycJpo=.11b237f7-35ca-4519-a605-5a541719c5cb@github.com> On Thu, 9 Mar 2023 21:04:57 GMT, Emanuel Peter wrote: >> **List of important things below** >> >> - 3 Bugs I fixed + regression tests https://github.com/openjdk/jdk/pull/12350#issuecomment-1460323523 >> - Conversation with @jatin-bhateja about script-generated regression test: https://github.com/openjdk/jdk/pull/12350#discussion_r1115317152 >> - My [blog-article](https://eme64.github.io/blog/2023/02/23/SuperWord-Introduction.html) about SuperWord >> - Explanation of `dependency_graph` and `DepPreds` https://github.com/openjdk/jdk/pull/12350#issuecomment-1461498252 >> - Explanation of my new `find_dependency`. Arguments about `independence` of packs and cyclic dependencies between packs https://github.com/openjdk/jdk/pull/12350#issuecomment-1461681129 >> >> **Original RFE description:** >> Cyclic dependencies are not handled correctly in all cases. Three examples: >> >> https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/loopopts/superword/TestCyclicDependency.java#L270-L277 >> >> And this, compiled with `-XX:CompileCommand=option,compiler.vectorization.TestOptionVectorizeIR::test*,Vectorize`: >> https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestOptionVectorizeIR.java#L173-L180 >> >> And for `vmIntrinsics::_forEachRemaining` compile option `Vectorize` is always enabled: >> https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestForEachRem.java#L69-L73 >> >> All of these examples are vectorized, despite the cyclic dependency of distance 2. The cyclic dependency is dropped, instead the emitted vector code implements a shift by 2, instead of repeating the same 2 values. >> >> **Analysis** >> >> The `create_pack` logic in `SuperWord::find_adjacent_refs` is broken in two ways: >> >> - When the compile directive `Vectorize` is on, or we compile `vmIntrinsics::_forEachRemaining` we have `_do_vector_loop == true`. When that is the case, we blindly trust that there is no cyclic dependency larger than distance 1. Distance 1 would already be detected by the `independence(s1, s2)` checks we do for all adjacent memops. But for larger distances, we rely on `memory_alignment == 0`. But the compile directive avoids these checks. >> - If `best_align_to_mem_ref` is of a different type, and we have `memory_alignment(mem_ref, best_align_to_mem_ref) == 0`, we do not check if `mem_ref` has `memory_alignment == 0` for all other refs of the same type. In the example `TestCyclicDependency::test2`, we have `best_align_to_mem_ref` as the `StoreF`. Then we assess the `StoreI`, which is not aligned with it, but it is of a different type, so we accept it too. Finally, we look at `LoadI`, which has perfect alignment with the `StoreF`, so we accept it too (even though it is in conflict with the `StoreI`). >> >> Generally, the nested if-statements are confusing and buggy. I propose to fix and refactor the code. >> >> I also propose to only allow the compile directive `Vectorize` only if `vectors_should_be_aligned() == false`. If all vector operations have to be `vector_width` aligned, then they also have to be mutually aligned, and we cannot have patterns like `v[i] = v[i] + v[i+1]` for which the compile directive was introduced in the first place https://github.com/openjdk/jdk/commit/c7d33de202203b6da544f2e0f9a13952381b32dd. >> **Update**: I found a **Test.java** that lead to a crash (`SIGBUS`) on a ARM32 on master. The example bypassed the alignment requirement because of `_do_vector_loop`, and allowed unaligned vector loads to be generated, on a platform that requires alignment. Thanks @fg1417 for running that test for me! >> >> **Solution** >> >> First, I implemented `SuperWord::verify_packs` which catches cyclic dependencies just before scheduling. The idea is to reassess every pack, and check if all memops in it are mutually independent. Turns out that per vector pack, it suffices to do a single BFS over the nodes in the block (see `SuperWord::find_dependence`). With this verification in place we at least get an assert instead of wrong execution. >> >> I then refactored and fixed the `create_pack` code, and put the logic all in `SuperWord::is_mem_ref_alignment_ok`. With the added comments, I hope the logic is more straight forward and readable. If `_do_vector_loop == true`, then I filter the vector packs again in `SuperWord::combine_packs`, since we are at that point not sure that the packs are actually independent, we only know that adjacient memops are independent. >> >> Another change I have made: >> Disallow `extend_packlist` from adding `MemNodes` back in. Because if we have rejected some memops, we do not want them to be added back in later. >> >> **Testing** >> >> I added a few more regression tests, and am running tier1-3, plus some stress testing. >> >> However, I need help from someone who can test this on **ARM32** and **PPC**, basically machines that have `vectors_should_be_aligned() == true`. I would love to have additional testing on those machine, and some reviews. >> **Update:** @fg1417 did testing on ARM32, @reinrich did testing on PPC. >> >> **Discussion / Future Work** >> >> I wonder if we should have `_do_vector_loop == true` by default, since it allows more vectorization. With the added filtering, we are sure that we do not schedule packs with cyclic dependencies. We would have to evaluate performance and other side-effects of course. What do you think? [JDK-8303113](https://bugs.openjdk.org/browse/JDK-8303113) > > Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 40 commits: > > - Merge master after NULL -> nullptr conversion > - Fixed wording from last commit > - A little renaming and improved comments > - resolve merge conflict after Roland's fix > - TestDependencyOffsets.java: add vanilla run > - TestDependencyOffsets.java: parallelize it + various AVX settings > - TestOptionVectorizeIR.java: removed PopulateIndex IR rule - fails on x86 32bit - see Matcher::match_rule_supported > - Merge branch 'master' into JDK-8298935 > - Reworked TestDependencyOffsets.java > - remove negative IR rules for TestOptionVectorizeIR.java > - ... and 30 more: https://git.openjdk.org/jdk/compare/5726d31e...731cc7b5 https://github.com/openjdk/jdk/pull/12350#issuecomment-1461681129 > So is there a **4th Bug** lurking here? @vnkozlov I now found an example that reveals this **Bug 4**. I want to fix it in a separate Bug [JDK-8304042](https://bugs.openjdk.org/browse/JDK-8304042). ------------- PR: https://git.openjdk.org/jdk/pull/12350 From aph at openjdk.org Mon Mar 13 10:32:35 2023 From: aph at openjdk.org (Andrew Haley) Date: Mon, 13 Mar 2023 10:32:35 GMT Subject: RFR: 8302906: AArch64: Add SVE backend support for vector unsigned comparison [v3] In-Reply-To: References: Message-ID: On Tue, 7 Mar 2023 07:02:27 GMT, changpeng1997 wrote: >> This patch implements unsigned vector comparison on SVE. >> >> 1: Test: >> All vector API test cases[1][2] passed without new failure. Existing test cases can cover all unsigned comparison conditions for all kinds of vector. >> >> 2: Performance: >> (1): Benchmark: >> As existing benchmarks in panama repo (such as [3]) have some issues [4] (We will fix them in a separate patch.), I collected performance data with a reduced jmh benchmark [5]. e.g. for ByteVector unsigned comparison: >> >> >> @Benchmark >> public void byteVectorUnsignedCompare() { >> for (int j = 0; j < 200; j++) { >> for (int i = 0; i < bspecies.length(); i++) { >> ByteVector av = ByteVector.fromArray(bspecies, ba, i); >> ByteVector ca = ByteVector.fromArray(bspecies, bb, i); >> av.compare(VectorOperators.UNSIGNED_GT, ca).intoArray(br, i); >> } >> } >> } >> >> >> (2): Performance data >> >> Before: >> >> >> Benchmark Score(op/ms) Error >> ByteVector.UNSIGNED_GT#size(1024) 4.846 3.419 >> ShortVector.UNSIGNED_GE#size(1024) 3.055 1.369 >> IntVector.UNSIGNED_LT#size(1024) 3.475 1.269 >> LongVector.UNSIGNED_LE#size(1024) 4.515 1.812 >> >> >> After: >> >> >> Benchmark Score(op/ms) Error >> ByteVector.UNSIGNED_GT#size(1024) 493.937 1.389 >> ShortVector.UNSIGNED_GE#size(1024) 5308.796 20.557 >> IntVector.UNSIGNED_LT#size(1024) 4944.744 10.606 >> LongVector.UNSIGNED_LE#size(1024) 8459.605 28.683 >> >> >> [1] https://github.com/openjdk/jdk/tree/master/test/jdk/jdk/incubator/vector >> [2] https://github.com/openjdk/jdk/tree/master/test/hotspot/jtreg/compiler/vectorapi >> [3] https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/ByteMaxVector.java#L1459 >> [4] https://bugs.openjdk.org/browse/JDK-8282850 >> [5] https://gist.github.com/changpeng1997/d311127e1015c107197f9b56a92b0fae > > changpeng1997 has updated the pull request incrementally with one additional commit since the last revision: > > Refactor part of code in C2 assembler and remove some switch-case stmts. src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3218: > 3216: f(1, 21), rf(Vm, 16), f(0b111001, 15, 10), rf(Vn, 5), rf(Vd, 0); > 3217: } > 3218: This looks OK, but it's in the wrong place in the file. Look at C4.1 A64 instruction set encoding. These instructions are in the "Advanced SIMD three same" group, so they must appear in assembler_aarch64.hpp in the "Advanced SIMD three same" section. This is the "AdvSIMD two-reg misc" section. ------------- PR: https://git.openjdk.org/jdk/pull/12725 From epeter at openjdk.org Mon Mar 13 10:37:32 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 13 Mar 2023 10:37:32 GMT Subject: RFR: 8298935: fix independence bug in create_pack logic in SuperWord::find_adjacent_refs [v27] In-Reply-To: References: Message-ID: <7BaWanF45N91xNBIIXvX0yXhxEAPFfKK18G36oWF6LI=.16fb8405-9e5e-4789-8671-8f2e465cdde5@github.com> > **List of important things below** > > - 3 Bugs I fixed + regression tests https://github.com/openjdk/jdk/pull/12350#issuecomment-1460323523 > - Conversation with @jatin-bhateja about script-generated regression test: https://github.com/openjdk/jdk/pull/12350#discussion_r1115317152 > - My [blog-article](https://eme64.github.io/blog/2023/02/23/SuperWord-Introduction.html) about SuperWord > - Explanation of `dependency_graph` and `DepPreds` https://github.com/openjdk/jdk/pull/12350#issuecomment-1461498252 > - Explanation of my new `find_dependency`. Arguments about `independence` of packs and cyclic dependencies between packs https://github.com/openjdk/jdk/pull/12350#issuecomment-1461681129 > > **Original RFE description:** > Cyclic dependencies are not handled correctly in all cases. Three examples: > > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/loopopts/superword/TestCyclicDependency.java#L270-L277 > > And this, compiled with `-XX:CompileCommand=option,compiler.vectorization.TestOptionVectorizeIR::test*,Vectorize`: > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestOptionVectorizeIR.java#L173-L180 > > And for `vmIntrinsics::_forEachRemaining` compile option `Vectorize` is always enabled: > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestForEachRem.java#L69-L73 > > All of these examples are vectorized, despite the cyclic dependency of distance 2. The cyclic dependency is dropped, instead the emitted vector code implements a shift by 2, instead of repeating the same 2 values. > > **Analysis** > > The `create_pack` logic in `SuperWord::find_adjacent_refs` is broken in two ways: > > - When the compile directive `Vectorize` is on, or we compile `vmIntrinsics::_forEachRemaining` we have `_do_vector_loop == true`. When that is the case, we blindly trust that there is no cyclic dependency larger than distance 1. Distance 1 would already be detected by the `independence(s1, s2)` checks we do for all adjacent memops. But for larger distances, we rely on `memory_alignment == 0`. But the compile directive avoids these checks. > - If `best_align_to_mem_ref` is of a different type, and we have `memory_alignment(mem_ref, best_align_to_mem_ref) == 0`, we do not check if `mem_ref` has `memory_alignment == 0` for all other refs of the same type. In the example `TestCyclicDependency::test2`, we have `best_align_to_mem_ref` as the `StoreF`. Then we assess the `StoreI`, which is not aligned with it, but it is of a different type, so we accept it too. Finally, we look at `LoadI`, which has perfect alignment with the `StoreF`, so we accept it too (even though it is in conflict with the `StoreI`). > > Generally, the nested if-statements are confusing and buggy. I propose to fix and refactor the code. > > I also propose to only allow the compile directive `Vectorize` only if `vectors_should_be_aligned() == false`. If all vector operations have to be `vector_width` aligned, then they also have to be mutually aligned, and we cannot have patterns like `v[i] = v[i] + v[i+1]` for which the compile directive was introduced in the first place https://github.com/openjdk/jdk/commit/c7d33de202203b6da544f2e0f9a13952381b32dd. > **Update**: I found a **Test.java** that lead to a crash (`SIGBUS`) on a ARM32 on master. The example bypassed the alignment requirement because of `_do_vector_loop`, and allowed unaligned vector loads to be generated, on a platform that requires alignment. Thanks @fg1417 for running that test for me! > > **Solution** > > First, I implemented `SuperWord::verify_packs` which catches cyclic dependencies just before scheduling. The idea is to reassess every pack, and check if all memops in it are mutually independent. Turns out that per vector pack, it suffices to do a single BFS over the nodes in the block (see `SuperWord::find_dependence`). With this verification in place we at least get an assert instead of wrong execution. > > I then refactored and fixed the `create_pack` code, and put the logic all in `SuperWord::is_mem_ref_alignment_ok`. With the added comments, I hope the logic is more straight forward and readable. If `_do_vector_loop == true`, then I filter the vector packs again in `SuperWord::combine_packs`, since we are at that point not sure that the packs are actually independent, we only know that adjacient memops are independent. > > Another change I have made: > Disallow `extend_packlist` from adding `MemNodes` back in. Because if we have rejected some memops, we do not want them to be added back in later. > > **Testing** > > I added a few more regression tests, and am running tier1-3, plus some stress testing. > > However, I need help from someone who can test this on **ARM32** and **PPC**, basically machines that have `vectors_should_be_aligned() == true`. I would love to have additional testing on those machine, and some reviews. > **Update:** @fg1417 did testing on ARM32, @reinrich did testing on PPC. > > **Discussion / Future Work** > > I wonder if we should have `_do_vector_loop == true` by default, since it allows more vectorization. With the added filtering, we are sure that we do not schedule packs with cyclic dependencies. We would have to evaluate performance and other side-effects of course. What do you think? [JDK-8303113](https://bugs.openjdk.org/browse/JDK-8303113) Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 41 commits: - merge master: resolved conflict in test/hotspot/jtreg/compiler/lib/ir_framework/IRNode.java - Merge master after NULL -> nullptr conversion - Fixed wording from last commit - A little renaming and improved comments - resolve merge conflict after Roland's fix - TestDependencyOffsets.java: add vanilla run - TestDependencyOffsets.java: parallelize it + various AVX settings - TestOptionVectorizeIR.java: removed PopulateIndex IR rule - fails on x86 32bit - see Matcher::match_rule_supported - Merge branch 'master' into JDK-8298935 - Reworked TestDependencyOffsets.java - ... and 31 more: https://git.openjdk.org/jdk/compare/25e7ac22...ff0850e6 ------------- Changes: https://git.openjdk.org/jdk/pull/12350/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12350&range=26 Stats: 12924 lines in 7 files changed: 12863 ins; 46 del; 15 mod Patch: https://git.openjdk.org/jdk/pull/12350.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12350/head:pull/12350 PR: https://git.openjdk.org/jdk/pull/12350 From redestad at openjdk.org Mon Mar 13 11:13:56 2023 From: redestad at openjdk.org (Claes Redestad) Date: Mon, 13 Mar 2023 11:13:56 GMT Subject: RFR: 8303238: Create generalizations for existing LShift ideal transforms [v4] In-Reply-To: <22Lk8Rj9OTcddJxDx3-jIAIVXRnEFOwyScgxFTgXePA=.beea5fd7-f2d4-4b39-b745-ec7ab5c77b6b@github.com> References: <22Lk8Rj9OTcddJxDx3-jIAIVXRnEFOwyScgxFTgXePA=.beea5fd7-f2d4-4b39-b745-ec7ab5c77b6b@github.com> Message-ID: <7vM79TpX4-EX1AMk7FEuChwfWdjbOeQ0H_8g26epcXc=.02788d6d-9e38-4205-bb88-8d15e7568926@github.com> On Fri, 10 Mar 2023 01:10:03 GMT, Jasmine K. wrote: >> Hello, >> I would like to generalize two ideal transforms for bitwise shifts. Left shift nodes perform the transformations `(x >> C1) << C2 => x & (-1 << C2)` and `((x >> C1) & Y) << C2 => x & (Y << C2)`, but only when the case where `C1 == C2`. However, it is possible to use both of these rules to improve cases where the constants aren't equal, by removing one of the shifts and replacing it with a bitwise and. This transformation is profitable because typically more bitwise ands can be dispatched per cycle than bit shifts. In addition, the strength reduction from a shift to a bitwise and can allow more profitable transformations to occur. These patterns are found throughout the JDK, mainly around strings and OW2 ASM. I've attached some profiling results from my (Zen 2) machine below: >> >> Baseline Patch Improvement >> Benchmark Mode Cnt Score Error Units Score Error Units >> LShiftNodeIdealize.testRgbaToAbgr avgt 15 63.287 ? 1.770 ns/op / 54.199 ? 1.408 ns/op + 14.36% >> LShiftNodeIdealize.testShiftAndInt avgt 15 874.564 ? 15.334 ns/op / 538.408 ? 11.768 ns/op + 38.44% >> LShiftNodeIdealize.testShiftAndLong avgt 15 1017.466 ? 29.010 ns/op / 701.356 ? 18.258 ns/op + 31.07% >> LShiftNodeIdealize.testShiftInt avgt 15 663.865 ? 14.226 ns/op / 533.588 ? 9.949 ns/op + 19.63% >> LShiftNodeIdealize.testShiftInt2 avgt 15 658.976 ? 32.856 ns/op / 649.871 ? 10.598 ns/op + 1.38% >> LShiftNodeIdealize.testShiftLong avgt 15 815.540 ? 14.721 ns/op / 689.270 ? 14.028 ns/op + 15.48% >> LShiftNodeIdealize.testShiftLong2 avgt 15 817.936 ? 23.573 ns/op / 810.185 ? 14.983 ns/op + 0.95% >> >> >> In addition, in the process of making this PR I've found a missing ideal transform for `RShiftLNode`, so right shifts of large numbers (such as `x >> 65`) are not properly folded down, like how they are `RShiftINode` and `URShiftLNode`. I'll address this in a future RFR. >> >> Testing: GHA, tier1 local, and performance testing >> >> Thanks, >> Jasmine K > > Jasmine K. has updated the pull request incrementally with one additional commit since the last revision: > > Update full name Testing looks good ------------- PR: https://git.openjdk.org/jdk/pull/12734 From duke at openjdk.org Mon Mar 13 11:13:59 2023 From: duke at openjdk.org (Jasmine K.) Date: Mon, 13 Mar 2023 11:13:59 GMT Subject: Integrated: 8303238: Create generalizations for existing LShift ideal transforms In-Reply-To: References: Message-ID: On Thu, 23 Feb 2023 20:28:31 GMT, Jasmine K. wrote: > Hello, > I would like to generalize two ideal transforms for bitwise shifts. Left shift nodes perform the transformations `(x >> C1) << C2 => x & (-1 << C2)` and `((x >> C1) & Y) << C2 => x & (Y << C2)`, but only when the case where `C1 == C2`. However, it is possible to use both of these rules to improve cases where the constants aren't equal, by removing one of the shifts and replacing it with a bitwise and. This transformation is profitable because typically more bitwise ands can be dispatched per cycle than bit shifts. In addition, the strength reduction from a shift to a bitwise and can allow more profitable transformations to occur. These patterns are found throughout the JDK, mainly around strings and OW2 ASM. I've attached some profiling results from my (Zen 2) machine below: > > Baseline Patch Improvement > Benchmark Mode Cnt Score Error Units Score Error Units > LShiftNodeIdealize.testRgbaToAbgr avgt 15 63.287 ? 1.770 ns/op / 54.199 ? 1.408 ns/op + 14.36% > LShiftNodeIdealize.testShiftAndInt avgt 15 874.564 ? 15.334 ns/op / 538.408 ? 11.768 ns/op + 38.44% > LShiftNodeIdealize.testShiftAndLong avgt 15 1017.466 ? 29.010 ns/op / 701.356 ? 18.258 ns/op + 31.07% > LShiftNodeIdealize.testShiftInt avgt 15 663.865 ? 14.226 ns/op / 533.588 ? 9.949 ns/op + 19.63% > LShiftNodeIdealize.testShiftInt2 avgt 15 658.976 ? 32.856 ns/op / 649.871 ? 10.598 ns/op + 1.38% > LShiftNodeIdealize.testShiftLong avgt 15 815.540 ? 14.721 ns/op / 689.270 ? 14.028 ns/op + 15.48% > LShiftNodeIdealize.testShiftLong2 avgt 15 817.936 ? 23.573 ns/op / 810.185 ? 14.983 ns/op + 0.95% > > > In addition, in the process of making this PR I've found a missing ideal transform for `RShiftLNode`, so right shifts of large numbers (such as `x >> 65`) are not properly folded down, like how they are `RShiftINode` and `URShiftLNode`. I'll address this in a future RFR. > > Testing: GHA, tier1 local, and performance testing > > Thanks, > Jasmine K This pull request has now been integrated. Changeset: 8e41bf22 Author: Jasmine K <25208576+SuperCoder7979 at users.noreply.github.com> Committer: Claes Redestad URL: https://git.openjdk.org/jdk/commit/8e41bf222f4adce0bfaee7d464962d5ae22e3b3b Stats: 425 lines in 4 files changed: 396 ins; 0 del; 29 mod 8303238: Create generalizations for existing LShift ideal transforms Reviewed-by: redestad, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/12734 From duke at openjdk.org Mon Mar 13 13:05:29 2023 From: duke at openjdk.org (Damon Fenacci) Date: Mon, 13 Mar 2023 13:05:29 GMT Subject: RFR: JDK-8303154: Investigate and improve instruction cache flushing during compilation In-Reply-To: References: Message-ID: On Fri, 10 Mar 2023 22:55:19 GMT, Vladimir Kozlov wrote: > Did you look on how many times we flush ICache during adapters generation? It has most numerous cases when I looked on it: "CodeCache::commit() is also used for adapters. But adapters uses RuntimeBlob which calls CodeBuffer::copy_code_to()." @vnkozlov the ICache flushing was called 1596 times during adapters generation with C1. You're right, these are by far the most calls and the flush calls are also performed twice in these cases, once in `CodeBuffer::copy_code_to()` and once in `CodeCache::commit()` (I've missed it). > I thought we would remove flush from CodeCache::commit() and not from copy_code_to(). I thought it would make more sense to keep the flush in `CodeCache::commit()` since it was generally the last call made but, in light of what you're pointing out, it would definitely make more sense to remove it from `CodeCache::commit()` and only leave it in `copy_code_to`. This also halves the number of flushing coming from adapters. ------------- PR: https://git.openjdk.org/jdk/pull/12877 From duke at openjdk.org Mon Mar 13 13:52:59 2023 From: duke at openjdk.org (Damon Fenacci) Date: Mon, 13 Mar 2023 13:52:59 GMT Subject: RFR: JDK-8303154: Investigate and improve instruction cache flushing during compilation [v2] In-Reply-To: References: Message-ID: > It was noticed that we flush the instruction cache too much for a single C1 compilation. The same is true for the C2 compilation. > There are several places in the code where the instruction cache is called and many of them are very intertwined (see [bug report](https://bugs.openjdk.org/browse/JDK-8303154)). > > This PR is meant to be a "minimum" set of changes that improve the situation without introducing excessive extra information to keep track of the origin of the call through call stacks. This is done by avoiding calls to flush the ICache at `Compilation::emit_code_epilog` and when copying from the temporary buffer into the code cache in `CodeBuffer::copy_code_to` if there is a call to `CodeCache::commit` later on. This results in flushing the ICache only once instead of 3 times for a C1 compilation and twice for a C2 compilation respectively. > > This change decreases the number of calls to flush the ICache for a simple _Hello world_ program on Mac OSX aarch64 from 3568 to 3487 on C1 (2.27% improvement) and from 3499 to 3415 on C2 (2.40% improvement). Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: JDK-8303154: remove flush in CodeCache::commit() instead of CodeBuffer::copy_code_to() ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12877/files - new: https://git.openjdk.org/jdk/pull/12877/files/85eb7914..2e119cb1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12877&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12877&range=00-01 Stats: 15 lines in 5 files changed: 0 ins; 7 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/12877.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12877/head:pull/12877 PR: https://git.openjdk.org/jdk/pull/12877 From kbarrett at openjdk.org Mon Mar 13 14:29:32 2023 From: kbarrett at openjdk.org (Kim Barrett) Date: Mon, 13 Mar 2023 14:29:32 GMT Subject: RFR: 8299546: C2: MulLNode::mul_ring() wrongly returns bottom type due to casting errors with large numbers [v3] In-Reply-To: References: <2dvxIq2gzTJzz-3HdADIWO9vMXhzTBEvjNODO5GUL70=.3a6366f2-f65a-4c6a-9667-615a483a26d4@github.com> Message-ID: On Thu, 23 Feb 2023 10:48:22 GMT, Christian Hagedorn wrote: >> We need @kbarrett opinion on C++ code for this case. >> I prefer to use already defined functions if they work to avoid duplication. > > @kimbarrett > > Looks like we've initially pinged the wrong Kim Barrett :-) C++14 5.8/3 In the description of "E1 >> E2" it says "If E1 has a signed type and a negative value, the resulting value is implementation-defined." However, C++20 7.6.7/3 further defines integral arithmetic, as part of requiring two's-complement behavior. https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0907r3.html https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1236r1.html The corresponding C++20 text is "Right-shift on signed integral types is an arithmetic right shift, which performs sign-extension." As discussed in the two's complement proposal, all known modern C++ compilers already behave that way. And it is unlikely any would go off and do something different now, with C++20 tightening things up. So I think relying on sign extension by right shift is fine. ------------- PR: https://git.openjdk.org/jdk/pull/11907 From kbarrett at openjdk.org Mon Mar 13 14:35:57 2023 From: kbarrett at openjdk.org (Kim Barrett) Date: Mon, 13 Mar 2023 14:35:57 GMT Subject: RFR: 8299546: C2: MulLNode::mul_ring() wrongly returns bottom type due to casting errors with large numbers [v3] In-Reply-To: References: <2dvxIq2gzTJzz-3HdADIWO9vMXhzTBEvjNODO5GUL70=.3a6366f2-f65a-4c6a-9667-615a483a26d4@github.com> Message-ID: <-kI7Dqep7RYHo4Snn5Nt_SadYZk16uT7Z14m_8LLK5E=.74cd4f40-29eb-4cc9-9cbf-f5fb1b35d1b3@github.com> On Mon, 13 Mar 2023 14:26:29 GMT, Kim Barrett wrote: >> @kimbarrett >> >> Looks like we've initially pinged the wrong Kim Barrett :-) > > C++14 5.8/3 In the description of "E1 >> E2" it says "If E1 has a signed type > and a negative value, the resulting value is implementation-defined." > > However, C++20 7.6.7/3 further defines integral arithmetic, as part of > requiring two's-complement behavior. > https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0907r3.html > https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1236r1.html > The corresponding C++20 text is "Right-shift on signed integral types is an > arithmetic right shift, which performs sign-extension." > > As discussed in the two's complement proposal, all known modern C++ compilers > already behave that way. And it is unlikely any would go off and do something > different now, with C++20 tightening things up. > > So I think relying on sign extension by right shift is fine. That comment quoted earlier from globalDefinitions.hpp could be expanded to include the above analysis. ------------- PR: https://git.openjdk.org/jdk/pull/11907 From chagedorn at openjdk.org Mon Mar 13 14:55:17 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 13 Mar 2023 14:55:17 GMT Subject: RFR: 8299546: C2: MulLNode::mul_ring() wrongly returns bottom type due to casting errors with large numbers [v4] In-Reply-To: References: Message-ID: <7KjN7vkpQApuoEVSijdD_2Txycr6Xyl600B5SXmrsis=.7a09abb7-4076-43d2-a05f-4378cf5537ee@github.com> > The current logic in `MulLNode::mul_ring()` casts all `jlong` values of the involved type ranges of a multiplication to `double` in order to catch overflows when multiplying the two type ranges. This works fine for values in the `jlong` range that are not larger than 253 or lower than -253. For numbers outside that range, we could experience precision errors because these numbers cannot be represented precisely due to the nature of how doubles are represented with a 52 bit mantissa. For example, the number 253 and 253 + 1 both have the same `double` representation of 253. > > In `MulLNode::mul_ring()`, we could do a multiplication with a `lo` or `hi` value of a type that is larger than 253 (or smaller than -253). In this case, we might get a different result compared to doing the same multiplication with `jlong` values (even though there is no overflow/underflow). As a result, we return `TypeLong::LONG` (bottom type) and missed an optimization opportunity (e.g. folding an `If` when using the `MulL` node in the condition etc.). > > This was caught by the new verification code added to CCP in [JDK-8257197](https://bugs.openjdk.org/browse/JDK-8257197) which checks that after CCP, we should not get a different type anymore when calling `Value()` on a node. In the found fuzzer testcase, we run into the precision problem described above for a `MulL` node and set the type to bottom during CCP (even though there was no actual overflow). Since the type is bottom, we do not re-add the node to the CCP worklist because the premise is that types only go from top to bottom during CCP. Afterwards, an input type of the `MulL` node is updated again in such a way that the previously imprecise `double` multiplication in `mul_ring()` is now exact (by coincidence). We then hit the "missed optimization opportunity" assert added by JDK-8257197. > > To fix this problem, I suggest to switch from a `jlong` - > `double` multiplication overflow check to an overflow check without casting. I've used the idea that `x = a * b` is the same as `b = x / a` (for `a != 0` and `!(a = -1 && b = MIN_VALUE)`) which is also applied in `Math.multiplyExact()`: https://github.com/openjdk/jdk/blob/66db0bb6a15310e4e60ff1e33d40e03c52c4eca8/src/java.base/share/classes/java/lang/Math.java#L1022-L1036 > > The code of `MulLNode::mul_ring()` is almost identical to `MulINode::mul_ring()`. I've refactored that into a template class in order to share the code and simplified the overflow checking by using `MIN/MAX4` instead of using nested `if/else` statements. > > Thanks, > Christian Christian Hagedorn has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision: - Use java_shift_right instead of portable version with updated comment from Kim - Merge branch 'master' into JDK-8299546 - Merge branch 'master' into JDK-8299546 - Change algorithm to handle overflows/underflows if the cross products have the same number of overflows/underflows as suggested by Quan - Merge branch 'master' into JDK-8299546 - review - 8299546: C2: MulLNode::mul_ring() wrongly returns bottom type due to casting errors with large numbers ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11907/files - new: https://git.openjdk.org/jdk/pull/11907/files/0453789f..dc6896e9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11907&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11907&range=02-03 Stats: 377672 lines in 4573 files changed: 208336 ins; 120339 del; 48997 mod Patch: https://git.openjdk.org/jdk/pull/11907.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11907/head:pull/11907 PR: https://git.openjdk.org/jdk/pull/11907 From chagedorn at openjdk.org Mon Mar 13 14:55:17 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 13 Mar 2023 14:55:17 GMT Subject: RFR: 8299546: C2: MulLNode::mul_ring() wrongly returns bottom type due to casting errors with large numbers [v3] In-Reply-To: <-kI7Dqep7RYHo4Snn5Nt_SadYZk16uT7Z14m_8LLK5E=.74cd4f40-29eb-4cc9-9cbf-f5fb1b35d1b3@github.com> References: <2dvxIq2gzTJzz-3HdADIWO9vMXhzTBEvjNODO5GUL70=.3a6366f2-f65a-4c6a-9667-615a483a26d4@github.com> <-kI7Dqep7RYHo4Snn5Nt_SadYZk16uT7Z14m_8LLK5E=.74cd4f40-29eb-4cc9-9cbf-f5fb1b35d1b3@github.com> Message-ID: On Mon, 13 Mar 2023 14:33:15 GMT, Kim Barrett wrote: >> C++14 5.8/3 In the description of "E1 >> E2" it says "If E1 has a signed type >> and a negative value, the resulting value is implementation-defined." >> >> However, C++20 7.6.7/3 further defines integral arithmetic, as part of >> requiring two's-complement behavior. >> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0907r3.html >> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1236r1.html >> The corresponding C++20 text is "Right-shift on signed integral types is an >> arithmetic right shift, which performs sign-extension." >> >> As discussed in the two's complement proposal, all known modern C++ compilers >> already behave that way. And it is unlikely any would go off and do something >> different now, with C++20 tightening things up. >> >> So I think relying on sign extension by right shift is fine. > > That comment quoted earlier from globalDefinitions.hpp could be expanded to include the above analysis. Thanks a lot Kim for your input and the detailed comments! I've included it as a comment at the `java_shift_right()` method and updated the code to directly use `java_shift_right()` instead of `shift_right_arithmetic()`. ------------- PR: https://git.openjdk.org/jdk/pull/11907 From duke at openjdk.org Mon Mar 13 15:37:45 2023 From: duke at openjdk.org (Ilya Korennoy) Date: Mon, 13 Mar 2023 15:37:45 GMT Subject: RFR: 8299226: compiler/profiling/TestTypeProfiling.java: make it not throw if C2 is not enabled. [v2] In-Reply-To: References: Message-ID: > Changing RuntimeException to SkippedException when TieredStopAtLevel < 4. > > The main part of this problem was done in [JDK-8226795](https://bugs.openjdk.org/browse/JDK-8226795) but for some reason, the exception was not changed. Ilya Korennoy has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge branch 'openjdk:master' into 8299226 - 8299226: Throw SkippedException instead of RuntimeException in case of TieredStopAtLevel < 4 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12981/files - new: https://git.openjdk.org/jdk/pull/12981/files/9663af7b..bff4593a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12981&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12981&range=00-01 Stats: 3834 lines in 117 files changed: 1528 ins; 1714 del; 592 mod Patch: https://git.openjdk.org/jdk/pull/12981.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12981/head:pull/12981 PR: https://git.openjdk.org/jdk/pull/12981 From duke at openjdk.org Mon Mar 13 15:39:11 2023 From: duke at openjdk.org (Damon Fenacci) Date: Mon, 13 Mar 2023 15:39:11 GMT Subject: RFR: JDK-8303154: Investigate and improve instruction cache flushing during compilation [v2] In-Reply-To: References: Message-ID: On Fri, 10 Mar 2023 22:55:19 GMT, Vladimir Kozlov wrote: >> Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: >> >> JDK-8303154: remove flush in CodeCache::commit() instead of CodeBuffer::copy_code_to() > > Did you look on how many times we flush ICache during adapters generation? > It has most numerous cases when I looked on it: > "CodeCache::commit() is also used for adapters. But adapters uses RuntimeBlob which calls CodeBuffer::copy_code_to()." > > I thought we would remove flush from CodeCache::commit() and not from copy_code_to(). @vnkozlov @TobiHartmann I pushed the changes if you want to have a look at them again. Thanks a lot! ------------- PR: https://git.openjdk.org/jdk/pull/12877 From chagedorn at openjdk.org Mon Mar 13 15:43:05 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 13 Mar 2023 15:43:05 GMT Subject: RFR: 8299546: C2: MulLNode::mul_ring() wrongly returns bottom type due to casting errors with large numbers [v5] In-Reply-To: References: Message-ID: <1oEtfGK7rvziWpDy2fuRJvQS5mlY6wrybfzpPdo5cDs=.490c2c57-1aea-4a8a-9f51-258afd7eba1d@github.com> > The current logic in `MulLNode::mul_ring()` casts all `jlong` values of the involved type ranges of a multiplication to `double` in order to catch overflows when multiplying the two type ranges. This works fine for values in the `jlong` range that are not larger than 253 or lower than -253. For numbers outside that range, we could experience precision errors because these numbers cannot be represented precisely due to the nature of how doubles are represented with a 52 bit mantissa. For example, the number 253 and 253 + 1 both have the same `double` representation of 253. > > In `MulLNode::mul_ring()`, we could do a multiplication with a `lo` or `hi` value of a type that is larger than 253 (or smaller than -253). In this case, we might get a different result compared to doing the same multiplication with `jlong` values (even though there is no overflow/underflow). As a result, we return `TypeLong::LONG` (bottom type) and missed an optimization opportunity (e.g. folding an `If` when using the `MulL` node in the condition etc.). > > This was caught by the new verification code added to CCP in [JDK-8257197](https://bugs.openjdk.org/browse/JDK-8257197) which checks that after CCP, we should not get a different type anymore when calling `Value()` on a node. In the found fuzzer testcase, we run into the precision problem described above for a `MulL` node and set the type to bottom during CCP (even though there was no actual overflow). Since the type is bottom, we do not re-add the node to the CCP worklist because the premise is that types only go from top to bottom during CCP. Afterwards, an input type of the `MulL` node is updated again in such a way that the previously imprecise `double` multiplication in `mul_ring()` is now exact (by coincidence). We then hit the "missed optimization opportunity" assert added by JDK-8257197. > > To fix this problem, I suggest to switch from a `jlong` - > `double` multiplication overflow check to an overflow check without casting. I've used the idea that `x = a * b` is the same as `b = x / a` (for `a != 0` and `!(a = -1 && b = MIN_VALUE)`) which is also applied in `Math.multiplyExact()`: https://github.com/openjdk/jdk/blob/66db0bb6a15310e4e60ff1e33d40e03c52c4eca8/src/java.base/share/classes/java/lang/Math.java#L1022-L1036 > > The code of `MulLNode::mul_ring()` is almost identical to `MulINode::mul_ring()`. I've refactored that into a template class in order to share the code and simplified the overflow checking by using `MIN/MAX4` instead of using nested `if/else` statements. > > Thanks, > Christian Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: Fix build failures on Mac ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11907/files - new: https://git.openjdk.org/jdk/pull/11907/files/dc6896e9..391260da Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11907&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11907&range=03-04 Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/11907.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11907/head:pull/11907 PR: https://git.openjdk.org/jdk/pull/11907 From kvn at openjdk.org Mon Mar 13 16:38:13 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 13 Mar 2023 16:38:13 GMT Subject: RFR: 8304034: Remove redundant and meaningless comments in opto [v2] In-Reply-To: <7IfWb47lN-hT3rvgyDqOTnp8kl5vhlgBbSjNM0VTN10=.311703fa-339a-42c3-b718-316095e1e17e@github.com> References: <7IfWb47lN-hT3rvgyDqOTnp8kl5vhlgBbSjNM0VTN10=.311703fa-339a-42c3-b718-316095e1e17e@github.com> Message-ID: On Mon, 13 Mar 2023 07:42:59 GMT, Yi Yang wrote: >> Please help review this trivial change to remove redundant and meaningless comments in `hotspot/share/opto` directory. >> >> They are either >> 1. Repeat the function name that the function they comment for. >> 2. Makes no sense, e.g. `//----Idealize----` >> >> And I think original CC-style code (`if( test )`,`call( arg )`) can be formatted in one go, instead of formatting the near code when someone touches them. But this may form a big patch, and it confuses code blame, so I left this work until we reach a consensus. >> >> Thanks! > > Yi Yang has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: > > 8304034: Remove redundant and meaningless comments in opto I agree that comments which duplicate method names are useless. But some comment have meaning. Consider cleanup multiply sequential empty lines too. src/hotspot/share/opto/loopTransform.cpp line 61: > 59: > 60: > 61: May be look to remove multiply sequential empty lines too. One empty line as code separator is enough. src/hotspot/share/opto/loopopts.cpp line 1840: > 1838: // C L O N E A L O O P B O D Y > 1839: // > 1840: This can be removed too. We have this comment before `clone_loop()` method. src/hotspot/share/opto/macro.cpp line 1134: > 1132: } > 1133: > 1134: //============================================================================= I would leave this one src/hotspot/share/opto/macro.cpp line 1173: > 1171: // oop flavor. > 1172: // > 1173: //============================================================================= this one src/hotspot/share/opto/macro.cpp line 1179: > 1177: // trigger exceptions go the slow route. Also, it must be small enough so > 1178: // that heap_top + size_in_bytes does not wrap around the 4Gig limit. > 1179: //=============================================================================j// and this one. src/hotspot/share/opto/parse1.cpp line 101: > 99: #endif > 100: > 101: //------------------------------ON STACK REPLACEMENT--------------------------- Leave this. src/hotspot/share/opto/phaseX.cpp line 46: > 44: //============================================================================= > 45: #define NODE_HASH_MINIMUM_SIZE 255 > 46: //------------------------------NodeHash--------------------------------------- Add empty line to separate `#define` from following code. src/hotspot/share/opto/phaseX.cpp line 1925: > 1923: uint PhaseCCP::_total_constants = 0; > 1924: #endif > 1925: //------------------------------PhaseCCP--------------------------------------- Add empty line. src/hotspot/share/opto/regmask.hpp line 35: > 33: class LRG; > 34: > 35: //-------------Non-zero bit search methods used by RegMask--------------------- Leave this. src/hotspot/share/opto/runtime.cpp line 195: > 193: //============================================================================= > 194: // Opto compiler runtime routines > 195: //============================================================================= Leave these. src/hotspot/share/opto/runtime.cpp line 706: > 704: } > 705: > 706: //-------------- currentTimeMillis, currentTimeNanos, etc Leave this. src/hotspot/share/opto/runtime.cpp line 1309: > 1307: } > 1308: > 1309: //------------- Interpreter state access for on stack replacement Leave this. src/hotspot/share/opto/superword.cpp line 46: > 44: // > 45: // S U P E R W O R D T R A N S F O R M > 46: //============================================================================= Leave this. src/hotspot/share/opto/vectornode.hpp line 133: > 131: }; > 132: > 133: //===========================Vector=ALU=Operations============================= Leave it. src/hotspot/share/opto/vectornode.hpp line 825: > 823: }; > 824: > 825: //================================= M E M O R Y =============================== Leave it. src/hotspot/share/opto/vectornode.hpp line 1114: > 1112: }; > 1113: > 1114: //=========================Promote_Scalar_to_Vector============================ Leave it src/hotspot/share/opto/vectornode.hpp line 1171: > 1169: }; > 1170: > 1171: //========================Pack_Scalars_into_a_Vector=========================== Leave it src/hotspot/share/opto/vectornode.hpp line 1267: > 1265: }; > 1266: > 1267: //========================Extract_Scalar_from_Vector=========================== Leave it. ------------- PR: https://git.openjdk.org/jdk/pull/12995 From jvernee at openjdk.org Mon Mar 13 16:59:59 2023 From: jvernee at openjdk.org (Jorn Vernee) Date: Mon, 13 Mar 2023 16:59:59 GMT Subject: RFR: 8303022: "assert(allocates2(pc)) failed: not in CodeBuffer memory" When linking downcall handle [v3] In-Reply-To: References: Message-ID: On Sat, 11 Mar 2023 00:15:10 GMT, Vladimir Kozlov wrote: >> Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision: >> >> RISCV changes > > Good. @vnkozlov Does this need another reviewer? ------------- PR: https://git.openjdk.org/jdk/pull/12908 From duke at openjdk.org Mon Mar 13 17:00:41 2023 From: duke at openjdk.org (Ilya Korennoy) Date: Mon, 13 Mar 2023 17:00:41 GMT Subject: RFR: 8293324: ciField.hpp has two methods to return field's offset Message-ID: Small refactoring of ciField.hpp method `offset()` removed and `offset_in_bytes()` used instead. Test: tier1 linux-x86_64 ------------- Commit messages: - 8293324: Remove method offset from ciField.hpp and use offset_in_bytes instead Changes: https://git.openjdk.org/jdk/pull/13003/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13003&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8293324 Stats: 20 lines in 10 files changed: 0 ins; 5 del; 15 mod Patch: https://git.openjdk.org/jdk/pull/13003.diff Fetch: git fetch https://git.openjdk.org/jdk pull/13003/head:pull/13003 PR: https://git.openjdk.org/jdk/pull/13003 From kvn at openjdk.org Mon Mar 13 17:08:43 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 13 Mar 2023 17:08:43 GMT Subject: RFR: 8298935: fix independence bug in create_pack logic in SuperWord::find_adjacent_refs [v25] In-Reply-To: References: Message-ID: On Thu, 9 Mar 2023 16:49:03 GMT, Vladimir Kozlov wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Fixed wording from last commit > > There was no big meaning in my question "Does not matter if they are in an other pack or not?" > As you explained we go through memory and data inputs. In simple case they would be in an other pack (since we looking only inside block). But in `_do_vector_loop` case (and may be other cases) some packs could be eliminate leaving nodes not in packs. But it does not hinder the search for dependence. That is what I want to say and ask for confirmation. > @vnkozlov I now found an example that reveals this **Bug 4**. I want to fix it in a separate Bug [JDK-8304042](https://bugs.openjdk.org/browse/JDK-8304042). Agree. ------------- PR: https://git.openjdk.org/jdk/pull/12350 From kvn at openjdk.org Mon Mar 13 17:35:33 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 13 Mar 2023 17:35:33 GMT Subject: RFR: 8299546: C2: MulLNode::mul_ring() wrongly returns bottom type due to casting errors with large numbers [v5] In-Reply-To: <1oEtfGK7rvziWpDy2fuRJvQS5mlY6wrybfzpPdo5cDs=.490c2c57-1aea-4a8a-9f51-258afd7eba1d@github.com> References: <1oEtfGK7rvziWpDy2fuRJvQS5mlY6wrybfzpPdo5cDs=.490c2c57-1aea-4a8a-9f51-258afd7eba1d@github.com> Message-ID: On Mon, 13 Mar 2023 15:43:05 GMT, Christian Hagedorn wrote: >> The current logic in `MulLNode::mul_ring()` casts all `jlong` values of the involved type ranges of a multiplication to `double` in order to catch overflows when multiplying the two type ranges. This works fine for values in the `jlong` range that are not larger than 253 or lower than -253. For numbers outside that range, we could experience precision errors because these numbers cannot be represented precisely due to the nature of how doubles are represented with a 52 bit mantissa. For example, the number 253 and 253 + 1 both have the same `double` representation of 253. >> >> In `MulLNode::mul_ring()`, we could do a multiplication with a `lo` or `hi` value of a type that is larger than 253 (or smaller than -253). In this case, we might get a different result compared to doing the same multiplication with `jlong` values (even though there is no overflow/underflow). As a result, we return `TypeLong::LONG` (bottom type) and missed an optimization opportunity (e.g. folding an `If` when using the `MulL` node in the condition etc.). >> >> This was caught by the new verification code added to CCP in [JDK-8257197](https://bugs.openjdk.org/browse/JDK-8257197) which checks that after CCP, we should not get a different type anymore when calling `Value()` on a node. In the found fuzzer testcase, we run into the precision problem described above for a `MulL` node and set the type to bottom during CCP (even though there was no actual overflow). Since the type is bottom, we do not re-add the node to the CCP worklist because the premise is that types only go from top to bottom during CCP. Afterwards, an input type of the `MulL` node is updated again in such a way that the previously imprecise `double` multiplication in `mul_ring()` is now exact (by coincidence). We then hit the "missed optimization opportunity" assert added by JDK-8257197. >> >> To fix this problem, I suggest to switch from a `jlong` - > `double` multiplication overflow check to an overflow check without casting. I've used the idea that `x = a * b` is the same as `b = x / a` (for `a != 0` and `!(a = -1 && b = MIN_VALUE)`) which is also applied in `Math.multiplyExact()`: https://github.com/openjdk/jdk/blob/66db0bb6a15310e4e60ff1e33d40e03c52c4eca8/src/java.base/share/classes/java/lang/Math.java#L1022-L1036 >> >> The code of `MulLNode::mul_ring()` is almost identical to `MulINode::mul_ring()`. I've refactored that into a template class in order to share the code and simplified the overflow checking by using `MIN/MAX4` instead of using nested `if/else` statements. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Fix build failures on Mac Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11907 From kvn at openjdk.org Mon Mar 13 17:39:30 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 13 Mar 2023 17:39:30 GMT Subject: RFR: JDK-8303154: Investigate and improve instruction cache flushing during compilation [v2] In-Reply-To: References: Message-ID: <5hCuWCzGEIWdOmUsak9PyKNPNJgfRW6oZIxDdDhc-WQ=.7793a434-58cf-4fb7-98fa-0ff6c59ea36a@github.com> On Mon, 13 Mar 2023 13:52:59 GMT, Damon Fenacci wrote: >> It was noticed that we flush the instruction cache too much for a single C1 compilation. The same is true for the C2 compilation. >> There are several places in the code where the instruction cache is called and many of them are very intertwined (see [bug report](https://bugs.openjdk.org/browse/JDK-8303154)). >> >> This PR is meant to be a "minimum" set of changes that improve the situation without introducing excessive extra information to keep track of the origin of the call through call stacks. This is done by avoiding calls to flush the ICache at `Compilation::emit_code_epilog` and when calling `CodeCache::commit` as flushing is done anyway when copying from the temporary buffer into the code cache in `CodeBuffer::copy_code_to`. This results in flushing the ICache only once instead of 3 times for a C1 compilation and twice for a C2 compilation. Additionally this halves the number of flushes during adapters generation (lots of calls). >> >> This change decreases the number of calls to flush the ICache for a simple _Hello world_ program on Mac OSX aarch64 from 3569 to 2756 on C1 (22.8% improvement) and from 3572 to 2685 on C2 (24.1% improvement). > > Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: > > JDK-8303154: remove flush in CodeCache::commit() instead of CodeBuffer::copy_code_to() Good. What are new numbers of of calls to flush the ICache? ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/12877 From kvn at openjdk.org Mon Mar 13 17:51:11 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 13 Mar 2023 17:51:11 GMT Subject: RFR: JDK-8303154: Investigate and improve instruction cache flushing during compilation [v2] In-Reply-To: References: Message-ID: <9k5Kq1Lq7wo_tSB4xVQ3_lHNPymo-1WyHYVmcWrs0mg=.0f7e9a62-7e5a-44e5-ac26-9730112378aa@github.com> On Mon, 13 Mar 2023 15:35:36 GMT, Damon Fenacci wrote: >> Did you look on how many times we flush ICache during adapters generation? >> It has most numerous cases when I looked on it: >> "CodeCache::commit() is also used for adapters. But adapters uses RuntimeBlob which calls CodeBuffer::copy_code_to()." >> >> I thought we would remove flush from CodeCache::commit() and not from copy_code_to(). > > @vnkozlov @TobiHartmann I pushed the changes if you want to have a look at them again. Thanks a lot! @dafedafe I look on stack traces you collected. Please look on this: 732 Stack: ICache::invalidate_range(unsigned char *, int) icache_bsd_aarch64.hpp:41 AbstractAssembler::flush() assembler.cpp:110 SharedRuntime::generate_i2c2i_adapters(MacroAssembler *, int, int, const BasicType *, const VMRegPair *, AdapterFingerPrint *) sharedRuntime_aarch64.cpp:794 AdapterHandlerLibrary::create_adapter(AdapterBlob *&, int, BasicType *, bool) sharedRuntime.cpp:2970 `generate_i2c2i_adapters` is called for temporary buffer so we don't need to flush it: https://github.com/openjdk/jdk/blob/master/src/hotspot/share/runtime/sharedRuntime.cpp#L2870 ------------- PR: https://git.openjdk.org/jdk/pull/12877 From kvn at openjdk.org Mon Mar 13 18:05:28 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 13 Mar 2023 18:05:28 GMT Subject: RFR: 8303022: "assert(allocates2(pc)) failed: not in CodeBuffer memory" When linking downcall handle [v3] In-Reply-To: References: Message-ID: On Sat, 11 Mar 2023 00:15:10 GMT, Vladimir Kozlov wrote: >> Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision: >> >> RISCV changes > > Good. > @vnkozlov Does this need another reviewer? Yes, it is not trivial ------------- PR: https://git.openjdk.org/jdk/pull/12908 From coleenp at openjdk.org Mon Mar 13 20:30:02 2023 From: coleenp at openjdk.org (Coleen Phillimore) Date: Mon, 13 Mar 2023 20:30:02 GMT Subject: RFR: 8304059: Use InstanceKlass in dependencies Message-ID: Please review this small change to eliminate InstanceKlass::cast() calls for things that should be already InstanceKlass. Tested with tier1-4. ------------- Commit messages: - 8304059: Use InstanceKlass in dependencies Changes: https://git.openjdk.org/jdk/pull/13005/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13005&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8304059 Stats: 27 lines in 4 files changed: 0 ins; 1 del; 26 mod Patch: https://git.openjdk.org/jdk/pull/13005.diff Fetch: git fetch https://git.openjdk.org/jdk pull/13005/head:pull/13005 PR: https://git.openjdk.org/jdk/pull/13005 From vlivanov at openjdk.org Mon Mar 13 20:39:26 2023 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Mon, 13 Mar 2023 20:39:26 GMT Subject: RFR: 8304059: Use InstanceKlass in dependencies In-Reply-To: References: Message-ID: On Mon, 13 Mar 2023 20:23:47 GMT, Coleen Phillimore wrote: > Please review this small change to eliminate InstanceKlass::cast() calls for things that should be already InstanceKlass. > Tested with tier1-4. Looks good. Thanks for cleaning it up! ------------- Marked as reviewed by vlivanov (Reviewer). PR: https://git.openjdk.org/jdk/pull/13005 From kvn at openjdk.org Mon Mar 13 21:18:58 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 13 Mar 2023 21:18:58 GMT Subject: RFR: 8303415: Add VM_Version::is_intrinsic_supported(id) [v3] In-Reply-To: References: Message-ID: > Currently we check VM flags, directives and JIT compiler support when we generate intrinsics. > We have *product* VM flags for most intrinsics and set them in VM based on HW support. > But not all intrinsics have such flags and it is not scalable to add new *product* flag for each new intrinsic. > Also we have `-XX:DisableIntrinsic=` and `-XX:ControlIntrinsic=` flags to control intrinsics from command line. We don't need specific flags for that. > > I propose to add new `VM_Version::is_intrinsic_supported(id)` method to check platform support for intrinsic without adding new flag. I used it for `_floatToFloat16` intrinsic for my work on [JDK-8302976](https://bugs.openjdk.org/browse/JDK-8302976). > > Additional fixes: > Fixed Interpreter to skip intrinsics if they are disabled with flag. > Moved Interpreter's `InlineIntrinsics` flag check into one place in shared code. > Added separate interpreter id for `_dsqrt_strict` so it could be disabled separately from regular `_dsqrt`. > Added missing `native` mark to `_currentThread`. > Removed unused `AbstractInterpreter::in_native_entry()`. > Cleanup C2 intrinsic checks code. > > Tested tier1-4,xcomp,stress. Also ran tier1-3,xcomp with `-XX:-InlineIntrinsics`. Vladimir Kozlov has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: - merged master and resolved condlicts - Address comments - 8303415: Add VM_Version::is_intrinsic_supported(id) ------------- Changes: https://git.openjdk.org/jdk/pull/12858/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12858&range=02 Stats: 345 lines in 25 files changed: 211 ins; 93 del; 41 mod Patch: https://git.openjdk.org/jdk/pull/12858.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12858/head:pull/12858 PR: https://git.openjdk.org/jdk/pull/12858 From coleenp at openjdk.org Mon Mar 13 21:51:31 2023 From: coleenp at openjdk.org (Coleen Phillimore) Date: Mon, 13 Mar 2023 21:51:31 GMT Subject: RFR: 8304059: Use InstanceKlass in dependencies In-Reply-To: References: Message-ID: On Mon, 13 Mar 2023 20:23:47 GMT, Coleen Phillimore wrote: > Please review this small change to eliminate InstanceKlass::cast() calls for things that should be already InstanceKlass. > Tested with tier1-4. Thanks Vladimir. ------------- PR: https://git.openjdk.org/jdk/pull/13005 From fjiang at openjdk.org Tue Mar 14 00:53:00 2023 From: fjiang at openjdk.org (Feilong Jiang) Date: Tue, 14 Mar 2023 00:53:00 GMT Subject: RFR: 8303955: RISC-V: Factor out the tmp parameter from copy_memory and copy_memory_v In-Reply-To: <8HOCjcd_jg5ToRgvOQ9pWS767BatJxZqgBBTqx-jYN4=.b8f737dc-956a-4a31-b03c-f1e895ea4e04@github.com> References: <8HOCjcd_jg5ToRgvOQ9pWS767BatJxZqgBBTqx-jYN4=.b8f737dc-956a-4a31-b03c-f1e895ea4e04@github.com> Message-ID: On Mon, 13 Mar 2023 01:12:17 GMT, Fei Yang wrote: >> The call site of `copy_memory` and `copy_memory_v` always use `t0` as tmp register, so we can factor the tmp parameter out. >> >> Testing: >> >> - [x] tier1 tests on Unmatched board (release build with `-XX:-UseRVV`) >> - [x] hotspot_tier1 and jdk_tier1 on QEMU (release build with `-XX:+UseRVV`) > > Looks fine. Thanks. @RealFYang @yhzhu20 -- Thanks for the review! ------------- PR: https://git.openjdk.org/jdk/pull/12969 From fjiang at openjdk.org Tue Mar 14 00:58:52 2023 From: fjiang at openjdk.org (Feilong Jiang) Date: Tue, 14 Mar 2023 00:58:52 GMT Subject: Integrated: 8303955: RISC-V: Factor out the tmp parameter from copy_memory and copy_memory_v In-Reply-To: References: Message-ID: On Fri, 10 Mar 2023 09:12:25 GMT, Feilong Jiang wrote: > The call site of `copy_memory` and `copy_memory_v` always use `t0` as tmp register, so we can factor the tmp parameter out. > > Testing: > > - [x] tier1 tests on Unmatched board (release build with `-XX:-UseRVV`) > - [x] hotspot_tier1 and jdk_tier1 on QEMU (release build with `-XX:+UseRVV`) This pull request has now been integrated. Changeset: 49181b81 Author: Feilong Jiang Committer: Fei Yang URL: https://git.openjdk.org/jdk/commit/49181b81dd284f65455492183ce5d0ab38b48d52 Stats: 29 lines in 1 file changed: 0 ins; 0 del; 29 mod 8303955: RISC-V: Factor out the tmp parameter from copy_memory and copy_memory_v Reviewed-by: fyang, yzhu ------------- PR: https://git.openjdk.org/jdk/pull/12969 From fgao at openjdk.org Tue Mar 14 01:40:11 2023 From: fgao at openjdk.org (Fei Gao) Date: Tue, 14 Mar 2023 01:40:11 GMT Subject: RFR: 8298935: fix independence bug in create_pack logic in SuperWord::find_adjacent_refs [v15] In-Reply-To: References: <7BB6Bc-RlCF3GYkbuOEmSJgdv4aAzQ-2cQFdiCY2vaQ=.37b1a20a-70f2-43a2-865e-b49aafe09b42@github.com> <9tb5DWpbU4UzvvQXlD4rRFtNEgwXbcVRNYmDtGan_sI=.21ae9f61-4abc-4143-9bf3-dfcd11b42ce9@github.com> Message-ID: On Mon, 6 Mar 2023 17:46:12 GMT, Jatin Bhateja wrote: >>> do you really see any value in generating tests for synthetic vector sizes where MaxVectorSize is a non-power of two. Lets remove them to reduce noise ? >> >> I see a value in having non-power of 2 offsets, yes. They should vectorize if the vector width is small enough. And then there are some values like `18, 20, 192` that are a there to check vectorization with `+AlignVector`, where we expect vectorization only if we have `byte_offset % vector_width == 0`. So it is interesting to have some non-power-of-2 values that have various power-of-2 factors in them. >> >> Maybe you find the `MaxVectorSize <= 12` "noisy" somehow, because it is equivalent to `MaxVectorSize <= 8`? I find it rather helpful, because `12` reflects the `byte_offset`, and so makes the rule a bit more understandable. >> >> Finally, I generate many tests, I don't want to do that by hand. So maybe the rules are not simplified perfectly. I tried to improve it a bit. If you have a concrete idea how to further improve, I'm open for suggestions. I could for example round down the values to the next power of 2, or something like that. But again: would that really make the rules more understandable? > >> > With +AlignVector behavior with and without Vectorize,true pragma should match. >> >> This was about example with `fArr[i + 4] = fArr[i];` in the loop. `byte_offset = 4 * 4 = 16`. >> >> @jatin-bhateja I am not sure what you are trying to say, what do you mean by `should match`? >> > > Yes, this was a bug in mainline where we were incorrectly vectorizing which is now fixed with your changes, just wanted to get that point highlighted. > @jatin-bhateja Under `aarch64`, I have made bad experiences with `SuperWordMaxVectorSize`. It is not properly adjusted to be at most `MaxVectorSize`. For example if the `aarch64` machine only supports `MaxVectorSize <= 32`, but I set `SuperWordMaxVectorSize = 64`, then it will keep it at `64`. So then my IR rules fail. For the `x86 / x64` machines we have: > > https://github.com/openjdk/jdk/blob/33bec207103acd520eb99afb093cfafa44aecfda/src/hotspot/cpu/x86/vm_version_x86.cpp#L1314-L1333 > > @fg1417 Would you like to implement this for `aarch64`? @eme64 thanks for pointing it out! I'll take a look at it. ------------- PR: https://git.openjdk.org/jdk/pull/12350 From yyang at openjdk.org Tue Mar 14 02:35:49 2023 From: yyang at openjdk.org (Yi Yang) Date: Tue, 14 Mar 2023 02:35:49 GMT Subject: RFR: 8304034: Remove redundant and meaningless comments in opto [v3] In-Reply-To: References: Message-ID: > Please help review this trivial change to remove redundant and meaningless comments in `hotspot/share/opto` directory. > > They are either > 1. Repeat the function name that the function they comment for. > 2. Makes no sense, e.g. `//----Idealize----` > > And I think original CC-style code (`if( test )`,`call( arg )`) can be formatted in one go, instead of formatting the near code when someone touches them. But this may form a big patch, and it confuses code blame, so I left this work until we reach a consensus. > > Thanks! Yi Yang has updated the pull request incrementally with two additional commits since the last revision: - multiple empty lines to one empty lines - reserve some comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12995/files - new: https://git.openjdk.org/jdk/pull/12995/files/5680a6ce..309503da Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12995&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12995&range=01-02 Stats: 554 lines in 97 files changed: 10 ins; 536 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/12995.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12995/head:pull/12995 PR: https://git.openjdk.org/jdk/pull/12995 From yyang at openjdk.org Tue Mar 14 03:07:37 2023 From: yyang at openjdk.org (Yi Yang) Date: Tue, 14 Mar 2023 03:07:37 GMT Subject: RFR: 8304034: Remove redundant and meaningless comments in opto [v4] In-Reply-To: References: Message-ID: > Please help review this trivial change to remove redundant and meaningless comments in `hotspot/share/opto` directory. > > They are either > 1. Repeat the function name that the function they comment for. > 2. Makes no sense, e.g. `//----Idealize----` > > And I think original CC-style code (`if( test )`,`call( arg )`) can be formatted in one go, instead of formatting the near code when someone touches them. But this may form a big patch, and it confuses code blame, so I left this work until we reach a consensus. > > Thanks! Yi Yang has updated the pull request incrementally with two additional commits since the last revision: - cleanup more - reserve some comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12995/files - new: https://git.openjdk.org/jdk/pull/12995/files/309503da..b3fc99e3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12995&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12995&range=02-03 Stats: 126 lines in 10 files changed: 5 ins; 121 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/12995.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12995/head:pull/12995 PR: https://git.openjdk.org/jdk/pull/12995 From kvn at openjdk.org Tue Mar 14 05:22:57 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 14 Mar 2023 05:22:57 GMT Subject: RFR: 8303415: Add VM_Version::is_intrinsic_supported(id) [v4] In-Reply-To: References: Message-ID: <9jI8j1lHAB1JuFIPlDgXjyp-WzkoiTrx4YJR17-WdIY=.456c5092-7383-4738-940f-f91438cd7e3f@github.com> > Currently we check VM flags, directives and JIT compiler support when we generate intrinsics. > We have *product* VM flags for most intrinsics and set them in VM based on HW support. > But not all intrinsics have such flags and it is not scalable to add new *product* flag for each new intrinsic. > Also we have `-XX:DisableIntrinsic=` and `-XX:ControlIntrinsic=` flags to control intrinsics from command line. We don't need specific flags for that. > > I propose to add new `VM_Version::is_intrinsic_supported(id)` method to check platform support for intrinsic without adding new flag. I used it for `_floatToFloat16` intrinsic for my work on [JDK-8302976](https://bugs.openjdk.org/browse/JDK-8302976). > > Additional fixes: > Fixed Interpreter to skip intrinsics if they are disabled with flag. > Moved Interpreter's `InlineIntrinsics` flag check into one place in shared code. > Added separate interpreter id for `_dsqrt_strict` so it could be disabled separately from regular `_dsqrt`. > Added missing `native` mark to `_currentThread`. > Removed unused `AbstractInterpreter::in_native_entry()`. > Cleanup C2 intrinsic checks code. > > Tested tier1-4,xcomp,stress. Also ran tier1-3,xcomp with `-XX:-InlineIntrinsics`. Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: Additional Interpreter changes ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12858/files - new: https://git.openjdk.org/jdk/pull/12858/files/6be43779..98304c83 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12858&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12858&range=02-03 Stats: 1025 lines in 9 files changed: 288 ins; 384 del; 353 mod Patch: https://git.openjdk.org/jdk/pull/12858.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12858/head:pull/12858 PR: https://git.openjdk.org/jdk/pull/12858 From yyang at openjdk.org Tue Mar 14 05:34:20 2023 From: yyang at openjdk.org (Yi Yang) Date: Tue, 14 Mar 2023 05:34:20 GMT Subject: RFR: 8304034: Remove redundant and meaningless comments in opto [v5] In-Reply-To: References: Message-ID: > Please help review this trivial change to remove redundant and meaningless comments in `hotspot/share/opto` directory. > > They are either > 1. Repeat the function name that the function they comment for. > 2. Makes no sense, e.g. `//----Idealize----` > > And I think original CC-style code (`if( test )`,`call( arg )`) can be formatted in one go, instead of formatting the near code when someone touches them. But this may form a big patch, and it confuses code blame, so I left this work until we reach a consensus. > > Thanks! Yi Yang has updated the pull request incrementally with one additional commit since the last revision: restore mistakenly removed lines ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12995/files - new: https://git.openjdk.org/jdk/pull/12995/files/b3fc99e3..92963630 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12995&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12995&range=03-04 Stats: 37 lines in 1 file changed: 34 ins; 3 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/12995.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12995/head:pull/12995 PR: https://git.openjdk.org/jdk/pull/12995 From duke at openjdk.org Tue Mar 14 08:25:25 2023 From: duke at openjdk.org (Damon Fenacci) Date: Tue, 14 Mar 2023 08:25:25 GMT Subject: RFR: JDK-8303154: Investigate and improve instruction cache flushing during compilation [v2] In-Reply-To: <5hCuWCzGEIWdOmUsak9PyKNPNJgfRW6oZIxDdDhc-WQ=.7793a434-58cf-4fb7-98fa-0ff6c59ea36a@github.com> References: <5hCuWCzGEIWdOmUsak9PyKNPNJgfRW6oZIxDdDhc-WQ=.7793a434-58cf-4fb7-98fa-0ff6c59ea36a@github.com> Message-ID: On Mon, 13 Mar 2023 17:36:27 GMT, Vladimir Kozlov wrote: > Good. What are new numbers of of calls to flush the ICache? The total number of flushes for the _HelloWorld_ on Mac OSX aarch64 go from 3569 to 2756 on C1 (22.8% improvement) and from 3572 to 2685 on C2 (24.1% improvement). ------------- PR: https://git.openjdk.org/jdk/pull/12877 From rrich at openjdk.org Tue Mar 14 10:17:45 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Tue, 14 Mar 2023 10:17:45 GMT Subject: RFR: 8299375: [PPC64] GetStackTraceSuspendedStressTest tries to deoptimize frame with invalid fp Message-ID: Mark a frame as not fully initialized when copying it from a continuation StackChunk to the stack until the callers_sp (aka back link) is set. This avoids the assertion given in the bug report when the copied frame is deoptimized before it is fully initialized. IMHO the deoptimization at that point is a little questionable but it actually only changes the pc of the frame which can be done. Note that the frame can get extended later (and metadata can get overridden) but [there is code that handles this](https://github.com/openjdk/jdk/blob/34a92466a615415b76c8cb6010ff7e6e1a1d63b4/src/hotspot/share/runtime/continuationFreezeThaw.cpp#L2108-L2110). Testing: jdk_loom. The fix passed our CI testing. This includes most JCK and JTREG tiers 1-4, also in Xcomp mode, on the standard platforms and also on ppc64le. ------------- Commit messages: - Fix Changes: https://git.openjdk.org/jdk/pull/12941/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12941&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8299375 Stats: 15 lines in 3 files changed: 14 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/12941.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12941/head:pull/12941 PR: https://git.openjdk.org/jdk/pull/12941 From thartmann at openjdk.org Tue Mar 14 10:32:38 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 14 Mar 2023 10:32:38 GMT Subject: RFR: 8304059: Use InstanceKlass in dependencies In-Reply-To: References: Message-ID: On Mon, 13 Mar 2023 20:23:47 GMT, Coleen Phillimore wrote: > Please review this small change to eliminate InstanceKlass::cast() calls for things that should be already InstanceKlass. > Tested with tier1-4. Looks good to me too. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/13005 From roland at openjdk.org Tue Mar 14 10:58:11 2023 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 14 Mar 2023 10:58:11 GMT Subject: RFR: 8300257: C2: vectorization fails on some simple Memory Segment loops [v2] In-Reply-To: References: Message-ID: > In the test case `testByteLong1` (that's extracted from a memory > segment micro benchmark), the address of the store is initially: > > > (AddP#204 base#195 base#195 (AddL#164 (ConvI2L#158 (CastII#157 (LshiftI#107 iv#101))) invar#163)) > > > (#numbers are node numbers to help the discussion). > > `iv#101` is the `Phi` of a counted loop. `invar#163` is the > `baseOffset` load. > > To eliminate the range check, the loop is transformed into a loop nest > and as a consequence the address above becomes: > > > (AddP#204 base#195 base#195 (AddL#164 (ConvI2L#158 (CastII#157 (LShiftI#107 (AddI#326 invar#308 iv#321)))) invar#163)) > > > `invar#308` is some expression from a `Phi` of the outer loop. > > That `AddP` is transformed multiple times to push the invariants out of loop: > > > (AddP#568 base#195 (AddP#556 base#195 base#195 invar#163) (ConvI2L#158 (CastII#157 (AddI#566 (LShiftI#565 iv#321) invar#577)))) > > > then: > > > (AddP#568 base#195 (AddP#847 (AddP#556 base#195 base#195 invar#163) (AddL#838 (ConvI2L#793 (LShiftL#760 iv#767)) (ConvI2L#818 (CastII#779 invar#577))))) > > > and finally: > > > (AddP#568 base#195 (AddP#949 base#195 (AddP#855 base#195 (AddP#556 base#195 base#195 invar#163) (ConvI2L#818 (CastII#809 invar#577))) (ConvI2L#938 (LShiftI#896 iv#908)))) > > > `AddP#855` is out of the inner loop. > > This doesn't vectorize because: > > - there are 2 invariants in the address expression but superword only > support one (tracked by `_invar` in `SWPointer`) > > - there are more levels of `AddP` (4) than superword supports (3) > > To fix that, I propose to no longer track the address elements in > `_invar`, `_negate_invar` and `_invar_scale` but instead to have a > single `_invar` which is an expression built by superword as it > follows chains of `addP` nodes. I kept the previous `_invar`, > `_negate_invar` and `_invar_scale` as debugging and use them to check > that what vectorized with the previous scheme still does. > > I also propose lifting the restriction on 3 levels of `AddP` entirely. Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: - NULL -> nullptr - Merge branch 'master' into JDK-8300257 - fix & test ------------- Changes: https://git.openjdk.org/jdk/pull/12942/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12942&range=01 Stats: 274 lines in 3 files changed: 212 ins; 23 del; 39 mod Patch: https://git.openjdk.org/jdk/pull/12942.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12942/head:pull/12942 PR: https://git.openjdk.org/jdk/pull/12942 From thartmann at openjdk.org Tue Mar 14 11:12:53 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 14 Mar 2023 11:12:53 GMT Subject: RFR: 8303970: C2 can not merge homogeneous adjacent two If [v2] In-Reply-To: References: Message-ID: <1Y7V2RlaWMRc77oiYYRUOh6iTsOpkTl7ASUUwI_A5xk=.d32e380e-5dd5-41fe-988c-ec8cb68624c9@github.com> On Mon, 13 Mar 2023 06:34:03 GMT, Yi Yang wrote: >> Hi, can I have a review for this patch? It adds new Identity for BoolNode to lookup homogenous integer comparison, i.e. `Bool (CmpX a b)` is identity to `Bool (CmpX b a)`, in this way, we are able to merge the following two "identical" Ifs, which is not before. >> >> >> public static void test(int a, int b) { // ok, identical ifs, apply split_if >> if (a == b) { >> int_field = 0x42; >> } else { >> int_field = 42; >> } >> if (a == b) { >> int_field = 0x42; >> } else { >> int_field = 42; >> } >> } >> >> public static void test(int a, int b) { // do nothing >> if (a == b) { >> int_field = 0x42; >> } else { >> int_field = 42; >> } >> if (b == a) { >> int_field = 0x42; >> } else { >> int_field = 42; >> } >> } >> >> >> Testing: tier1, appllication/ctw/modules > > Yi Yang has updated the pull request incrementally with one additional commit since the last revision: > > dont apply Identity for dead bool src/hotspot/share/opto/subnode.cpp line 1470: > 1468: } > 1469: > 1470: static Node* get_reverse_cmp(int cmp_op, Node* cmp1, Node* cmp2) { This should be a private member method to avoid the arguments. Please also add a comment. src/hotspot/share/opto/subnode.cpp line 1487: > 1485: if (cop == Op_FastLock || cop == Op_FastUnlock || > 1486: cop == Op_SubTypeCheck || cop == Op_VectorTest || > 1487: cmp->is_Overflow()) { Why is this check now needed? src/hotspot/share/opto/subnode.cpp line 1495: > 1493: Node* BoolNode::Identity(PhaseGVN* phase) { > 1494: // "Bool (CmpX a b)" is equivalent to "Bool (CmpX b a)" > 1495: Node *cmp = in(1); Suggestion: Node* cmp = in(1); src/hotspot/share/opto/subnode.cpp line 1502: > 1500: // During parsing, empty uses of bool is tolerable. During iterative GVN, > 1501: // we don't aggressively replace bool whose use is empty with existing node. > 1502: return this; Why should we bother optimizing a dead Bool at all? src/hotspot/share/opto/subnode.cpp line 1512: > 1510: Node* out = reverse_cmp->fast_out(i); > 1511: if (out->is_Bool() && out->as_Bool()->_test._test == _test._test && > 1512: phase->type_or_null(out) != nullptr) { Why is the `phase->type_or_null` required? test/hotspot/jtreg/compiler/c2/irTests/TestBackToBackIfs.java line 1: > 1: /* Please add the bug id to the @bug statement. test/hotspot/jtreg/compiler/c2/irTests/TestBackToBackIfs.java line 63: > 61: @Test > 62: @IR(counts = { IRNode.IF, "1" }) > 63: public static void test1(int a, int b) { Please add a test for `!=`. ------------- PR: https://git.openjdk.org/jdk/pull/12978 From yyang at openjdk.org Tue Mar 14 11:24:13 2023 From: yyang at openjdk.org (Yi Yang) Date: Tue, 14 Mar 2023 11:24:13 GMT Subject: RFR: 8303970: C2 can not merge homogeneous adjacent two If [v2] In-Reply-To: <1Y7V2RlaWMRc77oiYYRUOh6iTsOpkTl7ASUUwI_A5xk=.d32e380e-5dd5-41fe-988c-ec8cb68624c9@github.com> References: <1Y7V2RlaWMRc77oiYYRUOh6iTsOpkTl7ASUUwI_A5xk=.d32e380e-5dd5-41fe-988c-ec8cb68624c9@github.com> Message-ID: On Tue, 14 Mar 2023 10:42:49 GMT, Tobias Hartmann wrote: >> Yi Yang has updated the pull request incrementally with one additional commit since the last revision: >> >> dont apply Identity for dead bool > > src/hotspot/share/opto/subnode.cpp line 1502: > >> 1500: // During parsing, empty uses of bool is tolerable. During iterative GVN, >> 1501: // we don't aggressively replace bool whose use is empty with existing node. >> 1502: return this; > > Why should we bother optimizing a dead Bool at all? Because it is likely that bool is created but not immediately used during parsing(PhaseGVN), it can be optimized out. ------------- PR: https://git.openjdk.org/jdk/pull/12978 From yyang at openjdk.org Tue Mar 14 11:53:27 2023 From: yyang at openjdk.org (Yi Yang) Date: Tue, 14 Mar 2023 11:53:27 GMT Subject: RFR: 8303970: C2 can not merge homogeneous adjacent two If [v2] In-Reply-To: <1Y7V2RlaWMRc77oiYYRUOh6iTsOpkTl7ASUUwI_A5xk=.d32e380e-5dd5-41fe-988c-ec8cb68624c9@github.com> References: <1Y7V2RlaWMRc77oiYYRUOh6iTsOpkTl7ASUUwI_A5xk=.d32e380e-5dd5-41fe-988c-ec8cb68624c9@github.com> Message-ID: On Tue, 14 Mar 2023 11:09:29 GMT, Tobias Hartmann wrote: >> Yi Yang has updated the pull request incrementally with one additional commit since the last revision: >> >> dont apply Identity for dead bool > > src/hotspot/share/opto/subnode.cpp line 1512: > >> 1510: Node* out = reverse_cmp->fast_out(i); >> 1511: if (out->is_Bool() && out->as_Bool()->_test._test == _test._test && >> 1512: phase->type_or_null(out) != nullptr) { > > Why is the `phase->type_or_null` required? There is a cyclic case, we should avoid it. Apply PhaseGVN for Bool A -> Generate Bool B in BoolNode::Ideal-> Apply Identity for B -> Find A while type of A is not set. ------------- PR: https://git.openjdk.org/jdk/pull/12978 From yyang at openjdk.org Tue Mar 14 12:01:55 2023 From: yyang at openjdk.org (Yi Yang) Date: Tue, 14 Mar 2023 12:01:55 GMT Subject: RFR: 8303970: C2 can not merge homogeneous adjacent two If [v2] In-Reply-To: <1Y7V2RlaWMRc77oiYYRUOh6iTsOpkTl7ASUUwI_A5xk=.d32e380e-5dd5-41fe-988c-ec8cb68624c9@github.com> References: <1Y7V2RlaWMRc77oiYYRUOh6iTsOpkTl7ASUUwI_A5xk=.d32e380e-5dd5-41fe-988c-ec8cb68624c9@github.com> Message-ID: On Tue, 14 Mar 2023 10:41:38 GMT, Tobias Hartmann wrote: >> Yi Yang has updated the pull request incrementally with one additional commit since the last revision: >> >> dont apply Identity for dead bool > > src/hotspot/share/opto/subnode.cpp line 1487: > >> 1485: if (cop == Op_FastLock || cop == Op_FastUnlock || >> 1486: cop == Op_SubTypeCheck || cop == Op_VectorTest || >> 1487: cmp->is_Overflow()) { > > Why is this check now needed? For safety considerations through I dont really touch it. I checked the remaining transformation patterns in BoolNode::Ideal, and none of them applied on Overflow input, so I think it's okay to do so.. ------------- PR: https://git.openjdk.org/jdk/pull/12978 From kvn at openjdk.org Tue Mar 14 12:23:55 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 14 Mar 2023 12:23:55 GMT Subject: Integrated: 8303415: Add VM_Version::is_intrinsic_supported(id) In-Reply-To: References: Message-ID: On Fri, 3 Mar 2023 16:16:08 GMT, Vladimir Kozlov wrote: > Currently we check VM flags, directives and JIT compiler support when we generate intrinsics. > We have *product* VM flags for most intrinsics and set them in VM based on HW support. > But not all intrinsics have such flags and it is not scalable to add new *product* flag for each new intrinsic. > Also we have `-XX:DisableIntrinsic=` and `-XX:ControlIntrinsic=` flags to control intrinsics from command line. We don't need specific flags for that. > > I propose to add new `VM_Version::is_intrinsic_supported(id)` method to check platform support for intrinsic without adding new flag. I used it for `_floatToFloat16` intrinsic for my work on [JDK-8302976](https://bugs.openjdk.org/browse/JDK-8302976). > > Additional fixes: > Fixed Interpreter to skip intrinsics if they are disabled with flag. > Moved Interpreter's `InlineIntrinsics` flag check into one place in shared code. > Added separate interpreter id for `_dsqrt_strict` so it could be disabled separately from regular `_dsqrt`. > Added missing `native` mark to `_currentThread`. > Removed unused `AbstractInterpreter::in_native_entry()`. > Cleanup C2 intrinsic checks code. > > Tested tier1-4,xcomp,stress. Also ran tier1-3,xcomp with `-XX:-InlineIntrinsics`. This pull request has now been integrated. Changeset: ec1eb00e Author: Vladimir Kozlov URL: https://git.openjdk.org/jdk/commit/ec1eb00ed3290f44bdb175e0ca05522fd860efa1 Stats: 1242 lines in 25 files changed: 401 ins; 379 del; 462 mod 8303415: Add VM_Version::is_intrinsic_supported(id) Reviewed-by: thartmann, dholmes ------------- PR: https://git.openjdk.org/jdk/pull/12858 From yyang at openjdk.org Tue Mar 14 12:43:19 2023 From: yyang at openjdk.org (Yi Yang) Date: Tue, 14 Mar 2023 12:43:19 GMT Subject: RFR: 8303970: C2 can not merge homogeneous adjacent two If [v3] In-Reply-To: References: Message-ID: > Hi, can I have a review for this patch? It adds new Identity for BoolNode to lookup homogenous integer comparison, i.e. `Bool (CmpX a b)` is identity to `Bool (CmpX b a)`, in this way, we are able to merge the following two "identical" Ifs, which is not before. > > > public static void test(int a, int b) { // ok, identical ifs, apply split_if > if (a == b) { > int_field = 0x42; > } else { > int_field = 42; > } > if (a == b) { > int_field = 0x42; > } else { > int_field = 42; > } > } > > public static void test(int a, int b) { // do nothing > if (a == b) { > int_field = 0x42; > } else { > int_field = 42; > } > if (b == a) { > int_field = 0x42; > } else { > int_field = 42; > } > } > > > Testing: tier1, appllication/ctw/modules Yi Yang has updated the pull request incrementally with one additional commit since the last revision: review from tobias ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12978/files - new: https://git.openjdk.org/jdk/pull/12978/files/f85fac8f..eaa0c440 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12978&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12978&range=01-02 Stats: 58 lines in 3 files changed: 39 ins; 12 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/12978.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12978/head:pull/12978 PR: https://git.openjdk.org/jdk/pull/12978 From duke at openjdk.org Tue Mar 14 13:22:20 2023 From: duke at openjdk.org (Damon Fenacci) Date: Tue, 14 Mar 2023 13:22:20 GMT Subject: RFR: JDK-8303154: Investigate and improve instruction cache flushing during compilation [v3] In-Reply-To: References: Message-ID: > It was noticed that we flush the instruction cache too much for a single C1 compilation. The same is true for the C2 compilation. > There are several places in the code where the instruction cache is called and many of them are very intertwined (see [bug report](https://bugs.openjdk.org/browse/JDK-8303154)). > > This PR is meant to be a "minimum" set of changes that improve the situation without introducing excessive extra information to keep track of the origin of the call through call stacks. This is done by avoiding calls to flush the ICache at `Compilation::emit_code_epilog` and when calling `CodeCache::commit` as flushing is done anyway when copying from the temporary buffer into the code cache in `CodeBuffer::copy_code_to`. This results in flushing the ICache only once instead of 3 times for a C1 compilation and twice for a C2 compilation. Additionally this halves the number of flushes during adapters generation (lots of calls). > > This change decreases the number of calls to flush the ICache for a simple _Hello world_ program on Mac OSX aarch64 from 3569 to 2756 on C1 (22.8% improvement) and from 3572 to 2685 on C2 (24.1% improvement). Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: JDK-8303154: remove flush in SharedRuntime::generate_i2c2i_adapters ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12877/files - new: https://git.openjdk.org/jdk/pull/12877/files/2e119cb1..0a490b2f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12877&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12877&range=01-02 Stats: 5 lines in 5 files changed: 0 ins; 5 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/12877.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12877/head:pull/12877 PR: https://git.openjdk.org/jdk/pull/12877 From duke at openjdk.org Tue Mar 14 13:22:23 2023 From: duke at openjdk.org (Damon Fenacci) Date: Tue, 14 Mar 2023 13:22:23 GMT Subject: RFR: JDK-8303154: Investigate and improve instruction cache flushing during compilation [v2] In-Reply-To: References: <5hCuWCzGEIWdOmUsak9PyKNPNJgfRW6oZIxDdDhc-WQ=.7793a434-58cf-4fb7-98fa-0ff6c59ea36a@github.com> Message-ID: <-rkdau9lRi7xUZuxQ7OiCBIgpqo3qZ9smxF72Wa4R_A=.79d27b0b-e4ee-4900-bb1e-accc47252c6b@github.com> On Tue, 14 Mar 2023 08:22:42 GMT, Damon Fenacci wrote: >> Good. What are new numbers of of calls to flush the ICache? > >> Good. What are new numbers of of calls to flush the ICache? > > The total number of flushes for the _HelloWorld_ on Mac OSX aarch64 go from 3569 to 2756 on C1 (22.8% improvement) and from 3572 to 2685 on C2 (24.1% improvement). > @dafedafe I look on stack traces you collected. Please look on this: > > ``` > 732 Stack: > ICache::invalidate_range(unsigned char *, int) icache_bsd_aarch64.hpp:41 > AbstractAssembler::flush() assembler.cpp:110 > SharedRuntime::generate_i2c2i_adapters(MacroAssembler *, int, int, const BasicType *, const VMRegPair *, AdapterFingerPrint *) sharedRuntime_aarch64.cpp:794 > AdapterHandlerLibrary::create_adapter(AdapterBlob *&, int, BasicType *, bool) sharedRuntime.cpp:2970 > ``` > generate_i2c2i_adapters is called for temporary buffer so we don't need to flush it: Yes, thanks a lot @vnkozlov! I've removed the flushing there too. Now the number of flushes for _HelloWorld_ is down to 2028 with C1 (43% improvement). ------------- PR: https://git.openjdk.org/jdk/pull/12877 From kvn at openjdk.org Tue Mar 14 13:28:58 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 14 Mar 2023 13:28:58 GMT Subject: RFR: JDK-8303154: Investigate and improve instruction cache flushing during compilation [v3] In-Reply-To: References: Message-ID: On Tue, 14 Mar 2023 13:22:20 GMT, Damon Fenacci wrote: >> It was noticed that we flush the instruction cache too much for a single C1 compilation. The same is true for the C2 compilation. >> There are several places in the code where the instruction cache is called and many of them are very intertwined (see [bug report](https://bugs.openjdk.org/browse/JDK-8303154)). >> >> This PR is meant to be a "minimum" set of changes that improve the situation without introducing excessive extra information to keep track of the origin of the call through call stacks. This is done by avoiding calls to flush the ICache at `Compilation::emit_code_epilog` and when calling `CodeCache::commit` as flushing is done anyway when copying from the temporary buffer into the code cache in `CodeBuffer::copy_code_to`. This results in flushing the ICache only once instead of 3 times for a C1 compilation and twice for a C2 compilation. Additionally this halves the number of flushes during adapters generation (lots of calls). >> >> This change decreases the number of calls to flush the ICache for a simple _Hello world_ program on Mac OSX aarch64 from 3569 to 2756 on C1 (22.8% improvement) and from 3572 to 2685 on C2 (24.1% improvement). > > Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: > > JDK-8303154: remove flush in SharedRuntime::generate_i2c2i_adapters Nice! Good work. Now please thoroughly test it. ------------- PR: https://git.openjdk.org/jdk/pull/12877 From coleenp at openjdk.org Tue Mar 14 13:32:51 2023 From: coleenp at openjdk.org (Coleen Phillimore) Date: Tue, 14 Mar 2023 13:32:51 GMT Subject: RFR: 8304059: Use InstanceKlass in dependencies In-Reply-To: References: Message-ID: On Mon, 13 Mar 2023 20:23:47 GMT, Coleen Phillimore wrote: > Please review this small change to eliminate InstanceKlass::cast() calls for things that should be already InstanceKlass. > Tested with tier1-4. Thank you Tobias. ------------- PR: https://git.openjdk.org/jdk/pull/13005 From coleenp at openjdk.org Tue Mar 14 13:32:52 2023 From: coleenp at openjdk.org (Coleen Phillimore) Date: Tue, 14 Mar 2023 13:32:52 GMT Subject: Integrated: 8304059: Use InstanceKlass in dependencies In-Reply-To: References: Message-ID: On Mon, 13 Mar 2023 20:23:47 GMT, Coleen Phillimore wrote: > Please review this small change to eliminate InstanceKlass::cast() calls for things that should be already InstanceKlass. > Tested with tier1-4. This pull request has now been integrated. Changeset: 55aa1224 Author: Coleen Phillimore URL: https://git.openjdk.org/jdk/commit/55aa122462c34d8f4cafa58f4d1f2d900449c83e Stats: 27 lines in 4 files changed: 0 ins; 1 del; 26 mod 8304059: Use InstanceKlass in dependencies Reviewed-by: vlivanov, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/13005 From thartmann at openjdk.org Tue Mar 14 13:35:06 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 14 Mar 2023 13:35:06 GMT Subject: RFR: JDK-8303154: Investigate and improve instruction cache flushing during compilation [v3] In-Reply-To: References: Message-ID: On Tue, 14 Mar 2023 13:22:20 GMT, Damon Fenacci wrote: >> It was noticed that we flush the instruction cache too much for a single C1 compilation. The same is true for the C2 compilation. >> There are several places in the code where the instruction cache is called and many of them are very intertwined (see [bug report](https://bugs.openjdk.org/browse/JDK-8303154)). >> >> This PR is meant to be a "minimum" set of changes that improve the situation without introducing excessive extra information to keep track of the origin of the call through call stacks. This is done by avoiding calls to flush the ICache at `Compilation::emit_code_epilog` and when calling `CodeCache::commit` as flushing is done anyway when copying from the temporary buffer into the code cache in `CodeBuffer::copy_code_to`. This results in flushing the ICache only once instead of 3 times for a C1 compilation and twice for a C2 compilation. Additionally this halves the number of flushes during adapters generation (lots of calls). >> >> This change decreases the number of calls to flush the ICache for a simple _Hello world_ program on Mac OSX aarch64 from 3569 to 2756 on C1 (22.8% improvement) and from 3572 to 2685 on C2 (24.1% improvement). > > Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: > > JDK-8303154: remove flush in SharedRuntime::generate_i2c2i_adapters Good catch, Vladimir. New version looks good to me too. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/12877 From duke at openjdk.org Tue Mar 14 14:50:18 2023 From: duke at openjdk.org (Damon Fenacci) Date: Tue, 14 Mar 2023 14:50:18 GMT Subject: RFR: JDK-8303154: Investigate and improve instruction cache flushing during compilation [v3] In-Reply-To: References: Message-ID: On Tue, 14 Mar 2023 13:22:20 GMT, Damon Fenacci wrote: >> It was noticed that we flush the instruction cache too much for a single C1 compilation. The same is true for the C2 compilation. >> There are several places in the code where the instruction cache is called and many of them are very intertwined (see [bug report](https://bugs.openjdk.org/browse/JDK-8303154)). >> >> This PR is meant to be a "minimum" set of changes that improve the situation without introducing excessive extra information to keep track of the origin of the call through call stacks. This is done by avoiding calls to flush the ICache: >> * at `Compilation::emit_code_epilog` >> * when calling `CodeCache::commit` as flushing is done anyway when copying from the temporary buffer into the code cache in `CodeBuffer::copy_code_to`. This results in flushing the ICache only once instead of 3 times for a C1 compilation and twice for a C2 compilation. Additionally this halves the number of flushes during adapters generation (lots of calls). >> * at `SharedRuntime::generate_i2c2i_adapters` as this is called with a temporary buffer and an ICache flush is not needed >> >> This change decreases the number of calls to flush the ICache for a simple _Hello world_ program on Mac OSX aarch64 from 3569 to 2028 on C1 (43.2% improvement) and from 3572 to 1952 on C2 (45.4% improvement). >> >> This fix includes changes for x86_32/64 and aarch64, which I could test thoroughly but also for **arm** and **riscv**, for which I would need some help with testing. > > Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: > > JDK-8303154: remove flush in SharedRuntime::generate_i2c2i_adapters This fix includes some changes to `sharedRuntime` for multiple platforms. I could thoroughly test a x86_32/64 and aarch64 but I'd need some help testing on **ARM** and **RISCV**. @RealFYang could you or someone you know please run some testing for **RISCV** to make sure my change doesn't break anything? @bulasevich @shqking could I ask you the same for **ARM**? Thank you very much in advance for your help! ------------- PR: https://git.openjdk.org/jdk/pull/12877 From chagedorn at openjdk.org Tue Mar 14 15:01:53 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 14 Mar 2023 15:01:53 GMT Subject: RFR: 8299546: C2: MulLNode::mul_ring() wrongly returns bottom type due to casting errors with large numbers [v5] In-Reply-To: <1oEtfGK7rvziWpDy2fuRJvQS5mlY6wrybfzpPdo5cDs=.490c2c57-1aea-4a8a-9f51-258afd7eba1d@github.com> References: <1oEtfGK7rvziWpDy2fuRJvQS5mlY6wrybfzpPdo5cDs=.490c2c57-1aea-4a8a-9f51-258afd7eba1d@github.com> Message-ID: On Mon, 13 Mar 2023 15:43:05 GMT, Christian Hagedorn wrote: >> The current logic in `MulLNode::mul_ring()` casts all `jlong` values of the involved type ranges of a multiplication to `double` in order to catch overflows when multiplying the two type ranges. This works fine for values in the `jlong` range that are not larger than 253 or lower than -253. For numbers outside that range, we could experience precision errors because these numbers cannot be represented precisely due to the nature of how doubles are represented with a 52 bit mantissa. For example, the number 253 and 253 + 1 both have the same `double` representation of 253. >> >> In `MulLNode::mul_ring()`, we could do a multiplication with a `lo` or `hi` value of a type that is larger than 253 (or smaller than -253). In this case, we might get a different result compared to doing the same multiplication with `jlong` values (even though there is no overflow/underflow). As a result, we return `TypeLong::LONG` (bottom type) and missed an optimization opportunity (e.g. folding an `If` when using the `MulL` node in the condition etc.). >> >> This was caught by the new verification code added to CCP in [JDK-8257197](https://bugs.openjdk.org/browse/JDK-8257197) which checks that after CCP, we should not get a different type anymore when calling `Value()` on a node. In the found fuzzer testcase, we run into the precision problem described above for a `MulL` node and set the type to bottom during CCP (even though there was no actual overflow). Since the type is bottom, we do not re-add the node to the CCP worklist because the premise is that types only go from top to bottom during CCP. Afterwards, an input type of the `MulL` node is updated again in such a way that the previously imprecise `double` multiplication in `mul_ring()` is now exact (by coincidence). We then hit the "missed optimization opportunity" assert added by JDK-8257197. >> >> To fix this problem, I suggest to switch from a `jlong` - > `double` multiplication overflow check to an overflow check without casting. I've used the idea that `x = a * b` is the same as `b = x / a` (for `a != 0` and `!(a = -1 && b = MIN_VALUE)`) which is also applied in `Math.multiplyExact()`: https://github.com/openjdk/jdk/blob/66db0bb6a15310e4e60ff1e33d40e03c52c4eca8/src/java.base/share/classes/java/lang/Math.java#L1022-L1036 >> >> The code of `MulLNode::mul_ring()` is almost identical to `MulINode::mul_ring()`. I've refactored that into a template class in order to share the code and simplified the overflow checking by using `MIN/MAX4` instead of using nested `if/else` statements. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Fix build failures on Mac Thanks Vladimir for approving it again. I've re-run some testing with latest master which looked good. ------------- PR: https://git.openjdk.org/jdk/pull/11907 From chagedorn at openjdk.org Tue Mar 14 15:01:57 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 14 Mar 2023 15:01:57 GMT Subject: Integrated: 8299546: C2: MulLNode::mul_ring() wrongly returns bottom type due to casting errors with large numbers In-Reply-To: References: Message-ID: On Mon, 9 Jan 2023 16:19:46 GMT, Christian Hagedorn wrote: > The current logic in `MulLNode::mul_ring()` casts all `jlong` values of the involved type ranges of a multiplication to `double` in order to catch overflows when multiplying the two type ranges. This works fine for values in the `jlong` range that are not larger than 253 or lower than -253. For numbers outside that range, we could experience precision errors because these numbers cannot be represented precisely due to the nature of how doubles are represented with a 52 bit mantissa. For example, the number 253 and 253 + 1 both have the same `double` representation of 253. > > In `MulLNode::mul_ring()`, we could do a multiplication with a `lo` or `hi` value of a type that is larger than 253 (or smaller than -253). In this case, we might get a different result compared to doing the same multiplication with `jlong` values (even though there is no overflow/underflow). As a result, we return `TypeLong::LONG` (bottom type) and missed an optimization opportunity (e.g. folding an `If` when using the `MulL` node in the condition etc.). > > This was caught by the new verification code added to CCP in [JDK-8257197](https://bugs.openjdk.org/browse/JDK-8257197) which checks that after CCP, we should not get a different type anymore when calling `Value()` on a node. In the found fuzzer testcase, we run into the precision problem described above for a `MulL` node and set the type to bottom during CCP (even though there was no actual overflow). Since the type is bottom, we do not re-add the node to the CCP worklist because the premise is that types only go from top to bottom during CCP. Afterwards, an input type of the `MulL` node is updated again in such a way that the previously imprecise `double` multiplication in `mul_ring()` is now exact (by coincidence). We then hit the "missed optimization opportunity" assert added by JDK-8257197. > > To fix this problem, I suggest to switch from a `jlong` - > `double` multiplication overflow check to an overflow check without casting. I've used the idea that `x = a * b` is the same as `b = x / a` (for `a != 0` and `!(a = -1 && b = MIN_VALUE)`) which is also applied in `Math.multiplyExact()`: https://github.com/openjdk/jdk/blob/66db0bb6a15310e4e60ff1e33d40e03c52c4eca8/src/java.base/share/classes/java/lang/Math.java#L1022-L1036 > > The code of `MulLNode::mul_ring()` is almost identical to `MulINode::mul_ring()`. I've refactored that into a template class in order to share the code and simplified the overflow checking by using `MIN/MAX4` instead of using nested `if/else` statements. > > Thanks, > Christian This pull request has now been integrated. Changeset: c466cdf9 Author: Christian Hagedorn URL: https://git.openjdk.org/jdk/commit/c466cdf973ca9c4ecec1a28f158ebf366386024e Stats: 1079 lines in 4 files changed: 1006 ins; 38 del; 35 mod 8299546: C2: MulLNode::mul_ring() wrongly returns bottom type due to casting errors with large numbers Reviewed-by: iveresov, kvn, qamai ------------- PR: https://git.openjdk.org/jdk/pull/11907 From yzheng at openjdk.org Tue Mar 14 15:09:25 2023 From: yzheng at openjdk.org (Yudi Zheng) Date: Tue, 14 Mar 2023 15:09:25 GMT Subject: RFR: 8304138: [JVMCI] Test FailedSpeculation existence before appending. Message-ID: Upon uncommon_trap, JVMCI runtime appends a FailedSpeculation entry to the nmethod using an [atomic operation](https://github.com/openjdk/jdk/blob/55aa122462c34d8f4cafa58f4d1f2d900449c83e/src/hotspot/share/oops/methodData.cpp#L852). It becomes a performance bottleneck when there is a large amount of (virtual) threads deoptimizing in the nmethod. In this PR, we test if a FailedSpeculation exists in the list before appending it. ------------- Commit messages: - [JVMCI] Test FailedSpeculation existence before appending. Changes: https://git.openjdk.org/jdk/pull/13022/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13022&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8304138 Stats: 10 lines in 1 file changed: 9 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/13022.diff Fetch: git fetch https://git.openjdk.org/jdk pull/13022/head:pull/13022 PR: https://git.openjdk.org/jdk/pull/13022 From yzheng at openjdk.org Tue Mar 14 15:34:50 2023 From: yzheng at openjdk.org (Yudi Zheng) Date: Tue, 14 Mar 2023 15:34:50 GMT Subject: RFR: 8304138: [JVMCI] Test FailedSpeculation existence before appending. [v2] In-Reply-To: References: Message-ID: > Upon uncommon_trap, JVMCI runtime appends a FailedSpeculation entry to the nmethod using an [atomic operation](https://github.com/openjdk/jdk/blob/55aa122462c34d8f4cafa58f4d1f2d900449c83e/src/hotspot/share/oops/methodData.cpp#L852). It becomes a performance bottleneck when there is a large amount of (virtual) threads deoptimizing in the nmethod. In this PR, we test if a FailedSpeculation exists in the list before appending it. Yudi Zheng has updated the pull request incrementally with one additional commit since the last revision: avoid iterating from beginning. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13022/files - new: https://git.openjdk.org/jdk/pull/13022/files/2f9874df..e8c7eec4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13022&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13022&range=00-01 Stats: 2 lines in 1 file changed: 0 ins; 1 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/13022.diff Fetch: git fetch https://git.openjdk.org/jdk pull/13022/head:pull/13022 PR: https://git.openjdk.org/jdk/pull/13022 From kvn at openjdk.org Tue Mar 14 15:34:51 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 14 Mar 2023 15:34:51 GMT Subject: RFR: 8304138: [JVMCI] Test FailedSpeculation existence before appending. [v2] In-Reply-To: References: Message-ID: <5RHZBxl7toRIqTjQJ8hxd3XWiWFGM9lqeUsKD5Cgw58=.6dda1377-3a3e-44a7-ae13-1a8469a863c4@github.com> On Tue, 14 Mar 2023 15:30:37 GMT, Yudi Zheng wrote: >> Upon uncommon_trap, JVMCI runtime appends a FailedSpeculation entry to the nmethod using an [atomic operation](https://github.com/openjdk/jdk/blob/55aa122462c34d8f4cafa58f4d1f2d900449c83e/src/hotspot/share/oops/methodData.cpp#L852). It becomes a performance bottleneck when there is a large amount of (virtual) threads deoptimizing in the nmethod. In this PR, we test if a FailedSpeculation exists in the list before appending it. > > Yudi Zheng has updated the pull request incrementally with one additional commit since the last revision: > > avoid iterating from beginning. I agree that it is better than before. But without some kind of lock several threads can add the same speculation data in the window between new check and adding data to the list. src/hotspot/share/oops/methodData.cpp line 858: > 856: guarantee_failed_speculations_alive(nm, failed_speculations_address); > 857: > 858: cursor = failed_speculations_address; Why not continue from `cursor` value from previous check loop? Can the list be modified by other threads in between? ------------- PR: https://git.openjdk.org/jdk/pull/13022 From yzheng at openjdk.org Tue Mar 14 15:34:52 2023 From: yzheng at openjdk.org (Yudi Zheng) Date: Tue, 14 Mar 2023 15:34:52 GMT Subject: RFR: 8304138: [JVMCI] Test FailedSpeculation existence before appending. [v2] In-Reply-To: <5RHZBxl7toRIqTjQJ8hxd3XWiWFGM9lqeUsKD5Cgw58=.6dda1377-3a3e-44a7-ae13-1a8469a863c4@github.com> References: <5RHZBxl7toRIqTjQJ8hxd3XWiWFGM9lqeUsKD5Cgw58=.6dda1377-3a3e-44a7-ae13-1a8469a863c4@github.com> Message-ID: On Tue, 14 Mar 2023 15:28:31 GMT, Vladimir Kozlov wrote: >> Yudi Zheng has updated the pull request incrementally with one additional commit since the last revision: >> >> avoid iterating from beginning. > > src/hotspot/share/oops/methodData.cpp line 858: > >> 856: guarantee_failed_speculations_alive(nm, failed_speculations_address); >> 857: >> 858: cursor = failed_speculations_address; > > Why not continue from `cursor` value from previous check loop? > Can the list be modified by other threads in between? It cannot be modified in between. Addressed in https://github.com/openjdk/jdk/pull/13022/commits/e8c7eec468b049e15bf54bac280585389935a37d by continuing from `cursor` ------------- PR: https://git.openjdk.org/jdk/pull/13022 From yzheng at openjdk.org Tue Mar 14 15:45:17 2023 From: yzheng at openjdk.org (Yudi Zheng) Date: Tue, 14 Mar 2023 15:45:17 GMT Subject: RFR: 8304138: [JVMCI] Test FailedSpeculation existence before appending. [v2] In-Reply-To: <5RHZBxl7toRIqTjQJ8hxd3XWiWFGM9lqeUsKD5Cgw58=.6dda1377-3a3e-44a7-ae13-1a8469a863c4@github.com> References: <5RHZBxl7toRIqTjQJ8hxd3XWiWFGM9lqeUsKD5Cgw58=.6dda1377-3a3e-44a7-ae13-1a8469a863c4@github.com> Message-ID: On Tue, 14 Mar 2023 15:25:42 GMT, Vladimir Kozlov wrote: > I agree that it is better than before. But without some kind of lock several threads can add the same speculation data in the window between new check and adding data to the list. Right. We can bear a few redundant FailedSpeculation entries, and would like to avoid locking in this fast path. ------------- PR: https://git.openjdk.org/jdk/pull/13022 From kostasto at proton.me Tue Mar 14 15:59:15 2023 From: kostasto at proton.me (Kosta Stojiljkovic) Date: Tue, 14 Mar 2023 15:59:15 +0000 Subject: test/lib-test/jdk/test/whitebox/CPUInfoTest.java fails on Intel Alder/Raptor Lake Message-ID: Dear all, On a machine with the 13th gen Intel CPU, WhiteBox test in the file CPUInfoTest.java fails. The test in question checks the features returned from the CPUInfo class against a hardcoded set of well known CPU features inside the test, that looks like this: wellKnownCPUFeatures = Set.of( "cx8", "cmov", "fxsr", "ht", "mmx", "3dnowpref", "sse", "sse2", "sse3", "ssse3", "sse4a", "sse4.1", "sse4.2", "popcnt", "lzcnt", "tsc", "tscinvbit", "tscinv", "avx", "avx2", "aes", "erms", "clmul", "bmi1", "bmi2", "rtm", "adx", "avx512f", "avx512dq", "avx512pf", "avx512er", "avx512cd", "avx512bw", "avx512vl", "sha", "fma", "vzeroupper", "avx512_vpopcntdq", "avx512_vpclmulqdq", "avx512_vaes", "avx512_vnni", "clflush", "clflushopt", "clwb", "avx512_vbmi2", "avx512_vbmi", "rdtscp", "rdpid", "hv", "fsrm", "avx512_bitalg", "gfni", "f16c", "pku", "ospke", "cet_ibt", "cet_ss", "avx512_ifma"); This set of strings on the other hand does not account for the SERIALIZE instruction, added in the 12th generation of Intel Core processors (codenamed Alder Lake), while the processor inspection implementation in /src/hotspot/cpu/x86/vm_version_x86.cpp picks up the flag for it, thus leading to a discrepancy between the features set in the test and the features string obtained from CPUInfo class, when ran on the 12th gen processors and higher. The support for this feature seems to have been added to the code base with the following commit: [8284161: Implementation of Virtual Thread](https://github.com/openjdk/jdk/commit/9583e3657e43cc1c6f2101a64534564db2a9bd84), but the authors may have missed adding the "serialize" string to the set of well known CPU features in the CPUInfoTest.java file. I would like to extend the wellKnownCPUFeatures set with the "serialize" keyword, unless there is a reason that this keyword is missing that I do not see? If that isn't the case, I would appreciate getting some support with creating an issue in JBS, since I am not an author yet :) I look forward to your feedback! Best Regards, Kosta Stojiljkovic -------------- next part -------------- An HTML attachment was scrubbed... URL: From dnsimon at openjdk.org Tue Mar 14 16:00:31 2023 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 14 Mar 2023 16:00:31 GMT Subject: RFR: 8304138: [JVMCI] Test FailedSpeculation existence before appending. [v2] In-Reply-To: References: Message-ID: On Tue, 14 Mar 2023 15:34:50 GMT, Yudi Zheng wrote: >> Upon uncommon_trap, JVMCI runtime appends a FailedSpeculation entry to the nmethod using an [atomic operation](https://github.com/openjdk/jdk/blob/55aa122462c34d8f4cafa58f4d1f2d900449c83e/src/hotspot/share/oops/methodData.cpp#L852). It becomes a performance bottleneck when there is a large amount of (virtual) threads deoptimizing in the nmethod. In this PR, we test if a FailedSpeculation exists in the list before appending it. > > Yudi Zheng has updated the pull request incrementally with one additional commit since the last revision: > > avoid iterating from beginning. Marked as reviewed by dnsimon (Committer). ------------- PR: https://git.openjdk.org/jdk/pull/13022 From kvn at openjdk.org Tue Mar 14 16:09:03 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 14 Mar 2023 16:09:03 GMT Subject: RFR: 8304138: [JVMCI] Test FailedSpeculation existence before appending. [v2] In-Reply-To: References: Message-ID: On Tue, 14 Mar 2023 15:34:50 GMT, Yudi Zheng wrote: >> Upon uncommon_trap, JVMCI runtime appends a FailedSpeculation entry to the nmethod using an [atomic operation](https://github.com/openjdk/jdk/blob/55aa122462c34d8f4cafa58f4d1f2d900449c83e/src/hotspot/share/oops/methodData.cpp#L852). It becomes a performance bottleneck when there is a large amount of (virtual) threads deoptimizing in the nmethod. In this PR, we test if a FailedSpeculation exists in the list before appending it. > > Yudi Zheng has updated the pull request incrementally with one additional commit since the last revision: > > avoid iterating from beginning. Good. Please, re-test with latest changes. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/13022 From matsaave at openjdk.org Tue Mar 14 19:24:21 2023 From: matsaave at openjdk.org (Matias Saavedra Silva) Date: Tue, 14 Mar 2023 19:24:21 GMT Subject: RFR: 8241613: Suspicious calls to MacroAssembler::null_check(Register, offset) Message-ID: In several places in HotSpot, the method MacroAssembler::null_check(Register, offset) is called in a way that never produces any null check in the assembly code. The method null_check(Register, offset) calls needs_explicit_null_check(offset) to determine if it must emit a null check in the assembly code or not. needs_explicit_null_check(offset) returns true only if the offset is negative or bigger than the os page size. the offset being passed is the offset of a field in the header of Java object or a Java array. In both cases, the offset is always positive and smaller than an os page size. A null_check() call with a single parameter will always produce a null check in assembly. The cases suggested in the issue have been addressed by either removing or preserving the null_check. ------------- Commit messages: - 8241613: Suspicious calls to MacroAssembler::null_check(Register, offset) Changes: https://git.openjdk.org/jdk/pull/13026/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13026&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8241613 Stats: 3 lines in 1 file changed: 0 ins; 3 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/13026.diff Fetch: git fetch https://git.openjdk.org/jdk pull/13026/head:pull/13026 PR: https://git.openjdk.org/jdk/pull/13026 From azvegint at openjdk.org Tue Mar 14 20:06:03 2023 From: azvegint at openjdk.org (Alexander Zvegintsev) Date: Tue, 14 Mar 2023 20:06:03 GMT Subject: RFR: 8304172: ProblemList serviceability/sa/UniqueVtableTest.java In-Reply-To: References: Message-ID: On Tue, 14 Mar 2023 19:52:01 GMT, Daniel D. Daugherty wrote: > Trivial fixes to ProblemList a couple of tests: > > [JDK-8304172](https://bugs.openjdk.org/browse/JDK-8304172) ProblemList serviceability/sa/UniqueVtableTest.java > [JDK-8304175](https://bugs.openjdk.org/browse/JDK-8304175) ProblemList compiler/vectorapi/VectorLogicalOpIdentityTest.java on 2 platforms Marked as reviewed by azvegint (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/13029 From dcubed at openjdk.org Tue Mar 14 20:06:01 2023 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Tue, 14 Mar 2023 20:06:01 GMT Subject: RFR: 8304172: ProblemList serviceability/sa/UniqueVtableTest.java Message-ID: Trivial fixes to ProblemList a couple of tests: [JDK-8304172](https://bugs.openjdk.org/browse/JDK-8304172) ProblemList serviceability/sa/UniqueVtableTest.java [JDK-8304175](https://bugs.openjdk.org/browse/JDK-8304175) ProblemList compiler/vectorapi/VectorLogicalOpIdentityTest.java on 2 platforms ------------- Commit messages: - 8304175: ProblemList compiler/vectorapi/VectorLogicalOpIdentityTest.java on 2 platforms - 8304172: ProblemList serviceability/sa/UniqueVtableTest.java Changes: https://git.openjdk.org/jdk/pull/13029/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13029&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8304172 Stats: 3 lines in 1 file changed: 2 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/13029.diff Fetch: git fetch https://git.openjdk.org/jdk pull/13029/head:pull/13029 PR: https://git.openjdk.org/jdk/pull/13029 From dcubed at openjdk.org Tue Mar 14 20:09:13 2023 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Tue, 14 Mar 2023 20:09:13 GMT Subject: RFR: 8304172: ProblemList serviceability/sa/UniqueVtableTest.java In-Reply-To: References: Message-ID: On Tue, 14 Mar 2023 20:00:30 GMT, Alexander Zvegintsev wrote: >> Trivial fixes to ProblemList a couple of tests: >> >> [JDK-8304172](https://bugs.openjdk.org/browse/JDK-8304172) ProblemList serviceability/sa/UniqueVtableTest.java >> [JDK-8304175](https://bugs.openjdk.org/browse/JDK-8304175) ProblemList compiler/vectorapi/VectorLogicalOpIdentityTest.java on 2 platforms > > Marked as reviewed by azvegint (Reviewer). @azvegint - Thanks for the fast review! ------------- PR: https://git.openjdk.org/jdk/pull/13029 From dcubed at openjdk.org Tue Mar 14 20:13:18 2023 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Tue, 14 Mar 2023 20:13:18 GMT Subject: Integrated: 8304172: ProblemList serviceability/sa/UniqueVtableTest.java In-Reply-To: References: Message-ID: On Tue, 14 Mar 2023 19:52:01 GMT, Daniel D. Daugherty wrote: > Trivial fixes to ProblemList a couple of tests: > > [JDK-8304172](https://bugs.openjdk.org/browse/JDK-8304172) ProblemList serviceability/sa/UniqueVtableTest.java > [JDK-8304175](https://bugs.openjdk.org/browse/JDK-8304175) ProblemList compiler/vectorapi/VectorLogicalOpIdentityTest.java on 2 platforms This pull request has now been integrated. Changeset: 617c15f5 Author: Daniel D. Daugherty URL: https://git.openjdk.org/jdk/commit/617c15f5a131fdf254fc4277f6dd78d64292db1c Stats: 3 lines in 1 file changed: 2 ins; 0 del; 1 mod 8304172: ProblemList serviceability/sa/UniqueVtableTest.java 8304175: ProblemList compiler/vectorapi/VectorLogicalOpIdentityTest.java on 2 platforms Reviewed-by: azvegint ------------- PR: https://git.openjdk.org/jdk/pull/13029 From kvn at openjdk.org Tue Mar 14 23:17:49 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 14 Mar 2023 23:17:49 GMT Subject: RFR: JDK-8303154: Investigate and improve instruction cache flushing during compilation [v3] In-Reply-To: References: Message-ID: On Tue, 14 Mar 2023 14:47:34 GMT, Damon Fenacci wrote: >> Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: >> >> JDK-8303154: remove flush in SharedRuntime::generate_i2c2i_adapters > > This fix includes some changes to `sharedRuntime` for multiple platforms. I could thoroughly test on x86_32/64 and aarch64 but I'd need some help testing on **ARM** and **RISCV**. > @RealFYang could you or someone you know please run some testing for **RISCV** to make sure my change doesn't break anything? @bulasevich @shqking could I ask you the same for **ARM**? > Thank you very much in advance for your help! @dafedafe as followup RFE (I don't want to add more changes to this PR) look on all uses of `masm::flush()` for **temporary** buffers. I see such uses in some other `SharedRuntime` methods: [SharedRuntime::generate_resolve_blob](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp#L3299) There are may be only few but since we start fixing such cases we should finish. ------------- PR: https://git.openjdk.org/jdk/pull/12877 From duke at openjdk.org Wed Mar 15 03:38:54 2023 From: duke at openjdk.org (changpeng1997) Date: Wed, 15 Mar 2023 03:38:54 GMT Subject: RFR: 8302906: AArch64: Add SVE backend support for vector unsigned comparison [v4] In-Reply-To: References: Message-ID: > This patch implements unsigned vector comparison on SVE. > > 1: Test: > All vector API test cases[1][2] passed without new failure. Existing test cases can cover all unsigned comparison conditions for all kinds of vector. > > 2: Performance: > (1): Benchmark: > As existing benchmarks in panama repo (such as [3]) have some issues [4] (We will fix them in a separate patch.), I collected performance data with a reduced jmh benchmark [5]. e.g. for ByteVector unsigned comparison: > > > @Benchmark > public void byteVectorUnsignedCompare() { > for (int j = 0; j < 200; j++) { > for (int i = 0; i < bspecies.length(); i++) { > ByteVector av = ByteVector.fromArray(bspecies, ba, i); > ByteVector ca = ByteVector.fromArray(bspecies, bb, i); > av.compare(VectorOperators.UNSIGNED_GT, ca).intoArray(br, i); > } > } > } > > > (2): Performance data > > Before: > > > Benchmark Score(op/ms) Error > ByteVector.UNSIGNED_GT#size(1024) 4.846 3.419 > ShortVector.UNSIGNED_GE#size(1024) 3.055 1.369 > IntVector.UNSIGNED_LT#size(1024) 3.475 1.269 > LongVector.UNSIGNED_LE#size(1024) 4.515 1.812 > > > After: > > > Benchmark Score(op/ms) Error > ByteVector.UNSIGNED_GT#size(1024) 493.937 1.389 > ShortVector.UNSIGNED_GE#size(1024) 5308.796 20.557 > IntVector.UNSIGNED_LT#size(1024) 4944.744 10.606 > LongVector.UNSIGNED_LE#size(1024) 8459.605 28.683 > > > [1] https://github.com/openjdk/jdk/tree/master/test/jdk/jdk/incubator/vector > [2] https://github.com/openjdk/jdk/tree/master/test/hotspot/jtreg/compiler/vectorapi > [3] https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/ByteMaxVector.java#L1459 > [4] https://bugs.openjdk.org/browse/JDK-8282850 > [5] https://gist.github.com/changpeng1997/d311127e1015c107197f9b56a92b0fae changpeng1997 has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: - Merge branch 'openjdk:master' into sve_cmpU - Refactor part of code in C2 assembler and remove some switch-case stmts. - Merge branch 'openjdk:master' into sve_cmpU - 8302906: AArch64: Add SVE backend support for vector unsigned comparison This patch implements unsigned vector comparison on SVE. 1: Test: All vector API test cases[1][2] passed without new failure. Existing test cases can cover all unsigned comparison conditions for all kinds of vector. 2: Performance: (1): Benchmark: As existing benchmarks in panama repo (such as [3]) have some issues [4] (We will fix them in a separate patch.), I collected performance data with a reduced jmh benchmark [5]. e.g. for ByteVector unsigned comparison: ``` @Benchmark public void byteVectorUnsignedCompare() { for (int j = 0; j < 200; j++) { for (int i = 0; i < bspecies.length(); i++) { ByteVector av = ByteVector.fromArray(bspecies, ba, i); ByteVector ca = ByteVector.fromArray(bspecies, bb, i); av.compare(VectorOperators.UNSIGNED_GT, ca).intoArray(br, i); } } } ``` (2): Performance data Before: ``` Benchmark Score(op/ms) Error ByteVector.UNSIGNED_GT#size(1024) 4.846 3.419 ShortVector.UNSIGNED_GE#size(1024) 3.055 1.369 IntVector.UNSIGNED_LT#size(1024) 3.475 1.269 LongVector.UNSIGNED_LE#size(1024) 4.515 1.812 ``` After: ``` Benchmark Score(op/ms) Error ByteVector.UNSIGNED_GT#size(1024) 493.937 1.389 ShortVector.UNSIGNED_GE#size(1024) 5308.796 20.557 IntVector.UNSIGNED_LT#size(1024) 4944.744 10.606 LongVector.UNSIGNED_LE#size(1024) 8459.605 28.683 ``` [1] https://github.com/openjdk/jdk/tree/master/test/jdk/jdk/incubator/vector [2] https://github.com/openjdk/jdk/tree/master/test/hotspot/jtreg/compiler/vectorapi [3] https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/ByteMaxVector.java#L1459 [4] https://bugs.openjdk.org/browse/JDK-8282850 [5] https://gist.github.com/changpeng1997/d311127e1015c107197f9b56a92b0fae TEST_CMD: true Jira: ENTLLT-6097 Change-Id: I236cf4a7626af3aad04bf081b47849a00e77df15 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12725/files - new: https://git.openjdk.org/jdk/pull/12725/files/5acf5ba4..12fa26bb Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12725&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12725&range=02-03 Stats: 117949 lines in 1227 files changed: 86726 ins; 16946 del; 14277 mod Patch: https://git.openjdk.org/jdk/pull/12725.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12725/head:pull/12725 PR: https://git.openjdk.org/jdk/pull/12725 From duke at openjdk.org Wed Mar 15 03:48:01 2023 From: duke at openjdk.org (changpeng1997) Date: Wed, 15 Mar 2023 03:48:01 GMT Subject: RFR: 8302906: AArch64: Add SVE backend support for vector unsigned comparison [v5] In-Reply-To: References: Message-ID: <0N4QegqcD5I5funLRjC7WtivwQkmWgAlbI38Wcy2k8I=.c8fb8cff-4dcd-4f8c-a582-5ae1903773c8@github.com> > This patch implements unsigned vector comparison on SVE. > > 1: Test: > All vector API test cases[1][2] passed without new failure. Existing test cases can cover all unsigned comparison conditions for all kinds of vector. > > 2: Performance: > (1): Benchmark: > As existing benchmarks in panama repo (such as [3]) have some issues [4] (We will fix them in a separate patch.), I collected performance data with a reduced jmh benchmark [5]. e.g. for ByteVector unsigned comparison: > > > @Benchmark > public void byteVectorUnsignedCompare() { > for (int j = 0; j < 200; j++) { > for (int i = 0; i < bspecies.length(); i++) { > ByteVector av = ByteVector.fromArray(bspecies, ba, i); > ByteVector ca = ByteVector.fromArray(bspecies, bb, i); > av.compare(VectorOperators.UNSIGNED_GT, ca).intoArray(br, i); > } > } > } > > > (2): Performance data > > Before: > > > Benchmark Score(op/ms) Error > ByteVector.UNSIGNED_GT#size(1024) 4.846 3.419 > ShortVector.UNSIGNED_GE#size(1024) 3.055 1.369 > IntVector.UNSIGNED_LT#size(1024) 3.475 1.269 > LongVector.UNSIGNED_LE#size(1024) 4.515 1.812 > > > After: > > > Benchmark Score(op/ms) Error > ByteVector.UNSIGNED_GT#size(1024) 493.937 1.389 > ShortVector.UNSIGNED_GE#size(1024) 5308.796 20.557 > IntVector.UNSIGNED_LT#size(1024) 4944.744 10.606 > LongVector.UNSIGNED_LE#size(1024) 8459.605 28.683 > > > [1] https://github.com/openjdk/jdk/tree/master/test/jdk/jdk/incubator/vector > [2] https://github.com/openjdk/jdk/tree/master/test/hotspot/jtreg/compiler/vectorapi > [3] https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/ByteMaxVector.java#L1459 > [4] https://bugs.openjdk.org/browse/JDK-8282850 > [5] https://gist.github.com/changpeng1997/d311127e1015c107197f9b56a92b0fae changpeng1997 has updated the pull request incrementally with one additional commit since the last revision: Move cm() and fcm() to advsimd-three-same section. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12725/files - new: https://git.openjdk.org/jdk/pull/12725/files/12fa26bb..a0afaf80 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12725&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12725&range=03-04 Stats: 80 lines in 1 file changed: 40 ins; 40 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/12725.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12725/head:pull/12725 PR: https://git.openjdk.org/jdk/pull/12725 From duke at openjdk.org Wed Mar 15 03:50:20 2023 From: duke at openjdk.org (changpeng1997) Date: Wed, 15 Mar 2023 03:50:20 GMT Subject: RFR: 8302906: AArch64: Add SVE backend support for vector unsigned comparison [v3] In-Reply-To: References: Message-ID: On Mon, 13 Mar 2023 10:29:27 GMT, Andrew Haley wrote: >> changpeng1997 has updated the pull request incrementally with one additional commit since the last revision: >> >> Refactor part of code in C2 assembler and remove some switch-case stmts. > > src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3218: > >> 3216: f(1, 21), rf(Vm, 16), f(0b111001, 15, 10), rf(Vn, 5), rf(Vd, 0); >> 3217: } >> 3218: > > This looks OK, but it's in the wrong place in the file. Look at C4.1 A64 instruction set encoding. These instructions are in the "Advanced SIMD three same" group, so they must appear in assembler_aarch64.hpp in the "Advanced SIMD three same" section. > This is the "AdvSIMD two-reg misc" section. Sorry for this error. ------------- PR: https://git.openjdk.org/jdk/pull/12725 From kvn at openjdk.org Wed Mar 15 04:54:33 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 15 Mar 2023 04:54:33 GMT Subject: RFR: 8298935: fix independence bug in create_pack logic in SuperWord::find_adjacent_refs [v26] In-Reply-To: <6-Ea0WdYmM96b6eqsNReqIKevZPZsZkWEOSoUyycJpo=.11b237f7-35ca-4519-a605-5a541719c5cb@github.com> References: <_ygbSM3ARi21Y5jfOqEIYEqZe84cwTbQ6vayG9hVYiY=.36ace32c-73ac-452e-a9ac-f9ffaea08931@github.com> <6-Ea0WdYmM96b6eqsNReqIKevZPZsZkWEOSoUyycJpo=.11b237f7-35ca-4519-a605-5a541719c5cb@github.com> Message-ID: On Mon, 13 Mar 2023 10:13:40 GMT, Emanuel Peter wrote: >> Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 40 commits: >> >> - Merge master after NULL -> nullptr conversion >> - Fixed wording from last commit >> - A little renaming and improved comments >> - resolve merge conflict after Roland's fix >> - TestDependencyOffsets.java: add vanilla run >> - TestDependencyOffsets.java: parallelize it + various AVX settings >> - TestOptionVectorizeIR.java: removed PopulateIndex IR rule - fails on x86 32bit - see Matcher::match_rule_supported >> - Merge branch 'master' into JDK-8298935 >> - Reworked TestDependencyOffsets.java >> - remove negative IR rules for TestOptionVectorizeIR.java >> - ... and 30 more: https://git.openjdk.org/jdk/compare/5726d31e...731cc7b5 > > https://github.com/openjdk/jdk/pull/12350#issuecomment-1461681129 > >> So is there a **4th Bug** lurking here? > > @vnkozlov I now found an example that reveals this **Bug 4**. I want to fix it in a separate Bug [JDK-8304042](https://bugs.openjdk.org/browse/JDK-8304042). > > This is the method: > > static void test(int[] dataI1, int[] dataI2, float[] dataF1, float[] dataF2) { > for (int i = 0; i < RANGE/2; i+=2) { > dataF1[i+0] = dataI1[i+0] + 0.33f; // 1 > dataI2[i+1] = (int)(11.0f * dataF2[i+1]); // 2 > > dataI2[i+0] = (int)(11.0f * dataF2[i+0]); // 3 > dataF1[i+1] = dataI1[i+1] + 0.33f; // 4 > } > } > > > Note: `dataI1 == dataI2` and `dataF1 == dataF2`. I only had to use two references so that C2 does not know this, and does not optimize away load after store. > > Lines 1 and 4 are `isomorphic` and `independent`. The same holds for line 2 and 3. We creates the packs `[1,4]` and `[2,3]`, and vectorize (with and without my patch). However, we have the following dependencies: `1->3` and `2->4`. This creates a cyclic dependency between the two packs. > > As explained in the previous https://github.com/openjdk/jdk/pull/12350#issuecomment-1461681129, we have to verify that there are no cyclic dependencies between the packs, just before we schedule. The SuperWord paper states this in "3.7 Scheduling". @eme64, I ran with `-XX:-TieredCompilation -Xbatch -XX:CICompilerCount=1 -XX:+TraceNewVectors` on AVX512 linux machine our vectors jtregs tests (including `jdk/incubator/vector) and everything is fine except one test: compiler/loopopts/superword/TestPickLastMemoryState.java` With these changes we almost don't generate vectors (may be 2 per `@run`). Without changes we got about 160 (50 per `@run`) new vectors. It has several `@run` commands for different vector sizes. ------------- PR: https://git.openjdk.org/jdk/pull/12350 From thartmann at openjdk.org Wed Mar 15 06:02:22 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 15 Mar 2023 06:02:22 GMT Subject: RFR: JDK-8303154: Investigate and improve instruction cache flushing during compilation [v3] In-Reply-To: References: Message-ID: On Tue, 14 Mar 2023 13:22:20 GMT, Damon Fenacci wrote: >> It was noticed that we flush the instruction cache too much for a single C1 compilation. The same is true for the C2 compilation. >> There are several places in the code where the instruction cache is called and many of them are very intertwined (see [bug report](https://bugs.openjdk.org/browse/JDK-8303154)). >> >> This PR is meant to be a "minimum" set of changes that improve the situation without introducing excessive extra information to keep track of the origin of the call through call stacks. This is done by avoiding calls to flush the ICache: >> * at `Compilation::emit_code_epilog` >> * when calling `CodeCache::commit` as flushing is done anyway when copying from the temporary buffer into the code cache in `CodeBuffer::copy_code_to`. This results in flushing the ICache only once instead of 3 times for a C1 compilation and twice for a C2 compilation. Additionally this halves the number of flushes during adapters generation (lots of calls). >> * at `SharedRuntime::generate_i2c2i_adapters` as this is called with a temporary buffer and an ICache flush is not needed >> >> This change decreases the number of calls to flush the ICache for a simple _Hello world_ program on Mac OSX aarch64 from 3569 to 2028 on C1 (43.2% improvement) and from 3572 to 1952 on C2 (45.4% improvement). >> >> This fix includes changes for x86_32/64 and aarch64, which I could test thoroughly but also for **arm** and **riscv**, for which I would need some help with testing. > > Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: > > JDK-8303154: remove flush in SharedRuntime::generate_i2c2i_adapters FTR, the follow-up RFE is [JDK-8303971](https://bugs.openjdk.org/browse/JDK-8303971). ------------- PR: https://git.openjdk.org/jdk/pull/12877 From bulasevich at openjdk.org Wed Mar 15 07:10:37 2023 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Wed, 15 Mar 2023 07:10:37 GMT Subject: RFR: 8303238: Create generalizations for existing LShift ideal transforms [v4] In-Reply-To: References: <22Lk8Rj9OTcddJxDx3-jIAIVXRnEFOwyScgxFTgXePA=.beea5fd7-f2d4-4b39-b745-ec7ab5c77b6b@github.com> Message-ID: On Fri, 10 Mar 2023 16:08:09 GMT, Jasmine K. wrote: >> Jasmine K. has updated the pull request incrementally with one additional commit since the last revision: >> >> Update full name > > Thanks for the reviews! @SuperCoder7979 I see a new issue, it must be related to this change. Can you take a look? https://bugs.openjdk.org/browse/JDK-8304230 ------------- PR: https://git.openjdk.org/jdk/pull/12734 From duke at openjdk.org Wed Mar 15 07:17:22 2023 From: duke at openjdk.org (Damon Fenacci) Date: Wed, 15 Mar 2023 07:17:22 GMT Subject: RFR: JDK-8303154: Investigate and improve instruction cache flushing during compilation [v3] In-Reply-To: References: Message-ID: <09ChBmfOZfAw7M29PpuQ6tsb_Oezm55bA5an7I4A_mk=.f740db25-ad44-45ad-a45e-685086f801f8@github.com> On Tue, 14 Mar 2023 14:47:34 GMT, Damon Fenacci wrote: >> Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: >> >> JDK-8303154: remove flush in SharedRuntime::generate_i2c2i_adapters > > This fix includes some changes to `sharedRuntime` for multiple platforms. I could thoroughly test on x86_32/64 and aarch64 but I'd need some help testing on **ARM** and **RISCV**. > @RealFYang could you or someone you know please run some testing for **RISCV** to make sure my change doesn't break anything? @bulasevich @shqking could I ask you the same for **ARM**? > Thank you very much in advance for your help! > @dafedafe as followup RFE (I don't want to add more changes to this PR) look on all uses of `masm::flush()` for **temporary** buffers. I see such uses in some other `SharedRuntime` methods: [SharedRuntime::generate_resolve_blob](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp#L3299) > There are may be only few but since we start fixing such cases we should finish. Sure! Thanks for the hint @vnkozlov. ------------- PR: https://git.openjdk.org/jdk/pull/12877 From thartmann at openjdk.org Wed Mar 15 07:32:20 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 15 Mar 2023 07:32:20 GMT Subject: RFR: 8293324: ciField.hpp has two methods to return field's offset In-Reply-To: References: Message-ID: <3BgeRE4TpMGh9D2Xpq8r4fq5IopiEX2jKmOLfj6Xcb8=.2db75c23-f9e6-49ac-b1db-327c9bc40cdd@github.com> On Mon, 13 Mar 2023 16:52:30 GMT, Ilya Korennoy wrote: > Small refactoring of ciField.hpp method `offset()` removed and `offset_in_bytes()` used instead. > > Test: tier1 linux-x86_64 Looks good to me. Please enable https://openjdk.org/guide/#github-actions for some automated testing. And please update the copyright dates of all files. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/13003Changes requested by thartmann (Reviewer). From tobias.hartmann at oracle.com Wed Mar 15 07:49:55 2023 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Wed, 15 Mar 2023 08:49:55 +0100 Subject: test/lib-test/jdk/test/whitebox/CPUInfoTest.java fails on Intel Alder/Raptor Lake In-Reply-To: References: Message-ID: Hi Kosta, Welcome to OpenJDK and thanks for reporting this issue! I was able to reproduce it and filed: https://bugs.openjdk.org/browse/JDK-8304242 Just a minor correction: Support was not added by JDK-8284161 but JDK-8264543. Do you intend to work on the fix? Best regards, Tobias On 14.03.23 16:59, Kosta Stojiljkovic wrote: > Dear all, > > On a machine with the 13th gen Intel CPU, WhiteBox test in the file CPUInfoTest.java fails. > > The test in question checks the features returned from the CPUInfo class against a hardcoded set of > well known CPU features inside the test, that looks like this: > > ?wellKnownCPUFeatures = Set.of( "cx8", "cmov", "fxsr", "ht", "mmx", "3dnowpref", "sse", "sse2", > "sse3", "ssse3", "sse4a", "sse4.1", "sse4.2", "popcnt", "lzcnt", "tsc", "tscinvbit", "tscinv", > "avx", "avx2", "aes", "erms", "clmul", "bmi1", "bmi2", "rtm", "adx", "avx512f", "avx512dq", > "avx512pf", "avx512er", "avx512cd", "avx512bw", "avx512vl", "sha", "fma", "vzeroupper", > "avx512_vpopcntdq", "avx512_vpclmulqdq", "avx512_vaes", "avx512_vnni", "clflush", "clflushopt", > "clwb", "avx512_vbmi2", "avx512_vbmi", "rdtscp", "rdpid", "hv", "fsrm", "avx512_bitalg", "gfni", > "f16c", "pku", "ospke", "cet_ibt", "cet_ss", "avx512_ifma"); > > This set of strings on the other hand does not account for the SERIALIZE instruction, added in the > 12th generation of Intel Core processors (codenamed Alder Lake), while the processor inspection > implementation in /src/hotspot/cpu/x86/vm_version_x86.cpp picks up the flag for it, thus leading to > a discrepancy between the features set in the test and the features string obtained from CPUInfo > class, when ran on the 12th gen processors and higher. > > The support for this feature seems to have been added to the code base with the following commit: > 8284161: Implementation of Virtual Thread > , but the authors > may have missed adding the "serialize" string to the set of well known CPU features in the > CPUInfoTest.java file. > > I would like to extend the wellKnownCPUFeatures set with the "serialize" keyword, unless there is a > reason that this keyword is missing that I do not see? > > If that isn't the case, I would appreciate getting some support with creating an issue in JBS, since > I am not an author yet :) > > I look forward to your feedback! > > Best Regards, > Kosta Stojiljkovic From epeter at openjdk.org Wed Mar 15 08:15:36 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 15 Mar 2023 08:15:36 GMT Subject: RFR: 8298935: fix independence bug in create_pack logic in SuperWord::find_adjacent_refs [v26] In-Reply-To: References: <_ygbSM3ARi21Y5jfOqEIYEqZe84cwTbQ6vayG9hVYiY=.36ace32c-73ac-452e-a9ac-f9ffaea08931@github.com> <6-Ea0WdYmM96b6eqsNReqIKevZPZsZkWEOSoUyycJpo=.11b237f7-35ca-4519-a605-5a541719c5cb@github.com> Message-ID: On Wed, 15 Mar 2023 04:50:48 GMT, Vladimir Kozlov wrote: >> https://github.com/openjdk/jdk/pull/12350#issuecomment-1461681129 >> >>> So is there a **4th Bug** lurking here? >> >> @vnkozlov I now found an example that reveals this **Bug 4**. I want to fix it in a separate Bug [JDK-8304042](https://bugs.openjdk.org/browse/JDK-8304042). >> >> This is the method: >> >> static void test(int[] dataI1, int[] dataI2, float[] dataF1, float[] dataF2) { >> for (int i = 0; i < RANGE/2; i+=2) { >> dataF1[i+0] = dataI1[i+0] + 0.33f; // 1 >> dataI2[i+1] = (int)(11.0f * dataF2[i+1]); // 2 >> >> dataI2[i+0] = (int)(11.0f * dataF2[i+0]); // 3 >> dataF1[i+1] = dataI1[i+1] + 0.33f; // 4 >> } >> } >> >> >> Note: `dataI1 == dataI2` and `dataF1 == dataF2`. I only had to use two references so that C2 does not know this, and does not optimize away load after store. >> >> Lines 1 and 4 are `isomorphic` and `independent`. The same holds for line 2 and 3. We creates the packs `[1,4]` and `[2,3]`, and vectorize (with and without my patch). However, we have the following dependencies: `1->3` and `2->4`. This creates a cyclic dependency between the two packs. >> >> As explained in the previous https://github.com/openjdk/jdk/pull/12350#issuecomment-1461681129, we have to verify that there are no cyclic dependencies between the packs, just before we schedule. The SuperWord paper states this in "3.7 Scheduling". > > @eme64, I ran with `-XX:-TieredCompilation -Xbatch -XX:CICompilerCount=1 -XX:+TraceNewVectors` on AVX512 linux machine our vectors jtregs tests (including `jdk/incubator/vector) and everything is fine except one test: > > compiler/loopopts/superword/TestPickLastMemoryState.java` > > With these changes we almost don't generate vectors (may be 2 per `@run`). Without changes we got about 160 (50 per `@run`) new vectors. It has several `@run` commands for different vector sizes. @vnkozlov I'm looking into `TestPickLastMemoryState.java` These are relevant vectorized methods: - `f`: vectorize with with master, nothing with my patch. - `reset`: both vectorize. - `test1 - 6`: vectorize with master, nothing with patch. Explanation: - `f`: `b[h - 1]` and `b[h]` misaligned. - `reset`: just a simple `VectorStore` with all zeros. Not much can go wrong here. - `test1`: `iArr[i1]` and `iArr[i1+1]` misaligned. - `test2`: `iArr[i1]` and `iArr[i1+2]` misaligned. - `test3`: `iArr[i1]` and `iArr[i1-2]` misaligned. - `test4`: `iArr[i1]` and `iArr[i1-3]` misaligned. - `test5`: `iArr[i1]` and `iArr[i1+2]` misaligned. - `test6`: `iArr[i1]` and `iArr[i1+1]` misaligned. My first guess is that it means that we reject the misaligned slice during `find_adjacent_refs`. But then we re-introduce (some of) the memops during `extend_packlist`, because we do not only follow non-memops, but also memops. I call this "happy accident" before my fix, which should not be allowed. It can indeed lead to bugs on very similar examples. I now disallow these "happy accidents", I forbid the re-introduction of memops during `extend_packlist`. TODO: I will look at all examples in detail now, and see if this guess is correct. If this is correct, we should just use the `_do_vector_loop` in general, which would probably allow most if not all of these cases. ------------- PR: https://git.openjdk.org/jdk/pull/12350 From kostasto at proton.me Wed Mar 15 08:19:47 2023 From: kostasto at proton.me (Kosta Stojiljkovic) Date: Wed, 15 Mar 2023 08:19:47 +0000 Subject: test/lib-test/jdk/test/whitebox/CPUInfoTest.java fails on Intel Alder/Raptor Lake In-Reply-To: References: Message-ID: Hi Tobias, Thank you for the correction and for creating the issue in the bug tracker. I do intend to work on the fix. Best, Kosta ------- Original Message ------- On Wednesday, March 15th, 2023 at 8:49 AM, Tobias Hartmann wrote: > Hi Kosta, > > Welcome to OpenJDK and thanks for reporting this issue! I was able to reproduce it and filed: > https://bugs.openjdk.org/browse/JDK-8304242 > > Just a minor correction: Support was not added by JDK-8284161 but JDK-8264543. > > Do you intend to work on the fix? > > Best regards, > Tobias > > On 14.03.23 16:59, Kosta Stojiljkovic wrote: > > > Dear all, > > > > On a machine with the 13th gen Intel CPU, WhiteBox test in the file CPUInfoTest.java fails. > > > > The test in question checks the features returned from the CPUInfo class against a hardcoded set of > > well known CPU features inside the test, that looks like this: > > > > wellKnownCPUFeatures = Set.of( "cx8", "cmov", "fxsr", "ht", "mmx", "3dnowpref", "sse", "sse2", > > "sse3", "ssse3", "sse4a", "sse4.1", "sse4.2", "popcnt", "lzcnt", "tsc", "tscinvbit", "tscinv", > > "avx", "avx2", "aes", "erms", "clmul", "bmi1", "bmi2", "rtm", "adx", "avx512f", "avx512dq", > > "avx512pf", "avx512er", "avx512cd", "avx512bw", "avx512vl", "sha", "fma", "vzeroupper", > > "avx512_vpopcntdq", "avx512_vpclmulqdq", "avx512_vaes", "avx512_vnni", "clflush", "clflushopt", > > "clwb", "avx512_vbmi2", "avx512_vbmi", "rdtscp", "rdpid", "hv", "fsrm", "avx512_bitalg", "gfni", > > "f16c", "pku", "ospke", "cet_ibt", "cet_ss", "avx512_ifma"); > > > > This set of strings on the other hand does not account for the SERIALIZE instruction, added in the > > 12th generation of Intel Core processors (codenamed Alder Lake), while the processor inspection > > implementation in /src/hotspot/cpu/x86/vm_version_x86.cpp picks up the flag for it, thus leading to > > a discrepancy between the features set in the test and the features string obtained from CPUInfo > > class, when ran on the 12th gen processors and higher. > > > > The support for this feature seems to have been added to the code base with the following commit: > > 8284161: Implementation of Virtual Thread > > https://github.com/openjdk/jdk/commit/9583e3657e43cc1c6f2101a64534564db2a9bd84, but the authors > > may have missed adding the "serialize" string to the set of well known CPU features in the > > CPUInfoTest.java file. > > > > I would like to extend the wellKnownCPUFeatures set with the "serialize" keyword, unless there is a > > reason that this keyword is missing that I do not see? > > > > If that isn't the case, I would appreciate getting some support with creating an issue in JBS, since > > I am not an author yet :) > > > > I look forward to your feedback! > > > > Best Regards, > > Kosta Stojiljkovic From mdoerr at openjdk.org Wed Mar 15 09:02:23 2023 From: mdoerr at openjdk.org (Martin Doerr) Date: Wed, 15 Mar 2023 09:02:23 GMT Subject: RFR: 8299375: [PPC64] GetStackTraceSuspendedStressTest tries to deoptimize frame with invalid fp In-Reply-To: References: Message-ID: On Thu, 9 Mar 2023 10:46:05 GMT, Richard Reingruber wrote: > Mark a frame as not fully initialized when copying it from a continuation StackChunk to the stack until the callers_sp (aka back link) is set. > > This avoids the assertion given in the bug report when the copied frame is deoptimized before it is fully initialized. > IMHO the deoptimization at that point is a little questionable but it actually only changes the pc of the frame which can be done. > Note that the frame can get extended later (and metadata can get overridden) but [there is code that handles this](https://github.com/openjdk/jdk/blob/34a92466a615415b76c8cb6010ff7e6e1a1d63b4/src/hotspot/share/runtime/continuationFreezeThaw.cpp#L2108-L2110). > > Testing: jdk_loom. The fix passed our CI testing. This includes most JCK and JTREG tiers 1-4, also in Xcomp mode, on the standard platforms and also on ppc64le. This looks good. src/hotspot/cpu/ppc/frame_ppc.hpp line 389: > 387: void set_offset_fp(int value) { assert_on_heap(); _offset_fp = value; } > 388: > 389: // Mark a frame as not fully initialized Maybe add a comment like "Must not be used for frames in the valid back chain."? ------------- PR: https://git.openjdk.org/jdk/pull/12941 From aph at openjdk.org Wed Mar 15 09:07:23 2023 From: aph at openjdk.org (Andrew Haley) Date: Wed, 15 Mar 2023 09:07:23 GMT Subject: RFR: 8302906: AArch64: Add SVE backend support for vector unsigned comparison [v5] In-Reply-To: <0N4QegqcD5I5funLRjC7WtivwQkmWgAlbI38Wcy2k8I=.c8fb8cff-4dcd-4f8c-a582-5ae1903773c8@github.com> References: <0N4QegqcD5I5funLRjC7WtivwQkmWgAlbI38Wcy2k8I=.c8fb8cff-4dcd-4f8c-a582-5ae1903773c8@github.com> Message-ID: On Wed, 15 Mar 2023 03:48:01 GMT, changpeng1997 wrote: >> This patch implements unsigned vector comparison on SVE. >> >> 1: Test: >> All vector API test cases[1][2] passed without new failure. Existing test cases can cover all unsigned comparison conditions for all kinds of vector. >> >> 2: Performance: >> (1): Benchmark: >> As existing benchmarks in panama repo (such as [3]) have some issues [4] (We will fix them in a separate patch.), I collected performance data with a reduced jmh benchmark [5]. e.g. for ByteVector unsigned comparison: >> >> >> @Benchmark >> public void byteVectorUnsignedCompare() { >> for (int j = 0; j < 200; j++) { >> for (int i = 0; i < bspecies.length(); i++) { >> ByteVector av = ByteVector.fromArray(bspecies, ba, i); >> ByteVector ca = ByteVector.fromArray(bspecies, bb, i); >> av.compare(VectorOperators.UNSIGNED_GT, ca).intoArray(br, i); >> } >> } >> } >> >> >> (2): Performance data >> >> Before: >> >> >> Benchmark Score(op/ms) Error >> ByteVector.UNSIGNED_GT#size(1024) 4.846 3.419 >> ShortVector.UNSIGNED_GE#size(1024) 3.055 1.369 >> IntVector.UNSIGNED_LT#size(1024) 3.475 1.269 >> LongVector.UNSIGNED_LE#size(1024) 4.515 1.812 >> >> >> After: >> >> >> Benchmark Score(op/ms) Error >> ByteVector.UNSIGNED_GT#size(1024) 493.937 1.389 >> ShortVector.UNSIGNED_GE#size(1024) 5308.796 20.557 >> IntVector.UNSIGNED_LT#size(1024) 4944.744 10.606 >> LongVector.UNSIGNED_LE#size(1024) 8459.605 28.683 >> >> >> [1] https://github.com/openjdk/jdk/tree/master/test/jdk/jdk/incubator/vector >> [2] https://github.com/openjdk/jdk/tree/master/test/hotspot/jtreg/compiler/vectorapi >> [3] https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/ByteMaxVector.java#L1459 >> [4] https://bugs.openjdk.org/browse/JDK-8282850 >> [5] https://gist.github.com/changpeng1997/d311127e1015c107197f9b56a92b0fae > > changpeng1997 has updated the pull request incrementally with one additional commit since the last revision: > > Move cm() and fcm() to advsimd-three-same section. That's a lot cleaner. Thanks. ------------- Marked as reviewed by aph (Reviewer). PR: https://git.openjdk.org/jdk/pull/12725 From rrich at openjdk.org Wed Mar 15 09:17:56 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Wed, 15 Mar 2023 09:17:56 GMT Subject: RFR: 8299375: [PPC64] GetStackTraceSuspendedStressTest tries to deoptimize frame with invalid fp [v2] In-Reply-To: References: Message-ID: > Mark a frame as not fully initialized when copying it from a continuation StackChunk to the stack until the callers_sp (aka back link) is set. > > This avoids the assertion given in the bug report when the copied frame is deoptimized before it is fully initialized. > IMHO the deoptimization at that point is a little questionable but it actually only changes the pc of the frame which can be done. > Note that the frame can get extended later (and metadata can get overridden) but [there is code that handles this](https://github.com/openjdk/jdk/blob/34a92466a615415b76c8cb6010ff7e6e1a1d63b4/src/hotspot/share/runtime/continuationFreezeThaw.cpp#L2108-L2110). > > Testing: jdk_loom. The fix passed our CI testing. This includes most JCK and JTREG tiers 1-4, also in Xcomp mode, on the standard platforms and also on ppc64le. Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: Feedback Martin ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12941/files - new: https://git.openjdk.org/jdk/pull/12941/files/4188e822..1a283f73 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12941&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12941&range=00-01 Stats: 4 lines in 2 files changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/12941.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12941/head:pull/12941 PR: https://git.openjdk.org/jdk/pull/12941 From rrich at openjdk.org Wed Mar 15 09:17:57 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Wed, 15 Mar 2023 09:17:57 GMT Subject: RFR: 8299375: [PPC64] GetStackTraceSuspendedStressTest tries to deoptimize frame with invalid fp [v2] In-Reply-To: References: Message-ID: On Wed, 15 Mar 2023 08:59:22 GMT, Martin Doerr wrote: >> Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: >> >> Feedback Martin > > src/hotspot/cpu/ppc/frame_ppc.hpp line 389: > >> 387: void set_offset_fp(int value) { assert_on_heap(); _offset_fp = value; } >> 388: >> 389: // Mark a frame as not fully initialized > > Maybe add a comment like "Must not be used for frames in the valid back chain."? Thanks for the feedback. I've added the comment and updated the Copyright years. ------------- PR: https://git.openjdk.org/jdk/pull/12941 From duke at openjdk.org Wed Mar 15 09:18:21 2023 From: duke at openjdk.org (changpeng1997) Date: Wed, 15 Mar 2023 09:18:21 GMT Subject: RFR: 8302906: AArch64: Add SVE backend support for vector unsigned comparison [v5] In-Reply-To: References: <0N4QegqcD5I5funLRjC7WtivwQkmWgAlbI38Wcy2k8I=.c8fb8cff-4dcd-4f8c-a582-5ae1903773c8@github.com> Message-ID: <3tzoAfesdz8T2KEOq42-S5tl_xA1QOrGzt2Zr-QHRdU=.3ff21424-451a-495a-81bf-f91cd7ff266f@github.com> On Wed, 15 Mar 2023 09:04:55 GMT, Andrew Haley wrote: > That's a lot cleaner. Thanks. Thanks for your review. ------------- PR: https://git.openjdk.org/jdk/pull/12725 From mdoerr at openjdk.org Wed Mar 15 09:22:21 2023 From: mdoerr at openjdk.org (Martin Doerr) Date: Wed, 15 Mar 2023 09:22:21 GMT Subject: RFR: 8299375: [PPC64] GetStackTraceSuspendedStressTest tries to deoptimize frame with invalid fp [v2] In-Reply-To: References: Message-ID: On Wed, 15 Mar 2023 09:17:56 GMT, Richard Reingruber wrote: >> Mark a frame as not fully initialized when copying it from a continuation StackChunk to the stack until the callers_sp (aka back link) is set. >> >> This avoids the assertion given in the bug report when the copied frame is deoptimized before it is fully initialized. >> IMHO the deoptimization at that point is a little questionable but it actually only changes the pc of the frame which can be done. >> Note that the frame can get extended later (and metadata can get overridden) but [there is code that handles this](https://github.com/openjdk/jdk/blob/34a92466a615415b76c8cb6010ff7e6e1a1d63b4/src/hotspot/share/runtime/continuationFreezeThaw.cpp#L2108-L2110). >> >> Testing: jdk_loom. The fix passed our CI testing. This includes most JCK and JTREG tiers 1-4, also in Xcomp mode, on the standard platforms and also on ppc64le. > > Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: > > Feedback Martin Thanks! ------------- Marked as reviewed by mdoerr (Reviewer). PR: https://git.openjdk.org/jdk/pull/12941 From thartmann at openjdk.org Wed Mar 15 09:36:42 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 15 Mar 2023 09:36:42 GMT Subject: RFR: 8304034: Remove redundant and meaningless comments in opto [v5] In-Reply-To: References: Message-ID: On Tue, 14 Mar 2023 05:34:20 GMT, Yi Yang wrote: >> Please help review this trivial change to remove redundant and meaningless comments in `hotspot/share/opto` directory. >> >> They are either >> 1. Repeat the function name that the function they comment for. >> 2. Makes no sense, e.g. `//----Idealize----` >> >> And I think original CC-style code (`if( test )`,`call( arg )`) can be formatted in one go, instead of formatting the near code when someone touches them. But this may form a big patch, and it confuses code blame, so I left this work until we reach a consensus. >> >> Thanks! > > Yi Yang has updated the pull request incrementally with one additional commit since the last revision: > > restore mistakenly removed lines Looks good to me otherwise. src/hotspot/share/opto/compile.cpp line 2499: > 2497: } > 2498: > 2499: //---------------------------- Bitwise operation packing optimization --------------------------- I think this should be converted to a "normal" comment. src/hotspot/share/opto/matcher.cpp line 1606: > 1604: > 1605: > 1606: //------------------------------Instruction Selection-------------------------- This should stay. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/12995 From tobias.hartmann at oracle.com Wed Mar 15 09:40:09 2023 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Wed, 15 Mar 2023 10:40:09 +0100 Subject: test/lib-test/jdk/test/whitebox/CPUInfoTest.java fails on Intel Alder/Raptor Lake In-Reply-To: References: Message-ID: <747aecd5-9ab3-e28d-ae2d-be2e11b7c8ff@oracle.com> Great, just let me know if you need any help. Best regards, Tobias On 15.03.23 09:19, Kosta Stojiljkovic wrote: > Hi Tobias, > > Thank you for the correction and for creating the issue in the bug tracker. > > I do intend to work on the fix. > > Best, > Kosta > > ------- Original Message ------- > On Wednesday, March 15th, 2023 at 8:49 AM, Tobias Hartmann wrote: > > >> Hi Kosta, >> >> Welcome to OpenJDK and thanks for reporting this issue! I was able to reproduce it and filed: >> https://bugs.openjdk.org/browse/JDK-8304242 >> >> Just a minor correction: Support was not added by JDK-8284161 but JDK-8264543. >> >> Do you intend to work on the fix? >> >> Best regards, >> Tobias >> >> On 14.03.23 16:59, Kosta Stojiljkovic wrote: >> >>> Dear all, >>> >>> On a machine with the 13th gen Intel CPU, WhiteBox test in the file CPUInfoTest.java fails. >>> >>> The test in question checks the features returned from the CPUInfo class against a hardcoded set of >>> well known CPU features inside the test, that looks like this: >>> >>> wellKnownCPUFeatures = Set.of( "cx8", "cmov", "fxsr", "ht", "mmx", "3dnowpref", "sse", "sse2", >>> "sse3", "ssse3", "sse4a", "sse4.1", "sse4.2", "popcnt", "lzcnt", "tsc", "tscinvbit", "tscinv", >>> "avx", "avx2", "aes", "erms", "clmul", "bmi1", "bmi2", "rtm", "adx", "avx512f", "avx512dq", >>> "avx512pf", "avx512er", "avx512cd", "avx512bw", "avx512vl", "sha", "fma", "vzeroupper", >>> "avx512_vpopcntdq", "avx512_vpclmulqdq", "avx512_vaes", "avx512_vnni", "clflush", "clflushopt", >>> "clwb", "avx512_vbmi2", "avx512_vbmi", "rdtscp", "rdpid", "hv", "fsrm", "avx512_bitalg", "gfni", >>> "f16c", "pku", "ospke", "cet_ibt", "cet_ss", "avx512_ifma"); >>> >>> This set of strings on the other hand does not account for the SERIALIZE instruction, added in the >>> 12th generation of Intel Core processors (codenamed Alder Lake), while the processor inspection >>> implementation in /src/hotspot/cpu/x86/vm_version_x86.cpp picks up the flag for it, thus leading to >>> a discrepancy between the features set in the test and the features string obtained from CPUInfo >>> class, when ran on the 12th gen processors and higher. >>> >>> The support for this feature seems to have been added to the code base with the following commit: >>> 8284161: Implementation of Virtual Thread >>> https://urldefense.com/v3/__https://github.com/openjdk/jdk/commit/9583e3657e43cc1c6f2101a64534564db2a9bd84__;!!ACWV5N9M2RV99hQ!IJbzLNzlucZbrRztEqHetAhucew02Cs7jMUa87bypeJ-thvbz2OMTM2upkyX4r2_GvMBPFuFRDWZwFKEWlWsGdE$ , but the authors >>> may have missed adding the "serialize" string to the set of well known CPU features in the >>> CPUInfoTest.java file. >>> >>> I would like to extend the wellKnownCPUFeatures set with the "serialize" keyword, unless there is a >>> reason that this keyword is missing that I do not see? >>> >>> If that isn't the case, I would appreciate getting some support with creating an issue in JBS, since >>> I am not an author yet :) >>> >>> I look forward to your feedback! >>> >>> Best Regards, >>> Kosta Stojiljkovic From wanghaomin at openjdk.org Wed Mar 15 10:04:25 2023 From: wanghaomin at openjdk.org (Wang Haomin) Date: Wed, 15 Mar 2023 10:04:25 GMT Subject: RFR: 8303804: Fix some errors of If-VectorTest and CMove-VectorTest [v2] In-Reply-To: References: <-Jw_zF5ca_3WHcoZQwzsT6lMA1NFdAzbOv3063qU6Lw=.b90c971a-bfe4-4144-93dc-b04a4c89a154@github.com> Message-ID: On Thu, 9 Mar 2023 01:19:40 GMT, Wang Haomin wrote: >> After https://bugs.openjdk.org/browse/JDK-8292289 , the base class of VectorTestNode changed from Node to CmpNode. So I add two match rule into ad file. >> >> match(If cop (VectorTest op1 op2)); >> match(Set dst (CMoveI (Binary cop (VectorTest op1 op2)) (Binary src1 src2))); >> >> First error, rule1 shouldn't generate the statement "node->_bottom_type = _leaf->bottom_type();". >> Second error, both rule1 and rule2 need to use VectorTestNode, the VectorTestNode should be cloned like CmpNode. > > Wang Haomin has updated the pull request incrementally with one additional commit since the last revision: > > compare the results with 0 Would anyone review this, thanks. ------------- PR: https://git.openjdk.org/jdk/pull/12917 From duke at openjdk.org Wed Mar 15 10:40:07 2023 From: duke at openjdk.org (Ilya Korennoy) Date: Wed, 15 Mar 2023 10:40:07 GMT Subject: RFR: 8293324: ciField.hpp has two methods to return field's offset [v2] In-Reply-To: References: Message-ID: > Small refactoring of ciField.hpp method `offset()` removed and `offset_in_bytes()` used instead. > > Test: tier1 linux-x86_64 Ilya Korennoy has updated the pull request incrementally with one additional commit since the last revision: 8293324: Update the copyright dates ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13003/files - new: https://git.openjdk.org/jdk/pull/13003/files/3d94cab0..29e36aa9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13003&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13003&range=00-01 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/13003.diff Fetch: git fetch https://git.openjdk.org/jdk pull/13003/head:pull/13003 PR: https://git.openjdk.org/jdk/pull/13003 From yyang at openjdk.org Wed Mar 15 10:45:08 2023 From: yyang at openjdk.org (Yi Yang) Date: Wed, 15 Mar 2023 10:45:08 GMT Subject: RFR: 8304049: C2 can not merge trivial Ifs due to CastII Message-ID: Hi can I have a review for this patch? C2 can not apply Split If for the attached trivial case. PhiNode::Ideal removes itself by unique_input but introduces a new CastII, therefore we have two Cmp, which is not identical for split_if. public static void test5(int a, int b){ if( b!=0) { int_field = 35; } else { int_field =222; } if( b!=0) { int_field = 35; } else { int_field =222; } } Test: tier1, application/ctw/modules ------------- Commit messages: - 8304049: C2 can not merge trivial Ifs due to CastII Changes: https://git.openjdk.org/jdk/pull/13039/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13039&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8304049 Stats: 53 lines in 7 files changed: 50 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/13039.diff Fetch: git fetch https://git.openjdk.org/jdk pull/13039/head:pull/13039 PR: https://git.openjdk.org/jdk/pull/13039 From roland at openjdk.org Wed Mar 15 12:16:20 2023 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 15 Mar 2023 12:16:20 GMT Subject: RFR: 8304049: C2 can not merge trivial Ifs due to CastII In-Reply-To: References: Message-ID: <8O1VwQaD922SM3z2vFSf7aEHUHJ1OT5tESCfphmuR5M=.7f85c81d-c679-4f27-9334-3f2749772e06@github.com> On Wed, 15 Mar 2023 10:37:03 GMT, Yi Yang wrote: > Hi can I have a review for this patch? C2 can not apply Split If for the attached trivial case. PhiNode::Ideal removes itself by unique_input but introduces a new CastII > > https://github.com/openjdk/jdk/blob/e3777b0c49abb9cc1925f4044392afadf3adef61/src/hotspot/share/opto/cfgnode.cpp#L1470-L1474 > > https://github.com/openjdk/jdk/blob/e3777b0c49abb9cc1925f4044392afadf3adef61/src/hotspot/share/opto/cfgnode.cpp#L2078-L2079 > > Therefore we have two Cmp, which is not identical for split_if. > > ![image](https://user-images.githubusercontent.com/5010047/225285449-b41dc939-1d3f-45f3-b6d6-a9b9445c2f6a.png) > (Fig1. Phi#41 is removed during ideal, create CastII#58 then) > > ![image](https://user-images.githubusercontent.com/5010047/225285493-30471f1c-97b0-452b-9218-3b5f09f09859.png) > (Fig2. CmpI#42 and CmpI#23 are different comparisons, they are not identical_backtoback_ifs ) > > This patch adds Cmp identity to find existing Cmp node, i.e. Cmp#42 is identity to Cmp#23 > > > public static void test5(int a, int b){ > > if( b!=0) { > int_field = 35; > } else { > int_field =222; > } > > if( b!=0) { > int_field = 35; > } else { > int_field =222; > } > } > > > > Test: tier1, application/ctw/modules Why not do this (and the other similar #12978) by extending the logic where the merge happens in `PhaseIdealLoop::identical_backtoback_ifs()`? I see a few reasons why that would be preferable: - if the motivation is only to merge ifs, is it a good thing to transform the graph blindly during GVN when we don't know if the ifs are even candidates for merging? - having that code far away from where the merge transformation happens makes understanding what's going on harder - dropping a Cast node is risky, in general, because they sometimes carry a dependence that's required for correctness. So it should be done with care. ------------- PR: https://git.openjdk.org/jdk/pull/13039 From thartmann at openjdk.org Wed Mar 15 12:21:21 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 15 Mar 2023 12:21:21 GMT Subject: RFR: 8293324: ciField.hpp has two methods to return field's offset [v2] In-Reply-To: References: Message-ID: On Wed, 15 Mar 2023 10:40:07 GMT, Ilya Korennoy wrote: >> Small refactoring of ciField.hpp method `offset()` removed and `offset_in_bytes()` used instead. >> >> Test: tier1 linux-x86_64 > > Ilya Korennoy has updated the pull request incrementally with one additional commit since the last revision: > > 8293324: Update the copyright dates Marked as reviewed by thartmann (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/13003 From duke at openjdk.org Wed Mar 15 12:27:22 2023 From: duke at openjdk.org (Ilya Korennoy) Date: Wed, 15 Mar 2023 12:27:22 GMT Subject: RFR: 8293324: ciField.hpp has two methods to return field's offset [v2] In-Reply-To: References: Message-ID: On Wed, 15 Mar 2023 12:18:49 GMT, Tobias Hartmann wrote: >> Ilya Korennoy has updated the pull request incrementally with one additional commit since the last revision: >> >> 8293324: Update the copyright dates > > Marked as reviewed by thartmann (Reviewer). @TobiHartmann - thanks for the review! ------------- PR: https://git.openjdk.org/jdk/pull/13003 From duke at openjdk.org Wed Mar 15 12:40:39 2023 From: duke at openjdk.org (Jasmine K.) Date: Wed, 15 Mar 2023 12:40:39 GMT Subject: RFR: 8303238: Create generalizations for existing LShift ideal transforms [v4] In-Reply-To: <22Lk8Rj9OTcddJxDx3-jIAIVXRnEFOwyScgxFTgXePA=.beea5fd7-f2d4-4b39-b745-ec7ab5c77b6b@github.com> References: <22Lk8Rj9OTcddJxDx3-jIAIVXRnEFOwyScgxFTgXePA=.beea5fd7-f2d4-4b39-b745-ec7ab5c77b6b@github.com> Message-ID: <_QHLJXmrNrGgy90UGF8rArNUpdEQftjxoZxWDLvpUfA=.95db1ffd-1d1f-4f55-9374-86d8c100d643@github.com> On Fri, 10 Mar 2023 01:10:03 GMT, Jasmine K. wrote: >> Hello, >> I would like to generalize two ideal transforms for bitwise shifts. Left shift nodes perform the transformations `(x >> C1) << C2 => x & (-1 << C2)` and `((x >> C1) & Y) << C2 => x & (Y << C2)`, but only when the case where `C1 == C2`. However, it is possible to use both of these rules to improve cases where the constants aren't equal, by removing one of the shifts and replacing it with a bitwise and. This transformation is profitable because typically more bitwise ands can be dispatched per cycle than bit shifts. In addition, the strength reduction from a shift to a bitwise and can allow more profitable transformations to occur. These patterns are found throughout the JDK, mainly around strings and OW2 ASM. I've attached some profiling results from my (Zen 2) machine below: >> >> Baseline Patch Improvement >> Benchmark Mode Cnt Score Error Units Score Error Units >> LShiftNodeIdealize.testRgbaToAbgr avgt 15 63.287 ? 1.770 ns/op / 54.199 ? 1.408 ns/op + 14.36% >> LShiftNodeIdealize.testShiftAndInt avgt 15 874.564 ? 15.334 ns/op / 538.408 ? 11.768 ns/op + 38.44% >> LShiftNodeIdealize.testShiftAndLong avgt 15 1017.466 ? 29.010 ns/op / 701.356 ? 18.258 ns/op + 31.07% >> LShiftNodeIdealize.testShiftInt avgt 15 663.865 ? 14.226 ns/op / 533.588 ? 9.949 ns/op + 19.63% >> LShiftNodeIdealize.testShiftInt2 avgt 15 658.976 ? 32.856 ns/op / 649.871 ? 10.598 ns/op + 1.38% >> LShiftNodeIdealize.testShiftLong avgt 15 815.540 ? 14.721 ns/op / 689.270 ? 14.028 ns/op + 15.48% >> LShiftNodeIdealize.testShiftLong2 avgt 15 817.936 ? 23.573 ns/op / 810.185 ? 14.983 ns/op + 0.95% >> >> >> In addition, in the process of making this PR I've found a missing ideal transform for `RShiftLNode`, so right shifts of large numbers (such as `x >> 65`) are not properly folded down, like how they are `RShiftINode` and `URShiftLNode`. I'll address this in a future RFR. >> >> Testing: GHA, tier1 local, and performance testing >> >> Thanks, >> Jasmine K > > Jasmine K. has updated the pull request incrementally with one additional commit since the last revision: > > Update full name Hi, and apologies- I'll address this ASAP. Thanks for the heads up. ------------- PR: https://git.openjdk.org/jdk/pull/12734 From yyang at openjdk.org Wed Mar 15 12:48:20 2023 From: yyang at openjdk.org (Yi Yang) Date: Wed, 15 Mar 2023 12:48:20 GMT Subject: RFR: 8304049: C2 can not merge trivial Ifs due to CastII In-Reply-To: <8O1VwQaD922SM3z2vFSf7aEHUHJ1OT5tESCfphmuR5M=.7f85c81d-c679-4f27-9334-3f2749772e06@github.com> References: <8O1VwQaD922SM3z2vFSf7aEHUHJ1OT5tESCfphmuR5M=.7f85c81d-c679-4f27-9334-3f2749772e06@github.com> Message-ID: On Wed, 15 Mar 2023 12:13:21 GMT, Roland Westrelin wrote: > Why not do this (and the other similar #12978) by extending the logic where the merge happens in `PhaseIdealLoop::identical_backtoback_ifs()`? > > I see a few reasons why that would be preferable: > > * if the motivation is only to merge ifs, is it a good thing to transform the graph blindly during GVN when we don't know if the ifs are even candidates for merging? > * having that code far away from where the merge transformation happens makes understanding what's going on harder > * dropping a Cast node is risky, in general, because they sometimes carry a dependence that's required for correctness. So it should be done with care. Hi @rwestrel, extending identical_backtoback_ifs is another candidate, but these IR shapes(this and #12978, ) are common cases(tty tells me in newly added Identity): - Bool (Cmp A B) and Bool (Cmp B A) - Bool (Cmp (Cast P1) P2) and Bool (Cmp P1 P2) They affect split_if but they are not related to split_if. Intuitively, Bool (Cmp A B) and Bool (Cmp B A) are identical shapes and could be transformed locally. ------------- PR: https://git.openjdk.org/jdk/pull/13039 From qamai at openjdk.org Wed Mar 15 13:04:12 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 15 Mar 2023 13:04:12 GMT Subject: RFR: 8304258: x86: Improve the code generation of VectorRearrange with int and float Message-ID: Hi, This small patch changes the code generation of VectorRearrangeNode with respect to int and float elements. With not-larger-than-128-bit vectors, we can use `vpermilps` instead of promoting to operating on the extended 256-bit vector. This also helps the code generation of AVX1 to not rely on the sse version. Please take a look and leave some reviews. Thanks a lot. ------------- Commit messages: - improve rearrangeI Changes: https://git.openjdk.org/jdk/pull/13042/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13042&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8304258 Stats: 28 lines in 3 files changed: 21 ins; 1 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/13042.diff Fetch: git fetch https://git.openjdk.org/jdk pull/13042/head:pull/13042 PR: https://git.openjdk.org/jdk/pull/13042 From kvn at openjdk.org Wed Mar 15 13:35:46 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 15 Mar 2023 13:35:46 GMT Subject: RFR: 8298935: fix independence bug in create_pack logic in SuperWord::find_adjacent_refs [v26] In-Reply-To: References: <_ygbSM3ARi21Y5jfOqEIYEqZe84cwTbQ6vayG9hVYiY=.36ace32c-73ac-452e-a9ac-f9ffaea08931@github.com> <6-Ea0WdYmM96b6eqsNReqIKevZPZsZkWEOSoUyycJpo=.11b237f7-35ca-4519-a605-5a541719c5cb@github.com> Message-ID: On Wed, 15 Mar 2023 08:11:56 GMT, Emanuel Peter wrote: > I call this "happy accident" before my fix, which should not be allowed. It can indeed lead to bugs on very similar examples. I now disallow these "happy accidents", I forbid the re-introduction of memops during extend_packlist. The test was added by [8290910](https://bugs.openjdk.org/browse/JDK-8290910) fix in JDK 20. Before that superword produces wrong results in these tests. The issue was found with fuzzer testing. I don't know where the test in [8293216](https://bugs.openjdk.org/browse/JDK-8293216) comes from. So you are right about this be corner case. Based on this I agree with not allowing vectorization in such cases for now. But file RFE to look on these cases much **later**. If vectorization produces valid result we should allow it. I understand that we are missing more precise checks which separate valid from invalid misaligned operations. I am not suggesting adding back code which extend memory ops without any checks but may be improve find_adjacent_refs when we can accept such cases. It is very complex case and fix could be also complex. ------------- PR: https://git.openjdk.org/jdk/pull/12350 From epeter at openjdk.org Wed Mar 15 14:06:53 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 15 Mar 2023 14:06:53 GMT Subject: Integrated: 8298935: fix independence bug in create_pack logic in SuperWord::find_adjacent_refs In-Reply-To: References: Message-ID: On Tue, 31 Jan 2023 18:26:52 GMT, Emanuel Peter wrote: > **List of important things below** > > - 3 Bugs I fixed + regression tests https://github.com/openjdk/jdk/pull/12350#issuecomment-1460323523 > - Conversation with @jatin-bhateja about script-generated regression test: https://github.com/openjdk/jdk/pull/12350#discussion_r1115317152 > - My [blog-article](https://eme64.github.io/blog/2023/02/23/SuperWord-Introduction.html) about SuperWord > - Explanation of `dependency_graph` and `DepPreds` https://github.com/openjdk/jdk/pull/12350#issuecomment-1461498252 > - Explanation of my new `find_dependency`. Arguments about `independence` of packs and cyclic dependencies between packs https://github.com/openjdk/jdk/pull/12350#issuecomment-1461681129 > - I found that 4rth bug, where we have independent packs, but the packs have a cyclic dependency. I will fix it in a separate Bug. https://github.com/openjdk/jdk/pull/12350#issuecomment-1465860465 > > **Original RFE description:** > Cyclic dependencies are not handled correctly in all cases. Three examples: > > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/loopopts/superword/TestCyclicDependency.java#L270-L277 > > And this, compiled with `-XX:CompileCommand=option,compiler.vectorization.TestOptionVectorizeIR::test*,Vectorize`: > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestOptionVectorizeIR.java#L173-L180 > > And for `vmIntrinsics::_forEachRemaining` compile option `Vectorize` is always enabled: > https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestForEachRem.java#L69-L73 > > All of these examples are vectorized, despite the cyclic dependency of distance 2. The cyclic dependency is dropped, instead the emitted vector code implements a shift by 2, instead of repeating the same 2 values. > > **Analysis** > > The `create_pack` logic in `SuperWord::find_adjacent_refs` is broken in two ways: > > - When the compile directive `Vectorize` is on, or we compile `vmIntrinsics::_forEachRemaining` we have `_do_vector_loop == true`. When that is the case, we blindly trust that there is no cyclic dependency larger than distance 1. Distance 1 would already be detected by the `independence(s1, s2)` checks we do for all adjacent memops. But for larger distances, we rely on `memory_alignment == 0`. But the compile directive avoids these checks. > - If `best_align_to_mem_ref` is of a different type, and we have `memory_alignment(mem_ref, best_align_to_mem_ref) == 0`, we do not check if `mem_ref` has `memory_alignment == 0` for all other refs of the same type. In the example `TestCyclicDependency::test2`, we have `best_align_to_mem_ref` as the `StoreF`. Then we assess the `StoreI`, which is not aligned with it, but it is of a different type, so we accept it too. Finally, we look at `LoadI`, which has perfect alignment with the `StoreF`, so we accept it too (even though it is in conflict with the `StoreI`). > > Generally, the nested if-statements are confusing and buggy. I propose to fix and refactor the code. > > I also propose to only allow the compile directive `Vectorize` only if `vectors_should_be_aligned() == false`. If all vector operations have to be `vector_width` aligned, then they also have to be mutually aligned, and we cannot have patterns like `v[i] = v[i] + v[i+1]` for which the compile directive was introduced in the first place https://github.com/openjdk/jdk/commit/c7d33de202203b6da544f2e0f9a13952381b32dd. > **Update**: I found a **Test.java** that lead to a crash (`SIGBUS`) on a ARM32 on master. The example bypassed the alignment requirement because of `_do_vector_loop`, and allowed unaligned vector loads to be generated, on a platform that requires alignment. Thanks @fg1417 for running that test for me! > > **Solution** > > First, I implemented `SuperWord::verify_packs` which catches cyclic dependencies just before scheduling. The idea is to reassess every pack, and check if all memops in it are mutually independent. Turns out that per vector pack, it suffices to do a single BFS over the nodes in the block (see `SuperWord::find_dependence`). With this verification in place we at least get an assert instead of wrong execution. > > I then refactored and fixed the `create_pack` code, and put the logic all in `SuperWord::is_mem_ref_alignment_ok`. With the added comments, I hope the logic is more straight forward and readable. If `_do_vector_loop == true`, then I filter the vector packs again in `SuperWord::combine_packs`, since we are at that point not sure that the packs are actually independent, we only know that adjacient memops are independent. > > Another change I have made: > Disallow `extend_packlist` from adding `MemNodes` back in. Because if we have rejected some memops, we do not want them to be added back in later. > > **Testing** > > I added a few more regression tests, and am running tier1-3, plus some stress testing. > > However, I need help from someone who can test this on **ARM32** and **PPC**, basically machines that have `vectors_should_be_aligned() == true`. I would love to have additional testing on those machine, and some reviews. > **Update:** @fg1417 did testing on ARM32, @reinrich did testing on PPC. > > **Discussion / Future Work** > > I wonder if we should have `_do_vector_loop == true` by default, since it allows more vectorization. With the added filtering, we are sure that we do not schedule packs with cyclic dependencies. We would have to evaluate performance and other side-effects of course. What do you think? [JDK-8303113](https://bugs.openjdk.org/browse/JDK-8303113) This pull request has now been integrated. Changeset: 01e69205 Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/01e6920581407bc3bd69db495fc694629ef01262 Stats: 12924 lines in 7 files changed: 12863 ins; 46 del; 15 mod 8298935: fix independence bug in create_pack logic in SuperWord::find_adjacent_refs Reviewed-by: kvn, jbhateja ------------- PR: https://git.openjdk.org/jdk/pull/12350 From epeter at openjdk.org Wed Mar 15 14:06:51 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 15 Mar 2023 14:06:51 GMT Subject: RFR: 8298935: fix independence bug in create_pack logic in SuperWord::find_adjacent_refs [v27] In-Reply-To: <7BaWanF45N91xNBIIXvX0yXhxEAPFfKK18G36oWF6LI=.16fb8405-9e5e-4789-8671-8f2e465cdde5@github.com> References: <7BaWanF45N91xNBIIXvX0yXhxEAPFfKK18G36oWF6LI=.16fb8405-9e5e-4789-8671-8f2e465cdde5@github.com> Message-ID: On Mon, 13 Mar 2023 10:37:32 GMT, Emanuel Peter wrote: >> **List of important things below** >> >> - 3 Bugs I fixed + regression tests https://github.com/openjdk/jdk/pull/12350#issuecomment-1460323523 >> - Conversation with @jatin-bhateja about script-generated regression test: https://github.com/openjdk/jdk/pull/12350#discussion_r1115317152 >> - My [blog-article](https://eme64.github.io/blog/2023/02/23/SuperWord-Introduction.html) about SuperWord >> - Explanation of `dependency_graph` and `DepPreds` https://github.com/openjdk/jdk/pull/12350#issuecomment-1461498252 >> - Explanation of my new `find_dependency`. Arguments about `independence` of packs and cyclic dependencies between packs https://github.com/openjdk/jdk/pull/12350#issuecomment-1461681129 >> - I found that 4rth bug, where we have independent packs, but the packs have a cyclic dependency. I will fix it in a separate Bug. https://github.com/openjdk/jdk/pull/12350#issuecomment-1465860465 >> >> **Original RFE description:** >> Cyclic dependencies are not handled correctly in all cases. Three examples: >> >> https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/loopopts/superword/TestCyclicDependency.java#L270-L277 >> >> And this, compiled with `-XX:CompileCommand=option,compiler.vectorization.TestOptionVectorizeIR::test*,Vectorize`: >> https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestOptionVectorizeIR.java#L173-L180 >> >> And for `vmIntrinsics::_forEachRemaining` compile option `Vectorize` is always enabled: >> https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestForEachRem.java#L69-L73 >> >> All of these examples are vectorized, despite the cyclic dependency of distance 2. The cyclic dependency is dropped, instead the emitted vector code implements a shift by 2, instead of repeating the same 2 values. >> >> **Analysis** >> >> The `create_pack` logic in `SuperWord::find_adjacent_refs` is broken in two ways: >> >> - When the compile directive `Vectorize` is on, or we compile `vmIntrinsics::_forEachRemaining` we have `_do_vector_loop == true`. When that is the case, we blindly trust that there is no cyclic dependency larger than distance 1. Distance 1 would already be detected by the `independence(s1, s2)` checks we do for all adjacent memops. But for larger distances, we rely on `memory_alignment == 0`. But the compile directive avoids these checks. >> - If `best_align_to_mem_ref` is of a different type, and we have `memory_alignment(mem_ref, best_align_to_mem_ref) == 0`, we do not check if `mem_ref` has `memory_alignment == 0` for all other refs of the same type. In the example `TestCyclicDependency::test2`, we have `best_align_to_mem_ref` as the `StoreF`. Then we assess the `StoreI`, which is not aligned with it, but it is of a different type, so we accept it too. Finally, we look at `LoadI`, which has perfect alignment with the `StoreF`, so we accept it too (even though it is in conflict with the `StoreI`). >> >> Generally, the nested if-statements are confusing and buggy. I propose to fix and refactor the code. >> >> I also propose to only allow the compile directive `Vectorize` only if `vectors_should_be_aligned() == false`. If all vector operations have to be `vector_width` aligned, then they also have to be mutually aligned, and we cannot have patterns like `v[i] = v[i] + v[i+1]` for which the compile directive was introduced in the first place https://github.com/openjdk/jdk/commit/c7d33de202203b6da544f2e0f9a13952381b32dd. >> **Update**: I found a **Test.java** that lead to a crash (`SIGBUS`) on a ARM32 on master. The example bypassed the alignment requirement because of `_do_vector_loop`, and allowed unaligned vector loads to be generated, on a platform that requires alignment. Thanks @fg1417 for running that test for me! >> >> **Solution** >> >> First, I implemented `SuperWord::verify_packs` which catches cyclic dependencies just before scheduling. The idea is to reassess every pack, and check if all memops in it are mutually independent. Turns out that per vector pack, it suffices to do a single BFS over the nodes in the block (see `SuperWord::find_dependence`). With this verification in place we at least get an assert instead of wrong execution. >> >> I then refactored and fixed the `create_pack` code, and put the logic all in `SuperWord::is_mem_ref_alignment_ok`. With the added comments, I hope the logic is more straight forward and readable. If `_do_vector_loop == true`, then I filter the vector packs again in `SuperWord::combine_packs`, since we are at that point not sure that the packs are actually independent, we only know that adjacient memops are independent. >> >> Another change I have made: >> Disallow `extend_packlist` from adding `MemNodes` back in. Because if we have rejected some memops, we do not want them to be added back in later. >> >> **Testing** >> >> I added a few more regression tests, and am running tier1-3, plus some stress testing. >> >> However, I need help from someone who can test this on **ARM32** and **PPC**, basically machines that have `vectors_should_be_aligned() == true`. I would love to have additional testing on those machine, and some reviews. >> **Update:** @fg1417 did testing on ARM32, @reinrich did testing on PPC. >> >> **Discussion / Future Work** >> >> I wonder if we should have `_do_vector_loop == true` by default, since it allows more vectorization. With the added filtering, we are sure that we do not schedule packs with cyclic dependencies. We would have to evaluate performance and other side-effects of course. What do you think? [JDK-8303113](https://bugs.openjdk.org/browse/JDK-8303113) > > Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 41 commits: > > - merge master: resolved conflict in test/hotspot/jtreg/compiler/lib/ir_framework/IRNode.java > - Merge master after NULL -> nullptr conversion > - Fixed wording from last commit > - A little renaming and improved comments > - resolve merge conflict after Roland's fix > - TestDependencyOffsets.java: add vanilla run > - TestDependencyOffsets.java: parallelize it + various AVX settings > - TestOptionVectorizeIR.java: removed PopulateIndex IR rule - fails on x86 32bit - see Matcher::match_rule_supported > - Merge branch 'master' into JDK-8298935 > - Reworked TestDependencyOffsets.java > - ... and 31 more: https://git.openjdk.org/jdk/compare/25e7ac22...ff0850e6 Ok, great. Thanks everybody for the help! @vnkozlov @jatin-bhateja thanks for the reviews! @fg1417 @reinrich thanks for the testing and feedback! @TobiHartmann thanks for the review suggestions! I have the follwing follow-up RFE's: [JDK-8303113](https://bugs.openjdk.org/browse/JDK-8303113) [SuperWord] investigate if enabling _do_vector_loop by default creates speedup I want to see if we can vectorize more, using the `Vectorize` approach: given `-AlignVector`, do no alignment checks, create packs optimistically. Filter out packs that are not `independent` later. This will remedy lots of the "collateral damage" of this Bug-fix here. [JDK-8260943](https://bugs.openjdk.org/browse/JDK-8260943) Revisit vectorization optimization added by 8076284 This is an old one. But we should either delete the dead code that is not hard-coded to be `false`, or fix it. This is related to `_do_vector_loop` (JDK-8076284 first introduced it). [JDK-8303827](https://bugs.openjdk.org/browse/JDK-8303827) C2 SuperWord: allow more fine grained alignment for +AlignVector We should fix the "collateral damage" for the `+AlignVector` case. We can do that by relaxing the strict alignment requirement a bit, to 4/8-byte. The `vector_width` alignment was required on SPARC. But even there, the vectors were not longer than 8 bytes (@vnkozlov ). Some collateral damage will happen, for example some conversions will not be vectorized after this fix here. That will for example affect `TestVectorizeTypeConversion.java` on some platforms with `+AlignVector`. **But**: I will not do this, and probably nobody from Oracle, as all our machines have `-AlignVector`. If anybody is interested in fixing this, you are free to take over the bug! [JDK-8304042](https://bugs.openjdk.org/browse/JDK-8304042) C2 SuperWord: schedule must remove packs with cyclic dependencies The **Bug 4** mentioned above, where we have cyclic dependencies on the `packs`, even when all `packs` are `independent`. Thanks again to all the involved! ------------- PR: https://git.openjdk.org/jdk/pull/12350 From roland at openjdk.org Wed Mar 15 15:56:49 2023 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 15 Mar 2023 15:56:49 GMT Subject: RFR: 8304049: C2 can not merge trivial Ifs due to CastII In-Reply-To: References: <8O1VwQaD922SM3z2vFSf7aEHUHJ1OT5tESCfphmuR5M=.7f85c81d-c679-4f27-9334-3f2749772e06@github.com> Message-ID: On Wed, 15 Mar 2023 12:43:59 GMT, Yi Yang wrote: > For carry_dependency, please let me do more investigation later to make sure they can be safely removed. `carry_dependency` true or not, removing a Cast node must be done with care as it removes a dependency. There have been numerous bugs over the year when dependencies were simply dropped because they got in the way of some other transformation. ------------- PR: https://git.openjdk.org/jdk/pull/13039 From jlu at openjdk.org Wed Mar 15 16:08:03 2023 From: jlu at openjdk.org (Justin Lu) Date: Wed, 15 Mar 2023 16:08:03 GMT Subject: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native Message-ID: <0MB7FLFNfaGEWssr9X54UJ_iZNFWBJkxQ1yusP7fsuY=.3f9f3de5-fe84-48e6-9449-626cac42da0b@github.com> This PR converts Unicode sequences to UTF-8 native in .properties file. (Excluding the Unicode space and tab sequence). The conversion was done using native2ascii. In addition, the build logic is adjusted to support reading in the .properties files as UTF-8 during the conversion from .properties file to .java ListResourceBundle file. ------------- Commit messages: - Write to ASCII - Read in .properties as UTF-8, but write to LRB .java as ISO-8859-1 - Compile class with ascii (Not ready to make system wide change) - Toggle UTF-8 for javac option in JavaCompilation.gmk - CompileProperties converts in UTF-8 - Convert .properties from ISO-8859-1 to UTF-8 Changes: https://git.openjdk.org/jdk/pull/12726/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12726&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8301991 Stats: 29093 lines in 490 files changed: 6 ins; 0 del; 29087 mod Patch: https://git.openjdk.org/jdk/pull/12726.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12726/head:pull/12726 PR: https://git.openjdk.org/jdk/pull/12726 From jjg at openjdk.org Wed Mar 15 16:08:06 2023 From: jjg at openjdk.org (Jonathan Gibbons) Date: Wed, 15 Mar 2023 16:08:06 GMT Subject: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native In-Reply-To: <0MB7FLFNfaGEWssr9X54UJ_iZNFWBJkxQ1yusP7fsuY=.3f9f3de5-fe84-48e6-9449-626cac42da0b@github.com> References: <0MB7FLFNfaGEWssr9X54UJ_iZNFWBJkxQ1yusP7fsuY=.3f9f3de5-fe84-48e6-9449-626cac42da0b@github.com> Message-ID: On Thu, 23 Feb 2023 09:04:23 GMT, Justin Lu wrote: > This PR converts Unicode sequences to UTF-8 native in .properties file. (Excluding the Unicode space and tab sequence). The conversion was done using native2ascii. > > In addition, the build logic is adjusted to support reading in the .properties files as UTF-8 during the conversion from .properties file to .java ListResourceBundle file. make/langtools/tools/compileproperties/CompileProperties.java line 252: > 250: try { > 251: writer = new BufferedWriter( > 252: new OutputStreamWriter(new FileOutputStream(outputPath), StandardCharsets.ISO_8859_1)); Using ISO_8859_1 seems strange. Since these are generated files, you could write them as UTF-8 and then override the default javac option for ascii when compiling _just_ these files. Or else just stay with ascii; no one should be looking at these files! ------------- PR: https://git.openjdk.org/jdk/pull/12726 From jlu at openjdk.org Wed Mar 15 16:08:07 2023 From: jlu at openjdk.org (Justin Lu) Date: Wed, 15 Mar 2023 16:08:07 GMT Subject: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native In-Reply-To: References: <0MB7FLFNfaGEWssr9X54UJ_iZNFWBJkxQ1yusP7fsuY=.3f9f3de5-fe84-48e6-9449-626cac42da0b@github.com> Message-ID: <_dP9N3UNWa82tfLVEapoSFJjbvMmlyP21ZbuL0NjTDU=.3685af0b-31a0-42aa-86b0-5098bda72766@github.com> On Tue, 7 Mar 2023 23:15:14 GMT, Jonathan Gibbons wrote: >> This PR converts Unicode sequences to UTF-8 native in .properties file. (Excluding the Unicode space and tab sequence). The conversion was done using native2ascii. >> >> In addition, the build logic is adjusted to support reading in the .properties files as UTF-8 during the conversion from .properties file to .java ListResourceBundle file. > > make/langtools/tools/compileproperties/CompileProperties.java line 252: > >> 250: try { >> 251: writer = new BufferedWriter( >> 252: new OutputStreamWriter(new FileOutputStream(outputPath), StandardCharsets.ISO_8859_1)); > > Using ISO_8859_1 seems strange. > Since these are generated files, you could write them as UTF-8 and then override the default javac option for ascii when compiling _just_ these files. > > Or else just stay with ascii; no one should be looking at these files! Will stick with your latter solution, as since the .properties files were converted via native2ascii, it makes sense to write out via ascii. ------------- PR: https://git.openjdk.org/jdk/pull/12726 From duke at openjdk.org Wed Mar 15 16:21:33 2023 From: duke at openjdk.org (Archie L. Cobbs) Date: Wed, 15 Mar 2023 16:21:33 GMT Subject: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native In-Reply-To: <0MB7FLFNfaGEWssr9X54UJ_iZNFWBJkxQ1yusP7fsuY=.3f9f3de5-fe84-48e6-9449-626cac42da0b@github.com> References: <0MB7FLFNfaGEWssr9X54UJ_iZNFWBJkxQ1yusP7fsuY=.3f9f3de5-fe84-48e6-9449-626cac42da0b@github.com> Message-ID: <1I9v8d2OiyLfQVCozGYVRhAi3AotqGuRUhsNj0VCsUk=.e673ca33-d24f-4aab-908e-a5c0bfa3bf7c@github.com> On Thu, 23 Feb 2023 09:04:23 GMT, Justin Lu wrote: > This PR converts Unicode sequences to UTF-8 native in .properties file. (Excluding the Unicode space and tab sequence). The conversion was done using native2ascii. > > In addition, the build logic is adjusted to support reading in the .properties files as UTF-8 during the conversion from .properties file to .java ListResourceBundle file. test/jdk/java/util/ResourceBundle/Bug6204853.properties line 1: > 1: # This file should probably be excluded because it's used in a test that relates to UTF-8 encoding (or not) of property files. ------------- PR: https://git.openjdk.org/jdk/pull/12726 From vlivanov at openjdk.org Wed Mar 15 16:54:33 2023 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 15 Mar 2023 16:54:33 GMT Subject: RFR: 8303022: "assert(allocates2(pc)) failed: not in CodeBuffer memory" When linking downcall handle [v3] In-Reply-To: References: Message-ID: On Fri, 10 Mar 2023 14:14:55 GMT, Jorn Vernee wrote: >> The issue is that the size of the code buffer is not large enough to hold the whole stub. >> >> Proposed solution is to scale the size of the stub with the number of arguments. I've adjusted sizes for both downcall and upcall stubs. I've also dropped the number of relocations, since we're not really using any for downcalls, and for upcalls we only have 1 AFAICS. (the size of the relocations can not be zero however, as that leads to the relocation section [not being initialized][1], and triggering [an assert][2] later when the code blob is copied). >> >> The way I've determined the new base size and per-argument size for stubs, is by first linking a stub without any arguments to get the required base size, and by then adding 20 `double` arguments to get a rough per-argument size. Both values have wiggle room as well. The sizes can be printed using e.g. `-XX:+LogCompilation`, and then looking for `nep_invoker_blob` and `upcall_stub*` in the log file. This experiment was done on a fastdebug build to account for additional debug code being generated. The included test is designed to try and maximize the size of the generated stub. >> >> I've also updated `CodeBuffer::log_section_sizes` to print the in-use size, rather than just the capacity and free space. >> >> [1]: https://github.com/openjdk/jdk/blob/56512cfe1f0682c98ba3488af3d03ccef632c016/src/hotspot/share/asm/codeBuffer.cpp#L119-L121 >> [2]: https://github.com/openjdk/jdk/blob/56512cfe1f0682c98ba3488af3d03ccef632c016/src/hotspot/share/asm/codeBuffer.cpp#L675 > > Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision: > > RISCV changes Looks good. ------------- Marked as reviewed by vlivanov (Reviewer). PR: https://git.openjdk.org/jdk/pull/12908 From cslucas at openjdk.org Wed Mar 15 17:20:47 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Wed, 15 Mar 2023 17:20:47 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v2] In-Reply-To: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: <8d3LAUIIFVAJIXyrV2YafqAtAe6yiSPUS5THd2VynTk=.006e4cf8-90fe-43ea-8bb3-bbda4d3244f9@github.com> > Can I please get reviews for this PR to add support for the rematerialization of scalar-replaced objects that participate in allocation merges? > > The most common and frequent use of NonEscaping Phis merging object allocations is for debugging information. The two graphs below show numbers for Renaissance and DaCapo benchmarks - similar results are obtained for all other applications that I tested. > > With what frequency does each IR node type occurs as an allocation merge user? I.e., if the same node type uses a Phi N times the counter is incremented by N: > > ![image](https://user-images.githubusercontent.com/2249648/222280517-4dcf5871-2564-4207-b49e-22aee47fa49d.png) > > What are the most common users of allocation merges? I.e., if the same node type uses a Phi N times the counter is incremented by 1: > > ![image](https://user-images.githubusercontent.com/2249648/222280608-ca742a4e-1622-4e69-a778-e4db6805ea02.png) > > This PR adds support scalar replacing allocations participating in merges that are used *only* as debug information in SafePointNode and its subclasses. Although there is a performance benefit in doing scalar replacement in this scenario only, the goal of this PR is mainly to add infrastructure to support the rematerialization of SR objects participating in merges. I plan to create subsequent PRs to enable scalar replacement of merges used by other node types (CmpP, Load+AddP, primarily) subsequently. > > The approach I used is pretty straightforward. It consists basically in: 1) Extend SafePointScalarObjectNode to represent multiple SR objects; 2) Add a new Class to support rematerialization of SR objects part of merges; 3) Patch HotSpot to be able to serialize and deserialize debug information related to allocation merges; 4) Patch C2 to generate unique types for SR objects participating in allocation merges used only as debug information. > > I tested this with JTREG tests tier 1-4 (Windows, Linux, and Mac) and didn't see regression that might be related. I also tested with several applications and didn't see any failure. Cesar Soares Lucas has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: - Merge master - Add support for rematerializing scalar replaced objects participating in allocation merges ------------- Changes: https://git.openjdk.org/jdk/pull/12897/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12897&range=01 Stats: 1803 lines in 18 files changed: 1653 ins; 9 del; 141 mod Patch: https://git.openjdk.org/jdk/pull/12897.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12897/head:pull/12897 PR: https://git.openjdk.org/jdk/pull/12897 From cslucas at openjdk.org Wed Mar 15 18:06:58 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Wed, 15 Mar 2023 18:06:58 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v3] In-Reply-To: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: > Can I please get reviews for this PR to add support for the rematerialization of scalar-replaced objects that participate in allocation merges? > > The most common and frequent use of NonEscaping Phis merging object allocations is for debugging information. The two graphs below show numbers for Renaissance and DaCapo benchmarks - similar results are obtained for all other applications that I tested. > > With what frequency does each IR node type occurs as an allocation merge user? I.e., if the same node type uses a Phi N times the counter is incremented by N: > > ![image](https://user-images.githubusercontent.com/2249648/222280517-4dcf5871-2564-4207-b49e-22aee47fa49d.png) > > What are the most common users of allocation merges? I.e., if the same node type uses a Phi N times the counter is incremented by 1: > > ![image](https://user-images.githubusercontent.com/2249648/222280608-ca742a4e-1622-4e69-a778-e4db6805ea02.png) > > This PR adds support scalar replacing allocations participating in merges that are used *only* as debug information in SafePointNode and its subclasses. Although there is a performance benefit in doing scalar replacement in this scenario only, the goal of this PR is mainly to add infrastructure to support the rematerialization of SR objects participating in merges. I plan to create subsequent PRs to enable scalar replacement of merges used by other node types (CmpP, Load+AddP, primarily) subsequently. > > The approach I used is pretty straightforward. It consists basically in: 1) Extend SafePointScalarObjectNode to represent multiple SR objects; 2) Add a new Class to support rematerialization of SR objects part of merges; 3) Patch HotSpot to be able to serialize and deserialize debug information related to allocation merges; 4) Patch C2 to generate unique types for SR objects participating in allocation merges used only as debug information. > > I tested this with JTREG tests tier 1-4 (Windows, Linux, and Mac) and didn't see regression that might be related. I also tested with several applications and didn't see any failure. Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: Fix some typos and do some small refactorings. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12897/files - new: https://git.openjdk.org/jdk/pull/12897/files/ea67a304..3b492d2e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12897&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12897&range=01-02 Stats: 72 lines in 8 files changed: 1 ins; 7 del; 64 mod Patch: https://git.openjdk.org/jdk/pull/12897.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12897/head:pull/12897 PR: https://git.openjdk.org/jdk/pull/12897 From duke at openjdk.org Wed Mar 15 18:31:17 2023 From: duke at openjdk.org (Jasmine K.) Date: Wed, 15 Mar 2023 18:31:17 GMT Subject: RFR: 8304230: LShift ideal transform assertion Message-ID: Hi, This PR aims to address the assertion on arm32 where the special-case `add1->in(2) == in(2)` check fails, and it falls through to the regular cases. I'm not quite sure how this issue can manifest as AFAIK the GVN should allow the usage of `==` to check against constants that are equal. I've changed the check from node equality to constant equality to hopefully resolve this. I unfortunately cannot reproduce the behavior on x86, nor do I have access to arm32 hardware, so I would greatly appreciate reviews and help testing this change (cc @bulasevich). Thank you all in advance. ------------- Commit messages: - Check for special case with constant instead of node Changes: https://git.openjdk.org/jdk/pull/13049/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13049&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8304230 Stats: 14 lines in 1 file changed: 6 ins; 6 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/13049.diff Fetch: git fetch https://git.openjdk.org/jdk pull/13049/head:pull/13049 PR: https://git.openjdk.org/jdk/pull/13049 From duke at openjdk.org Wed Mar 15 18:33:19 2023 From: duke at openjdk.org (Jasmine K.) Date: Wed, 15 Mar 2023 18:33:19 GMT Subject: RFR: 8304230: LShift ideal transform assertion In-Reply-To: References: Message-ID: On Wed, 15 Mar 2023 18:24:23 GMT, Jasmine K. wrote: > Hi, > This PR aims to address the assertion on arm32 where the special-case `add1->in(2) == in(2)` check fails, and it falls through to the regular cases. I'm not quite sure how this issue can manifest as AFAIK the GVN should allow the usage of `==` to check against constants that are equal. I've changed the check from node equality to constant equality to hopefully resolve this. I unfortunately cannot reproduce the behavior on x86, nor do I have access to arm32 hardware, so I would greatly appreciate reviews and help testing this change (cc @bulasevich). Thank you all in advance. It seems the GHA failure on macOS is unrelated, seems it's failing to get a boot JDK with a 503 ------------- PR: https://git.openjdk.org/jdk/pull/13049 From naoto at openjdk.org Wed Mar 15 20:23:23 2023 From: naoto at openjdk.org (Naoto Sato) Date: Wed, 15 Mar 2023 20:23:23 GMT Subject: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native In-Reply-To: <0MB7FLFNfaGEWssr9X54UJ_iZNFWBJkxQ1yusP7fsuY=.3f9f3de5-fe84-48e6-9449-626cac42da0b@github.com> References: <0MB7FLFNfaGEWssr9X54UJ_iZNFWBJkxQ1yusP7fsuY=.3f9f3de5-fe84-48e6-9449-626cac42da0b@github.com> Message-ID: On Thu, 23 Feb 2023 09:04:23 GMT, Justin Lu wrote: > This PR converts Unicode sequences to UTF-8 native in .properties file. (Excluding the Unicode space and tab sequence). The conversion was done using native2ascii. > > In addition, the build logic is adjusted to support reading in the .properties files as UTF-8 during the conversion from .properties file to .java ListResourceBundle file. test/jdk/java/text/Format/NumberFormat/CurrencySymbols.properties line 156: > 154: zh=\u00A4 > 155: zh_CN=\uFFE5 > 156: zh_HK=HK$ Why are they not encoded into UTF-8 native? ------------- PR: https://git.openjdk.org/jdk/pull/12726 From jvernee at openjdk.org Wed Mar 15 20:53:28 2023 From: jvernee at openjdk.org (Jorn Vernee) Date: Wed, 15 Mar 2023 20:53:28 GMT Subject: RFR: 8303022: "assert(allocates2(pc)) failed: not in CodeBuffer memory" When linking downcall handle [v3] In-Reply-To: References: Message-ID: On Fri, 10 Mar 2023 14:14:55 GMT, Jorn Vernee wrote: >> The issue is that the size of the code buffer is not large enough to hold the whole stub. >> >> Proposed solution is to scale the size of the stub with the number of arguments. I've adjusted sizes for both downcall and upcall stubs. I've also dropped the number of relocations, since we're not really using any for downcalls, and for upcalls we only have 1 AFAICS. (the size of the relocations can not be zero however, as that leads to the relocation section [not being initialized][1], and triggering [an assert][2] later when the code blob is copied). >> >> The way I've determined the new base size and per-argument size for stubs, is by first linking a stub without any arguments to get the required base size, and by then adding 20 `double` arguments to get a rough per-argument size. Both values have wiggle room as well. The sizes can be printed using e.g. `-XX:+LogCompilation`, and then looking for `nep_invoker_blob` and `upcall_stub*` in the log file. This experiment was done on a fastdebug build to account for additional debug code being generated. The included test is designed to try and maximize the size of the generated stub. >> >> I've also updated `CodeBuffer::log_section_sizes` to print the in-use size, rather than just the capacity and free space. >> >> [1]: https://github.com/openjdk/jdk/blob/56512cfe1f0682c98ba3488af3d03ccef632c016/src/hotspot/share/asm/codeBuffer.cpp#L119-L121 >> [2]: https://github.com/openjdk/jdk/blob/56512cfe1f0682c98ba3488af3d03ccef632c016/src/hotspot/share/asm/codeBuffer.cpp#L675 > > Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision: > > RISCV changes Thanks for the reviews. I'm running one more round of tests before integrating. ------------- PR: https://git.openjdk.org/jdk/pull/12908 From kostasto at proton.me Wed Mar 15 21:40:35 2023 From: kostasto at proton.me (Kosta Stojiljkovic) Date: Wed, 15 Mar 2023 21:40:35 +0000 Subject: jtreg test test/jdk/java/lang/StackWalker/StackWalkTest.java fails after jtreg commit 7903373 Message-ID: Dear all, The test in ..test/jdk/java/lang/StackWalker/StackWalkTest.java fails with the latest jtreg build, with the following error: ... recursion chain ... ... ... at StackWalkTest$Test.call(StackWalkTest.java:223) at StackWalkTest.runTest(StackWalkTest.java:270) at StackWalkTest.main(StackWalkTest.java:325) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) at java.base/java.lang.reflect.Method.invoke(Method.java:578) at com.sun.javatest.regtest.agent.MainWrapper$MainTask.run(MainWrapper.java:138) at java.base/java.lang.Thread.run(Thread.java:1623) Caused by: java.lang.IndexOutOfBoundsException: Index: 1004, Size: 1004 at java.base/java.util.LinkedList.checkElementIndex(LinkedList.java:559) at java.base/java.util.LinkedList.get(LinkedList.java:480) at StackRecorderUtil.compareFrame(StackRecorderUtil.java:64) at StackWalkTest.consume(StackWalkTest.java:145) ... 1018 more JavaTest Message: Test threw exception: java.lang.RuntimeException: extra non-infra stack frame at count 1004: --------------------------- In essence, the test detects an extra non-infra stack frame for the MainWrapper$MainTask's frame. The test should disregard MainTask's stack frame, since it's coming from an infrastructure class - com.sun.javatest.regtest.agent.MainWrapper. The code correctly checks if the stack frame belongs to the mentioned infrastructure class, but it also looks for the inner class - com.sun.javatest.regtest.agent.MainWrapper$MainThread. I believe the problem comes from the following commit (7903373) to the jtreg repository: https://github.com/openjdk/jtreg/commit/5b9e661eb6ee9dd9a9d2690986bbf9ce303a8f03 This commit changed the name of the class MainThread to MainTask, thus making the hardcoded check in the StackWalkTest fail to recognize this extra stack frame as an infra frame. Could you please try to reproduce and let me know if I am missing or misunderstanding something. Best, Kosta Stojiljkovic From jvernee at openjdk.org Wed Mar 15 23:46:31 2023 From: jvernee at openjdk.org (Jorn Vernee) Date: Wed, 15 Mar 2023 23:46:31 GMT Subject: Integrated: 8303022: "assert(allocates2(pc)) failed: not in CodeBuffer memory" When linking downcall handle In-Reply-To: References: Message-ID: <31Rotf1DBSLQIaqV1yNIo0DmVi8aB3KXojPEGvrrj1w=.ed94ce07-7664-4efa-93e0-acb83dbb95d0@github.com> On Tue, 7 Mar 2023 18:02:41 GMT, Jorn Vernee wrote: > The issue is that the size of the code buffer is not large enough to hold the whole stub. > > Proposed solution is to scale the size of the stub with the number of arguments. I've adjusted sizes for both downcall and upcall stubs. I've also dropped the number of relocations, since we're not really using any for downcalls, and for upcalls we only have 1 AFAICS. (the size of the relocations can not be zero however, as that leads to the relocation section [not being initialized][1], and triggering [an assert][2] later when the code blob is copied). > > The way I've determined the new base size and per-argument size for stubs, is by first linking a stub without any arguments to get the required base size, and by then adding 20 `double` arguments to get a rough per-argument size. Both values have wiggle room as well. The sizes can be printed using e.g. `-XX:+LogCompilation`, and then looking for `nep_invoker_blob` and `upcall_stub*` in the log file. This experiment was done on a fastdebug build to account for additional debug code being generated. The included test is designed to try and maximize the size of the generated stub. > > I've also updated `CodeBuffer::log_section_sizes` to print the in-use size, rather than just the capacity and free space. > > [1]: https://github.com/openjdk/jdk/blob/56512cfe1f0682c98ba3488af3d03ccef632c016/src/hotspot/share/asm/codeBuffer.cpp#L119-L121 > [2]: https://github.com/openjdk/jdk/blob/56512cfe1f0682c98ba3488af3d03ccef632c016/src/hotspot/share/asm/codeBuffer.cpp#L675 This pull request has now been integrated. Changeset: 2b81faeb Author: Jorn Vernee URL: https://git.openjdk.org/jdk/commit/2b81faeb3514060e6c8c950ef4e39e299c43199d Stats: 102 lines in 8 files changed: 81 ins; 0 del; 21 mod 8303022: "assert(allocates2(pc)) failed: not in CodeBuffer memory" When linking downcall handle Reviewed-by: kvn, vlivanov ------------- PR: https://git.openjdk.org/jdk/pull/12908 From fyang at openjdk.org Thu Mar 16 00:29:43 2023 From: fyang at openjdk.org (Fei Yang) Date: Thu, 16 Mar 2023 00:29:43 GMT Subject: RFR: JDK-8303154: Investigate and improve instruction cache flushing during compilation [v3] In-Reply-To: References: Message-ID: On Tue, 14 Mar 2023 13:22:20 GMT, Damon Fenacci wrote: >> It was noticed that we flush the instruction cache too much for a single C1 compilation. The same is true for the C2 compilation. >> There are several places in the code where the instruction cache is called and many of them are very intertwined (see [bug report](https://bugs.openjdk.org/browse/JDK-8303154)). >> >> This PR is meant to be a "minimum" set of changes that improve the situation without introducing excessive extra information to keep track of the origin of the call through call stacks. This is done by avoiding calls to flush the ICache: >> * at `Compilation::emit_code_epilog` >> * when calling `CodeCache::commit` as flushing is done anyway when copying from the temporary buffer into the code cache in `CodeBuffer::copy_code_to`. This results in flushing the ICache only once instead of 3 times for a C1 compilation and twice for a C2 compilation. Additionally this halves the number of flushes during adapters generation (lots of calls). >> * at `SharedRuntime::generate_i2c2i_adapters` as this is called with a temporary buffer and an ICache flush is not needed >> >> This change decreases the number of calls to flush the ICache for a simple _Hello world_ program on Mac OSX aarch64 from 3569 to 2028 on C1 (43.2% improvement) and from 3572 to 1952 on C2 (45.4% improvement). >> >> This fix includes changes for x86_32/64 and aarch64, which I could test thoroughly but also for **arm** and **riscv**, for which I would need some help with testing. > > Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: > > JDK-8303154: remove flush in SharedRuntime::generate_i2c2i_adapters Hi, I performed tier1-3 tests on linux-riscv64 boards, result looks good. ------------- PR: https://git.openjdk.org/jdk/pull/12877 From eliu at openjdk.org Thu Mar 16 01:39:19 2023 From: eliu at openjdk.org (Eric Liu) Date: Thu, 16 Mar 2023 01:39:19 GMT Subject: RFR: 8302906: AArch64: Add SVE backend support for vector unsigned comparison [v5] In-Reply-To: <0N4QegqcD5I5funLRjC7WtivwQkmWgAlbI38Wcy2k8I=.c8fb8cff-4dcd-4f8c-a582-5ae1903773c8@github.com> References: <0N4QegqcD5I5funLRjC7WtivwQkmWgAlbI38Wcy2k8I=.c8fb8cff-4dcd-4f8c-a582-5ae1903773c8@github.com> Message-ID: On Wed, 15 Mar 2023 03:48:01 GMT, changpeng1997 wrote: >> This patch implements unsigned vector comparison on SVE. >> >> 1: Test: >> All vector API test cases[1][2] passed without new failure. Existing test cases can cover all unsigned comparison conditions for all kinds of vector. >> >> 2: Performance: >> (1): Benchmark: >> As existing benchmarks in panama repo (such as [3]) have some issues [4] (We will fix them in a separate patch.), I collected performance data with a reduced jmh benchmark [5]. e.g. for ByteVector unsigned comparison: >> >> >> @Benchmark >> public void byteVectorUnsignedCompare() { >> for (int j = 0; j < 200; j++) { >> for (int i = 0; i < bspecies.length(); i++) { >> ByteVector av = ByteVector.fromArray(bspecies, ba, i); >> ByteVector ca = ByteVector.fromArray(bspecies, bb, i); >> av.compare(VectorOperators.UNSIGNED_GT, ca).intoArray(br, i); >> } >> } >> } >> >> >> (2): Performance data >> >> Before: >> >> >> Benchmark Score(op/ms) Error >> ByteVector.UNSIGNED_GT#size(1024) 4.846 3.419 >> ShortVector.UNSIGNED_GE#size(1024) 3.055 1.369 >> IntVector.UNSIGNED_LT#size(1024) 3.475 1.269 >> LongVector.UNSIGNED_LE#size(1024) 4.515 1.812 >> >> >> After: >> >> >> Benchmark Score(op/ms) Error >> ByteVector.UNSIGNED_GT#size(1024) 493.937 1.389 >> ShortVector.UNSIGNED_GE#size(1024) 5308.796 20.557 >> IntVector.UNSIGNED_LT#size(1024) 4944.744 10.606 >> LongVector.UNSIGNED_LE#size(1024) 8459.605 28.683 >> >> >> [1] https://github.com/openjdk/jdk/tree/master/test/jdk/jdk/incubator/vector >> [2] https://github.com/openjdk/jdk/tree/master/test/hotspot/jtreg/compiler/vectorapi >> [3] https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/ByteMaxVector.java#L1459 >> [4] https://bugs.openjdk.org/browse/JDK-8282850 >> [5] https://gist.github.com/changpeng1997/d311127e1015c107197f9b56a92b0fae > > changpeng1997 has updated the pull request incrementally with one additional commit since the last revision: > > Move cm() and fcm() to advsimd-three-same section. LGTM ------------- Marked as reviewed by eliu (Committer). PR: https://git.openjdk.org/jdk/pull/12725 From duke at openjdk.org Thu Mar 16 04:19:28 2023 From: duke at openjdk.org (changpeng1997) Date: Thu, 16 Mar 2023 04:19:28 GMT Subject: Integrated: 8302906: AArch64: Add SVE backend support for vector unsigned comparison In-Reply-To: References: Message-ID: <3fhnibjkaWafm02TL_6RmevzQ00tsHEH0jOwuCEsIPY=.520032e5-7ba8-45ea-8a3d-e5281ac5bfab@github.com> On Thu, 23 Feb 2023 07:05:43 GMT, changpeng1997 wrote: > This patch implements unsigned vector comparison on SVE. > > 1: Test: > All vector API test cases[1][2] passed without new failure. Existing test cases can cover all unsigned comparison conditions for all kinds of vector. > > 2: Performance: > (1): Benchmark: > As existing benchmarks in panama repo (such as [3]) have some issues [4] (We will fix them in a separate patch.), I collected performance data with a reduced jmh benchmark [5]. e.g. for ByteVector unsigned comparison: > > > @Benchmark > public void byteVectorUnsignedCompare() { > for (int j = 0; j < 200; j++) { > for (int i = 0; i < bspecies.length(); i++) { > ByteVector av = ByteVector.fromArray(bspecies, ba, i); > ByteVector ca = ByteVector.fromArray(bspecies, bb, i); > av.compare(VectorOperators.UNSIGNED_GT, ca).intoArray(br, i); > } > } > } > > > (2): Performance data > > Before: > > > Benchmark Score(op/ms) Error > ByteVector.UNSIGNED_GT#size(1024) 4.846 3.419 > ShortVector.UNSIGNED_GE#size(1024) 3.055 1.369 > IntVector.UNSIGNED_LT#size(1024) 3.475 1.269 > LongVector.UNSIGNED_LE#size(1024) 4.515 1.812 > > > After: > > > Benchmark Score(op/ms) Error > ByteVector.UNSIGNED_GT#size(1024) 493.937 1.389 > ShortVector.UNSIGNED_GE#size(1024) 5308.796 20.557 > IntVector.UNSIGNED_LT#size(1024) 4944.744 10.606 > LongVector.UNSIGNED_LE#size(1024) 8459.605 28.683 > > > [1] https://github.com/openjdk/jdk/tree/master/test/jdk/jdk/incubator/vector > [2] https://github.com/openjdk/jdk/tree/master/test/hotspot/jtreg/compiler/vectorapi > [3] https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/ByteMaxVector.java#L1459 > [4] https://bugs.openjdk.org/browse/JDK-8282850 > [5] https://gist.github.com/changpeng1997/d311127e1015c107197f9b56a92b0fae This pull request has now been integrated. Changeset: 42dd9077 Author: changpeng1997 Committer: Eric Liu URL: https://git.openjdk.org/jdk/commit/42dd9077a087e1431b76c5653db820e65a6cc177 Stats: 262 lines in 9 files changed: 92 ins; 60 del; 110 mod 8302906: AArch64: Add SVE backend support for vector unsigned comparison Reviewed-by: aph, eliu ------------- PR: https://git.openjdk.org/jdk/pull/12725 From tobias.hartmann at oracle.com Thu Mar 16 06:24:55 2023 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Thu, 16 Mar 2023 07:24:55 +0100 Subject: jtreg test test/jdk/java/lang/StackWalker/StackWalkTest.java fails after jtreg commit 7903373 In-Reply-To: References: Message-ID: FTR (hotspot-compiler-dev did not like the BCC): https://mail.openjdk.org/pipermail/core-libs-dev/2023-March/102247.html Best regards, Tobias On 15.03.23 22:40, Kosta Stojiljkovic wrote: > Dear all, > > The test in ..test/jdk/java/lang/StackWalker/StackWalkTest.java fails with the latest jtreg build, with the following error: > > ... > recursion chain > ... > ... > ... > at StackWalkTest$Test.call(StackWalkTest.java:223) > at StackWalkTest.runTest(StackWalkTest.java:270) > at StackWalkTest.main(StackWalkTest.java:325) > at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) > at java.base/java.lang.reflect.Method.invoke(Method.java:578) > at com.sun.javatest.regtest.agent.MainWrapper$MainTask.run(MainWrapper.java:138) > at java.base/java.lang.Thread.run(Thread.java:1623) > Caused by: java.lang.IndexOutOfBoundsException: Index: 1004, Size: 1004 > at java.base/java.util.LinkedList.checkElementIndex(LinkedList.java:559) > at java.base/java.util.LinkedList.get(LinkedList.java:480) > at StackRecorderUtil.compareFrame(StackRecorderUtil.java:64) > at StackWalkTest.consume(StackWalkTest.java:145) > ... 1018 more > > JavaTest Message: Test threw exception: java.lang.RuntimeException: extra non-infra stack frame at count 1004: > > --------------------------- > > In essence, the test detects an extra non-infra stack frame for the MainWrapper$MainTask's frame. The test should disregard MainTask's stack frame, since it's coming from an infrastructure class - com.sun.javatest.regtest.agent.MainWrapper. The code correctly checks if the stack frame belongs to the mentioned infrastructure class, but it also looks for the inner class - com.sun.javatest.regtest.agent.MainWrapper$MainThread. > > I believe the problem comes from the following commit (7903373) to the jtreg repository: https://github.com/openjdk/jtreg/commit/5b9e661eb6ee9dd9a9d2690986bbf9ce303a8f03 > > This commit changed the name of the class MainThread to MainTask, thus making the hardcoded check in the StackWalkTest fail to recognize this extra stack frame as an infra frame. > > Could you please try to reproduce and let me know if I am missing or misunderstanding something. > > Best, > Kosta Stojiljkovic From duke at openjdk.org Thu Mar 16 07:17:23 2023 From: duke at openjdk.org (Damon Fenacci) Date: Thu, 16 Mar 2023 07:17:23 GMT Subject: RFR: JDK-8303154: Investigate and improve instruction cache flushing during compilation [v3] In-Reply-To: References: Message-ID: <4AyCfhBVMHMcjPufBtZnrr6XGgCH4VBAiZY75u-FHJ4=.8cc7fe75-2442-465b-9142-6d1f7e959a91@github.com> On Thu, 16 Mar 2023 00:26:30 GMT, Fei Yang wrote: >> Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: >> >> JDK-8303154: remove flush in SharedRuntime::generate_i2c2i_adapters > > Hi, I performed tier1-3 tests on linux-riscv64 boards, result looks good. @RealFYang thank you very much for running the tests! ------------- PR: https://git.openjdk.org/jdk/pull/12877 From bulasevich at openjdk.org Thu Mar 16 08:27:25 2023 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 16 Mar 2023 08:27:25 GMT Subject: RFR: JDK-8303154: Investigate and improve instruction cache flushing during compilation [v3] In-Reply-To: References: Message-ID: On Tue, 14 Mar 2023 13:22:20 GMT, Damon Fenacci wrote: >> It was noticed that we flush the instruction cache too much for a single C1 compilation. The same is true for the C2 compilation. >> There are several places in the code where the instruction cache is called and many of them are very intertwined (see [bug report](https://bugs.openjdk.org/browse/JDK-8303154)). >> >> This PR is meant to be a "minimum" set of changes that improve the situation without introducing excessive extra information to keep track of the origin of the call through call stacks. This is done by avoiding calls to flush the ICache: >> * at `Compilation::emit_code_epilog` >> * when calling `CodeCache::commit` as flushing is done anyway when copying from the temporary buffer into the code cache in `CodeBuffer::copy_code_to`. This results in flushing the ICache only once instead of 3 times for a C1 compilation and twice for a C2 compilation. Additionally this halves the number of flushes during adapters generation (lots of calls). >> * at `SharedRuntime::generate_i2c2i_adapters` as this is called with a temporary buffer and an ICache flush is not needed >> >> This change decreases the number of calls to flush the ICache for a simple _Hello world_ program on Mac OSX aarch64 from 3569 to 2028 on C1 (43.2% improvement) and from 3572 to 1952 on C2 (45.4% improvement). >> >> This fix includes changes for x86_32/64 and aarch64, which I could test thoroughly but also for **arm** and **riscv**, for which I would need some help with testing. > > Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: > > JDK-8303154: remove flush in SharedRuntime::generate_i2c2i_adapters HI, tier1-2 tests on ARM32 are OK! ------------- PR: https://git.openjdk.org/jdk/pull/12877 From duke at openjdk.org Thu Mar 16 08:27:26 2023 From: duke at openjdk.org (Damon Fenacci) Date: Thu, 16 Mar 2023 08:27:26 GMT Subject: RFR: JDK-8303154: Investigate and improve instruction cache flushing during compilation [v3] In-Reply-To: References: Message-ID: On Thu, 16 Mar 2023 08:21:33 GMT, Boris Ulasevich wrote: >> Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: >> >> JDK-8303154: remove flush in SharedRuntime::generate_i2c2i_adapters > > HI, tier1-2 tests on ARM32 are OK! @bulasevich thanks a lot for testing! ------------- PR: https://git.openjdk.org/jdk/pull/12877 From duke at openjdk.org Thu Mar 16 08:27:27 2023 From: duke at openjdk.org (Damon Fenacci) Date: Thu, 16 Mar 2023 08:27:27 GMT Subject: RFR: JDK-8303154: Investigate and improve instruction cache flushing during compilation [v3] In-Reply-To: References: Message-ID: <2DDqUam3456YuiW0Wc2ag3dnEX7ibPz6gtcZTcCQX0A=.b6e6312a-85a6-466b-8b9d-f13025391b77@github.com> On Wed, 15 Mar 2023 06:00:02 GMT, Tobias Hartmann wrote: >> Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: >> >> JDK-8303154: remove flush in SharedRuntime::generate_i2c2i_adapters > > FTR, the follow-up RFE is [JDK-8303971](https://bugs.openjdk.org/browse/JDK-8303971). @TobiHartmann @vnkozlov thanks a lot for your reviews! ------------- PR: https://git.openjdk.org/jdk/pull/12877 From duke at openjdk.org Thu Mar 16 08:32:35 2023 From: duke at openjdk.org (Damon Fenacci) Date: Thu, 16 Mar 2023 08:32:35 GMT Subject: Integrated: JDK-8303154: Investigate and improve instruction cache flushing during compilation In-Reply-To: References: Message-ID: <5C6pHyrbCJTR3AM2WnSGNNUESA8o2lLaWUVDdAWx3I8=.f83a4b4c-ddde-4759-8c2a-3c6dd8063b27@github.com> On Mon, 6 Mar 2023 08:37:50 GMT, Damon Fenacci wrote: > It was noticed that we flush the instruction cache too much for a single C1 compilation. The same is true for the C2 compilation. > There are several places in the code where the instruction cache is called and many of them are very intertwined (see [bug report](https://bugs.openjdk.org/browse/JDK-8303154)). > > This PR is meant to be a "minimum" set of changes that improve the situation without introducing excessive extra information to keep track of the origin of the call through call stacks. This is done by avoiding calls to flush the ICache: > * at `Compilation::emit_code_epilog` > * when calling `CodeCache::commit` as flushing is done anyway when copying from the temporary buffer into the code cache in `CodeBuffer::copy_code_to`. This results in flushing the ICache only once instead of 3 times for a C1 compilation and twice for a C2 compilation. Additionally this halves the number of flushes during adapters generation (lots of calls). > * at `SharedRuntime::generate_i2c2i_adapters` as this is called with a temporary buffer and an ICache flush is not needed > > This change decreases the number of calls to flush the ICache for a simple _Hello world_ program on Mac OSX aarch64 from 3569 to 2028 on C1 (43.2% improvement) and from 3572 to 1952 on C2 (45.4% improvement). > > This fix includes changes for x86_32/64 and aarch64, which I could test thoroughly but also for **arm** and **riscv**, for which I would need some help with testing. This pull request has now been integrated. Changeset: b7945bc9 Author: Damon Fenacci Committer: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/b7945bc9e5db5761f17a9e56246424fbcab21627 Stats: 11 lines in 7 files changed: 0 ins; 11 del; 0 mod 8303154: Investigate and improve instruction cache flushing during compilation Reviewed-by: thartmann, kvn ------------- PR: https://git.openjdk.org/jdk/pull/12877 From duke at openjdk.org Thu Mar 16 08:41:32 2023 From: duke at openjdk.org (Ilya Korennoy) Date: Thu, 16 Mar 2023 08:41:32 GMT Subject: Integrated: 8293324: ciField.hpp has two methods to return field's offset In-Reply-To: References: Message-ID: On Mon, 13 Mar 2023 16:52:30 GMT, Ilya Korennoy wrote: > Small refactoring of ciField.hpp method `offset()` removed and `offset_in_bytes()` used instead. > > Test: tier1 linux-x86_64 This pull request has now been integrated. Changeset: 7277bb19 Author: Ilya Korennoy Committer: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/7277bb19f128b84094400cb4262b2e0432e559c5 Stats: 22 lines in 10 files changed: 0 ins; 5 del; 17 mod 8293324: ciField.hpp has two methods to return field's offset Reviewed-by: thartmann ------------- PR: https://git.openjdk.org/jdk/pull/13003 From thartmann at openjdk.org Thu Mar 16 08:56:29 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 16 Mar 2023 08:56:29 GMT Subject: RFR: 8300257: C2: vectorization fails on some simple Memory Segment loops [v2] In-Reply-To: References: Message-ID: <0xHmGI1abpLB-_9DvlR9M8G43XZaAx8Y_LeSuRTLFxE=.6aa6a8be-ed9f-4666-9c25-af07f76e180c@github.com> On Tue, 14 Mar 2023 10:58:11 GMT, Roland Westrelin wrote: >> In the test case `testByteLong1` (that's extracted from a memory >> segment micro benchmark), the address of the store is initially: >> >> >> (AddP#204 base#195 base#195 (AddL#164 (ConvI2L#158 (CastII#157 (LshiftI#107 iv#101))) invar#163)) >> >> >> (#numbers are node numbers to help the discussion). >> >> `iv#101` is the `Phi` of a counted loop. `invar#163` is the >> `baseOffset` load. >> >> To eliminate the range check, the loop is transformed into a loop nest >> and as a consequence the address above becomes: >> >> >> (AddP#204 base#195 base#195 (AddL#164 (ConvI2L#158 (CastII#157 (LShiftI#107 (AddI#326 invar#308 iv#321)))) invar#163)) >> >> >> `invar#308` is some expression from a `Phi` of the outer loop. >> >> That `AddP` is transformed multiple times to push the invariants out of loop: >> >> >> (AddP#568 base#195 (AddP#556 base#195 base#195 invar#163) (ConvI2L#158 (CastII#157 (AddI#566 (LShiftI#565 iv#321) invar#577)))) >> >> >> then: >> >> >> (AddP#568 base#195 (AddP#847 (AddP#556 base#195 base#195 invar#163) (AddL#838 (ConvI2L#793 (LShiftL#760 iv#767)) (ConvI2L#818 (CastII#779 invar#577))))) >> >> >> and finally: >> >> >> (AddP#568 base#195 (AddP#949 base#195 (AddP#855 base#195 (AddP#556 base#195 base#195 invar#163) (ConvI2L#818 (CastII#809 invar#577))) (ConvI2L#938 (LShiftI#896 iv#908)))) >> >> >> `AddP#855` is out of the inner loop. >> >> This doesn't vectorize because: >> >> - there are 2 invariants in the address expression but superword only >> support one (tracked by `_invar` in `SWPointer`) >> >> - there are more levels of `AddP` (4) than superword supports (3) >> >> To fix that, I propose to no longer track the address elements in >> `_invar`, `_negate_invar` and `_invar_scale` but instead to have a >> single `_invar` which is an expression built by superword as it >> follows chains of `addP` nodes. I kept the previous `_invar`, >> `_negate_invar` and `_invar_scale` as debugging and use them to check >> that what vectorized with the previous scheme still does. >> >> I also propose lifting the restriction on 3 levels of `AddP` entirely. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: > > - NULL -> nullptr > - Merge branch 'master' into JDK-8300257 > - fix & test Looks good to me. I'll run some perf testing and report back. src/hotspot/share/opto/superword.cpp line 4318: > 4316: if (opc == Op_AddI) { > 4317: if (n->in(2)->is_Con() && invariant(n->in(1))) { > 4318: maybe_add_to_invar(maybe_negate_invar(negate, n->in(1))); It feels like `maybe_negate_invar` should be moved into `maybe_add_to_invar` and be controlled by a `negate` argument. src/hotspot/share/opto/superword.hpp line 695: > 693: _debug_invar_scale == q._debug_invar_scale && > 694: _debug_negate_invar == q._debug_negate_invar), ""); > 695: return _invar == q._invar; Suggestion: return _invar == q._invar; test/hotspot/jtreg/compiler/c2/irTests/TestVectorizationMultiInvar.java line 78: > 76: long start4, long stop4, > 77: long start5, long stop5 > 78: ) { Suggestion: public static void testLoopNest1(byte[] dest, byte[] src, long start1, long stop1, long start2, long stop2, long start3, long stop3, long start4, long stop4, long start5, long stop5) { test/hotspot/jtreg/compiler/c2/irTests/TestVectorizationMultiInvar.java line 110: > 108: long start4, long stop4, > 109: long start5, long stop5 > 110: ) { Suggestion: public static void testLoopNest2(int[] dest, int[] src, long start1, long stop1, long start2, long stop2, long start3, long stop3, long start4, long stop4, long start5, long stop5) { ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/12942 From tobias.hartmann at oracle.com Thu Mar 16 06:20:33 2023 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Thu, 16 Mar 2023 07:20:33 +0100 Subject: jtreg test test/jdk/java/lang/StackWalker/StackWalkTest.java fails after jtreg commit 7903373 In-Reply-To: References: Message-ID: <4847e074-5a12-7a33-a9fb-d7e412943078@oracle.com> Hi Kosta, Thanks again for the report! This test is owned by core-libs/java.lang, I'm forwarding to core-libs-dev and CC'ing Leonid, the author of https://bugs.openjdk.org/browse/CODETOOLS-7903373. I can see these failures in our testing as well but no one filed a bug yet. I filed: https://bugs.openjdk.org/browse/JDK-8304314 Best regards, Tobias On 15.03.23 22:40, Kosta Stojiljkovic wrote: > Dear all, > > The test in ..test/jdk/java/lang/StackWalker/StackWalkTest.java fails with the latest jtreg build, with the following error: > > ... > recursion chain > ... > ... > ... > at StackWalkTest$Test.call(StackWalkTest.java:223) > at StackWalkTest.runTest(StackWalkTest.java:270) > at StackWalkTest.main(StackWalkTest.java:325) > at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) > at java.base/java.lang.reflect.Method.invoke(Method.java:578) > at com.sun.javatest.regtest.agent.MainWrapper$MainTask.run(MainWrapper.java:138) > at java.base/java.lang.Thread.run(Thread.java:1623) > Caused by: java.lang.IndexOutOfBoundsException: Index: 1004, Size: 1004 > at java.base/java.util.LinkedList.checkElementIndex(LinkedList.java:559) > at java.base/java.util.LinkedList.get(LinkedList.java:480) > at StackRecorderUtil.compareFrame(StackRecorderUtil.java:64) > at StackWalkTest.consume(StackWalkTest.java:145) > ... 1018 more > > JavaTest Message: Test threw exception: java.lang.RuntimeException: extra non-infra stack frame at count 1004: > > --------------------------- > > In essence, the test detects an extra non-infra stack frame for the MainWrapper$MainTask's frame. The test should disregard MainTask's stack frame, since it's coming from an infrastructure class - com.sun.javatest.regtest.agent.MainWrapper. The code correctly checks if the stack frame belongs to the mentioned infrastructure class, but it also looks for the inner class - com.sun.javatest.regtest.agent.MainWrapper$MainThread. > > I believe the problem comes from the following commit (7903373) to the jtreg repository: https://github.com/openjdk/jtreg/commit/5b9e661eb6ee9dd9a9d2690986bbf9ce303a8f03 > > This commit changed the name of the class MainThread to MainTask, thus making the hardcoded check in the StackWalkTest fail to recognize this extra stack frame as an infra frame. > > Could you please try to reproduce and let me know if I am missing or misunderstanding something. > > Best, > Kosta Stojiljkovic From yyang at openjdk.org Thu Mar 16 11:44:18 2023 From: yyang at openjdk.org (Yi Yang) Date: Thu, 16 Mar 2023 11:44:18 GMT Subject: RFR: 8304049: C2 can not merge trivial Ifs due to CastII In-Reply-To: References: Message-ID: <6DEmJyXOc7y_sq_vlHoq_JwPcuFne-4I2RfHG_y5-2E=.5225b53e-9453-4ec3-a6e8-ab40e3410663@github.com> On Wed, 15 Mar 2023 10:37:03 GMT, Yi Yang wrote: > Hi can I have a review for this patch? C2 can not apply Split If for the attached trivial case. PhiNode::Ideal removes itself by unique_input but introduces a new CastII > > https://github.com/openjdk/jdk/blob/e3777b0c49abb9cc1925f4044392afadf3adef61/src/hotspot/share/opto/cfgnode.cpp#L1470-L1474 > > https://github.com/openjdk/jdk/blob/e3777b0c49abb9cc1925f4044392afadf3adef61/src/hotspot/share/opto/cfgnode.cpp#L2078-L2079 > > Therefore we have two Cmp, which is not identical for split_if. > > ![image](https://user-images.githubusercontent.com/5010047/225285449-b41dc939-1d3f-45f3-b6d6-a9b9445c2f6a.png) > (Fig1. Phi#41 is removed during ideal, create CastII#58 then) > > ![image](https://user-images.githubusercontent.com/5010047/225285493-30471f1c-97b0-452b-9218-3b5f09f09859.png) > (Fig2. CmpI#42 and CmpI#23 are different comparisons, they are not identical_backtoback_ifs ) > > This patch adds Cmp identity to find existing Cmp node, i.e. Cmp#42 is identity to Cmp#23 > > > public static void test5(int a, int b){ > > if( b!=0) { > int_field = 35; > } else { > int_field =222; > } > > if( b!=0) { > int_field = 35; > } else { > int_field =222; > } > } > > > > Test: tier1, application/ctw/modules A safer approach is that we don't optimize it when Cast input carries any dependencies(In such case, I still see hundreds of transformation happen during building) Node* uncast = n->uncast(); if (uncast != n && !n->as_ConstraintCast()->carry_dependency()) { But this can not solve the above trivial case, because CastII#58 generated in [PhiNode::Ideal](https://github.com/y1yang0/jdk/commit/6961dea52a6aae94d1fb4573de64525f4934352e#diff-8ffaebb53d272c5385ca4d9df21e8bda21133d11ad64d4c4a5ab4c3a3301ce17R1697) carries unnecessarily strong dependency anyway, I wonder if it's possible to make strong dependency only when we have precise pattern described in https://bugs.openjdk.org/browse/JDK-8139771 ------------- PR: https://git.openjdk.org/jdk/pull/13039 From bulasevich at openjdk.org Thu Mar 16 12:00:19 2023 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 16 Mar 2023 12:00:19 GMT Subject: RFR: 8304230: LShift ideal transform assertion In-Reply-To: References: Message-ID: <2_8cX_8W0WN3NoIjT0lbyMmlUUvOsUSziSm1PCQxcSc=.23649705-da68-409d-81d5-ef21ba5cb7ce@github.com> On Wed, 15 Mar 2023 18:24:23 GMT, Jasmine K. wrote: > Hi, > This PR aims to address the assertion on arm32 where the special-case `add1->in(2) == in(2)` check fails, and it falls through to the regular cases. I'm not quite sure how this issue can manifest as AFAIK the GVN should allow the usage of `==` to check against constants that are equal. I've changed the check from node equality to constant equality to hopefully resolve this. I unfortunately cannot reproduce the behavior on x86, nor do I have access to arm32 hardware, so I would greatly appreciate reviews and help testing this change (cc @bulasevich). Thank you all in advance. Hi, I confirm ARM32 tests pass Ok with this change. And the change looks reasonable to me (I am not a reviewer). Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/13049 From roland at openjdk.org Thu Mar 16 12:00:20 2023 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 16 Mar 2023 12:00:20 GMT Subject: RFR: 8304049: C2 can not merge trivial Ifs due to CastII In-Reply-To: <6DEmJyXOc7y_sq_vlHoq_JwPcuFne-4I2RfHG_y5-2E=.5225b53e-9453-4ec3-a6e8-ab40e3410663@github.com> References: <6DEmJyXOc7y_sq_vlHoq_JwPcuFne-4I2RfHG_y5-2E=.5225b53e-9453-4ec3-a6e8-ab40e3410663@github.com> Message-ID: On Thu, 16 Mar 2023 11:41:44 GMT, Yi Yang wrote: > But this can not solve the above trivial case, because CastII#58 generated in [PhiNode::Ideal](https://github.com/y1yang0/jdk/commit/6961dea52a6aae94d1fb4573de64525f4934352e#diff-8ffaebb53d272c5385ca4d9df21e8bda21133d11ad64d4c4a5ab4c3a3301ce17R1697) carries unnecessarily strong dependency anyway, I wonder if it's possible to make strong dependency only when we have precise pattern described in https://bugs.openjdk.org/browse/JDK-8139771 Nothing says the pattern of that test case is the only that can cause a problem. It still seems simpler to me to do this in `PhaseIdealLoop::identical_backtoback_ifs()` and I don't understand why you're against it. ------------- PR: https://git.openjdk.org/jdk/pull/13039 From bulasevich at openjdk.org Thu Mar 16 12:04:21 2023 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 16 Mar 2023 12:04:21 GMT Subject: RFR: 8304230: LShift ideal transform assertion In-Reply-To: References: Message-ID: On Wed, 15 Mar 2023 18:30:28 GMT, Jasmine K. wrote: > It seems the GHA failure on macOS is unrelated, seems it's failing to get a boot JDK with a 503 Yes, this failure is unrelated to the change. You can restart the tests. ------------- PR: https://git.openjdk.org/jdk/pull/13049 From yyang at openjdk.org Thu Mar 16 12:06:46 2023 From: yyang at openjdk.org (Yi Yang) Date: Thu, 16 Mar 2023 12:06:46 GMT Subject: RFR: 8304034: Remove redundant and meaningless comments in opto [v6] In-Reply-To: References: Message-ID: <3H6qdeUA8opQBX3D6NAMgpgCZbL9u1Nx5yorXdJ96zc=.e9d0f1a3-3720-4d7a-822d-9d511172e5bd@github.com> > Please help review this trivial change to remove redundant and meaningless comments in `hotspot/share/opto` directory. > > They are either > 1. Repeat the function name that the function they comment for. > 2. Makes no sense, e.g. `//----Idealize----` > > And I think original CC-style code (`if( test )`,`call( arg )`) can be formatted in one go, instead of formatting the near code when someone touches them. But this may form a big patch, and it confuses code blame, so I left this work until we reach a consensus. > > Thanks! Yi Yang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: - review feedback - Merge branch 'master' into cleanupc2 - restore mistakenly removed lines - cleanup more - reserve some comments - multiple empty lines to one empty lines - reserve some comments - 8304034: Remove redundant and meaningless comments in opto ------------- Changes: https://git.openjdk.org/jdk/pull/12995/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12995&range=05 Stats: 2752 lines in 117 files changed: 1 ins; 2745 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/12995.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12995/head:pull/12995 PR: https://git.openjdk.org/jdk/pull/12995 From dzhang at openjdk.org Thu Mar 16 13:53:53 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Thu, 16 Mar 2023 13:53:53 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v6] In-Reply-To: References: Message-ID: > HI, > > We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot. > > This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. > `VectorLoadMask, VectorMaskCmp, VectorStoreMask` implement the mask-passed datapath. > > We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? > > 21c loadV V1, [R7] # vector (rvv) > 224 vloadmask V30, V1 # KILL cr > 22c vmaskcmp_rvv_masked V30, V4, V5, V30, #0 # KILL cr > 240 > 240 MEMBAR-store-store #@membar_storestore > 244 # checkcastPP of R8, #@checkCastPP > 244 vstoremask V1, V30 > > > The corresponding generated jit assembly? > > # loadV > 0x000000400c8fce5c: vsetivli t0,16,e8,m1,tu,mu > 0x000000400c8fce60: vle8.v v1,(t2) > > # vloadmask > 0x000000400c8fce64: vsetivli t0,16,e8,m1,tu,mu > 0x000000400c8fce68: vmsne.vx v30,v1,zero > > # vmaskcmp_rvv_masked > 0x000000400c8fce6c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8fce70: vmmv.m v0,v30 > 0x000000400c8fce74: vsetivli t0,16,e8,m1,tu,mu > 0x000000400c8fce78: vmclr.m v30 > 0x000000400c8fce7c: vmseq.vv v30,v4,v5,v0.t > > # vstoremask > 0x000000400c8fce84: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8fce88: vmv.v.i v1,0 > 0x000000400c8fce8c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8fce90: vmmv.m v0,v30 > 0x000000400c8fce94: vmerge.vim v1,v1,1,v0 > > > `AndVMask, OrVMask, XorVMask` will be used for operations such as division. > The current implementation of `VectorMaskCast` is for the case of equal width of the parameter data, other cases depend on the subsequent cast node. > > AddMaskTestMerge case: > > > import jdk.incubator.vector.IntVector; > import jdk.incubator.vector.VectorMask; > import jdk.incubator.vector.VectorOperators; > import jdk.incubator.vector.VectorSpecies; > > public class AddMaskTestMerge { > > static final VectorSpecies SPECIES = IntVector.SPECIES_128; > static final int SIZE = 1024; > static int[] a = new int[SIZE]; > static int[] b = new int[SIZE]; > static int[] r = new int[SIZE]; > static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; > static { > for (int i = 0; i < SIZE; i++) { > a[i] = i; > b[i] = i; > } > } > > static void workload(int idx) { > VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); > IntVector av = IntVector.fromArray(SPECIES, a, idx); > IntVector bv = IntVector.fromArray(SPECIES, b, idx); > av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); > } > > public static void main(String[] args) { > for (int i = 0; i < 30_0000; i++) { > for (int j = 0; j < SIZE; j += SPECIES.length()) { > workload(j); > } > } > } > } > > > This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. > > Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: > > > 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 > 0ae loadV V1, [R31] # vector (rvv) > 0b6 vloadmask V0, V2 # KILL cr > 0be vadd.vv V3, V1, V0 #@vaddI_masked > 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r > 0ca decode_heap_oop R28, R28 #@decodeHeapOop > 0cc lwu R7, [R28, #12] # range, #@loadRange > 0d0 NullCheck R28 > > > And the jit code is as follows: > > > 0x000000400c8109ae: vsetivli t0,4,e32,m1,tu,mu > 0x000000400c8109b2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) > ; - AddMaskTestMerge::workload at 46 (line 25) > 0x000000400c8109b6: vsetivli t0,4,e8,m1,tu,mu > 0x000000400c8109ba: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) > ; - AddMaskTestMerge::workload at 7 (line 22) > 0x000000400c8109be: vsetivli t0,4,e32,m1,tu,mu > 0x000000400c8109c2: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) > ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) > ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) > ; - AddMaskTestMerge::workload at 39 (line 25) > > > > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java > > ### Testing: > > qemu with UseRVV: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) > - [x] test/jdk/jdk/incubator/vector (release/fastdebug) > > Unmatched: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) Dingli Zhang has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains one additional commit since the last revision: RISC-V: Support vector add mask instructions for Vector API ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12682/files - new: https://git.openjdk.org/jdk/pull/12682/files/59a15d59..c2aa9997 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=04-05 Stats: 134181 lines in 1133 files changed: 98159 ins; 21395 del; 14627 mod Patch: https://git.openjdk.org/jdk/pull/12682.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12682/head:pull/12682 PR: https://git.openjdk.org/jdk/pull/12682 From duke at openjdk.org Thu Mar 16 14:47:19 2023 From: duke at openjdk.org (Jasmine K.) Date: Thu, 16 Mar 2023 14:47:19 GMT Subject: RFR: 8304230: LShift ideal transform assertion In-Reply-To: References: Message-ID: On Wed, 15 Mar 2023 18:24:23 GMT, Jasmine K. wrote: > Hi, > This PR aims to address the assertion on arm32 where the special-case `add1->in(2) == in(2)` check fails, and it falls through to the regular cases. I'm not quite sure how this issue can manifest as AFAIK the GVN should allow the usage of `==` to check against constants that are equal. I've changed the check from node equality to constant equality to hopefully resolve this. I unfortunately cannot reproduce the behavior on x86, nor do I have access to arm32 hardware, so I would greatly appreciate reviews and help testing this change (cc @bulasevich). Thank you all in advance. Great, thank you for the testing! ------------- PR: https://git.openjdk.org/jdk/pull/13049 From duke at openjdk.org Thu Mar 16 15:17:06 2023 From: duke at openjdk.org (Damon Fenacci) Date: Thu, 16 Mar 2023 15:17:06 GMT Subject: RFR: JDK-8303069: Memory leak in CompilerOracle::parse_from_line Message-ID: A memory leak has been detected using *lsan* when running the `compiler/blackhole/BlackholeExperimentalUnlockTest.java` test. This happens when parsing the *blackhole* *CompileCommand*. There is a check for the `-XX:+UnlockExperimentalVMOptions` flag being enabled when *blackhole* is used. If this flag is not set, a warning gets printed and the *CompileCommand* is not taken. Unfortunately this happens in `register_commands` where commands are supposed to be added to a list. In this case we bail out and the option is neither added nor deleted. https://github.com/openjdk/jdk/blob/f629152021d4ce0288119c47d5a111b87dce1de6/src/hotspot/share/compiler/compilerOracle.cpp#L310-L313 So, this changes deletes the `matcher` object before returning. This is done in `register_commands` (the callee) mainly because a couple of other similar checks are also done in this function. The other option would have been to move the check to the caller (the only one for this case) before `register_command`: https://github.com/openjdk/jdk/blob/f629152021d4ce0288119c47d5a111b87dce1de6/src/hotspot/share/compiler/compilerOracle.cpp#L915-L917 but it seemed less appropriate (no other similar checks). Letting it handle like other experimental flags (code below) is not an option either since in this case we have to do with a `CompileCommand`. https://github.com/openjdk/jdk/blob/f629152021d4ce0288119c47d5a111b87dce1de6/src/hotspot/share/runtime/flags/jvmFlag.cpp#L112-L118 ------------- Commit messages: - JDK-8303069: Memory leak in CompilerOracle::parse_from_line Changes: https://git.openjdk.org/jdk/pull/13060/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13060&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8303069 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/13060.diff Fetch: git fetch https://git.openjdk.org/jdk pull/13060/head:pull/13060 PR: https://git.openjdk.org/jdk/pull/13060 From thartmann at openjdk.org Thu Mar 16 15:17:07 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 16 Mar 2023 15:17:07 GMT Subject: RFR: JDK-8303069: Memory leak in CompilerOracle::parse_from_line In-Reply-To: References: Message-ID: On Thu, 16 Mar 2023 14:08:13 GMT, Damon Fenacci wrote: > A memory leak has been detected using *lsan* when running the `compiler/blackhole/BlackholeExperimentalUnlockTest.java` test. > > This happens when parsing the *blackhole* *CompileCommand*. There is a check for the `-XX:+UnlockExperimentalVMOptions` flag being enabled when *blackhole* is used. If this flag is not set, a warning gets printed and the *CompileCommand* is not taken. > > Unfortunately this happens in `register_commands` where commands are supposed to be added to a list. In this case we bail out and the option is neither added nor deleted. > > https://github.com/openjdk/jdk/blob/f629152021d4ce0288119c47d5a111b87dce1de6/src/hotspot/share/compiler/compilerOracle.cpp#L310-L313 > > So, this changes deletes the `matcher` object before returning. This is done in `register_commands` (the callee) mainly because a couple of other similar checks are also done in this function. > > The other option would have been to move the check to the caller (the only one for this case) before `register_command`: > https://github.com/openjdk/jdk/blob/f629152021d4ce0288119c47d5a111b87dce1de6/src/hotspot/share/compiler/compilerOracle.cpp#L915-L917 > but it seemed less appropriate (no other similar checks). > > Letting it handle like other experimental flags (code below) is not an option either since in this case we have to do with a `CompileCommand`. > https://github.com/openjdk/jdk/blob/f629152021d4ce0288119c47d5a111b87dce1de6/src/hotspot/share/runtime/flags/jvmFlag.cpp#L112-L118 Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/13060 From duke at openjdk.org Thu Mar 16 15:53:35 2023 From: duke at openjdk.org (Kosta Stojiljkovic) Date: Thu, 16 Mar 2023 15:53:35 GMT Subject: RFR: 8304242: CPUInfoTest fails because "serialize" CPU feature is not known Message-ID: This test fails on modern x86_64 hardware with "serialize" feature (eg. Intel Gen 12 and higher). Support for this feature was added by JDK-8264543 but the test wasn't updated. I have updated the test to recognize "serialize" as a supported CPU feature. Tested on 13th Gen Intel(R) Core(TM) i7-13700K by running this new version of the test. ------------- Commit messages: - Modified the copyright notice with current year. - Modified CPUInfoTest.java to have serialize in its list of well known cpu features. Changes: https://git.openjdk.org/jdk/pull/13062/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13062&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8304242 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/13062.diff Fetch: git fetch https://git.openjdk.org/jdk pull/13062/head:pull/13062 PR: https://git.openjdk.org/jdk/pull/13062 From kvn at openjdk.org Thu Mar 16 15:59:43 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 16 Mar 2023 15:59:43 GMT Subject: RFR: 8304242: CPUInfoTest fails because "serialize" CPU feature is not known In-Reply-To: References: Message-ID: On Thu, 16 Mar 2023 15:40:01 GMT, Kosta Stojiljkovic wrote: > This test fails on modern x86_64 hardware with "serialize" feature (eg. Intel Gen 12 and higher). > Support for this feature was added by JDK-8264543 but the test wasn't updated. > > I have updated the test to recognize "serialize" as a supported CPU feature. > Tested on 13th Gen Intel(R) Core(TM) i7-13700K by running this new version of the test. Goos and trivial. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/13062 From sviswanathan at openjdk.org Thu Mar 16 16:57:14 2023 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Thu, 16 Mar 2023 16:57:14 GMT Subject: RFR: 8304242: CPUInfoTest fails because "serialize" CPU feature is not known In-Reply-To: References: Message-ID: On Thu, 16 Mar 2023 15:40:01 GMT, Kosta Stojiljkovic wrote: > This test fails on modern x86_64 hardware with "serialize" feature (eg. Intel Gen 12 and higher). > Support for this feature was added by JDK-8264543 but the test wasn't updated. > > I have updated the test to recognize "serialize" as a supported CPU feature. > Tested on 13th Gen Intel(R) Core(TM) i7-13700K by running this new version of the test. Marked as reviewed by sviswanathan (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/13062 From jcking at openjdk.org Thu Mar 16 17:31:35 2023 From: jcking at openjdk.org (Justin King) Date: Thu, 16 Mar 2023 17:31:35 GMT Subject: RFR: JDK-8303069: Memory leak in CompilerOracle::parse_from_line In-Reply-To: References: Message-ID: On Thu, 16 Mar 2023 14:08:13 GMT, Damon Fenacci wrote: > A memory leak has been detected using *lsan* when running the `compiler/blackhole/BlackholeExperimentalUnlockTest.java` test. > > This happens when parsing the *blackhole* *CompileCommand*. There is a check for the `-XX:+UnlockExperimentalVMOptions` flag being enabled when *blackhole* is used. If this flag is not set, a warning gets printed and the *CompileCommand* is not taken. > > Unfortunately this happens in `register_commands` where commands are supposed to be added to a list. In this case we bail out and the option is neither added nor deleted. > > https://github.com/openjdk/jdk/blob/f629152021d4ce0288119c47d5a111b87dce1de6/src/hotspot/share/compiler/compilerOracle.cpp#L310-L313 > > So, this changes deletes the `matcher` object before returning. This is done in `register_commands` (the callee) mainly because a couple of other similar checks are also done in this function. > > The other option would have been to move the check to the caller (the only one for this case) before `register_command`: > https://github.com/openjdk/jdk/blob/f629152021d4ce0288119c47d5a111b87dce1de6/src/hotspot/share/compiler/compilerOracle.cpp#L915-L917 > but it seemed less appropriate (no other similar checks). > > Letting it handle like other experimental flags (code below) is not an option either since in this case we have to do with a `CompileCommand`. > https://github.com/openjdk/jdk/blob/f629152021d4ce0288119c47d5a111b87dce1de6/src/hotspot/share/runtime/flags/jvmFlag.cpp#L112-L118 Marked as reviewed by jcking (Committer). ------------- PR: https://git.openjdk.org/jdk/pull/13060 From jlu at openjdk.org Thu Mar 16 18:19:29 2023 From: jlu at openjdk.org (Justin Lu) Date: Thu, 16 Mar 2023 18:19:29 GMT Subject: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native [v2] In-Reply-To: <0MB7FLFNfaGEWssr9X54UJ_iZNFWBJkxQ1yusP7fsuY=.3f9f3de5-fe84-48e6-9449-626cac42da0b@github.com> References: <0MB7FLFNfaGEWssr9X54UJ_iZNFWBJkxQ1yusP7fsuY=.3f9f3de5-fe84-48e6-9449-626cac42da0b@github.com> Message-ID: > This PR converts Unicode sequences to UTF-8 native in .properties file. (Excluding the Unicode space and tab sequence). The conversion was done using native2ascii. > > In addition, the build logic is adjusted to support reading in the .properties files as UTF-8 during the conversion from .properties file to .java ListResourceBundle file. Justin Lu has updated the pull request incrementally with four additional commits since the last revision: - Bug6204853 should not be converted - Copyright year for CompileProperties - Redo translation for CS.properties - Spot convert CurrencySymbols.properties ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12726/files - new: https://git.openjdk.org/jdk/pull/12726/files/1e798f24..6d6bffe8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12726&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12726&range=00-01 Stats: 92 lines in 4 files changed: 0 ins; 0 del; 92 mod Patch: https://git.openjdk.org/jdk/pull/12726.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12726/head:pull/12726 PR: https://git.openjdk.org/jdk/pull/12726 From jlu at openjdk.org Thu Mar 16 18:21:40 2023 From: jlu at openjdk.org (Justin Lu) Date: Thu, 16 Mar 2023 18:21:40 GMT Subject: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native [v2] In-Reply-To: References: <0MB7FLFNfaGEWssr9X54UJ_iZNFWBJkxQ1yusP7fsuY=.3f9f3de5-fe84-48e6-9449-626cac42da0b@github.com> Message-ID: On Thu, 16 Mar 2023 18:19:29 GMT, Justin Lu wrote: >> This PR converts Unicode sequences to UTF-8 native in .properties file. (Excluding the Unicode space and tab sequence). The conversion was done using native2ascii. >> >> In addition, the build logic is adjusted to support reading in the .properties files as UTF-8 during the conversion from .properties file to .java ListResourceBundle file. > > Justin Lu has updated the pull request incrementally with four additional commits since the last revision: > > - Bug6204853 should not be converted > - Copyright year for CompileProperties > - Redo translation for CS.properties > - Spot convert CurrencySymbols.properties test/jdk/java/text/Format/NumberFormat/CurrencySymbols.properties line 1: > 1: # Conversion did not work as expected, addressing right now. ------------- PR: https://git.openjdk.org/jdk/pull/12726 From jlu at openjdk.org Thu Mar 16 18:21:43 2023 From: jlu at openjdk.org (Justin Lu) Date: Thu, 16 Mar 2023 18:21:43 GMT Subject: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native [v2] In-Reply-To: References: <0MB7FLFNfaGEWssr9X54UJ_iZNFWBJkxQ1yusP7fsuY=.3f9f3de5-fe84-48e6-9449-626cac42da0b@github.com> Message-ID: On Wed, 15 Mar 2023 20:19:51 GMT, Naoto Sato wrote: >> Justin Lu has updated the pull request incrementally with four additional commits since the last revision: >> >> - Bug6204853 should not be converted >> - Copyright year for CompileProperties >> - Redo translation for CS.properties >> - Spot convert CurrencySymbols.properties > > test/jdk/java/text/Format/NumberFormat/CurrencySymbols.properties line 156: > >> 154: zh=\u00A4 >> 155: zh_CN=\uFFE5 >> 156: zh_HK=HK$ > > Why are they not encoded into UTF-8 native? Not sure, thank you for catching it. Working on it right now. ------------- PR: https://git.openjdk.org/jdk/pull/12726 From jlu at openjdk.org Thu Mar 16 18:21:46 2023 From: jlu at openjdk.org (Justin Lu) Date: Thu, 16 Mar 2023 18:21:46 GMT Subject: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native [v2] In-Reply-To: <1I9v8d2OiyLfQVCozGYVRhAi3AotqGuRUhsNj0VCsUk=.e673ca33-d24f-4aab-908e-a5c0bfa3bf7c@github.com> References: <0MB7FLFNfaGEWssr9X54UJ_iZNFWBJkxQ1yusP7fsuY=.3f9f3de5-fe84-48e6-9449-626cac42da0b@github.com> <1I9v8d2OiyLfQVCozGYVRhAi3AotqGuRUhsNj0VCsUk=.e673ca33-d24f-4aab-908e-a5c0bfa3bf7c@github.com> Message-ID: <_6WBGo5CQBseDEjMv16qCWmodFlYOO4gsT9WbON7ddA=.f94339a4-8893-47e4-8bb1-f28a8807ad9d@github.com> On Wed, 15 Mar 2023 16:18:44 GMT, Archie L. Cobbs wrote: >> Justin Lu has updated the pull request incrementally with four additional commits since the last revision: >> >> - Bug6204853 should not be converted >> - Copyright year for CompileProperties >> - Redo translation for CS.properties >> - Spot convert CurrencySymbols.properties > > test/jdk/java/util/ResourceBundle/Bug6204853.properties line 1: > >> 1: # > > This file should probably be excluded because it's used in a test that relates to UTF-8 encoding (or not) of property files. Thank you, removed the changes for this file ------------- PR: https://git.openjdk.org/jdk/pull/12726 From jlu at openjdk.org Thu Mar 16 18:35:51 2023 From: jlu at openjdk.org (Justin Lu) Date: Thu, 16 Mar 2023 18:35:51 GMT Subject: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native [v3] In-Reply-To: <0MB7FLFNfaGEWssr9X54UJ_iZNFWBJkxQ1yusP7fsuY=.3f9f3de5-fe84-48e6-9449-626cac42da0b@github.com> References: <0MB7FLFNfaGEWssr9X54UJ_iZNFWBJkxQ1yusP7fsuY=.3f9f3de5-fe84-48e6-9449-626cac42da0b@github.com> Message-ID: > This PR converts Unicode sequences to UTF-8 native in .properties file. (Excluding the Unicode space and tab sequence). The conversion was done using native2ascii. > > In addition, the build logic is adjusted to support reading in the .properties files as UTF-8 during the conversion from .properties file to .java ListResourceBundle file. Justin Lu has updated the pull request incrementally with two additional commits since the last revision: - Reconvert CS.properties to UTF-8 - Revert all changes to CurrencySymbols.properties ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12726/files - new: https://git.openjdk.org/jdk/pull/12726/files/6d6bffe8..7119830b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12726&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12726&range=01-02 Stats: 87 lines in 1 file changed: 0 ins; 0 del; 87 mod Patch: https://git.openjdk.org/jdk/pull/12726.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12726/head:pull/12726 PR: https://git.openjdk.org/jdk/pull/12726 From jlu at openjdk.org Thu Mar 16 18:35:54 2023 From: jlu at openjdk.org (Justin Lu) Date: Thu, 16 Mar 2023 18:35:54 GMT Subject: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native [v3] In-Reply-To: References: <0MB7FLFNfaGEWssr9X54UJ_iZNFWBJkxQ1yusP7fsuY=.3f9f3de5-fe84-48e6-9449-626cac42da0b@github.com> Message-ID: On Thu, 16 Mar 2023 18:31:23 GMT, Justin Lu wrote: >> This PR converts Unicode sequences to UTF-8 native in .properties file. (Excluding the Unicode space and tab sequence). The conversion was done using native2ascii. >> >> In addition, the build logic is adjusted to support reading in the .properties files as UTF-8 during the conversion from .properties file to .java ListResourceBundle file. > > Justin Lu has updated the pull request incrementally with two additional commits since the last revision: > > - Reconvert CS.properties to UTF-8 > - Revert all changes to CurrencySymbols.properties test/jdk/java/text/Format/NumberFormat/CurrencySymbols.properties line 1: > 1: # CurrencySymbols.properties is fully converted to UTF-8 now ------------- PR: https://git.openjdk.org/jdk/pull/12726 From yzheng at openjdk.org Thu Mar 16 21:19:53 2023 From: yzheng at openjdk.org (Yudi Zheng) Date: Thu, 16 Mar 2023 21:19:53 GMT Subject: RFR: 8304138: [JVMCI] Test FailedSpeculation existence before appending. [v3] In-Reply-To: References: Message-ID: > Upon uncommon_trap, JVMCI runtime appends a FailedSpeculation entry to the nmethod using an [atomic operation](https://github.com/openjdk/jdk/blob/55aa122462c34d8f4cafa58f4d1f2d900449c83e/src/hotspot/share/oops/methodData.cpp#L852). It becomes a performance bottleneck when there is a large amount of (virtual) threads deoptimizing in the nmethod. In this PR, we test if a FailedSpeculation exists in the list before appending it. Yudi Zheng has updated the pull request incrementally with one additional commit since the last revision: avoid duplicated entry. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13022/files - new: https://git.openjdk.org/jdk/pull/13022/files/e8c7eec4..fdf93ae1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13022&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13022&range=01-02 Stats: 38 lines in 1 file changed: 20 ins; 18 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/13022.diff Fetch: git fetch https://git.openjdk.org/jdk pull/13022/head:pull/13022 PR: https://git.openjdk.org/jdk/pull/13022 From yzheng at openjdk.org Thu Mar 16 21:19:57 2023 From: yzheng at openjdk.org (Yudi Zheng) Date: Thu, 16 Mar 2023 21:19:57 GMT Subject: RFR: 8304138: [JVMCI] Test FailedSpeculation existence before appending. [v2] In-Reply-To: References: Message-ID: On Tue, 14 Mar 2023 16:06:38 GMT, Vladimir Kozlov wrote: >> Yudi Zheng has updated the pull request incrementally with one additional commit since the last revision: >> >> avoid iterating from beginning. > > Good. > Please, re-test with latest changes. @vnkozlov could you please review https://github.com/openjdk/jdk/pull/13022/commits/fdf93ae17bad8e4daeaa982d8d43bc49c9550d73 that avoids duplicated FailedSpeculation entries? ------------- PR: https://git.openjdk.org/jdk/pull/13022 From kvn at openjdk.org Fri Mar 17 00:08:26 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 17 Mar 2023 00:08:26 GMT Subject: RFR: 8304138: [JVMCI] Test FailedSpeculation existence before appending. [v3] In-Reply-To: References: Message-ID: On Thu, 16 Mar 2023 21:19:53 GMT, Yudi Zheng wrote: >> Upon uncommon_trap, JVMCI runtime appends a FailedSpeculation entry to the nmethod using an [atomic operation](https://github.com/openjdk/jdk/blob/55aa122462c34d8f4cafa58f4d1f2d900449c83e/src/hotspot/share/oops/methodData.cpp#L852). It becomes a performance bottleneck when there is a large amount of (virtual) threads deoptimizing in the nmethod. In this PR, we test if a FailedSpeculation exists in the list before appending it. > > Yudi Zheng has updated the pull request incrementally with one additional commit since the last revision: > > avoid duplicated entry. Yes, it would do what you are saying but I don't like infinite loops like that with conditional exists. How big length of `failed_speculations_address` list you can have? Consider implementing it as recursive method if depth is not big: bool FailedSpeculation::add_failed_speculation(nmethod* nm, FailedSpeculation** failed_speculations_address, address speculation, int speculation_len) { assert(failed_speculations_address != nullptr, "must be"); guarantee_failed_speculations_alive(nm, failed_speculations_address); size_t fs_size = sizeof(FailedSpeculation) + speculation_len; FailedSpeculation* fs = new (fs_size) FailedSpeculation(speculation, speculation_len); if (fs == nullptr) { // no memory -> ignore failed speculation return false; } guarantee(is_aligned(fs, sizeof(FailedSpeculation*)), "FailedSpeculation objects must be pointer aligned"); if (!add_failed_speculation_recursive(failed_speculations_address, fs)) { delete fs; return false; } return true; } bool add_failed_speculation_recursive(FailedSpeculation** cursor, FailedSpeculation* fs) { if (*cursor == nullptr) { FailedSpeculation* old_fs = Atomic::cmpxchg(cursor, (FailedSpeculation*) nullptr, fs); if (old_fs == nullptr) { // Successfully appended fs to end of the list return true; } guarantee(*cursor != nullptr, "cursor must point to non-null FailedSpeculation"); } // check if the current entry matches this thread's failed speculation int speculation_len = fs->data_len(); if ((*cursor)->data_len() == speculation_len && memcmp(fs->data(), (*cursor)->data(), speculation_len) == 0) { return false; } return add_failed_speculation_recursive((*cursor)->next_adr(), fs); } ------------- PR: https://git.openjdk.org/jdk/pull/13022 From dzhang at openjdk.org Fri Mar 17 01:57:49 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Fri, 17 Mar 2023 01:57:49 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v7] In-Reply-To: References: Message-ID: > HI, > > We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot. > > This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. > `VectorLoadMask, VectorMaskCmp, VectorStoreMask` implement the mask-passed datapath. > > We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? > > 21c loadV V1, [R7] # vector (rvv) > 224 vloadmask V30, V1 # KILL cr > 22c vmaskcmp_rvv_masked V30, V4, V5, V30, #0 # KILL cr > 240 > 240 MEMBAR-store-store #@membar_storestore > 244 # checkcastPP of R8, #@checkCastPP > 244 vstoremask V1, V30 > > > The corresponding generated jit assembly? > > # loadV > 0x000000400c8fce5c: vsetivli t0,16,e8,m1,tu,mu > 0x000000400c8fce60: vle8.v v1,(t2) > > # vloadmask > 0x000000400c8fce64: vsetivli t0,16,e8,m1,tu,mu > 0x000000400c8fce68: vmsne.vx v30,v1,zero > > # vmaskcmp_rvv_masked > 0x000000400c8fce6c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8fce70: vmmv.m v0,v30 > 0x000000400c8fce74: vsetivli t0,16,e8,m1,tu,mu > 0x000000400c8fce78: vmclr.m v30 > 0x000000400c8fce7c: vmseq.vv v30,v4,v5,v0.t > > # vstoremask > 0x000000400c8fce84: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8fce88: vmv.v.i v1,0 > 0x000000400c8fce8c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8fce90: vmmv.m v0,v30 > 0x000000400c8fce94: vmerge.vim v1,v1,1,v0 > > > `AndVMask, OrVMask, XorVMask` will be used for operations such as division. > The current implementation of `VectorMaskCast` is for the case of equal width of the parameter data, other cases depend on the subsequent cast node. > > AddMaskTestMerge case: > > > import jdk.incubator.vector.IntVector; > import jdk.incubator.vector.VectorMask; > import jdk.incubator.vector.VectorOperators; > import jdk.incubator.vector.VectorSpecies; > > public class AddMaskTestMerge { > > static final VectorSpecies SPECIES = IntVector.SPECIES_128; > static final int SIZE = 1024; > static int[] a = new int[SIZE]; > static int[] b = new int[SIZE]; > static int[] r = new int[SIZE]; > static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; > static { > for (int i = 0; i < SIZE; i++) { > a[i] = i; > b[i] = i; > } > } > > static void workload(int idx) { > VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); > IntVector av = IntVector.fromArray(SPECIES, a, idx); > IntVector bv = IntVector.fromArray(SPECIES, b, idx); > av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); > } > > public static void main(String[] args) { > for (int i = 0; i < 30_0000; i++) { > for (int j = 0; j < SIZE; j += SPECIES.length()) { > workload(j); > } > } > } > } > > > This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. > > Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: > > > 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 > 0ae loadV V1, [R31] # vector (rvv) > 0b6 vloadmask V0, V2 # KILL cr > 0be vadd.vv V3, V1, V0 #@vaddI_masked > 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r > 0ca decode_heap_oop R28, R28 #@decodeHeapOop > 0cc lwu R7, [R28, #12] # range, #@loadRange > 0d0 NullCheck R28 > > > And the jit code is as follows: > > > 0x000000400c8109ae: vsetivli t0,4,e32,m1,tu,mu > 0x000000400c8109b2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) > ; - AddMaskTestMerge::workload at 46 (line 25) > 0x000000400c8109b6: vsetivli t0,4,e8,m1,tu,mu > 0x000000400c8109ba: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) > ; - AddMaskTestMerge::workload at 7 (line 22) > 0x000000400c8109be: vsetivli t0,4,e32,m1,tu,mu > 0x000000400c8109c2: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) > ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) > ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) > ; - AddMaskTestMerge::workload at 39 (line 25) > > > > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java > > ### Testing: > > qemu with UseRVV: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) > - [x] test/jdk/jdk/incubator/vector (release/fastdebug) > > Unmatched: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: Fix useage of iRegIorL2I and remove nouse cr ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12682/files - new: https://git.openjdk.org/jdk/pull/12682/files/c2aa9997..60052751 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=05-06 Stats: 15 lines in 1 file changed: 0 ins; 4 del; 11 mod Patch: https://git.openjdk.org/jdk/pull/12682.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12682/head:pull/12682 PR: https://git.openjdk.org/jdk/pull/12682 From dzhang at openjdk.org Fri Mar 17 02:01:54 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Fri, 17 Mar 2023 02:01:54 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v8] In-Reply-To: References: Message-ID: > HI, > > We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot. > > This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. > `VectorLoadMask, VectorMaskCmp, VectorStoreMask` implement the mask-passed datapath. > > We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? > > 21c loadV V1, [R7] # vector (rvv) > 224 vloadmask V30, V1 # KILL cr > 22c vmaskcmp_rvv_masked V30, V4, V5, V30, #0 # KILL cr > 240 > 240 MEMBAR-store-store #@membar_storestore > 244 # checkcastPP of R8, #@checkCastPP > 244 vstoremask V1, V30 > > > The corresponding generated jit assembly? > > # loadV > 0x000000400c8fce5c: vsetivli t0,16,e8,m1,tu,mu > 0x000000400c8fce60: vle8.v v1,(t2) > > # vloadmask > 0x000000400c8fce64: vsetivli t0,16,e8,m1,tu,mu > 0x000000400c8fce68: vmsne.vx v30,v1,zero > > # vmaskcmp_rvv_masked > 0x000000400c8fce6c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8fce70: vmmv.m v0,v30 > 0x000000400c8fce74: vsetivli t0,16,e8,m1,tu,mu > 0x000000400c8fce78: vmclr.m v30 > 0x000000400c8fce7c: vmseq.vv v30,v4,v5,v0.t > > # vstoremask > 0x000000400c8fce84: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8fce88: vmv.v.i v1,0 > 0x000000400c8fce8c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8fce90: vmmv.m v0,v30 > 0x000000400c8fce94: vmerge.vim v1,v1,1,v0 > > > `AndVMask, OrVMask, XorVMask` will be used for operations such as division. > The current implementation of `VectorMaskCast` is for the case of equal width of the parameter data, other cases depend on the subsequent cast node. > > AddMaskTestMerge case: > > > import jdk.incubator.vector.IntVector; > import jdk.incubator.vector.VectorMask; > import jdk.incubator.vector.VectorOperators; > import jdk.incubator.vector.VectorSpecies; > > public class AddMaskTestMerge { > > static final VectorSpecies SPECIES = IntVector.SPECIES_128; > static final int SIZE = 1024; > static int[] a = new int[SIZE]; > static int[] b = new int[SIZE]; > static int[] r = new int[SIZE]; > static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; > static { > for (int i = 0; i < SIZE; i++) { > a[i] = i; > b[i] = i; > } > } > > static void workload(int idx) { > VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); > IntVector av = IntVector.fromArray(SPECIES, a, idx); > IntVector bv = IntVector.fromArray(SPECIES, b, idx); > av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); > } > > public static void main(String[] args) { > for (int i = 0; i < 30_0000; i++) { > for (int j = 0; j < SIZE; j += SPECIES.length()) { > workload(j); > } > } > } > } > > > This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. > > Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: > > > 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 > 0ae loadV V1, [R31] # vector (rvv) > 0b6 vloadmask V0, V2 # KILL cr > 0be vadd.vv V3, V1, V0 #@vaddI_masked > 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r > 0ca decode_heap_oop R28, R28 #@decodeHeapOop > 0cc lwu R7, [R28, #12] # range, #@loadRange > 0d0 NullCheck R28 > > > And the jit code is as follows: > > > 0x000000400c8109ae: vsetivli t0,4,e32,m1,tu,mu > 0x000000400c8109b2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) > ; - AddMaskTestMerge::workload at 46 (line 25) > 0x000000400c8109b6: vsetivli t0,4,e8,m1,tu,mu > 0x000000400c8109ba: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) > ; - AddMaskTestMerge::workload at 7 (line 22) > 0x000000400c8109be: vsetivli t0,4,e32,m1,tu,mu > 0x000000400c8109c2: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) > ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) > ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) > ; - AddMaskTestMerge::workload at 39 (line 25) > > > > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java > > ### Testing: > > qemu with UseRVV: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) > - [x] test/jdk/jdk/incubator/vector (release/fastdebug) > > Unmatched: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) Dingli Zhang has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: Fix useage of iRegIorL2I and remove nouse cr ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12682/files - new: https://git.openjdk.org/jdk/pull/12682/files/60052751..2f7b7c06 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=06-07 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/12682.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12682/head:pull/12682 PR: https://git.openjdk.org/jdk/pull/12682 From dzhang at openjdk.org Fri Mar 17 03:38:03 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Fri, 17 Mar 2023 03:38:03 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v9] In-Reply-To: References: Message-ID: > HI, > > We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot. > > This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. > `VectorLoadMask, VectorMaskCmp, VectorStoreMask` implement the mask-passed datapath. > > We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? > > 21c loadV V1, [R7] # vector (rvv) > 224 vloadmask V30, V1 # KILL cr > 22c vmaskcmp_rvv_masked V30, V4, V5, V30, #0 # KILL cr > 240 > 240 MEMBAR-store-store #@membar_storestore > 244 # checkcastPP of R8, #@checkCastPP > 244 vstoremask V1, V30 > > > The corresponding generated jit assembly? > > # loadV > 0x000000400c8fce5c: vsetivli t0,16,e8,m1,tu,mu > 0x000000400c8fce60: vle8.v v1,(t2) > > # vloadmask > 0x000000400c8fce64: vsetivli t0,16,e8,m1,tu,mu > 0x000000400c8fce68: vmsne.vx v30,v1,zero > > # vmaskcmp_rvv_masked > 0x000000400c8fce6c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8fce70: vmmv.m v0,v30 > 0x000000400c8fce74: vsetivli t0,16,e8,m1,tu,mu > 0x000000400c8fce78: vmclr.m v30 > 0x000000400c8fce7c: vmseq.vv v30,v4,v5,v0.t > > # vstoremask > 0x000000400c8fce84: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8fce88: vmv.v.i v1,0 > 0x000000400c8fce8c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8fce90: vmmv.m v0,v30 > 0x000000400c8fce94: vmerge.vim v1,v1,1,v0 > > > `AndVMask, OrVMask, XorVMask` will be used for operations such as division. > The current implementation of `VectorMaskCast` is for the case of equal width of the parameter data, other cases depend on the subsequent cast node. > > AddMaskTestMerge case: > > > import jdk.incubator.vector.IntVector; > import jdk.incubator.vector.VectorMask; > import jdk.incubator.vector.VectorOperators; > import jdk.incubator.vector.VectorSpecies; > > public class AddMaskTestMerge { > > static final VectorSpecies SPECIES = IntVector.SPECIES_128; > static final int SIZE = 1024; > static int[] a = new int[SIZE]; > static int[] b = new int[SIZE]; > static int[] r = new int[SIZE]; > static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; > static { > for (int i = 0; i < SIZE; i++) { > a[i] = i; > b[i] = i; > } > } > > static void workload(int idx) { > VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); > IntVector av = IntVector.fromArray(SPECIES, a, idx); > IntVector bv = IntVector.fromArray(SPECIES, b, idx); > av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); > } > > public static void main(String[] args) { > for (int i = 0; i < 30_0000; i++) { > for (int j = 0; j < SIZE; j += SPECIES.length()) { > workload(j); > } > } > } > } > > > This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. > > Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: > > > 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 > 0ae loadV V1, [R31] # vector (rvv) > 0b6 vloadmask V0, V2 # KILL cr > 0be vadd.vv V3, V1, V0 #@vaddI_masked > 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r > 0ca decode_heap_oop R28, R28 #@decodeHeapOop > 0cc lwu R7, [R28, #12] # range, #@loadRange > 0d0 NullCheck R28 > > > And the jit code is as follows: > > > 0x000000400c8109ae: vsetivli t0,4,e32,m1,tu,mu > 0x000000400c8109b2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) > ; - AddMaskTestMerge::workload at 46 (line 25) > 0x000000400c8109b6: vsetivli t0,4,e8,m1,tu,mu > 0x000000400c8109ba: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) > ; - AddMaskTestMerge::workload at 7 (line 22) > 0x000000400c8109be: vsetivli t0,4,e32,m1,tu,mu > 0x000000400c8109c2: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) > ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) > ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) > ; - AddMaskTestMerge::workload at 39 (line 25) > > > > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java > > ### Testing: > > qemu with UseRVV: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) > - [x] test/jdk/jdk/incubator/vector (release/fastdebug) > > Unmatched: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: Remove some no use code ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12682/files - new: https://git.openjdk.org/jdk/pull/12682/files/2f7b7c06..d971f015 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=07-08 Stats: 4 lines in 1 file changed: 0 ins; 4 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/12682.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12682/head:pull/12682 PR: https://git.openjdk.org/jdk/pull/12682 From eliu at openjdk.org Fri Mar 17 06:21:17 2023 From: eliu at openjdk.org (Eric Liu) Date: Fri, 17 Mar 2023 06:21:17 GMT Subject: RFR: 8303278: Imprecise bottom type of ExtractB/UB Message-ID: This is a trivial patch, which fixes the bottom type of ExtractB/UB nodes. ExtractNode can be generated by Vector API Vector.lane(int), which gets the lane element at the given index. A more precise type of range can help to optimize out unnecessary type conversion in some cases. Below shows a typical case used ExtractBNode public static byte byteLt16() { ByteVector vecb = ByteVector.broadcast(ByteVector.SPECIES_128, 1); return vecb.lane(1); } In this case, c2 constructs IR graph like: ExtractB ConI(24) | __| | / | LShiftI __| | / RShiftI which generates AArch64 code: movi v16.16b, #0x1 smov x11, v16.b[1] sxtb w0, w11 with this patch, this shift pair can be optimized out by RShiftI's identity [1]. The code is optimized to: movi v16.16b, #0x1 smov x0, v16.b[1] [TEST] Full jtreg passed except 4 files on x86: jdk/incubator/vector/Byte128VectorTests.java jdk/incubator/vector/Byte256VectorTests.java jdk/incubator/vector/Byte512VectorTests.java jdk/incubator/vector/Byte64VectorTests.java They are caused by a known issue on x86 [2]. [1] https://github.com/openjdk/jdk/blob/742bc041eaba1ff9beb7f5b6d896e4f382b030ea/src/hotspot/share/opto/mulnode.cpp#L1052 [2] https://bugs.openjdk.org/browse/JDK-8303508 ------------- Commit messages: - 8303278: Imprecise bottom type of ExtractB/UB Changes: https://git.openjdk.org/jdk/pull/13070/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13070&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8303278 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/13070.diff Fetch: git fetch https://git.openjdk.org/jdk pull/13070/head:pull/13070 PR: https://git.openjdk.org/jdk/pull/13070 From thartmann at openjdk.org Fri Mar 17 06:23:31 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 17 Mar 2023 06:23:31 GMT Subject: RFR: 8304242: CPUInfoTest fails because "serialize" CPU feature is not known In-Reply-To: References: Message-ID: On Thu, 16 Mar 2023 15:40:01 GMT, Kosta Stojiljkovic wrote: > This test fails on modern x86_64 hardware with "serialize" feature (eg. Intel Gen 12 and higher). > Support for this feature was added by JDK-8264543 but the test wasn't updated. > > I have updated the test to recognize "serialize" as a supported CPU feature. > Tested on 13th Gen Intel(R) Core(TM) i7-13700K by running this new version of the test. Marked as reviewed by thartmann (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/13062 From duke at openjdk.org Fri Mar 17 06:23:32 2023 From: duke at openjdk.org (Kosta Stojiljkovic) Date: Fri, 17 Mar 2023 06:23:32 GMT Subject: Integrated: 8304242: CPUInfoTest fails because "serialize" CPU feature is not known In-Reply-To: References: Message-ID: On Thu, 16 Mar 2023 15:40:01 GMT, Kosta Stojiljkovic wrote: > This test fails on modern x86_64 hardware with "serialize" feature (eg. Intel Gen 12 and higher). > Support for this feature was added by JDK-8264543 but the test wasn't updated. > > I have updated the test to recognize "serialize" as a supported CPU feature. > Tested on 13th Gen Intel(R) Core(TM) i7-13700K by running this new version of the test. This pull request has now been integrated. Changeset: 36995c5a Author: Kosta Stojiljkovic Committer: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/36995c5a75c74c1748c1751ac621b5d62e964fc5 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod 8304242: CPUInfoTest fails because "serialize" CPU feature is not known Reviewed-by: kvn, sviswanathan, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/13062 From qamai at openjdk.org Fri Mar 17 06:49:21 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 17 Mar 2023 06:49:21 GMT Subject: RFR: 8303278: Imprecise bottom type of ExtractB/UB In-Reply-To: References: Message-ID: On Fri, 17 Mar 2023 06:14:00 GMT, Eric Liu wrote: > This is a trivial patch, which fixes the bottom type of ExtractB/UB nodes. > > ExtractNode can be generated by Vector API Vector.lane(int), which gets the lane element at the given index. A more precise type of range can help to optimize out unnecessary type conversion in some cases. > > Below shows a typical case used ExtractBNode > > > public static byte byteLt16() { > ByteVector vecb = ByteVector.broadcast(ByteVector.SPECIES_128, 1); > return vecb.lane(1); > } > > > In this case, c2 constructs IR graph like: > > ExtractB ConI(24) > | __| > | / | > LShiftI __| > | / > RShiftI > > which generates AArch64 code: > > movi v16.16b, #0x1 > smov x11, v16.b[1] > sxtb w0, w11 > > with this patch, this shift pair can be optimized out by RShiftI's identity [1]. The code is optimized to: > > movi v16.16b, #0x1 > smov x0, v16.b[1] > > [TEST] > > Full jtreg passed except 4 files on x86: > > jdk/incubator/vector/Byte128VectorTests.java > jdk/incubator/vector/Byte256VectorTests.java > jdk/incubator/vector/Byte512VectorTests.java > jdk/incubator/vector/Byte64VectorTests.java > > They are caused by a known issue on x86 [2]. > > [1] https://github.com/openjdk/jdk/blob/742bc041eaba1ff9beb7f5b6d896e4f382b030ea/src/hotspot/share/opto/mulnode.cpp#L1052 > [2] https://bugs.openjdk.org/browse/JDK-8303508 Marked as reviewed by qamai (Committer). ------------- PR: https://git.openjdk.org/jdk/pull/13070 From amitkumar at openjdk.org Fri Mar 17 08:13:13 2023 From: amitkumar at openjdk.org (Amit Kumar) Date: Fri, 17 Mar 2023 08:13:13 GMT Subject: RFR: 8303147: [s390x] fast & slow debug builds are broken [v2] In-Reply-To: References: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> Message-ID: <7cD60PcBEew1R4Q9Q1fF4plcsdw5StvAXF_jBYX2eYU=.8b0ced92-cf91-4873-8412-ab8a3c84d8a6@github.com> On Mon, 6 Mar 2023 05:58:23 GMT, Amit Kumar wrote: >> This PR fixes broken fast debug and slow debug build for s390x-arch. tier1 test are completed and results are not affect after this patch. > > Amit Kumar has updated the pull request incrementally with one additional commit since the last revision: > > use constant instead of enum Hi, could somebody please review it, So that we can move ahead ? ------------- PR: https://git.openjdk.org/jdk/pull/12825 From epeter at openjdk.org Fri Mar 17 08:38:20 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 17 Mar 2023 08:38:20 GMT Subject: RFR: 8303951: Add asserts before record_method_not_compilable where possible Message-ID: I went through all `C2` bailouts, and checked if they are justified to bail out of compilation silently. I added asserts everywhere. Those that were hit, I inspected by hand. Some of them seem to be justified. There I added comments why they are justified. They are cases that we do not want to handle in `C2`, and that are rare enough so that it probably does not matter. For the following bailouts I did not add an assert, because it may have revealed a bug: [JDK-8304328](https://bugs.openjdk.org/browse/JDK-8304328) C2 Bailout "failed spill-split-recycle sanity check" reveals hidden issue with RA Note: [JDK-8303466](https://bugs.openjdk.org/browse/JDK-8303466) C2: COMPILE SKIPPED: malformed control flow - only one IfProj That bug bug was the reason for this RFE here. I added the assert for "malformed control flow". After this RFE here, that Bug will run into the assert on debug builds. I ran `tier1-6` and stress testing. None of the asserts triggered. Should we file a follow-up RFE to do the same for `BAILOUT` in `C1`? ------------- Commit messages: - Merge branch 'master' into JDK-8303951 - Out of stack space - expected bailout - moving spill-split-recycle bailout to future work - manual merge after NULL nullptr - 8303951: Add asserts before record_method_not_compilable where possible Changes: https://git.openjdk.org/jdk/pull/13038/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13038&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8303951 Stats: 46 lines in 10 files changed: 42 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/13038.diff Fetch: git fetch https://git.openjdk.org/jdk pull/13038/head:pull/13038 PR: https://git.openjdk.org/jdk/pull/13038 From duke at openjdk.org Fri Mar 17 08:40:32 2023 From: duke at openjdk.org (SUN Guoyun) Date: Fri, 17 Mar 2023 08:40:32 GMT Subject: RFR: 8303147: [s390x] fast & slow debug builds are broken [v2] In-Reply-To: References: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> Message-ID: On Mon, 6 Mar 2023 05:58:23 GMT, Amit Kumar wrote: >> This PR fixes broken fast debug and slow debug build for s390x-arch. tier1 test are completed and results are not affect after this patch. > > Amit Kumar has updated the pull request incrementally with one additional commit since the last revision: > > use constant instead of enum I don't think the current modification is reasonable, why don't you modify `emit_typecheck_helper`? Maybe we can also @dean-long has to say about this. ------------- PR: https://git.openjdk.org/jdk/pull/12825 From rrich at openjdk.org Fri Mar 17 08:48:32 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Fri, 17 Mar 2023 08:48:32 GMT Subject: Integrated: 8299375: [PPC64] GetStackTraceSuspendedStressTest tries to deoptimize frame with invalid fp In-Reply-To: References: Message-ID: On Thu, 9 Mar 2023 10:46:05 GMT, Richard Reingruber wrote: > Mark a frame as not fully initialized when copying it from a continuation StackChunk to the stack until the callers_sp (aka back link) is set. > > This avoids the assertion given in the bug report when the copied frame is deoptimized before it is fully initialized. > IMHO the deoptimization at that point is a little questionable but it actually only changes the pc of the frame which can be done. > Note that the frame can get extended later (and metadata can get overridden) but [there is code that handles this](https://github.com/openjdk/jdk/blob/34a92466a615415b76c8cb6010ff7e6e1a1d63b4/src/hotspot/share/runtime/continuationFreezeThaw.cpp#L2108-L2110). > > Testing: jdk_loom. The fix passed our CI testing. This includes most JCK and JTREG tiers 1-4, also in Xcomp mode, on the standard platforms and also on ppc64le. This pull request has now been integrated. Changeset: 9d518c52 Author: Richard Reingruber URL: https://git.openjdk.org/jdk/commit/9d518c528b11953b556aa7585fc69ff9c9a22435 Stats: 18 lines in 3 files changed: 14 ins; 0 del; 4 mod 8299375: [PPC64] GetStackTraceSuspendedStressTest tries to deoptimize frame with invalid fp Reviewed-by: mdoerr ------------- PR: https://git.openjdk.org/jdk/pull/12941 From dnsimon at openjdk.org Fri Mar 17 09:09:20 2023 From: dnsimon at openjdk.org (Doug Simon) Date: Fri, 17 Mar 2023 09:09:20 GMT Subject: RFR: 8304138: [JVMCI] Test FailedSpeculation existence before appending. [v3] In-Reply-To: References: Message-ID: On Fri, 17 Mar 2023 00:05:59 GMT, Vladimir Kozlov wrote: >> Yudi Zheng has updated the pull request incrementally with one additional commit since the last revision: >> >> avoid duplicated entry. > > Yes, it would do what you are saying but I don't like infinite loops like that with conditional exists. > How big length of `failed_speculations_address` list you can have? > Consider implementing it as recursive method if depth is not big: > > bool FailedSpeculation::add_failed_speculation(nmethod* nm, FailedSpeculation** failed_speculations_address, address speculation, int speculation_len) { > assert(failed_speculations_address != nullptr, "must be"); > guarantee_failed_speculations_alive(nm, failed_speculations_address); > > size_t fs_size = sizeof(FailedSpeculation) + speculation_len; > FailedSpeculation* fs = new (fs_size) FailedSpeculation(speculation, speculation_len); > if (fs == nullptr) { > // no memory -> ignore failed speculation > return false; > } > guarantee(is_aligned(fs, sizeof(FailedSpeculation*)), "FailedSpeculation objects must be pointer aligned"); > > if (!add_failed_speculation_recursive(failed_speculations_address, fs)) { > delete fs; > return false; > } > return true; > } > > bool add_failed_speculation_recursive(FailedSpeculation** cursor, FailedSpeculation* fs) { > if (*cursor == nullptr) { > FailedSpeculation* old_fs = Atomic::cmpxchg(cursor, (FailedSpeculation*) nullptr, fs); > if (old_fs == nullptr) { > // Successfully appended fs to end of the list > return true; > } > guarantee(*cursor != nullptr, "cursor must point to non-null FailedSpeculation"); > } > // check if the current entry matches this thread's failed speculation > int speculation_len = fs->data_len(); > if ((*cursor)->data_len() == speculation_len && memcmp(fs->data(), (*cursor)->data(), speculation_len) == 0) { > return false; > } > return add_failed_speculation_recursive((*cursor)->next_adr(), fs); > } @vnkozlov your suggestion eagerly allocates a new `FailedSpeculation`. I'm also generally allergic to infinite loops but I don't want to ever have to worry about a stack overflow in this code as it will crash the VM. I think we should leave Yudi's code in its current form. ------------- PR: https://git.openjdk.org/jdk/pull/13022 From amitkumar at openjdk.org Fri Mar 17 11:34:03 2023 From: amitkumar at openjdk.org (Amit Kumar) Date: Fri, 17 Mar 2023 11:34:03 GMT Subject: RFR: 8303147: [s390x] fast & slow debug builds are broken [v2] In-Reply-To: References: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> Message-ID: On Mon, 6 Mar 2023 05:58:23 GMT, Amit Kumar wrote: >> This PR fixes broken fast debug and slow debug build for s390x-arch. tier1 test are completed and results are not affect after this patch. > > Amit Kumar has updated the pull request incrementally with one additional commit since the last revision: > > use constant instead of enum @RealLucy would you please take a look at this one too. Thank you :-) ------------- PR: https://git.openjdk.org/jdk/pull/12825 From duke at openjdk.org Fri Mar 17 13:07:44 2023 From: duke at openjdk.org (Damon Fenacci) Date: Fri, 17 Mar 2023 13:07:44 GMT Subject: RFR: JDK-8303069: Memory leak in CompilerOracle::parse_from_line In-Reply-To: References: Message-ID: <7K3FyFq9FuQZKFnfHFpqu4QR7UOfnjrt0EgVm60rndw=.6ce5c9f9-fb98-406f-9e6a-18f176c35fe8@github.com> On Thu, 16 Mar 2023 15:03:52 GMT, Tobias Hartmann wrote: >> A memory leak has been detected using *lsan* when running the `compiler/blackhole/BlackholeExperimentalUnlockTest.java` test. >> >> This happens when parsing the *blackhole* *CompileCommand*. There is a check for the `-XX:+UnlockExperimentalVMOptions` flag being enabled when *blackhole* is used. If this flag is not set, a warning gets printed and the *CompileCommand* is not taken. >> >> Unfortunately this happens in `register_commands` where commands are supposed to be added to a list. In this case we bail out and the option is neither added nor deleted. >> >> https://github.com/openjdk/jdk/blob/f629152021d4ce0288119c47d5a111b87dce1de6/src/hotspot/share/compiler/compilerOracle.cpp#L310-L313 >> >> So, this changes deletes the `matcher` object before returning. This is done in `register_commands` (the callee) mainly because a couple of other similar checks are also done in this function. >> >> The other option would have been to move the check to the caller (the only one for this case) before `register_command`: >> https://github.com/openjdk/jdk/blob/f629152021d4ce0288119c47d5a111b87dce1de6/src/hotspot/share/compiler/compilerOracle.cpp#L915-L917 >> but it seemed less appropriate (no other similar checks). >> >> Letting it handle like other experimental flags (code below) is not an option either since in this case we have to do with a `CompileCommand`. >> https://github.com/openjdk/jdk/blob/f629152021d4ce0288119c47d5a111b87dce1de6/src/hotspot/share/runtime/flags/jvmFlag.cpp#L112-L118 > > Looks good to me. Thanks a lot @TobiHartmann @jcking for your reviews! ------------- PR: https://git.openjdk.org/jdk/pull/13060 From duke at openjdk.org Fri Mar 17 13:28:34 2023 From: duke at openjdk.org (Damon Fenacci) Date: Fri, 17 Mar 2023 13:28:34 GMT Subject: Integrated: JDK-8303069: Memory leak in CompilerOracle::parse_from_line In-Reply-To: References: Message-ID: On Thu, 16 Mar 2023 14:08:13 GMT, Damon Fenacci wrote: > A memory leak has been detected using *lsan* when running the `compiler/blackhole/BlackholeExperimentalUnlockTest.java` test. > > This happens when parsing the *blackhole* *CompileCommand*. There is a check for the `-XX:+UnlockExperimentalVMOptions` flag being enabled when *blackhole* is used. If this flag is not set, a warning gets printed and the *CompileCommand* is not taken. > > Unfortunately this happens in `register_commands` where commands are supposed to be added to a list. In this case we bail out and the option is neither added nor deleted. > > https://github.com/openjdk/jdk/blob/f629152021d4ce0288119c47d5a111b87dce1de6/src/hotspot/share/compiler/compilerOracle.cpp#L310-L313 > > So, this changes deletes the `matcher` object before returning. This is done in `register_commands` (the callee) mainly because a couple of other similar checks are also done in this function. > > The other option would have been to move the check to the caller (the only one for this case) before `register_command`: > https://github.com/openjdk/jdk/blob/f629152021d4ce0288119c47d5a111b87dce1de6/src/hotspot/share/compiler/compilerOracle.cpp#L915-L917 > but it seemed less appropriate (no other similar checks). > > Letting it handle like other experimental flags (code below) is not an option either since in this case we have to do with a `CompileCommand`. > https://github.com/openjdk/jdk/blob/f629152021d4ce0288119c47d5a111b87dce1de6/src/hotspot/share/runtime/flags/jvmFlag.cpp#L112-L118 This pull request has now been integrated. Changeset: 384a8b85 Author: Damon Fenacci Committer: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/384a8b85a7266b920242ea73baf578577ca588ec Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod 8303069: Memory leak in CompilerOracle::parse_from_line Reviewed-by: thartmann, jcking ------------- PR: https://git.openjdk.org/jdk/pull/13060 From thartmann at openjdk.org Fri Mar 17 15:03:20 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 17 Mar 2023 15:03:20 GMT Subject: RFR: 8300257: C2: vectorization fails on some simple Memory Segment loops [v2] In-Reply-To: References: Message-ID: <56Dqr28HmjSXNzanFtuGBIQqnGrDYA7vU_MXhum-4WQ=.44095e43-2ae2-48a5-a0fa-6c22272b0aa6@github.com> On Tue, 14 Mar 2023 10:58:11 GMT, Roland Westrelin wrote: >> In the test case `testByteLong1` (that's extracted from a memory >> segment micro benchmark), the address of the store is initially: >> >> >> (AddP#204 base#195 base#195 (AddL#164 (ConvI2L#158 (CastII#157 (LshiftI#107 iv#101))) invar#163)) >> >> >> (#numbers are node numbers to help the discussion). >> >> `iv#101` is the `Phi` of a counted loop. `invar#163` is the >> `baseOffset` load. >> >> To eliminate the range check, the loop is transformed into a loop nest >> and as a consequence the address above becomes: >> >> >> (AddP#204 base#195 base#195 (AddL#164 (ConvI2L#158 (CastII#157 (LShiftI#107 (AddI#326 invar#308 iv#321)))) invar#163)) >> >> >> `invar#308` is some expression from a `Phi` of the outer loop. >> >> That `AddP` is transformed multiple times to push the invariants out of loop: >> >> >> (AddP#568 base#195 (AddP#556 base#195 base#195 invar#163) (ConvI2L#158 (CastII#157 (AddI#566 (LShiftI#565 iv#321) invar#577)))) >> >> >> then: >> >> >> (AddP#568 base#195 (AddP#847 (AddP#556 base#195 base#195 invar#163) (AddL#838 (ConvI2L#793 (LShiftL#760 iv#767)) (ConvI2L#818 (CastII#779 invar#577))))) >> >> >> and finally: >> >> >> (AddP#568 base#195 (AddP#949 base#195 (AddP#855 base#195 (AddP#556 base#195 base#195 invar#163) (ConvI2L#818 (CastII#809 invar#577))) (ConvI2L#938 (LShiftI#896 iv#908)))) >> >> >> `AddP#855` is out of the inner loop. >> >> This doesn't vectorize because: >> >> - there are 2 invariants in the address expression but superword only >> support one (tracked by `_invar` in `SWPointer`) >> >> - there are more levels of `AddP` (4) than superword supports (3) >> >> To fix that, I propose to no longer track the address elements in >> `_invar`, `_negate_invar` and `_invar_scale` but instead to have a >> single `_invar` which is an expression built by superword as it >> follows chains of `addP` nodes. I kept the previous `_invar`, >> `_negate_invar` and `_invar_scale` as debugging and use them to check >> that what vectorized with the previous scheme still does. >> >> I also propose lifting the restriction on 3 levels of `AddP` entirely. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: > > - NULL -> nullptr > - Merge branch 'master' into JDK-8300257 > - fix & test Performance and correctness testing looks good. ------------- PR: https://git.openjdk.org/jdk/pull/12942 From cslucas at openjdk.org Fri Mar 17 15:52:15 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Fri, 17 Mar 2023 15:52:15 GMT Subject: RFR: 8303970: C2 can not merge homogeneous adjacent two If [v3] In-Reply-To: References: Message-ID: <6VubTXyFbgqR-itTDABGv9CWVWsym52M3snRNxOotoc=.a91c0bc7-13f5-48c6-b13e-ad8bee448366@github.com> On Tue, 14 Mar 2023 12:43:19 GMT, Yi Yang wrote: >> Hi, can I have a review for this patch? It adds new Identity for BoolNode to lookup homogenous integer comparison, i.e. `Bool (CmpX a b)` is identity to `Bool (CmpX b a)`, in this way, we are able to merge the following two "identical" Ifs, which is not before. >> >> >> public static void test(int a, int b) { // ok, identical ifs, apply split_if >> if (a == b) { >> int_field = 0x42; >> } else { >> int_field = 42; >> } >> if (a == b) { >> int_field = 0x42; >> } else { >> int_field = 42; >> } >> } >> >> public static void test(int a, int b) { // do nothing >> if (a == b) { >> int_field = 0x42; >> } else { >> int_field = 42; >> } >> if (b == a) { >> int_field = 0x42; >> } else { >> int_field = 42; >> } >> } >> >> >> Testing: tier1, appllication/ctw/modules > > Yi Yang has updated the pull request incrementally with one additional commit since the last revision: > > review from tobias src/hotspot/share/opto/subnode.cpp line 1488: > 1486: > 1487: static bool is_arithmetic_cmp(Node* cmp) { > 1488: if (!cmp->is_Cmp()) { Perhaps just merge this if with the next one. src/hotspot/share/opto/subnode.cpp line 1514: > 1512: Node* reverse_cmp = NULL; > 1513: if ((_test._test == BoolTest::eq || _test._test == BoolTest::ne) && > 1514: (reverse_cmp = cmp->as_Cmp()->get_reverse_cmp()) != nullptr) { Please factor out the assignment to reverse_cmp. src/hotspot/share/opto/subnode.cpp line 1539: > 1537: Node *cmp1 = cmp->in(1); > 1538: Node *cmp2 = cmp->in(2); > 1539: if (!cmp1) { Please change to "if (cmp1 == nullptr)". test/hotspot/jtreg/compiler/c2/irTests/TestBackToBackIfs.java line 91: > 89: } > 90: > 91: @Run(test = {"test", "test1", "test2"}) Can you please add some tests with more than two "if-elses"? I think having a test with a few consecutive "if-else" with alternating operands (e.g., "a != b", "b != a", "a != b", ...) may be interesting. ------------- PR: https://git.openjdk.org/jdk/pull/12978 From kvn at openjdk.org Fri Mar 17 16:09:51 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 17 Mar 2023 16:09:51 GMT Subject: RFR: 8303951: Add asserts before record_method_not_compilable where possible In-Reply-To: References: Message-ID: On Wed, 15 Mar 2023 10:29:32 GMT, Emanuel Peter wrote: > I went through all `C2` bailouts, and checked if they are justified to bail out of compilation silently. I added asserts everywhere. Those that were hit, I inspected by hand. > > Some of them seem to be justified. There I added comments why they are justified. They are cases that we do not want to handle in `C2`, and that are rare enough so that it probably does not matter. > > For the following bailouts I did not add an assert, because it may have revealed a bug: > [JDK-8304328](https://bugs.openjdk.org/browse/JDK-8304328) C2 Bailout "failed spill-split-recycle sanity check" reveals hidden issue with RA > > Note: > [JDK-8303466](https://bugs.openjdk.org/browse/JDK-8303466) C2: COMPILE SKIPPED: malformed control flow - only one IfProj > That bug bug was the reason for this RFE here. I added the assert for "malformed control flow". After this RFE here, that Bug will run into the assert on debug builds. > > I ran `tier1-6` and stress testing. Now running `tier7-9`. > > Should we file a follow-up RFE to do the same for `BAILOUT` in `C1`? General suggestion is to print more info on bailouts to help debugging. > Should we file a follow-up RFE to do the same for BAILOUT in C1? Yes src/hotspot/share/compiler/compileBroker.cpp line 2280: > 2278: > 2279: if (!ci_env.failing() && !task->is_success()) { > 2280: assert(false, "compiler should always document failure"); I suggest to add `ci_env.failure_reason()` to assert message. src/hotspot/share/opto/buildOopMap.cpp line 254: > 252: // Check for a legal reg name in the oopMap and bailout if it is not. > 253: if (!omap->legal_vm_reg_name(r)) { > 254: assert(false, "illegal oopMap register name"); Added information about `r` to assert message. src/hotspot/share/opto/buildOopMap.cpp line 322: > 320: // Check for a legal reg name in the oopMap and bailout if it is not. > 321: if (!omap->legal_vm_reg_name(r)) { > 322: assert(false, "illegal oopMap register name"); Same. src/hotspot/share/opto/compile.cpp line 757: > 755: if (cg == nullptr) { > 756: const char* reason = InlineTree::check_can_parse(method()); > 757: assert(reason != nullptr, "cannot parse method: why?"); Add `reason` to assert's message. src/hotspot/share/opto/compile.cpp line 769: > 767: if ((jvms = cg->generate(jvms)) == nullptr) { > 768: if (!failure_reason_is(C2Compiler::retry_class_loading_during_parsing())) { > 769: assert(failure_reason() != nullptr, "method parse failed: why?"); same src/hotspot/share/opto/compile.cpp line 4001: > 3999: // Recheck with a better notion of 'required_outcnt' > 4000: if (n->outcnt() != required_outcnt) { > 4001: assert(false, "malformed control flow"); Print more info about `n` src/hotspot/share/opto/compile.cpp line 4020: > 4018: for (DUIterator_Fast jmax, j = n->fast_outs(jmax); j < jmax; j++) > 4019: if (!frc._visited.test(n->fast_out(j)->_idx)) { > 4020: assert(false, "infinite loop"); Print more info about n src/hotspot/share/opto/output.cpp line 2847: > 2845: } > 2846: if (pinch->_idx >= _regalloc->node_regs_max_index()) { > 2847: assert(false, "too many D-U pinch points"); More info about `pinch` node. ------------- PR: https://git.openjdk.org/jdk/pull/13038 From kvn at openjdk.org Fri Mar 17 16:11:44 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 17 Mar 2023 16:11:44 GMT Subject: RFR: 8304138: [JVMCI] Test FailedSpeculation existence before appending. [v3] In-Reply-To: References: Message-ID: On Thu, 16 Mar 2023 21:19:53 GMT, Yudi Zheng wrote: >> Upon uncommon_trap, JVMCI runtime appends a FailedSpeculation entry to the nmethod using an [atomic operation](https://github.com/openjdk/jdk/blob/55aa122462c34d8f4cafa58f4d1f2d900449c83e/src/hotspot/share/oops/methodData.cpp#L852). It becomes a performance bottleneck when there is a large amount of (virtual) threads deoptimizing in the nmethod. In this PR, we test if a FailedSpeculation exists in the list before appending it. > > Yudi Zheng has updated the pull request incrementally with one additional commit since the last revision: > > avoid duplicated entry. Marked as reviewed by kvn (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/13022 From kvn at openjdk.org Fri Mar 17 16:11:46 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 17 Mar 2023 16:11:46 GMT Subject: RFR: 8304138: [JVMCI] Test FailedSpeculation existence before appending. [v3] In-Reply-To: References: Message-ID: On Fri, 17 Mar 2023 00:05:59 GMT, Vladimir Kozlov wrote: >> Yudi Zheng has updated the pull request incrementally with one additional commit since the last revision: >> >> avoid duplicated entry. > > Yes, it would do what you are saying but I don't like infinite loops like that with conditional exists. > How big length of `failed_speculations_address` list you can have? > Consider implementing it as recursive method if depth is not big: > > bool FailedSpeculation::add_failed_speculation(nmethod* nm, FailedSpeculation** failed_speculations_address, address speculation, int speculation_len) { > assert(failed_speculations_address != nullptr, "must be"); > guarantee_failed_speculations_alive(nm, failed_speculations_address); > > size_t fs_size = sizeof(FailedSpeculation) + speculation_len; > FailedSpeculation* fs = new (fs_size) FailedSpeculation(speculation, speculation_len); > if (fs == nullptr) { > // no memory -> ignore failed speculation > return false; > } > guarantee(is_aligned(fs, sizeof(FailedSpeculation*)), "FailedSpeculation objects must be pointer aligned"); > > if (!add_failed_speculation_recursive(failed_speculations_address, fs)) { > delete fs; > return false; > } > return true; > } > > bool add_failed_speculation_recursive(FailedSpeculation** cursor, FailedSpeculation* fs) { > if (*cursor == nullptr) { > FailedSpeculation* old_fs = Atomic::cmpxchg(cursor, (FailedSpeculation*) nullptr, fs); > if (old_fs == nullptr) { > // Successfully appended fs to end of the list > return true; > } > guarantee(*cursor != nullptr, "cursor must point to non-null FailedSpeculation"); > } > // check if the current entry matches this thread's failed speculation > int speculation_len = fs->data_len(); > if ((*cursor)->data_len() == speculation_len && memcmp(fs->data(), (*cursor)->data(), speculation_len) == 0) { > return false; > } > return add_failed_speculation_recursive((*cursor)->next_adr(), fs); > } > @vnkozlov your suggestion eagerly allocates a new `FailedSpeculation`. I'm also generally allergic to infinite loops but I don't want to ever have to worry about a stack overflow in this code as it will crash the VM. I think we should leave Yudi's code in its current form. Okay. You are the "boss" for this code ;^) ------------- PR: https://git.openjdk.org/jdk/pull/13022 From roland at openjdk.org Fri Mar 17 16:48:00 2023 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 17 Mar 2023 16:48:00 GMT Subject: RFR: 8300257: C2: vectorization fails on some simple Memory Segment loops [v3] In-Reply-To: References: Message-ID: > In the test case `testByteLong1` (that's extracted from a memory > segment micro benchmark), the address of the store is initially: > > > (AddP#204 base#195 base#195 (AddL#164 (ConvI2L#158 (CastII#157 (LshiftI#107 iv#101))) invar#163)) > > > (#numbers are node numbers to help the discussion). > > `iv#101` is the `Phi` of a counted loop. `invar#163` is the > `baseOffset` load. > > To eliminate the range check, the loop is transformed into a loop nest > and as a consequence the address above becomes: > > > (AddP#204 base#195 base#195 (AddL#164 (ConvI2L#158 (CastII#157 (LShiftI#107 (AddI#326 invar#308 iv#321)))) invar#163)) > > > `invar#308` is some expression from a `Phi` of the outer loop. > > That `AddP` is transformed multiple times to push the invariants out of loop: > > > (AddP#568 base#195 (AddP#556 base#195 base#195 invar#163) (ConvI2L#158 (CastII#157 (AddI#566 (LShiftI#565 iv#321) invar#577)))) > > > then: > > > (AddP#568 base#195 (AddP#847 (AddP#556 base#195 base#195 invar#163) (AddL#838 (ConvI2L#793 (LShiftL#760 iv#767)) (ConvI2L#818 (CastII#779 invar#577))))) > > > and finally: > > > (AddP#568 base#195 (AddP#949 base#195 (AddP#855 base#195 (AddP#556 base#195 base#195 invar#163) (ConvI2L#818 (CastII#809 invar#577))) (ConvI2L#938 (LShiftI#896 iv#908)))) > > > `AddP#855` is out of the inner loop. > > This doesn't vectorize because: > > - there are 2 invariants in the address expression but superword only > support one (tracked by `_invar` in `SWPointer`) > > - there are more levels of `AddP` (4) than superword supports (3) > > To fix that, I propose to no longer track the address elements in > `_invar`, `_negate_invar` and `_invar_scale` but instead to have a > single `_invar` which is an expression built by superword as it > follows chains of `addP` nodes. I kept the previous `_invar`, > `_negate_invar` and `_invar_scale` as debugging and use them to check > that what vectorized with the previous scheme still does. > > I also propose lifting the restriction on 3 levels of `AddP` entirely. Roland Westrelin has updated the pull request incrementally with three additional commits since the last revision: - Update test/hotspot/jtreg/compiler/c2/irTests/TestVectorizationMultiInvar.java Co-authored-by: Tobias Hartmann - Update test/hotspot/jtreg/compiler/c2/irTests/TestVectorizationMultiInvar.java Co-authored-by: Tobias Hartmann - Update src/hotspot/share/opto/superword.hpp Co-authored-by: Tobias Hartmann ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12942/files - new: https://git.openjdk.org/jdk/pull/12942/files/cdcc181c..d4d07656 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12942&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12942&range=01-02 Stats: 13 lines in 2 files changed: 0 ins; 2 del; 11 mod Patch: https://git.openjdk.org/jdk/pull/12942.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12942/head:pull/12942 PR: https://git.openjdk.org/jdk/pull/12942 From dlong at openjdk.org Fri Mar 17 20:19:32 2023 From: dlong at openjdk.org (Dean Long) Date: Fri, 17 Mar 2023 20:19:32 GMT Subject: RFR: 8303147: [s390x] fast & slow debug builds are broken [v2] In-Reply-To: References: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> Message-ID: On Mon, 6 Mar 2023 05:58:23 GMT, Amit Kumar wrote: >> This PR fixes broken fast debug and slow debug build for s390x-arch. tier1 test are completed and results are not affect after this patch. > > Amit Kumar has updated the pull request incrementally with one additional commit since the last revision: > > use constant instead of enum It looks like s390 implements emit_typecheck_helper in a different way than other ports. Only s390 uses store_parameter() to write into the compiler frame slots reserved for stubs. Other ports push the values and adjust the stack pointer. I suggest s390 follow the example of other ports. Either pass args in registers or push to temporary stack space. ------------- PR: https://git.openjdk.org/jdk/pull/12825 From dlong at openjdk.org Fri Mar 17 20:22:20 2023 From: dlong at openjdk.org (Dean Long) Date: Fri, 17 Mar 2023 20:22:20 GMT Subject: RFR: 8303147: [s390x] fast & slow debug builds are broken [v2] In-Reply-To: References: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> Message-ID: On Mon, 6 Mar 2023 05:58:23 GMT, Amit Kumar wrote: >> This PR fixes broken fast debug and slow debug build for s390x-arch. tier1 test are completed and results are not affect after this patch. > > Amit Kumar has updated the pull request incrementally with one additional commit since the last revision: > > use constant instead of enum This seems like the wrong approach, and reverts parts of JDK-8302369. ------------- Changes requested by dlong (Reviewer). PR: https://git.openjdk.org/jdk/pull/12825 From jlu at openjdk.org Fri Mar 17 20:28:13 2023 From: jlu at openjdk.org (Justin Lu) Date: Fri, 17 Mar 2023 20:28:13 GMT Subject: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native [v4] In-Reply-To: <0MB7FLFNfaGEWssr9X54UJ_iZNFWBJkxQ1yusP7fsuY=.3f9f3de5-fe84-48e6-9449-626cac42da0b@github.com> References: <0MB7FLFNfaGEWssr9X54UJ_iZNFWBJkxQ1yusP7fsuY=.3f9f3de5-fe84-48e6-9449-626cac42da0b@github.com> Message-ID: > This PR converts Unicode sequences to UTF-8 native in .properties file. (Excluding the Unicode space and tab sequence). The conversion was done using native2ascii. > > In addition, the build logic is adjusted to support reading in the .properties files as UTF-8 during the conversion from .properties file to .java ListResourceBundle file. Justin Lu has updated the pull request incrementally with one additional commit since the last revision: Adjust CF test to read in with UTF-8 to fix failing test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12726/files - new: https://git.openjdk.org/jdk/pull/12726/files/7119830b..007c78a7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12726&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12726&range=02-03 Stats: 3 lines in 1 file changed: 2 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/12726.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12726/head:pull/12726 PR: https://git.openjdk.org/jdk/pull/12726 From angorya at openjdk.org Fri Mar 17 20:34:00 2023 From: angorya at openjdk.org (Andy Goryachev) Date: Fri, 17 Mar 2023 20:34:00 GMT Subject: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native [v4] In-Reply-To: References: <0MB7FLFNfaGEWssr9X54UJ_iZNFWBJkxQ1yusP7fsuY=.3f9f3de5-fe84-48e6-9449-626cac42da0b@github.com> Message-ID: <-3wtWK_Pdt1fqDnSjbS6JTGLwboJi7Tw2sV0v7LQ3Os=.7036d0b0-2524-43bc-a82d-640f29fd35a0@github.com> On Fri, 17 Mar 2023 20:28:13 GMT, Justin Lu wrote: >> This PR converts Unicode sequences to UTF-8 native in .properties file. (Excluding the Unicode space and tab sequence). The conversion was done using native2ascii. >> >> In addition, the build logic is adjusted to support reading in the .properties files as UTF-8 during the conversion from .properties file to .java ListResourceBundle file. > > Justin Lu has updated the pull request incrementally with one additional commit since the last revision: > > Adjust CF test to read in with UTF-8 to fix failing test make/jdk/src/classes/build/tools/compileproperties/CompileProperties.java line 226: > 224: Properties p = new Properties(); > 225: try { > 226: FileInputStream input = new FileInputStream(propertiesPath); Should this stream be closed in a finally { } block? ------------- PR: https://git.openjdk.org/jdk/pull/12726 From naoto at openjdk.org Fri Mar 17 21:05:18 2023 From: naoto at openjdk.org (Naoto Sato) Date: Fri, 17 Mar 2023 21:05:18 GMT Subject: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native [v4] In-Reply-To: <-3wtWK_Pdt1fqDnSjbS6JTGLwboJi7Tw2sV0v7LQ3Os=.7036d0b0-2524-43bc-a82d-640f29fd35a0@github.com> References: <0MB7FLFNfaGEWssr9X54UJ_iZNFWBJkxQ1yusP7fsuY=.3f9f3de5-fe84-48e6-9449-626cac42da0b@github.com> <-3wtWK_Pdt1fqDnSjbS6JTGLwboJi7Tw2sV0v7LQ3Os=.7036d0b0-2524-43bc-a82d-640f29fd35a0@github.com> Message-ID: On Fri, 17 Mar 2023 20:31:27 GMT, Andy Goryachev wrote: >> Justin Lu has updated the pull request incrementally with one additional commit since the last revision: >> >> Adjust CF test to read in with UTF-8 to fix failing test > > make/jdk/src/classes/build/tools/compileproperties/CompileProperties.java line 226: > >> 224: Properties p = new Properties(); >> 225: try { >> 226: FileInputStream input = new FileInputStream(propertiesPath); > > Should this stream be closed in a finally { } block? or better be `try-with-resources`? ------------- PR: https://git.openjdk.org/jdk/pull/12726 From weijun at openjdk.org Fri Mar 17 21:52:23 2023 From: weijun at openjdk.org (Weijun Wang) Date: Fri, 17 Mar 2023 21:52:23 GMT Subject: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native [v4] In-Reply-To: References: <0MB7FLFNfaGEWssr9X54UJ_iZNFWBJkxQ1yusP7fsuY=.3f9f3de5-fe84-48e6-9449-626cac42da0b@github.com> Message-ID: <1JBZe7nrM-HsVEItfK-3GPeXoX_glyM9SL4ZACUbLwk=.3a3cf62b-0960-4b03-80aa-2756bd1636dc@github.com> On Fri, 17 Mar 2023 20:28:13 GMT, Justin Lu wrote: >> This PR converts Unicode sequences to UTF-8 native in .properties file. (Excluding the Unicode space and tab sequence). The conversion was done using native2ascii. >> >> In addition, the build logic is adjusted to support reading in the .properties files as UTF-8 during the conversion from .properties file to .java ListResourceBundle file. > > Justin Lu has updated the pull request incrementally with one additional commit since the last revision: > > Adjust CF test to read in with UTF-8 to fix failing test make/jdk/src/classes/build/tools/compileproperties/CompileProperties.java line 326: > 324: outBuffer.append(toHex((aChar >> 8) & 0xF)); > 325: outBuffer.append(toHex((aChar >> 4) & 0xF)); > 326: outBuffer.append(toHex(aChar & 0xF)); Sorry I don't know when this tool is called, but why is it still writing in `\unnnn` style? ------------- PR: https://git.openjdk.org/jdk/pull/12726 From weijun at openjdk.org Fri Mar 17 21:56:23 2023 From: weijun at openjdk.org (Weijun Wang) Date: Fri, 17 Mar 2023 21:56:23 GMT Subject: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native [v4] In-Reply-To: <1JBZe7nrM-HsVEItfK-3GPeXoX_glyM9SL4ZACUbLwk=.3a3cf62b-0960-4b03-80aa-2756bd1636dc@github.com> References: <0MB7FLFNfaGEWssr9X54UJ_iZNFWBJkxQ1yusP7fsuY=.3f9f3de5-fe84-48e6-9449-626cac42da0b@github.com> <1JBZe7nrM-HsVEItfK-3GPeXoX_glyM9SL4ZACUbLwk=.3a3cf62b-0960-4b03-80aa-2756bd1636dc@github.com> Message-ID: On Fri, 17 Mar 2023 21:49:33 GMT, Weijun Wang wrote: >> Justin Lu has updated the pull request incrementally with one additional commit since the last revision: >> >> Adjust CF test to read in with UTF-8 to fix failing test > > make/jdk/src/classes/build/tools/compileproperties/CompileProperties.java line 326: > >> 324: outBuffer.append(toHex((aChar >> 8) & 0xF)); >> 325: outBuffer.append(toHex((aChar >> 4) & 0xF)); >> 326: outBuffer.append(toHex(aChar & 0xF)); > > Sorry I don't know when this tool is called, but why is it still writing in `\unnnn` style? I probably understand it now, source code still needs escaping. When can we put in UTF-8 there as well? ------------- PR: https://git.openjdk.org/jdk/pull/12726 From jlu at openjdk.org Fri Mar 17 22:27:48 2023 From: jlu at openjdk.org (Justin Lu) Date: Fri, 17 Mar 2023 22:27:48 GMT Subject: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native [v5] In-Reply-To: <0MB7FLFNfaGEWssr9X54UJ_iZNFWBJkxQ1yusP7fsuY=.3f9f3de5-fe84-48e6-9449-626cac42da0b@github.com> References: <0MB7FLFNfaGEWssr9X54UJ_iZNFWBJkxQ1yusP7fsuY=.3f9f3de5-fe84-48e6-9449-626cac42da0b@github.com> Message-ID: > This PR converts Unicode sequences to UTF-8 native in .properties file. (Excluding the Unicode space and tab sequence). The conversion was done using native2ascii. > > In addition, the build logic is adjusted to support reading in the .properties files as UTF-8 during the conversion from .properties file to .java ListResourceBundle file. Justin Lu has updated the pull request incrementally with one additional commit since the last revision: Close streams when finished loading into props ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12726/files - new: https://git.openjdk.org/jdk/pull/12726/files/007c78a7..19b91e6b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12726&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12726&range=03-04 Stats: 15 lines in 3 files changed: 6 ins; 1 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/12726.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12726/head:pull/12726 PR: https://git.openjdk.org/jdk/pull/12726 From yzheng at openjdk.org Sat Mar 18 08:45:20 2023 From: yzheng at openjdk.org (Yudi Zheng) Date: Sat, 18 Mar 2023 08:45:20 GMT Subject: RFR: 8304138: [JVMCI] Test FailedSpeculation existence before appending. [v3] In-Reply-To: References: Message-ID: <4rPtzADl60qXsziwZEyEKmNLUCKiNidwzgb9Nuo-R3o=.4e9dcd8e-454b-47f0-8221-eb8d8ebcbfb4@github.com> On Fri, 17 Mar 2023 16:08:38 GMT, Vladimir Kozlov wrote: >> Yes, it would do what you are saying but I don't like infinite loops like that with conditional exists. >> How big length of `failed_speculations_address` list you can have? >> Consider implementing it as recursive method if depth is not big: >> >> bool FailedSpeculation::add_failed_speculation(nmethod* nm, FailedSpeculation** failed_speculations_address, address speculation, int speculation_len) { >> assert(failed_speculations_address != nullptr, "must be"); >> guarantee_failed_speculations_alive(nm, failed_speculations_address); >> >> size_t fs_size = sizeof(FailedSpeculation) + speculation_len; >> FailedSpeculation* fs = new (fs_size) FailedSpeculation(speculation, speculation_len); >> if (fs == nullptr) { >> // no memory -> ignore failed speculation >> return false; >> } >> guarantee(is_aligned(fs, sizeof(FailedSpeculation*)), "FailedSpeculation objects must be pointer aligned"); >> >> if (!add_failed_speculation_recursive(failed_speculations_address, fs)) { >> delete fs; >> return false; >> } >> return true; >> } >> >> bool add_failed_speculation_recursive(FailedSpeculation** cursor, FailedSpeculation* fs) { >> if (*cursor == nullptr) { >> FailedSpeculation* old_fs = Atomic::cmpxchg(cursor, (FailedSpeculation*) nullptr, fs); >> if (old_fs == nullptr) { >> // Successfully appended fs to end of the list >> return true; >> } >> guarantee(*cursor != nullptr, "cursor must point to non-null FailedSpeculation"); >> } >> // check if the current entry matches this thread's failed speculation >> int speculation_len = fs->data_len(); >> if ((*cursor)->data_len() == speculation_len && memcmp(fs->data(), (*cursor)->data(), speculation_len) == 0) { >> return false; >> } >> return add_failed_speculation_recursive((*cursor)->next_adr(), fs); >> } > >> @vnkozlov your suggestion eagerly allocates a new `FailedSpeculation`. I'm also generally allergic to infinite loops but I don't want to ever have to worry about a stack overflow in this code as it will crash the VM. I think we should leave Yudi's code in its current form. > > Okay. You are the "boss" for this code ;^) @vnkozlov @dougxc thanks for the review! I will keep it as is then. ------------- PR: https://git.openjdk.org/jdk/pull/13022 From yzheng at openjdk.org Sat Mar 18 09:45:31 2023 From: yzheng at openjdk.org (Yudi Zheng) Date: Sat, 18 Mar 2023 09:45:31 GMT Subject: Integrated: 8304138: [JVMCI] Test FailedSpeculation existence before appending. In-Reply-To: References: Message-ID: On Tue, 14 Mar 2023 14:55:18 GMT, Yudi Zheng wrote: > Upon uncommon_trap, JVMCI runtime appends a FailedSpeculation entry to the nmethod using an [atomic operation](https://github.com/openjdk/jdk/blob/55aa122462c34d8f4cafa58f4d1f2d900449c83e/src/hotspot/share/oops/methodData.cpp#L852). It becomes a performance bottleneck when there is a large amount of (virtual) threads deoptimizing in the nmethod. In this PR, we test if a FailedSpeculation exists in the list before appending it. This pull request has now been integrated. Changeset: 7503ecc0 Author: Yudi Zheng Committer: Doug Simon URL: https://git.openjdk.org/jdk/commit/7503ecc0f185f6da777c022a66d7af6c40dcd05f Stats: 28 lines in 1 file changed: 19 ins; 9 del; 0 mod 8304138: [JVMCI] Test FailedSpeculation existence before appending. Reviewed-by: kvn, dnsimon ------------- PR: https://git.openjdk.org/jdk/pull/13022 From xgong at openjdk.org Mon Mar 20 01:58:20 2023 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 20 Mar 2023 01:58:20 GMT Subject: RFR: 8303161: [vectorapi] VectorMask.cast narrow operation returns incorrect value with SVE In-Reply-To: References: Message-ID: <0Q20PKClcSnCMu3E9mKle9O3ZkBm7HPmFQoq3SVPMc4=.c48d3f66-3a01-499b-a206-d63796393749@github.com> On Tue, 7 Mar 2023 10:56:40 GMT, Bhavana Kilambi wrote: > The cast operation for VectorMask from wider type to narrow type returns incorrect result for trueCount() method invocation for the resultant mask with SVE (on some SVE machines toLong() also results in incorrect values). An example narrow operation which results in incorrect toLong() and trueCount() values is shown below for a 128-bit -> 64-bit conversion and this can be extended to other narrow operations where the source mask in bytes is either 4x or 8x the size of the result mask in bytes - > > > public class TestMaskCast { > > static final boolean [] mask_arr = {true, true, false, true}; > > public static long narrow_long() { > VectorMask lmask128 = VectorMask.fromArray(LongVector.SPECIES_128, mask_arr, 0); > return lmask128.cast(IntVector.SPECIES_64).toLong(); > } > > public static void main(String[] args) { > long r = 0L; > for (int ic = 0; ic < 50000; ic++) { > r = narrow_long(); > } > System.out.println("toLong() : " + r); > } > } > > > **C2 compilation result :** > java --add-modules jdk.incubator.vector TestMaskCast > toLong(): 15 > > **Interpreter result (for verification) :** > java --add-modules jdk.incubator.vector -Xint TestMaskCast > toLong(): 3 > > The incorrect results with toLong() have been observed only on the 128-bit and 256-bit SVE machines but they are not reproducible on a 512-bit machine. However, trueCount() returns incorrect values too and they are reproducible on all the SVE machines and thus is more reliable to use trueCount() to bring out the drawbacks of the current implementation of mask cast narrow operation for SVE. > > Replacing the call to toLong() by trueCount() in the above example - > > > public class TestMaskCast { > > static final boolean [] mask_arr = {true, true, false, true}; > > public static int narrow_long() { > VectorMask lmask128 = VectorMask.fromArray(LongVector.SPECIES_128, mask_arr, 0); > return lmask128.cast(IntVector.SPECIES_64).trueCount(); > } > > public static void main(String[] args) { > int r = 0; > for (int ic = 0; ic < 50000; ic++) { > r = narrow_long(); > } > System.out.println("trueCount() : " + r); > } > } > > > > **C2 compilation result:** > java --add-modules jdk.incubator.vector TestMaskCast > trueCount() : 4 > > **Interpreter result:** > java --add-modules jdk.incubator.vector -Xint TestMaskCast > trueCount() : 2 > > Since in this example, the source mask size in bytes is 2x that of the result mask, trueCount() returns 2x the number of true elements in the source mask. It would return 4x/8x the number of true elements in the source mask if the size of the source mask is 4x/8x that of result mask. > > The returned values are incorrect because of the higher order bits in the result not being cleared (since the result is narrowed down) and trueCount() or toLong() tend to consider the higher order bits in the vector register as well which results in incorrect value. For the 128-bit to 64-bit conversion with a mask - "TT" passed, the current implementation for mask cast narrow operation returns the same mask in the lower and upper half of the 128-bit register that is - "TTTT" which results in a long value of 15 (instead of 3 - "FFTT" for the 64-bit Integer mask) and number of true elements to be 4 (instead of 2). > > This patch proposes a fix for this problem. An already existing JTREG IR test - "test/hotspot/jtreg/compiler/vectorapi/VectorMaskCastTest.java" has also been modified to call the trueCount() method as well since the toString() method alone cannot be used to reproduce the incorrect values in this bug. This test passes successfully on 128-bit, 256-bit and 512-bit SVE machines. Since the IR test has been changed, it has been tested successfully on other platforms like x86 and aarch64 Neon machines as well to ensure the changes have not introduced any new errors. Looks good to me! ------------- Marked as reviewed by xgong (Committer). PR: https://git.openjdk.org/jdk/pull/12901 From yyang at openjdk.org Mon Mar 20 02:49:21 2023 From: yyang at openjdk.org (Yi Yang) Date: Mon, 20 Mar 2023 02:49:21 GMT Subject: RFR: 8303970: C2 can not merge homogeneous adjacent two If [v3] In-Reply-To: <6VubTXyFbgqR-itTDABGv9CWVWsym52M3snRNxOotoc=.a91c0bc7-13f5-48c6-b13e-ad8bee448366@github.com> References: <6VubTXyFbgqR-itTDABGv9CWVWsym52M3snRNxOotoc=.a91c0bc7-13f5-48c6-b13e-ad8bee448366@github.com> Message-ID: On Fri, 17 Mar 2023 15:38:01 GMT, Cesar Soares Lucas wrote: >> Yi Yang has updated the pull request incrementally with one additional commit since the last revision: >> >> review from tobias > > src/hotspot/share/opto/subnode.cpp line 1488: > >> 1486: >> 1487: static bool is_arithmetic_cmp(Node* cmp) { >> 1488: if (!cmp->is_Cmp()) { > > Perhaps just merge this if with the next one. Hi Cesar, I receive feedback from Roland in #13039 that suggested extending identical_backtoback_ifs instead of adding Ideal for BoolNode and CmpNode. Before making any modifications based on your review comment, I would first like to reach a consensus on whether we should add idealization for bool and cmp or extend identical_backtoback_ifs. Do you have any comments in this regard? Perhaps @TobiHartmann could also chime in. ------------- PR: https://git.openjdk.org/jdk/pull/12978 From thartmann at openjdk.org Mon Mar 20 07:28:19 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 20 Mar 2023 07:28:19 GMT Subject: RFR: 8304049: C2 can not merge trivial Ifs due to CastII In-Reply-To: References: Message-ID: On Wed, 15 Mar 2023 10:37:03 GMT, Yi Yang wrote: > Hi can I have a review for this patch? C2 can not apply Split If for the attached trivial case. PhiNode::Ideal removes itself by unique_input but introduces a new CastII > > https://github.com/openjdk/jdk/blob/e3777b0c49abb9cc1925f4044392afadf3adef61/src/hotspot/share/opto/cfgnode.cpp#L1470-L1474 > > https://github.com/openjdk/jdk/blob/e3777b0c49abb9cc1925f4044392afadf3adef61/src/hotspot/share/opto/cfgnode.cpp#L2078-L2079 > > Therefore we have two Cmp, which is not identical for split_if. > > ![image](https://user-images.githubusercontent.com/5010047/225285449-b41dc939-1d3f-45f3-b6d6-a9b9445c2f6a.png) > (Fig1. Phi#41 is removed during ideal, create CastII#58 then) > > ![image](https://user-images.githubusercontent.com/5010047/225285493-30471f1c-97b0-452b-9218-3b5f09f09859.png) > (Fig2. CmpI#42 and CmpI#23 are different comparisons, they are not identical_backtoback_ifs ) > > This patch adds Cmp identity to find existing Cmp node, i.e. Cmp#42 is identity to Cmp#23 > > > public static void test5(int a, int b){ > > if( b!=0) { > int_field = 35; > } else { > int_field =222; > } > > if( b!=0) { > int_field = 35; > } else { > int_field =222; > } > } > > > > Test: tier1, application/ctw/modules I agree with Roland and would also like this to be merged into `PhaseIdealLoop::identical_backtoback_ifs`. ------------- PR: https://git.openjdk.org/jdk/pull/13039 From roland at openjdk.org Mon Mar 20 07:37:18 2023 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 20 Mar 2023 07:37:18 GMT Subject: RFR: 8300257: C2: vectorization fails on some simple Memory Segment loops [v4] In-Reply-To: References: Message-ID: <1qWMOwI1VZiAf-DY2tRw8LqFe93CS7omKZam2Ro5fJ8=.a3c51e8d-c2b5-4d89-bfdb-3ed20beaa9d6@github.com> > In the test case `testByteLong1` (that's extracted from a memory > segment micro benchmark), the address of the store is initially: > > > (AddP#204 base#195 base#195 (AddL#164 (ConvI2L#158 (CastII#157 (LshiftI#107 iv#101))) invar#163)) > > > (#numbers are node numbers to help the discussion). > > `iv#101` is the `Phi` of a counted loop. `invar#163` is the > `baseOffset` load. > > To eliminate the range check, the loop is transformed into a loop nest > and as a consequence the address above becomes: > > > (AddP#204 base#195 base#195 (AddL#164 (ConvI2L#158 (CastII#157 (LShiftI#107 (AddI#326 invar#308 iv#321)))) invar#163)) > > > `invar#308` is some expression from a `Phi` of the outer loop. > > That `AddP` is transformed multiple times to push the invariants out of loop: > > > (AddP#568 base#195 (AddP#556 base#195 base#195 invar#163) (ConvI2L#158 (CastII#157 (AddI#566 (LShiftI#565 iv#321) invar#577)))) > > > then: > > > (AddP#568 base#195 (AddP#847 (AddP#556 base#195 base#195 invar#163) (AddL#838 (ConvI2L#793 (LShiftL#760 iv#767)) (ConvI2L#818 (CastII#779 invar#577))))) > > > and finally: > > > (AddP#568 base#195 (AddP#949 base#195 (AddP#855 base#195 (AddP#556 base#195 base#195 invar#163) (ConvI2L#818 (CastII#809 invar#577))) (ConvI2L#938 (LShiftI#896 iv#908)))) > > > `AddP#855` is out of the inner loop. > > This doesn't vectorize because: > > - there are 2 invariants in the address expression but superword only > support one (tracked by `_invar` in `SWPointer`) > > - there are more levels of `AddP` (4) than superword supports (3) > > To fix that, I propose to no longer track the address elements in > `_invar`, `_negate_invar` and `_invar_scale` but instead to have a > single `_invar` which is an expression built by superword as it > follows chains of `addP` nodes. I kept the previous `_invar`, > `_negate_invar` and `_invar_scale` as debugging and use them to check > that what vectorized with the previous scheme still does. > > I also propose lifting the restriction on 3 levels of `AddP` entirely. Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: - review - Merge branch 'master' into JDK-8300257 - Update test/hotspot/jtreg/compiler/c2/irTests/TestVectorizationMultiInvar.java Co-authored-by: Tobias Hartmann - Update test/hotspot/jtreg/compiler/c2/irTests/TestVectorizationMultiInvar.java Co-authored-by: Tobias Hartmann - Update src/hotspot/share/opto/superword.hpp Co-authored-by: Tobias Hartmann - NULL -> nullptr - Merge branch 'master' into JDK-8300257 - fix & test ------------- Changes: https://git.openjdk.org/jdk/pull/12942/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12942&range=03 Stats: 273 lines in 3 files changed: 211 ins; 23 del; 39 mod Patch: https://git.openjdk.org/jdk/pull/12942.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12942/head:pull/12942 PR: https://git.openjdk.org/jdk/pull/12942 From thartmann at openjdk.org Mon Mar 20 07:50:20 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 20 Mar 2023 07:50:20 GMT Subject: RFR: 8304230: LShift ideal transform assertion In-Reply-To: References: Message-ID: On Wed, 15 Mar 2023 18:24:23 GMT, Jasmine K. wrote: > Hi, > This PR aims to address the assertion on arm32 where the special-case `add1->in(2) == in(2)` check fails, and it falls through to the regular cases. I'm not quite sure how this issue can manifest as AFAIK the GVN should allow the usage of `==` to check against constants that are equal. I've changed the check from node equality to constant equality to hopefully resolve this. I unfortunately cannot reproduce the behavior on x86, nor do I have access to arm32 hardware, so I would greatly appreciate reviews and help testing this change (cc @bulasevich). Thank you all in advance. Looks good to me. > I'm not quite sure how this issue can manifest as AFAIK the GVN should allow the usage of == to check against constants that are equal It might be an intermittent state during GVN, where a non-constant node is known to have a constant type that's returned by `const_shift_count` but the node was not yet replaced by an actual constant and therefore `==` returns false. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/13049 From thartmann at openjdk.org Mon Mar 20 08:29:23 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 20 Mar 2023 08:29:23 GMT Subject: RFR: 8303804: Fix some errors of If-VectorTest and CMove-VectorTest [v2] In-Reply-To: References: <-Jw_zF5ca_3WHcoZQwzsT6lMA1NFdAzbOv3063qU6Lw=.b90c971a-bfe4-4144-93dc-b04a4c89a154@github.com> Message-ID: <01AxKBXWxHUxCSIZhNJKwuLcg2KVSx9Jva1PTBL5ZxQ=.42c59b6c-efad-4558-8a7d-190e4505049a@github.com> On Thu, 9 Mar 2023 01:19:40 GMT, Wang Haomin wrote: >> After https://bugs.openjdk.org/browse/JDK-8292289 , the base class of VectorTestNode changed from Node to CmpNode. So I add two match rule into ad file. >> >> match(If cop (VectorTest op1 op2)); >> match(Set dst (CMoveI (Binary cop (VectorTest op1 op2)) (Binary src1 src2))); >> >> First error, rule1 shouldn't generate the statement "node->_bottom_type = _leaf->bottom_type();". >> Second error, both rule1 and rule2 need to use VectorTestNode, the VectorTestNode should be cloned like CmpNode. > > Wang Haomin has updated the pull request incrementally with one additional commit since the last revision: > > compare the results with 0 src/hotspot/share/adlc/output_c.cpp line 3989: > 3987: if (inst->captures_bottom_type(_globalNames)) { > 3988: if (strncmp("MachCall", inst->mach_base_class(_globalNames), strlen("MachCall")) != 0 > 3989: && strncmp("MachIf", inst->mach_base_class(_globalNames), strlen("MachIf")) != 0) { Could you please explain this change, and how it relates to JDK-8292289, in more detail? ------------- PR: https://git.openjdk.org/jdk/pull/12917 From xlinzheng at openjdk.org Mon Mar 20 09:41:19 2023 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Mon, 20 Mar 2023 09:41:19 GMT Subject: RFR: 8304387: Fix positions of shared static stubs / trampolines Message-ID: This RFE fixes the positions of static stubs / trampolines. They should be like: [Verified Entry Point] ... ... ... [Stub Code] [Exception Handler] ... ... [Deopt Handler Code] ... ... Currently after we have the shared static stubs/trampolines in JDK-8280481 and JDK-8280152 : [Verified Entry Point] ... ... ... [Stub Code] [Exception Handler] ... ... [Deopt Handler Code] ... ... // they are presented in the Deopt range, though do not have correctness issues. For example on x86: [Verified Entry Point] ... [Stub Code] 0x00007fac68ef4908: nopl 0x0(%rax,%rax,1) ; {no_reloc} 0x00007fac68ef490d: mov $0x0,%rbx ; {static_stub} 0x00007fac68ef4917: jmpq 0x00007fac68ef4917 ; {runtime_call} 0x00007fac68ef491c: nop 0x00007fac68ef491d: mov $0x0,%rbx ; {static_stub} 0x00007fac68ef4927: jmpq 0x00007fac68ef4927 ; {runtime_call} [Exception Handler] 0x00007fac68ef492c: callq 0x00007fac703da280 ; {runtime_call handle_exception_from_callee Runtime1 stub} 0x00007fac68ef4931: mov $0x7fac885d8067,%rdi ; {external_word} 0x00007fac68ef493b: and $0xfffffffffffffff0,%rsp 0x00007fac68ef493f: callq 0x00007fac881e9900 ; {runtime_call MacroAssembler::debug64(char*, long, long*)} 0x00007fac68ef4944: hlt [Deopt Handler Code] 0x00007fac68ef4945: mov $0x7fac68ef4945,%r10 ; {section_word} 0x00007fac68ef494f: push %r10 0x00007fac68ef4951: jmpq 0x00007fac70326520 ; {runtime_call DeoptimizationBlob} 0x00007fac68ef4956: mov $0x0,%rbx ; {static_stub} // <---------- here 0x00007fac68ef4960: jmpq 0x00007fac68ef4960 ; {runtime_call} 0x00007fac68ef4965: mov $0x0,%rbx ; {static_stub} // <---------- here 0x00007fac68ef496f: jmpq 0x00007fac68ef496f ; {runtime_call} 0x00007fac68ef4974: mov $0x0,%rbx ; {static_stub} // <---------- here 0x00007fac68ef497e: jmpq 0x00007fac68ef497e ; {runtime_call} 0x00007fac68ef4983: hlt 0x00007fac68ef4984: hlt 0x00007fac68ef4985: hlt 0x00007fac68ef4986: hlt 0x00007fac68ef4987: hlt -------------------------------------------------------------------------------- [/Disassembly] It can be simply reproduced and dumped by `-XX:+PrintAssembly`. Though the correctness doesn't get affected in the current case, we may need to move them to a better place, back into the `[Stub Code]`, which might be more reasonable and unified. Also for the performance's sake, `ciEnv::register_method()`, where `code_buffer->finalize_stubs()` locates currently, has two locks `Compile_lock` and `MethodCompileQueue_lock`. So I think it may be better to move `code_buffer->finalize_stubs()` out to C1 and C2 code generation phases, separately, before the exception handler code is emitted so they are inside the `[Stub Code]` range. BTW, this is the "direct cause" of [JDK-8302384](https://bugs.openjdk.org/browse/JDK-8302384) because shared trampolines and their data are generated at the end of compiled code, which is different from the original condition. Though for that issue, the root cause is still from the Binutils, for even if trampolines are generated at the end of code, we should not fail as well when disassembling. But that is another issue, please see [JDK-8302384](https://bugs.openjdk.org/browse/JDK-8302384) for more details. Tested x86, AArch64, and RISC-V hotspot tier1~4 with fastdebug build, no new errors found. Thanks, Xiaolin ------------- Commit messages: - remove error handling logic in aarch64 part - My real home Changes: https://git.openjdk.org/jdk/pull/13071/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13071&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8304387 Stats: 21 lines in 8 files changed: 11 ins; 7 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/13071.diff Fetch: git fetch https://git.openjdk.org/jdk pull/13071/head:pull/13071 PR: https://git.openjdk.org/jdk/pull/13071 From chagedorn at openjdk.org Mon Mar 20 09:55:21 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 20 Mar 2023 09:55:21 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter In-Reply-To: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> Message-ID: On Thu, 9 Mar 2023 18:28:02 GMT, Roberto Casta?eda Lozano wrote: > The "Simplify graph" filter abstracts away details from the graph that are typically unnecessary for debugging or analyzing the represented program. This changeset decouples this filter into two: > > - "Simplify graph", which hides elements that are typically (but not always) unnecessary, and > - "Condense graph", which makes the graph more compact without loss of information. > > Together, these two filters reduce the average graph size by a factor of 1.6x (nodes) and 1.9x (edges): > > ![without-with-filters](https://user-images.githubusercontent.com/8792647/224118397-e6bd45d1-0b90-4d94-88ae-0a83f9ef20da.png) > > Besides decoupling the "Simplify graph" filter, the changeset extends its functionality by: > - combining Bool and conversion nodes into their predecessors, > - inlining all Parm nodes except control into their successors (this removes lots of long edges), > - removing "top" inputs from call-like nodes, > - inlining more source nodes (such as MachTemp and ThreadLocal) into their successors, > - pretty-printing the labels of many inlined and combined nodes such as Bool comparisons or Catch projections (via a new filter that edits node properties), and > - using a sparse representation of nodes with empty inputs (e.g. call-like nodes after applying "Simplify graph"). > > The sparse input representation shows dots between non-contiguous inputs, instead of horizontal space proportional to the number of empty inputs. This helps reducing node width, which is known to improve overall layout quality: > > ![dense-vs-sparse](https://user-images.githubusercontent.com/8792647/224118703-04f663b7-7a73-4e49-87d9-2acd8b98522b.png) > > Note that the exact input indices can still be retrieved via the incoming edge's tooltips: > > ![tooltip-with-input-index](https://user-images.githubusercontent.com/8792647/224119319-7f40fba2-1e9f-436e-a11c-8c3d428d46a6.png) > > The control-flow graph view is also adapted to this representation: > > ![sparse-in-cfg](https://user-images.githubusercontent.com/8792647/224119399-884e2516-a9a1-43fd-b5f5-747c99472ace.png) > > #### Additional improvements > > Additionally, this changeset: > - ensures that the selected filter subset is applied in the order listed in the "Filter" window (this is necessary for combining effectively the "Simplify graph" and "Condense graph" filters, but is also generally desirable for simplicity and consistency), > - introduces a complementary filter "Show custom node info" (enabled by default) that extends the labels of call and exception-creation nodes with custom information, > - extends the search functionality so that combined and inlined nodes can also be searched on and selected, and > - defines and documents JavaScript helpers to simplify the new and existing available filters. > > Here is an example of the effect of the new "Show custom node info" filter: > > ![show-custom-node-info](https://user-images.githubusercontent.com/8792647/224119545-fd564224-7ccc-4829-988e-77f05d25b3bc.png) > > ### Testing > > #### Functionality > > - Tested the functionality manually on a small selection of graphs. > > - Tested automatically that viewing thousands of graphs in the three views with different filter subsets enabled does not trigger any assertion failure (by instrumenting IGV to view graphs as they are loaded and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`). > > #### Performance > > Measured the combined filter application and view creation time for the sea-of-nodes view on a selection of 100 medium-sized graphs (200-500 nodes). On average, applying the new "Show custom node info" filter introduces a minimal overhead of around 1%, which motivates enabling it by default. Applying the "simplify graph" and "condense graph" on top actually gives a speedup of about 12%, since the additional filter application time is amortized by laying out and drawing fewer nodes. However, these filters are not enabled by default, since they cause a (minor) loss of information which is not desirable in every use case. > > The graph size reduction and performance results are [attached](https://github.com/openjdk/jdk/files/10934804/performance-evaluation.ods) (note that each time measurement in the sheet corresponds to the median of ten runs). First of all, nice work! The improved filters and the additional node info for call and exception-creation nodes is very useful. Maybe the node info can be improved further in a future RFE, for example for `CountedLoop` nodes to also show if it is a pre/main/post loop or to add the stride. I've tried your patch out and it works quite well. But I've noticed some things: - When selecting a `CallStaticJava` node, the custom node info is sometimes cut depending on the zoom level (sometimes more, sometimes less): ![image](https://user-images.githubusercontent.com/17833009/226298043-a029d42f-0fe0-423e-853c-027bba3b10a8.png) - Selecting an inlined node with the condensed graph filter does not work when searching for it. For example, I can search for `165 Bool` node in the search field. It finds it but when clicking on it, it shows me an empty graph. I would have expected to see the following graph with the "outer" node being selected which includes `165 Bool`: ![image](https://user-images.githubusercontent.com/17833009/226300286-375b13f4-01bf-4f7d-8984-fd17031b43ed.png) - I've only just noticed this now: When having IGV opened on my second, larger monitor (ultra-wide, 3440x1440), the tooltip is quite off: ![image](https://user-images.githubusercontent.com/17833009/226302714-abd098cf-104a-4cd0-9d9a-065f1fb08b96.png) But this was already a problem before and unrelated to your patch. I'll also have a look at the code later but I'm not very familiar with it. Thanks, Christian ------------- PR: https://git.openjdk.org/jdk/pull/12955 From thartmann at openjdk.org Mon Mar 20 10:27:19 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 20 Mar 2023 10:27:19 GMT Subject: RFR: 8303278: Imprecise bottom type of ExtractB/UB In-Reply-To: References: Message-ID: On Fri, 17 Mar 2023 06:14:00 GMT, Eric Liu wrote: > This is a trivial patch, which fixes the bottom type of ExtractB/UB nodes. > > ExtractNode can be generated by Vector API Vector.lane(int), which gets the lane element at the given index. A more precise type of range can help to optimize out unnecessary type conversion in some cases. > > Below shows a typical case used ExtractBNode > > > public static byte byteLt16() { > ByteVector vecb = ByteVector.broadcast(ByteVector.SPECIES_128, 1); > return vecb.lane(1); > } > > > In this case, c2 constructs IR graph like: > > ExtractB ConI(24) > | __| > | / | > LShiftI __| > | / > RShiftI > > which generates AArch64 code: > > movi v16.16b, #0x1 > smov x11, v16.b[1] > sxtb w0, w11 > > with this patch, this shift pair can be optimized out by RShiftI's identity [1]. The code is optimized to: > > movi v16.16b, #0x1 > smov x0, v16.b[1] > > [TEST] > > Full jtreg passed except 4 files on x86: > > jdk/incubator/vector/Byte128VectorTests.java > jdk/incubator/vector/Byte256VectorTests.java > jdk/incubator/vector/Byte512VectorTests.java > jdk/incubator/vector/Byte64VectorTests.java > > They are caused by a known issue on x86 [2]. > > [1] https://github.com/openjdk/jdk/blob/742bc041eaba1ff9beb7f5b6d896e4f382b030ea/src/hotspot/share/opto/mulnode.cpp#L1052 > [2] https://bugs.openjdk.org/browse/JDK-8303508 ~~Looks good to me too.~~ This triggers failures in testing: jdk/incubator/vector/Byte64VectorTests.java java.lang.Exception: failures: 1 at com.sun.javatest.regtest.agent.TestNGRunner.main(TestNGRunner.java:95) at com.sun.javatest.regtest.agent.TestNGRunner.main(TestNGRunner.java:53) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) at java.base/java.lang.reflect.Method.invoke(Method.java:578) at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:125) at java.base/java.lang.Thread.run(Thread.java:1623) ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/13070Changes requested by thartmann (Reviewer). From adinn at openjdk.org Mon Mar 20 10:38:22 2023 From: adinn at openjdk.org (Andrew Dinn) Date: Mon, 20 Mar 2023 10:38:22 GMT Subject: RFR: 8304387: Fix positions of shared static stubs / trampolines In-Reply-To: References: Message-ID: <5XQDOtLl4vpWkq6U_mTl1X4E6XwrNReLrY2mhcuo5WY=.3e7be4db-5cfb-4f43-acae-7dcaf97632fb@github.com> On Fri, 17 Mar 2023 07:30:19 GMT, Xiaolin Zheng wrote: > This RFE fixes the positions of static stubs / trampolines. They should be like: > > > [Verified Entry Point] > ... > ... > ... > [Stub Code] > > [Exception Handler] > ... > ... > [Deopt Handler Code] > ... > ... > > > Currently after we have the shared static stubs/trampolines in JDK-8280481 and JDK-8280152 : > > > [Verified Entry Point] > ... > ... > ... > [Stub Code] > > [Exception Handler] > ... > ... > [Deopt Handler Code] > ... > ... > // they are presented in the Deopt range, though do not have correctness issues. > > > For example on x86: > > > [Verified Entry Point] > ... > [Stub Code] > 0x00007fac68ef4908: nopl 0x0(%rax,%rax,1) ; {no_reloc} > 0x00007fac68ef490d: mov $0x0,%rbx ; {static_stub} > 0x00007fac68ef4917: jmpq 0x00007fac68ef4917 ; {runtime_call} > 0x00007fac68ef491c: nop > 0x00007fac68ef491d: mov $0x0,%rbx ; {static_stub} > 0x00007fac68ef4927: jmpq 0x00007fac68ef4927 ; {runtime_call} > [Exception Handler] > 0x00007fac68ef492c: callq 0x00007fac703da280 ; {runtime_call handle_exception_from_callee Runtime1 stub} > 0x00007fac68ef4931: mov $0x7fac885d8067,%rdi ; {external_word} > 0x00007fac68ef493b: and $0xfffffffffffffff0,%rsp > 0x00007fac68ef493f: callq 0x00007fac881e9900 ; {runtime_call MacroAssembler::debug64(char*, long, long*)} > 0x00007fac68ef4944: hlt > [Deopt Handler Code] > 0x00007fac68ef4945: mov $0x7fac68ef4945,%r10 ; {section_word} > 0x00007fac68ef494f: push %r10 > 0x00007fac68ef4951: jmpq 0x00007fac70326520 ; {runtime_call DeoptimizationBlob} > 0x00007fac68ef4956: mov $0x0,%rbx ; {static_stub} // <---------- here > 0x00007fac68ef4960: jmpq 0x00007fac68ef4960 ; {runtime_call} > 0x00007fac68ef4965: mov $0x0,%rbx ; {static_stub} // <---------- here > 0x00007fac68ef496f: jmpq 0x00007fac68ef496f ; {runtime_call} > 0x00007fac68ef4974: mov $0x0,%rbx ; {static_stub} // <---------- here > 0x00007fac68ef497e: jmpq 0x00007fac68ef497e ; {runtime_call} > 0x00007fac68ef4983: hlt > 0x00007fac68ef4984: hlt > 0x00007fac68ef4985: hlt > 0x00007fac68ef4986: hlt > 0x00007fac68ef4987: hlt > -------------------------------------------------------------------------------- > [/Disassembly] > > > > It can be simply reproduced and dumped by `-XX:+PrintAssembly`. > > Though the correctness doesn't get affected in the current case, we may need to move them to a better place, back into the `[Stub Code]`, which might be more reasonable and unified. Also for the performance's sake, `ciEnv::register_method()`, where `code_buffer->finalize_stubs()` locates currently, has two locks `Compile_lock` and `MethodCompileQueue_lock`. So I think it may be better to move `code_buffer->finalize_stubs()` out to C1 and C2 code generation phases, separately, before the exception handler code is emitted so they are inside the `[Stub Code]` range. > > BTW, this is the "direct cause" of [JDK-8302384](https://bugs.openjdk.org/browse/JDK-8302384) because shared trampolines and their data are generated at the end of compiled code, which is different from the original condition. Though for that issue, the root cause is still from the Binutils, for even if trampolines are generated at the end of code, we should not fail as well when disassembling. But that is another issue, please see [JDK-8302384](https://bugs.openjdk.org/browse/JDK-8302384) for more details. > > Tested x86, AArch64, and RISC-V hotspot tier1~4 with fastdebug build, no new errors found. > > Thanks, > Xiaolin Thanks for fixing this. src/hotspot/share/asm/codeBuffer.cpp line 1012: > 1010: } > 1011: > 1012: bool CodeBuffer::finalize_stubs() { It's not actually relevant to this fix but it might be worth making a small change here. There is actually no need to call `pd_finalize_stubs` if `_finalize_stubs` is already false. So, you could return straight away if it is already false. ------------- Marked as reviewed by adinn (Reviewer). PR: https://git.openjdk.org/jdk/pull/13071 From tholenstein at openjdk.org Mon Mar 20 11:55:29 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 20 Mar 2023 11:55:29 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter In-Reply-To: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> Message-ID: On Thu, 9 Mar 2023 18:28:02 GMT, Roberto Casta?eda Lozano wrote: > The "Simplify graph" filter abstracts away details from the graph that are typically unnecessary for debugging or analyzing the represented program. This changeset decouples this filter into two: > > - "Simplify graph", which hides elements that are typically (but not always) unnecessary, and > - "Condense graph", which makes the graph more compact without loss of information. > > Together, these two filters reduce the average graph size by a factor of 1.6x (nodes) and 1.9x (edges): > > ![without-with-filters](https://user-images.githubusercontent.com/8792647/224118397-e6bd45d1-0b90-4d94-88ae-0a83f9ef20da.png) > > Besides decoupling the "Simplify graph" filter, the changeset extends its functionality by: > - combining Bool and conversion nodes into their predecessors, > - inlining all Parm nodes except control into their successors (this removes lots of long edges), > - removing "top" inputs from call-like nodes, > - inlining more source nodes (such as MachTemp and ThreadLocal) into their successors, > - pretty-printing the labels of many inlined and combined nodes such as Bool comparisons or Catch projections (via a new filter that edits node properties), and > - using a sparse representation of nodes with empty inputs (e.g. call-like nodes after applying "Simplify graph"). > > The sparse input representation shows dots between non-contiguous inputs, instead of horizontal space proportional to the number of empty inputs. This helps reducing node width, which is known to improve overall layout quality: > > ![dense-vs-sparse](https://user-images.githubusercontent.com/8792647/224118703-04f663b7-7a73-4e49-87d9-2acd8b98522b.png) > > Note that the exact input indices can still be retrieved via the incoming edge's tooltips: > > ![tooltip-with-input-index](https://user-images.githubusercontent.com/8792647/224119319-7f40fba2-1e9f-436e-a11c-8c3d428d46a6.png) > > The control-flow graph view is also adapted to this representation: > > ![sparse-in-cfg](https://user-images.githubusercontent.com/8792647/224119399-884e2516-a9a1-43fd-b5f5-747c99472ace.png) > > #### Additional improvements > > Additionally, this changeset: > - ensures that the selected filter subset is applied in the order listed in the "Filter" window (this is necessary for combining effectively the "Simplify graph" and "Condense graph" filters, but is also generally desirable for simplicity and consistency), > - introduces a complementary filter "Show custom node info" (enabled by default) that extends the labels of call and exception-creation nodes with custom information, > - extends the search functionality so that combined and inlined nodes can also be searched on and selected, and > - defines and documents JavaScript helpers to simplify the new and existing available filters. > > Here is an example of the effect of the new "Show custom node info" filter: > > ![show-custom-node-info](https://user-images.githubusercontent.com/8792647/224119545-fd564224-7ccc-4829-988e-77f05d25b3bc.png) > > ### Testing > > #### Functionality > > - Tested the functionality manually on a small selection of graphs. > > - Tested automatically that viewing thousands of graphs in the three views with different filter subsets enabled does not trigger any assertion failure (by instrumenting IGV to view graphs as they are loaded and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`). > > #### Performance > > Measured the combined filter application and view creation time for the sea-of-nodes view on a selection of 100 medium-sized graphs (200-500 nodes). On average, applying the new "Show custom node info" filter introduces a minimal overhead of around 1%, which motivates enabling it by default. Applying the "simplify graph" and "condense graph" on top actually gives a speedup of about 12%, since the additional filter application time is amortized by laying out and drawing fewer nodes. However, these filters are not enabled by default, since they cause a (minor) loss of information which is not desirable in every use case. > > The graph size reduction and performance results are [attached](https://github.com/openjdk/jdk/files/10934804/performance-evaluation.ods) (note that each time measurement in the sheet corresponds to the median of ten runs). Thanks @robcasloz for working on this. Extending the filter in the way you did it is very useful in my opinion! I tested your changes and everything seems to work as expected. (See my comments in the code for more) src/utils/IdealGraphVisualizer/Filter/src/main/java/com/sun/hotspot/igv/filter/CombineFilter.java line 71: > 69: } > 70: } > 71: I think `assert slot != null;` should be moved up here src/utils/IdealGraphVisualizer/Filter/src/main/java/com/sun/hotspot/igv/filter/FilterChain.java line 1: > 1: /* I think `applyInOrder` can be simplified as this : public void applyInOrder(Diagram d, FilterChain sequence) { for (Filter f : sequence.getFilters()) { if (filters.contains(f)) { f.apply(d); } } } Reason: `FilterChain ordering` is the same as `this` in `FilterChain`. Usually `filters` are already in the order that we want them to apply. Only exception is when the user manually reoders the filters. `FilterChain sequence` contains all the filters in the order that they appear in the list. `filters` are the filters that are selected by the user and should alway be a subset of `sequence`. Therefore we can just iterate through `sequence` to get the correct order and apply each filter that is selected (contained in `filters`) src/utils/IdealGraphVisualizer/Graph/src/main/java/com/sun/hotspot/igv/graph/Figure.java line 343: > 341: inputLabel = nodeTinyLabel; > 342: } > 343: if (inputLabel != null) { according to my IDE inputLabel is here always non-null. src/utils/IdealGraphVisualizer/Graph/src/main/java/com/sun/hotspot/igv/graph/InputSlot.java line 76: > 74: int gapAmount = (int)((getPosition() + 1)*gapRatio); > 75: return new Point(gapAmount + Figure.getSlotsWidth(Figure.getAllBefore(getFigure().getInputSlots(), this)) + getWidth()/2, -Figure.SLOT_START); > 76: //return new Point((getFigure().getWidth() / (getFigure().getInputSlots().size() * 2)) * (getPosition() * 2 + 1), -Figure.SLOT_START); perhaps remove this old comment src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/DiagramScene.java line 218: > 216: if (ids.contains(figure.getInputNode().getId())) { > 217: selectedFigures.add(figure); > 218: } Suggestion: } for (Slot slot : figure.getSlots()) { if (!Collections.disjoint(slot.getSource().getSourceNodesAsSet(), ids)) { highlightedObjects.add(slot); } } I am not sure what your intent was in adding the slots to the selected objects. If you wanted the slots to be selected globally in "link global node selection" mode, you need to add the following code to make it work ------------- Changes requested by tholenstein (Committer). PR: https://git.openjdk.org/jdk/pull/12955 From tholenstein at openjdk.org Mon Mar 20 11:55:31 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 20 Mar 2023 11:55:31 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter In-Reply-To: References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> Message-ID: On Mon, 20 Mar 2023 11:47:37 GMT, Tobias Holenstein wrote: >> The "Simplify graph" filter abstracts away details from the graph that are typically unnecessary for debugging or analyzing the represented program. This changeset decouples this filter into two: >> >> - "Simplify graph", which hides elements that are typically (but not always) unnecessary, and >> - "Condense graph", which makes the graph more compact without loss of information. >> >> Together, these two filters reduce the average graph size by a factor of 1.6x (nodes) and 1.9x (edges): >> >> ![without-with-filters](https://user-images.githubusercontent.com/8792647/224118397-e6bd45d1-0b90-4d94-88ae-0a83f9ef20da.png) >> >> Besides decoupling the "Simplify graph" filter, the changeset extends its functionality by: >> - combining Bool and conversion nodes into their predecessors, >> - inlining all Parm nodes except control into their successors (this removes lots of long edges), >> - removing "top" inputs from call-like nodes, >> - inlining more source nodes (such as MachTemp and ThreadLocal) into their successors, >> - pretty-printing the labels of many inlined and combined nodes such as Bool comparisons or Catch projections (via a new filter that edits node properties), and >> - using a sparse representation of nodes with empty inputs (e.g. call-like nodes after applying "Simplify graph"). >> >> The sparse input representation shows dots between non-contiguous inputs, instead of horizontal space proportional to the number of empty inputs. This helps reducing node width, which is known to improve overall layout quality: >> >> ![dense-vs-sparse](https://user-images.githubusercontent.com/8792647/224118703-04f663b7-7a73-4e49-87d9-2acd8b98522b.png) >> >> Note that the exact input indices can still be retrieved via the incoming edge's tooltips: >> >> ![tooltip-with-input-index](https://user-images.githubusercontent.com/8792647/224119319-7f40fba2-1e9f-436e-a11c-8c3d428d46a6.png) >> >> The control-flow graph view is also adapted to this representation: >> >> ![sparse-in-cfg](https://user-images.githubusercontent.com/8792647/224119399-884e2516-a9a1-43fd-b5f5-747c99472ace.png) >> >> #### Additional improvements >> >> Additionally, this changeset: >> - ensures that the selected filter subset is applied in the order listed in the "Filter" window (this is necessary for combining effectively the "Simplify graph" and "Condense graph" filters, but is also generally desirable for simplicity and consistency), >> - introduces a complementary filter "Show custom node info" (enabled by default) that extends the labels of call and exception-creation nodes with custom information, >> - extends the search functionality so that combined and inlined nodes can also be searched on and selected, and >> - defines and documents JavaScript helpers to simplify the new and existing available filters. >> >> Here is an example of the effect of the new "Show custom node info" filter: >> >> ![show-custom-node-info](https://user-images.githubusercontent.com/8792647/224119545-fd564224-7ccc-4829-988e-77f05d25b3bc.png) >> >> ### Testing >> >> #### Functionality >> >> - Tested the functionality manually on a small selection of graphs. >> >> - Tested automatically that viewing thousands of graphs in the three views with different filter subsets enabled does not trigger any assertion failure (by instrumenting IGV to view graphs as they are loaded and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`). >> >> #### Performance >> >> Measured the combined filter application and view creation time for the sea-of-nodes view on a selection of 100 medium-sized graphs (200-500 nodes). On average, applying the new "Show custom node info" filter introduces a minimal overhead of around 1%, which motivates enabling it by default. Applying the "simplify graph" and "condense graph" on top actually gives a speedup of about 12%, since the additional filter application time is amortized by laying out and drawing fewer nodes. However, these filters are not enabled by default, since they cause a (minor) loss of information which is not desirable in every use case. >> >> The graph size reduction and performance results are [attached](https://github.com/openjdk/jdk/files/10934804/performance-evaluation.ods) (note that each time measurement in the sheet corresponds to the median of ten runs). > > src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/DiagramScene.java line 218: > >> 216: if (ids.contains(figure.getInputNode().getId())) { >> 217: selectedFigures.add(figure); >> 218: } > > Suggestion: > > } > for (Slot slot : figure.getSlots()) { > if (!Collections.disjoint(slot.getSource().getSourceNodesAsSet(), ids)) { > highlightedObjects.add(slot); > } > } > > I am not sure what your intent was in adding the slots to the selected objects. If you wanted the slots to be selected globally in "link global node selection" mode, you need to add the following code to make it work Even if this was not you intention, I think selecting the slots globally is a useful feature. ------------- PR: https://git.openjdk.org/jdk/pull/12955 From roland at openjdk.org Mon Mar 20 12:05:46 2023 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 20 Mar 2023 12:05:46 GMT Subject: RFR: 8300257: C2: vectorization fails on some simple Memory Segment loops [v2] In-Reply-To: <0xHmGI1abpLB-_9DvlR9M8G43XZaAx8Y_LeSuRTLFxE=.6aa6a8be-ed9f-4666-9c25-af07f76e180c@github.com> References: <0xHmGI1abpLB-_9DvlR9M8G43XZaAx8Y_LeSuRTLFxE=.6aa6a8be-ed9f-4666-9c25-af07f76e180c@github.com> Message-ID: On Thu, 16 Mar 2023 08:52:49 GMT, Tobias Hartmann wrote: >> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: >> >> - NULL -> nullptr >> - Merge branch 'master' into JDK-8300257 >> - fix & test > > src/hotspot/share/opto/superword.cpp line 4318: > >> 4316: if (opc == Op_AddI) { >> 4317: if (n->in(2)->is_Con() && invariant(n->in(1))) { >> 4318: maybe_add_to_invar(maybe_negate_invar(negate, n->in(1))); > > It feels like `maybe_negate_invar` should be moved into `maybe_add_to_invar` and be controlled by a `negate` argument. Thanks for the review. I applied your suggestions and made that change too. ------------- PR: https://git.openjdk.org/jdk/pull/12942 From xlinzheng at openjdk.org Mon Mar 20 12:08:18 2023 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Mon, 20 Mar 2023 12:08:18 GMT Subject: RFR: 8304387: Fix positions of shared static stubs / trampolines [v2] In-Reply-To: <5XQDOtLl4vpWkq6U_mTl1X4E6XwrNReLrY2mhcuo5WY=.3e7be4db-5cfb-4f43-acae-7dcaf97632fb@github.com> References: <5XQDOtLl4vpWkq6U_mTl1X4E6XwrNReLrY2mhcuo5WY=.3e7be4db-5cfb-4f43-acae-7dcaf97632fb@github.com> Message-ID: On Mon, 20 Mar 2023 10:31:56 GMT, Andrew Dinn wrote: >> Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: >> >> Andrew's review comments: another cleanup > > src/hotspot/share/asm/codeBuffer.cpp line 1012: > >> 1010: } >> 1011: >> 1012: bool CodeBuffer::finalize_stubs() { > > It's not actually relevant to this fix but it might be worth making a small change here. There is actually no need to call `pd_finalize_stubs` if `_finalize_stubs` is already false. So, you could return straight away if it is already false. Thanks for the quick review, Andrew. Yes you are right. Tested a simple AArch64 tier1 and some tests for zero. For backends excluding x86, AArch64 and RISC-V, `CodeBuffer::supports_shared_stubs()` are all false and there are no shared trampolines implemented, so `_finalize_stubs` is always a false value. ------------- PR: https://git.openjdk.org/jdk/pull/13071 From xlinzheng at openjdk.org Mon Mar 20 12:08:16 2023 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Mon, 20 Mar 2023 12:08:16 GMT Subject: RFR: 8304387: Fix positions of shared static stubs / trampolines [v2] In-Reply-To: References: Message-ID: <1-rKX7bks0rCi-7k-IHi07-Abg53c8EAt9KCYjgv66E=.14b278ee-9e72-4ac3-abd2-d7ba4b71f397@github.com> > This RFE fixes the positions of static stubs / trampolines. They should be like: > > > [Verified Entry Point] > ... > ... > ... > [Stub Code] > > [Exception Handler] > ... > ... > [Deopt Handler Code] > ... > ... > > > Currently after we have the shared static stubs/trampolines in JDK-8280481 and JDK-8280152 : > > > [Verified Entry Point] > ... > ... > ... > [Stub Code] > > [Exception Handler] > ... > ... > [Deopt Handler Code] > ... > ... > // they are presented in the Deopt range, though do not have correctness issues. > > > For example on x86: > > > [Verified Entry Point] > ... > [Stub Code] > 0x00007fac68ef4908: nopl 0x0(%rax,%rax,1) ; {no_reloc} > 0x00007fac68ef490d: mov $0x0,%rbx ; {static_stub} > 0x00007fac68ef4917: jmpq 0x00007fac68ef4917 ; {runtime_call} > 0x00007fac68ef491c: nop > 0x00007fac68ef491d: mov $0x0,%rbx ; {static_stub} > 0x00007fac68ef4927: jmpq 0x00007fac68ef4927 ; {runtime_call} > [Exception Handler] > 0x00007fac68ef492c: callq 0x00007fac703da280 ; {runtime_call handle_exception_from_callee Runtime1 stub} > 0x00007fac68ef4931: mov $0x7fac885d8067,%rdi ; {external_word} > 0x00007fac68ef493b: and $0xfffffffffffffff0,%rsp > 0x00007fac68ef493f: callq 0x00007fac881e9900 ; {runtime_call MacroAssembler::debug64(char*, long, long*)} > 0x00007fac68ef4944: hlt > [Deopt Handler Code] > 0x00007fac68ef4945: mov $0x7fac68ef4945,%r10 ; {section_word} > 0x00007fac68ef494f: push %r10 > 0x00007fac68ef4951: jmpq 0x00007fac70326520 ; {runtime_call DeoptimizationBlob} > 0x00007fac68ef4956: mov $0x0,%rbx ; {static_stub} // <---------- here > 0x00007fac68ef4960: jmpq 0x00007fac68ef4960 ; {runtime_call} > 0x00007fac68ef4965: mov $0x0,%rbx ; {static_stub} // <---------- here > 0x00007fac68ef496f: jmpq 0x00007fac68ef496f ; {runtime_call} > 0x00007fac68ef4974: mov $0x0,%rbx ; {static_stub} // <---------- here > 0x00007fac68ef497e: jmpq 0x00007fac68ef497e ; {runtime_call} > 0x00007fac68ef4983: hlt > 0x00007fac68ef4984: hlt > 0x00007fac68ef4985: hlt > 0x00007fac68ef4986: hlt > 0x00007fac68ef4987: hlt > -------------------------------------------------------------------------------- > [/Disassembly] > > > > It can be simply reproduced and dumped by `-XX:+PrintAssembly`. > > Though the correctness doesn't get affected in the current case, we may need to move them to a better place, back into the `[Stub Code]`, which might be more reasonable and unified. Also for the performance's sake, `ciEnv::register_method()`, where `code_buffer->finalize_stubs()` locates currently, has two locks `Compile_lock` and `MethodCompileQueue_lock`. So I think it may be better to move `code_buffer->finalize_stubs()` out to C1 and C2 code generation phases, separately, before the exception handler code is emitted so they are inside the `[Stub Code]` range. > > BTW, this is the "direct cause" of [JDK-8302384](https://bugs.openjdk.org/browse/JDK-8302384) because shared trampolines and their data are generated at the end of compiled code, which is different from the original condition. Though for that issue, the root cause is still from the Binutils, for even if trampolines are generated at the end of code, we should not fail as well when disassembling. But that is another issue, please see [JDK-8302384](https://bugs.openjdk.org/browse/JDK-8302384) for more details. > > Tested x86, AArch64, and RISC-V hotspot tier1~4 with fastdebug build, no new errors found. > > Thanks, > Xiaolin Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: Andrew's review comments: another cleanup ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13071/files - new: https://git.openjdk.org/jdk/pull/13071/files/c3262f74..0c39f182 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13071&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13071&range=00-01 Stats: 14 lines in 5 files changed: 1 ins; 8 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/13071.diff Fetch: git fetch https://git.openjdk.org/jdk pull/13071/head:pull/13071 PR: https://git.openjdk.org/jdk/pull/13071 From tholenstein at openjdk.org Mon Mar 20 12:13:34 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 20 Mar 2023 12:13:34 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter In-Reply-To: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> Message-ID: <7-AdNFFLocni-2nnwHPVf8fy1ykMXn7hrP8dg1Z-0po=.1f02eac6-0213-495d-9193-645eb32f099d@github.com> On Thu, 9 Mar 2023 18:28:02 GMT, Roberto Casta?eda Lozano wrote: > The "Simplify graph" filter abstracts away details from the graph that are typically unnecessary for debugging or analyzing the represented program. This changeset decouples this filter into two: > > - "Simplify graph", which hides elements that are typically (but not always) unnecessary, and > - "Condense graph", which makes the graph more compact without loss of information. > > Together, these two filters reduce the average graph size by a factor of 1.6x (nodes) and 1.9x (edges): > > ![without-with-filters](https://user-images.githubusercontent.com/8792647/224118397-e6bd45d1-0b90-4d94-88ae-0a83f9ef20da.png) > > Besides decoupling the "Simplify graph" filter, the changeset extends its functionality by: > - combining Bool and conversion nodes into their predecessors, > - inlining all Parm nodes except control into their successors (this removes lots of long edges), > - removing "top" inputs from call-like nodes, > - inlining more source nodes (such as MachTemp and ThreadLocal) into their successors, > - pretty-printing the labels of many inlined and combined nodes such as Bool comparisons or Catch projections (via a new filter that edits node properties), and > - using a sparse representation of nodes with empty inputs (e.g. call-like nodes after applying "Simplify graph"). > > The sparse input representation shows dots between non-contiguous inputs, instead of horizontal space proportional to the number of empty inputs. This helps reducing node width, which is known to improve overall layout quality: > > ![dense-vs-sparse](https://user-images.githubusercontent.com/8792647/224118703-04f663b7-7a73-4e49-87d9-2acd8b98522b.png) > > Note that the exact input indices can still be retrieved via the incoming edge's tooltips: > > ![tooltip-with-input-index](https://user-images.githubusercontent.com/8792647/224119319-7f40fba2-1e9f-436e-a11c-8c3d428d46a6.png) > > The control-flow graph view is also adapted to this representation: > > ![sparse-in-cfg](https://user-images.githubusercontent.com/8792647/224119399-884e2516-a9a1-43fd-b5f5-747c99472ace.png) > > #### Additional improvements > > Additionally, this changeset: > - ensures that the selected filter subset is applied in the order listed in the "Filter" window (this is necessary for combining effectively the "Simplify graph" and "Condense graph" filters, but is also generally desirable for simplicity and consistency), > - introduces a complementary filter "Show custom node info" (enabled by default) that extends the labels of call and exception-creation nodes with custom information, > - extends the search functionality so that combined and inlined nodes can also be searched on and selected, and > - defines and documents JavaScript helpers to simplify the new and existing available filters. > > Here is an example of the effect of the new "Show custom node info" filter: > > ![show-custom-node-info](https://user-images.githubusercontent.com/8792647/224119545-fd564224-7ccc-4829-988e-77f05d25b3bc.png) > > ### Testing > > #### Functionality > > - Tested the functionality manually on a small selection of graphs. > > - Tested automatically that viewing thousands of graphs in the three views with different filter subsets enabled does not trigger any assertion failure (by instrumenting IGV to view graphs as they are loaded and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`). > > #### Performance > > Measured the combined filter application and view creation time for the sea-of-nodes view on a selection of 100 medium-sized graphs (200-500 nodes). On average, applying the new "Show custom node info" filter introduces a minimal overhead of around 1%, which motivates enabling it by default. Applying the "simplify graph" and "condense graph" on top actually gives a speedup of about 12%, since the additional filter application time is amortized by laying out and drawing fewer nodes. However, these filters are not enabled by default, since they cause a (minor) loss of information which is not desirable in every use case. > > The graph size reduction and performance results are [attached](https://github.com/openjdk/jdk/files/10934804/performance-evaluation.ods) (note that each time measurement in the sheet corresponds to the median of ten runs). > I'll also have a look at the code later but I'm not very familiar with it. > First of all, nice work! The improved filters and the additional node info for call and exception-creation nodes is very useful. Maybe the node info can be improved further in a future RFE, for example for `CountedLoop` nodes to also show if it is a pre/main/post loop or to add the stride. > > I've tried your patch out and it works quite well. But I've noticed some things: > > * When selecting a `CallStaticJava` node, the custom node info is sometimes cut depending on the zoom level (sometimes more, sometimes less): > ![image](https://user-images.githubusercontent.com/17833009/226298043-a029d42f-0fe0-423e-853c-027bba3b10a8.png) > * Selecting an inlined node with the condensed graph filter does not work when searching for it. For example, I can search for `165 Bool` node in the search field. It finds it but when clicking on it, it shows me an empty graph. I would have expected to see the following graph with the "outer" node being selected which includes `165 Bool`: > ![image](https://user-images.githubusercontent.com/17833009/226300286-375b13f4-01bf-4f7d-8984-fd17031b43ed.png) > * I've only just noticed this now: When having IGV opened on my second, larger monitor (ultra-wide, 3440x1440), the tooltip is quite off: > ![image](https://user-images.githubusercontent.com/17833009/226302714-abd098cf-104a-4cd0-9d9a-065f1fb08b96.png) > But this was already a problem before and unrelated to your patch. > > I'll also have a look at the code later but I'm not very familiar with it. > > Thanks, Christian Hi Christian, Regarding tooltips on large monitors, I think this is an issue of `Netbeans Platform` that we are using in IGV. Perhaps we can upgrade to a newer version of `Netbeans Platform` ------------- PR: https://git.openjdk.org/jdk/pull/12955 From chagedorn at openjdk.org Mon Mar 20 12:30:28 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 20 Mar 2023 12:30:28 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter In-Reply-To: <7-AdNFFLocni-2nnwHPVf8fy1ykMXn7hrP8dg1Z-0po=.1f02eac6-0213-495d-9193-645eb32f099d@github.com> References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> <7-AdNFFLocni-2nnwHPVf8fy1ykMXn7hrP8dg1Z-0po=.1f02eac6-0213-495d-9193-645eb32f099d@github.com> Message-ID: <9JHT4fvwluc3EULCyjSchk9DUcoz3v6eQp1lZ7Ca0TI=.64b1968a-f894-49da-8532-a2f01e7ce911@github.com> On Mon, 20 Mar 2023 12:10:35 GMT, Tobias Holenstein wrote: > > I'll also have a look at the code later but I'm not very familiar with it. > > > First of all, nice work! The improved filters and the additional node info for call and exception-creation nodes is very useful. Maybe the node info can be improved further in a future RFE, for example for `CountedLoop` nodes to also show if it is a pre/main/post loop or to add the stride. > > I've tried your patch out and it works quite well. But I've noticed some things: > > > > * When selecting a `CallStaticJava` node, the custom node info is sometimes cut depending on the zoom level (sometimes more, sometimes less): > > ![image](https://user-images.githubusercontent.com/17833009/226298043-a029d42f-0fe0-423e-853c-027bba3b10a8.png) > > * Selecting an inlined node with the condensed graph filter does not work when searching for it. For example, I can search for `165 Bool` node in the search field. It finds it but when clicking on it, it shows me an empty graph. I would have expected to see the following graph with the "outer" node being selected which includes `165 Bool`: > > ![image](https://user-images.githubusercontent.com/17833009/226300286-375b13f4-01bf-4f7d-8984-fd17031b43ed.png) > > * I've only just noticed this now: When having IGV opened on my second, larger monitor (ultra-wide, 3440x1440), the tooltip is quite off: > > ![image](https://user-images.githubusercontent.com/17833009/226302714-abd098cf-104a-4cd0-9d9a-065f1fb08b96.png) > > But this was already a problem before and unrelated to your patch. > > > > I'll also have a look at the code later but I'm not very familiar with it. > > Thanks, Christian > > Hi Christian, Regarding tooltips on large monitors, I think this is an issue of `Netbeans Platform` that we are using in IGV. Perhaps we can upgrade to a newer version of `Netbeans Platform` Thanks for your answer Toby! I see, that might help to fix this issue. But it's just a detail that I've noticed now when trying out the tooltips for the new inlined nodes. ------------- PR: https://git.openjdk.org/jdk/pull/12955 From wanghaomin at openjdk.org Mon Mar 20 13:01:41 2023 From: wanghaomin at openjdk.org (Wang Haomin) Date: Mon, 20 Mar 2023 13:01:41 GMT Subject: RFR: 8303804: Fix some errors of If-VectorTest and CMove-VectorTest [v2] In-Reply-To: <01AxKBXWxHUxCSIZhNJKwuLcg2KVSx9Jva1PTBL5ZxQ=.42c59b6c-efad-4558-8a7d-190e4505049a@github.com> References: <-Jw_zF5ca_3WHcoZQwzsT6lMA1NFdAzbOv3063qU6Lw=.b90c971a-bfe4-4144-93dc-b04a4c89a154@github.com> <01AxKBXWxHUxCSIZhNJKwuLcg2KVSx9Jva1PTBL5ZxQ=.42c59b6c-efad-4558-8a7d-190e4505049a@github.com> Message-ID: On Mon, 20 Mar 2023 08:26:14 GMT, Tobias Hartmann wrote: >> Wang Haomin has updated the pull request incrementally with one additional commit since the last revision: >> >> compare the results with 0 > > src/hotspot/share/adlc/output_c.cpp line 3989: > >> 3987: if (inst->captures_bottom_type(_globalNames)) { >> 3988: if (strncmp("MachCall", inst->mach_base_class(_globalNames), strlen("MachCall")) != 0 >> 3989: && strncmp("MachIf", inst->mach_base_class(_globalNames), strlen("MachIf")) != 0) { > > Could you please explain this change, and how it relates to JDK-8292289, in more detail? Add `match(If cop (VectorTest op1 op2));` into ad file. The following code will be generated in`MachNode *State::MachNodeGenerator(int opcode)` of ad_xxx_gen.cpp. 5458 case anytrue_in_maskV16_branch_rule: { 5459 anytrue_in_maskV16_branchNode *node = new anytrue_in_maskV16_branchNode(); 5460 node->set_opnd_array(4, MachOperGenerator(LABEL)); 5461 node->_bottom_type = _leaf->bottom_type(); 5462 node->_prob = _leaf->as_If()->_prob; 5463 node->_fcnt = _leaf->as_If()->_fcnt; 5464 return node; 5465 } `error: 'class anytrue_in_maskV16_branchNode' has no member named '_bottom_type';` reported when `make hotspot`. In fact, `MachIfNode` does not have the member `_bottom_type`. Normal `MachIfNode` will return false in `inst->captures_bottom_type`. However, the right child node of this `MachIfNode` is `VectorTest`. `is_vector` return true, so `inst->captures_bottom_type` return true. Therefore the error occurred. ------------- PR: https://git.openjdk.org/jdk/pull/12917 From thartmann at openjdk.org Mon Mar 20 13:19:31 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 20 Mar 2023 13:19:31 GMT Subject: RFR: 8300257: C2: vectorization fails on some simple Memory Segment loops [v4] In-Reply-To: <1qWMOwI1VZiAf-DY2tRw8LqFe93CS7omKZam2Ro5fJ8=.a3c51e8d-c2b5-4d89-bfdb-3ed20beaa9d6@github.com> References: <1qWMOwI1VZiAf-DY2tRw8LqFe93CS7omKZam2Ro5fJ8=.a3c51e8d-c2b5-4d89-bfdb-3ed20beaa9d6@github.com> Message-ID: On Mon, 20 Mar 2023 07:37:18 GMT, Roland Westrelin wrote: >> In the test case `testByteLong1` (that's extracted from a memory >> segment micro benchmark), the address of the store is initially: >> >> >> (AddP#204 base#195 base#195 (AddL#164 (ConvI2L#158 (CastII#157 (LshiftI#107 iv#101))) invar#163)) >> >> >> (#numbers are node numbers to help the discussion). >> >> `iv#101` is the `Phi` of a counted loop. `invar#163` is the >> `baseOffset` load. >> >> To eliminate the range check, the loop is transformed into a loop nest >> and as a consequence the address above becomes: >> >> >> (AddP#204 base#195 base#195 (AddL#164 (ConvI2L#158 (CastII#157 (LShiftI#107 (AddI#326 invar#308 iv#321)))) invar#163)) >> >> >> `invar#308` is some expression from a `Phi` of the outer loop. >> >> That `AddP` is transformed multiple times to push the invariants out of loop: >> >> >> (AddP#568 base#195 (AddP#556 base#195 base#195 invar#163) (ConvI2L#158 (CastII#157 (AddI#566 (LShiftI#565 iv#321) invar#577)))) >> >> >> then: >> >> >> (AddP#568 base#195 (AddP#847 (AddP#556 base#195 base#195 invar#163) (AddL#838 (ConvI2L#793 (LShiftL#760 iv#767)) (ConvI2L#818 (CastII#779 invar#577))))) >> >> >> and finally: >> >> >> (AddP#568 base#195 (AddP#949 base#195 (AddP#855 base#195 (AddP#556 base#195 base#195 invar#163) (ConvI2L#818 (CastII#809 invar#577))) (ConvI2L#938 (LShiftI#896 iv#908)))) >> >> >> `AddP#855` is out of the inner loop. >> >> This doesn't vectorize because: >> >> - there are 2 invariants in the address expression but superword only >> support one (tracked by `_invar` in `SWPointer`) >> >> - there are more levels of `AddP` (4) than superword supports (3) >> >> To fix that, I propose to no longer track the address elements in >> `_invar`, `_negate_invar` and `_invar_scale` but instead to have a >> single `_invar` which is an expression built by superword as it >> follows chains of `addP` nodes. I kept the previous `_invar`, >> `_negate_invar` and `_invar_scale` as debugging and use them to check >> that what vectorized with the previous scheme still does. >> >> I also propose lifting the restriction on 3 levels of `AddP` entirely. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: > > - review > - Merge branch 'master' into JDK-8300257 > - Update test/hotspot/jtreg/compiler/c2/irTests/TestVectorizationMultiInvar.java > > Co-authored-by: Tobias Hartmann > - Update test/hotspot/jtreg/compiler/c2/irTests/TestVectorizationMultiInvar.java > > Co-authored-by: Tobias Hartmann > - Update src/hotspot/share/opto/superword.hpp > > Co-authored-by: Tobias Hartmann > - NULL -> nullptr > - Merge branch 'master' into JDK-8300257 > - fix & test Thanks for making these changes. `SWPointer::maybe_negate_invar` could now be removed as it has only one user but I'm also fine with leaving it as is. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/12942 From epeter at openjdk.org Mon Mar 20 13:48:05 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 20 Mar 2023 13:48:05 GMT Subject: RFR: 8303951: Add asserts before record_method_not_compilable where possible In-Reply-To: References: Message-ID: <2nglRCq5IVQqRb8TUWqTElepM5D8aoQ9raRf-pyVXHs=.2b3870ed-d642-4fd3-92b1-4c36f7da1128@github.com> On Fri, 17 Mar 2023 15:55:02 GMT, Vladimir Kozlov wrote: >> I went through all `C2` bailouts, and checked if they are justified to bail out of compilation silently. I added asserts everywhere. Those that were hit, I inspected by hand. >> >> Some of them seem to be justified. There I added comments why they are justified. They are cases that we do not want to handle in `C2`, and that are rare enough so that it probably does not matter. >> >> For the following bailouts I did not add an assert, because it may have revealed a bug: >> [JDK-8304328](https://bugs.openjdk.org/browse/JDK-8304328) C2 Bailout "failed spill-split-recycle sanity check" reveals hidden issue with RA >> >> Note: >> [JDK-8303466](https://bugs.openjdk.org/browse/JDK-8303466) C2: COMPILE SKIPPED: malformed control flow - only one IfProj >> That bug bug was the reason for this RFE here. I added the assert for "malformed control flow". After this RFE here, that Bug will run into the assert on debug builds. >> >> I ran `tier1-6` and stress testing. Now running `tier7-9`. >> >> Filed a follow-up RFE to do the same for `BAILOUT` in `C1`: [JDK-8304532](https://bugs.openjdk.org/browse/JDK-8304532). > > src/hotspot/share/opto/compile.cpp line 757: > >> 755: if (cg == nullptr) { >> 756: const char* reason = InlineTree::check_can_parse(method()); >> 757: assert(reason != nullptr, "cannot parse method: why?"); > > Add `reason` to assert's message. `reason` is `nullptr` when the asser fails. I now say `expect reason for parse failure`. ------------- PR: https://git.openjdk.org/jdk/pull/13038 From epeter at openjdk.org Mon Mar 20 13:58:52 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 20 Mar 2023 13:58:52 GMT Subject: RFR: 8303951: Add asserts before record_method_not_compilable where possible [v2] In-Reply-To: References: Message-ID: > I went through all `C2` bailouts, and checked if they are justified to bail out of compilation silently. I added asserts everywhere. Those that were hit, I inspected by hand. > > Some of them seem to be justified. There I added comments why they are justified. They are cases that we do not want to handle in `C2`, and that are rare enough so that it probably does not matter. > > For the following bailouts I did not add an assert, because it may have revealed a bug: > [JDK-8304328](https://bugs.openjdk.org/browse/JDK-8304328) C2 Bailout "failed spill-split-recycle sanity check" reveals hidden issue with RA > > Note: > [JDK-8303466](https://bugs.openjdk.org/browse/JDK-8303466) C2: COMPILE SKIPPED: malformed control flow - only one IfProj > That bug bug was the reason for this RFE here. I added the assert for "malformed control flow". After this RFE here, that Bug will run into the assert on debug builds. > > I ran `tier1-6` and stress testing. Now running `tier7-9`. > > Filed a follow-up RFE to do the same for `BAILOUT` in `C1`: [JDK-8304532](https://bugs.openjdk.org/browse/JDK-8304532). Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: addressing Vladimir K's review suggestions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13038/files - new: https://git.openjdk.org/jdk/pull/13038/files/07b00ffd..28d41ffe Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13038&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13038&range=00-01 Stats: 20 lines in 4 files changed: 12 ins; 0 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/13038.diff Fetch: git fetch https://git.openjdk.org/jdk pull/13038/head:pull/13038 PR: https://git.openjdk.org/jdk/pull/13038 From cslucas at openjdk.org Mon Mar 20 15:29:42 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Mon, 20 Mar 2023 15:29:42 GMT Subject: RFR: 8303970: C2 can not merge homogeneous adjacent two If [v3] In-Reply-To: References: <6VubTXyFbgqR-itTDABGv9CWVWsym52M3snRNxOotoc=.a91c0bc7-13f5-48c6-b13e-ad8bee448366@github.com> Message-ID: <4DR7m-Js6OtLku_39uDGsVF_IwkRbks0EhlB2BKa9Jk=.a0ad98ed-2bbf-4174-b449-22b35381b35a@github.com> On Mon, 20 Mar 2023 02:46:51 GMT, Yi Yang wrote: >> src/hotspot/share/opto/subnode.cpp line 1488: >> >>> 1486: >>> 1487: static bool is_arithmetic_cmp(Node* cmp) { >>> 1488: if (!cmp->is_Cmp()) { >> >> Perhaps just merge this if with the next one. > > Hi Cesar, I receive feedback from Roland in #13039 that suggested extending identical_backtoback_ifs instead of adding Ideal for BoolNode and CmpNode. > > Before making any modifications based on your review comment, I would first like to reach a consensus on whether we should add idealization for bool and cmp or extend identical_backtoback_ifs. Do you have any comments in this regard? Perhaps @TobiHartmann and @merykitty could also chime in. I was unaware of the other PR. Please ignore my comments if the other PR covers this work. ------------- PR: https://git.openjdk.org/jdk/pull/12978 From kvn at openjdk.org Mon Mar 20 16:39:39 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 20 Mar 2023 16:39:39 GMT Subject: RFR: 8303951: Add asserts before record_method_not_compilable where possible [v2] In-Reply-To: References: Message-ID: On Mon, 20 Mar 2023 13:58:52 GMT, Emanuel Peter wrote: >> I went through all `C2` bailouts, and checked if they are justified to bail out of compilation silently. I added asserts everywhere. Those that were hit, I inspected by hand. >> >> Some of them seem to be justified. There I added comments why they are justified. They are cases that we do not want to handle in `C2`, and that are rare enough so that it probably does not matter. >> >> For the following bailouts I did not add an assert, because it may have revealed a bug: >> [JDK-8304328](https://bugs.openjdk.org/browse/JDK-8304328) C2 Bailout "failed spill-split-recycle sanity check" reveals hidden issue with RA >> >> Note: >> [JDK-8303466](https://bugs.openjdk.org/browse/JDK-8303466) C2: COMPILE SKIPPED: malformed control flow - only one IfProj >> That bug bug was the reason for this RFE here. I added the assert for "malformed control flow". After this RFE here, that Bug will run into the assert on debug builds. >> >> I ran `tier1-6` and stress testing. Now running `tier7-9`. >> >> Filed a follow-up RFE to do the same for `BAILOUT` in `C1`: [JDK-8304532](https://bugs.openjdk.org/browse/JDK-8304532). > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > addressing Vladimir K's review suggestions Looks good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/13038 From duke at openjdk.org Mon Mar 20 17:14:01 2023 From: duke at openjdk.org (Jasmine Karthikeyan) Date: Mon, 20 Mar 2023 17:14:01 GMT Subject: RFR: 8304230: LShift ideal transform assertion In-Reply-To: References: Message-ID: On Mon, 20 Mar 2023 07:47:20 GMT, Tobias Hartmann wrote: > It might be an intermittent state during GVN, where a non-constant node is known to have a constant type that's returned by `const_shift_count` but the node was not yet replaced by an actual constant and therefore `==` returns false. Interesting, thank you for the explanation! I will keep this in mind for the future. Thank you for the review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/13049#issuecomment-1476626783 From vladimir.kempik at gmail.com Mon Mar 20 19:02:38 2023 From: vladimir.kempik at gmail.com (Vladimir Kempik) Date: Mon, 20 Mar 2023 22:02:38 +0300 Subject: Missaligned memory accesses from JDK In-Reply-To: <6194F148-F760-407F-961E-180BBDC6AE4F@gmail.com> References: <29875E09-8B1E-4255-AAED-06305459C872@gmail.com> <330a2677.26fa2.186fdec7098.Coremail.yangfei@iscas.ac.cn> <6194F148-F760-407F-961E-180BBDC6AE4F@gmail.com> Message-ID: <7F6F8F09-5521-4F6E-B4C4-DB3EFC45EF6B@gmail.com> Adding hs compiler list. Could you please suggest on best way to make emit_intX methods not perform misaligned memory stores ? Talking about src/hotspot/share/asm/codeBuffer.hpp from https://github.com/VladimirKempik/jdk/commit/18d7f399ce1bc213b2495411193938d914d3f616#diff-deb8ab083311ba60c0016dc34d6518579bbee4683c81e8d348982bac897fe8ae Regards, Vladimir >> For each emit_intX functions modified, I see there is a correspondent version which handles unaligned access. For example, 'void emit_int16(uint8_t x1, uint8_t x2)' for 'void emit_int16(uint16_t x)' >> So if we encounter an unaligned access issue when using 'emit_int16(uint16_t x)', shouldn't we change the callsite to use 'emit_int16(uint8_t x1, uint8_t x2)' instead? > > Hello > not exactly > 'void emit_int16(uint8_t x1, uint8_t x2) > will always use slow version ( store byte) > > but > void emit_int16(uint16_t x) > will use slow version only on unaligned stores. if store is aligned, it will use "store half", which should be faster. > > So we can?t always use emit_int16(uint8_t x1, uint8_t x2) at callsite. > > and we can?t decide which one to use at callsite as callsite should be unaware of end() value inside CodeSection class > > Regards, Vladimir -------------- next part -------------- An HTML attachment was scrubbed... URL: From cslucas at openjdk.org Mon Mar 20 19:23:34 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Mon, 20 Mar 2023 19:23:34 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v4] In-Reply-To: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: <6NDwZSpjSrokmglncPRp4tM7_Hiq4b26dXukhXODpKo=.8ba7efd0-bc44-4f1e-beb8-c1c68bc33515@github.com> > Can I please get reviews for this PR? > > The most common and frequent use of NonEscaping Phis merging object allocations is for debugging information. The two graphs below show numbers for Renaissance and DaCapo benchmarks - similar results are obtained for all other applications that I tested. > > With what frequency does each IR node type occurs as an allocation merge user? I.e., if the same node type uses a Phi N times the counter is incremented by N: > > ![image](https://user-images.githubusercontent.com/2249648/222280517-4dcf5871-2564-4207-b49e-22aee47fa49d.png) > > What are the most common users of allocation merges? I.e., if the same node type uses a Phi N times the counter is incremented by 1: > > ![image](https://user-images.githubusercontent.com/2249648/222280608-ca742a4e-1622-4e69-a778-e4db6805ea02.png) > > This PR adds support scalar replacing allocations participating in merges that are used as debug information OR as a base for field loads. I plan to create subsequent PRs to enable scalar replacement of merges used by other node types (CmpP is next on the list) subsequently. > > The approach I used for _rematerialization_ is pretty straightforward. It consists basically in: 1) Extend SafePointScalarObjectNode to represent multiple SR objects; 2) Add a new Class to support rematerialization of SR objects part of merges; 3) Patch HotSpot to be able to serialize and deserialize debug information related to allocation merges; 4) Patch C2 to generate unique types for SR objects participating in some allocation merges. > > The approach I used for _enabling the scalar replacement of some of the inputs of the allocation merge_ is also pretty straight forward: call `MemNode::split_through_phi` to, well, split AddP->Load* through the merge which will render the Phi useless. > > I tested this with JTREG tests tier 1-4 (Windows, Linux, and Mac) and didn't see regression. I also tested with several applications and didn't see any failure. I also run tests with "-ea -esa -Xbatch -Xcomp -XX:+UnlockExperimentalVMOptions -XX:-TieredCompilation -server -XX:+IgnoreUnrecognizedVMOptions -XX:+UnlockDiagnosticVMOptions -XX:+StressLCM -XX:+StressGCM -XX:+StressCCP" and didn't observe any related failures. Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: Add support for SR'ing some inputs of merges used for field loads ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12897/files - new: https://git.openjdk.org/jdk/pull/12897/files/3b492d2e..a158ae66 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12897&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12897&range=02-03 Stats: 481 lines in 9 files changed: 292 ins; 117 del; 72 mod Patch: https://git.openjdk.org/jdk/pull/12897.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12897/head:pull/12897 PR: https://git.openjdk.org/jdk/pull/12897 From jcking at openjdk.org Mon Mar 20 20:56:41 2023 From: jcking at openjdk.org (Justin King) Date: Mon, 20 Mar 2023 20:56:41 GMT Subject: RFR: JDK-8304546: CompileTask::_directive leaked if CompileBroker::invoke_compiler_on_method not called Message-ID: Ensure `CompileTask::_directive` is not leaked when `CompileBroker::invoke_compiler_on_method` is not called. This can happen for stale tasks or when compilation is disabled. ------------- Commit messages: - Ensure CompileTask::_directive is not leaked Changes: https://git.openjdk.org/jdk/pull/13108/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13108&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8304546 Stats: 6 lines in 3 files changed: 6 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/13108.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13108/head:pull/13108 PR: https://git.openjdk.org/jdk/pull/13108 From eliu at openjdk.org Tue Mar 21 01:54:39 2023 From: eliu at openjdk.org (Eric Liu) Date: Tue, 21 Mar 2023 01:54:39 GMT Subject: RFR: 8303278: Imprecise bottom type of ExtractB/UB In-Reply-To: References: Message-ID: On Mon, 20 Mar 2023 10:24:10 GMT, Tobias Hartmann wrote: > This triggers failures in testing: > > ``` > jdk/incubator/vector/Byte64VectorTests.java > > java.lang.Exception: failures: 1 > at com.sun.javatest.regtest.agent.TestNGRunner.main(TestNGRunner.java:95) > at com.sun.javatest.regtest.agent.TestNGRunner.main(TestNGRunner.java:53) > at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) > at java.base/java.lang.reflect.Method.invoke(Method.java:578) > at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:125) > at java.base/java.lang.Thread.run(Thread.java:1623) > ``` I think this should be caused by https://bugs.openjdk.org/browse/JDK-8303508. The removed narrowing exposed this issue on x86, that the generated code of ExtractB perhaps does not handle sign-bit correctly. TBH I'm not familiar with x86 instructions... @jatin-bhateja Could you help to take a look at this bug? ------------- PR Comment: https://git.openjdk.org/jdk/pull/13070#issuecomment-1477172638 From jcking at google.com Tue Mar 21 02:36:29 2023 From: jcking at google.com (Justin King) Date: Mon, 20 Mar 2023 19:36:29 -0700 Subject: Missaligned memory accesses from JDK In-Reply-To: <7F6F8F09-5521-4F6E-B4C4-DB3EFC45EF6B@gmail.com> References: <29875E09-8B1E-4255-AAED-06305459C872@gmail.com> <330a2677.26fa2.186fdec7098.Coremail.yangfei@iscas.ac.cn> <6194F148-F760-407F-961E-180BBDC6AE4F@gmail.com> <7F6F8F09-5521-4F6E-B4C4-DB3EFC45EF6B@gmail.com> Message-ID: https://github.com/openjdk/jdk/pull/12078 Proposes introducing a standard interface for dealing with unaligned loads/stores without undefined behavior. On Mon, Mar 20, 2023, 12:03 PM Vladimir Kempik wrote: > Adding hs compiler list. > > Could you please suggest on best way to make emit_intX methods not perform misaligned memory stores ? > Talking about src/hotspot/share/asm/codeBuffer.hpp from https://github.com/VladimirKempik/jdk/commit/18d7f399ce1bc213b2495411193938d914d3f616#diff-deb8ab083311ba60c0016dc34d6518579bbee4683c81e8d348982bac897fe8ae > > Regards, Vladimir > > > For each emit_intX functions modified, I see there is a correspondent > version which handles unaligned access. For example, 'void > emit_int16(uint8_t x1, uint8_t x2)' for 'void emit_int16(uint16_t x)' > So if we encounter an unaligned access issue when using > 'emit_int16(uint16_t x)', shouldn't we change the callsite to use > 'emit_int16(uint8_t x1, uint8_t x2)' instead? > > > Hello > not exactly > 'void emit_int16(uint8_t x1, uint8_t x2) > will always use slow version ( store byte) > > but > void emit_int16(uint16_t x) > will use slow version only on unaligned stores. if store is aligned, it > will use "store half", which should be faster. > > So we can?t always use emit_int16(uint8_t x1, uint8_t x2) at callsite. > > and we can?t decide which one to use at callsite as callsite should be > unaware of end() value inside CodeSection class > > Regards, Vladimir > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3999 bytes Desc: S/MIME Cryptographic Signature URL: From qamai at openjdk.org Tue Mar 21 02:38:40 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 21 Mar 2023 02:38:40 GMT Subject: RFR: 8303278: Imprecise bottom type of ExtractB/UB In-Reply-To: References: Message-ID: <1xefg5Er866JRwRc53ioKjESRndo31dwTfM2oitZKQY=.1432e570-ce09-4a7a-bbbc-8deae75411cb@github.com> On Fri, 17 Mar 2023 06:14:00 GMT, Eric Liu wrote: > This is a trivial patch, which fixes the bottom type of ExtractB/UB nodes. > > ExtractNode can be generated by Vector API Vector.lane(int), which gets the lane element at the given index. A more precise type of range can help to optimize out unnecessary type conversion in some cases. > > Below shows a typical case used ExtractBNode > > > public static byte byteLt16() { > ByteVector vecb = ByteVector.broadcast(ByteVector.SPECIES_128, 1); > return vecb.lane(1); > } > > > In this case, c2 constructs IR graph like: > > ExtractB ConI(24) > | __| > | / | > LShiftI __| > | / > RShiftI > > which generates AArch64 code: > > movi v16.16b, #0x1 > smov x11, v16.b[1] > sxtb w0, w11 > > with this patch, this shift pair can be optimized out by RShiftI's identity [1]. The code is optimized to: > > movi v16.16b, #0x1 > smov x0, v16.b[1] > > [TEST] > > Full jtreg passed except 4 files on x86: > > jdk/incubator/vector/Byte128VectorTests.java > jdk/incubator/vector/Byte256VectorTests.java > jdk/incubator/vector/Byte512VectorTests.java > jdk/incubator/vector/Byte64VectorTests.java > > They are caused by a known issue on x86 [2]. > > [1] https://github.com/openjdk/jdk/blob/742bc041eaba1ff9beb7f5b6d896e4f382b030ea/src/hotspot/share/opto/mulnode.cpp#L1052 > [2] https://bugs.openjdk.org/browse/JDK-8303508 Yes x86 does not handle signed extension correctly. `pextrb` and `pextrw` zeroes the upper bits instead of signed extending them. A simple fix is to add `movsx` after those. https://github.com/openjdk/jdk/blob/bbca7c3ede338a04d140abfe3e19cb27c628a0f5/src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp#L2247 ------------- PR Comment: https://git.openjdk.org/jdk/pull/13070#issuecomment-1477200165 From duke at openjdk.org Tue Mar 21 06:05:53 2023 From: duke at openjdk.org (Jasmine Karthikeyan) Date: Tue, 21 Mar 2023 06:05:53 GMT Subject: Integrated: 8304230: LShift ideal transform assertion In-Reply-To: References: Message-ID: On Wed, 15 Mar 2023 18:24:23 GMT, Jasmine Karthikeyan wrote: > Hi, > This PR aims to address the assertion on arm32 where the special-case `add1->in(2) == in(2)` check fails, and it falls through to the regular cases. I'm not quite sure how this issue can manifest as AFAIK the GVN should allow the usage of `==` to check against constants that are equal. I've changed the check from node equality to constant equality to hopefully resolve this. I unfortunately cannot reproduce the behavior on x86, nor do I have access to arm32 hardware, so I would greatly appreciate reviews and help testing this change (cc @bulasevich). Thank you all in advance. This pull request has now been integrated. Changeset: a6b72f56 Author: Jasmine K <25208576+SuperCoder7979 at users.noreply.github.com> Committer: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/a6b72f56f56b4f33ac163e90b115d79b2b844999 Stats: 14 lines in 1 file changed: 6 ins; 6 del; 2 mod 8304230: LShift ideal transform assertion Reviewed-by: thartmann ------------- PR: https://git.openjdk.org/jdk/pull/13049 From fyang at openjdk.org Tue Mar 21 06:55:42 2023 From: fyang at openjdk.org (Fei Yang) Date: Tue, 21 Mar 2023 06:55:42 GMT Subject: RFR: 8304387: Fix positions of shared static stubs / trampolines [v2] In-Reply-To: <1-rKX7bks0rCi-7k-IHi07-Abg53c8EAt9KCYjgv66E=.14b278ee-9e72-4ac3-abd2-d7ba4b71f397@github.com> References: <1-rKX7bks0rCi-7k-IHi07-Abg53c8EAt9KCYjgv66E=.14b278ee-9e72-4ac3-abd2-d7ba4b71f397@github.com> Message-ID: On Mon, 20 Mar 2023 12:08:16 GMT, Xiaolin Zheng wrote: >> This RFE fixes the positions of shared static stubs / trampolines. They should be like: >> >> >> [Verified Entry Point] >> ... >> ... >> ... >> [Stub Code] >> >> [Exception Handler] >> ... >> ... >> [Deopt Handler Code] >> ... >> ... >> >> >> Currently after we have the shared static stubs/trampolines in JDK-8280481 and JDK-8280152 : >> >> >> [Verified Entry Point] >> ... >> ... >> ... >> [Stub Code] >> >> [Exception Handler] >> ... >> ... >> [Deopt Handler Code] >> ... >> ... >> // they are presented in the Deopt range, though do not have correctness issues. >> >> >> For example on x86: >> >> >> [Verified Entry Point] >> ... >> [Stub Code] >> 0x00007fac68ef4908: nopl 0x0(%rax,%rax,1) ; {no_reloc} >> 0x00007fac68ef490d: mov $0x0,%rbx ; {static_stub} >> 0x00007fac68ef4917: jmpq 0x00007fac68ef4917 ; {runtime_call} >> 0x00007fac68ef491c: nop >> 0x00007fac68ef491d: mov $0x0,%rbx ; {static_stub} >> 0x00007fac68ef4927: jmpq 0x00007fac68ef4927 ; {runtime_call} >> [Exception Handler] >> 0x00007fac68ef492c: callq 0x00007fac703da280 ; {runtime_call handle_exception_from_callee Runtime1 stub} >> 0x00007fac68ef4931: mov $0x7fac885d8067,%rdi ; {external_word} >> 0x00007fac68ef493b: and $0xfffffffffffffff0,%rsp >> 0x00007fac68ef493f: callq 0x00007fac881e9900 ; {runtime_call MacroAssembler::debug64(char*, long, long*)} >> 0x00007fac68ef4944: hlt >> [Deopt Handler Code] >> 0x00007fac68ef4945: mov $0x7fac68ef4945,%r10 ; {section_word} >> 0x00007fac68ef494f: push %r10 >> 0x00007fac68ef4951: jmpq 0x00007fac70326520 ; {runtime_call DeoptimizationBlob} >> 0x00007fac68ef4956: mov $0x0,%rbx ; {static_stub} // <---------- here >> 0x00007fac68ef4960: jmpq 0x00007fac68ef4960 ; {runtime_call} >> 0x00007fac68ef4965: mov $0x0,%rbx ; {static_stub} // <---------- here >> 0x00007fac68ef496f: jmpq 0x00007fac68ef496f ; {runtime_call} >> 0x00007fac68ef4974: mov $0x0,%rbx ; {static_stub} // <---------- here >> 0x00007fac68ef497e: jmpq 0x00007fac68ef497e ; {runtime_call} >> 0x00007fac68ef4983: hlt >> 0x00007fac68ef4984: hlt >> 0x00007fac68ef4985: hlt >> 0x00007fac68ef4986: hlt >> 0x00007fac68ef4987: hlt >> -------------------------------------------------------------------------------- >> [/Disassembly] >> >> >> >> It can be simply reproduced and dumped by `-XX:+PrintAssembly`. >> >> Though the correctness doesn't get affected in the current case, we may need to move them to a better place, back into the `[Stub Code]`, which might be more reasonable and unified. Also for the performance's sake, `ciEnv::register_method()`, where `code_buffer->finalize_stubs()` locates currently, has two locks `Compile_lock` and `MethodCompileQueue_lock`. So I think it may be better to move `code_buffer->finalize_stubs()` out to C1 and C2 code generation phases, separately, before the exception handler code is emitted so they are inside the `[Stub Code]` range. >> >> BTW, this is the "direct cause" of [JDK-8302384](https://bugs.openjdk.org/browse/JDK-8302384) because shared trampolines and their data are generated at the end of compiled code, which is different from the original condition. Though for that issue, the root cause is still from the Binutils, for even if trampolines are generated at the end of code, we should not fail as well when disassembling. But that is another issue, please see [JDK-8302384](https://bugs.openjdk.org/browse/JDK-8302384) for more details. >> >> Tested x86, AArch64, and RISC-V hotspot tier1~4 with fastdebug build, no new errors found. >> >> Thanks, >> Xiaolin > > Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: > > Andrew's review comments: another cleanup LGTM. ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13071#pullrequestreview-1349793484 From thartmann at openjdk.org Tue Mar 21 08:02:49 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 21 Mar 2023 08:02:49 GMT Subject: RFR: 8303951: Add asserts before record_method_not_compilable where possible [v2] In-Reply-To: References: Message-ID: On Mon, 20 Mar 2023 13:58:52 GMT, Emanuel Peter wrote: >> I went through all `C2` bailouts, and checked if they are justified to bail out of compilation silently. I added asserts everywhere. Those that were hit, I inspected by hand. >> >> Some of them seem to be justified. There I added comments why they are justified. They are cases that we do not want to handle in `C2`, and that are rare enough so that it probably does not matter. >> >> For the following bailouts I did not add an assert, because it may have revealed a bug: >> [JDK-8304328](https://bugs.openjdk.org/browse/JDK-8304328) C2 Bailout "failed spill-split-recycle sanity check" reveals hidden issue with RA >> >> Note: >> [JDK-8303466](https://bugs.openjdk.org/browse/JDK-8303466) C2: COMPILE SKIPPED: malformed control flow - only one IfProj >> That bug bug was the reason for this RFE here. I added the assert for "malformed control flow". After this RFE here, that Bug will run into the assert on debug builds. >> >> I ran `tier1-6` and stress testing. Now running `tier7-9`. >> >> Filed a follow-up RFE to do the same for `BAILOUT` in `C1`: [JDK-8304532](https://bugs.openjdk.org/browse/JDK-8304532). > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > addressing Vladimir K's review suggestions Looks good overall, some comments below. src/hotspot/share/opto/matcher.cpp line 1256: > 1254: out_arg_limit_per_call = OptoReg::add(warped,1); > 1255: if (!RegMask::can_represent_arg(warped)) { > 1256: // Bailout. For example not enoug space on stack for all arguments. Happens for methods with too many arguments. Suggestion: // Bailout. For example not enough space on stack for all arguments. Happens for methods with too many arguments. src/hotspot/share/opto/parse1.cpp line 211: > 209: // of loops in catch blocks or loops which branch with a non-empty stack. > 210: if (sp() != 0) { > 211: // Bailout. But we should probably kick into normal compilation? We shouldn't add a question which is equivalent to a ToDo (same below). The comment should explain how this could happen and if we think that making the method not compilable is too strong, we should file a follow-up issue to investigate/fix. How common is this? We will still compile at C1, so normal compilation **will** kick in, right? src/hotspot/share/opto/parse1.cpp line 218: > 216: if (osr_block->has_trap_at(osr_block->start())) { > 217: assert(false, "OSR starts with an immediate trap"); > 218: // Bailout. But we should probably kick into normal compilation? "OSR inside finally clauses" sounds like it could easily happen. src/hotspot/share/opto/parse1.cpp line 436: > 434: _flow = method()->get_flow_analysis(); > 435: if (_flow->failing()) { > 436: assert(false, "flow fails during parsing"); Suggestion: assert(false, "type flow failed during parsing"); src/hotspot/share/opto/parse1.cpp line 515: > 513: _flow = method()->get_osr_flow_analysis(osr_bci()); > 514: if (_flow->failing()) { > 515: assert(false, "OSR flow failure"); Suggestion: assert(false, "type flow analysis failed for OSR compilation"); ------------- PR Review: https://git.openjdk.org/jdk/pull/13038#pullrequestreview-1349839442 PR Review Comment: https://git.openjdk.org/jdk/pull/13038#discussion_r1142981232 PR Review Comment: https://git.openjdk.org/jdk/pull/13038#discussion_r1142999117 PR Review Comment: https://git.openjdk.org/jdk/pull/13038#discussion_r1142994434 PR Review Comment: https://git.openjdk.org/jdk/pull/13038#discussion_r1142982576 PR Review Comment: https://git.openjdk.org/jdk/pull/13038#discussion_r1142983628 From adinn at openjdk.org Tue Mar 21 10:50:49 2023 From: adinn at openjdk.org (Andrew Dinn) Date: Tue, 21 Mar 2023 10:50:49 GMT Subject: RFR: 8304387: Fix positions of shared static stubs / trampolines [v2] In-Reply-To: <1-rKX7bks0rCi-7k-IHi07-Abg53c8EAt9KCYjgv66E=.14b278ee-9e72-4ac3-abd2-d7ba4b71f397@github.com> References: <1-rKX7bks0rCi-7k-IHi07-Abg53c8EAt9KCYjgv66E=.14b278ee-9e72-4ac3-abd2-d7ba4b71f397@github.com> Message-ID: <2HS0SdgiY4dfhKhs1WZI9WgnRb4Vp94rqI8zrRB0I70=.b4233605-a482-4142-91cb-a43983cfc00d@github.com> On Mon, 20 Mar 2023 12:08:16 GMT, Xiaolin Zheng wrote: >> This RFE fixes the positions of shared static stubs / trampolines. They should be like: >> >> >> [Verified Entry Point] >> ... >> ... >> ... >> [Stub Code] >> >> [Exception Handler] >> ... >> ... >> [Deopt Handler Code] >> ... >> ... >> >> >> Currently after we have the shared static stubs/trampolines in JDK-8280481 and JDK-8280152 : >> >> >> [Verified Entry Point] >> ... >> ... >> ... >> [Stub Code] >> >> [Exception Handler] >> ... >> ... >> [Deopt Handler Code] >> ... >> ... >> // they are presented in the Deopt range, though do not have correctness issues. >> >> >> For example on x86: >> >> >> [Verified Entry Point] >> ... >> [Stub Code] >> 0x00007fac68ef4908: nopl 0x0(%rax,%rax,1) ; {no_reloc} >> 0x00007fac68ef490d: mov $0x0,%rbx ; {static_stub} >> 0x00007fac68ef4917: jmpq 0x00007fac68ef4917 ; {runtime_call} >> 0x00007fac68ef491c: nop >> 0x00007fac68ef491d: mov $0x0,%rbx ; {static_stub} >> 0x00007fac68ef4927: jmpq 0x00007fac68ef4927 ; {runtime_call} >> [Exception Handler] >> 0x00007fac68ef492c: callq 0x00007fac703da280 ; {runtime_call handle_exception_from_callee Runtime1 stub} >> 0x00007fac68ef4931: mov $0x7fac885d8067,%rdi ; {external_word} >> 0x00007fac68ef493b: and $0xfffffffffffffff0,%rsp >> 0x00007fac68ef493f: callq 0x00007fac881e9900 ; {runtime_call MacroAssembler::debug64(char*, long, long*)} >> 0x00007fac68ef4944: hlt >> [Deopt Handler Code] >> 0x00007fac68ef4945: mov $0x7fac68ef4945,%r10 ; {section_word} >> 0x00007fac68ef494f: push %r10 >> 0x00007fac68ef4951: jmpq 0x00007fac70326520 ; {runtime_call DeoptimizationBlob} >> 0x00007fac68ef4956: mov $0x0,%rbx ; {static_stub} // <---------- here >> 0x00007fac68ef4960: jmpq 0x00007fac68ef4960 ; {runtime_call} >> 0x00007fac68ef4965: mov $0x0,%rbx ; {static_stub} // <---------- here >> 0x00007fac68ef496f: jmpq 0x00007fac68ef496f ; {runtime_call} >> 0x00007fac68ef4974: mov $0x0,%rbx ; {static_stub} // <---------- here >> 0x00007fac68ef497e: jmpq 0x00007fac68ef497e ; {runtime_call} >> 0x00007fac68ef4983: hlt >> 0x00007fac68ef4984: hlt >> 0x00007fac68ef4985: hlt >> 0x00007fac68ef4986: hlt >> 0x00007fac68ef4987: hlt >> -------------------------------------------------------------------------------- >> [/Disassembly] >> >> >> >> It can be simply reproduced and dumped by `-XX:+PrintAssembly`. >> >> Though the correctness doesn't get affected in the current case, we may need to move them to a better place, back into the `[Stub Code]`, which might be more reasonable and unified. Also for the performance's sake, `ciEnv::register_method()`, where `code_buffer->finalize_stubs()` locates currently, has two locks `Compile_lock` and `MethodCompileQueue_lock`. So I think it may be better to move `code_buffer->finalize_stubs()` out to C1 and C2 code generation phases, separately, before the exception handler code is emitted so they are inside the `[Stub Code]` range. >> >> BTW, this is the "direct cause" of [JDK-8302384](https://bugs.openjdk.org/browse/JDK-8302384) because shared trampolines and their data are generated at the end of compiled code, which is different from the original condition. Though for that issue, the root cause is still from the Binutils, for even if trampolines are generated at the end of code, we should not fail as well when disassembling. But that is another issue, please see [JDK-8302384](https://bugs.openjdk.org/browse/JDK-8302384) for more details. >> >> Tested x86, AArch64, and RISC-V hotspot tier1~4 with fastdebug build, no new errors found. >> >> Thanks, >> Xiaolin > > Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: > > Andrew's review comments: another cleanup @zhengxiaolinX Yes,still looks good. Please integrate and I will sponsor. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13071#issuecomment-1477617372 From xlinzheng at openjdk.org Tue Mar 21 10:50:50 2023 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Tue, 21 Mar 2023 10:50:50 GMT Subject: RFR: 8304387: Fix positions of shared static stubs / trampolines [v2] In-Reply-To: <2HS0SdgiY4dfhKhs1WZI9WgnRb4Vp94rqI8zrRB0I70=.b4233605-a482-4142-91cb-a43983cfc00d@github.com> References: <1-rKX7bks0rCi-7k-IHi07-Abg53c8EAt9KCYjgv66E=.14b278ee-9e72-4ac3-abd2-d7ba4b71f397@github.com> <2HS0SdgiY4dfhKhs1WZI9WgnRb4Vp94rqI8zrRB0I70=.b4233605-a482-4142-91cb-a43983cfc00d@github.com> Message-ID: On Tue, 21 Mar 2023 10:45:45 GMT, Andrew Dinn wrote: > @zhengxiaolinX Yes,still looks good. Please integrate and I will sponsor. Thanks a lot! @adinn @RealFYang ------------- PR Comment: https://git.openjdk.org/jdk/pull/13071#issuecomment-1477620264 From xlinzheng at openjdk.org Tue Mar 21 11:30:57 2023 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Tue, 21 Mar 2023 11:30:57 GMT Subject: Integrated: 8304387: Fix positions of shared static stubs / trampolines In-Reply-To: References: Message-ID: On Fri, 17 Mar 2023 07:30:19 GMT, Xiaolin Zheng wrote: > This RFE fixes the positions of shared static stubs / trampolines. They should be like: > > > [Verified Entry Point] > ... > ... > ... > [Stub Code] > > [Exception Handler] > ... > ... > [Deopt Handler Code] > ... > ... > > > Currently after we have the shared static stubs/trampolines in JDK-8280481 and JDK-8280152 : > > > [Verified Entry Point] > ... > ... > ... > [Stub Code] > > [Exception Handler] > ... > ... > [Deopt Handler Code] > ... > ... > // they are presented in the Deopt range, though do not have correctness issues. > > > For example on x86: > > > [Verified Entry Point] > ... > [Stub Code] > 0x00007fac68ef4908: nopl 0x0(%rax,%rax,1) ; {no_reloc} > 0x00007fac68ef490d: mov $0x0,%rbx ; {static_stub} > 0x00007fac68ef4917: jmpq 0x00007fac68ef4917 ; {runtime_call} > 0x00007fac68ef491c: nop > 0x00007fac68ef491d: mov $0x0,%rbx ; {static_stub} > 0x00007fac68ef4927: jmpq 0x00007fac68ef4927 ; {runtime_call} > [Exception Handler] > 0x00007fac68ef492c: callq 0x00007fac703da280 ; {runtime_call handle_exception_from_callee Runtime1 stub} > 0x00007fac68ef4931: mov $0x7fac885d8067,%rdi ; {external_word} > 0x00007fac68ef493b: and $0xfffffffffffffff0,%rsp > 0x00007fac68ef493f: callq 0x00007fac881e9900 ; {runtime_call MacroAssembler::debug64(char*, long, long*)} > 0x00007fac68ef4944: hlt > [Deopt Handler Code] > 0x00007fac68ef4945: mov $0x7fac68ef4945,%r10 ; {section_word} > 0x00007fac68ef494f: push %r10 > 0x00007fac68ef4951: jmpq 0x00007fac70326520 ; {runtime_call DeoptimizationBlob} > 0x00007fac68ef4956: mov $0x0,%rbx ; {static_stub} // <---------- here > 0x00007fac68ef4960: jmpq 0x00007fac68ef4960 ; {runtime_call} > 0x00007fac68ef4965: mov $0x0,%rbx ; {static_stub} // <---------- here > 0x00007fac68ef496f: jmpq 0x00007fac68ef496f ; {runtime_call} > 0x00007fac68ef4974: mov $0x0,%rbx ; {static_stub} // <---------- here > 0x00007fac68ef497e: jmpq 0x00007fac68ef497e ; {runtime_call} > 0x00007fac68ef4983: hlt > 0x00007fac68ef4984: hlt > 0x00007fac68ef4985: hlt > 0x00007fac68ef4986: hlt > 0x00007fac68ef4987: hlt > -------------------------------------------------------------------------------- > [/Disassembly] > > > > It can be simply reproduced and dumped by `-XX:+PrintAssembly`. > > Though the correctness doesn't get affected in the current case, we may need to move them to a better place, back into the `[Stub Code]`, which might be more reasonable and unified. Also for the performance's sake, `ciEnv::register_method()`, where `code_buffer->finalize_stubs()` locates currently, has two locks `Compile_lock` and `MethodCompileQueue_lock`. So I think it may be better to move `code_buffer->finalize_stubs()` out to C1 and C2 code generation phases, separately, before the exception handler code is emitted so they are inside the `[Stub Code]` range. > > BTW, this is the "direct cause" of [JDK-8302384](https://bugs.openjdk.org/browse/JDK-8302384) because shared trampolines and their data are generated at the end of compiled code, which is different from the original condition. Though for that issue, the root cause is still from the Binutils, for even if trampolines are generated at the end of code, we should not fail as well when disassembling. But that is another issue, please see [JDK-8302384](https://bugs.openjdk.org/browse/JDK-8302384) for more details. > > Tested x86, AArch64, and RISC-V hotspot tier1~4 with fastdebug build, no new errors found. > > Thanks, > Xiaolin This pull request has now been integrated. Changeset: 1c04686c Author: Xiaolin Zheng Committer: Andrew Dinn URL: https://git.openjdk.org/jdk/commit/1c04686cd68a78f926f09707ac723aa762945527 Stats: 35 lines in 12 files changed: 12 ins; 15 del; 8 mod 8304387: Fix positions of shared static stubs / trampolines Reviewed-by: adinn, fyang ------------- PR: https://git.openjdk.org/jdk/pull/13071 From fyang at openjdk.org Tue Mar 21 11:50:41 2023 From: fyang at openjdk.org (Fei Yang) Date: Tue, 21 Mar 2023 11:50:41 GMT Subject: RFR: 8302384: Handle hsdis out-of-bound logic for RISC-V [v3] In-Reply-To: References: Message-ID: On Thu, 2 Mar 2023 03:37:39 GMT, Xiaolin Zheng wrote: >> Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove white spaces, and use hsdis code style > > Uh. Sorry for the disturbance; I forgot the rule was "1 review required, with at least 1 Reviewer" in the mainline. Seems we need to wait for another proper review of this. > > BTW, an issue [1] for binutils was filed yesterday to track this; though I have not got a confirmation about whether it could get fixed, and when. > > [1] https://sourceware.org/bugzilla/show_bug.cgi?id=30184 @zhengxiaolinX : Hi, I see the fix for binutils has been committed [1]. At the same time, as a workaround for this issue, fix for [2] was also merged. Do we still need this change then? Thanks. [1] https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=e43d8768d909139bf5ec4a97c79a096ed28a4b08 [2] https://bugs.openjdk.org/browse/JDK-8304387 ------------- PR Comment: https://git.openjdk.org/jdk/pull/12551#issuecomment-1477696412 From xlinzheng at openjdk.org Tue Mar 21 12:39:03 2023 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Tue, 21 Mar 2023 12:39:03 GMT Subject: RFR: 8302384: Handle hsdis out-of-bound logic for RISC-V [v3] In-Reply-To: References: Message-ID: On Thu, 2 Mar 2023 03:37:39 GMT, Xiaolin Zheng wrote: >> Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove white spaces, and use hsdis code style > > Uh. Sorry for the disturbance; I forgot the rule was "1 review required, with at least 1 Reviewer" in the mainline. Seems we need to wait for another proper review of this. > > BTW, an issue [1] for binutils was filed yesterday to track this; though I have not got a confirmation about whether it could get fixed, and when. > > [1] https://sourceware.org/bugzilla/show_bug.cgi?id=30184 > @zhengxiaolinX : Hi, I see the fix for binutils has been committed [1]. At the same time, as a workaround for this issue, fix for [2] was also merged. Do we still need this change then? Thanks. > > [1] https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=e43d8768d909139bf5ec4a97c79a096ed28a4b08 [2] https://bugs.openjdk.org/browse/JDK-8304387 Interestingly both are at today. Although the Binutils fix seems in the range of Binutils 2.41 and Hsdis supports 2.38- currently. But with #13071 we can fortunately "bypass" this issue. Maybe in the future we still have small chances to trigger this issue if changes are made and some interesting data is stored at the end of a piece of compiled code again - but I am okay with retracting this patch, for after #13071 code in this patch can neither be triggered nor be tested, and I do not really like the platform-specific workaround as well. I think we could put off such issue until then. Again thanks for approving this patch! @luhenry ------------- PR Comment: https://git.openjdk.org/jdk/pull/12551#issuecomment-1477764218 From xlinzheng at openjdk.org Tue Mar 21 12:39:05 2023 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Tue, 21 Mar 2023 12:39:05 GMT Subject: Withdrawn: 8302384: Handle hsdis out-of-bound logic for RISC-V In-Reply-To: References: Message-ID: <6G9Dw6PsZu-sJD16oNuBMuI2_4qIG88XffR7-L-Xl0U=.ede32b9f-5fdc-48fa-87da-d2c8f81e0559@github.com> On Tue, 14 Feb 2023 07:10:42 GMT, Xiaolin Zheng wrote: > Several debug assertion failures have been observed on RISC-V, on physical boards only. > > Failure list: (the `hs_err` log is in the JBS issue) > > compiler/vectorapi/TestVectorShiftImm.java > compiler/compilercontrol/jcmd/AddPrintAssemblyTest.java > compiler/intrinsics/math/TestFpMinMaxIntrinsics.java > compiler/compilercontrol/TestCompilerDirectivesCompatibilityFlag.java > compiler/compilercontrol/TestCompilerDirectivesCompatibilityCommandOn.java > compiler/runtime/TestConstantsInError.java > compiler/compilercontrol/jcmd/PrintDirectivesTest.java > > > When the failure occurs, hsdis is disassembling the last unrecognizable data at the end of a code blob, usually the data stored in trampolines. It could be theoretically any address inside the code cache, and sometimes binutils can recognize the data as 2-byte instructions, 4-byte instructions, and 6 or 8-byte instructions even though as far as I know no instructions longer than 4-byte have landed. Therefore, binutils may firstly run out of bound after the calculation. However, the RISC-V binutils returns our `hsdis_read_memory_func`'s return number directly [1] (an EIO, which is `5`, FYI), rather than returning a `-1` (FYI, [2][3][4][5]) on other platforms when such out-of-bound happens. So when coming back to our hsdis, we (hsdis) get the `size = 5` as the return value [6] rather than `-1`: our hsdis error handling is skipped, our variable `p` is out of bound, and then we meet the crash. > > To fix it, we should check the value is the special `EIO` on RISC-V. However, after fixing that issue, I found binutils would print some messages like "Address 0x%s is out of bounds." on the screen: > > > 0x0000003f901a41b4: auipc t0,0x0 ; {trampoline_stub} > 0x0000003f901a41b8: ld t0,12(t0) # 0x0000003f901a41c0 > 0x0000003f901a41bc: jr t0 > 0x0000003f901a41c0: .2byte 0x8ec0 > 0x0000003f901a41c2: srli s0,s0,0x21 > 0x0000003f901a41c4: Address 0x0000003f901a41c9 is out of bounds. <----------- But we want the real bytes here. > > > So, we should overwrite the `disassemble_info.memory_error_func` in the binutils callback [7], to generate our own output: > > 0x0000003f901a41b4: auipc t0,0x0 ; {trampoline_stub} > 0x0000003f901a41b8: ld t0,12(t0) # 0x0000003f901a41c0 > 0x0000003f901a41bc: jr t0 > 0x0000003f901a41c0: .2byte 0x8ec0 > 0x0000003f901a41c2: srli s0,s0,0x21 > 0x0000003f901a41c4: .4byte 0x0000003f > > > Mirroring the code of hsdis-llvm, to print merely a 4-byte data [8]. > > > BTW, the reason why the crash only happens on the physical board, is that boards support RISC-V sv39 address mode only: a legal user-space address can be no more than 38-bit. So the code cache is always mmapped to an address like `0x3fe0000000`. Such a `0x3f` is always recognized as the mark of an 8-byte instruction [9]. > > > Tested hotspot tier1~4 with fastdebug build, no new errors found. > > Thanks, > Xiaolin > > > [1] https://github.com/bminor/binutils-gdb/blob/binutils-2_38-branch/opcodes/riscv-dis.c#L940 > [2] https://github.com/bminor/binutils-gdb/blob/binutils-2_38-branch/opcodes/aarch64-dis.c#L3792 > [3] https://github.com/bminor/binutils-gdb/blob/binutils-2_38-branch/opcodes/ppc-dis.c#L872 > [4] https://github.com/bminor/binutils-gdb/blob/binutils-2_38-branch/opcodes/s390-dis.c#L305 > [5] https://github.com/bminor/binutils-gdb/blob/binutils-2_38-branch/opcodes/i386-dis.c#L9466 (the i386 one uses a `setlongjmp` to handle the exception case, so the code might look different) > [6] https://github.com/openjdk/jdk/blob/94e7cc8587356988e713d23d1653bdd5c43fb3f1/src/utils/hsdis/binutils/hsdis-binutils.c#L198 > [7] https://github.com/bminor/binutils-gdb/blob/binutils-2_38-branch/opcodes/dis-buf.c#L51-L72 > [8] https://github.com/openjdk/jdk/blob/94e7cc8587356988e713d23d1653bdd5c43fb3f1/src/utils/hsdis/llvm/hsdis-llvm.cpp#L316-L317 > [9] https://github.com/bminor/binutils-gdb/blob/binutils-2_38-branch/include/opcode/riscv.h#L30-L42 This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/12551 From psandoz at openjdk.org Tue Mar 21 16:32:44 2023 From: psandoz at openjdk.org (Paul Sandoz) Date: Tue, 21 Mar 2023 16:32:44 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v2] In-Reply-To: References: Message-ID: On Tue, 21 Mar 2023 16:11:50 GMT, Quan Anh Mai wrote: > Apart from the mask implementation, shuffle implementation definitely has to take into consideration the element type. Yes, the way you have implemented shuffle is tightly connected, that looks ok. I am wondering if we can make the mask implementation more loosely coupled and modified such that it does not have to take into consideration the element type (or species) of the vector it operates on, and instead compatibility is based solely on the lane count. Ideally it would be good to change the `VectorMask::check` method to just compare the lanes counts and not require a cast in the implementation, which i presume requires some deeper changes in C2? What you propose seems a possible a interim step towards a more preferable API, if the performance is good. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13093#issuecomment-1478175761 From jcking at openjdk.org Tue Mar 21 17:01:08 2023 From: jcking at openjdk.org (Justin King) Date: Tue, 21 Mar 2023 17:01:08 GMT Subject: RFR: JDK-8304684: Memory leak in DirectivesParser::set_option_flag Message-ID: Add missing `FREE_C_HEAP_ARRAY` call. ------------- Commit messages: - Fix memory leak in DirectivesParser::set_option_flag Changes: https://git.openjdk.org/jdk/pull/13125/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13125&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8304684 Stats: 4 lines in 1 file changed: 4 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/13125.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13125/head:pull/13125 PR: https://git.openjdk.org/jdk/pull/13125 From qamai at openjdk.org Tue Mar 21 18:15:44 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 21 Mar 2023 18:15:44 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v2] In-Reply-To: References: Message-ID: On Tue, 21 Mar 2023 16:29:44 GMT, Paul Sandoz wrote: >> I have moved most of the methods to `AbstractVector` and `AbstractShuffle`, I have to resort to raw types, though, since there seems to be no way to do the same with wild cards, and the generics mechanism is not powerful enough for things like `Vector`. The remaining failure seems to be related to [JDK-8304676](https://bugs.openjdk.org/projects/JDK/issues/JDK-8304676), so I think this patch is ready for review now. >> >>> The mask implementation is specialized by the species of vectors it operates on, but does it have to be >> >> Apart from the mask implementation, shuffle implementation definitely has to take into consideration the element type. However, this information does not have to be visible to the API, similar to how we currently handle the vector length, we can have `class AbstractMask implements VectorMask`. As a result, the cast method would be useless and can be removed in the API, but our implementation details would still use it, for example >> >> Vector blend(Vector v, VectorMask w) { >> AbstractMask aw = (AbstractMask) w; >> AbstractMask tw = aw.cast(vspecies()); >> return VectorSupport.blend(...); >> } >> >> Vector rearrange(VectorShuffle s) { >> AbstractShuffle as = (AbstractShuffle) s; >> AbstractShuffle ts = s.cast(vspecies()); >> return VectorSupport.rearrangeOp(...); >> } >> >> What do you think? > >> Apart from the mask implementation, shuffle implementation definitely has to take into consideration the element type. > > Yes, the way you have implemented shuffle is tightly connected, that looks ok. > > I am wondering if we can make the mask implementation more loosely coupled and modified such that it does not have to take into consideration the element type (or species) of the vector it operates on, and instead compatibility is based solely on the lane count. > > Ideally it would be good to change the `VectorMask::check` method to just compare the lanes counts and not require a cast in the implementation, which i presume requires some deeper changes in C2? > > What you propose seems a possible a interim step towards a more preferable API, if the performance is good. @PaulSandoz As some hardware does differentiate masks based on element type, at some point we have to differentiate between them. From a design point of view, they are both implementation details so there might be no consideration regarding the API. On the other hand, having more in the Java side seems to be more desirable, as it does illustrate the operations more intuitively compared to the graph management in C2. Another important point I can think of is that having a constant shape for a Java class would help us in implementing the vector calling convention, as we can rely on the class information instead of some side channels. As a result, I think I do prefer the current class hierarchy. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13093#issuecomment-1478374992 From kvn at openjdk.org Tue Mar 21 18:18:46 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 21 Mar 2023 18:18:46 GMT Subject: RFR: JDK-8304684: Memory leak in DirectivesParser::set_option_flag In-Reply-To: References: Message-ID: On Tue, 21 Mar 2023 16:53:18 GMT, Justin King wrote: > Add missing `FREE_C_HEAP_ARRAY` call. src/hotspot/share/compiler/directivesParser.cpp line 350: > 348: set->set_ideal_phase_mask(mask); > 349: } > 350: FREE_C_HEAP_ARRAY(char, s); An other way to do this is create local variable `bool valid = true;` and assign `false` to instead of `return false;` on all these branches so that we always reach this line to free array and `return valid;`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13125#discussion_r1143822498 From duke at openjdk.org Tue Mar 21 18:54:57 2023 From: duke at openjdk.org (duke) Date: Tue, 21 Mar 2023 18:54:57 GMT Subject: Withdrawn: 8292761: x86: Clone nodes to match complex rules In-Reply-To: References: Message-ID: On Tue, 23 Aug 2022 09:07:54 GMT, Quan Anh Mai wrote: > Hi, > > This patch tries to clone a node if it can be matched as a part of a BMI and lea pattern. This may reduce the live range of a local or remove that local completely. > > Please take a look and have some reviews. Thanks a lot. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/9977 From jcking at openjdk.org Tue Mar 21 20:34:17 2023 From: jcking at openjdk.org (Justin King) Date: Tue, 21 Mar 2023 20:34:17 GMT Subject: RFR: JDK-8304684: Memory leak in DirectivesParser::set_option_flag [v2] In-Reply-To: References: Message-ID: <5cez_uFvtwk06UhgF6VnHYGatqHOdnzyEB1mMNkAndQ=.e6489e78-ffcf-4c68-8fa3-cca52c372a28@github.com> > Add missing `FREE_C_HEAP_ARRAY` call. Justin King has updated the pull request incrementally with one additional commit since the last revision: Update based on review Signed-off-by: Justin King ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13125/files - new: https://git.openjdk.org/jdk/pull/13125/files/e5a92853..a832d587 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13125&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13125&range=00-01 Stats: 19 lines in 1 file changed: 9 ins; 5 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/13125.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13125/head:pull/13125 PR: https://git.openjdk.org/jdk/pull/13125 From jcking at openjdk.org Tue Mar 21 20:37:45 2023 From: jcking at openjdk.org (Justin King) Date: Tue, 21 Mar 2023 20:37:45 GMT Subject: RFR: JDK-8304684: Memory leak in DirectivesParser::set_option_flag [v2] In-Reply-To: References: Message-ID: On Tue, 21 Mar 2023 18:16:10 GMT, Vladimir Kozlov wrote: >> Justin King has updated the pull request incrementally with one additional commit since the last revision: >> >> Update based on review >> >> Signed-off-by: Justin King > > src/hotspot/share/compiler/directivesParser.cpp line 350: > >> 348: set->set_ideal_phase_mask(mask); >> 349: } >> 350: FREE_C_HEAP_ARRAY(char, s); > > An other way to do this is create local variable `bool valid = true;` and assign `false` to instead of `return false;` on all these branches so that we always reach this line to free array and `return valid;`. I was going for the path of least change, but I can do that. Done. I am only doing that for string, as I want to deal with only the leak in this change. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13125#discussion_r1143961249 From kvn at openjdk.org Tue Mar 21 21:33:43 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 21 Mar 2023 21:33:43 GMT Subject: RFR: JDK-8304684: Memory leak in DirectivesParser::set_option_flag [v2] In-Reply-To: <5cez_uFvtwk06UhgF6VnHYGatqHOdnzyEB1mMNkAndQ=.e6489e78-ffcf-4c68-8fa3-cca52c372a28@github.com> References: <5cez_uFvtwk06UhgF6VnHYGatqHOdnzyEB1mMNkAndQ=.e6489e78-ffcf-4c68-8fa3-cca52c372a28@github.com> Message-ID: On Tue, 21 Mar 2023 20:34:17 GMT, Justin King wrote: >> Add missing `FREE_C_HEAP_ARRAY` call. > > Justin King has updated the pull request incrementally with one additional commit since the last revision: > > Update based on review > > Signed-off-by: Justin King Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13125#pullrequestreview-1351412240 From dlong at openjdk.org Tue Mar 21 22:32:43 2023 From: dlong at openjdk.org (Dean Long) Date: Tue, 21 Mar 2023 22:32:43 GMT Subject: RFR: JDK-8304684: Memory leak in DirectivesParser::set_option_flag [v2] In-Reply-To: <5cez_uFvtwk06UhgF6VnHYGatqHOdnzyEB1mMNkAndQ=.e6489e78-ffcf-4c68-8fa3-cca52c372a28@github.com> References: <5cez_uFvtwk06UhgF6VnHYGatqHOdnzyEB1mMNkAndQ=.e6489e78-ffcf-4c68-8fa3-cca52c372a28@github.com> Message-ID: On Tue, 21 Mar 2023 20:34:17 GMT, Justin King wrote: >> Add missing `FREE_C_HEAP_ARRAY` call. > > Justin King has updated the pull request incrementally with one additional commit since the last revision: > > Update based on review > > Signed-off-by: Justin King Changes requested by dlong (Reviewer). src/hotspot/share/compiler/directivesParser.cpp line 351: > 349: } > 350: > 351: FREE_C_HEAP_ARRAY(char, s); This looks unsafe. We shouldn't free the memory without clearing all references to it, otherwise there is a dangling pointer. There is already another reference to the memory because of this call: `(set->*test)((void *)&s);` (see the set_function_definition macro) I think it would be better to move this copying call until after validation has been done. ------------- PR Review: https://git.openjdk.org/jdk/pull/13125#pullrequestreview-1351476419 PR Review Comment: https://git.openjdk.org/jdk/pull/13125#discussion_r1144052137 From jcking at openjdk.org Wed Mar 22 00:07:43 2023 From: jcking at openjdk.org (Justin King) Date: Wed, 22 Mar 2023 00:07:43 GMT Subject: RFR: JDK-8304684: Memory leak in DirectivesParser::set_option_flag [v2] In-Reply-To: References: <5cez_uFvtwk06UhgF6VnHYGatqHOdnzyEB1mMNkAndQ=.e6489e78-ffcf-4c68-8fa3-cca52c372a28@github.com> Message-ID: On Tue, 21 Mar 2023 22:30:03 GMT, Dean Long wrote: >> Justin King has updated the pull request incrementally with one additional commit since the last revision: >> >> Update based on review >> >> Signed-off-by: Justin King > > src/hotspot/share/compiler/directivesParser.cpp line 351: > >> 349: } >> 350: >> 351: FREE_C_HEAP_ARRAY(char, s); > > This looks unsafe. We shouldn't free the memory without clearing all references to it, otherwise there is a dangling pointer. There is already another reference to the memory because of this call: > > `(set->*test)((void *)&s);` (see the set_function_definition macro) > > I think it would be better to move this copying call until after validation has been done. I am very confused actually, `(set->*test)((void *)&s);` calls DirectiveSet::set_X. Looking at the code, it simply just stores the pointer? Is the DirectiveSet supposed to own the option? And if so, who is freeing it then? It doesn't look like cloning actually clones the underlying storage and the DirectiveSet destructor doesn't free it. So really DirectiveSet::~DirectiveSet should be freeing the string storage and DirectiveSet::set_X is taking ownership. Yeah? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13125#discussion_r1144102314 From thartmann at openjdk.org Wed Mar 22 06:42:42 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 22 Mar 2023 06:42:42 GMT Subject: RFR: 8303804: Fix some errors of If-VectorTest and CMove-VectorTest [v2] In-Reply-To: References: <-Jw_zF5ca_3WHcoZQwzsT6lMA1NFdAzbOv3063qU6Lw=.b90c971a-bfe4-4144-93dc-b04a4c89a154@github.com> Message-ID: On Thu, 9 Mar 2023 01:19:40 GMT, Wang Haomin wrote: >> After https://bugs.openjdk.org/browse/JDK-8292289 , the base class of VectorTestNode changed from Node to CmpNode. So I add two match rule into ad file. >> >> match(If cop (VectorTest op1 op2)); >> match(Set dst (CMoveI (Binary cop (VectorTest op1 op2)) (Binary src1 src2))); >> >> First error, rule1 shouldn't generate the statement "node->_bottom_type = _leaf->bottom_type();". >> Second error, both rule1 and rule2 need to use VectorTestNode, the VectorTestNode should be cloned like CmpNode. > > Wang Haomin has updated the pull request incrementally with one additional commit since the last revision: > > compare the results with 0 Marked as reviewed by thartmann (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/12917#pullrequestreview-1351808655 From thartmann at openjdk.org Wed Mar 22 06:42:46 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 22 Mar 2023 06:42:46 GMT Subject: RFR: 8303804: Fix some errors of If-VectorTest and CMove-VectorTest [v2] In-Reply-To: References: <-Jw_zF5ca_3WHcoZQwzsT6lMA1NFdAzbOv3063qU6Lw=.b90c971a-bfe4-4144-93dc-b04a4c89a154@github.com> <01AxKBXWxHUxCSIZhNJKwuLcg2KVSx9Jva1PTBL5ZxQ=.42c59b6c-efad-4558-8a7d-190e4505049a@github.com> Message-ID: On Mon, 20 Mar 2023 12:58:57 GMT, Wang Haomin wrote: >> src/hotspot/share/adlc/output_c.cpp line 3989: >> >>> 3987: if (inst->captures_bottom_type(_globalNames)) { >>> 3988: if (strncmp("MachCall", inst->mach_base_class(_globalNames), strlen("MachCall")) != 0 >>> 3989: && strncmp("MachIf", inst->mach_base_class(_globalNames), strlen("MachIf")) != 0) { >> >> Could you please explain this change, and how it relates to JDK-8292289, in more detail? > > Add `match(If cop (VectorTest op1 op2));` into ad file. > The following code will be generated in`MachNode *State::MachNodeGenerator(int opcode)` of ad_xxx_gen.cpp. > > > 5458 case anytrue_in_maskV16_branch_rule: { > 5459 anytrue_in_maskV16_branchNode *node = new anytrue_in_maskV16_branchNode(); > 5460 node->set_opnd_array(4, MachOperGenerator(LABEL)); > 5461 node->_bottom_type = _leaf->bottom_type(); > 5462 node->_prob = _leaf->as_If()->_prob; > 5463 node->_fcnt = _leaf->as_If()->_fcnt; > 5464 return node; > 5465 } > > > `error: 'class anytrue_in_maskV16_branchNode' has no member named '_bottom_type';` reported when `make hotspot`. > In fact, `MachIfNode` does not have the member `_bottom_type`. > > Normal `MachIfNode` will return false in `inst->captures_bottom_type`. However, the right child node of this `MachIfNode` is `VectorTest`. `is_vector` return true, so `inst->captures_bottom_type` return true. Therefore the error occurred. Thanks for the explanation. The change looks good to me. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12917#discussion_r1144292178 From wanghaomin at openjdk.org Wed Mar 22 07:06:43 2023 From: wanghaomin at openjdk.org (Wang Haomin) Date: Wed, 22 Mar 2023 07:06:43 GMT Subject: RFR: 8303804: Fix some errors of If-VectorTest and CMove-VectorTest [v2] In-Reply-To: References: <-Jw_zF5ca_3WHcoZQwzsT6lMA1NFdAzbOv3063qU6Lw=.b90c971a-bfe4-4144-93dc-b04a4c89a154@github.com> Message-ID: On Wed, 22 Mar 2023 06:39:43 GMT, Tobias Hartmann wrote: >> Wang Haomin has updated the pull request incrementally with one additional commit since the last revision: >> >> compare the results with 0 > > Marked as reviewed by thartmann (Reviewer). @TobiHartmann Thanks for your review. Could you sponsor for it? ------------- PR Comment: https://git.openjdk.org/jdk/pull/12917#issuecomment-1479015798 From wanghaomin at openjdk.org Wed Mar 22 07:41:50 2023 From: wanghaomin at openjdk.org (Wang Haomin) Date: Wed, 22 Mar 2023 07:41:50 GMT Subject: Integrated: 8303804: Fix some errors of If-VectorTest and CMove-VectorTest In-Reply-To: <-Jw_zF5ca_3WHcoZQwzsT6lMA1NFdAzbOv3063qU6Lw=.b90c971a-bfe4-4144-93dc-b04a4c89a154@github.com> References: <-Jw_zF5ca_3WHcoZQwzsT6lMA1NFdAzbOv3063qU6Lw=.b90c971a-bfe4-4144-93dc-b04a4c89a154@github.com> Message-ID: <7QKCEf4phlveJwMtI98risKpS8HeyCkzIBOBIHB-Ym8=.cdf93dac-04bb-498b-b3e1-266ae6351998@github.com> On Wed, 8 Mar 2023 03:52:33 GMT, Wang Haomin wrote: > After https://bugs.openjdk.org/browse/JDK-8292289 , the base class of VectorTestNode changed from Node to CmpNode. So I add two match rule into ad file. > > match(If cop (VectorTest op1 op2)); > match(Set dst (CMoveI (Binary cop (VectorTest op1 op2)) (Binary src1 src2))); > > First error, rule1 shouldn't generate the statement "node->_bottom_type = _leaf->bottom_type();". > Second error, both rule1 and rule2 need to use VectorTestNode, the VectorTestNode should be cloned like CmpNode. This pull request has now been integrated. Changeset: c039d266 Author: Wang Haomin Committer: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/c039d26603e85ae37b0a53430a47f5751bf911af Stats: 4 lines in 2 files changed: 2 ins; 0 del; 2 mod 8303804: Fix some errors of If-VectorTest and CMove-VectorTest Reviewed-by: qamai, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/12917 From xgong at openjdk.org Wed Mar 22 08:08:46 2023 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 22 Mar 2023 08:08:46 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v3] In-Reply-To: <4Op0Z8whnyDXDC6zGyMbx4ugcZp5TEoAqW_myB5flxM=.1c7b59ba-efb2-4f68-90d7-2d6e33e39572@github.com> References: <4Op0Z8whnyDXDC6zGyMbx4ugcZp5TEoAqW_myB5flxM=.1c7b59ba-efb2-4f68-90d7-2d6e33e39572@github.com> Message-ID: On Tue, 21 Mar 2023 16:16:31 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch reimplements `VectorShuffle` implementations to be a vector of the bit type. Currently, VectorShuffle is stored as a byte array, and would be expanded upon usage. This poses several drawbacks: >> >> 1. Inefficient conversions between a shuffle and its corresponding vector. This hinders the performance when the shuffle indices are not constant and are loaded or computed dynamically. >> 2. Redundant expansions in `rearrange` operations. On all platforms, it seems that a shuffle index vector is always expanded to the correct type before executing the `rearrange` operations. >> 3. Some redundant intrinsics are needed to support this handling as well as special considerations in the C2 compiler. >> 4. Range checks are performed using `VectorShuffle::toVector`, which is inefficient for FP types since both FP conversions and FP comparisons are more expensive than the integral ones. >> >> Upon these changes, a `rearrange` can emit more efficient code: >> >> var species = IntVector.SPECIES_128; >> var v1 = IntVector.fromArray(species, SRC1, 0); >> var v2 = IntVector.fromArray(species, SRC2, 0); >> v1.rearrange(v2.toShuffle()).intoArray(DST, 0); >> >> Before: >> movabs $0x751589fa8,%r10 ; {oop([I{0x0000000751589fa8})} >> vmovdqu 0x10(%r10),%xmm2 >> movabs $0x7515a0d08,%r10 ; {oop([I{0x00000007515a0d08})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158afb8,%r10 ; {oop([I{0x000000075158afb8})} >> vmovdqu 0x10(%r10),%xmm0 >> vpand -0xddc12(%rip),%xmm0,%xmm0 # Stub::vector_int_to_byte_mask >> ; {external_word} >> vpackusdw %xmm0,%xmm0,%xmm0 >> vpackuswb %xmm0,%xmm0,%xmm0 >> vpmovsxbd %xmm0,%xmm3 >> vpcmpgtd %xmm3,%xmm1,%xmm3 >> vtestps %xmm3,%xmm3 >> jne 0x00007fc2acb4e0d8 >> vpmovzxbd %xmm0,%xmm0 >> vpermd %ymm2,%ymm0,%ymm0 >> movabs $0x751588f98,%r10 ; {oop([I{0x0000000751588f98})} >> vmovdqu %xmm0,0x10(%r10) >> >> After: >> movabs $0x751589c78,%r10 ; {oop([I{0x0000000751589c78})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158ac88,%r10 ; {oop([I{0x000000075158ac88})} >> vmovdqu 0x10(%r10),%xmm2 >> vpxor %xmm0,%xmm0,%xmm0 >> vpcmpgtd %xmm2,%xmm0,%xmm3 >> vtestps %xmm3,%xmm3 >> jne 0x00007fa818b27cb1 >> vpermd %ymm1,%ymm2,%ymm0 >> movabs $0x751588c68,%r10 ; {oop([I{0x0000000751588c68})} >> vmovdqu %xmm0,0x10(%r10) >> >> Please take a look and leave reviews. Thanks a lot. > > Quan Anh Mai has updated the pull request incrementally with two additional commits since the last revision: > > - missing casts > - clean up src/hotspot/cpu/aarch64/aarch64_vector.ad line 6082: > 6080: // to implement rearrange. > 6081: > 6082: // Maybe move the shuffle preparation to VectorLoadShuffle Agree that moving the shuffle computation code to `VectorLoadShuffle`. Thanks! src/hotspot/share/opto/vectorIntrinsics.cpp line 2059: > 2057: if (need_load_shuffle) { > 2058: shuffle = gvn().transform(new VectorLoadShuffleNode(shuffle, vt)); > 2059: } How about generating `VectorLoadShuffleNode` for all platforms that support Vector API, and remove the helper method `vector_needs_load_shuffle()` ? For those platforms that do not need this shuffle preparation, we can emit nothing in codegen. src/hotspot/share/opto/vectorIntrinsics.cpp line 2426: > 2424: if (is_vector_shuffle(vbox_klass_from)) { > 2425: return false; // vector shuffles aren't supported > 2426: } Is it better to change this as an "assertion" or print the log details? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1144366812 PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1144360349 PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1144363416 From xgong at openjdk.org Wed Mar 22 08:11:43 2023 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 22 Mar 2023 08:11:43 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v3] In-Reply-To: <4Op0Z8whnyDXDC6zGyMbx4ugcZp5TEoAqW_myB5flxM=.1c7b59ba-efb2-4f68-90d7-2d6e33e39572@github.com> References: <4Op0Z8whnyDXDC6zGyMbx4ugcZp5TEoAqW_myB5flxM=.1c7b59ba-efb2-4f68-90d7-2d6e33e39572@github.com> Message-ID: On Tue, 21 Mar 2023 16:16:31 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch reimplements `VectorShuffle` implementations to be a vector of the bit type. Currently, VectorShuffle is stored as a byte array, and would be expanded upon usage. This poses several drawbacks: >> >> 1. Inefficient conversions between a shuffle and its corresponding vector. This hinders the performance when the shuffle indices are not constant and are loaded or computed dynamically. >> 2. Redundant expansions in `rearrange` operations. On all platforms, it seems that a shuffle index vector is always expanded to the correct type before executing the `rearrange` operations. >> 3. Some redundant intrinsics are needed to support this handling as well as special considerations in the C2 compiler. >> 4. Range checks are performed using `VectorShuffle::toVector`, which is inefficient for FP types since both FP conversions and FP comparisons are more expensive than the integral ones. >> >> Upon these changes, a `rearrange` can emit more efficient code: >> >> var species = IntVector.SPECIES_128; >> var v1 = IntVector.fromArray(species, SRC1, 0); >> var v2 = IntVector.fromArray(species, SRC2, 0); >> v1.rearrange(v2.toShuffle()).intoArray(DST, 0); >> >> Before: >> movabs $0x751589fa8,%r10 ; {oop([I{0x0000000751589fa8})} >> vmovdqu 0x10(%r10),%xmm2 >> movabs $0x7515a0d08,%r10 ; {oop([I{0x00000007515a0d08})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158afb8,%r10 ; {oop([I{0x000000075158afb8})} >> vmovdqu 0x10(%r10),%xmm0 >> vpand -0xddc12(%rip),%xmm0,%xmm0 # Stub::vector_int_to_byte_mask >> ; {external_word} >> vpackusdw %xmm0,%xmm0,%xmm0 >> vpackuswb %xmm0,%xmm0,%xmm0 >> vpmovsxbd %xmm0,%xmm3 >> vpcmpgtd %xmm3,%xmm1,%xmm3 >> vtestps %xmm3,%xmm3 >> jne 0x00007fc2acb4e0d8 >> vpmovzxbd %xmm0,%xmm0 >> vpermd %ymm2,%ymm0,%ymm0 >> movabs $0x751588f98,%r10 ; {oop([I{0x0000000751588f98})} >> vmovdqu %xmm0,0x10(%r10) >> >> After: >> movabs $0x751589c78,%r10 ; {oop([I{0x0000000751589c78})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158ac88,%r10 ; {oop([I{0x000000075158ac88})} >> vmovdqu 0x10(%r10),%xmm2 >> vpxor %xmm0,%xmm0,%xmm0 >> vpcmpgtd %xmm2,%xmm0,%xmm3 >> vtestps %xmm3,%xmm3 >> jne 0x00007fa818b27cb1 >> vpermd %ymm1,%ymm2,%ymm0 >> movabs $0x751588c68,%r10 ; {oop([I{0x0000000751588c68})} >> vmovdqu %xmm0,0x10(%r10) >> >> Please take a look and leave reviews. Thanks a lot. > > Quan Anh Mai has updated the pull request incrementally with two additional commits since the last revision: > > - missing casts > - clean up Please also update the copyright to 2023 for some touched files like `vectorSupport.hpp` and other java files like `AbstractShuffle.java`, `AbstractVector.java`, `VectorShape.java`, and `VectorShuffle.java`. Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/13093#issuecomment-1479076010 From xgong at openjdk.org Wed Mar 22 08:33:47 2023 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 22 Mar 2023 08:33:47 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v3] In-Reply-To: <4Op0Z8whnyDXDC6zGyMbx4ugcZp5TEoAqW_myB5flxM=.1c7b59ba-efb2-4f68-90d7-2d6e33e39572@github.com> References: <4Op0Z8whnyDXDC6zGyMbx4ugcZp5TEoAqW_myB5flxM=.1c7b59ba-efb2-4f68-90d7-2d6e33e39572@github.com> Message-ID: On Tue, 21 Mar 2023 16:16:31 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch reimplements `VectorShuffle` implementations to be a vector of the bit type. Currently, VectorShuffle is stored as a byte array, and would be expanded upon usage. This poses several drawbacks: >> >> 1. Inefficient conversions between a shuffle and its corresponding vector. This hinders the performance when the shuffle indices are not constant and are loaded or computed dynamically. >> 2. Redundant expansions in `rearrange` operations. On all platforms, it seems that a shuffle index vector is always expanded to the correct type before executing the `rearrange` operations. >> 3. Some redundant intrinsics are needed to support this handling as well as special considerations in the C2 compiler. >> 4. Range checks are performed using `VectorShuffle::toVector`, which is inefficient for FP types since both FP conversions and FP comparisons are more expensive than the integral ones. >> >> Upon these changes, a `rearrange` can emit more efficient code: >> >> var species = IntVector.SPECIES_128; >> var v1 = IntVector.fromArray(species, SRC1, 0); >> var v2 = IntVector.fromArray(species, SRC2, 0); >> v1.rearrange(v2.toShuffle()).intoArray(DST, 0); >> >> Before: >> movabs $0x751589fa8,%r10 ; {oop([I{0x0000000751589fa8})} >> vmovdqu 0x10(%r10),%xmm2 >> movabs $0x7515a0d08,%r10 ; {oop([I{0x00000007515a0d08})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158afb8,%r10 ; {oop([I{0x000000075158afb8})} >> vmovdqu 0x10(%r10),%xmm0 >> vpand -0xddc12(%rip),%xmm0,%xmm0 # Stub::vector_int_to_byte_mask >> ; {external_word} >> vpackusdw %xmm0,%xmm0,%xmm0 >> vpackuswb %xmm0,%xmm0,%xmm0 >> vpmovsxbd %xmm0,%xmm3 >> vpcmpgtd %xmm3,%xmm1,%xmm3 >> vtestps %xmm3,%xmm3 >> jne 0x00007fc2acb4e0d8 >> vpmovzxbd %xmm0,%xmm0 >> vpermd %ymm2,%ymm0,%ymm0 >> movabs $0x751588f98,%r10 ; {oop([I{0x0000000751588f98})} >> vmovdqu %xmm0,0x10(%r10) >> >> After: >> movabs $0x751589c78,%r10 ; {oop([I{0x0000000751589c78})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158ac88,%r10 ; {oop([I{0x000000075158ac88})} >> vmovdqu 0x10(%r10),%xmm2 >> vpxor %xmm0,%xmm0,%xmm0 >> vpcmpgtd %xmm2,%xmm0,%xmm3 >> vtestps %xmm3,%xmm3 >> jne 0x00007fa818b27cb1 >> vpermd %ymm1,%ymm2,%ymm0 >> movabs $0x751588c68,%r10 ; {oop([I{0x0000000751588c68})} >> vmovdqu %xmm0,0x10(%r10) >> >> Please take a look and leave reviews. Thanks a lot. > > Quan Anh Mai has updated the pull request incrementally with two additional commits since the last revision: > > - missing casts > - clean up src/jdk.incubator.vector/share/classes/jdk/incubator/vector/AbstractShuffle.java line 118: > 116: return (VectorShuffle) v.rearrange(shuffle.cast(vspecies().asIntegral())) > 117: .toShuffle() > 118: .cast(vspecies()); Style issue. Suggest to change to: return (VectorShuffle) v.rearrange(shuffle.cast(vspecies().asIntegral())) .toShuffle() .cast(vspecies()); I also noticed that the similar shuffle cast code is used more frequently. Could we wrap such code `toShuffle().cast(vspecies())` to a separate method? src/jdk.incubator.vector/share/classes/jdk/incubator/vector/AbstractShuffle.java line 130: > 128: } else { > 129: v = v.blend(v.lanewise(VectorOperators.ADD, length()), > 130: v.compare(VectorOperators.LT, 0)); Style issue. Suggest to change to: v = v.blend(v.lanewise(VectorOperators.ADD, length()), v.compare(VectorOperators.LT, 0)); src/jdk.incubator.vector/share/classes/jdk/incubator/vector/AbstractVector.java line 198: > 196: if ((length() & (length() - 1)) != 0) { > 197: return wrap ? shuffleFromOp(i -> (VectorIntrinsics.wrapToRange(i * step + start, length()))) > 198: : shuffleFromOp(i -> i * step + start); Code style issue. Suggest to: return wrap ? shuffleFromOp(i -> (VectorIntrinsics.wrapToRange(i * step + start, length()))) : shuffleFromOp(i -> i * step + start); src/jdk.incubator.vector/share/classes/jdk/incubator/vector/AbstractVector.java line 204: > 202: Vector iota = species.iota(); > 203: iota = iota.lanewise(VectorOperators.MUL, step) > 204: .lanewise(VectorOperators.ADD, start); Style issue. Same as above. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1144384585 PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1144389023 PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1144390218 PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1144390692 From tholenstein at openjdk.org Wed Mar 22 10:25:31 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 22 Mar 2023 10:25:31 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally [v4] In-Reply-To: References: Message-ID: > In IGV the user can apply a set of filters to a graph. Currently, the same set of selected filters is applied to all graphs (globally). > > - With this change the use can define a set of filters for each individual graph tab using the `--Local--` profile > - Further a filter profile can be created that represents a set of filter. This filter profile can the be selected in each graph tab individually. > > ### Global profile > Each tab has a `--Global--` filter profile which is selected when opening a graph. Filters applied to the `--Global--` profile are applied to all tabs that have the `--Global--` profile selected. > > ### Local profile > Each tab has its own `--Local--` filter profile. Filters applied to the `--Local--` profile are applied only to the currently selected tabs. Only one tab can be selected at a time and a tab gets selected by clicking on it. To make it more clear which tab is currently selected, the title of the selected tab is displayed in **bold** font. > tabA > > When clicking on a different tab with a different `--Local--` profile, the selected filters get updated accordingly. > tabB > > ### New profile > The user can also create a new filter profile and give it a name. E.g. `My Filters` > newProfile > > The `My Filters` profile is then globally available to other tabs as well > selectProfile > > > ### Filters for cloned tabs > When the user clones a tab, the `--Local--` profile gets cloned as well. Further the clone has the same filter profile selected when it gets opened > cloneTab Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: fix callback for cloned models ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12714/files - new: https://git.openjdk.org/jdk/pull/12714/files/fbabcdaa..6d837d9b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12714&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12714&range=02-03 Stats: 38 lines in 1 file changed: 19 ins; 16 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/12714.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12714/head:pull/12714 PR: https://git.openjdk.org/jdk/pull/12714 From rcastanedalo at openjdk.org Wed Mar 22 11:11:46 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 22 Mar 2023 11:11:46 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand Message-ID: Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point `Math.min()/max()` implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see [JDK-8261147](https://bugs.openjdk.org/browse/JDK-8261147) and [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622)). This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops: ![reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226725587-b7d68509-3717-4bbe-8d54-f9a105853fda.png) The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks. ## Performance Benefits As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point `Math.min()/max()` in multiple scenarios. ### Increased Auto-Vectorization Scope There are two main scenarios in which the proposed changeset enables further auto-vectorization: #### Reductions Using Global Accumulators public class Foo { int acc = 0; (..) void reduce(int[] array) { for (int i = 0; i < array.length; i++) { acc += array[i]; } } } Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis: ![global-reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226745351-33494e40-7c07-4a8b-8bf6-d3a96e84b1c2.png) #### Reductions of partially unrolled loops (..) for (int i = 0; i < array.length / 2; i++) { acc += array[2*i]; acc += array[2*i + 1]; } (..) These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically. ### Increased Performance of x64 Floating-Point `Math.min()/max()` Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point `Math.min()/max()` implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in `FpMinMaxIntrinsics.java` for more details). ## Implementation details The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case `testReductionOnPartiallyUnrolledLoopWithSwappedInputs` from `TestGeneralizedReductions.java`) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops). A complication results from edge swapping in the nodes cloned by loop unrolling (see [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/addnode.cpp#L123) and [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/mulnode.cpp#L113)), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64) is to replace this changeset's linear chain finding approach with a more general shortest path-finding algorithm. This alternative might preclude the need for tracking edge swapping at a potentially higher computational cost. Since the trade-off is not obvious, I propose to investigate it in a follow-up RFE. The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of `[Min|Max][F|D]` nodes. ## Testing ### Functionality - tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64). - fuzzing (12 h. on linux-x64 and linux-aarch64). ##### TestGeneralizedReductions.java Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. `testReductionOnPartiallyUnrolledLoop` has been observed to fail on [linux-x86](https://github.com/robcasloz/jdk/actions/runs/4478959520/jobs/7873827856#logs) due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction. ##### TestFpMinMaxReductions.java Tests the matching of floating-point max/min implementations in x64. ##### TestSuperwordFailsUnrolling.java This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test. ### Performance #### General Benchmarks The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64. #### Micro-benchmarks The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis ([full results](https://github.com/openjdk/jdk/files/11039207/microbenchmark-results.ods)). ##### VectorReduction.java These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running `VectorReduction.java` on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for `andRedIOnGlobalAccumulator` and `andRedIPartiallyUnrolled`, where the changeset improves performance by 2.4x in both cases. ##### MaxIntrinsics.java This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point `Math.min()` implementation that is specialized for reduction min operations): | micro-benchmark | speedup compared to mainline | | --- | --- | | `fMinReduceInOuterLoop` | 1.1x | | `fMinReduceNonCounted` | 2.3x | | `fMinReduceGlobalAccumulator` | 2.4x | | `fMinReducePartiallyUnrolled` | 3.9x | ## Acknowledgments Thanks to @danielogh for making it possible to test this improvement with confidence ([JDK-8294715](https://bugs.openjdk.org/browse/JDK-8294715)) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback. ------------- Commit messages: - Do not run test in x86-32 - Update existing test instead of removing it - Add negative vectorization test - Update copyright headers - Add two more reduction vectorization microbenchmarks - Ensure that the reduction vectorization microbenchmarks actually vectorize - Add new FP min/max micro-benchmarks - Add one more FP min/max test case - Refine check conditions - Run min/max reduction tests only for UseAVX > 0 - ... and 13 more: https://git.openjdk.org/jdk/compare/7277bb19...72fe5a6a Changes: https://git.openjdk.org/jdk/pull/13120/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13120&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8287087 Stats: 801 lines in 17 files changed: 639 ins; 99 del; 63 mod Patch: https://git.openjdk.org/jdk/pull/13120.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13120/head:pull/13120 PR: https://git.openjdk.org/jdk/pull/13120 From tholenstein at openjdk.org Wed Mar 22 11:41:24 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 22 Mar 2023 11:41:24 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally [v5] In-Reply-To: References: Message-ID: > In IGV the user can apply a set of filters to a graph. Currently, the same set of selected filters is applied to all graphs (globally). > > - With this change the use can define a set of filters for each individual graph tab using the `--Local--` profile > - Further a filter profile can be created that represents a set of filter. This filter profile can the be selected in each graph tab individually. > > ### Global profile > Each tab has a `--Global--` filter profile which is selected when opening a graph. Filters applied to the `--Global--` profile are applied to all tabs that have the `--Global--` profile selected. > > ### Local profile > Each tab has its own `--Local--` filter profile. Filters applied to the `--Local--` profile are applied only to the currently selected tabs. Only one tab can be selected at a time and a tab gets selected by clicking on it. To make it more clear which tab is currently selected, the title of the selected tab is displayed in **bold** font. > tabA > > When clicking on a different tab with a different `--Local--` profile, the selected filters get updated accordingly. > tabB > > ### New profile > The user can also create a new filter profile and give it a name. E.g. `My Filters` > newProfile > > The `My Filters` profile is then globally available to other tabs as well > selectProfile > > > ### Filters for cloned tabs > When the user clones a tab, the `--Local--` profile gets cloned as well. Further the clone has the same filter profile selected when it gets opened > cloneTab Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: always apply filters in order ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12714/files - new: https://git.openjdk.org/jdk/pull/12714/files/6d837d9b..1a87f5db Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12714&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12714&range=03-04 Stats: 46 lines in 7 files changed: 8 ins; 19 del; 19 mod Patch: https://git.openjdk.org/jdk/pull/12714.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12714/head:pull/12714 PR: https://git.openjdk.org/jdk/pull/12714 From qamai at openjdk.org Wed Mar 22 12:46:33 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 22 Mar 2023 12:46:33 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v4] In-Reply-To: References: Message-ID: > Hi, > > This patch reimplements `VectorShuffle` implementations to be a vector of the bit type. Currently, VectorShuffle is stored as a byte array, and would be expanded upon usage. This poses several drawbacks: > > 1. Inefficient conversions between a shuffle and its corresponding vector. This hinders the performance when the shuffle indices are not constant and are loaded or computed dynamically. > 2. Redundant expansions in `rearrange` operations. On all platforms, it seems that a shuffle index vector is always expanded to the correct type before executing the `rearrange` operations. > 3. Some redundant intrinsics are needed to support this handling as well as special considerations in the C2 compiler. > 4. Range checks are performed using `VectorShuffle::toVector`, which is inefficient for FP types since both FP conversions and FP comparisons are more expensive than the integral ones. > > Upon these changes, a `rearrange` can emit more efficient code: > > var species = IntVector.SPECIES_128; > var v1 = IntVector.fromArray(species, SRC1, 0); > var v2 = IntVector.fromArray(species, SRC2, 0); > v1.rearrange(v2.toShuffle()).intoArray(DST, 0); > > Before: > movabs $0x751589fa8,%r10 ; {oop([I{0x0000000751589fa8})} > vmovdqu 0x10(%r10),%xmm2 > movabs $0x7515a0d08,%r10 ; {oop([I{0x00000007515a0d08})} > vmovdqu 0x10(%r10),%xmm1 > movabs $0x75158afb8,%r10 ; {oop([I{0x000000075158afb8})} > vmovdqu 0x10(%r10),%xmm0 > vpand -0xddc12(%rip),%xmm0,%xmm0 # Stub::vector_int_to_byte_mask > ; {external_word} > vpackusdw %xmm0,%xmm0,%xmm0 > vpackuswb %xmm0,%xmm0,%xmm0 > vpmovsxbd %xmm0,%xmm3 > vpcmpgtd %xmm3,%xmm1,%xmm3 > vtestps %xmm3,%xmm3 > jne 0x00007fc2acb4e0d8 > vpmovzxbd %xmm0,%xmm0 > vpermd %ymm2,%ymm0,%ymm0 > movabs $0x751588f98,%r10 ; {oop([I{0x0000000751588f98})} > vmovdqu %xmm0,0x10(%r10) > > After: > movabs $0x751589c78,%r10 ; {oop([I{0x0000000751589c78})} > vmovdqu 0x10(%r10),%xmm1 > movabs $0x75158ac88,%r10 ; {oop([I{0x000000075158ac88})} > vmovdqu 0x10(%r10),%xmm2 > vpxor %xmm0,%xmm0,%xmm0 > vpcmpgtd %xmm2,%xmm0,%xmm3 > vtestps %xmm3,%xmm3 > jne 0x00007fa818b27cb1 > vpermd %ymm1,%ymm2,%ymm0 > movabs $0x751588c68,%r10 ; {oop([I{0x0000000751588c68})} > vmovdqu %xmm0,0x10(%r10) > > Please take a look and leave reviews. Thanks a lot. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: reviews ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13093/files - new: https://git.openjdk.org/jdk/pull/13093/files/4caa9d10..e0b9ee88 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13093&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13093&range=02-03 Stats: 17 lines in 5 files changed: 0 ins; 0 del; 17 mod Patch: https://git.openjdk.org/jdk/pull/13093.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13093/head:pull/13093 PR: https://git.openjdk.org/jdk/pull/13093 From qamai at openjdk.org Wed Mar 22 12:46:36 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 22 Mar 2023 12:46:36 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v3] In-Reply-To: References: <4Op0Z8whnyDXDC6zGyMbx4ugcZp5TEoAqW_myB5flxM=.1c7b59ba-efb2-4f68-90d7-2d6e33e39572@github.com> Message-ID: On Wed, 22 Mar 2023 08:09:03 GMT, Xiaohong Gong wrote: >> Quan Anh Mai has updated the pull request incrementally with two additional commits since the last revision: >> >> - missing casts >> - clean up > > Please also update the copyright to 2023 for some touched files like `vectorSupport.hpp` and other java files like `AbstractShuffle.java`, `AbstractVector.java`, `VectorShape.java`, and `VectorShuffle.java`. Thanks! @XiaohongGong Thanks, I have updated the copyright year and the code styles as you suggested ------------- PR Comment: https://git.openjdk.org/jdk/pull/13093#issuecomment-1479500096 From qamai at openjdk.org Wed Mar 22 12:46:39 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 22 Mar 2023 12:46:39 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v4] In-Reply-To: References: <4Op0Z8whnyDXDC6zGyMbx4ugcZp5TEoAqW_myB5flxM=.1c7b59ba-efb2-4f68-90d7-2d6e33e39572@github.com> Message-ID: On Wed, 22 Mar 2023 07:59:40 GMT, Xiaohong Gong wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> reviews > > src/hotspot/share/opto/vectorIntrinsics.cpp line 2059: > >> 2057: if (need_load_shuffle) { >> 2058: shuffle = gvn().transform(new VectorLoadShuffleNode(shuffle, vt)); >> 2059: } > > How about generating `VectorLoadShuffleNode` for all platforms that support Vector API, and remove the helper method `vector_needs_load_shuffle()` ? For those platforms that do not need this shuffle preparation, we can emit nothing in codegen. I think not emitting `VectorLoadShuffleNode` is more common so it is better to emit them only when needed, as it will simplify the graph and may allow better inspections of the indices in the future. Additionally, a do-nothing node does not alias with its input and therefore kills the input, which leads to an additional spill if they both need to live. > src/hotspot/share/opto/vectorIntrinsics.cpp line 2426: > >> 2424: if (is_vector_shuffle(vbox_klass_from)) { >> 2425: return false; // vector shuffles aren't supported >> 2426: } > > Is it better to change this as an "assertion" or print the log details? The change indifferentiates a vector shuffle from a normal vector in C2, so this should be removed, as vector shuffles are converted to/from normal vector using this routine ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1144748489 PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1144744663 From epeter at openjdk.org Wed Mar 22 12:58:21 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 22 Mar 2023 12:58:21 GMT Subject: RFR: 8303951: Add asserts before record_method_not_compilable where possible [v3] In-Reply-To: References: Message-ID: > I went through all `C2` bailouts, and checked if they are justified to bail out of compilation silently. I added asserts everywhere. Those that were hit, I inspected by hand. > > Some of them seem to be justified. There I added comments why they are justified. They are cases that we do not want to handle in `C2`, and that are rare enough so that it probably does not matter. > > For the following bailouts I did not add an assert, because it may have revealed a bug: > [JDK-8304328](https://bugs.openjdk.org/browse/JDK-8304328) C2 Bailout "failed spill-split-recycle sanity check" reveals hidden issue with RA > > Note: > [JDK-8303466](https://bugs.openjdk.org/browse/JDK-8303466) C2: COMPILE SKIPPED: malformed control flow - only one IfProj > That bug bug was the reason for this RFE here. I added the assert for "malformed control flow". After this RFE here, that Bug will run into the assert on debug builds. > > I ran `tier1-6` and stress testing. Now running `tier7-9`. > > Filed a follow-up RFE to do the same for `BAILOUT` in `C1`: [JDK-8304532](https://bugs.openjdk.org/browse/JDK-8304532). Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Apply suggestions from code review Co-authored-by: Tobias Hartmann ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13038/files - new: https://git.openjdk.org/jdk/pull/13038/files/28d41ffe..6e60f02e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13038&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13038&range=01-02 Stats: 3 lines in 2 files changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/13038.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13038/head:pull/13038 PR: https://git.openjdk.org/jdk/pull/13038 From epeter at openjdk.org Wed Mar 22 13:16:20 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 22 Mar 2023 13:16:20 GMT Subject: RFR: 8303951: Add asserts before record_method_not_compilable where possible [v4] In-Reply-To: References: Message-ID: > I went through all `C2` bailouts, and checked if they are justified to bail out of compilation silently. I added asserts everywhere. Those that were hit, I inspected by hand. > > Some of them seem to be justified. There I added comments why they are justified. They are cases that we do not want to handle in `C2`, and that are rare enough so that it probably does not matter. > > For the following bailouts I did not add an assert, because it may have revealed a bug: > [JDK-8304328](https://bugs.openjdk.org/browse/JDK-8304328) C2 Bailout "failed spill-split-recycle sanity check" reveals hidden issue with RA > > Note: > [JDK-8303466](https://bugs.openjdk.org/browse/JDK-8303466) C2: COMPILE SKIPPED: malformed control flow - only one IfProj > That bug bug was the reason for this RFE here. I added the assert for "malformed control flow". After this RFE here, that Bug will run into the assert on debug builds. > > I ran `tier1-6` and stress testing. Now running `tier7-9`. > > Filed a follow-up RFE to do the same for `BAILOUT` in `C1`: [JDK-8304532](https://bugs.openjdk.org/browse/JDK-8304532). Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: remove some of my bad comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13038/files - new: https://git.openjdk.org/jdk/pull/13038/files/6e60f02e..9a40a1a9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13038&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13038&range=02-03 Stats: 2 lines in 1 file changed: 0 ins; 2 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/13038.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13038/head:pull/13038 PR: https://git.openjdk.org/jdk/pull/13038 From epeter at openjdk.org Wed Mar 22 13:16:20 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 22 Mar 2023 13:16:20 GMT Subject: RFR: 8303951: Add asserts before record_method_not_compilable where possible [v2] In-Reply-To: References: Message-ID: On Tue, 21 Mar 2023 07:59:46 GMT, Tobias Hartmann wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> addressing Vladimir K's review suggestions > > src/hotspot/share/opto/parse1.cpp line 211: > >> 209: // of loops in catch blocks or loops which branch with a non-empty stack. >> 210: if (sp() != 0) { >> 211: // Bailout. But we should probably kick into normal compilation? > > We shouldn't add a question which is equivalent to a ToDo (same below). The comment should explain how this could happen and if we think that making the method not compilable is too strong, we should file a follow-up issue to investigate/fix. > > How common is this? We will still compile at C1, so normal compilation **will** kick in, right? Ok, I will remove he question. I think you are right, we should at least still have C1 compilation. In my experiments, it was extremely rare, this case. > src/hotspot/share/opto/parse1.cpp line 218: > >> 216: if (osr_block->has_trap_at(osr_block->start())) { >> 217: assert(false, "OSR starts with an immediate trap"); >> 218: // Bailout. But we should probably kick into normal compilation? > > "OSR inside finally clauses" sounds like it could easily happen. It sounds likely, but I never saw it happen, up to tier9. So I suggest we leave the assert in, and once it is triggered, we can decide to remove it again. But then we also have an example, and we can check it in as a test, to justify the bailout. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13038#discussion_r1144787843 PR Review Comment: https://git.openjdk.org/jdk/pull/13038#discussion_r1144786768 From jbhateja at openjdk.org Wed Mar 22 15:32:49 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 22 Mar 2023 15:32:49 GMT Subject: RFR: 8303278: Imprecise bottom type of ExtractB/UB In-Reply-To: <1xefg5Er866JRwRc53ioKjESRndo31dwTfM2oitZKQY=.1432e570-ce09-4a7a-bbbc-8deae75411cb@github.com> References: <1xefg5Er866JRwRc53ioKjESRndo31dwTfM2oitZKQY=.1432e570-ce09-4a7a-bbbc-8deae75411cb@github.com> Message-ID: On Tue, 21 Mar 2023 02:35:25 GMT, Quan Anh Mai wrote: > Yes x86 does not handle signed extension correctly. `pextrb` and `pextrw` zeroes the upper bits instead of signed extending them. A simple fix is to add `movsx` after those. > > >https://github.com/openjdk/jdk/blob/bbca7c3ede338a04d140abfe3e19cb27c628a0f5/src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp#L2247 ![image](https://user-images.githubusercontent.com/59989778/226945245-85115014-9b1b-4cf0-b8e7-53b34d0e4f57.png) - As can be seen above C2 creates ExtractS node with TypeInt::SHORT which is then succeeded by ConvI2L node since expander returns a long value. - Return value is again down casted back short value though an explicit cast operation in java code. ![image](https://user-images.githubusercontent.com/59989778/226948646-7b123472-23c3-4b2d-9f53-a8a6f9534802.png) Currently C2 folds following IR sequence ExtractS -> ConvI2L -> ConvL2I -> LShift (16) -> RShift(16) ==> ExtractS -> return_value [Inline expander] [ Java side down cast long to short -> l2i + i2s] Because ideal type of ExtractS is TypeInt::SHORT hence entire chain of conversion operations are folded by compiler during idealizations. There are two ways to address this issue:- 1) We can set the ideal type of ExtractS node to TypeInt::INT as is done for ExtractB/UB since subsequent java code will handle down casting to generate correct result. This will also fix [JDK-8303508](https://bugs.openjdk.org/browse/JDK-8303508). 2) Enforce strict semantics of ExtractS/B IR node and thus backends should emit an explicit sign extension instruction movsx for sub-word types. I agree with @merykitty that 2) solution looks more robust since it will enforce accurate semantics of ExtractS IR. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13070#issuecomment-1479779537 From epeter at openjdk.org Wed Mar 22 15:43:44 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 22 Mar 2023 15:43:44 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand In-Reply-To: References: Message-ID: On Tue, 21 Mar 2023 14:49:26 GMT, Roberto Casta?eda Lozano wrote: > Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point `Math.min()/max()` implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see [JDK-8261147](https://bugs.openjdk.org/browse/JDK-8261147) and [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622)). > > This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops: > > ![reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226725587-b7d68509-3717-4bbe-8d54-f9a105853fda.png) > > The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks. > > ## Performance Benefits > > As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point `Math.min()/max()` in multiple scenarios. > > ### Increased Auto-Vectorization Scope > > There are two main scenarios in which the proposed changeset enables further auto-vectorization: > > #### Reductions Using Global Accumulators > > > public class Foo { > int acc = 0; > (..) > void reduce(int[] array) { > for (int i = 0; i < array.length; i++) { > acc += array[i]; > } > } > } > > Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis: > > ![global-reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226745351-33494e40-7c07-4a8b-8bf6-d3a96e84b1c2.png) > > #### Reductions of partially unrolled loops > > > (..) > for (int i = 0; i < array.length / 2; i++) { > acc += array[2*i]; > acc += array[2*i + 1]; > } > (..) > > > These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically. > > ### Increased Performance of x64 Floating-Point `Math.min()/max()` > > Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point `Math.min()/max()` implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in `FpMinMaxIntrinsics.java` for more details). > > ## Implementation details > > The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case `testReductionOnPartiallyUnrolledLoopWithSwappedInputs` from `TestGeneralizedReductions.java`) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops). > > A complication results from edge swapping in the nodes cloned by loop unrolling (see [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/addnode.cpp#L123) and [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/mulnode.cpp#L113)), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64) is to replace this changeset's linear chain finding approach with a more general shortest path-finding algorithm. This alternative might preclude the need for tracking edge swapping at a potentially higher computational cost. Since the trade-off is not obvious, I propose to investigate it in a follow-up RFE. > > The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of `[Min|Max][F|D]` nodes. > > ## Testing > > ### Functionality > > - tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64). > - fuzzing (12 h. on linux-x64 and linux-aarch64). > > ##### TestGeneralizedReductions.java > > Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. `testReductionOnPartiallyUnrolledLoop` has been observed to fail on [linux-x86](https://github.com/robcasloz/jdk/actions/runs/4478959520/jobs/7873827856#logs) due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction. > > ##### TestFpMinMaxReductions.java > > Tests the matching of floating-point max/min implementations in x64. > > ##### TestSuperwordFailsUnrolling.java > > This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test. > > ### Performance > > #### General Benchmarks > > The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64. > > #### Micro-benchmarks > > The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis ([full results](https://github.com/openjdk/jdk/files/11039207/microbenchmark-results.ods)). > > > ##### VectorReduction.java > > These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running `VectorReduction.java` on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for `andRedIOnGlobalAccumulator` and `andRedIPartiallyUnrolled`, where the changeset improves performance by 2.4x in both cases. > > ##### MaxIntrinsics.java > > This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point `Math.min()` implementation that is specialized for reduction min operations): > > | micro-benchmark | speedup compared to mainline | > | --- | --- | > | `fMinReduceInOuterLoop` | 1.1x | > | `fMinReduceNonCounted` | 2.3x | > | `fMinReduceGlobalAccumulator` | 2.4x | > | `fMinReducePartiallyUnrolled` | 3.9x | > > ## Acknowledgments > > Thanks to @danielogh for making it possible to test this improvement with confidence ([JDK-8294715](https://bugs.openjdk.org/browse/JDK-8294715)) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback. test/hotspot/jtreg/compiler/loopopts/superword/TestGeneralizedReductions.java line 82: > 80: > 81: @Test > 82: @IR(applyIfCPUFeatureAnd = {"sse4.1", "true", "avx2", "true"}, Does `avx2` not imply `sse4.1`? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13120#discussion_r1145020447 From epeter at openjdk.org Wed Mar 22 15:52:44 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 22 Mar 2023 15:52:44 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand In-Reply-To: References: Message-ID: On Tue, 21 Mar 2023 14:49:26 GMT, Roberto Casta?eda Lozano wrote: > Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point `Math.min()/max()` implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see [JDK-8261147](https://bugs.openjdk.org/browse/JDK-8261147) and [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622)). > > This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops: > > ![reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226725587-b7d68509-3717-4bbe-8d54-f9a105853fda.png) > > The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks. > > ## Performance Benefits > > As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point `Math.min()/max()` in multiple scenarios. > > ### Increased Auto-Vectorization Scope > > There are two main scenarios in which the proposed changeset enables further auto-vectorization: > > #### Reductions Using Global Accumulators > > > public class Foo { > int acc = 0; > (..) > void reduce(int[] array) { > for (int i = 0; i < array.length; i++) { > acc += array[i]; > } > } > } > > Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis: > > ![global-reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226745351-33494e40-7c07-4a8b-8bf6-d3a96e84b1c2.png) > > #### Reductions of partially unrolled loops > > > (..) > for (int i = 0; i < array.length / 2; i++) { > acc += array[2*i]; > acc += array[2*i + 1]; > } > (..) > > > These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically. > > ### Increased Performance of x64 Floating-Point `Math.min()/max()` > > Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point `Math.min()/max()` implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in `FpMinMaxIntrinsics.java` for more details). > > ## Implementation details > > The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case `testReductionOnPartiallyUnrolledLoopWithSwappedInputs` from `TestGeneralizedReductions.java`) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops). > > A complication results from edge swapping in the nodes cloned by loop unrolling (see [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/addnode.cpp#L123) and [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/mulnode.cpp#L113)), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64) is to replace this changeset's linear chain finding approach with a more general shortest path-finding algorithm. This alternative might preclude the need for tracking edge swapping at a potentially higher computational cost. Since the trade-off is not obvious, I propose to investigate it in a follow-up RFE. > > The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of `[Min|Max][F|D]` nodes. > > ## Testing > > ### Functionality > > - tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64). > - fuzzing (12 h. on linux-x64 and linux-aarch64). > > ##### TestGeneralizedReductions.java > > Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. `testReductionOnPartiallyUnrolledLoop` has been observed to fail on [linux-x86](https://github.com/robcasloz/jdk/actions/runs/4478959520/jobs/7873827856#logs) due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction. > > ##### TestFpMinMaxReductions.java > > Tests the matching of floating-point max/min implementations in x64. > > ##### TestSuperwordFailsUnrolling.java > > This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test. > > ### Performance > > #### General Benchmarks > > The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64. > > #### Micro-benchmarks > > The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis ([full results](https://github.com/openjdk/jdk/files/11039207/microbenchmark-results.ods)). > > > ##### VectorReduction.java > > These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running `VectorReduction.java` on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for `andRedIOnGlobalAccumulator` and `andRedIPartiallyUnrolled`, where the changeset improves performance by 2.4x in both cases. > > ##### MaxIntrinsics.java > > This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point `Math.min()` implementation that is specialized for reduction min operations): > > | micro-benchmark | speedup compared to mainline | > | --- | --- | > | `fMinReduceInOuterLoop` | 1.1x | > | `fMinReduceNonCounted` | 2.3x | > | `fMinReduceGlobalAccumulator` | 2.4x | > | `fMinReducePartiallyUnrolled` | 3.9x | > > ## Acknowledgments > > Thanks to @danielogh for making it possible to test this improvement with confidence ([JDK-8294715](https://bugs.openjdk.org/browse/JDK-8294715)) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback. src/hotspot/share/opto/node.hpp line 586: > 584: } else { > 585: add_flag(Node::Flag_has_swapped_edges); > 586: } This is probably the riskiest part of the implementation. If anyone decides to swap edges without using `swap_edges`, we will not detect the reduction. We have to trust our tests to catch that. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13120#discussion_r1145036247 From epeter at openjdk.org Wed Mar 22 16:21:08 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 22 Mar 2023 16:21:08 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand In-Reply-To: References: Message-ID: On Tue, 21 Mar 2023 14:49:26 GMT, Roberto Casta?eda Lozano wrote: > Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point `Math.min()/max()` implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see [JDK-8261147](https://bugs.openjdk.org/browse/JDK-8261147) and [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622)). > > This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops: > > ![reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226725587-b7d68509-3717-4bbe-8d54-f9a105853fda.png) > > The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks. > > ## Performance Benefits > > As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point `Math.min()/max()` in multiple scenarios. > > ### Increased Auto-Vectorization Scope > > There are two main scenarios in which the proposed changeset enables further auto-vectorization: > > #### Reductions Using Global Accumulators > > > public class Foo { > int acc = 0; > (..) > void reduce(int[] array) { > for (int i = 0; i < array.length; i++) { > acc += array[i]; > } > } > } > > Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis: > > ![global-reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226745351-33494e40-7c07-4a8b-8bf6-d3a96e84b1c2.png) > > #### Reductions of partially unrolled loops > > > (..) > for (int i = 0; i < array.length / 2; i++) { > acc += array[2*i]; > acc += array[2*i + 1]; > } > (..) > > > These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically. > > ### Increased Performance of x64 Floating-Point `Math.min()/max()` > > Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point `Math.min()/max()` implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in `FpMinMaxIntrinsics.java` for more details). > > ## Implementation details > > The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case `testReductionOnPartiallyUnrolledLoopWithSwappedInputs` from `TestGeneralizedReductions.java`) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops). > > A complication results from edge swapping in the nodes cloned by loop unrolling (see [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/addnode.cpp#L123) and [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/mulnode.cpp#L113)), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64) is to replace this changeset's linear chain finding approach with a more general shortest path-finding algorithm. This alternative might preclude the need for tracking edge swapping at a potentially higher computational cost. Since the trade-off is not obvious, I propose to investigate it in a follow-up RFE. > > The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of `[Min|Max][F|D]` nodes. > > ## Testing > > ### Functionality > > - tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64). > - fuzzing (12 h. on linux-x64 and linux-aarch64). > > ##### TestGeneralizedReductions.java > > Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. `testReductionOnPartiallyUnrolledLoop` has been observed to fail on [linux-x86](https://github.com/robcasloz/jdk/actions/runs/4478959520/jobs/7873827856#logs) due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction. > > ##### TestFpMinMaxReductions.java > > Tests the matching of floating-point max/min implementations in x64. > > ##### TestSuperwordFailsUnrolling.java > > This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test. > > ### Performance > > #### General Benchmarks > > The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64. > > #### Micro-benchmarks > > The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis ([full results](https://github.com/openjdk/jdk/files/11039207/microbenchmark-results.ods)). > > > ##### VectorReduction.java > > These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running `VectorReduction.java` on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for `andRedIOnGlobalAccumulator` and `andRedIPartiallyUnrolled`, where the changeset improves performance by 2.4x in both cases. > > ##### MaxIntrinsics.java > > This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point `Math.min()` implementation that is specialized for reduction min operations): > > | micro-benchmark | speedup compared to mainline | > | --- | --- | > | `fMinReduceInOuterLoop` | 1.1x | > | `fMinReduceNonCounted` | 2.3x | > | `fMinReduceGlobalAccumulator` | 2.4x | > | `fMinReducePartiallyUnrolled` | 3.9x | > > ## Acknowledgments > > Thanks to @danielogh for making it possible to test this improvement with confidence ([JDK-8294715](https://bugs.openjdk.org/browse/JDK-8294715)) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback. src/hotspot/share/opto/superword.cpp line 446: > 444: int opc = n->Opcode(); > 445: return (opc != ReductionNode::opcode(opc, n->bottom_type()->basic_type()) > 446: || opc == Op_MinD || opc == Op_MinF || opc == Op_MaxD || opc == Op_MaxF); Why did you need to explicitly mention the `Op_MinD` etc? Are they covered in `ReductionNode::opcode`? src/hotspot/share/opto/superword.cpp line 481: > 479: > 480: _loop_reductions.clear(); > 481: const CountedLoopNode* loop_head = loop->_head->as_CountedLoop(); Can you not use the SuperWord member function `lp()`? src/hotspot/share/opto/superword.cpp line 523: > 521: Node* ctrl = _phase->get_ctrl(n); > 522: return (n->Opcode() == first->Opcode() && ctrl != nullptr && > 523: loop->is_member(_phase->get_loop(ctrl))); You could consider going the SuperWord way, and use `in_bb(n)` to check if `n` is in the SuperWord loop. src/hotspot/share/opto/superword.cpp line 543: > 541: for (DUIterator_Fast jmax, j = current->fast_outs(jmax); j < jmax; j++) { > 542: Node* u = current->fast_out(j); > 543: if (!loop->is_member(_phase->get_loop(_phase->ctrl_or_self(u)))) { same ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13120#discussion_r1145057606 PR Review Comment: https://git.openjdk.org/jdk/pull/13120#discussion_r1145087075 PR Review Comment: https://git.openjdk.org/jdk/pull/13120#discussion_r1145079302 PR Review Comment: https://git.openjdk.org/jdk/pull/13120#discussion_r1145089809 From epeter at openjdk.org Wed Mar 22 16:21:10 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 22 Mar 2023 16:21:10 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand In-Reply-To: References: Message-ID: On Wed, 22 Mar 2023 16:16:03 GMT, Emanuel Peter wrote: >> Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point `Math.min()/max()` implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see [JDK-8261147](https://bugs.openjdk.org/browse/JDK-8261147) and [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622)). >> >> This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops: >> >> ![reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226725587-b7d68509-3717-4bbe-8d54-f9a105853fda.png) >> >> The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks. >> >> ## Performance Benefits >> >> As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point `Math.min()/max()` in multiple scenarios. >> >> ### Increased Auto-Vectorization Scope >> >> There are two main scenarios in which the proposed changeset enables further auto-vectorization: >> >> #### Reductions Using Global Accumulators >> >> >> public class Foo { >> int acc = 0; >> (..) >> void reduce(int[] array) { >> for (int i = 0; i < array.length; i++) { >> acc += array[i]; >> } >> } >> } >> >> Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis: >> >> ![global-reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226745351-33494e40-7c07-4a8b-8bf6-d3a96e84b1c2.png) >> >> #### Reductions of partially unrolled loops >> >> >> (..) >> for (int i = 0; i < array.length / 2; i++) { >> acc += array[2*i]; >> acc += array[2*i + 1]; >> } >> (..) >> >> >> These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically. >> >> ### Increased Performance of x64 Floating-Point `Math.min()/max()` >> >> Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point `Math.min()/max()` implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in `FpMinMaxIntrinsics.java` for more details). >> >> ## Implementation details >> >> The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case `testReductionOnPartiallyUnrolledLoopWithSwappedInputs` from `TestGeneralizedReductions.java`) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops). >> >> A complication results from edge swapping in the nodes cloned by loop unrolling (see [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/addnode.cpp#L123) and [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/mulnode.cpp#L113)), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64) is to replace this changeset's linear chain finding approach with a more general shortest path-finding algorithm. This alternative might preclude the need for tracking edge swapping at a potentially higher computational cost. Since the trade-off is not obvious, I propose to investigate it in a follow-up RFE. >> >> The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of `[Min|Max][F|D]` nodes. >> >> ## Testing >> >> ### Functionality >> >> - tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64). >> - fuzzing (12 h. on linux-x64 and linux-aarch64). >> >> ##### TestGeneralizedReductions.java >> >> Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. `testReductionOnPartiallyUnrolledLoop` has been observed to fail on [linux-x86](https://github.com/robcasloz/jdk/actions/runs/4478959520/jobs/7873827856#logs) due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction. >> >> ##### TestFpMinMaxReductions.java >> >> Tests the matching of floating-point max/min implementations in x64. >> >> ##### TestSuperwordFailsUnrolling.java >> >> This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test. >> >> ### Performance >> >> #### General Benchmarks >> >> The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64. >> >> #### Micro-benchmarks >> >> The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis ([full results](https://github.com/openjdk/jdk/files/11039207/microbenchmark-results.ods)). >> >> >> ##### VectorReduction.java >> >> These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running `VectorReduction.java` on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for `andRedIOnGlobalAccumulator` and `andRedIPartiallyUnrolled`, where the changeset improves performance by 2.4x in both cases. >> >> ##### MaxIntrinsics.java >> >> This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point `Math.min()` implementation that is specialized for reduction min operations): >> >> | micro-benchmark | speedup compared to mainline | >> | --- | --- | >> | `fMinReduceInOuterLoop` | 1.1x | >> | `fMinReduceNonCounted` | 2.3x | >> | `fMinReduceGlobalAccumulator` | 2.4x | >> | `fMinReducePartiallyUnrolled` | 3.9x | >> >> ## Acknowledgments >> >> Thanks to @danielogh for making it possible to test this improvement with confidence ([JDK-8294715](https://bugs.openjdk.org/browse/JDK-8294715)) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback. > > src/hotspot/share/opto/superword.cpp line 481: > >> 479: >> 480: _loop_reductions.clear(); >> 481: const CountedLoopNode* loop_head = loop->_head->as_CountedLoop(); > > Can you not use the SuperWord member function `lp()`? I think you can drop the `loop` argument` to this function. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13120#discussion_r1145091360 From epeter at openjdk.org Wed Mar 22 16:27:45 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 22 Mar 2023 16:27:45 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand In-Reply-To: References: Message-ID: On Tue, 21 Mar 2023 14:49:26 GMT, Roberto Casta?eda Lozano wrote: > Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point `Math.min()/max()` implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see [JDK-8261147](https://bugs.openjdk.org/browse/JDK-8261147) and [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622)). > > This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops: > > ![reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226725587-b7d68509-3717-4bbe-8d54-f9a105853fda.png) > > The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks. > > ## Performance Benefits > > As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point `Math.min()/max()` in multiple scenarios. > > ### Increased Auto-Vectorization Scope > > There are two main scenarios in which the proposed changeset enables further auto-vectorization: > > #### Reductions Using Global Accumulators > > > public class Foo { > int acc = 0; > (..) > void reduce(int[] array) { > for (int i = 0; i < array.length; i++) { > acc += array[i]; > } > } > } > > Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis: > > ![global-reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226745351-33494e40-7c07-4a8b-8bf6-d3a96e84b1c2.png) > > #### Reductions of partially unrolled loops > > > (..) > for (int i = 0; i < array.length / 2; i++) { > acc += array[2*i]; > acc += array[2*i + 1]; > } > (..) > > > These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically. > > ### Increased Performance of x64 Floating-Point `Math.min()/max()` > > Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point `Math.min()/max()` implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in `FpMinMaxIntrinsics.java` for more details). > > ## Implementation details > > The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case `testReductionOnPartiallyUnrolledLoopWithSwappedInputs` from `TestGeneralizedReductions.java`) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops). > > A complication results from edge swapping in the nodes cloned by loop unrolling (see [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/addnode.cpp#L123) and [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/mulnode.cpp#L113)), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64) is to replace this changeset's linear chain finding approach with a more general shortest path-finding algorithm. This alternative might preclude the need for tracking edge swapping at a potentially higher computational cost. Since the trade-off is not obvious, I propose to investigate it in a follow-up RFE. > > The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of `[Min|Max][F|D]` nodes. > > ## Testing > > ### Functionality > > - tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64). > - fuzzing (12 h. on linux-x64 and linux-aarch64). > > ##### TestGeneralizedReductions.java > > Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. `testReductionOnPartiallyUnrolledLoop` has been observed to fail on [linux-x86](https://github.com/robcasloz/jdk/actions/runs/4478959520/jobs/7873827856#logs) due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction. > > ##### TestFpMinMaxReductions.java > > Tests the matching of floating-point max/min implementations in x64. > > ##### TestSuperwordFailsUnrolling.java > > This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test. > > ### Performance > > #### General Benchmarks > > The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64. > > #### Micro-benchmarks > > The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis ([full results](https://github.com/openjdk/jdk/files/11039207/microbenchmark-results.ods)). > > > ##### VectorReduction.java > > These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running `VectorReduction.java` on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for `andRedIOnGlobalAccumulator` and `andRedIPartiallyUnrolled`, where the changeset improves performance by 2.4x in both cases. > > ##### MaxIntrinsics.java > > This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point `Math.min()` implementation that is specialized for reduction min operations): > > | micro-benchmark | speedup compared to mainline | > | --- | --- | > | `fMinReduceInOuterLoop` | 1.1x | > | `fMinReduceNonCounted` | 2.3x | > | `fMinReduceGlobalAccumulator` | 2.4x | > | `fMinReducePartiallyUnrolled` | 3.9x | > > ## Acknowledgments > > Thanks to @danielogh for making it possible to test this improvement with confidence ([JDK-8294715](https://bugs.openjdk.org/browse/JDK-8294715)) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback. src/hotspot/share/opto/superword.cpp line 485: > 483: > 484: // Iterate through all phi nodes associated to the loop and search for > 485: // reduction cycles of at most LoopMaxUnroll nodes. `LoopMaxUnroll` is probably ok for most cases. With hand-unrolled `byte` loops, this may not work, since the loop will have more operations in the chain. You could consider using the number of nodes in the loop. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13120#discussion_r1145101297 From kvn at openjdk.org Wed Mar 22 18:39:08 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 22 Mar 2023 18:39:08 GMT Subject: RFR: 8304258: x86: Improve the code generation of VectorRearrange with int and float In-Reply-To: References: Message-ID: On Wed, 15 Mar 2023 12:55:52 GMT, Quan Anh Mai wrote: > Hi, > > This small patch changes the code generation of VectorRearrangeNode with respect to int and float elements. With not-larger-than-128-bit vectors, we can use `vpermilps` instead of promoting to operating on the extended 256-bit vector. This also helps the code generation of AVX1 to not rely on the sse version. > > Please take a look and leave some reviews. > Thanks a lot. This is really should be reviewed by @jatin-bhateja and @sviswa7 Do updated predicates satisfy restrictions on these vectors listed in `match_rule_supported_vector` at line 1855? I can help with testing. @merykitty, what tests are affected by this change? Please, update to latest JDK sources. ------------- PR Review: https://git.openjdk.org/jdk/pull/13042#pullrequestreview-1353247207 PR Comment: https://git.openjdk.org/jdk/pull/13042#issuecomment-1480075471 From qamai at openjdk.org Wed Mar 22 20:22:22 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 22 Mar 2023 20:22:22 GMT Subject: RFR: 8304258: x86: Improve the code generation of VectorRearrange with int and float [v2] In-Reply-To: References: Message-ID: > Hi, > > This small patch changes the code generation of VectorRearrangeNode with respect to int and float elements. With not-larger-than-128-bit vectors, we can use `vpermilps` instead of promoting to operating on the extended 256-bit vector. This also helps the code generation of AVX1 to not rely on the sse version. > > Please take a look and leave some reviews. > Thanks a lot. Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge branch 'master' into rearrangeI - improve rearrangeI ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13042/files - new: https://git.openjdk.org/jdk/pull/13042/files/3f3343ec..38f3450a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13042&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13042&range=00-01 Stats: 154245 lines in 1774 files changed: 110193 ins; 25740 del; 18312 mod Patch: https://git.openjdk.org/jdk/pull/13042.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13042/head:pull/13042 PR: https://git.openjdk.org/jdk/pull/13042 From qamai at openjdk.org Wed Mar 22 20:29:41 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 22 Mar 2023 20:29:41 GMT Subject: RFR: 8304258: x86: Improve the code generation of VectorRearrange with int and float In-Reply-To: References: Message-ID: On Wed, 22 Mar 2023 18:36:21 GMT, Vladimir Kozlov wrote: >> Hi, >> >> This small patch changes the code generation of VectorRearrangeNode with respect to int and float elements. With not-larger-than-128-bit vectors, we can use `vpermilps` instead of promoting to operating on the extended 256-bit vector. This also helps the code generation of AVX1 to not rely on the sse version. >> >> Please take a look and leave some reviews. >> Thanks a lot. > > Please, update to latest JDK sources. @vnkozlov Currently `VectorRearrange` rejects vectors of length 2 as well as 256-bit vectors on AVX1. As a result, the only type of int/float vectors appearing in AVX1 has a length of 4. This change affects AVX >= 1 machines in int and float vector tests in `test/jdk/jdk/incubator/vector`. Thanks a lot. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13042#issuecomment-1480217453 From kvn at openjdk.org Wed Mar 22 21:41:42 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 22 Mar 2023 21:41:42 GMT Subject: RFR: 8304258: x86: Improve the code generation of VectorRearrange with int and float In-Reply-To: References: Message-ID: On Wed, 22 Mar 2023 18:36:21 GMT, Vladimir Kozlov wrote: >> Hi, >> >> This small patch changes the code generation of VectorRearrangeNode with respect to int and float elements. With not-larger-than-128-bit vectors, we can use `vpermilps` instead of promoting to operating on the extended 256-bit vector. This also helps the code generation of AVX1 to not rely on the sse version. >> >> Please take a look and leave some reviews. >> Thanks a lot. > > Please, update to latest JDK sources. > @vnkozlov Currently `VectorRearrange` rejects vectors of length 2 as well as 256-bit vectors on AVX1. As a result, the only type of int/float vectors appearing in AVX1 has a length of 4. This change affects AVX >= 1 machines in int and float vector tests in `test/jdk/jdk/incubator/vector`. Thanks a lot. Okay, then I will pay attention to vectors generation in `incubator/vector` testing. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13042#issuecomment-1480294635 From jbhateja at openjdk.org Thu Mar 23 00:47:06 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 23 Mar 2023 00:47:06 GMT Subject: RFR: 8303508: Vector.lane() gets wrong value on x86 Message-ID: Incorrectness happens because compiler absorbs casting IR chain during idealizations. Sign extending the result of sub-word extraction operation. Please refer to detailed discussion on https://github.com/openjdk/jdk/pull/13070#issuecomment-1479779537 Best Regards, Jatin ------------- Commit messages: - Removing redundant imports from test. - 8303508: Vector.lane() gets wrong value on x86 Changes: https://git.openjdk.org/jdk/pull/13152/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13152&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8303508 Stats: 91 lines in 3 files changed: 87 ins; 3 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/13152.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13152/head:pull/13152 PR: https://git.openjdk.org/jdk/pull/13152 From eliu at openjdk.org Thu Mar 23 01:30:01 2023 From: eliu at openjdk.org (Eric Liu) Date: Thu, 23 Mar 2023 01:30:01 GMT Subject: RFR: 8303508: Vector.lane() gets wrong value on x86 In-Reply-To: References: Message-ID: On Thu, 23 Mar 2023 00:37:29 GMT, Jatin Bhateja wrote: > Incorrectness happens because compiler absorbs casting IR chain during idealizations. > Sign extending the result of sub-word extraction operation. > > Please refer to detailed discussion on https://github.com/openjdk/jdk/pull/13070#issuecomment-1479779537 > > Best Regards, > Jatin Looks good to me. Thanks for your fix. ------------- Marked as reviewed by eliu (Committer). PR Review: https://git.openjdk.org/jdk/pull/13152#pullrequestreview-1353700842 From xgong at openjdk.org Thu Mar 23 01:45:20 2023 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 23 Mar 2023 01:45:20 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v4] In-Reply-To: References: <4Op0Z8whnyDXDC6zGyMbx4ugcZp5TEoAqW_myB5flxM=.1c7b59ba-efb2-4f68-90d7-2d6e33e39572@github.com> Message-ID: On Wed, 22 Mar 2023 12:39:27 GMT, Quan Anh Mai wrote: >> src/hotspot/share/opto/vectorIntrinsics.cpp line 2426: >> >>> 2424: if (is_vector_shuffle(vbox_klass_from)) { >>> 2425: return false; // vector shuffles aren't supported >>> 2426: } >> >> Is it better to change this as an "assertion" or print the log details? > > The change indifferentiates a vector shuffle from a normal vector in C2, so this should be removed, as vector shuffles are converted to/from normal vector using this routine Oh, I'm sorry that I didn't notice it is a removing change. And, yes, you're right. So please ignore my previous comment. Thanks! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1145580975 From xgong at openjdk.org Thu Mar 23 02:25:44 2023 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 23 Mar 2023 02:25:44 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v4] In-Reply-To: References: <4Op0Z8whnyDXDC6zGyMbx4ugcZp5TEoAqW_myB5flxM=.1c7b59ba-efb2-4f68-90d7-2d6e33e39572@github.com> Message-ID: On Wed, 22 Mar 2023 12:42:15 GMT, Quan Anh Mai wrote: >> src/hotspot/share/opto/vectorIntrinsics.cpp line 2059: >> >>> 2057: if (need_load_shuffle) { >>> 2058: shuffle = gvn().transform(new VectorLoadShuffleNode(shuffle, vt)); >>> 2059: } >> >> How about generating `VectorLoadShuffleNode` for all platforms that support Vector API, and remove the helper method `vector_needs_load_shuffle()` ? For those platforms that do not need this shuffle preparation, we can emit nothing in codegen. > > I think not emitting `VectorLoadShuffleNode` is more common so it is better to emit them only when needed, as it will simplify the graph and may allow better inspections of the indices in the future. Additionally, a do-nothing node does not alias with its input and therefore kills the input, which leads to an additional spill if they both need to live. Yeah, I agree that saving a node have some benefits like what you said. My concern is there are more and more methods added into `Matcher::` and each platform has to do the different implementation. There is not too much meaning for those platforms that do not implement Vector API like` arm/ppc/...` for me. This makes code not so easy to maintain. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1145598639 From xgong at openjdk.org Thu Mar 23 02:31:46 2023 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 23 Mar 2023 02:31:46 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v3] In-Reply-To: References: <4Op0Z8whnyDXDC6zGyMbx4ugcZp5TEoAqW_myB5flxM=.1c7b59ba-efb2-4f68-90d7-2d6e33e39572@github.com> Message-ID: <_oZaqVXsCPRK3Poj6yp7Vi4S4cCuRygQv9pGhPvTAPk=.d7139597-8b20-4170-80cc-94ec7f88b2e4@github.com> On Wed, 22 Mar 2023 08:09:03 GMT, Xiaohong Gong wrote: >> Quan Anh Mai has updated the pull request incrementally with two additional commits since the last revision: >> >> - missing casts >> - clean up > > Please also update the copyright to 2023 for some touched files like `vectorSupport.hpp` and other java files like `AbstractShuffle.java`, `AbstractVector.java`, `VectorShape.java`, and `VectorShuffle.java`. Thanks! > @XiaohongGong Thanks, I have updated the copyright year and the code styles as you suggested Thanks for your update! > I also noticed that the similar shuffle cast code is used more frequently. Could we wrap such code `toShuffle().cast(vspecies())` to a separate method? And for this, is it possible to wrap such code in a single method? ------------- PR Comment: https://git.openjdk.org/jdk/pull/13093#issuecomment-1480494762 From thartmann at openjdk.org Thu Mar 23 07:18:42 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 23 Mar 2023 07:18:42 GMT Subject: RFR: 8303508: Vector.lane() gets wrong value on x86 In-Reply-To: References: Message-ID: On Thu, 23 Mar 2023 00:37:29 GMT, Jatin Bhateja wrote: > Incorrectness happens because compiler absorbs casting IR chain during idealizations. > Sign extending the result of sub-word extraction operation. > > Please refer to detailed discussion on https://github.com/openjdk/jdk/pull/13070#issuecomment-1479779537 > > Best Regards, > Jatin Looks good. I submitted some testing and will report back once it finished. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13152#pullrequestreview-1353976426 From thartmann at openjdk.org Thu Mar 23 07:19:47 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 23 Mar 2023 07:19:47 GMT Subject: RFR: 8303951: Add asserts before record_method_not_compilable where possible [v4] In-Reply-To: References: Message-ID: On Wed, 22 Mar 2023 13:16:20 GMT, Emanuel Peter wrote: >> I went through all `C2` bailouts, and checked if they are justified to bail out of compilation silently. I added asserts everywhere. Those that were hit, I inspected by hand. >> >> Some of them seem to be justified. There I added comments why they are justified. They are cases that we do not want to handle in `C2`, and that are rare enough so that it probably does not matter. >> >> For the following bailouts I did not add an assert, because it may have revealed a bug: >> [JDK-8304328](https://bugs.openjdk.org/browse/JDK-8304328) C2 Bailout "failed spill-split-recycle sanity check" reveals hidden issue with RA >> >> Note: >> [JDK-8303466](https://bugs.openjdk.org/browse/JDK-8303466) C2: COMPILE SKIPPED: malformed control flow - only one IfProj >> That bug bug was the reason for this RFE here. I added the assert for "malformed control flow". After this RFE here, that Bug will run into the assert on debug builds. >> >> I ran `tier1-6` and stress testing. Now running `tier7-9`. >> >> Filed a follow-up RFE to do the same for `BAILOUT` in `C1`: [JDK-8304532](https://bugs.openjdk.org/browse/JDK-8304532). > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > remove some of my bad comments Thanks for making these changes, looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13038#pullrequestreview-1353977437 From xgong at openjdk.org Thu Mar 23 07:34:49 2023 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 23 Mar 2023 07:34:49 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v4] In-Reply-To: References: Message-ID: On Wed, 22 Mar 2023 12:46:33 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch reimplements `VectorShuffle` implementations to be a vector of the bit type. Currently, VectorShuffle is stored as a byte array, and would be expanded upon usage. This poses several drawbacks: >> >> 1. Inefficient conversions between a shuffle and its corresponding vector. This hinders the performance when the shuffle indices are not constant and are loaded or computed dynamically. >> 2. Redundant expansions in `rearrange` operations. On all platforms, it seems that a shuffle index vector is always expanded to the correct type before executing the `rearrange` operations. >> 3. Some redundant intrinsics are needed to support this handling as well as special considerations in the C2 compiler. >> 4. Range checks are performed using `VectorShuffle::toVector`, which is inefficient for FP types since both FP conversions and FP comparisons are more expensive than the integral ones. >> >> Upon these changes, a `rearrange` can emit more efficient code: >> >> var species = IntVector.SPECIES_128; >> var v1 = IntVector.fromArray(species, SRC1, 0); >> var v2 = IntVector.fromArray(species, SRC2, 0); >> v1.rearrange(v2.toShuffle()).intoArray(DST, 0); >> >> Before: >> movabs $0x751589fa8,%r10 ; {oop([I{0x0000000751589fa8})} >> vmovdqu 0x10(%r10),%xmm2 >> movabs $0x7515a0d08,%r10 ; {oop([I{0x00000007515a0d08})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158afb8,%r10 ; {oop([I{0x000000075158afb8})} >> vmovdqu 0x10(%r10),%xmm0 >> vpand -0xddc12(%rip),%xmm0,%xmm0 # Stub::vector_int_to_byte_mask >> ; {external_word} >> vpackusdw %xmm0,%xmm0,%xmm0 >> vpackuswb %xmm0,%xmm0,%xmm0 >> vpmovsxbd %xmm0,%xmm3 >> vpcmpgtd %xmm3,%xmm1,%xmm3 >> vtestps %xmm3,%xmm3 >> jne 0x00007fc2acb4e0d8 >> vpmovzxbd %xmm0,%xmm0 >> vpermd %ymm2,%ymm0,%ymm0 >> movabs $0x751588f98,%r10 ; {oop([I{0x0000000751588f98})} >> vmovdqu %xmm0,0x10(%r10) >> >> After: >> movabs $0x751589c78,%r10 ; {oop([I{0x0000000751589c78})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158ac88,%r10 ; {oop([I{0x000000075158ac88})} >> vmovdqu 0x10(%r10),%xmm2 >> vpxor %xmm0,%xmm0,%xmm0 >> vpcmpgtd %xmm2,%xmm0,%xmm3 >> vtestps %xmm3,%xmm3 >> jne 0x00007fa818b27cb1 >> vpermd %ymm1,%ymm2,%ymm0 >> movabs $0x751588c68,%r10 ; {oop([I{0x0000000751588c68})} >> vmovdqu %xmm0,0x10(%r10) >> >> Please take a look and leave reviews. Thanks a lot. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > reviews src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Byte128Vector.java line 868: > 866: return (Byte128Vector) Byte128Vector.VSPECIES.dummyVector() > 867: .vectorFactory(s.indices()); > 868: } Move the implementation details to the super class? src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteMaxVector.java line 862: > 860: v.convertShape(VectorOperators.B2I, species, 3) > 861: .reinterpretAsInts() > 862: .intoArray(a, offset + species.length() * 3); Can we add a method like `intoIntArray()` in `ByteVector` and move these common code there? The same to other vector types. src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteMaxVector.java line 919: > 917: } > 918: return false; > 919: } Same as `intoArray()`. Any possible that moving this function to `ByteVector` and rename it to something like `shuffleIndicesInRange` ? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1145769798 PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1145781044 PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1145782910 From epeter at openjdk.org Thu Mar 23 07:47:55 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 23 Mar 2023 07:47:55 GMT Subject: Integrated: 8303951: Add asserts before record_method_not_compilable where possible In-Reply-To: References: Message-ID: On Wed, 15 Mar 2023 10:29:32 GMT, Emanuel Peter wrote: > I went through all `C2` bailouts, and checked if they are justified to bail out of compilation silently. I added asserts everywhere. Those that were hit, I inspected by hand. > > Some of them seem to be justified. There I added comments why they are justified. They are cases that we do not want to handle in `C2`, and that are rare enough so that it probably does not matter. > > For the following bailouts I did not add an assert, because it may have revealed a bug: > [JDK-8304328](https://bugs.openjdk.org/browse/JDK-8304328) C2 Bailout "failed spill-split-recycle sanity check" reveals hidden issue with RA > > Note: > [JDK-8303466](https://bugs.openjdk.org/browse/JDK-8303466) C2: COMPILE SKIPPED: malformed control flow - only one IfProj > That bug bug was the reason for this RFE here. I added the assert for "malformed control flow". After this RFE here, that Bug will run into the assert on debug builds. > > I ran `tier1-6` and stress testing. Now running `tier7-9`. > > Filed a follow-up RFE to do the same for `BAILOUT` in `C1`: [JDK-8304532](https://bugs.openjdk.org/browse/JDK-8304532). > > Note: the philosophy here is rather to have an assert too much that triggers in the future. Then we can re-evaluate and weaken or remove it - or maybe we find a bug that we can fix. This pull request has now been integrated. Changeset: af4d5600 Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/af4d5600e37ec6d331e62c5d37491ee97cad5311 Stats: 58 lines in 10 files changed: 52 ins; 0 del; 6 mod 8303951: Add asserts before record_method_not_compilable where possible Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/13038 From epeter at openjdk.org Thu Mar 23 07:47:54 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 23 Mar 2023 07:47:54 GMT Subject: RFR: 8303951: Add asserts before record_method_not_compilable where possible In-Reply-To: References: Message-ID: <0YGwhlhw8PsPHFtvbxd0VaoCh9i8Pkh42wZE3p_g7eQ=.53aa9fe6-5d90-4217-878f-c3dd0eb6951b@github.com> On Fri, 17 Mar 2023 16:06:47 GMT, Vladimir Kozlov wrote: >> I went through all `C2` bailouts, and checked if they are justified to bail out of compilation silently. I added asserts everywhere. Those that were hit, I inspected by hand. >> >> Some of them seem to be justified. There I added comments why they are justified. They are cases that we do not want to handle in `C2`, and that are rare enough so that it probably does not matter. >> >> For the following bailouts I did not add an assert, because it may have revealed a bug: >> [JDK-8304328](https://bugs.openjdk.org/browse/JDK-8304328) C2 Bailout "failed spill-split-recycle sanity check" reveals hidden issue with RA >> >> Note: >> [JDK-8303466](https://bugs.openjdk.org/browse/JDK-8303466) C2: COMPILE SKIPPED: malformed control flow - only one IfProj >> That bug bug was the reason for this RFE here. I added the assert for "malformed control flow". After this RFE here, that Bug will run into the assert on debug builds. >> >> I ran `tier1-6` and stress testing. Now running `tier7-9`. >> >> Filed a follow-up RFE to do the same for `BAILOUT` in `C1`: [JDK-8304532](https://bugs.openjdk.org/browse/JDK-8304532). >> >> Note: the philosophy here is rather to have an assert too much that triggers in the future. Then we can re-evaluate and weaken or remove it - or maybe we find a bug that we can fix. > >> Should we file a follow-up RFE to do the same for BAILOUT in C1? > > Yes @vnkozlov @TobiHartmann thanks for the help and reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/13038#issuecomment-1480726431 From rcastanedalo at openjdk.org Thu Mar 23 08:38:48 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 23 Mar 2023 08:38:48 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand In-Reply-To: References: Message-ID: <6U2dbMPwbL91OYu0hccBV6grrp4jllvimeC7PqhVt_A=.f095ff26-ab95-417a-a4e9-3b7b9d102ddc@github.com> On Wed, 22 Mar 2023 15:49:55 GMT, Emanuel Peter wrote: >> Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point `Math.min()/max()` implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see [JDK-8261147](https://bugs.openjdk.org/browse/JDK-8261147) and [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622)). >> >> This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops: >> >> ![reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226725587-b7d68509-3717-4bbe-8d54-f9a105853fda.png) >> >> The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks. >> >> ## Performance Benefits >> >> As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point `Math.min()/max()` in multiple scenarios. >> >> ### Increased Auto-Vectorization Scope >> >> There are two main scenarios in which the proposed changeset enables further auto-vectorization: >> >> #### Reductions Using Global Accumulators >> >> >> public class Foo { >> int acc = 0; >> (..) >> void reduce(int[] array) { >> for (int i = 0; i < array.length; i++) { >> acc += array[i]; >> } >> } >> } >> >> Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis: >> >> ![global-reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226745351-33494e40-7c07-4a8b-8bf6-d3a96e84b1c2.png) >> >> #### Reductions of partially unrolled loops >> >> >> (..) >> for (int i = 0; i < array.length / 2; i++) { >> acc += array[2*i]; >> acc += array[2*i + 1]; >> } >> (..) >> >> >> These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically. >> >> ### Increased Performance of x64 Floating-Point `Math.min()/max()` >> >> Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point `Math.min()/max()` implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in `FpMinMaxIntrinsics.java` for more details). >> >> ## Implementation details >> >> The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case `testReductionOnPartiallyUnrolledLoopWithSwappedInputs` from `TestGeneralizedReductions.java`) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops). >> >> A complication results from edge swapping in the nodes cloned by loop unrolling (see [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/addnode.cpp#L123) and [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/mulnode.cpp#L113)), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64) is to replace this changeset's linear chain finding approach with a more general shortest path-finding algorithm. This alternative might preclude the need for tracking edge swapping at a potentially higher computational cost. Since the trade-off is not obvious, I propose to investigate it in a follow-up RFE. >> >> The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of `[Min|Max][F|D]` nodes. >> >> ## Testing >> >> ### Functionality >> >> - tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64). >> - fuzzing (12 h. on linux-x64 and linux-aarch64). >> >> ##### TestGeneralizedReductions.java >> >> Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. `testReductionOnPartiallyUnrolledLoop` has been observed to fail on [linux-x86](https://github.com/robcasloz/jdk/actions/runs/4478959520/jobs/7873827856#logs) due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction. >> >> ##### TestFpMinMaxReductions.java >> >> Tests the matching of floating-point max/min implementations in x64. >> >> ##### TestSuperwordFailsUnrolling.java >> >> This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test. >> >> ### Performance >> >> #### General Benchmarks >> >> The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64. >> >> #### Micro-benchmarks >> >> The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis ([full results](https://github.com/openjdk/jdk/files/11039207/microbenchmark-results.ods)). >> >> >> ##### VectorReduction.java >> >> These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running `VectorReduction.java` on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for `andRedIOnGlobalAccumulator` and `andRedIPartiallyUnrolled`, where the changeset improves performance by 2.4x in both cases. >> >> ##### MaxIntrinsics.java >> >> This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point `Math.min()` implementation that is specialized for reduction min operations): >> >> | micro-benchmark | speedup compared to mainline | >> | --- | --- | >> | `fMinReduceInOuterLoop` | 1.1x | >> | `fMinReduceNonCounted` | 2.3x | >> | `fMinReduceGlobalAccumulator` | 2.4x | >> | `fMinReducePartiallyUnrolled` | 3.9x | >> >> ## Acknowledgments >> >> Thanks to @danielogh for making it possible to test this improvement with confidence ([JDK-8294715](https://bugs.openjdk.org/browse/JDK-8294715)) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback. > > src/hotspot/share/opto/node.hpp line 586: > >> 584: } else { >> 585: add_flag(Node::Flag_has_swapped_edges); >> 586: } > > This is probably the riskiest part of the implementation. If anyone decides to swap edges without using `swap_edges`, we will not detect the reduction. We have to trust our tests to catch that. That's right, in that case we would miss the reduction. On the other hand, the assumption that edges are ordered in the current form is deeply ingrained in C2, and if this was altered, missing some reduction vectorizations would be the least of our problems. And, as you mention, thanks to the introduction of IR checks in the reduction tests we would detect the missing vectorization anyway. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13120#discussion_r1145842532 From rcastanedalo at openjdk.org Thu Mar 23 09:03:44 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 23 Mar 2023 09:03:44 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand In-Reply-To: References: Message-ID: On Wed, 22 Mar 2023 16:00:11 GMT, Emanuel Peter wrote: >> Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point `Math.min()/max()` implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see [JDK-8261147](https://bugs.openjdk.org/browse/JDK-8261147) and [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622)). >> >> This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops: >> >> ![reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226725587-b7d68509-3717-4bbe-8d54-f9a105853fda.png) >> >> The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks. >> >> ## Performance Benefits >> >> As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point `Math.min()/max()` in multiple scenarios. >> >> ### Increased Auto-Vectorization Scope >> >> There are two main scenarios in which the proposed changeset enables further auto-vectorization: >> >> #### Reductions Using Global Accumulators >> >> >> public class Foo { >> int acc = 0; >> (..) >> void reduce(int[] array) { >> for (int i = 0; i < array.length; i++) { >> acc += array[i]; >> } >> } >> } >> >> Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis: >> >> ![global-reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226745351-33494e40-7c07-4a8b-8bf6-d3a96e84b1c2.png) >> >> #### Reductions of partially unrolled loops >> >> >> (..) >> for (int i = 0; i < array.length / 2; i++) { >> acc += array[2*i]; >> acc += array[2*i + 1]; >> } >> (..) >> >> >> These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically. >> >> ### Increased Performance of x64 Floating-Point `Math.min()/max()` >> >> Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point `Math.min()/max()` implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in `FpMinMaxIntrinsics.java` for more details). >> >> ## Implementation details >> >> The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case `testReductionOnPartiallyUnrolledLoopWithSwappedInputs` from `TestGeneralizedReductions.java`) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops). >> >> A complication results from edge swapping in the nodes cloned by loop unrolling (see [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/addnode.cpp#L123) and [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/mulnode.cpp#L113)), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64) is to replace this changeset's linear chain finding approach with a more general shortest path-finding algorithm. This alternative might preclude the need for tracking edge swapping at a potentially higher computational cost. Since the trade-off is not obvious, I propose to investigate it in a follow-up RFE. >> >> The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of `[Min|Max][F|D]` nodes. >> >> ## Testing >> >> ### Functionality >> >> - tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64). >> - fuzzing (12 h. on linux-x64 and linux-aarch64). >> >> ##### TestGeneralizedReductions.java >> >> Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. `testReductionOnPartiallyUnrolledLoop` has been observed to fail on [linux-x86](https://github.com/robcasloz/jdk/actions/runs/4478959520/jobs/7873827856#logs) due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction. >> >> ##### TestFpMinMaxReductions.java >> >> Tests the matching of floating-point max/min implementations in x64. >> >> ##### TestSuperwordFailsUnrolling.java >> >> This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test. >> >> ### Performance >> >> #### General Benchmarks >> >> The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64. >> >> #### Micro-benchmarks >> >> The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis ([full results](https://github.com/openjdk/jdk/files/11039207/microbenchmark-results.ods)). >> >> >> ##### VectorReduction.java >> >> These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running `VectorReduction.java` on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for `andRedIOnGlobalAccumulator` and `andRedIPartiallyUnrolled`, where the changeset improves performance by 2.4x in both cases. >> >> ##### MaxIntrinsics.java >> >> This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point `Math.min()` implementation that is specialized for reduction min operations): >> >> | micro-benchmark | speedup compared to mainline | >> | --- | --- | >> | `fMinReduceInOuterLoop` | 1.1x | >> | `fMinReduceNonCounted` | 2.3x | >> | `fMinReduceGlobalAccumulator` | 2.4x | >> | `fMinReducePartiallyUnrolled` | 3.9x | >> >> ## Acknowledgments >> >> Thanks to @danielogh for making it possible to test this improvement with confidence ([JDK-8294715](https://bugs.openjdk.org/browse/JDK-8294715)) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback. > > src/hotspot/share/opto/superword.cpp line 446: > >> 444: int opc = n->Opcode(); >> 445: return (opc != ReductionNode::opcode(opc, n->bottom_type()->basic_type()) >> 446: || opc == Op_MinD || opc == Op_MinF || opc == Op_MaxD || opc == Op_MaxF); > > Why did you need to explicitly mention the `Op_MinD` etc? Are they covered in `ReductionNode::opcode`? Good catch, I just copied the code from the current reduction analysis: https://github.com/openjdk/jdk/blob/63d4afbeb17df4eff0f65041926373ee62a8a33a/src/hotspot/share/opto/loopTransform.cpp#L2520-L2522 But I agree, they seem redundant since they are covered in `ReductionNode::opcode`. Will test removing them. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13120#discussion_r1145873823 From tholenstein at openjdk.org Thu Mar 23 09:43:26 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Thu, 23 Mar 2023 09:43:26 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally [v6] In-Reply-To: References: Message-ID: > In IGV the user can apply a set of filters to a graph. Currently, the same set of selected filters is applied to all graphs (globally). > > - With this change the use can define a set of filters for each individual graph tab using the `--Local--` profile > - Further a filter profile can be created that represents a set of filter. This filter profile can the be selected in each graph tab individually. > > ### Global profile > Each tab has a `--Global--` filter profile which is selected when opening a graph. Filters applied to the `--Global--` profile are applied to all tabs that have the `--Global--` profile selected. > > ### Local profile > Each tab has its own `--Local--` filter profile. Filters applied to the `--Local--` profile are applied only to the currently selected tabs. Only one tab can be selected at a time and a tab gets selected by clicking on it. To make it more clear which tab is currently selected, the title of the selected tab is displayed in **bold** font. > tabA > > When clicking on a different tab with a different `--Local--` profile, the selected filters get updated accordingly. > tabB > > ### New profile > The user can also create a new filter profile and give it a name. E.g. `My Filters` > newProfile > > The `My Filters` profile is then globally available to other tabs as well > selectProfile > > > ### Filters for cloned tabs > When the user clones a tab, the `--Local--` profile gets cloned as well. Further the clone has the same filter profile selected when it gets opened > cloneTab Tobias Holenstein has updated the pull request incrementally with three additional commits since the last revision: - always select previous profile for new tabs - .js ending for filters - save order of filters ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12714/files - new: https://git.openjdk.org/jdk/pull/12714/files/1a87f5db..e1add0c6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12714&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12714&range=04-05 Stats: 160 lines in 7 files changed: 73 ins; 32 del; 55 mod Patch: https://git.openjdk.org/jdk/pull/12714.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12714/head:pull/12714 PR: https://git.openjdk.org/jdk/pull/12714 From rcastanedalo at openjdk.org Thu Mar 23 09:45:47 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 23 Mar 2023 09:45:47 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand In-Reply-To: References: Message-ID: On Wed, 22 Mar 2023 16:18:20 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/superword.cpp line 481: >> >>> 479: >>> 480: _loop_reductions.clear(); >>> 481: const CountedLoopNode* loop_head = loop->_head->as_CountedLoop(); >> >> Can you not use the SuperWord member function `lp()`? > > I think you can drop the `loop` argument` to this function. Good suggestion, will do. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13120#discussion_r1145923539 From rcastanedalo at openjdk.org Thu Mar 23 09:45:52 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 23 Mar 2023 09:45:52 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand In-Reply-To: References: Message-ID: On Wed, 22 Mar 2023 16:11:49 GMT, Emanuel Peter wrote: >> Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point `Math.min()/max()` implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see [JDK-8261147](https://bugs.openjdk.org/browse/JDK-8261147) and [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622)). >> >> This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops: >> >> ![reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226725587-b7d68509-3717-4bbe-8d54-f9a105853fda.png) >> >> The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks. >> >> ## Performance Benefits >> >> As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point `Math.min()/max()` in multiple scenarios. >> >> ### Increased Auto-Vectorization Scope >> >> There are two main scenarios in which the proposed changeset enables further auto-vectorization: >> >> #### Reductions Using Global Accumulators >> >> >> public class Foo { >> int acc = 0; >> (..) >> void reduce(int[] array) { >> for (int i = 0; i < array.length; i++) { >> acc += array[i]; >> } >> } >> } >> >> Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis: >> >> ![global-reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226745351-33494e40-7c07-4a8b-8bf6-d3a96e84b1c2.png) >> >> #### Reductions of partially unrolled loops >> >> >> (..) >> for (int i = 0; i < array.length / 2; i++) { >> acc += array[2*i]; >> acc += array[2*i + 1]; >> } >> (..) >> >> >> These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically. >> >> ### Increased Performance of x64 Floating-Point `Math.min()/max()` >> >> Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point `Math.min()/max()` implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in `FpMinMaxIntrinsics.java` for more details). >> >> ## Implementation details >> >> The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case `testReductionOnPartiallyUnrolledLoopWithSwappedInputs` from `TestGeneralizedReductions.java`) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops). >> >> A complication results from edge swapping in the nodes cloned by loop unrolling (see [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/addnode.cpp#L123) and [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/mulnode.cpp#L113)), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64) is to replace this changeset's linear chain finding approach with a more general shortest path-finding algorithm. This alternative might preclude the need for tracking edge swapping at a potentially higher computational cost. Since the trade-off is not obvious, I propose to investigate it in a follow-up RFE. >> >> The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of `[Min|Max][F|D]` nodes. >> >> ## Testing >> >> ### Functionality >> >> - tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64). >> - fuzzing (12 h. on linux-x64 and linux-aarch64). >> >> ##### TestGeneralizedReductions.java >> >> Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. `testReductionOnPartiallyUnrolledLoop` has been observed to fail on [linux-x86](https://github.com/robcasloz/jdk/actions/runs/4478959520/jobs/7873827856#logs) due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction. >> >> ##### TestFpMinMaxReductions.java >> >> Tests the matching of floating-point max/min implementations in x64. >> >> ##### TestSuperwordFailsUnrolling.java >> >> This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test. >> >> ### Performance >> >> #### General Benchmarks >> >> The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64. >> >> #### Micro-benchmarks >> >> The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis ([full results](https://github.com/openjdk/jdk/files/11039207/microbenchmark-results.ods)). >> >> >> ##### VectorReduction.java >> >> These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running `VectorReduction.java` on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for `andRedIOnGlobalAccumulator` and `andRedIPartiallyUnrolled`, where the changeset improves performance by 2.4x in both cases. >> >> ##### MaxIntrinsics.java >> >> This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point `Math.min()` implementation that is specialized for reduction min operations): >> >> | micro-benchmark | speedup compared to mainline | >> | --- | --- | >> | `fMinReduceInOuterLoop` | 1.1x | >> | `fMinReduceNonCounted` | 2.3x | >> | `fMinReduceGlobalAccumulator` | 2.4x | >> | `fMinReducePartiallyUnrolled` | 3.9x | >> >> ## Acknowledgments >> >> Thanks to @danielogh for making it possible to test this improvement with confidence ([JDK-8294715](https://bugs.openjdk.org/browse/JDK-8294715)) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback. > > src/hotspot/share/opto/superword.cpp line 523: > >> 521: Node* ctrl = _phase->get_ctrl(n); >> 522: return (n->Opcode() == first->Opcode() && ctrl != nullptr && >> 523: loop->is_member(_phase->get_loop(ctrl))); > > You could consider going the SuperWord way, and use `in_bb(n)` to check if `n` is in the SuperWord loop. Will do. > src/hotspot/share/opto/superword.cpp line 543: > >> 541: for (DUIterator_Fast jmax, j = current->fast_outs(jmax); j < jmax; j++) { >> 542: Node* u = current->fast_out(j); >> 543: if (!loop->is_member(_phase->get_loop(_phase->ctrl_or_self(u)))) { > > same Will do. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13120#discussion_r1145923869 PR Review Comment: https://git.openjdk.org/jdk/pull/13120#discussion_r1145924102 From tholenstein at openjdk.org Thu Mar 23 09:54:45 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Thu, 23 Mar 2023 09:54:45 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally In-Reply-To: References: Message-ID: On Tue, 7 Mar 2023 12:37:06 GMT, Roberto Casta?eda Lozano wrote: > > I now added also a --Global-- profile that is selected by default. > > Thanks for the changes, Toby. I can see the `--Global--` profile selected by default, however as soon as I open a graph it switches to `--Local--`. Is this intended? No this was not intended. It should be fixed now. I also fixed that the order of the filters are taken into account when applied and when saved on closing IGV. The `--Global--` profile is still selected by default. But now when the user selects a different profile and then opens a new tab, that new tab has the same profile selected as the previous one. ------------- PR Comment: https://git.openjdk.org/jdk/pull/12714#issuecomment-1480892110 From rcastanedalo at openjdk.org Thu Mar 23 09:59:43 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 23 Mar 2023 09:59:43 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand In-Reply-To: References: Message-ID: On Wed, 22 Mar 2023 15:40:27 GMT, Emanuel Peter wrote: >> Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point `Math.min()/max()` implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see [JDK-8261147](https://bugs.openjdk.org/browse/JDK-8261147) and [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622)). >> >> This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops: >> >> ![reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226725587-b7d68509-3717-4bbe-8d54-f9a105853fda.png) >> >> The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks. >> >> ## Performance Benefits >> >> As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point `Math.min()/max()` in multiple scenarios. >> >> ### Increased Auto-Vectorization Scope >> >> There are two main scenarios in which the proposed changeset enables further auto-vectorization: >> >> #### Reductions Using Global Accumulators >> >> >> public class Foo { >> int acc = 0; >> (..) >> void reduce(int[] array) { >> for (int i = 0; i < array.length; i++) { >> acc += array[i]; >> } >> } >> } >> >> Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis: >> >> ![global-reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226745351-33494e40-7c07-4a8b-8bf6-d3a96e84b1c2.png) >> >> #### Reductions of partially unrolled loops >> >> >> (..) >> for (int i = 0; i < array.length / 2; i++) { >> acc += array[2*i]; >> acc += array[2*i + 1]; >> } >> (..) >> >> >> These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically. >> >> ### Increased Performance of x64 Floating-Point `Math.min()/max()` >> >> Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point `Math.min()/max()` implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in `FpMinMaxIntrinsics.java` for more details). >> >> ## Implementation details >> >> The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case `testReductionOnPartiallyUnrolledLoopWithSwappedInputs` from `TestGeneralizedReductions.java`) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops). >> >> A complication results from edge swapping in the nodes cloned by loop unrolling (see [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/addnode.cpp#L123) and [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/mulnode.cpp#L113)), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64) is to replace this changeset's linear chain finding approach with a more general shortest path-finding algorithm. This alternative might preclude the need for tracking edge swapping at a potentially higher computational cost. Since the trade-off is not obvious, I propose to investigate it in a follow-up RFE. >> >> The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of `[Min|Max][F|D]` nodes. >> >> ## Testing >> >> ### Functionality >> >> - tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64). >> - fuzzing (12 h. on linux-x64 and linux-aarch64). >> >> ##### TestGeneralizedReductions.java >> >> Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. `testReductionOnPartiallyUnrolledLoop` has been observed to fail on [linux-x86](https://github.com/robcasloz/jdk/actions/runs/4478959520/jobs/7873827856#logs) due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction. >> >> ##### TestFpMinMaxReductions.java >> >> Tests the matching of floating-point max/min implementations in x64. >> >> ##### TestSuperwordFailsUnrolling.java >> >> This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test. >> >> ### Performance >> >> #### General Benchmarks >> >> The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64. >> >> #### Micro-benchmarks >> >> The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis ([full results](https://github.com/openjdk/jdk/files/11039207/microbenchmark-results.ods)). >> >> >> ##### VectorReduction.java >> >> These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running `VectorReduction.java` on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for `andRedIOnGlobalAccumulator` and `andRedIPartiallyUnrolled`, where the changeset improves performance by 2.4x in both cases. >> >> ##### MaxIntrinsics.java >> >> This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point `Math.min()` implementation that is specialized for reduction min operations): >> >> | micro-benchmark | speedup compared to mainline | >> | --- | --- | >> | `fMinReduceInOuterLoop` | 1.1x | >> | `fMinReduceNonCounted` | 2.3x | >> | `fMinReduceGlobalAccumulator` | 2.4x | >> | `fMinReducePartiallyUnrolled` | 3.9x | >> >> ## Acknowledgments >> >> Thanks to @danielogh for making it possible to test this improvement with confidence ([JDK-8294715](https://bugs.openjdk.org/browse/JDK-8294715)) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback. > > test/hotspot/jtreg/compiler/loopopts/superword/TestGeneralizedReductions.java line 82: > >> 80: >> 81: @Test >> 82: @IR(applyIfCPUFeatureAnd = {"sse4.1", "true", "avx2", "true"}, > > Does `avx2` not imply `sse4.1`? Good catch, will remove `sse4.1`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13120#discussion_r1145941313 From rcastanedalo at openjdk.org Thu Mar 23 10:09:44 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 23 Mar 2023 10:09:44 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand In-Reply-To: References: Message-ID: On Wed, 22 Mar 2023 16:24:45 GMT, Emanuel Peter wrote: >> Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point `Math.min()/max()` implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see [JDK-8261147](https://bugs.openjdk.org/browse/JDK-8261147) and [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622)). >> >> This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops: >> >> ![reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226725587-b7d68509-3717-4bbe-8d54-f9a105853fda.png) >> >> The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks. >> >> ## Performance Benefits >> >> As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point `Math.min()/max()` in multiple scenarios. >> >> ### Increased Auto-Vectorization Scope >> >> There are two main scenarios in which the proposed changeset enables further auto-vectorization: >> >> #### Reductions Using Global Accumulators >> >> >> public class Foo { >> int acc = 0; >> (..) >> void reduce(int[] array) { >> for (int i = 0; i < array.length; i++) { >> acc += array[i]; >> } >> } >> } >> >> Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis: >> >> ![global-reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226745351-33494e40-7c07-4a8b-8bf6-d3a96e84b1c2.png) >> >> #### Reductions of partially unrolled loops >> >> >> (..) >> for (int i = 0; i < array.length / 2; i++) { >> acc += array[2*i]; >> acc += array[2*i + 1]; >> } >> (..) >> >> >> These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically. >> >> ### Increased Performance of x64 Floating-Point `Math.min()/max()` >> >> Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point `Math.min()/max()` implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in `FpMinMaxIntrinsics.java` for more details). >> >> ## Implementation details >> >> The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case `testReductionOnPartiallyUnrolledLoopWithSwappedInputs` from `TestGeneralizedReductions.java`) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops). >> >> A complication results from edge swapping in the nodes cloned by loop unrolling (see [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/addnode.cpp#L123) and [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/mulnode.cpp#L113)), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64) is to replace this changeset's linear chain finding approach with a more general shortest path-finding algorithm. This alternative might preclude the need for tracking edge swapping at a potentially higher computational cost. Since the trade-off is not obvious, I propose to investigate it in a follow-up RFE. >> >> The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of `[Min|Max][F|D]` nodes. >> >> ## Testing >> >> ### Functionality >> >> - tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64). >> - fuzzing (12 h. on linux-x64 and linux-aarch64). >> >> ##### TestGeneralizedReductions.java >> >> Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. `testReductionOnPartiallyUnrolledLoop` has been observed to fail on [linux-x86](https://github.com/robcasloz/jdk/actions/runs/4478959520/jobs/7873827856#logs) due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction. >> >> ##### TestFpMinMaxReductions.java >> >> Tests the matching of floating-point max/min implementations in x64. >> >> ##### TestSuperwordFailsUnrolling.java >> >> This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test. >> >> ### Performance >> >> #### General Benchmarks >> >> The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64. >> >> #### Micro-benchmarks >> >> The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis ([full results](https://github.com/openjdk/jdk/files/11039207/microbenchmark-results.ods)). >> >> >> ##### VectorReduction.java >> >> These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running `VectorReduction.java` on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for `andRedIOnGlobalAccumulator` and `andRedIPartiallyUnrolled`, where the changeset improves performance by 2.4x in both cases. >> >> ##### MaxIntrinsics.java >> >> This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point `Math.min()` implementation that is specialized for reduction min operations): >> >> | micro-benchmark | speedup compared to mainline | >> | --- | --- | >> | `fMinReduceInOuterLoop` | 1.1x | >> | `fMinReduceNonCounted` | 2.3x | >> | `fMinReduceGlobalAccumulator` | 2.4x | >> | `fMinReducePartiallyUnrolled` | 3.9x | >> >> ## Acknowledgments >> >> Thanks to @danielogh for making it possible to test this improvement with confidence ([JDK-8294715](https://bugs.openjdk.org/browse/JDK-8294715)) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback. > > src/hotspot/share/opto/superword.cpp line 485: > >> 483: >> 484: // Iterate through all phi nodes associated to the loop and search for >> 485: // reduction cycles of at most LoopMaxUnroll nodes. > > `LoopMaxUnroll` is probably ok for most cases. With hand-unrolled `byte` loops, this may not work, since the loop will have more operations in the chain. You could consider using the number of nodes in the loop. Thanks for the suggestion! I prefer to keep `LoopMaxUnroll` as a bound, to guard against pathologically large basic blocks, even if it implies missing some hand-unrolled loops. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13120#discussion_r1145953197 From rcastanedalo at openjdk.org Thu Mar 23 10:12:43 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 23 Mar 2023 10:12:43 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand In-Reply-To: References: Message-ID: <1nILoG42Y7uYSLFdIC8FQvOpJ0IzhLjjtVrxOjrM_4w=.80aa8a03-e0c4-4bad-961a-d13d306db62a@github.com> On Tue, 21 Mar 2023 14:49:26 GMT, Roberto Casta?eda Lozano wrote: > Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point `Math.min()/max()` implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see [JDK-8261147](https://bugs.openjdk.org/browse/JDK-8261147) and [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622)). > > This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops: > > ![reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226725587-b7d68509-3717-4bbe-8d54-f9a105853fda.png) > > The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks. > > ## Performance Benefits > > As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point `Math.min()/max()` in multiple scenarios. > > ### Increased Auto-Vectorization Scope > > There are two main scenarios in which the proposed changeset enables further auto-vectorization: > > #### Reductions Using Global Accumulators > > > public class Foo { > int acc = 0; > (..) > void reduce(int[] array) { > for (int i = 0; i < array.length; i++) { > acc += array[i]; > } > } > } > > Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis: > > ![global-reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226745351-33494e40-7c07-4a8b-8bf6-d3a96e84b1c2.png) > > #### Reductions of partially unrolled loops > > > (..) > for (int i = 0; i < array.length / 2; i++) { > acc += array[2*i]; > acc += array[2*i + 1]; > } > (..) > > > These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically. > > ### Increased Performance of x64 Floating-Point `Math.min()/max()` > > Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point `Math.min()/max()` implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in `FpMinMaxIntrinsics.java` for more details). > > ## Implementation details > > The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case `testReductionOnPartiallyUnrolledLoopWithSwappedInputs` from `TestGeneralizedReductions.java`) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops). > > A complication results from edge swapping in the nodes cloned by loop unrolling (see [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/addnode.cpp#L123) and [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/mulnode.cpp#L113)), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64) is to replace this changeset's linear chain finding approach with a more general shortest path-finding algorithm. This alternative might preclude the need for tracking edge swapping at a potentially higher computational cost. Since the trade-off is not obvious, I propose to investigate it in a follow-up RFE. > > The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of `[Min|Max][F|D]` nodes. > > ## Testing > > ### Functionality > > - tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64). > - fuzzing (12 h. on linux-x64 and linux-aarch64). > > ##### TestGeneralizedReductions.java > > Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. `testReductionOnPartiallyUnrolledLoop` has been observed to fail on [linux-x86](https://github.com/robcasloz/jdk/actions/runs/4478959520/jobs/7873827856#logs) due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction. > > ##### TestFpMinMaxReductions.java > > Tests the matching of floating-point max/min implementations in x64. > > ##### TestSuperwordFailsUnrolling.java > > This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test. > > ### Performance > > #### General Benchmarks > > The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64. > > #### Micro-benchmarks > > The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis ([full results](https://github.com/openjdk/jdk/files/11039207/microbenchmark-results.ods)). > > > ##### VectorReduction.java > > These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running `VectorReduction.java` on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for `andRedIOnGlobalAccumulator` and `andRedIPartiallyUnrolled`, where the changeset improves performance by 2.4x in both cases. > > ##### MaxIntrinsics.java > > This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point `Math.min()` implementation that is specialized for reduction min operations): > > | micro-benchmark | speedup compared to mainline | > | --- | --- | > | `fMinReduceInOuterLoop` | 1.1x | > | `fMinReduceNonCounted` | 2.3x | > | `fMinReduceGlobalAccumulator` | 2.4x | > | `fMinReducePartiallyUnrolled` | 3.9x | > > ## Acknowledgments > > Thanks to @danielogh for making it possible to test this improvement with confidence ([JDK-8294715](https://bugs.openjdk.org/browse/JDK-8294715)) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback. Thanks for your comments and suggestions @eme64! I addressed (or discussed) all of them, will do some testing before updating the changeset. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13120#issuecomment-1480918361 From epeter at openjdk.org Thu Mar 23 10:15:45 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 23 Mar 2023 10:15:45 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand In-Reply-To: References: Message-ID: On Thu, 23 Mar 2023 10:06:31 GMT, Roberto Casta?eda Lozano wrote: >> src/hotspot/share/opto/superword.cpp line 485: >> >>> 483: >>> 484: // Iterate through all phi nodes associated to the loop and search for >>> 485: // reduction cycles of at most LoopMaxUnroll nodes. >> >> `LoopMaxUnroll` is probably ok for most cases. With hand-unrolled `byte` loops, this may not work, since the loop will have more operations in the chain. You could consider using the number of nodes in the loop. > > Thanks for the suggestion! I prefer to keep `LoopMaxUnroll` as a bound, to guard against pathologically large basic blocks, even if it implies missing some hand-unrolled loops. Up to you. The loop body size is limited by `LoopUnrollLimit`, which is currently set to 50 or 60, depending on the platform. So we do not unroll too much to prevent pathologically large loop bodies. That is actually why most SuperWord tests have to increase `LoopUnrollLimit` - otherwise we would not even vectorize those cases. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13120#discussion_r1145960866 From epeter at openjdk.org Thu Mar 23 15:59:20 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 23 Mar 2023 15:59:20 GMT Subject: RFR: 8304042: C2 SuperWord: schedule must remove packs with cyclic dependencies Message-ID: <9VgAQeNZfUZJXO8llozcZZuRftv6kk43jw0YIrBIdck=.b5c89436-608b-4ed1-816d-b3514374eaeb@github.com> I discovered this bug during the bug fix of [JDK-8298935](https://bugs.openjdk.org/browse/JDK-8298935) [PR](https://git.openjdk.org/jdk/pull/12350). Currently, the SuperWord algorithm only ensures that all `packs` are `isomorphic` and `independent` (additionally memops are `adjacent`). This is **not sufficient**. We need to ensure that the `packs` do not introduce `cycles` into the graph. Example: https://github.com/openjdk/jdk/blob/ad580d18dbbf074c8a3692e2836839505b574326/test/hotspot/jtreg/compiler/loopopts/superword/TestIndependentPacksWithCyclicDependency.java#L217-L231 This is also mentioned in the [SuperWord Paper](https://groups.csail.mit.edu/cag/slp/SLP-PLDI-2000.pdf) (2000, Samuel Larsen and Saman Amarasinghe, Exploiting Superword Level Parallelism with Multimedia Instruction Sets): 3.7 Scheduling Dependence analysis before packing ensures that statements within a group can be executed safely in parallel. However, it may be the case that executing two groups produces a dependence violation. An example of this is shown in Figure 6. Here, dependence edges are drawn between groups if a statement in one group is dependent on a statement in the other. As long as there are no cycles in this dependence graph, all groups can be scheduled such that no violations occur. However, a cycle indicates that the set of chosen groups is invalid and at least one group will need to be eliminated. Although experimental data has shown this case to be extremely rare, care must be taken to ensure correctness. **Solution** Just before scheduling, I introduced `SuperWord::remove_cycles`. It creates a `PacksetGraph`, based on nodes in the `packs`, and scalar-nodes which are not in a pack. The edges are taken from `DepPreds`. We check if the graph can be scheduled without cycles (via topological sort). **FYI** I found a further bug, this time I think it happens during scheduling. See [JDK-8304720](https://bugs.openjdk.org/browse/JDK-8304720). Because of that, I had to disable a test case (`TestIndependentPacksWithCyclicDependency::test5`). I also had to require 64 bit, and either `avx2` or `asimd`. I hope we can lift that again once we fix the other bug. The issue is this: the cyclic dependency example can degenerate to non-cyclic ones, that need to reorder the non-vectorized memory operations. ------------- Commit messages: - higher requirements on tests, else they trigger the other bug - Merge branch 'master' into JDK-8304042 - fix IR rules - remove regression test because of another bug - allow reduction self-cycles - missing reference - only dump in debug - implemented the fix - 8304042: C2 SuperWord: schedule must remove packs with cyclic dependencies Changes: https://git.openjdk.org/jdk/pull/13078/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13078&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8304042 Stats: 817 lines in 5 files changed: 817 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/13078.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13078/head:pull/13078 PR: https://git.openjdk.org/jdk/pull/13078 From jbhateja at openjdk.org Thu Mar 23 17:01:27 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 23 Mar 2023 17:01:27 GMT Subject: RFR: 8304258: x86: Improve the code generation of VectorRearrange with int and float [v2] In-Reply-To: References: Message-ID: On Wed, 22 Mar 2023 20:22:22 GMT, Quan Anh Mai wrote: >> Hi, >> >> This small patch changes the code generation of VectorRearrangeNode with respect to int and float elements. With not-larger-than-128-bit vectors, we can use `vpermilps` instead of promoting to operating on the extended 256-bit vector. This also helps the code generation of AVX1 to not rely on the sse version. >> >> Please take a look and leave some reviews. >> Thanks a lot. > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge branch 'master' into rearrangeI > - improve rearrangeI src/hotspot/cpu/x86/assembler_x86.cpp line 4230: > 4228: > 4229: void Assembler::vpermps(XMMRegister dst, XMMRegister nds, XMMRegister src, int vector_len) { > 4230: assert(vector_len <= AVX_256bit ? VM_Version::supports_avx2() : VM_Version::supports_evex(), ""); Instruction support 256 and 512 bit vectors only and assertion should comply with that. src/hotspot/cpu/x86/x86.ad line 8645: > 8643: int vlen_enc = vector_length_encoding(this); > 8644: if (vlen_enc == Assembler::AVX_128bit) { > 8645: __ vpermilps($dst$$XMMRegister, $src$$XMMRegister, $shuffle$$XMMRegister, vlen_enc); Since you are emitting different instruction to save on domain switch over penalty for > 128 bit vectors, same can be done for 128 bit vectors also, you may use vpshufd for integers. src/hotspot/cpu/x86/x86.ad line 8649: > 8647: __ vpermd($dst$$XMMRegister, $shuffle$$XMMRegister, $src$$XMMRegister, vlen_enc); > 8648: } else { > 8649: __ vpermps($dst$$XMMRegister, $shuffle$$XMMRegister, $src$$XMMRegister, vlen_enc); Please move this to a macro assembly routine. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13042#discussion_r1146449816 PR Review Comment: https://git.openjdk.org/jdk/pull/13042#discussion_r1146477783 PR Review Comment: https://git.openjdk.org/jdk/pull/13042#discussion_r1146497128 From kvn at openjdk.org Thu Mar 23 17:55:15 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 23 Mar 2023 17:55:15 GMT Subject: RFR: 8304258: x86: Improve the code generation of VectorRearrange with int and float [v2] In-Reply-To: References: Message-ID: On Wed, 22 Mar 2023 20:22:22 GMT, Quan Anh Mai wrote: >> Hi, >> >> This small patch changes the code generation of VectorRearrangeNode with respect to int and float elements. With not-larger-than-128-bit vectors, we can use `vpermilps` instead of promoting to operating on the extended 256-bit vector. This also helps the code generation of AVX1 to not rely on the sse version. >> >> Please take a look and leave some reviews. >> Thanks a lot. > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge branch 'master' into rearrangeI > - improve rearrangeI My testing tier1-3 (includes incubator/vector) and Xcomp passed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13042#issuecomment-1481635377 From qamai at openjdk.org Thu Mar 23 18:35:20 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 23 Mar 2023 18:35:20 GMT Subject: RFR: 8304258: x86: Improve the code generation of VectorRearrange with int and float [v3] In-Reply-To: References: Message-ID: > Hi, > > This small patch changes the code generation of VectorRearrangeNode with respect to int and float elements. With not-larger-than-128-bit vectors, we can use `vpermilps` instead of promoting to operating on the extended 256-bit vector. This also helps the code generation of AVX1 to not rely on the sse version. > > Please take a look and leave some reviews. > Thanks a lot. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: refine asserts, move logic to C2_MacroAssembler ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13042/files - new: https://git.openjdk.org/jdk/pull/13042/files/38f3450a..fb779477 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13042&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13042&range=01-02 Stats: 28 lines in 4 files changed: 17 ins; 5 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/13042.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13042/head:pull/13042 PR: https://git.openjdk.org/jdk/pull/13042 From qamai at openjdk.org Thu Mar 23 18:35:25 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 23 Mar 2023 18:35:25 GMT Subject: RFR: 8304258: x86: Improve the code generation of VectorRearrange with int and float [v2] In-Reply-To: References: Message-ID: On Thu, 23 Mar 2023 17:52:01 GMT, Vladimir Kozlov wrote: >> Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: >> >> - Merge branch 'master' into rearrangeI >> - improve rearrangeI > > My testing tier1-3 (includes incubator/vector) and Xcomp passed. @vnkozlov Thanks for your testing @jatin-bhateja Thanks a lot for your reviews, I have addressed those comments. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13042#issuecomment-1481703309 From qamai at openjdk.org Thu Mar 23 18:35:29 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 23 Mar 2023 18:35:29 GMT Subject: RFR: 8304258: x86: Improve the code generation of VectorRearrange with int and float [v2] In-Reply-To: References: Message-ID: On Thu, 23 Mar 2023 16:21:03 GMT, Jatin Bhateja wrote: >> Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: >> >> - Merge branch 'master' into rearrangeI >> - improve rearrangeI > > src/hotspot/cpu/x86/assembler_x86.cpp line 4230: > >> 4228: >> 4229: void Assembler::vpermps(XMMRegister dst, XMMRegister nds, XMMRegister src, int vector_len) { >> 4230: assert(vector_len <= AVX_256bit ? VM_Version::supports_avx2() : VM_Version::supports_evex(), ""); > > Instruction support 256 and 512 bit vectors only and assertion should comply with that. I have fixed that > src/hotspot/cpu/x86/x86.ad line 8645: > >> 8643: int vlen_enc = vector_length_encoding(this); >> 8644: if (vlen_enc == Assembler::AVX_128bit) { >> 8645: __ vpermilps($dst$$XMMRegister, $src$$XMMRegister, $shuffle$$XMMRegister, vlen_enc); > > Since you are emitting different instruction to save on domain switch over penalty for > 128 bit vectors, same can be done for 128 bit vectors also, you may use vpshufd for integers. I think `vpshufd` does not support variable indices so we can only use `vpermilps` here > src/hotspot/cpu/x86/x86.ad line 8649: > >> 8647: __ vpermd($dst$$XMMRegister, $shuffle$$XMMRegister, $src$$XMMRegister, vlen_enc); >> 8648: } else { >> 8649: __ vpermps($dst$$XMMRegister, $shuffle$$XMMRegister, $src$$XMMRegister, vlen_enc); > > Please move this to a macro assembly routine. Done ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13042#discussion_r1146653117 PR Review Comment: https://git.openjdk.org/jdk/pull/13042#discussion_r1146653363 PR Review Comment: https://git.openjdk.org/jdk/pull/13042#discussion_r1146653904 From dlong at openjdk.org Thu Mar 23 23:30:30 2023 From: dlong at openjdk.org (Dean Long) Date: Thu, 23 Mar 2023 23:30:30 GMT Subject: RFR: JDK-8304684: Memory leak in DirectivesParser::set_option_flag [v2] In-Reply-To: References: <5cez_uFvtwk06UhgF6VnHYGatqHOdnzyEB1mMNkAndQ=.e6489e78-ffcf-4c68-8fa3-cca52c372a28@github.com> Message-ID: On Wed, 22 Mar 2023 00:04:45 GMT, Justin King wrote: >> src/hotspot/share/compiler/directivesParser.cpp line 351: >> >>> 349: } >>> 350: >>> 351: FREE_C_HEAP_ARRAY(char, s); >> >> This looks unsafe. We shouldn't free the memory without clearing all references to it, otherwise there is a dangling pointer. There is already another reference to the memory because of this call: >> >> `(set->*test)((void *)&s);` (see the set_function_definition macro) >> >> I think it would be better to move this copying call until after validation has been done. > > I am very confused actually, `(set->*test)((void *)&s);` calls DirectiveSet::set_X. Looking at the code, it simply just stores the pointer? Is the DirectiveSet supposed to own the option? And if so, who is freeing it then? It doesn't look like cloning actually clones the underlying storage and the DirectiveSet destructor doesn't free it. > > So really DirectiveSet::~DirectiveSet should be freeing the string storage and DirectiveSet::set_X is taking ownership. Yeah? Right, the ownership of the string seems unclear. If we try to free the string storage in ~DirectiveSet I think it will result in a double free, because clone() appears to do a shallow copy. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13125#discussion_r1146973287 From dlong at openjdk.org Thu Mar 23 23:41:28 2023 From: dlong at openjdk.org (Dean Long) Date: Thu, 23 Mar 2023 23:41:28 GMT Subject: RFR: JDK-8304684: Memory leak in DirectivesParser::set_option_flag [v2] In-Reply-To: References: <5cez_uFvtwk06UhgF6VnHYGatqHOdnzyEB1mMNkAndQ=.e6489e78-ffcf-4c68-8fa3-cca52c372a28@github.com> Message-ID: On Thu, 23 Mar 2023 23:27:52 GMT, Dean Long wrote: >> I am very confused actually, `(set->*test)((void *)&s);` calls DirectiveSet::set_X. Looking at the code, it simply just stores the pointer? Is the DirectiveSet supposed to own the option? And if so, who is freeing it then? It doesn't look like cloning actually clones the underlying storage and the DirectiveSet destructor doesn't free it. >> >> So really DirectiveSet::~DirectiveSet should be freeing the string storage and DirectiveSet::set_X is taking ownership. Yeah? > > Right, the ownership of the string seems unclear. If we try to free the string storage in ~DirectiveSet I think it will result in a double free, because clone() appears to do a shallow copy. Worse yet, it looks like the default value can point to the existing storage of the DisableIntrinsic and ControlIntrinsic global values, making it doubly hard to know if the storage can be freed, unless we get rid of simple = assignment and always allocate new memory when assigning the string values. But why re-invent the wheel? Can't we use something like C++ std::string to manage the lifetimes for us? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13125#discussion_r1146978578 From dlong at openjdk.org Fri Mar 24 00:06:30 2023 From: dlong at openjdk.org (Dean Long) Date: Fri, 24 Mar 2023 00:06:30 GMT Subject: RFR: JDK-8304684: Memory leak in DirectivesParser::set_option_flag [v2] In-Reply-To: References: <5cez_uFvtwk06UhgF6VnHYGatqHOdnzyEB1mMNkAndQ=.e6489e78-ffcf-4c68-8fa3-cca52c372a28@github.com> Message-ID: On Thu, 23 Mar 2023 23:39:10 GMT, Dean Long wrote: >> Right, the ownership of the string seems unclear. If we try to free the string storage in ~DirectiveSet I think it will result in a double free, because clone() appears to do a shallow copy. > > Worse yet, it looks like the default value can point to the existing storage of the DisableIntrinsic and ControlIntrinsic global values, making it doubly hard to know if the storage can be freed, unless we get rid of simple = assignment and always allocate new memory when assigning the string values. > But why re-invent the wheel? Can't we use something like C++ std::string to manage the lifetimes for us? Or maybe ~DirectiveSet could use the _modified array to decide if it should free the string flags. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13125#discussion_r1146989162 From jbhateja at openjdk.org Fri Mar 24 04:18:30 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 24 Mar 2023 04:18:30 GMT Subject: RFR: 8304258: x86: Improve the code generation of VectorRearrange with int and float [v2] In-Reply-To: References: Message-ID: On Thu, 23 Mar 2023 18:29:58 GMT, Quan Anh Mai wrote: >> src/hotspot/cpu/x86/x86.ad line 8645: >> >>> 8643: int vlen_enc = vector_length_encoding(this); >>> 8644: if (vlen_enc == Assembler::AVX_128bit) { >>> 8645: __ vpermilps($dst$$XMMRegister, $src$$XMMRegister, $shuffle$$XMMRegister, vlen_enc); >> >> Since you are emitting different instruction to save on domain switch over penalty for > 128 bit vectors, same can be done for 128 bit vectors also, you may use vpshufd for integers. > > I think `vpshufd` does not support variable indices so we can only use `vpermilps` here Correct. Thanks ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13042#discussion_r1147100978 From jbhateja at openjdk.org Fri Mar 24 05:12:29 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 24 Mar 2023 05:12:29 GMT Subject: RFR: 8304258: x86: Improve the code generation of VectorRearrange with int and float [v3] In-Reply-To: References: Message-ID: On Thu, 23 Mar 2023 18:35:20 GMT, Quan Anh Mai wrote: >> Hi, >> >> This small patch changes the code generation of VectorRearrangeNode with respect to int and float elements. With not-larger-than-128-bit vectors, we can use `vpermilps` instead of promoting to operating on the extended 256-bit vector. This also helps the code generation of AVX1 to not rely on the sse version. >> >> Please take a look and leave some reviews. >> Thanks a lot. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > refine asserts, move logic to C2_MacroAssembler Marked as reviewed by jbhateja (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/13042#pullrequestreview-1356009491 From thartmann at openjdk.org Fri Mar 24 08:57:32 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 24 Mar 2023 08:57:32 GMT Subject: RFR: 8303508: Vector.lane() gets wrong value on x86 In-Reply-To: References: Message-ID: On Thu, 23 Mar 2023 00:37:29 GMT, Jatin Bhateja wrote: > Incorrectness happens because compiler absorbs casting IR chain during idealizations. > Sign extending the result of sub-word extraction operation. > > Please refer to detailed discussion on https://github.com/openjdk/jdk/pull/13070#issuecomment-1479779537 > > Best Regards, > Jatin All tests passed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13152#issuecomment-1482455461 From yyang at openjdk.org Fri Mar 24 10:35:05 2023 From: yyang at openjdk.org (Yi Yang) Date: Fri, 24 Mar 2023 10:35:05 GMT Subject: RFR: 8304034: Remove redundant and meaningless comments in opto [v7] In-Reply-To: References: Message-ID: <9hpJHzt3NBO2BWI0RVPNU2WxrEz4WTpCOrQaPN_yM34=.6e8afcad-6f29-4464-b33a-62e66749afdc@github.com> > Please help review this trivial change to remove redundant and meaningless comments in `hotspot/share/opto` directory. > > They are either > 1. Repeat the function name that the function they comment for. > 2. Makes no sense, e.g. `//----Idealize----` > > And I think original CC-style code (`if( test )`,`call( arg )`) can be formatted in one go, instead of formatting the near code when someone touches them. But this may form a big patch, and it confuses code blame, so I left this work until we reach a consensus. > > Thanks! Yi Yang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: - Merge branch 'master' into cleanupc2 - review feedback - Merge branch 'master' into cleanupc2 - restore mistakenly removed lines - cleanup more - reserve some comments - multiple empty lines to one empty lines - reserve some comments - 8304034: Remove redundant and meaningless comments in opto ------------- Changes: https://git.openjdk.org/jdk/pull/12995/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12995&range=06 Stats: 2759 lines in 117 files changed: 10 ins; 2742 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/12995.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12995/head:pull/12995 PR: https://git.openjdk.org/jdk/pull/12995 From rcastanedalo at openjdk.org Fri Mar 24 10:51:36 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 24 Mar 2023 10:51:36 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand In-Reply-To: References: Message-ID: On Thu, 23 Mar 2023 10:13:12 GMT, Emanuel Peter wrote: >> Thanks for the suggestion! I prefer to keep `LoopMaxUnroll` as a bound, to guard against pathologically large basic blocks, even if it implies missing some hand-unrolled loops. > > Up to you. The loop body size is limited by `LoopUnrollLimit`, which is currently set to 50 or 60, depending on the platform. So we do not unroll too much to prevent pathologically large loop bodies. That is actually why most SuperWord tests have to increase `LoopUnrollLimit` - otherwise we would not even vectorize those cases. Good point. I will test using the loop body size as a bound for superword reduction analysis, but leave the tighter `LoopMaxUnroll` bound for selecting floating-point min/max implementations, which is more time sensitive. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13120#discussion_r1147413537 From jbhateja at openjdk.org Fri Mar 24 11:22:58 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 24 Mar 2023 11:22:58 GMT Subject: Integrated: 8303508: Vector.lane() gets wrong value on x86 In-Reply-To: References: Message-ID: On Thu, 23 Mar 2023 00:37:29 GMT, Jatin Bhateja wrote: > Incorrectness happens because compiler absorbs casting IR chain during idealizations. > Sign extending the result of sub-word extraction operation. > > Please refer to detailed discussion on https://github.com/openjdk/jdk/pull/13070#issuecomment-1479779537 > > Best Regards, > Jatin This pull request has now been integrated. Changeset: d61de141 Author: Jatin Bhateja URL: https://git.openjdk.org/jdk/commit/d61de141eb8ba52122db43172429f9186ea47e61 Stats: 91 lines in 3 files changed: 87 ins; 3 del; 1 mod 8303508: Vector.lane() gets wrong value on x86 Reviewed-by: eliu, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/13152 From rcastanedalo at openjdk.org Fri Mar 24 15:21:45 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 24 Mar 2023 15:21:45 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand [v2] In-Reply-To: References: Message-ID: > Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point `Math.min()/max()` implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see [JDK-8261147](https://bugs.openjdk.org/browse/JDK-8261147) and [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622)). > > This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops: > > ![reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226725587-b7d68509-3717-4bbe-8d54-f9a105853fda.png) > > The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks. > > ## Performance Benefits > > As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point `Math.min()/max()` in multiple scenarios. > > ### Increased Auto-Vectorization Scope > > There are two main scenarios in which the proposed changeset enables further auto-vectorization: > > #### Reductions Using Global Accumulators > > > public class Foo { > int acc = 0; > (..) > void reduce(int[] array) { > for (int i = 0; i < array.length; i++) { > acc += array[i]; > } > } > } > > Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis: > > ![global-reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226745351-33494e40-7c07-4a8b-8bf6-d3a96e84b1c2.png) > > #### Reductions of partially unrolled loops > > > (..) > for (int i = 0; i < array.length / 2; i++) { > acc += array[2*i]; > acc += array[2*i + 1]; > } > (..) > > > These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically. > > ### Increased Performance of x64 Floating-Point `Math.min()/max()` > > Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point `Math.min()/max()` implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in `FpMinMaxIntrinsics.java` for more details). > > ## Implementation details > > The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case `testReductionOnPartiallyUnrolledLoopWithSwappedInputs` from `TestGeneralizedReductions.java`) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops). > > A complication results from edge swapping in the nodes cloned by loop unrolling (see [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/addnode.cpp#L123) and [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/mulnode.cpp#L113)), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64) is to replace this changeset's linear chain finding approach with a more general shortest path-finding algorithm. This alternative might preclude the need for tracking edge swapping at a potentially higher computational cost. Since the trade-off is not obvious, I propose to investigate it in a follow-up RFE. > > The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of `[Min|Max][F|D]` nodes. > > ## Testing > > ### Functionality > > - tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64). > - fuzzing (12 h. on linux-x64 and linux-aarch64). > > ##### TestGeneralizedReductions.java > > Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. `testReductionOnPartiallyUnrolledLoop` has been observed to fail on [linux-x86](https://github.com/robcasloz/jdk/actions/runs/4478959520/jobs/7873827856#logs) due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction. > > ##### TestFpMinMaxReductions.java > > Tests the matching of floating-point max/min implementations in x64. > > ##### TestSuperwordFailsUnrolling.java > > This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test. > > ### Performance > > #### General Benchmarks > > The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64. > > #### Micro-benchmarks > > The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis ([full results](https://github.com/openjdk/jdk/files/11039207/microbenchmark-results.ods)). > > > ##### VectorReduction.java > > These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running `VectorReduction.java` on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for `andRedIOnGlobalAccumulator` and `andRedIPartiallyUnrolled`, where the changeset improves performance by 2.4x in both cases. > > ##### MaxIntrinsics.java > > This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point `Math.min()` implementation that is specialized for reduction min operations): > > | micro-benchmark | speedup compared to mainline | > | --- | --- | > | `fMinReduceInOuterLoop` | 1.1x | > | `fMinReduceNonCounted` | 2.3x | > | `fMinReduceGlobalAccumulator` | 2.4x | > | `fMinReducePartiallyUnrolled` | 3.9x | > > ## Acknowledgments > > Thanks to @danielogh for making it possible to test this improvement with confidence ([JDK-8294715](https://bugs.openjdk.org/browse/JDK-8294715)) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback. Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 28 additional commits since the last revision: - Merge master - Relax the reduction cycle search bound - Remove redundant IR check precondition - Use SuperWord members in reduction marking - Remove redundant opcode checks - Do not run test in x86-32 - Update existing test instead of removing it - Add negative vectorization test - Update copyright headers - Add two more reduction vectorization microbenchmarks - ... and 18 more: https://git.openjdk.org/jdk/compare/d063f7e5...95f6cc33 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13120/files - new: https://git.openjdk.org/jdk/pull/13120/files/72fe5a6a..95f6cc33 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13120&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13120&range=00-01 Stats: 14696 lines in 610 files changed: 8065 ins; 3518 del; 3113 mod Patch: https://git.openjdk.org/jdk/pull/13120.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13120/head:pull/13120 PR: https://git.openjdk.org/jdk/pull/13120 From rcastanedalo at openjdk.org Fri Mar 24 15:21:48 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 24 Mar 2023 15:21:48 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand In-Reply-To: References: Message-ID: <9K3JsxtFNTYymj86XUBOvnuQZd4fziGM7PkyPh_JefM=.17668ef2-cf1e-4145-9b02-308f9f123175@github.com> On Tue, 21 Mar 2023 14:49:26 GMT, Roberto Casta?eda Lozano wrote: > Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point `Math.min()/max()` implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see [JDK-8261147](https://bugs.openjdk.org/browse/JDK-8261147) and [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622)). > > This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops: > > ![reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226725587-b7d68509-3717-4bbe-8d54-f9a105853fda.png) > > The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks. > > ## Performance Benefits > > As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point `Math.min()/max()` in multiple scenarios. > > ### Increased Auto-Vectorization Scope > > There are two main scenarios in which the proposed changeset enables further auto-vectorization: > > #### Reductions Using Global Accumulators > > > public class Foo { > int acc = 0; > (..) > void reduce(int[] array) { > for (int i = 0; i < array.length; i++) { > acc += array[i]; > } > } > } > > Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis: > > ![global-reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226745351-33494e40-7c07-4a8b-8bf6-d3a96e84b1c2.png) > > #### Reductions of partially unrolled loops > > > (..) > for (int i = 0; i < array.length / 2; i++) { > acc += array[2*i]; > acc += array[2*i + 1]; > } > (..) > > > These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically. > > ### Increased Performance of x64 Floating-Point `Math.min()/max()` > > Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point `Math.min()/max()` implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in `FpMinMaxIntrinsics.java` for more details). > > ## Implementation details > > The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case `testReductionOnPartiallyUnrolledLoopWithSwappedInputs` from `TestGeneralizedReductions.java`) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops). > > A complication results from edge swapping in the nodes cloned by loop unrolling (see [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/addnode.cpp#L123) and [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/mulnode.cpp#L113)), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64) is to replace this changeset's linear chain finding approach with a more general shortest path-finding algorithm. This alternative might preclude the need for tracking edge swapping at a potentially higher computational cost. Since the trade-off is not obvious, I propose to investigate it in a follow-up RFE. > > The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of `[Min|Max][F|D]` nodes. > > ## Testing > > ### Functionality > > - tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64). > - fuzzing (12 h. on linux-x64 and linux-aarch64). > > ##### TestGeneralizedReductions.java > > Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. `testReductionOnPartiallyUnrolledLoop` has been observed to fail on [linux-x86](https://github.com/robcasloz/jdk/actions/runs/4478959520/jobs/7873827856#logs) due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction. > > ##### TestFpMinMaxReductions.java > > Tests the matching of floating-point max/min implementations in x64. > > ##### TestSuperwordFailsUnrolling.java > > This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test. > > ### Performance > > #### General Benchmarks > > The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64. > > #### Micro-benchmarks > > The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis ([full results](https://github.com/openjdk/jdk/files/11039207/microbenchmark-results.ods)). > > > ##### VectorReduction.java > > These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running `VectorReduction.java` on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for `andRedIOnGlobalAccumulator` and `andRedIPartiallyUnrolled`, where the changeset improves performance by 2.4x in both cases. > > ##### MaxIntrinsics.java > > This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point `Math.min()` implementation that is specialized for reduction min operations): > > | micro-benchmark | speedup compared to mainline | > | --- | --- | > | `fMinReduceInOuterLoop` | 1.1x | > | `fMinReduceNonCounted` | 2.3x | > | `fMinReduceGlobalAccumulator` | 2.4x | > | `fMinReducePartiallyUnrolled` | 3.9x | > > ## Acknowledgments > > Thanks to @danielogh for making it possible to test this improvement with confidence ([JDK-8294715](https://bugs.openjdk.org/browse/JDK-8294715)) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback. I just pushed a new version addressing @eme64's comments and suggestions, please review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13120#issuecomment-1482978249 From mdoerr at openjdk.org Fri Mar 24 15:35:43 2023 From: mdoerr at openjdk.org (Martin Doerr) Date: Fri, 24 Mar 2023 15:35:43 GMT Subject: RFR: 8304880: [PPC64] VerifyOops code in C1 doesn't work with ZGC Message-ID: <8jN_obM4X1OzdLXezbnDx553YsdlAvapHLTU3vSPt8U=.598357c8-5c00-4284-b217-5240a1d89260@github.com> I suggest to remove this code for the following reasons: - It doesn't work with ZGC (oop needs to go through load barrier, see JBS issue). - It generates too much code. Loading oops is quite common and the oop verification code is quite lengthy. - Other platforms don't have it, either. ------------- Commit messages: - 8304880: [PPC64] VerifyOops code in C1 doesn't work with ZGC Changes: https://git.openjdk.org/jdk/pull/13175/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13175&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8304880 Stats: 8 lines in 1 file changed: 0 ins; 8 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/13175.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13175/head:pull/13175 PR: https://git.openjdk.org/jdk/pull/13175 From duke at openjdk.org Fri Mar 24 15:55:17 2023 From: duke at openjdk.org (Ilya Korennoy) Date: Fri, 24 Mar 2023 15:55:17 GMT Subject: RFR: 8304445: Remaining uses of NULL in ciInstanceKlass.cpp Message-ID: 8304445: Remaining uses of NULL in ciInstanceKlass.cpp ------------- Commit messages: - 8304445: Replace remaining uses of NULL in ciInstanceKlass.cpp Changes: https://git.openjdk.org/jdk/pull/13178/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13178&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8304445 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/13178.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13178/head:pull/13178 PR: https://git.openjdk.org/jdk/pull/13178 From shade at openjdk.org Fri Mar 24 16:14:24 2023 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 24 Mar 2023 16:14:24 GMT Subject: RFR: 8304880: [PPC64] VerifyOops code in C1 doesn't work with ZGC In-Reply-To: <8jN_obM4X1OzdLXezbnDx553YsdlAvapHLTU3vSPt8U=.598357c8-5c00-4284-b217-5240a1d89260@github.com> References: <8jN_obM4X1OzdLXezbnDx553YsdlAvapHLTU3vSPt8U=.598357c8-5c00-4284-b217-5240a1d89260@github.com> Message-ID: On Fri, 24 Mar 2023 15:27:53 GMT, Martin Doerr wrote: > I suggest to remove this code for the following reasons: > - It doesn't work with ZGC (oop needs to go through load barrier, see JBS issue). > - It generates too much code. Loading oops is quite common and the oop verification code is quite lengthy. > - Other platforms don't have it, either. So this `check_oop` happens too early after the load, right? There are also `VerifyOops` blocks for stores in the same file, would you like to handle those as well? These oops should be okay with most GCs, but the point about the code size stands. Other platforms do not have the checks on those paths too, AFAICS. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13175#issuecomment-1483061365 From kvn at openjdk.org Fri Mar 24 16:38:02 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 24 Mar 2023 16:38:02 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v4] In-Reply-To: <6NDwZSpjSrokmglncPRp4tM7_Hiq4b26dXukhXODpKo=.8ba7efd0-bc44-4f1e-beb8-c1c68bc33515@github.com> References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> <6NDwZSpjSrokmglncPRp4tM7_Hiq4b26dXukhXODpKo=.8ba7efd0-bc44-4f1e-beb8-c1c68bc33515@github.com> Message-ID: On Mon, 20 Mar 2023 19:23:34 GMT, Cesar Soares Lucas wrote: >> Can I please get reviews for this PR? >> >> The most common and frequent use of NonEscaping Phis merging object allocations is for debugging information. The two graphs below show numbers for Renaissance and DaCapo benchmarks - similar results are obtained for all other applications that I tested. >> >> With what frequency does each IR node type occurs as an allocation merge user? I.e., if the same node type uses a Phi N times the counter is incremented by N: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280517-4dcf5871-2564-4207-b49e-22aee47fa49d.png) >> >> What are the most common users of allocation merges? I.e., if the same node type uses a Phi N times the counter is incremented by 1: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280608-ca742a4e-1622-4e69-a778-e4db6805ea02.png) >> >> This PR adds support scalar replacing allocations participating in merges that are used as debug information OR as a base for field loads. I plan to create subsequent PRs to enable scalar replacement of merges used by other node types (CmpP is next on the list) subsequently. >> >> The approach I used for _rematerialization_ is pretty straightforward. It consists basically in: 1) Extend SafePointScalarObjectNode to represent multiple SR objects; 2) Add a new Class to support rematerialization of SR objects part of merges; 3) Patch HotSpot to be able to serialize and deserialize debug information related to allocation merges; 4) Patch C2 to generate unique types for SR objects participating in some allocation merges. >> >> The approach I used for _enabling the scalar replacement of some of the inputs of the allocation merge_ is also pretty straight forward: call `MemNode::split_through_phi` to, well, split AddP->Load* through the merge which will render the Phi useless. >> >> I tested this with JTREG tests tier 1-4 (Windows, Linux, and Mac) and didn't see regression. I also tested with several applications and didn't see any failure. I also run tests with "-ea -esa -Xbatch -Xcomp -XX:+UnlockExperimentalVMOptions -XX:-TieredCompilation -server -XX:+IgnoreUnrecognizedVMOptions -XX:+UnlockDiagnosticVMOptions -XX:+StressLCM -XX:+StressGCM -XX:+StressCCP" and didn't observe any related failures. > > Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: > > Add support for SR'ing some inputs of merges used for field loads Thank you, @JohnTortugo, for continue working on it. I will test it and do proper review late. Of cause @iwanowww have to approve it too :) ------------- PR Review: https://git.openjdk.org/jdk/pull/12897#pullrequestreview-1357067077 From mdoerr at openjdk.org Fri Mar 24 16:38:53 2023 From: mdoerr at openjdk.org (Martin Doerr) Date: Fri, 24 Mar 2023 16:38:53 GMT Subject: RFR: 8304880: [PPC64] VerifyOops code in C1 doesn't work with ZGC In-Reply-To: References: <8jN_obM4X1OzdLXezbnDx553YsdlAvapHLTU3vSPt8U=.598357c8-5c00-4284-b217-5240a1d89260@github.com> Message-ID: On Fri, 24 Mar 2023 16:11:37 GMT, Aleksey Shipilev wrote: > So this `check_oop` happens too early after the load, right? Correct, it is placed between the raw load and the load barrier. > There are also `VerifyOops` blocks for stores in the same file, would you like to handle those as well? These oops should be okay with most GCs, but the point about the code size stands. Other platforms do not have the checks on those paths too, AFAICS. The store cases are ok with all GCs because we only write valid oops into the heap. Stores seem to be a less frequent than loads. So, the impact on size is smaller. But we could discuss the removal. Other platforms don't have them. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13175#issuecomment-1483093585 From kvn at openjdk.org Fri Mar 24 16:43:13 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 24 Mar 2023 16:43:13 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v4] In-Reply-To: <6NDwZSpjSrokmglncPRp4tM7_Hiq4b26dXukhXODpKo=.8ba7efd0-bc44-4f1e-beb8-c1c68bc33515@github.com> References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> <6NDwZSpjSrokmglncPRp4tM7_Hiq4b26dXukhXODpKo=.8ba7efd0-bc44-4f1e-beb8-c1c68bc33515@github.com> Message-ID: On Mon, 20 Mar 2023 19:23:34 GMT, Cesar Soares Lucas wrote: >> Can I please get reviews for this PR? >> >> The most common and frequent use of NonEscaping Phis merging object allocations is for debugging information. The two graphs below show numbers for Renaissance and DaCapo benchmarks - similar results are obtained for all other applications that I tested. >> >> With what frequency does each IR node type occurs as an allocation merge user? I.e., if the same node type uses a Phi N times the counter is incremented by N: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280517-4dcf5871-2564-4207-b49e-22aee47fa49d.png) >> >> What are the most common users of allocation merges? I.e., if the same node type uses a Phi N times the counter is incremented by 1: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280608-ca742a4e-1622-4e69-a778-e4db6805ea02.png) >> >> This PR adds support scalar replacing allocations participating in merges that are used as debug information OR as a base for field loads. I plan to create subsequent PRs to enable scalar replacement of merges used by other node types (CmpP is next on the list) subsequently. >> >> The approach I used for _rematerialization_ is pretty straightforward. It consists basically in: 1) Extend SafePointScalarObjectNode to represent multiple SR objects; 2) Add a new Class to support rematerialization of SR objects part of merges; 3) Patch HotSpot to be able to serialize and deserialize debug information related to allocation merges; 4) Patch C2 to generate unique types for SR objects participating in some allocation merges. >> >> The approach I used for _enabling the scalar replacement of some of the inputs of the allocation merge_ is also pretty straight forward: call `MemNode::split_through_phi` to, well, split AddP->Load* through the merge which will render the Phi useless. >> >> I tested this with JTREG tests tier 1-4 (Windows, Linux, and Mac) and didn't see regression. I also tested with several applications and didn't see any failure. I also run tests with "-ea -esa -Xbatch -Xcomp -XX:+UnlockExperimentalVMOptions -XX:-TieredCompilation -server -XX:+IgnoreUnrecognizedVMOptions -XX:+UnlockDiagnosticVMOptions -XX:+StressLCM -XX:+StressGCM -XX:+StressCCP" and didn't observe any related failures. > > Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: > > Add support for SR'ing some inputs of merges used for field loads You new test failed in GHA testing with 32-bit VM: `Could not find VM flag "UseCompressedOops" in @IR rule 1 at int`. You need to adjust next rule: `@IR(counts = { IRNode.ALLOC, "2" }, applyIf = { "UseCompressedOops", "false" })` ------------- PR Comment: https://git.openjdk.org/jdk/pull/12897#issuecomment-1483098691 From shade at openjdk.org Fri Mar 24 16:55:17 2023 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 24 Mar 2023 16:55:17 GMT Subject: RFR: 8304880: [PPC64] VerifyOops code in C1 doesn't work with ZGC In-Reply-To: <8jN_obM4X1OzdLXezbnDx553YsdlAvapHLTU3vSPt8U=.598357c8-5c00-4284-b217-5240a1d89260@github.com> References: <8jN_obM4X1OzdLXezbnDx553YsdlAvapHLTU3vSPt8U=.598357c8-5c00-4284-b217-5240a1d89260@github.com> Message-ID: On Fri, 24 Mar 2023 15:27:53 GMT, Martin Doerr wrote: > I suggest to remove this code for the following reasons: > - It doesn't work with ZGC (oop needs to go through load barrier, see JBS issue). > - It generates too much code. Loading oops is quite common and the oop verification code is quite lengthy. > - Other platforms don't have it, either. Marked as reviewed by shade (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/13175#pullrequestreview-1357094301 From shade at openjdk.org Fri Mar 24 16:55:21 2023 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 24 Mar 2023 16:55:21 GMT Subject: RFR: 8304880: [PPC64] VerifyOops code in C1 doesn't work with ZGC In-Reply-To: References: <8jN_obM4X1OzdLXezbnDx553YsdlAvapHLTU3vSPt8U=.598357c8-5c00-4284-b217-5240a1d89260@github.com> Message-ID: On Fri, 24 Mar 2023 16:36:10 GMT, Martin Doerr wrote: > The store cases are ok with all GCs because we only write valid oops into the heap. Stores seem to be a less frequent than loads. So, the impact on size is smaller. But we could discuss the removal. Other platforms don't have them. I see no reason to leave them. Your call. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13175#issuecomment-1483114528 From mdoerr at openjdk.org Fri Mar 24 17:14:41 2023 From: mdoerr at openjdk.org (Martin Doerr) Date: Fri, 24 Mar 2023 17:14:41 GMT Subject: RFR: 8304880: [PPC64] VerifyOops code in C1 doesn't work with ZGC In-Reply-To: <8jN_obM4X1OzdLXezbnDx553YsdlAvapHLTU3vSPt8U=.598357c8-5c00-4284-b217-5240a1d89260@github.com> References: <8jN_obM4X1OzdLXezbnDx553YsdlAvapHLTU3vSPt8U=.598357c8-5c00-4284-b217-5240a1d89260@github.com> Message-ID: On Fri, 24 Mar 2023 15:27:53 GMT, Martin Doerr wrote: > I suggest to remove this code for the following reasons: > - It doesn't work with ZGC (oop needs to go through load barrier, see JBS issue). > - It generates too much code. Loading oops is quite common and the oop verification code is quite lengthy. > - Other platforms don't have it, either. Thanks for the review! Hmm, it may be beneficial in some bug chasing scenario if we can ensure that C1 doesn't write any broken oops. I think I had implemented the checking code for that a long time ago. On the other side, we didn't need it for quite some time. Maybe someone else has an opinion about it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13175#issuecomment-1483143777 From jcking at openjdk.org Fri Mar 24 19:30:29 2023 From: jcking at openjdk.org (Justin King) Date: Fri, 24 Mar 2023 19:30:29 GMT Subject: RFR: JDK-8304684: Memory leak in DirectivesParser::set_option_flag [v2] In-Reply-To: References: <5cez_uFvtwk06UhgF6VnHYGatqHOdnzyEB1mMNkAndQ=.e6489e78-ffcf-4c68-8fa3-cca52c372a28@github.com> Message-ID: On Fri, 24 Mar 2023 00:03:21 GMT, Dean Long wrote: >> Worse yet, it looks like the default value can point to the existing storage of the DisableIntrinsic and ControlIntrinsic global values, making it doubly hard to know if the storage can be freed, unless we get rid of simple = assignment and always allocate new memory when assigning the string values. >> But why re-invent the wheel? Can't we use something like C++ std::string to manage the lifetimes for us? > > Or maybe ~DirectiveSet could use the _modified array to decide if it should free the string flags. `DirectiveSet::compilecommand_compatibility_init` makes things even more complicated, because it doesnt update `_modified` and it should probably be making a copy of the string provided via `CompilerOracle::has_option_value`. Ugh... ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13125#discussion_r1147979655 From kvn at openjdk.org Fri Mar 24 19:46:36 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 24 Mar 2023 19:46:36 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v4] In-Reply-To: <6NDwZSpjSrokmglncPRp4tM7_Hiq4b26dXukhXODpKo=.8ba7efd0-bc44-4f1e-beb8-c1c68bc33515@github.com> References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> <6NDwZSpjSrokmglncPRp4tM7_Hiq4b26dXukhXODpKo=.8ba7efd0-bc44-4f1e-beb8-c1c68bc33515@github.com> Message-ID: On Mon, 20 Mar 2023 19:23:34 GMT, Cesar Soares Lucas wrote: >> Can I please get reviews for this PR? >> >> The most common and frequent use of NonEscaping Phis merging object allocations is for debugging information. The two graphs below show numbers for Renaissance and DaCapo benchmarks - similar results are obtained for all other applications that I tested. >> >> With what frequency does each IR node type occurs as an allocation merge user? I.e., if the same node type uses a Phi N times the counter is incremented by N: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280517-4dcf5871-2564-4207-b49e-22aee47fa49d.png) >> >> What are the most common users of allocation merges? I.e., if the same node type uses a Phi N times the counter is incremented by 1: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280608-ca742a4e-1622-4e69-a778-e4db6805ea02.png) >> >> This PR adds support scalar replacing allocations participating in merges that are used as debug information OR as a base for field loads. I plan to create subsequent PRs to enable scalar replacement of merges used by other node types (CmpP is next on the list) subsequently. >> >> The approach I used for _rematerialization_ is pretty straightforward. It consists basically in: 1) Extend SafePointScalarObjectNode to represent multiple SR objects; 2) Add a new Class to support rematerialization of SR objects part of merges; 3) Patch HotSpot to be able to serialize and deserialize debug information related to allocation merges; 4) Patch C2 to generate unique types for SR objects participating in some allocation merges. >> >> The approach I used for _enabling the scalar replacement of some of the inputs of the allocation merge_ is also pretty straight forward: call `MemNode::split_through_phi` to, well, split AddP->Load* through the merge which will render the Phi useless. >> >> I tested this with JTREG tests tier 1-4 (Windows, Linux, and Mac) and didn't see regression. I also tested with several applications and didn't see any failure. I also run tests with "-ea -esa -Xbatch -Xcomp -XX:+UnlockExperimentalVMOptions -XX:-TieredCompilation -server -XX:+IgnoreUnrecognizedVMOptions -XX:+UnlockDiagnosticVMOptions -XX:+StressLCM -XX:+StressGCM -XX:+StressCCP" and didn't observe any related failures. > > Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: > > Add support for SR'ing some inputs of merges used for field loads Initial review. src/hotspot/share/code/debugInfo.hpp line 199: > 197: // ObjectValue describing an object that was scalar replaced. > 198: > 199: class ObjectMergeValue: public ScopeValue { Why you did not make subclass of ObjectValue? You would need to check `sv->is_object_merge()` first before `sv->is_object()` in few places. But on other hand you don't need to duplicates ObjectValue`s fields and asserts. src/hotspot/share/opto/callnode.cpp line 1479: > 1477: #ifdef ASSERT > 1478: _alloc(alloc), > 1479: #endif May be we should always pass alloc, even in product VM. It is not related to your changes but it is pain to have. src/hotspot/share/opto/callnode.hpp line 511: > 509: // by a SafePoint; 2) A scalar replaced object is participating in an allocation > 510: // merge (Phi) and the Phi is referenced by a SafePoint. The schematics of how > 511: // 'spobj' is used in both scenarios are described below. I am not comfortable with reusing SafePointScalarObjectNode for 2) since it describes totally different information. I think it should be separate Node which points to array of SFSO id (in addition to Phis) similar how we do now if SFSO is referenced in other SFSO's field. SFSO could be created before the merge. Consider: Point p = new Point(); Point q = foo(); if (cond) { q = p; } trap(p, q); src/hotspot/share/opto/callnode.hpp line 519: > 517: // _nfields : how many fields the SR object has. > 518: // _alloc : pointer to the Allocate object that previously created the SR object. > 519: // Only used for debug purposes. May be useful in other cases too in a future not only in debug. src/hotspot/share/opto/macro.hpp line 196: > 194: Node* size_in_bytes); > 195: > 196: static Node* make_arraycopy_load(Compile* comp, PhaseIterGVN* igvn, ArrayCopyNode* ac, intptr_t offset, Node* ctl, Node* mem, BasicType ft, const Type *ftype, AllocateNode *alloc); Why you need this change? It polluted diffs and hide important changes. Could be separate change from this one. ------------- PR Review: https://git.openjdk.org/jdk/pull/12897#pullrequestreview-1357285068 PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1147961058 PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1147963487 PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1147991641 PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1147965112 PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1147973203 From jcking at openjdk.org Fri Mar 24 19:47:31 2023 From: jcking at openjdk.org (Justin King) Date: Fri, 24 Mar 2023 19:47:31 GMT Subject: RFR: JDK-8304684: Memory leak in DirectivesParser::set_option_flag [v2] In-Reply-To: References: <5cez_uFvtwk06UhgF6VnHYGatqHOdnzyEB1mMNkAndQ=.e6489e78-ffcf-4c68-8fa3-cca52c372a28@github.com> Message-ID: <9GW5pDTdap2DBR3yBsHs4TBw1S3SygQ7SUH276-VVzc=.289cf892-3c04-40d8-a898-b63d05a1528a@github.com> On Fri, 24 Mar 2023 19:27:19 GMT, Justin King wrote: >> Or maybe ~DirectiveSet could use the _modified array to decide if it should free the string flags. > > `DirectiveSet::compilecommand_compatibility_init` makes things even more complicated, because it doesnt update `_modified` and it should probably be making a copy of the string provided via `CompilerOracle::has_option_value`. Ugh... Nevermind, it looks like `CompilerOracle::has_option_value` has static duration effectively. So I should be able to fix this. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13125#discussion_r1147992934 From jcking at openjdk.org Fri Mar 24 19:51:29 2023 From: jcking at openjdk.org (Justin King) Date: Fri, 24 Mar 2023 19:51:29 GMT Subject: RFR: JDK-8304684: Memory leak in DirectivesParser::set_option_flag [v2] In-Reply-To: <5cez_uFvtwk06UhgF6VnHYGatqHOdnzyEB1mMNkAndQ=.e6489e78-ffcf-4c68-8fa3-cca52c372a28@github.com> References: <5cez_uFvtwk06UhgF6VnHYGatqHOdnzyEB1mMNkAndQ=.e6489e78-ffcf-4c68-8fa3-cca52c372a28@github.com> Message-ID: On Tue, 21 Mar 2023 20:34:17 GMT, Justin King wrote: >> Add missing `FREE_C_HEAP_ARRAY` call. > > Justin King has updated the pull request incrementally with one additional commit since the last revision: > > Update based on review > > Signed-off-by: Justin King Moving this to draft while I deal with fixing the ownership issue. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13125#issuecomment-1483330003 From xliu at openjdk.org Fri Mar 24 19:56:32 2023 From: xliu at openjdk.org (Xin Liu) Date: Fri, 24 Mar 2023 19:56:32 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v4] In-Reply-To: <6NDwZSpjSrokmglncPRp4tM7_Hiq4b26dXukhXODpKo=.8ba7efd0-bc44-4f1e-beb8-c1c68bc33515@github.com> References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> <6NDwZSpjSrokmglncPRp4tM7_Hiq4b26dXukhXODpKo=.8ba7efd0-bc44-4f1e-beb8-c1c68bc33515@github.com> Message-ID: On Mon, 20 Mar 2023 19:23:34 GMT, Cesar Soares Lucas wrote: >> Can I please get reviews for this PR? >> >> The most common and frequent use of NonEscaping Phis merging object allocations is for debugging information. The two graphs below show numbers for Renaissance and DaCapo benchmarks - similar results are obtained for all other applications that I tested. >> >> With what frequency does each IR node type occurs as an allocation merge user? I.e., if the same node type uses a Phi N times the counter is incremented by N: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280517-4dcf5871-2564-4207-b49e-22aee47fa49d.png) >> >> What are the most common users of allocation merges? I.e., if the same node type uses a Phi N times the counter is incremented by 1: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280608-ca742a4e-1622-4e69-a778-e4db6805ea02.png) >> >> This PR adds support scalar replacing allocations participating in merges that are used as debug information OR as a base for field loads. I plan to create subsequent PRs to enable scalar replacement of merges used by other node types (CmpP is next on the list) subsequently. >> >> The approach I used for _rematerialization_ is pretty straightforward. It consists basically in: 1) Extend SafePointScalarObjectNode to represent multiple SR objects; 2) Add a new Class to support rematerialization of SR objects part of merges; 3) Patch HotSpot to be able to serialize and deserialize debug information related to allocation merges; 4) Patch C2 to generate unique types for SR objects participating in some allocation merges. >> >> The approach I used for _enabling the scalar replacement of some of the inputs of the allocation merge_ is also pretty straight forward: call `MemNode::split_through_phi` to, well, split AddP->Load* through the merge which will render the Phi useless. >> >> I tested this with JTREG tests tier 1-4 (Windows, Linux, and Mac) and didn't see regression. I also tested with several applications and didn't see any failure. I also run tests with "-ea -esa -Xbatch -Xcomp -XX:+UnlockExperimentalVMOptions -XX:-TieredCompilation -server -XX:+IgnoreUnrecognizedVMOptions -XX:+UnlockDiagnosticVMOptions -XX:+StressLCM -XX:+StressGCM -XX:+StressCCP" and didn't observe any related failures. > > Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: > > Add support for SR'ing some inputs of merges used for field loads src/hotspot/share/opto/callnode.hpp line 614: > 612: int merge_pointer_idx(JVMState* jvms) const { > 613: assert(jvms != nullptr, "JVMS reference is null."); > 614: return jvms->scloff() + _merge_pointer_idx; how about we also assert is_from_merge() here? Your comment above says that _merge_point_idx is a zero-based index of sfpt's input array. here we use scloff-based. I think either is okay, but we need consistency. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1148000014 From sviswanathan at openjdk.org Fri Mar 24 21:32:30 2023 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Fri, 24 Mar 2023 21:32:30 GMT Subject: RFR: 8304258: x86: Improve the code generation of VectorRearrange with int and float [v3] In-Reply-To: References: Message-ID: On Thu, 23 Mar 2023 18:35:20 GMT, Quan Anh Mai wrote: >> Hi, >> >> This small patch changes the code generation of VectorRearrangeNode with respect to int and float elements. With not-larger-than-128-bit vectors, we can use `vpermilps` instead of promoting to operating on the extended 256-bit vector. This also helps the code generation of AVX1 to not rely on the sse version. >> >> Please take a look and leave some reviews. >> Thanks a lot. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > refine asserts, move logic to C2_MacroAssembler Marked as reviewed by sviswanathan (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/13042#pullrequestreview-1357461225 From cslucas at openjdk.org Fri Mar 24 23:29:29 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Fri, 24 Mar 2023 23:29:29 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v4] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> <6NDwZSpjSrokmglncPRp4tM7_Hiq4b26dXukhXODpKo=.8ba7efd0-bc44-4f1e-beb8-c1c68bc33515@github.com> Message-ID: <0UbMqMHtVIayPdJMmfDF6YTadWe4YTlSW6mZc5P3IU8=.c4b1a292-e434-4c57-a5cd-015edca2ec95@github.com> On Fri, 24 Mar 2023 19:06:18 GMT, Vladimir Kozlov wrote: >> Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: >> >> Add support for SR'ing some inputs of merges used for field loads > > src/hotspot/share/opto/callnode.cpp line 1479: > >> 1477: #ifdef ASSERT >> 1478: _alloc(alloc), >> 1479: #endif > > May be we should always pass alloc, even in product VM. It is not related to your changes but it is pain to have. I can make that change. > src/hotspot/share/opto/macro.hpp line 196: > >> 194: Node* size_in_bytes); >> 195: >> 196: static Node* make_arraycopy_load(Compile* comp, PhaseIterGVN* igvn, ArrayCopyNode* ac, intptr_t offset, Node* ctl, Node* mem, BasicType ft, const Type *ftype, AllocateNode *alloc); > > Why you need this change? It polluted diffs and hide important changes. Could be separate change from this one. I had to make this method static because it uses `value_from_mem` - which I also made static. I had to make `value_from_mem` static so that I can use it outside PhaseMacroExpand. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1148141099 PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1148140607 From kvn at openjdk.org Fri Mar 24 23:40:31 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 24 Mar 2023 23:40:31 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v4] In-Reply-To: <0UbMqMHtVIayPdJMmfDF6YTadWe4YTlSW6mZc5P3IU8=.c4b1a292-e434-4c57-a5cd-015edca2ec95@github.com> References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> <6NDwZSpjSrokmglncPRp4tM7_Hiq4b26dXukhXODpKo=.8ba7efd0-bc44-4f1e-beb8-c1c68bc33515@github.com> <0UbMqMHtVIayPdJMmfDF6YTadWe4YTlSW6mZc5P3IU8=.c4b1a292-e434-4c57-a5cd-015edca2ec95@github.com> Message-ID: On Fri, 24 Mar 2023 23:24:47 GMT, Cesar Soares Lucas wrote: >> src/hotspot/share/opto/macro.hpp line 196: >> >>> 194: Node* size_in_bytes); >>> 195: >>> 196: static Node* make_arraycopy_load(Compile* comp, PhaseIterGVN* igvn, ArrayCopyNode* ac, intptr_t offset, Node* ctl, Node* mem, BasicType ft, const Type *ftype, AllocateNode *alloc); >> >> Why you need this change? It polluted diffs and hide important changes. Could be separate change from this one. > > I had to make this method static because it uses `value_from_mem` - which I also made static. I had to make `value_from_mem` static so that I can use it outside PhaseMacroExpand. I see, you use it in escape.cpp. Okay. I need to review changes there too. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1148144963 From qamai at openjdk.org Fri Mar 24 23:42:30 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 24 Mar 2023 23:42:30 GMT Subject: RFR: 8304258: x86: Improve the code generation of VectorRearrange with int and float [v3] In-Reply-To: References: Message-ID: On Fri, 24 Mar 2023 21:29:32 GMT, Sandhya Viswanathan wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> refine asserts, move logic to C2_MacroAssembler > > Marked as reviewed by sviswanathan (Reviewer). @sviswa7 Thanks for your approval. May I ask if I can integrate the change? ------------- PR Comment: https://git.openjdk.org/jdk/pull/13042#issuecomment-1483593300 From cslucas at openjdk.org Fri Mar 24 23:49:31 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Fri, 24 Mar 2023 23:49:31 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v4] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> <6NDwZSpjSrokmglncPRp4tM7_Hiq4b26dXukhXODpKo=.8ba7efd0-bc44-4f1e-beb8-c1c68bc33515@github.com> Message-ID: <9n3UqJDruE0pvA51cFuUGal2gvluNX5LoseLuhvXlIg=.ab6692bd-3aba-4d42-a925-a15ff906677d@github.com> On Fri, 24 Mar 2023 19:53:52 GMT, Xin Liu wrote: >> Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: >> >> Add support for SR'ing some inputs of merges used for field loads > > src/hotspot/share/opto/callnode.hpp line 614: > >> 612: int merge_pointer_idx(JVMState* jvms) const { >> 613: assert(jvms != nullptr, "JVMS reference is null."); >> 614: return jvms->scloff() + _merge_pointer_idx; > > how about we also assert is_from_merge() here? > > Your comment above says that _merge_point_idx is a zero-based index of sfpt's input array. > here we use scloff-based. I think either is okay, but we need consistency. I think adding assert is a good idea. I'll fix the comment. Thanks! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1148148247 From cslucas at openjdk.org Fri Mar 24 23:59:30 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Fri, 24 Mar 2023 23:59:30 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v4] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> <6NDwZSpjSrokmglncPRp4tM7_Hiq4b26dXukhXODpKo=.8ba7efd0-bc44-4f1e-beb8-c1c68bc33515@github.com> Message-ID: <7xRwVRVapKbqiVQMDMZUh3ILhfaYub_brXWVopFhJ8M=.28289c04-0ff0-4f19-b764-03af4d3155d6@github.com> On Fri, 24 Mar 2023 19:02:57 GMT, Vladimir Kozlov wrote: >> Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: >> >> Add support for SR'ing some inputs of merges used for field loads > > src/hotspot/share/code/debugInfo.hpp line 199: > >> 197: // ObjectValue describing an object that was scalar replaced. >> 198: >> 199: class ObjectMergeValue: public ScopeValue { > > Why you did not make subclass of ObjectValue? You would need to check `sv->is_object_merge()` first before `sv->is_object()` in few places. But on other hand you don't need to duplicates ObjectValue`s fields and asserts. Let me try that and see how it looks. > src/hotspot/share/opto/callnode.hpp line 511: > >> 509: // by a SafePoint; 2) A scalar replaced object is participating in an allocation >> 510: // merge (Phi) and the Phi is referenced by a SafePoint. The schematics of how >> 511: // 'spobj' is used in both scenarios are described below. > > I am not comfortable with reusing SafePointScalarObjectNode for 2) since it describes totally different information. > I think it should be separate Node which points to array of SFSO id (in addition to Phis) similar how we do now if SFSO is referenced in other SFSO's field. SFSO could be created before the merge. Consider: > > Point p = new Point(); > Point q = foo(); > if (cond) { > q = p; > } > trap(p, q); I had considered that but decided not to do it to prevent adding a new IR node. I'll give that a shot and update this thread with how it goes. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1148150933 PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1148151474 From kvn at openjdk.org Sat Mar 25 00:06:30 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 25 Mar 2023 00:06:30 GMT Subject: RFR: 8304258: x86: Improve the code generation of VectorRearrange with int and float [v3] In-Reply-To: References: Message-ID: <9bj74hBLtWlnwPit9KPzTPZhsDfv_LzlP50KFi3GXNc=.a481a458-28dd-46e0-b0fa-49d6f668a242@github.com> On Fri, 24 Mar 2023 23:39:31 GMT, Quan Anh Mai wrote: > May I ask if I can integrate the change? I am running testing for latest version. I will let you know when it is done. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13042#issuecomment-1483611249 From kvn at openjdk.org Sat Mar 25 00:11:31 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 25 Mar 2023 00:11:31 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v4] In-Reply-To: <7xRwVRVapKbqiVQMDMZUh3ILhfaYub_brXWVopFhJ8M=.28289c04-0ff0-4f19-b764-03af4d3155d6@github.com> References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> <6NDwZSpjSrokmglncPRp4tM7_Hiq4b26dXukhXODpKo=.8ba7efd0-bc44-4f1e-beb8-c1c68bc33515@github.com> <7xRwVRVapKbqiVQMDMZUh3ILhfaYub_brXWVopFhJ8M=.28289c04-0ff0-4f19-b764-03af4d3155d6@github.com> Message-ID: On Fri, 24 Mar 2023 23:57:07 GMT, Cesar Soares Lucas wrote: >> src/hotspot/share/opto/callnode.hpp line 511: >> >>> 509: // by a SafePoint; 2) A scalar replaced object is participating in an allocation >>> 510: // merge (Phi) and the Phi is referenced by a SafePoint. The schematics of how >>> 511: // 'spobj' is used in both scenarios are described below. >> >> I am not comfortable with reusing SafePointScalarObjectNode for 2) since it describes totally different information. >> I think it should be separate Node which points to array of SFSO id (in addition to Phis) similar how we do now if SFSO is referenced in other SFSO's field. SFSO could be created before the merge. Consider: >> >> Point p = new Point(); >> Point q = foo(); >> if (cond) { >> q = p; >> } >> trap(p, q); > > I had considered that but decided not to do it to prevent adding a new IR node. I'll give that a shot and update this thread with how it goes. It **will** complicate your DebugInfo code (packing/unpacking) information. But I think it is right thing to do to avoid duplicated re-allocations during deoptimization - you should have only one new object. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1148154060 From cslucas at openjdk.org Sat Mar 25 00:11:34 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Sat, 25 Mar 2023 00:11:34 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v4] In-Reply-To: <6NDwZSpjSrokmglncPRp4tM7_Hiq4b26dXukhXODpKo=.8ba7efd0-bc44-4f1e-beb8-c1c68bc33515@github.com> References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> <6NDwZSpjSrokmglncPRp4tM7_Hiq4b26dXukhXODpKo=.8ba7efd0-bc44-4f1e-beb8-c1c68bc33515@github.com> Message-ID: On Mon, 20 Mar 2023 19:23:34 GMT, Cesar Soares Lucas wrote: >> Can I please get reviews for this PR? >> >> The most common and frequent use of NonEscaping Phis merging object allocations is for debugging information. The two graphs below show numbers for Renaissance and DaCapo benchmarks - similar results are obtained for all other applications that I tested. >> >> With what frequency does each IR node type occurs as an allocation merge user? I.e., if the same node type uses a Phi N times the counter is incremented by N: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280517-4dcf5871-2564-4207-b49e-22aee47fa49d.png) >> >> What are the most common users of allocation merges? I.e., if the same node type uses a Phi N times the counter is incremented by 1: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280608-ca742a4e-1622-4e69-a778-e4db6805ea02.png) >> >> This PR adds support scalar replacing allocations participating in merges that are used as debug information OR as a base for field loads. I plan to create subsequent PRs to enable scalar replacement of merges used by other node types (CmpP is next on the list) subsequently. >> >> The approach I used for _rematerialization_ is pretty straightforward. It consists basically in: 1) Extend SafePointScalarObjectNode to represent multiple SR objects; 2) Add a new Class to support rematerialization of SR objects part of merges; 3) Patch HotSpot to be able to serialize and deserialize debug information related to allocation merges; 4) Patch C2 to generate unique types for SR objects participating in some allocation merges. >> >> The approach I used for _enabling the scalar replacement of some of the inputs of the allocation merge_ is also pretty straight forward: call `MemNode::split_through_phi` to, well, split AddP->Load* through the merge which will render the Phi useless. >> >> I tested this with JTREG tests tier 1-4 (Windows, Linux, and Mac) and didn't see regression. I also tested with several applications and didn't see any failure. I also run tests with "-ea -esa -Xbatch -Xcomp -XX:+UnlockExperimentalVMOptions -XX:-TieredCompilation -server -XX:+IgnoreUnrecognizedVMOptions -XX:+UnlockDiagnosticVMOptions -XX:+StressLCM -XX:+StressGCM -XX:+StressCCP" and didn't observe any related failures. > > Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: > > Add support for SR'ing some inputs of merges used for field loads src/hotspot/share/opto/escape.cpp line 3734: > 3732: if (reducible_merges.member(n)) { > 3733: // Split loads through phi > 3734: reduce_this_phi_on_field_access(n->as_Phi(), alloc_worklist); I decided to do the split here so that Phase 1 of `split_unique_types` could assign a new instance type for the new loads created by `split_through_phi`. However, I'm considering doing the split at the end of `compute_escape` instead of here to keep all the code that does split close together. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1148153569 From kvn at openjdk.org Sat Mar 25 00:27:30 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 25 Mar 2023 00:27:30 GMT Subject: RFR: 8304258: x86: Improve the code generation of VectorRearrange with int and float [v3] In-Reply-To: References: Message-ID: <7aqVf31ZCC0eUKDQwtWGtJvGugGSkopdITD_HEYl7wY=.03de436e-28bb-4d42-bae8-a9e12a4ea311@github.com> On Thu, 23 Mar 2023 18:35:20 GMT, Quan Anh Mai wrote: >> Hi, >> >> This small patch changes the code generation of VectorRearrangeNode with respect to int and float elements. With not-larger-than-128-bit vectors, we can use `vpermilps` instead of promoting to operating on the extended 256-bit vector. This also helps the code generation of AVX1 to not rely on the sse version. >> >> Please take a look and leave some reviews. >> Thanks a lot. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > refine asserts, move logic to C2_MacroAssembler My testing finished clean. @merykitty, you can integrate. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13042#pullrequestreview-1357598956 From qamai at openjdk.org Sat Mar 25 05:33:48 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Sat, 25 Mar 2023 05:33:48 GMT Subject: Integrated: 8304258: x86: Improve the code generation of VectorRearrange with int and float In-Reply-To: References: Message-ID: On Wed, 15 Mar 2023 12:55:52 GMT, Quan Anh Mai wrote: > Hi, > > This small patch changes the code generation of VectorRearrangeNode with respect to int and float elements. With not-larger-than-128-bit vectors, we can use `vpermilps` instead of promoting to operating on the extended 256-bit vector. This also helps the code generation of AVX1 to not rely on the sse version. > > Please take a look and leave some reviews. > Thanks a lot. This pull request has now been integrated. Changeset: 38e17148 Author: Quan Anh Mai URL: https://git.openjdk.org/jdk/commit/38e17148faef7799515478bd834ed2fa1a5153de Stats: 46 lines in 5 files changed: 34 ins; 2 del; 10 mod 8304258: x86: Improve the code generation of VectorRearrange with int and float Reviewed-by: kvn, jbhateja, sviswanathan ------------- PR: https://git.openjdk.org/jdk/pull/13042 From qamai at openjdk.org Mon Mar 27 02:46:41 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Mon, 27 Mar 2023 02:46:41 GMT Subject: RFR: 8282365: Optimize divideUnsigned and remainderUnsigned for constants [v14] In-Reply-To: References: Message-ID: <1VFNGzcsSt_ePrW28BGgxiPB94ehdewZrNH0NqiWM98=.f74660f7-c22e-4f9a-91c8-a341938baa82@github.com> > This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. > > In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: > > floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) > ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) > > The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. > > For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: > > c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) > c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) > > which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. > > For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. > > More tests are added to cover the possible patterns. > > Please take a look and have some reviews. Thank you very much. Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 47 commits: - move asserts to use sites - windows complaints - compiler complaints - undefined internal linkage - add tests, special casing large shift - draft - Merge branch 'master' into unsignedDiv - Merge branch 'master' into unsignedDiv - wip - Merge branch 'master' into unsignedDiv - ... and 37 more: https://git.openjdk.org/jdk/compare/e73411a2...e44625d6 ------------- Changes: https://git.openjdk.org/jdk/pull/9947/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=13 Stats: 2148 lines in 13 files changed: 1738 ins; 300 del; 110 mod Patch: https://git.openjdk.org/jdk/pull/9947.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/9947/head:pull/9947 PR: https://git.openjdk.org/jdk/pull/9947 From qamai at openjdk.org Mon Mar 27 02:48:43 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Mon, 27 Mar 2023 02:48:43 GMT Subject: RFR: 8282365: Optimize divideUnsigned and remainderUnsigned for constants [v13] In-Reply-To: References: Message-ID: On Tue, 29 Nov 2022 14:29:27 GMT, Quan Anh Mai wrote: >> This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. >> >> In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: >> >> floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) >> ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) >> >> The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. >> >> For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: >> >> c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) >> c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) >> >> which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. >> >> For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. >> >> More tests are added to cover the possible patterns. >> >> Please take a look and have some reviews. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with seven additional commits since the last revision: > > - change julong to uint64_t > - uint > - various fixes > - add constexpr > - add constexpr > - add message to static_assert > - missing powerOfTwo.hpp Sorry for the late updates, I have updated the gtest to emulate the transformation done in `DivNode::Ideal`s and verify the results with numerous input values. This uncover a bug that a shift value too large warps back to 0, which gives incorrect results. As a result, I have added multiple asserts to ensure the shift does not overflow when it should not and idealisations to `0` in the cases it does. Thanks a lot. ------------- PR Comment: https://git.openjdk.org/jdk/pull/9947#issuecomment-1484401515 From rcastanedalo at openjdk.org Mon Mar 27 07:33:34 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 27 Mar 2023 07:33:34 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally [v6] In-Reply-To: References: Message-ID: On Thu, 23 Mar 2023 09:43:26 GMT, Tobias Holenstein wrote: >> In IGV the user can apply a set of filters to a graph. Currently, the same set of selected filters is applied to all graphs (globally). >> >> - With this change the use can define a set of filters for each individual graph tab using the `--Local--` profile >> - Further a filter profile can be created that represents a set of filter. This filter profile can the be selected in each graph tab individually. >> >> ### Global profile >> Each tab has a `--Global--` filter profile which is selected when opening a graph. Filters applied to the `--Global--` profile are applied to all tabs that have the `--Global--` profile selected. >> >> ### Local profile >> Each tab has its own `--Local--` filter profile. Filters applied to the `--Local--` profile are applied only to the currently selected tabs. Only one tab can be selected at a time and a tab gets selected by clicking on it. To make it more clear which tab is currently selected, the title of the selected tab is displayed in **bold** font. >> tabA >> >> When clicking on a different tab with a different `--Local--` profile, the selected filters get updated accordingly. >> tabB >> >> ### New profile >> The user can also create a new filter profile and give it a name. E.g. `My Filters` >> newProfile >> >> The `My Filters` profile is then globally available to other tabs as well >> selectProfile >> >> >> ### Filters for cloned tabs >> When the user clones a tab, the `--Local--` profile gets cloned as well. Further the clone has the same filter profile selected when it gets opened >> cloneTab >> >> ### Saving of filters and profiles >> When the users closes IGV, the filters (in their exact order) are save, as well as the filter profiles. The profile that was last used is selected when opening IGV. > > Tobias Holenstein has updated the pull request incrementally with three additional commits since the last revision: > > - always select previous profile for new tabs > - .js ending for filters > - save order of filters Thanks for addressing my comments again, Toby! I like the additional functionality and that the changeset preserves the current workflow by default. I just have a question about graph viewing performance and a couple of minor comments. src/utils/IdealGraphVisualizer/Filter/src/main/java/com/sun/hotspot/igv/filter/FilterChain.java line 71: > 69: } > 70: > 71: public void applyInOrder(Diagram diagram, FilterChain filterOrder) { I have also addressed this in [#12955](https://github.com/openjdk/jdk/pull/12955), but I think your solution of sorting the filter list upfront, rather than every time filters are applied, is preferable. I will wait for this PR to be integrated and then exclude the corresponding changes from #12955. src/utils/IdealGraphVisualizer/FilterWindow/src/main/java/com/sun/hotspot/igv/filterwindow/FilterTopComponent.java line 454: > 452: > 453: String after = (String) fo.getAttribute(AFTER_ID); > 454: System.out.println(displayName + " after " + after); Please remove. src/utils/IdealGraphVisualizer/FilterWindow/src/main/java/com/sun/hotspot/igv/filterwindow/FilterTopComponent.java line 632: > 630: allFiltersOrdered.sortBy(order); > 631: > 632: System.out.println("readExternal"); Please remove. src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/DiagramViewModel.java line 326: > 324: > 325: // called when the filter in filterChain changed, but not filterChain itself > 326: private void filterChanged() { After applying this PR, `DiagramViewModel::filterChanged()` is fired every time a new graph in a group is viewed. This is not a functional bug, but it causes the expensive `DiagramViewModel::rebuildDiagram()` to be called twice in that scenario (the other call comes from `DiagramViewModel::changed()`). Would it be possible to arrange the code so that `DiagramViewModel::rebuildDiagram()` is only called once when a new graph in a group is viewed? ------------- Changes requested by rcastanedalo (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/12714#pullrequestreview-1358464092 PR Review Comment: https://git.openjdk.org/jdk/pull/12714#discussion_r1148879518 PR Review Comment: https://git.openjdk.org/jdk/pull/12714#discussion_r1148874121 PR Review Comment: https://git.openjdk.org/jdk/pull/12714#discussion_r1148874585 PR Review Comment: https://git.openjdk.org/jdk/pull/12714#discussion_r1148885678 From tholenstein at openjdk.org Mon Mar 27 07:53:19 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 27 Mar 2023 07:53:19 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally [v7] In-Reply-To: References: Message-ID: > In IGV the user can apply a set of filters to a graph. Currently, the same set of selected filters is applied to all graphs (globally). > > - With this change the use can define a set of filters for each individual graph tab using the `--Local--` profile > - Further a filter profile can be created that represents a set of filter. This filter profile can the be selected in each graph tab individually. > > ### Global profile > Each tab has a `--Global--` filter profile which is selected when opening a graph. Filters applied to the `--Global--` profile are applied to all tabs that have the `--Global--` profile selected. > > ### Local profile > Each tab has its own `--Local--` filter profile. Filters applied to the `--Local--` profile are applied only to the currently selected tabs. Only one tab can be selected at a time and a tab gets selected by clicking on it. To make it more clear which tab is currently selected, the title of the selected tab is displayed in **bold** font. > tabA > > When clicking on a different tab with a different `--Local--` profile, the selected filters get updated accordingly. > tabB > > ### New profile > The user can also create a new filter profile and give it a name. E.g. `My Filters` > newProfile > > The `My Filters` profile is then globally available to other tabs as well > selectProfile > > > ### Filters for cloned tabs > When the user clones a tab, the `--Local--` profile gets cloned as well. Further the clone has the same filter profile selected when it gets opened > cloneTab > > ### Saving of filters and profiles > When the users closes IGV, the filters (in their exact order) are save, as well as the filter profiles. The profile that was last used is selected when opening IGV. Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: remove 2x println ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12714/files - new: https://git.openjdk.org/jdk/pull/12714/files/e1add0c6..310cccb8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12714&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12714&range=05-06 Stats: 3 lines in 1 file changed: 0 ins; 3 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/12714.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12714/head:pull/12714 PR: https://git.openjdk.org/jdk/pull/12714 From tholenstein at openjdk.org Mon Mar 27 07:53:24 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 27 Mar 2023 07:53:24 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally [v6] In-Reply-To: References: Message-ID: On Mon, 27 Mar 2023 07:10:05 GMT, Roberto Casta?eda Lozano wrote: >> Tobias Holenstein has updated the pull request incrementally with three additional commits since the last revision: >> >> - always select previous profile for new tabs >> - .js ending for filters >> - save order of filters > > src/utils/IdealGraphVisualizer/FilterWindow/src/main/java/com/sun/hotspot/igv/filterwindow/FilterTopComponent.java line 454: > >> 452: >> 453: String after = (String) fo.getAttribute(AFTER_ID); >> 454: System.out.println(displayName + " after " + after); > > Please remove. done > src/utils/IdealGraphVisualizer/FilterWindow/src/main/java/com/sun/hotspot/igv/filterwindow/FilterTopComponent.java line 632: > >> 630: allFiltersOrdered.sortBy(order); >> 631: >> 632: System.out.println("readExternal"); > > Please remove. done ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12714#discussion_r1148916838 PR Review Comment: https://git.openjdk.org/jdk/pull/12714#discussion_r1148917069 From rcastanedalo at openjdk.org Mon Mar 27 08:52:32 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 27 Mar 2023 08:52:32 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter [v2] In-Reply-To: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> Message-ID: > The "Simplify graph" filter abstracts away details from the graph that are typically unnecessary for debugging or analyzing the represented program. This changeset decouples this filter into two: > > - "Simplify graph", which hides elements that are typically (but not always) unnecessary, and > - "Condense graph", which makes the graph more compact without loss of information. > > Together, these two filters reduce the average graph size by a factor of 1.6x (nodes) and 1.9x (edges): > > ![without-with-filters](https://user-images.githubusercontent.com/8792647/224118397-e6bd45d1-0b90-4d94-88ae-0a83f9ef20da.png) > > Besides decoupling the "Simplify graph" filter, the changeset extends its functionality by: > - combining Bool and conversion nodes into their predecessors, > - inlining all Parm nodes except control into their successors (this removes lots of long edges), > - removing "top" inputs from call-like nodes, > - inlining more source nodes (such as MachTemp and ThreadLocal) into their successors, > - pretty-printing the labels of many inlined and combined nodes such as Bool comparisons or Catch projections (via a new filter that edits node properties), and > - using a sparse representation of nodes with empty inputs (e.g. call-like nodes after applying "Simplify graph"). > > The sparse input representation shows dots between non-contiguous inputs, instead of horizontal space proportional to the number of empty inputs. This helps reducing node width, which is known to improve overall layout quality: > > ![dense-vs-sparse](https://user-images.githubusercontent.com/8792647/224118703-04f663b7-7a73-4e49-87d9-2acd8b98522b.png) > > Note that the exact input indices can still be retrieved via the incoming edge's tooltips: > > ![tooltip-with-input-index](https://user-images.githubusercontent.com/8792647/224119319-7f40fba2-1e9f-436e-a11c-8c3d428d46a6.png) > > The control-flow graph view is also adapted to this representation: > > ![sparse-in-cfg](https://user-images.githubusercontent.com/8792647/224119399-884e2516-a9a1-43fd-b5f5-747c99472ace.png) > > #### Additional improvements > > Additionally, this changeset: > - ensures that the selected filter subset is applied in the order listed in the "Filter" window (this is necessary for combining effectively the "Simplify graph" and "Condense graph" filters, but is also generally desirable for simplicity and consistency), > - introduces a complementary filter "Show custom node info" (enabled by default) that extends the labels of call and exception-creation nodes with custom information, > - extends the search functionality so that combined and inlined nodes can also be searched on and selected, and > - defines and documents JavaScript helpers to simplify the new and existing available filters. > > Here is an example of the effect of the new "Show custom node info" filter: > > ![show-custom-node-info](https://user-images.githubusercontent.com/8792647/224119545-fd564224-7ccc-4829-988e-77f05d25b3bc.png) > > ### Testing > > #### Functionality > > - Tested the functionality manually on a small selection of graphs. > > - Tested automatically that viewing thousands of graphs in the three views with different filter subsets enabled does not trigger any assertion failure (by instrumenting IGV to view graphs as they are loaded and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`). > > #### Performance > > Measured the combined filter application and view creation time for the sea-of-nodes view on a selection of 100 medium-sized graphs (200-500 nodes). On average, applying the new "Show custom node info" filter introduces a minimal overhead of around 1%, which motivates enabling it by default. Applying the "simplify graph" and "condense graph" on top actually gives a speedup of about 12%, since the additional filter application time is amortized by laying out and drawing fewer nodes. However, these filters are not enabled by default, since they cause a (minor) loss of information which is not desirable in every use case. > > The graph size reduction and performance results are [attached](https://github.com/openjdk/jdk/files/10934804/performance-evaluation.ods) (note that each time measurement in the sheet corresponds to the median of ten runs). Roberto Casta?eda Lozano has updated the pull request incrementally with five additional commits since the last revision: - Increase the bold text line factor slightly - Add extra horizontal margin for long labels and let them overflow within the node - Select slots as well - Remove code that is commented out - Assert inputLabel is non-null ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12955/files - new: https://git.openjdk.org/jdk/pull/12955/files/b379e87c..dde38762 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12955&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12955&range=00-01 Stats: 22 lines in 4 files changed: 8 ins; 3 del; 11 mod Patch: https://git.openjdk.org/jdk/pull/12955.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12955/head:pull/12955 PR: https://git.openjdk.org/jdk/pull/12955 From rcastanedalo at openjdk.org Mon Mar 27 10:50:35 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 27 Mar 2023 10:50:35 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter [v2] In-Reply-To: References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> Message-ID: On Mon, 20 Mar 2023 10:06:04 GMT, Tobias Holenstein wrote: >> Roberto Casta?eda Lozano has updated the pull request incrementally with five additional commits since the last revision: >> >> - Increase the bold text line factor slightly >> - Add extra horizontal margin for long labels and let them overflow within the node >> - Select slots as well >> - Remove code that is commented out >> - Assert inputLabel is non-null > > src/utils/IdealGraphVisualizer/Filter/src/main/java/com/sun/hotspot/igv/filter/CombineFilter.java line 71: > >> 69: } >> 70: } >> 71: > > I think `assert slot != null;` should be moved up here Makes sense, but I did not change it because the surrounding code is essentially dead (no current filter has a "reversed" `CombineRule`) and I would not be able to test it. Since this code has not been executed for years, it is likely to be broken anyway. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12955#discussion_r1149132866 From rcastanedalo at openjdk.org Mon Mar 27 11:00:37 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 27 Mar 2023 11:00:37 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter [v2] In-Reply-To: References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> Message-ID: On Mon, 20 Mar 2023 10:19:42 GMT, Tobias Holenstein wrote: >> Roberto Casta?eda Lozano has updated the pull request incrementally with five additional commits since the last revision: >> >> - Increase the bold text line factor slightly >> - Add extra horizontal margin for long labels and let them overflow within the node >> - Select slots as well >> - Remove code that is commented out >> - Assert inputLabel is non-null > > src/utils/IdealGraphVisualizer/Graph/src/main/java/com/sun/hotspot/igv/graph/Figure.java line 343: > >> 341: inputLabel = nodeTinyLabel; >> 342: } >> 343: if (inputLabel != null) { > > according to my IDE inputLabel is here always non-null. Replaced with an assertion. > src/utils/IdealGraphVisualizer/Graph/src/main/java/com/sun/hotspot/igv/graph/InputSlot.java line 76: > >> 74: int gapAmount = (int)((getPosition() + 1)*gapRatio); >> 75: return new Point(gapAmount + Figure.getSlotsWidth(Figure.getAllBefore(getFigure().getInputSlots(), this)) + getWidth()/2, -Figure.SLOT_START); >> 76: //return new Point((getFigure().getWidth() / (getFigure().getInputSlots().size() * 2)) * (getPosition() * 2 + 1), -Figure.SLOT_START); > > perhaps remove this old comment Thanks, done. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12955#discussion_r1149143319 PR Review Comment: https://git.openjdk.org/jdk/pull/12955#discussion_r1149142813 From rcastanedalo at openjdk.org Mon Mar 27 11:09:34 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 27 Mar 2023 11:09:34 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter [v2] In-Reply-To: References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> Message-ID: On Mon, 20 Mar 2023 10:13:15 GMT, Tobias Holenstein wrote: >> Roberto Casta?eda Lozano has updated the pull request incrementally with five additional commits since the last revision: >> >> - Increase the bold text line factor slightly >> - Add extra horizontal margin for long labels and let them overflow within the node >> - Select slots as well >> - Remove code that is commented out >> - Assert inputLabel is non-null > > src/utils/IdealGraphVisualizer/Filter/src/main/java/com/sun/hotspot/igv/filter/FilterChain.java line 1: > >> 1: /* > > I think `applyInOrder` can be simplified as this : > > public void applyInOrder(Diagram d, FilterChain sequence) { > for (Filter f : sequence.getFilters()) { > if (filters.contains(f)) { > f.apply(d); > } > } > } > > > Reason: `FilterChain ordering` is the same as `this` in `FilterChain`. Usually `filters` are already in the order that we want them to apply. Only exception is when the user manually reoders the filters. `FilterChain sequence` contains all the filters in the order that they appear in the list. `filters` are the filters that are selected by the user and should alway be a subset of `sequence`. Therefore we can just iterate through `sequence` to get the correct order and apply each filter that is selected (contained in `filters`) Thanks for the suggestion! I tested your assumption ("`FilterChain ordering` is the same as `this` in `FilterChain`") but it does not hold in this PR. Note that the filter list is never ordered outside of `applyInOrder`. In any case, as I mentioned in https://github.com/openjdk/jdk/pull/12714#discussion_r1148879518, I propose to go with the fix in JDK-8302644 and discard the filter ordering changes from this PR. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12955#discussion_r1149152222 From eastigeevich at openjdk.org Mon Mar 27 11:09:44 2023 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Mon, 27 Mar 2023 11:09:44 GMT Subject: RFR: 8304387: Fix positions of shared static stubs / trampolines [v2] In-Reply-To: <1-rKX7bks0rCi-7k-IHi07-Abg53c8EAt9KCYjgv66E=.14b278ee-9e72-4ac3-abd2-d7ba4b71f397@github.com> References: <1-rKX7bks0rCi-7k-IHi07-Abg53c8EAt9KCYjgv66E=.14b278ee-9e72-4ac3-abd2-d7ba4b71f397@github.com> Message-ID: On Mon, 20 Mar 2023 12:08:16 GMT, Xiaolin Zheng wrote: >> This RFE fixes the positions of shared static stubs / trampolines. They should be like: >> >> >> [Verified Entry Point] >> ... >> ... >> ... >> [Stub Code] >> >> [Exception Handler] >> ... >> ... >> [Deopt Handler Code] >> ... >> ... >> >> >> Currently after we have the shared static stubs/trampolines in JDK-8280481 and JDK-8280152 : >> >> >> [Verified Entry Point] >> ... >> ... >> ... >> [Stub Code] >> >> [Exception Handler] >> ... >> ... >> [Deopt Handler Code] >> ... >> ... >> // they are presented in the Deopt range, though do not have correctness issues. >> >> >> For example on x86: >> >> >> [Verified Entry Point] >> ... >> [Stub Code] >> 0x00007fac68ef4908: nopl 0x0(%rax,%rax,1) ; {no_reloc} >> 0x00007fac68ef490d: mov $0x0,%rbx ; {static_stub} >> 0x00007fac68ef4917: jmpq 0x00007fac68ef4917 ; {runtime_call} >> 0x00007fac68ef491c: nop >> 0x00007fac68ef491d: mov $0x0,%rbx ; {static_stub} >> 0x00007fac68ef4927: jmpq 0x00007fac68ef4927 ; {runtime_call} >> [Exception Handler] >> 0x00007fac68ef492c: callq 0x00007fac703da280 ; {runtime_call handle_exception_from_callee Runtime1 stub} >> 0x00007fac68ef4931: mov $0x7fac885d8067,%rdi ; {external_word} >> 0x00007fac68ef493b: and $0xfffffffffffffff0,%rsp >> 0x00007fac68ef493f: callq 0x00007fac881e9900 ; {runtime_call MacroAssembler::debug64(char*, long, long*)} >> 0x00007fac68ef4944: hlt >> [Deopt Handler Code] >> 0x00007fac68ef4945: mov $0x7fac68ef4945,%r10 ; {section_word} >> 0x00007fac68ef494f: push %r10 >> 0x00007fac68ef4951: jmpq 0x00007fac70326520 ; {runtime_call DeoptimizationBlob} >> 0x00007fac68ef4956: mov $0x0,%rbx ; {static_stub} // <---------- here >> 0x00007fac68ef4960: jmpq 0x00007fac68ef4960 ; {runtime_call} >> 0x00007fac68ef4965: mov $0x0,%rbx ; {static_stub} // <---------- here >> 0x00007fac68ef496f: jmpq 0x00007fac68ef496f ; {runtime_call} >> 0x00007fac68ef4974: mov $0x0,%rbx ; {static_stub} // <---------- here >> 0x00007fac68ef497e: jmpq 0x00007fac68ef497e ; {runtime_call} >> 0x00007fac68ef4983: hlt >> 0x00007fac68ef4984: hlt >> 0x00007fac68ef4985: hlt >> 0x00007fac68ef4986: hlt >> 0x00007fac68ef4987: hlt >> -------------------------------------------------------------------------------- >> [/Disassembly] >> >> >> >> It can be simply reproduced and dumped by `-XX:+PrintAssembly`. >> >> Though the correctness doesn't get affected in the current case, we may need to move them to a better place, back into the `[Stub Code]`, which might be more reasonable and unified. Also for the performance's sake, `ciEnv::register_method()`, where `code_buffer->finalize_stubs()` locates currently, has two locks `Compile_lock` and `MethodCompileQueue_lock`. So I think it may be better to move `code_buffer->finalize_stubs()` out to C1 and C2 code generation phases, separately, before the exception handler code is emitted so they are inside the `[Stub Code]` range. >> >> BTW, this is the "direct cause" of [JDK-8302384](https://bugs.openjdk.org/browse/JDK-8302384) because shared trampolines and their data are generated at the end of compiled code, which is different from the original condition. Though for that issue, the root cause is still from the Binutils, for even if trampolines are generated at the end of code, we should not fail as well when disassembling. But that is another issue, please see [JDK-8302384](https://bugs.openjdk.org/browse/JDK-8302384) for more details. >> >> Tested x86, AArch64, and RISC-V hotspot tier1~4 with fastdebug build, no new errors found. >> >> Thanks, >> Xiaolin > > Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: > > Andrew's review comments: another cleanup lgtm ------------- PR Comment: https://git.openjdk.org/jdk/pull/13071#issuecomment-1484947294 From rcastanedalo at openjdk.org Mon Mar 27 11:28:34 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 27 Mar 2023 11:28:34 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter In-Reply-To: <9JHT4fvwluc3EULCyjSchk9DUcoz3v6eQp1lZ7Ca0TI=.64b1968a-f894-49da-8532-a2f01e7ce911@github.com> References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> <7-AdNFFLocni-2nnwHPVf8fy1ykMXn7hrP8dg1Z-0po=.1f02eac6-0213-495d-9193-645eb32f099d@github.com> <9JHT4fvwluc3EULCyjSchk9DUcoz3v6eQp1lZ7Ca0TI=.64b1968a-f894-49da-8532-a2f01e7ce911@github.com> Message-ID: On Mon, 20 Mar 2023 12:27:26 GMT, Christian Hagedorn wrote: > When selecting a CallStaticJava node, the custom node info is sometimes cut depending on the zoom level (sometimes more, sometimes less) Good catch, @chhagedorn! This is an existing issue in mainline IGV, you can reproduce it e.g. by showing a long property such as `dump_spec` in the node text. The issue just becomes more visible with the addition of custom node info in this changeset. As far as I understand, the node width is computed assuming it is selected (i.e. bold text) at 100% zoom level, and scaled proportionally to the selected zoom level. This assumes label fonts scale perfectly with the zoom level, which is not the case. As a result, very long node labels can overflow at different zoom levels than 100%. I don't see a better solution than multiplying the computed node width with a factor (`Figure::BOLD_LINE_FACTOR`) to account for the worst-case text overflow at any zoom level. This will not change the width of most nodes since this tends to be dominated by the input slots anyway, only for those nodes with long labels. I selected this factor experimentally to be of 6% of the total width. Hope this new version fixes the issue you observed. If not, please try out and suggest a more appropriate factor. ------------- PR Comment: https://git.openjdk.org/jdk/pull/12955#issuecomment-1484974123 From rcastanedalo at openjdk.org Mon Mar 27 11:37:32 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 27 Mar 2023 11:37:32 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter In-Reply-To: <9JHT4fvwluc3EULCyjSchk9DUcoz3v6eQp1lZ7Ca0TI=.64b1968a-f894-49da-8532-a2f01e7ce911@github.com> References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> <7-AdNFFLocni-2nnwHPVf8fy1ykMXn7hrP8dg1Z-0po=.1f02eac6-0213-495d-9193-645eb32f099d@github.com> <9JHT4fvwluc3EULCyjSchk9DUcoz3v6eQp1lZ7Ca0TI=.64b1968a-f894-49da-8532-a2f01e7ce911@github.com> Message-ID: On Mon, 20 Mar 2023 12:27:26 GMT, Christian Hagedorn wrote: > Selecting an inlined node with the condensed graph filter does not work when searching for it. For example, I can search for 165 Bool node in the search field. It finds it but when clicking on it, it shows me an empty graph. I would have expected to see the following graph with the "outer" node being selected which includes 165 Bool. Selecting, highlighting, centering, synchronizing etc. inlined and combined nodes ("slots" in IGV speak) has not been possible at all before this changeset. You can reproduce similar issues when using the "Simplify graph" filter in mainline IGV. I included some basic (admittedly half-baked) support for this in this changeset (enhanced searching and parts of selecting, but not highlighting, centering, or synchronizing among tabs), but implementing full support would require a rather deep refactoring of IGV. I will not have time to work on such a refactoring in the coming weeks, so I propose to simply remove the partial support for slot interaction implemented provided by this changeset, so that we leave IGV in the same consistent state as before, and create a RFE for adding proper support in the future. @chhagedorn, @tobiasholenstein what do you think? ------------- PR Comment: https://git.openjdk.org/jdk/pull/12955#issuecomment-1484984666 From rcastanedalo at openjdk.org Mon Mar 27 11:45:37 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 27 Mar 2023 11:45:37 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter In-Reply-To: <9JHT4fvwluc3EULCyjSchk9DUcoz3v6eQp1lZ7Ca0TI=.64b1968a-f894-49da-8532-a2f01e7ce911@github.com> References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> <7-AdNFFLocni-2nnwHPVf8fy1ykMXn7hrP8dg1Z-0po=.1f02eac6-0213-495d-9193-645eb32f099d@github.com> <9JHT4fvwluc3EULCyjSchk9DUcoz3v6eQp1lZ7Ca0TI=.64b1968a-f894-49da-8532-a2f01e7ce911@github.com> Message-ID: On Mon, 20 Mar 2023 12:27:26 GMT, Christian Hagedorn wrote: > Maybe the node info can be improved further in a future RFE, for example for CountedLoop nodes to also show if it is a pre/main/post loop or to add the stride. Good suggestion! I agree that there is room for further exploiting custom node info in the future, loop nodes are excellent candidates :) Thanks @chhagedorn and @tobiasholenstein for looking at this (rather large) changeset and giving valuable feedback! I addressed your comments, please let me know what you think. ------------- PR Comment: https://git.openjdk.org/jdk/pull/12955#issuecomment-1484991958 PR Comment: https://git.openjdk.org/jdk/pull/12955#issuecomment-1484994462 From rcastanedalo at openjdk.org Mon Mar 27 11:45:38 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 27 Mar 2023 11:45:38 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter [v2] In-Reply-To: References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> Message-ID: On Mon, 20 Mar 2023 11:49:52 GMT, Tobias Holenstein wrote: >> src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/DiagramScene.java line 218: >> >>> 216: if (ids.contains(figure.getInputNode().getId())) { >>> 217: selectedFigures.add(figure); >>> 218: } >> >> Suggestion: >> >> } >> for (Slot slot : figure.getSlots()) { >> if (!Collections.disjoint(slot.getSource().getSourceNodesAsSet(), ids)) { >> highlightedObjects.add(slot); >> } >> } >> >> I am not sure what your intent was in adding the slots to the selected objects. If you wanted the slots to be selected globally in "link global node selection" mode, you need to add the following code to make it work > > Even if this was not you intention, I think selecting the slots globally is a useful feature. Thanks for the suggestion! I added support for this now, however after considering the amount of work required to implement proper interaction with slots in a consistent manner, I lean towards excluding the partial implementation from this changeset and leaving it instead for future work, see my reply to @chhagedorn. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12955#discussion_r1149183957 From epeter at openjdk.org Mon Mar 27 12:00:41 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 27 Mar 2023 12:00:41 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand [v2] In-Reply-To: References: Message-ID: On Fri, 24 Mar 2023 15:21:45 GMT, Roberto Casta?eda Lozano wrote: >> Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point `Math.min()/max()` implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see [JDK-8261147](https://bugs.openjdk.org/browse/JDK-8261147) and [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622)). >> >> This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops: >> >> ![reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226725587-b7d68509-3717-4bbe-8d54-f9a105853fda.png) >> >> The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks. >> >> ## Performance Benefits >> >> As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point `Math.min()/max()` in multiple scenarios. >> >> ### Increased Auto-Vectorization Scope >> >> There are two main scenarios in which the proposed changeset enables further auto-vectorization: >> >> #### Reductions Using Global Accumulators >> >> >> public class Foo { >> int acc = 0; >> (..) >> void reduce(int[] array) { >> for (int i = 0; i < array.length; i++) { >> acc += array[i]; >> } >> } >> } >> >> Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis: >> >> ![global-reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226745351-33494e40-7c07-4a8b-8bf6-d3a96e84b1c2.png) >> >> #### Reductions of partially unrolled loops >> >> >> (..) >> for (int i = 0; i < array.length / 2; i++) { >> acc += array[2*i]; >> acc += array[2*i + 1]; >> } >> (..) >> >> >> These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically. >> >> ### Increased Performance of x64 Floating-Point `Math.min()/max()` >> >> Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point `Math.min()/max()` implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in `FpMinMaxIntrinsics.java` for more details). >> >> ## Implementation details >> >> The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case `testReductionOnPartiallyUnrolledLoopWithSwappedInputs` from `TestGeneralizedReductions.java`) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops). >> >> A complication results from edge swapping in the nodes cloned by loop unrolling (see [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/addnode.cpp#L123) and [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/mulnode.cpp#L113)), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64) is to replace this changeset's linear chain finding approach with a more general shortest path-finding algorithm. This alternative might preclude the need for tracking edge swapping at a potentially higher computational cost. Since the trade-off is not obvious, I propose to investigate it in a follow-up RFE. >> >> The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of `[Min|Max][F|D]` nodes. >> >> ## Testing >> >> ### Functionality >> >> - tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64). >> - fuzzing (12 h. on linux-x64 and linux-aarch64). >> >> ##### TestGeneralizedReductions.java >> >> Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. `testReductionOnPartiallyUnrolledLoop` has been observed to fail on [linux-x86](https://github.com/robcasloz/jdk/actions/runs/4478959520/jobs/7873827856#logs) due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction. >> >> ##### TestFpMinMaxReductions.java >> >> Tests the matching of floating-point max/min implementations in x64. >> >> ##### TestSuperwordFailsUnrolling.java >> >> This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test. >> >> ### Performance >> >> #### General Benchmarks >> >> The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64. >> >> #### Micro-benchmarks >> >> The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis ([full results](https://github.com/openjdk/jdk/files/11039207/microbenchmark-results.ods)). >> >> >> ##### VectorReduction.java >> >> These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running `VectorReduction.java` on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for `andRedIOnGlobalAccumulator` and `andRedIPartiallyUnrolled`, where the changeset improves performance by 2.4x in both cases. >> >> ##### MaxIntrinsics.java >> >> This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point `Math.min()` implementation that is specialized for reduction min operations): >> >> | micro-benchmark | speedup compared to mainline | >> | --- | --- | >> | `fMinReduceInOuterLoop` | 1.1x | >> | `fMinReduceNonCounted` | 2.3x | >> | `fMinReduceGlobalAccumulator` | 2.4x | >> | `fMinReducePartiallyUnrolled` | 3.9x | >> >> ## Acknowledgments >> >> Thanks to @danielogh for making it possible to test this improvement with confidence ([JDK-8294715](https://bugs.openjdk.org/browse/JDK-8294715)) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback. > > Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 28 additional commits since the last revision: > > - Merge master > - Relax the reduction cycle search bound > - Remove redundant IR check precondition > - Use SuperWord members in reduction marking > - Remove redundant opcode checks > - Do not run test in x86-32 > - Update existing test instead of removing it > - Add negative vectorization test > - Update copyright headers > - Add two more reduction vectorization microbenchmarks > - ... and 18 more: https://git.openjdk.org/jdk/compare/8a9170f5...95f6cc33 Thanks for the updates @robcasloz . Looks good to me now. ------------- Marked as reviewed by epeter (Committer). PR Review: https://git.openjdk.org/jdk/pull/13120#pullrequestreview-1358957887 From xlinzheng at openjdk.org Mon Mar 27 12:01:29 2023 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Mon, 27 Mar 2023 12:01:29 GMT Subject: RFR: 8304681: compiler/sharedstubs/SharedStubToInterpTest.java fails after JDK-8304387 Message-ID: Please review this test fix for [JDK-8304387](https://bugs.openjdk.org/browse/JDK-8304387) after RFR. Instead of specifying shared stubs' locations in this test, we could check if two or more relocations combined with each of them in this test, same as the other test `SharedTrampolineTest.java`. The counting logic is aligned with `SharedTrampolineTest.java`. Printing relocation stuff requires debug version vm, so this test is changed to debug only. Also some minor cleanups for the tests. Apologies for the tier2 failure. I mainly focused on if there were new hs_errs when working on JDK-8304387. :-( I am now confirming if the comment "Static stubs must be created at the end of the Stub section" could be removed, which needs a little extra time - though I think we can relax such limitations in `SharedStubToInterpTest.java`. -- Update on March 27th: comfirmed; please see the JBS issue for the discussion. Tested x86_64, AArch64 and RISC-V for tests under `compiler/sharedstubs` folder, and now all passed (release, fastdebug). Thanks, Xiaolin ------------- Commit messages: - Test fixes after JDK-8304387 Changes: https://git.openjdk.org/jdk/pull/13135/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13135&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8304681 Stats: 60 lines in 3 files changed: 3 ins; 49 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/13135.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13135/head:pull/13135 PR: https://git.openjdk.org/jdk/pull/13135 From eastigeevich at openjdk.org Mon Mar 27 12:01:31 2023 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Mon, 27 Mar 2023 12:01:31 GMT Subject: RFR: 8304681: compiler/sharedstubs/SharedStubToInterpTest.java fails after JDK-8304387 In-Reply-To: References: Message-ID: On Wed, 22 Mar 2023 05:14:21 GMT, Xiaolin Zheng wrote: > Please review this test fix for [JDK-8304387](https://bugs.openjdk.org/browse/JDK-8304387) after RFR. > > Instead of specifying shared stubs' locations in this test, we could check if two or more relocations combined with each of them in this test, same as the other test `SharedTrampolineTest.java`. The counting logic is aligned with `SharedTrampolineTest.java`. Printing relocation stuff requires debug version vm, so this test is changed to debug only. Also some minor cleanups for the tests. > > Apologies for the tier2 failure. I mainly focused on if there were new hs_errs when working on JDK-8304387. :-( > > I am now confirming if the comment "Static stubs must be created at the end of the Stub section" could be removed, which needs a little extra time - though I think we can relax such limitations in `SharedStubToInterpTest.java`. > -- Update on March 27th: comfirmed; please see the JBS issue for the discussion. > > Tested x86_64, AArch64 and RISC-V for tests under `compiler/sharedstubs` folder, and now all passed (release, fastdebug). > > Thanks, > Xiaolin LGTM ------------- Marked as reviewed by eastigeevich (Committer). PR Review: https://git.openjdk.org/jdk/pull/13135#pullrequestreview-1358905815 From xlinzheng at openjdk.org Mon Mar 27 12:01:32 2023 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Mon, 27 Mar 2023 12:01:32 GMT Subject: RFR: 8304681: compiler/sharedstubs/SharedStubToInterpTest.java fails after JDK-8304387 In-Reply-To: References: Message-ID: On Wed, 22 Mar 2023 05:14:21 GMT, Xiaolin Zheng wrote: > Please review this test fix for [JDK-8304387](https://bugs.openjdk.org/browse/JDK-8304387) after RFR. > > Instead of specifying shared stubs' locations in this test, we could check if two or more relocations combined with each of them in this test, same as the other test `SharedTrampolineTest.java`. The counting logic is aligned with `SharedTrampolineTest.java`. Printing relocation stuff requires debug version vm, so this test is changed to debug only. Also some minor cleanups for the tests. > > Apologies for the tier2 failure. I mainly focused on if there were new hs_errs when working on JDK-8304387. :-( > > I am now confirming if the comment "Static stubs must be created at the end of the Stub section" could be removed, which needs a little extra time - though I think we can relax such limitations in `SharedStubToInterpTest.java`. > -- Update on March 27th: comfirmed; please see the JBS issue for the discussion. > > Tested x86_64, AArch64 and RISC-V for tests under `compiler/sharedstubs` folder, and now all passed (release, fastdebug). > > Thanks, > Xiaolin Hi @eastig, would you mind having a look at this simple change of test `SharedStubToInterpTest.java`? Thank you. Thank you for the review, Evgeny! ------------- PR Comment: https://git.openjdk.org/jdk/pull/13135#issuecomment-1484450030 PR Comment: https://git.openjdk.org/jdk/pull/13135#issuecomment-1485007659 From rcastanedalo at openjdk.org Mon Mar 27 12:09:35 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 27 Mar 2023 12:09:35 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand In-Reply-To: <9K3JsxtFNTYymj86XUBOvnuQZd4fziGM7PkyPh_JefM=.17668ef2-cf1e-4145-9b02-308f9f123175@github.com> References: <9K3JsxtFNTYymj86XUBOvnuQZd4fziGM7PkyPh_JefM=.17668ef2-cf1e-4145-9b02-308f9f123175@github.com> Message-ID: <-a1gSqzgYFMtpORJMMuOn7ZY6U-iF6OpGy_jsSOZ4fA=.a124ff89-5d46-40d7-ad17-644f5e66f5d9@github.com> On Fri, 24 Mar 2023 15:17:21 GMT, Roberto Casta?eda Lozano wrote: >> Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point `Math.min()/max()` implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see [JDK-8261147](https://bugs.openjdk.org/browse/JDK-8261147) and [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622)). >> >> This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops: >> >> ![reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226725587-b7d68509-3717-4bbe-8d54-f9a105853fda.png) >> >> The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks. >> >> ## Performance Benefits >> >> As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point `Math.min()/max()` in multiple scenarios. >> >> ### Increased Auto-Vectorization Scope >> >> There are two main scenarios in which the proposed changeset enables further auto-vectorization: >> >> #### Reductions Using Global Accumulators >> >> >> public class Foo { >> int acc = 0; >> (..) >> void reduce(int[] array) { >> for (int i = 0; i < array.length; i++) { >> acc += array[i]; >> } >> } >> } >> >> Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis: >> >> ![global-reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226745351-33494e40-7c07-4a8b-8bf6-d3a96e84b1c2.png) >> >> #### Reductions of partially unrolled loops >> >> >> (..) >> for (int i = 0; i < array.length / 2; i++) { >> acc += array[2*i]; >> acc += array[2*i + 1]; >> } >> (..) >> >> >> These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically. >> >> ### Increased Performance of x64 Floating-Point `Math.min()/max()` >> >> Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point `Math.min()/max()` implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in `FpMinMaxIntrinsics.java` for more details). >> >> ## Implementation details >> >> The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case `testReductionOnPartiallyUnrolledLoopWithSwappedInputs` from `TestGeneralizedReductions.java`) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops). >> >> A complication results from edge swapping in the nodes cloned by loop unrolling (see [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/addnode.cpp#L123) and [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/mulnode.cpp#L113)), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64) is to replace this changeset's linear chain finding approach with a more general shortest path-finding algorithm. This alternative might preclude the need for tracking edge swapping at a potentially higher computational cost. Since the trade-off is not obvious, I propose to investigate it in a follow-up RFE. >> >> The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of `[Min|Max][F|D]` nodes. >> >> ## Testing >> >> ### Functionality >> >> - tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64). >> - fuzzing (12 h. on linux-x64 and linux-aarch64). >> >> ##### TestGeneralizedReductions.java >> >> Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. `testReductionOnPartiallyUnrolledLoop` has been observed to fail on [linux-x86](https://github.com/robcasloz/jdk/actions/runs/4478959520/jobs/7873827856#logs) due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction. >> >> ##### TestFpMinMaxReductions.java >> >> Tests the matching of floating-point max/min implementations in x64. >> >> ##### TestSuperwordFailsUnrolling.java >> >> This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test. >> >> ### Performance >> >> #### General Benchmarks >> >> The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64. >> >> #### Micro-benchmarks >> >> The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis ([full results](https://github.com/openjdk/jdk/files/11039207/microbenchmark-results.ods)). >> >> >> ##### VectorReduction.java >> >> These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running `VectorReduction.java` on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for `andRedIOnGlobalAccumulator` and `andRedIPartiallyUnrolled`, where the changeset improves performance by 2.4x in both cases. >> >> ##### MaxIntrinsics.java >> >> This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point `Math.min()` implementation that is specialized for reduction min operations): >> >> | micro-benchmark | speedup compared to mainline | >> | --- | --- | >> | `fMinReduceInOuterLoop` | 1.1x | >> | `fMinReduceNonCounted` | 2.3x | >> | `fMinReduceGlobalAccumulator` | 2.4x | >> | `fMinReducePartiallyUnrolled` | 3.9x | >> >> ## Acknowledgments >> >> Thanks to @danielogh for making it possible to test this improvement with confidence ([JDK-8294715](https://bugs.openjdk.org/browse/JDK-8294715)) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback. > > I just pushed a new version addressing @eme64's comments and suggestions, please review. > Thanks for the updates @robcasloz . > Looks good to me now. Thanks for reviewing, Emanuel! ------------- PR Comment: https://git.openjdk.org/jdk/pull/13120#issuecomment-1485028650 From qamai at openjdk.org Mon Mar 27 13:35:46 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Mon, 27 Mar 2023 13:35:46 GMT Subject: RFR: 8282365: Optimize divideUnsigned and remainderUnsigned for constants [v15] In-Reply-To: References: Message-ID: > This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. > > In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: > > floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) > ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) > > The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. > > For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: > > c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) > c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) > > which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. > > For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. > > More tests are added to cover the possible patterns. > > Please take a look and have some reviews. Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: whitespace ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9947/files - new: https://git.openjdk.org/jdk/pull/9947/files/e44625d6..f2086507 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=14 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=13-14 Stats: 1 line in 1 file changed: 0 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/9947.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/9947/head:pull/9947 PR: https://git.openjdk.org/jdk/pull/9947 From tholenstein at openjdk.org Mon Mar 27 15:59:47 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 27 Mar 2023 15:59:47 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally [v8] In-Reply-To: References: Message-ID: > In IGV the user can apply a set of filters to a graph. Currently, the same set of selected filters is applied to all graphs (globally). > > - With this change the use can define a set of filters for each individual graph tab using the `--Local--` profile > - Further a filter profile can be created that represents a set of filter. This filter profile can the be selected in each graph tab individually. > > ### Global profile > Each tab has a `--Global--` filter profile which is selected when opening a graph. Filters applied to the `--Global--` profile are applied to all tabs that have the `--Global--` profile selected. > > ### Local profile > Each tab has its own `--Local--` filter profile. Filters applied to the `--Local--` profile are applied only to the currently selected tabs. Only one tab can be selected at a time and a tab gets selected by clicking on it. To make it more clear which tab is currently selected, the title of the selected tab is displayed in **bold** font. > tabA > > When clicking on a different tab with a different `--Local--` profile, the selected filters get updated accordingly. > tabB > > ### New profile > The user can also create a new filter profile and give it a name. E.g. `My Filters` > newProfile > > The `My Filters` profile is then globally available to other tabs as well > selectProfile > > > ### Filters for cloned tabs > When the user clones a tab, the `--Local--` profile gets cloned as well. Further the clone has the same filter profile selected when it gets opened > cloneTab > > ### Saving of filters and profiles > When the users closes IGV, the filters (in their exact order) are save, as well as the filter profiles. The profile that was last used is selected when opening IGV. Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: don not fire filterChanged() a new graph is viewed ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12714/files - new: https://git.openjdk.org/jdk/pull/12714/files/310cccb8..2d6409b9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12714&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12714&range=06-07 Stats: 155 lines in 7 files changed: 84 ins; 47 del; 24 mod Patch: https://git.openjdk.org/jdk/pull/12714.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12714/head:pull/12714 PR: https://git.openjdk.org/jdk/pull/12714 From vladimir.kempik at gmail.com Mon Mar 27 16:17:59 2023 From: vladimir.kempik at gmail.com (Vladimir Kempik) Date: Mon, 27 Mar 2023 19:17:59 +0300 Subject: Missaligned memory accesses from JDK In-Reply-To: References: <29875E09-8B1E-4255-AAED-06305459C872@gmail.com> <330a2677.26fa2.186fdec7098.Coremail.yangfei@iscas.ac.cn> <6194F148-F760-407F-961E-180BBDC6AE4F@gmail.com> <66F089DC-6F97-488E-B337-80F3C0DAE5A1@gmail.com> Message-ID: <5FE55856-1B54-45F0-85D7-8C337A76B52C@gmail.com> Hello Andrew Your idea looks good, but a question arises: If I change emit_int16() to use Bytes::put_native_u2() then few platform might see perf penalty, I found these platforms to do aligned store in put_native_u2() unconditionally: ppc and arm32 doing same thing: static inline void put_native_u2(address p, u2 x) { if ((intptr_t(p) & 1) == 0) { *(u2*)p = x; } else { p[0] = x; p[1] = x >> 8; } } and x86 doing this (): static inline void put_native_u2(address p, u2 x) { put_native((void*)p, x); } template static inline void put_native(void* p, T x) { assert(p != NULL, "null pointer"); if (is_aligned(p, sizeof(T))) { *(T*)p = x; } else { memcpy(p, &x, sizeof(T)); } } Should I then make ppc/arm32/x86 to do aligned stores in put_native_u2 only if AvoidUnalignedAccesses is true ? Thanks in advance, Vladimir. > 27 ????? 2023 ?., ? 12:55, Andrew Haley ???????(?): > > On 3/20/23 15:26, Vladimir Kempik wrote: >> Could you please suggest on best way to make emit_intX methods not perform misaligned memory stores ? > > You should change emit_int16() to use Bytes::put_native_u2(). You should change > RiscV's put_native_u2() to do whatever the back end needs, respecting > AvoidUnalignedAccesses. > > -- > Andrew Haley (he/him) > Java Platform Lead Engineer > Red Hat UK Ltd. > https://keybase.io/andrewhaley > EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcking at openjdk.org Mon Mar 27 19:57:12 2023 From: jcking at openjdk.org (Justin King) Date: Mon, 27 Mar 2023 19:57:12 GMT Subject: RFR: JDK-8304684: Memory leak in DirectivesParser::set_option_flag [v3] In-Reply-To: References: Message-ID: > Add missing `FREE_C_HEAP_ARRAY` call. Justin King has updated the pull request incrementally with four additional commits since the last revision: - Make directives enum macros more fool proof Signed-off-by: Justin King - Fix capitalization Signed-off-by: Justin King - Fix unqualified member variable reference Signed-off-by: Justin King - Update DirectiveSet to take ownership of string options Signed-off-by: Justin King ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13125/files - new: https://git.openjdk.org/jdk/pull/13125/files/a832d587..7ddd0cec Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13125&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13125&range=01-02 Stats: 95 lines in 3 files changed: 64 ins; 14 del; 17 mod Patch: https://git.openjdk.org/jdk/pull/13125.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13125/head:pull/13125 PR: https://git.openjdk.org/jdk/pull/13125 From jcking at openjdk.org Mon Mar 27 20:12:26 2023 From: jcking at openjdk.org (Justin King) Date: Mon, 27 Mar 2023 20:12:26 GMT Subject: RFR: JDK-8304684: Memory leak in DirectivesParser::set_option_flag [v2] In-Reply-To: <9GW5pDTdap2DBR3yBsHs4TBw1S3SygQ7SUH276-VVzc=.289cf892-3c04-40d8-a898-b63d05a1528a@github.com> References: <5cez_uFvtwk06UhgF6VnHYGatqHOdnzyEB1mMNkAndQ=.e6489e78-ffcf-4c68-8fa3-cca52c372a28@github.com> <9GW5pDTdap2DBR3yBsHs4TBw1S3SygQ7SUH276-VVzc=.289cf892-3c04-40d8-a898-b63d05a1528a@github.com> Message-ID: On Fri, 24 Mar 2023 19:45:09 GMT, Justin King wrote: >> `DirectiveSet::compilecommand_compatibility_init` makes things even more complicated, because it doesnt update `_modified` and it should probably be making a copy of the string provided via `CompilerOracle::has_option_value`. Ugh... > > Nevermind, it looks like `CompilerOracle::has_option_value` has static duration effectively. So I should be able to fix this. Okay, think I was able to figure it out. Running GHA again. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13125#discussion_r1149741203 From dlong at openjdk.org Mon Mar 27 22:36:33 2023 From: dlong at openjdk.org (Dean Long) Date: Mon, 27 Mar 2023 22:36:33 GMT Subject: RFR: JDK-8304684: Memory leak in DirectivesParser::set_option_flag [v3] In-Reply-To: References: Message-ID: <94qvA1Jc9ytfb1JyNEXv6P6L__Eeq_GpJgCdU4y4erw=.7a3a78be-3f22-425b-8ff2-477dac703577@github.com> On Mon, 27 Mar 2023 19:57:12 GMT, Justin King wrote: >> Add missing `FREE_C_HEAP_ARRAY` call. > > Justin King has updated the pull request incrementally with four additional commits since the last revision: > > - Make directives enum macros more fool proof > > Signed-off-by: Justin King > - Fix capitalization > > Signed-off-by: Justin King > - Fix unqualified member variable reference > > Signed-off-by: Justin King > - Update DirectiveSet to take ownership of string options > > Signed-off-by: Justin King src/hotspot/share/compiler/directivesParser.cpp line 321: > 319: strncpy(s, v->str.start, v->str.length + 1); > 320: s[v->str.length] = '\0'; > 321: (set->*test)((void *)&s); // Takes ownership. Would it be better to do the "set" only after validators have passed? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13125#discussion_r1149851012 From duke at openjdk.org Tue Mar 28 02:13:29 2023 From: duke at openjdk.org (changpeng1997) Date: Tue, 28 Mar 2023 02:13:29 GMT Subject: RFR: 8301739: AArch64: Add optimized rules for vector compare with immediate for SVE Message-ID: We can use SVE compare-with-integer-immediate instructions like cmpgt(immediate)[1] to avoid the extra scalar2vector operations. The following instruction sequence movi v17.16b, #12 cmpgt p0.b, p7/z, z16.b, z17.b can be optimized to: cmpgt p0.b, p7/z, z16.b, #12 This patch does the following: 1. Add SVE compare-with-7bit-unsigned-immediate instructions to C2's backend. SVE cmp(immediate) instructions can support vector comparing with 7bit unsigned integer immediate (range from 0 to 127)or 5bit signed integer immediate (range from -16 to 15). 2. Add optimized match rules to generate the compare-with-immediate instructions. [1]: https://developer.arm.com/documentation/ddi0596/2021-12/SVE-Instructions/CMP-cc---immediate---Compare-vector-to-immediate- ------------- Commit messages: - 8301739: AArch64: Add optimized rules for vector compare with immediate for SVE Changes: https://git.openjdk.org/jdk/pull/13200/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13200&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8301739 Stats: 895 lines in 8 files changed: 578 ins; 4 del; 313 mod Patch: https://git.openjdk.org/jdk/pull/13200.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13200/head:pull/13200 PR: https://git.openjdk.org/jdk/pull/13200 From duke at openjdk.org Tue Mar 28 06:34:37 2023 From: duke at openjdk.org (SUN Guoyun) Date: Tue, 28 Mar 2023 06:34:37 GMT Subject: RFR: 8302814: Delete unused CountLoopEnd instruct with CmpX [v3] In-Reply-To: <5xGjZbHg_TihXg2tfD7ZHmUvN4KVf32AKlekqqSh36g=.b2a327de-98e9-4b4a-8dca-bb5fed1d22fc@github.com> References: <4tL_UFrMJLavPFPa_Q_O3himxRjM_sWRADfAXsMgZYk=.2ba5fdc5-0470-4bef-926b-2e40d0ca31e7@github.com> <32f3F61MuuypToLHcV6wFKZU8oTyQrJO3wHv9cxQghY=.3db91155-ea72-4315-952a-1aa10120af8c@github.com> <5xGjZbHg_TihXg2tfD7ZHmUvN4KVf32AKlekqqSh36g=.b2a327de-98e9-4b4a-8dca-bb5fed1d22fc@github.com> Message-ID: On Thu, 23 Feb 2023 18:39:39 GMT, Vladimir Kozlov wrote: >> SUN Guoyun has updated the pull request incrementally with one additional commit since the last revision: >> >> 8302814: Delete unused CountLoopEnd instruct with CmpX > > I am taking back my next comment because I see this test fails in other PRs: > >>May be that is why next test failed in your GHA testing: >>compiler/vectorization/runner/LoopRangeStrideTest.java @vnkozlov Do you have any comments on this patch and could you please sponsor it for me? ------------- PR Comment: https://git.openjdk.org/jdk/pull/12648#issuecomment-1486290093 From chagedorn at openjdk.org Tue Mar 28 08:45:35 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 28 Mar 2023 08:45:35 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter [v2] In-Reply-To: References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> Message-ID: On Mon, 27 Mar 2023 08:52:32 GMT, Roberto Casta?eda Lozano wrote: >> The "Simplify graph" filter abstracts away details from the graph that are typically unnecessary for debugging or analyzing the represented program. This changeset decouples this filter into two: >> >> - "Simplify graph", which hides elements that are typically (but not always) unnecessary, and >> - "Condense graph", which makes the graph more compact without loss of information. >> >> Together, these two filters reduce the average graph size by a factor of 1.6x (nodes) and 1.9x (edges): >> >> ![without-with-filters](https://user-images.githubusercontent.com/8792647/224118397-e6bd45d1-0b90-4d94-88ae-0a83f9ef20da.png) >> >> Besides decoupling the "Simplify graph" filter, the changeset extends its functionality by: >> - combining Bool and conversion nodes into their predecessors, >> - inlining all Parm nodes except control into their successors (this removes lots of long edges), >> - removing "top" inputs from call-like nodes, >> - inlining more source nodes (such as MachTemp and ThreadLocal) into their successors, >> - pretty-printing the labels of many inlined and combined nodes such as Bool comparisons or Catch projections (via a new filter that edits node properties), and >> - using a sparse representation of nodes with empty inputs (e.g. call-like nodes after applying "Simplify graph"). >> >> The sparse input representation shows dots between non-contiguous inputs, instead of horizontal space proportional to the number of empty inputs. This helps reducing node width, which is known to improve overall layout quality: >> >> ![dense-vs-sparse](https://user-images.githubusercontent.com/8792647/224118703-04f663b7-7a73-4e49-87d9-2acd8b98522b.png) >> >> Note that the exact input indices can still be retrieved via the incoming edge's tooltips: >> >> ![tooltip-with-input-index](https://user-images.githubusercontent.com/8792647/224119319-7f40fba2-1e9f-436e-a11c-8c3d428d46a6.png) >> >> The control-flow graph view is also adapted to this representation: >> >> ![sparse-in-cfg](https://user-images.githubusercontent.com/8792647/224119399-884e2516-a9a1-43fd-b5f5-747c99472ace.png) >> >> #### Additional improvements >> >> Additionally, this changeset: >> - ensures that the selected filter subset is applied in the order listed in the "Filter" window (this is necessary for combining effectively the "Simplify graph" and "Condense graph" filters, but is also generally desirable for simplicity and consistency), >> - introduces a complementary filter "Show custom node info" (enabled by default) that extends the labels of call and exception-creation nodes with custom information, >> - extends the search functionality so that combined and inlined nodes can also be searched on and selected, and >> - defines and documents JavaScript helpers to simplify the new and existing available filters. >> >> Here is an example of the effect of the new "Show custom node info" filter: >> >> ![show-custom-node-info](https://user-images.githubusercontent.com/8792647/224119545-fd564224-7ccc-4829-988e-77f05d25b3bc.png) >> >> ### Testing >> >> #### Functionality >> >> - Tested the functionality manually on a small selection of graphs. >> >> - Tested automatically that viewing thousands of graphs in the three views with different filter subsets enabled does not trigger any assertion failure (by instrumenting IGV to view graphs as they are loaded and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`). >> >> #### Performance >> >> Measured the combined filter application and view creation time for the sea-of-nodes view on a selection of 100 medium-sized graphs (200-500 nodes). On average, applying the new "Show custom node info" filter introduces a minimal overhead of around 1%, which motivates enabling it by default. Applying the "simplify graph" and "condense graph" on top actually gives a speedup of about 12%, since the additional filter application time is amortized by laying out and drawing fewer nodes. However, these filters are not enabled by default, since they cause a (minor) loss of information which is not desirable in every use case. >> >> The graph size reduction and performance results are [attached](https://github.com/openjdk/jdk/files/10934804/performance-evaluation.ods) (note that each time measurement in the sheet corresponds to the median of ten runs). > > Roberto Casta?eda Lozano has updated the pull request incrementally with five additional commits since the last revision: > > - Increase the bold text line factor slightly > - Add extra horizontal margin for long labels and let them overflow within the node > - Select slots as well > - Remove code that is commented out > - Assert inputLabel is non-null Thanks Roberto for your detailed answers! > > When selecting a CallStaticJava node, the custom node info is sometimes cut depending on the zoom level (sometimes more, sometimes less) > > Good catch, @chhagedorn! This is an existing issue in mainline IGV, you can reproduce it e.g. by showing a long property such as `dump_spec` in the node text. The issue just becomes more visible with the addition of custom node info in this changeset. As far as I understand, the node width is computed assuming it is selected (i.e. bold text) at 100% zoom level, and scaled proportionally to the selected zoom level. This assumes label fonts scale perfectly with the zoom level, which is not the case. As a result, very long node labels can overflow at different zoom levels than 100%. I don't see a better solution than multiplying the computed node width with a factor (`Figure::BOLD_LINE_FACTOR`) to account for the worst-case text overflow at any zoom level. This will not change the width of most nodes since this tends to be dominated by the input slots anyway, only for those nodes with long labels. I selected this factor experimentally to be of 6% of the total width. Hope this new vers ion fixes the issue you observed. If not, please try out and suggest a more appropriate factor. That looks much better now! 6% seems to be a good value to go with. Thanks for fixing this general issue. > > Selecting an inlined node with the condensed graph filter does not work when searching for it. For example, I can search for 165 Bool node in the search field. It finds it but when clicking on it, it shows me an empty graph. I would have expected to see the following graph with the "outer" node being selected which includes 165 Bool. > > Selecting, highlighting, centering, synchronizing etc. inlined and combined nodes ("slots" in IGV speak) has not been possible at all before this changeset. You can reproduce similar issues when using the "Simplify graph" filter in mainline IGV. I see, I've never used the "Simplify graph" filter before. That's why I've only noticed this now. > I included some basic (admittedly half-baked) support for this in this changeset (enhanced searching and parts of selecting, but not highlighting, centering, or synchronizing among tabs), but implementing full support would require a rather deep refactoring of IGV. I will not have time to work on such a refactoring in the coming weeks, so I propose to simply remove the partial support for slot interaction implemented provided by this changeset, so that we leave IGV in the same consistent state as before, and create a RFE for adding proper support in the future. @chhagedorn, @tobiasholenstein what do you think? I agree with your suggestion to remove the partial implementation and try to fully support it later in a separate RFE. That might be the cleanest solution for now. And we could still take your current code as a starting point for that RFE . > > Maybe the node info can be improved further in a future RFE, for example for CountedLoop nodes to also show if it is a pre/main/post loop or to add the stride. > > Good suggestion! I agree that there is room for further exploiting custom node info in the future, loop nodes are excellent candidates :) Great! :-) Can you file an RFE for that? Thanks, Christian ------------- PR Comment: https://git.openjdk.org/jdk/pull/12955#issuecomment-1486452815 From rcastanedalo at openjdk.org Tue Mar 28 09:43:35 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 28 Mar 2023 09:43:35 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter [v2] In-Reply-To: References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> Message-ID: <9PJw3VBi9OdgT86PdHMrcJSzNOnyehsVWMbaSIJ6KZI=.14672d50-27b9-45a1-bdfb-263d20602724@github.com> On Tue, 28 Mar 2023 08:42:53 GMT, Christian Hagedorn wrote: > I agree with your suggestion to remove the partial implementation and try to fully support it later in a separate RFE. That might be the cleanest solution for now. And we could still take your current code as a starting point for that RFE . Thanks, yes, if everyone agrees on this I will extract the partial implementation into a separate patch. What do you think @tobiasholenstein? > Great! :-) Can you file an RFE for that? Will do after integrating this changeset. ------------- PR Comment: https://git.openjdk.org/jdk/pull/12955#issuecomment-1486535626 PR Comment: https://git.openjdk.org/jdk/pull/12955#issuecomment-1486536392 From bkilambi at openjdk.org Tue Mar 28 09:46:19 2023 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Tue, 28 Mar 2023 09:46:19 GMT Subject: RFR: 8303161: [vectorapi] VectorMask.cast narrow operation returns incorrect value with SVE [v2] In-Reply-To: References: Message-ID: > The cast operation for VectorMask from wider type to narrow type returns incorrect result for trueCount() method invocation for the resultant mask with SVE (on some SVE machines toLong() also results in incorrect values). An example narrow operation which results in incorrect toLong() and trueCount() values is shown below for a 128-bit -> 64-bit conversion and this can be extended to other narrow operations where the source mask in bytes is either 4x or 8x the size of the result mask in bytes - > > > public class TestMaskCast { > > static final boolean [] mask_arr = {true, true, false, true}; > > public static long narrow_long() { > VectorMask lmask128 = VectorMask.fromArray(LongVector.SPECIES_128, mask_arr, 0); > return lmask128.cast(IntVector.SPECIES_64).toLong(); > } > > public static void main(String[] args) { > long r = 0L; > for (int ic = 0; ic < 50000; ic++) { > r = narrow_long(); > } > System.out.println("toLong() : " + r); > } > } > > > **C2 compilation result :** > java --add-modules jdk.incubator.vector TestMaskCast > toLong(): 15 > > **Interpreter result (for verification) :** > java --add-modules jdk.incubator.vector -Xint TestMaskCast > toLong(): 3 > > The incorrect results with toLong() have been observed only on the 128-bit and 256-bit SVE machines but they are not reproducible on a 512-bit machine. However, trueCount() returns incorrect values too and they are reproducible on all the SVE machines and thus is more reliable to use trueCount() to bring out the drawbacks of the current implementation of mask cast narrow operation for SVE. > > Replacing the call to toLong() by trueCount() in the above example - > > > public class TestMaskCast { > > static final boolean [] mask_arr = {true, true, false, true}; > > public static int narrow_long() { > VectorMask lmask128 = VectorMask.fromArray(LongVector.SPECIES_128, mask_arr, 0); > return lmask128.cast(IntVector.SPECIES_64).trueCount(); > } > > public static void main(String[] args) { > int r = 0; > for (int ic = 0; ic < 50000; ic++) { > r = narrow_long(); > } > System.out.println("trueCount() : " + r); > } > } > > > > **C2 compilation result:** > java --add-modules jdk.incubator.vector TestMaskCast > trueCount() : 4 > > **Interpreter result:** > java --add-modules jdk.incubator.vector -Xint TestMaskCast > trueCount() : 2 > > Since in this example, the source mask size in bytes is 2x that of the result mask, trueCount() returns 2x the number of true elements in the source mask. It would return 4x/8x the number of true elements in the source mask if the size of the source mask is 4x/8x that of result mask. > > The returned values are incorrect because of the higher order bits in the result not being cleared (since the result is narrowed down) and trueCount() or toLong() tend to consider the higher order bits in the vector register as well which results in incorrect value. For the 128-bit to 64-bit conversion with a mask - "TT" passed, the current implementation for mask cast narrow operation returns the same mask in the lower and upper half of the 128-bit register that is - "TTTT" which results in a long value of 15 (instead of 3 - "FFTT" for the 64-bit Integer mask) and number of true elements to be 4 (instead of 2). > > This patch proposes a fix for this problem. An already existing JTREG IR test - "test/hotspot/jtreg/compiler/vectorapi/VectorMaskCastTest.java" has also been modified to call the trueCount() method as well since the toString() method alone cannot be used to reproduce the incorrect values in this bug. This test passes successfully on 128-bit, 256-bit and 512-bit SVE machines. Since the IR test has been changed, it has been tested successfully on other platforms like x86 and aarch64 Neon machines as well to ensure the changes have not introduced any new errors. Bhavana Kilambi has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge master - 8303161: [vectorapi] VectorMask.cast narrow operation returns incorrect value with SVE The cast operation for VectorMask from wider type to narrow type returns incorrect result for trueCount() method invocation for the resultant mask with SVE (on some SVE machines toLong() also results in incorrect values). An example narrow operation which results in incorrect toLong() and trueCount() values is shown below for a 128-bit -> 64-bit conversion and this can be extended to other narrow operations where the source mask in bytes is either 4x or 8x the size of the result mask in bytes - public class TestMaskCast { static final boolean [] mask_arr = {true, true, false, true}; public static long narrow_long() { VectorMask lmask128 = VectorMask.fromArray(LongVector.SPECIES_128, mask_arr, 0); return lmask128.cast(IntVector.SPECIES_64).toLong(); } public static void main(String[] args) { long r = 0L; for (int ic = 0; ic < 50000; ic++) { r = narrow_long(); } System.out.println("toLong() : " + r); } } C2 compilation result : java --add-modules jdk.incubator.vector TestMaskCast toLong(): 15 Interpreter result (for verification) : java --add-modules jdk.incubator.vector -Xint TestMaskCast toLong(): 3 The incorrect results with toLong() have been observed only on the 128-bit and 256-bit SVE machines but they are not reproducible on a 512-bit machine. However, trueCount() returns incorrect values too and they are reproducible on all the SVE machines and thus is more reliable to use trueCount() to bring out the drawbacks of the current implementation of mask cast narrow operation for SVE. Replacing the call to toLong() by trueCount() in the above example - public class TestMaskCast { static final boolean [] mask_arr = {true, true, false, true}; public static int narrow_long() { VectorMask lmask128 = VectorMask.fromArray(LongVector.SPECIES_128, mask_arr, 0); return lmask128.cast(IntVector.SPECIES_64).trueCount(); } public static void main(String[] args) { int r = 0; for (int ic = 0; ic < 50000; ic++) { r = narrow_long(); } System.out.println("trueCount() : " + r); } } C2 compilation result: java --add-modules jdk.incubator.vector TestMaskCast trueCount() : 4 Interpreter result: java --add-modules jdk.incubator.vector -Xint TestMaskCast trueCount() : 2 Since in this example, the source mask size in bytes is 2x that of the result mask, trueCount() returns 2x the number of true elements in the source mask. It would return 4x/8x the number of true elements in the source mask if the size of the source mask is 4x/8x that of result mask. The returned values are incorrect because of the higher order bits in the result not being cleared (since the result is narrowed down) and trueCount() or toLong() tend to consider the higher order bits in the vector register as well which results in incorrect value. For the 128-bit to 64-bit conversion with a mask - "TT" passed, the current implementation for mask cast narrow operation returns the same mask in the lower and upper half of the 128-bit register that is - "TTTT" which results in a long value of 15 (instead of 3 - "FFTT" for the 64-bit Integer mask) and number of true elements to be 4 (instead of 2). This patch proposes a fix for this problem. An already existing JTREG IR test - "test/hotspot/jtreg/compiler/vectorapi/VectorMaskCastTest.java" has also been modified to call the trueCount() method as well since the toString() method alone cannot be used to reproduce the incorrect values in this bug. This test passes successfully on 128-bit, 256-bit and 512-bit SVE machines. Since the IR test has been changed, it has been tested successfully on other platforms like x86 and aarch64 Neon machines as well to ensure the changes have not introduced any new errors. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12901/files - new: https://git.openjdk.org/jdk/pull/12901/files/639281ed..ccb23e2d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12901&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12901&range=00-01 Stats: 160063 lines in 1880 files changed: 111867 ins; 29582 del; 18614 mod Patch: https://git.openjdk.org/jdk/pull/12901.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12901/head:pull/12901 PR: https://git.openjdk.org/jdk/pull/12901 From tholenstein at openjdk.org Tue Mar 28 09:53:36 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Tue, 28 Mar 2023 09:53:36 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter [v2] In-Reply-To: <9PJw3VBi9OdgT86PdHMrcJSzNOnyehsVWMbaSIJ6KZI=.14672d50-27b9-45a1-bdfb-263d20602724@github.com> References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> <9PJw3VBi9OdgT86PdHMrcJSzNOnyehsVWMbaSIJ6KZI=.14672d50-27b9-45a1-bdfb-263d20602724@github.com> Message-ID: <4YovZ0nPrD-6HqovI-1_387NqckELwO8TkJQIgRZWaw=.b0b608c9-4ce6-4041-8d4c-630d125b9d2e@github.com> On Tue, 28 Mar 2023 09:40:22 GMT, Roberto Casta?eda Lozano wrote: > > I agree with your suggestion to remove the partial implementation and try to fully support it later in a separate RFE. That might be the cleanest solution for now. And we could still take your current code as a starting point for that RFE . > > Thanks, yes, if everyone agrees on this I will extract the partial implementation into a separate patch. What do you think @tobiasholenstein? Yes, that is fine with me. I think the only thing missing is searching for nodes that are hidden and inlined. IGV adds them to the set of visible nodes. But if the "outer" nodes is not visible, they are not shown. Only thing missing would be to make the outer nodes visible as well if any of the inlined nodes are visible ------------- PR Comment: https://git.openjdk.org/jdk/pull/12955#issuecomment-1486547184 From tholenstein at openjdk.org Tue Mar 28 09:53:40 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Tue, 28 Mar 2023 09:53:40 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter [v2] In-Reply-To: References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> Message-ID: On Mon, 27 Mar 2023 10:47:44 GMT, Roberto Casta?eda Lozano wrote: >> src/utils/IdealGraphVisualizer/Filter/src/main/java/com/sun/hotspot/igv/filter/CombineFilter.java line 71: >> >>> 69: } >>> 70: } >>> 71: >> >> I think `assert slot != null;` should be moved up here > > Makes sense, but I did not change it because the surrounding code is essentially dead (no current filter has a "reversed" `CombineRule`) and I would not be able to test it. Since this code has not been executed for years, it is likely to be broken anyway. okey >> src/utils/IdealGraphVisualizer/Filter/src/main/java/com/sun/hotspot/igv/filter/FilterChain.java line 1: >> >>> 1: /* >> >> I think `applyInOrder` can be simplified as this : >> >> public void applyInOrder(Diagram d, FilterChain sequence) { >> for (Filter f : sequence.getFilters()) { >> if (filters.contains(f)) { >> f.apply(d); >> } >> } >> } >> >> >> Reason: `FilterChain ordering` is the same as `this` in `FilterChain`. Usually `filters` are already in the order that we want them to apply. Only exception is when the user manually reoders the filters. `FilterChain sequence` contains all the filters in the order that they appear in the list. `filters` are the filters that are selected by the user and should alway be a subset of `sequence`. Therefore we can just iterate through `sequence` to get the correct order and apply each filter that is selected (contained in `filters`) > > Thanks for the suggestion! I tested your assumption ("`FilterChain ordering` is the same as `this` in `FilterChain`") but it does not hold in this PR. Note that the filter list is never ordered outside of `applyInOrder`. In any case, as I mentioned in https://github.com/openjdk/jdk/pull/12714#discussion_r1148879518, I propose to go with the fix in JDK-8302644 and discard the filter ordering changes from this PR. sounds good ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12955#discussion_r1150326349 PR Review Comment: https://git.openjdk.org/jdk/pull/12955#discussion_r1150327420 From tholenstein at openjdk.org Tue Mar 28 09:53:44 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Tue, 28 Mar 2023 09:53:44 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter [v2] In-Reply-To: References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> Message-ID: On Mon, 27 Mar 2023 11:38:49 GMT, Roberto Casta?eda Lozano wrote: >> Even if this was not you intention, I think selecting the slots globally is a useful feature. > > Thanks for the suggestion! I added support for this now, however after considering the amount of work required to implement proper interaction with slots in a consistent manner, I lean towards excluding the partial implementation from this changeset and leaving it instead for future work, see my reply to @chhagedorn. ok, I am fine with leaving this for future work ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12955#discussion_r1150328437 From rcastanedalo at openjdk.org Tue Mar 28 10:00:36 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 28 Mar 2023 10:00:36 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter [v2] In-Reply-To: <4YovZ0nPrD-6HqovI-1_387NqckELwO8TkJQIgRZWaw=.b0b608c9-4ce6-4041-8d4c-630d125b9d2e@github.com> References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> <9PJw3VBi9OdgT86PdHMrcJSzNOnyehsVWMbaSIJ6KZI=.14672d50-27b9-45a1-bdfb-263d20602724@github.com> <4YovZ0nPrD-6HqovI-1_387NqckELwO8TkJQIgRZWaw=.b0b608c9-4ce6-4041-8d4c-630d125b9d2e@github.com> Message-ID: On Tue, 28 Mar 2023 09:48:44 GMT, Tobias Holenstein wrote: > Yes, that is fine with me. I think the only thing missing is searching for nodes that are hidden and inlined. IGV adds them to the set of visible nodes. But if the "outer" nodes is not visible, they are not shown. Only thing missing would be to make the outer nodes visible as well if any of the inlined nodes are visible Thanks, I will add this information to the RFE. I think there are other features missing as well, e.g. centering slots after they are selected from the search box. ------------- PR Comment: https://git.openjdk.org/jdk/pull/12955#issuecomment-1486560819 From tholenstein at openjdk.org Tue Mar 28 10:37:15 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Tue, 28 Mar 2023 10:37:15 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally [v6] In-Reply-To: References: Message-ID: On Mon, 27 Mar 2023 07:14:09 GMT, Roberto Casta?eda Lozano wrote: >> Tobias Holenstein has updated the pull request incrementally with three additional commits since the last revision: >> >> - always select previous profile for new tabs >> - .js ending for filters >> - save order of filters > > src/utils/IdealGraphVisualizer/Filter/src/main/java/com/sun/hotspot/igv/filter/FilterChain.java line 71: > >> 69: } >> 70: >> 71: public void applyInOrder(Diagram diagram, FilterChain filterOrder) { > > I have also addressed this in [#12955](https://github.com/openjdk/jdk/pull/12955), but I think your solution of sorting the filter list upfront, rather than every time filters are applied, is preferable. I will wait for this PR to be integrated and then exclude the corresponding changes from #12955. sounds good > src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/DiagramViewModel.java line 326: > >> 324: >> 325: // called when the filter in filterChain changed, but not filterChain itself >> 326: private void filterChanged() { > > After applying this PR, `DiagramViewModel::filterChanged()` is fired every time a new graph in a group is viewed. This is not a functional bug, but it causes the expensive `DiagramViewModel::rebuildDiagram()` to be called twice in that scenario (the other call comes from `DiagramViewModel::changed()`). Would it be possible to arrange the code so that `DiagramViewModel::rebuildDiagram()` is only called once when a new graph in a group is viewed? You are right. `DiagramViewModel::rebuildDiagram()` was called too many times when it was not necessary. I updated the code - now `DiagramViewModel::rebuildDiagram()` should be only called when needed ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12714#discussion_r1150379977 PR Review Comment: https://git.openjdk.org/jdk/pull/12714#discussion_r1150379549 From rcastanedalo at openjdk.org Tue Mar 28 11:28:32 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 28 Mar 2023 11:28:32 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally [v8] In-Reply-To: References: Message-ID: On Mon, 27 Mar 2023 15:59:47 GMT, Tobias Holenstein wrote: >> In IGV the user can apply a set of filters to a graph. Currently, the same set of selected filters is applied to all graphs (globally). >> >> - With this change the use can define a set of filters for each individual graph tab using the `--Local--` profile >> - Further a filter profile can be created that represents a set of filter. This filter profile can the be selected in each graph tab individually. >> >> ### Global profile >> Each tab has a `--Global--` filter profile which is selected when opening a graph. Filters applied to the `--Global--` profile are applied to all tabs that have the `--Global--` profile selected. >> >> ### Local profile >> Each tab has its own `--Local--` filter profile. Filters applied to the `--Local--` profile are applied only to the currently selected tabs. Only one tab can be selected at a time and a tab gets selected by clicking on it. To make it more clear which tab is currently selected, the title of the selected tab is displayed in **bold** font. >> tabA >> >> When clicking on a different tab with a different `--Local--` profile, the selected filters get updated accordingly. >> tabB >> >> ### New profile >> The user can also create a new filter profile and give it a name. E.g. `My Filters` >> newProfile >> >> The `My Filters` profile is then globally available to other tabs as well >> selectProfile >> >> >> ### Filters for cloned tabs >> When the user clones a tab, the `--Local--` profile gets cloned as well. Further the clone has the same filter profile selected when it gets opened >> cloneTab >> >> ### Saving of filters and profiles >> When the users closes IGV, the filters (in their exact order) are save, as well as the filter profiles. The profile that was last used is selected when opening IGV. > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > don not fire filterChanged() a new graph is viewed Looks good! ------------- Marked as reviewed by rcastanedalo (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/12714#pullrequestreview-1360823425 From jcking at openjdk.org Tue Mar 28 14:30:55 2023 From: jcking at openjdk.org (Justin King) Date: Tue, 28 Mar 2023 14:30:55 GMT Subject: RFR: JDK-8304684: Memory leak in DirectivesParser::set_option_flag [v4] In-Reply-To: References: Message-ID: <9XO5we9RK8MKNE5HpGWLFySNOr6Y_TB6gXl13ksg0Yo=.dec7763e-9483-4c8c-ba79-7b6d47148d81@github.com> > Add missing `FREE_C_HEAP_ARRAY` call. Justin King has updated the pull request incrementally with one additional commit since the last revision: Adjust logic based on review Signed-off-by: Justin King ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13125/files - new: https://git.openjdk.org/jdk/pull/13125/files/7ddd0cec..78bb3ff0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13125&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13125&range=02-03 Stats: 21 lines in 1 file changed: 12 ins; 4 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/13125.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13125/head:pull/13125 PR: https://git.openjdk.org/jdk/pull/13125 From jcking at openjdk.org Tue Mar 28 14:31:11 2023 From: jcking at openjdk.org (Justin King) Date: Tue, 28 Mar 2023 14:31:11 GMT Subject: RFR: JDK-8304546: CompileTask::_directive leaked if CompileBroker::invoke_compiler_on_method not called In-Reply-To: References: Message-ID: On Mon, 20 Mar 2023 20:48:28 GMT, Justin King wrote: > Ensure `CompileTask::_directive` is not leaked when `CompileBroker::invoke_compiler_on_method` is not called. This can happen for stale tasks or when compilation is disabled. Poke. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13108#issuecomment-1487002028 From jcking at openjdk.org Tue Mar 28 14:31:00 2023 From: jcking at openjdk.org (Justin King) Date: Tue, 28 Mar 2023 14:31:00 GMT Subject: RFR: JDK-8304684: Memory leak in DirectivesParser::set_option_flag [v3] In-Reply-To: <94qvA1Jc9ytfb1JyNEXv6P6L__Eeq_GpJgCdU4y4erw=.7a3a78be-3f22-425b-8ff2-477dac703577@github.com> References: <94qvA1Jc9ytfb1JyNEXv6P6L__Eeq_GpJgCdU4y4erw=.7a3a78be-3f22-425b-8ff2-477dac703577@github.com> Message-ID: On Mon, 27 Mar 2023 22:33:39 GMT, Dean Long wrote: >> Justin King has updated the pull request incrementally with four additional commits since the last revision: >> >> - Make directives enum macros more fool proof >> >> Signed-off-by: Justin King >> - Fix capitalization >> >> Signed-off-by: Justin King >> - Fix unqualified member variable reference >> >> Signed-off-by: Justin King >> - Update DirectiveSet to take ownership of string options >> >> Signed-off-by: Justin King > > src/hotspot/share/compiler/directivesParser.cpp line 321: > >> 319: strncpy(s, v->str.start, v->str.length + 1); >> 320: s[v->str.length] = '\0'; >> 321: (set->*test)((void *)&s); // Takes ownership. > > Would it be better to do the "set" only after validators have passed? Probably. Was just going with previous logic. I moved it to the end now. PTAL ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13125#discussion_r1150707374 From mdoerr at openjdk.org Tue Mar 28 15:48:38 2023 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 28 Mar 2023 15:48:38 GMT Subject: Integrated: 8304880: [PPC64] VerifyOops code in C1 doesn't work with ZGC In-Reply-To: <8jN_obM4X1OzdLXezbnDx553YsdlAvapHLTU3vSPt8U=.598357c8-5c00-4284-b217-5240a1d89260@github.com> References: <8jN_obM4X1OzdLXezbnDx553YsdlAvapHLTU3vSPt8U=.598357c8-5c00-4284-b217-5240a1d89260@github.com> Message-ID: <8Wt9q_QMDsZJiPuSbecwaF690aMdaSGYJ-eslGLutWc=.cbb4ffdd-b6f8-4350-b972-b74d39ac9192@github.com> On Fri, 24 Mar 2023 15:27:53 GMT, Martin Doerr wrote: > I suggest to remove this code for the following reasons: > - It doesn't work with ZGC (oop needs to go through load barrier, see JBS issue). > - It generates too much code. Loading oops is quite common and the oop verification code is quite lengthy. > - Other platforms don't have it, either. This pull request has now been integrated. Changeset: 695683b5 Author: Martin Doerr URL: https://git.openjdk.org/jdk/commit/695683b5b15c69a56fe7ee1a93482fe7c3530ca8 Stats: 8 lines in 1 file changed: 0 ins; 8 del; 0 mod 8304880: [PPC64] VerifyOops code in C1 doesn't work with ZGC Reviewed-by: shade ------------- PR: https://git.openjdk.org/jdk/pull/13175 From mdoerr at openjdk.org Tue Mar 28 15:48:37 2023 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 28 Mar 2023 15:48:37 GMT Subject: RFR: 8304880: [PPC64] VerifyOops code in C1 doesn't work with ZGC In-Reply-To: <8jN_obM4X1OzdLXezbnDx553YsdlAvapHLTU3vSPt8U=.598357c8-5c00-4284-b217-5240a1d89260@github.com> References: <8jN_obM4X1OzdLXezbnDx553YsdlAvapHLTU3vSPt8U=.598357c8-5c00-4284-b217-5240a1d89260@github.com> Message-ID: <01fWpRzUY-OlzBwDDkWjaK4SluvqXoB9l6YMJ7z3PSQ=.c1195a54-7a13-4b7a-8429-484a05f92b7f@github.com> On Fri, 24 Mar 2023 15:27:53 GMT, Martin Doerr wrote: > I suggest to remove this code for the following reasons: > - It doesn't work with ZGC (oop needs to go through load barrier, see JBS issue). > - It generates too much code. Loading oops is quite common and the oop verification code is quite lengthy. > - Other platforms don't have it, either. 17u is also affected. I keep this PR to fix only the ZGC issue without further cleanup. In addition, the store part may be interesting for testing generational ZGC. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13175#issuecomment-1487158178 From epeter at openjdk.org Tue Mar 28 16:28:02 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 28 Mar 2023 16:28:02 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 Message-ID: I am reviving `-XX:+VerifyLoopOptimizations` after many years of abandonment. There were many bugs filed, but so far it has not been addressed. The hope is that this work will allow us to catch ctrl / idom / loop body bugs quicker, and fix many of the existing ones along the way. **The Idea of VerifyLoopOptimizations** Before loop-opts, we build many data-structures for dominance (idom), control, and loop membership. Then, loop-opts use this data to transform the graph. At the same time, they must maintain the correctness of the data-structures, so that other optimizations can be made, without needing to re-compute the data-structures every time. `VerifyLoopOptimizations` was implemented to verify correctness of the data-structures. After some loop-opts, we re-compute a verification data-structure and compare it to the one we created before the loop-opts and maintained during loopopts. **My Approach** I soon realized that there were many reasons why `VerifyLoopOptimizations` was broken. It seemed infeasible to fix all of them at once. I decided to first remove any part that was failing, until I have a minimal set that is working. I will leave many parts commented out. In follow-up RFE's, I will then iteratively improve the verification by re-enabling some verification and fixing the corresponding bugs. **What I fixed** - `verify_compare` - Renamed it to `verify_nodes`, since it does verification node-by-node (vs `verify_tree`, which verifies the loop-tree). - Previously, it was implemented as a BFS with recursion, which lead to stack-overflow. I flattened the BFS into a loop. - The BFS calls `verify_node` on every node. I refactored `verify_node` a bit, so that it is more readable. - Rather than having a thread-unsafe static variable `fail`, I now made it a reference argument. - `verify_tree` - I corrected the style and improved comments. - I removed the broken verification for `Opaque` nodes. I added some rudamentary verification for `CountedLoop`. I leave more of this work for follow-up RFE's. **Disabled Verifications** I commented out the following verifications: (A) data nodes should have same ctrl (B) ctrl node should belong to same loop (C) ctrl node should have same idom (D) loop should have same tail (E) loop should have same body (list of nodes) (F) broken verification in PhaseIdealLoop::build_loop_late_post, because ctrl was set wrong Note: verifying `idom`, `ctrl` and `_body` is the central goal of `VerifyLoopOptimizations`. But all of them are broken in many parts of the VM, as we have now not verified them for many years. **Follow-Up Work** I filed a first follow-up RFE [JDK-8305073](https://bugs.openjdk.org/browse/JDK-8305073). The following tasks should be addressed in it, or in subsequent follow-up RFE's. I propose the following order: - idom (C): The dominance structure is at the base of everything else. - ctrl / loop (A, B): Once dominance is fixed, we can ensure every node is assigned to the correct ctrl/loop. - tail (D): ensure the tail of a loop is updated correctly - body (E): nodes are assigned to the `_body` of a loop, according to the node ctrl. - other issues like (F) - Add more verification to IdealLoopTree::verify_tree. For example zero-trip-guard, etc. - Evaluate from where else we should call `PhaseIdealLoop::verify`. Maybe we are missing some cases. **Testing** I am running `tier1-tier6` and stress testing. Preliminary results are all good. **Conclusion** With this fix, I have the basic infrastructure of the verification working. However, all of the substantial verification are now still disabled, because there are too many places in the VM that do not maintain the data-structures properly. Follow-up RFE's will have to address these one-by-one. ------------- Commit messages: - NULL to nullptr - comment the code differently, so that it looks like less changes - manual merge from master - 8173709: Fix VerifyLoopOptimizations - step 1 Changes: https://git.openjdk.org/jdk/pull/13207/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13207&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8173709 Stats: 182 lines in 2 files changed: 90 ins; 27 del; 65 mod Patch: https://git.openjdk.org/jdk/pull/13207.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13207/head:pull/13207 PR: https://git.openjdk.org/jdk/pull/13207 From rcastanedalo at openjdk.org Tue Mar 28 17:01:22 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 28 Mar 2023 17:01:22 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter [v3] In-Reply-To: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> Message-ID: > The "Simplify graph" filter abstracts away details from the graph that are typically unnecessary for debugging or analyzing the represented program. This changeset decouples this filter into two: > > - "Simplify graph", which hides elements that are typically (but not always) unnecessary, and > - "Condense graph", which makes the graph more compact without loss of information. > > Together, these two filters reduce the average graph size by a factor of 1.6x (nodes) and 1.9x (edges): > > ![without-with-filters](https://user-images.githubusercontent.com/8792647/224118397-e6bd45d1-0b90-4d94-88ae-0a83f9ef20da.png) > > Besides decoupling the "Simplify graph" filter, the changeset extends its functionality by: > - combining Bool and conversion nodes into their predecessors, > - inlining all Parm nodes except control into their successors (this removes lots of long edges), > - removing "top" inputs from call-like nodes, > - inlining more source nodes (such as MachTemp and ThreadLocal) into their successors, > - pretty-printing the labels of many inlined and combined nodes such as Bool comparisons or Catch projections (via a new filter that edits node properties), and > - using a sparse representation of nodes with empty inputs (e.g. call-like nodes after applying "Simplify graph"). > > The sparse input representation shows dots between non-contiguous inputs, instead of horizontal space proportional to the number of empty inputs. This helps reducing node width, which is known to improve overall layout quality: > > ![dense-vs-sparse](https://user-images.githubusercontent.com/8792647/224118703-04f663b7-7a73-4e49-87d9-2acd8b98522b.png) > > Note that the exact input indices can still be retrieved via the incoming edge's tooltips: > > ![tooltip-with-input-index](https://user-images.githubusercontent.com/8792647/224119319-7f40fba2-1e9f-436e-a11c-8c3d428d46a6.png) > > The control-flow graph view is also adapted to this representation: > > ![sparse-in-cfg](https://user-images.githubusercontent.com/8792647/224119399-884e2516-a9a1-43fd-b5f5-747c99472ace.png) > > #### Additional improvements > > Additionally, this changeset: > - ensures that the selected filter subset is applied in the order listed in the "Filter" window (this is necessary for combining effectively the "Simplify graph" and "Condense graph" filters, but is also generally desirable for simplicity and consistency), > - introduces a complementary filter "Show custom node info" (enabled by default) that extends the labels of call and exception-creation nodes with custom information, > - extends the search functionality so that combined and inlined nodes can also be searched on and selected, and > - defines and documents JavaScript helpers to simplify the new and existing available filters. > > Here is an example of the effect of the new "Show custom node info" filter: > > ![show-custom-node-info](https://user-images.githubusercontent.com/8792647/224119545-fd564224-7ccc-4829-988e-77f05d25b3bc.png) > > ### Testing > > #### Functionality > > - Tested the functionality manually on a small selection of graphs. > > - Tested automatically that viewing thousands of graphs in the three views with different filter subsets enabled does not trigger any assertion failure (by instrumenting IGV to view graphs as they are loaded and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`). > > #### Performance > > Measured the combined filter application and view creation time for the sea-of-nodes view on a selection of 100 medium-sized graphs (200-500 nodes). On average, applying the new "Show custom node info" filter introduces a minimal overhead of around 1%, which motivates enabling it by default. Applying the "simplify graph" and "condense graph" on top actually gives a speedup of about 12%, since the additional filter application time is amortized by laying out and drawing fewer nodes. However, these filters are not enabled by default, since they cause a (minor) loss of information which is not desirable in every use case. > > The graph size reduction and performance results are [attached](https://github.com/openjdk/jdk/files/10934804/performance-evaluation.ods) (note that each time measurement in the sheet corresponds to the median of ten runs). Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: Revert "Select slots as well" This reverts commit 8256f0c20d7747cda691291c47841a9280d8c493. Revert "Fix figure selection" This reverts commit 71e73e89facfbb31614e4f1f3676c9e91a38e01a. Revert "Make slots searchable and selectable" This reverts commit 69cbec1f24ec5e941a5a72ab94b79117551d9560. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12955/files - new: https://git.openjdk.org/jdk/pull/12955/files/dde38762..1ea23e42 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12955&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12955&range=01-02 Stats: 48 lines in 2 files changed: 0 ins; 33 del; 15 mod Patch: https://git.openjdk.org/jdk/pull/12955.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12955/head:pull/12955 PR: https://git.openjdk.org/jdk/pull/12955 From rcastanedalo at openjdk.org Tue Mar 28 17:12:31 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 28 Mar 2023 17:12:31 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter [v3] In-Reply-To: References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> Message-ID: On Tue, 28 Mar 2023 17:01:22 GMT, Roberto Casta?eda Lozano wrote: >> The "Simplify graph" filter abstracts away details from the graph that are typically unnecessary for debugging or analyzing the represented program. This changeset decouples this filter into two: >> >> - "Simplify graph", which hides elements that are typically (but not always) unnecessary, and >> - "Condense graph", which makes the graph more compact without loss of information. >> >> Together, these two filters reduce the average graph size by a factor of 1.6x (nodes) and 1.9x (edges): >> >> ![without-with-filters](https://user-images.githubusercontent.com/8792647/224118397-e6bd45d1-0b90-4d94-88ae-0a83f9ef20da.png) >> >> Besides decoupling the "Simplify graph" filter, the changeset extends its functionality by: >> - combining Bool and conversion nodes into their predecessors, >> - inlining all Parm nodes except control into their successors (this removes lots of long edges), >> - removing "top" inputs from call-like nodes, >> - inlining more source nodes (such as MachTemp and ThreadLocal) into their successors, >> - pretty-printing the labels of many inlined and combined nodes such as Bool comparisons or Catch projections (via a new filter that edits node properties), and >> - using a sparse representation of nodes with empty inputs (e.g. call-like nodes after applying "Simplify graph"). >> >> The sparse input representation shows dots between non-contiguous inputs, instead of horizontal space proportional to the number of empty inputs. This helps reducing node width, which is known to improve overall layout quality: >> >> ![dense-vs-sparse](https://user-images.githubusercontent.com/8792647/224118703-04f663b7-7a73-4e49-87d9-2acd8b98522b.png) >> >> Note that the exact input indices can still be retrieved via the incoming edge's tooltips: >> >> ![tooltip-with-input-index](https://user-images.githubusercontent.com/8792647/224119319-7f40fba2-1e9f-436e-a11c-8c3d428d46a6.png) >> >> The control-flow graph view is also adapted to this representation: >> >> ![sparse-in-cfg](https://user-images.githubusercontent.com/8792647/224119399-884e2516-a9a1-43fd-b5f5-747c99472ace.png) >> >> #### Additional improvements >> >> Additionally, this changeset: >> - ensures that the selected filter subset is applied in the order listed in the "Filter" window (this is necessary for combining effectively the "Simplify graph" and "Condense graph" filters, but is also generally desirable for simplicity and consistency), >> - introduces a complementary filter "Show custom node info" (enabled by default) that extends the labels of call and exception-creation nodes with custom information, >> - extends the search functionality so that combined and inlined nodes can also be searched on and selected, and >> - defines and documents JavaScript helpers to simplify the new and existing available filters. >> >> Here is an example of the effect of the new "Show custom node info" filter: >> >> ![show-custom-node-info](https://user-images.githubusercontent.com/8792647/224119545-fd564224-7ccc-4829-988e-77f05d25b3bc.png) >> >> ### Testing >> >> #### Functionality >> >> - Tested the functionality manually on a small selection of graphs. >> >> - Tested automatically that viewing thousands of graphs in the three views with different filter subsets enabled does not trigger any assertion failure (by instrumenting IGV to view graphs as they are loaded and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`). >> >> #### Performance >> >> Measured the combined filter application and view creation time for the sea-of-nodes view on a selection of 100 medium-sized graphs (200-500 nodes). On average, applying the new "Show custom node info" filter introduces a minimal overhead of around 1%, which motivates enabling it by default. Applying the "simplify graph" and "condense graph" on top actually gives a speedup of about 12%, since the additional filter application time is amortized by laying out and drawing fewer nodes. However, these filters are not enabled by default, since they cause a (minor) loss of information which is not desirable in every use case. >> >> The graph size reduction and performance results are [attached](https://github.com/openjdk/jdk/files/10934804/performance-evaluation.ods) (note that each time measurement in the sheet corresponds to the median of ten runs). > > Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: > > Revert "Select slots as well" > > This reverts commit 8256f0c20d7747cda691291c47841a9280d8c493. > > Revert "Fix figure selection" > > This reverts commit 71e73e89facfbb31614e4f1f3676c9e91a38e01a. > > Revert "Make slots searchable and selectable" > > This reverts commit 69cbec1f24ec5e941a5a72ab94b79117551d9560. I just reverted the partial support for slot interaction, the behavior in this area should be now the same as in baseline IGV, please review. Will write an RFE with an initial implementation after this changeset is integrated. ------------- PR Comment: https://git.openjdk.org/jdk/pull/12955#issuecomment-1487302841 From kvn at openjdk.org Tue Mar 28 17:43:28 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 28 Mar 2023 17:43:28 GMT Subject: RFR: 8282365: Optimize divideUnsigned and remainderUnsigned for constants [v15] In-Reply-To: References: Message-ID: On Mon, 27 Mar 2023 13:35:46 GMT, Quan Anh Mai wrote: >> This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. >> >> In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: >> >> floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) >> ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) >> >> The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. >> >> For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: >> >> c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) >> c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) >> >> which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. >> >> For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. >> >> More tests are added to cover the possible patterns. >> >> Please take a look and have some reviews. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > whitespace I have to retest it again. ------------- PR Review: https://git.openjdk.org/jdk/pull/9947#pullrequestreview-1361617932 From dlong at openjdk.org Tue Mar 28 21:19:34 2023 From: dlong at openjdk.org (Dean Long) Date: Tue, 28 Mar 2023 21:19:34 GMT Subject: RFR: JDK-8304684: Memory leak in DirectivesParser::set_option_flag [v4] In-Reply-To: <9XO5we9RK8MKNE5HpGWLFySNOr6Y_TB6gXl13ksg0Yo=.dec7763e-9483-4c8c-ba79-7b6d47148d81@github.com> References: <9XO5we9RK8MKNE5HpGWLFySNOr6Y_TB6gXl13ksg0Yo=.dec7763e-9483-4c8c-ba79-7b6d47148d81@github.com> Message-ID: On Tue, 28 Mar 2023 14:30:55 GMT, Justin King wrote: >> Update `DirectivesSet` to take ownership of string options in some cases, to not leak memory. > > Justin King has updated the pull request incrementally with one additional commit since the last revision: > > Adjust logic based on review > > Signed-off-by: Justin King Looks good. ------------- Marked as reviewed by dlong (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13125#pullrequestreview-1361918952 From kvn at openjdk.org Tue Mar 28 23:18:33 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 28 Mar 2023 23:18:33 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 In-Reply-To: References: Message-ID: On Tue, 28 Mar 2023 12:49:57 GMT, Emanuel Peter wrote: > I am reviving `-XX:+VerifyLoopOptimizations` after many years of abandonment. There were many bugs filed, but so far it has not been addressed. > > The hope is that this work will allow us to catch ctrl / idom / loop body bugs quicker, and fix many of the existing ones along the way. > > **The Idea of VerifyLoopOptimizations** > Before loop-opts, we build many data-structures for dominance (idom), control, and loop membership. Then, loop-opts use this data to transform the graph. At the same time, they must maintain the correctness of the data-structures, so that other optimizations can be made, without needing to re-compute the data-structures every time. > `VerifyLoopOptimizations` was implemented to verify correctness of the data-structures. After some loop-opts, we re-compute a verification data-structure and compare it to the one we created before the loop-opts and maintained during loopopts. > > **My Approach** > I soon realized that there were many reasons why `VerifyLoopOptimizations` was broken. It seemed infeasible to fix all of them at once. I decided to first remove any part that was failing, until I have a minimal set that is working. I will leave many parts commented out. In follow-up RFE's, I will then iteratively improve the verification by re-enabling some verification and fixing the corresponding bugs. > > **What I fixed** > > - `verify_compare` > - Renamed it to `verify_nodes`, since it does verification node-by-node (vs `verify_tree`, which verifies the loop-tree). > - Previously, it was implemented as a BFS with recursion, which lead to stack-overflow. I flattened the BFS into a loop. > - The BFS calls `verify_node` on every node. I refactored `verify_node` a bit, so that it is more readable. > - Rather than having a thread-unsafe static variable `fail`, I now made it a reference argument. > - `verify_tree` > - I corrected the style and improved comments. > - I removed the broken verification for `Opaque` nodes. I added some rudamentary verification for `CountedLoop`. I leave more of this work for follow-up RFE's. > > **Disabled Verifications** > I commented out the following verifications: > > (A) data nodes should have same ctrl > (B) ctrl node should belong to same loop > (C) ctrl node should have same idom > (D) loop should have same tail > (E) loop should have same body (list of nodes) > (F) broken verification in PhaseIdealLoop::build_loop_late_post, because ctrl was set wrong > > > Note: verifying `idom`, `ctrl` and `_body` is the central goal of `VerifyLoopOptimizations`. But all of them are broken in many parts of the VM, as we have now not verified them for many years. > > **Follow-Up Work** > > I filed a first follow-up RFE [JDK-8305073](https://bugs.openjdk.org/browse/JDK-8305073). The following tasks should be addressed in it, or in subsequent follow-up RFE's. > > I propose the following order: > > - idom (C): The dominance structure is at the base of everything else. > - ctrl / loop (A, B): Once dominance is fixed, we can ensure every node is assigned to the correct ctrl/loop. > - tail (D): ensure the tail of a loop is updated correctly > - body (E): nodes are assigned to the `_body` of a loop, according to the node ctrl. > - other issues like (F) > - Add more verification to IdealLoopTree::verify_tree. For example zero-trip-guard, etc. > - Evaluate from where else we should call `PhaseIdealLoop::verify`. Maybe we are missing some cases. > > **Testing** > I am running `tier1-tier6` and stress testing. > Preliminary results are all good. > > **Conclusion** > With this fix, I have the basic infrastructure of the verification working. > However, all of the substantial verification are now still disabled, because there are too many places in the VM that do not maintain the data-structures properly. > Follow-up RFE's will have to address these one-by-one. Few comments src/hotspot/share/opto/loopnode.cpp line 4656: > 4654: // Verify ctrl and idom of every node. > 4655: int fail = 0; > 4656: verify_nodes(C->root(), &loop_verify, fail); I think result should be returned using local variables instead of storing it in passed variable. src/hotspot/share/opto/loopnode.cpp line 4691: > 4689: void PhaseIdealLoop::verify_node(Node* n, const PhaseIdealLoop* loop_verify, int &fail) const { > 4690: uint i = n->_idx; > 4691: // The loop-tree was built from def to use. The verification happens from def to use. I think it is not correct: "The verification happens from def to use." You put node's inputs (defs) on work list. src/hotspot/share/opto/loopnode.cpp line 4715: > 4713: assert(loop_verify->has_ctrl(n), "sanity"); > 4714: // n is a data node. > 4715: // Verify that it ctrl is the same. `its control` src/hotspot/share/opto/loopnode.cpp line 4805: > 4803: // within the loop tree can be reordered. We attempt to deal with that by > 4804: // reordering the verify's loop tree if possible. > 4805: void IdealLoopTree::verify_tree(IdealLoopTree* loop, const IdealLoopTree* parent) const { Should we rename `loop` to `loop_verify` similar to `verify_node()` code? src/hotspot/share/opto/loopnode.cpp line 4812: > 4810: tty->print_cr("reorder loop tree"); > 4811: // Find _next pointer to update (where "loop" is attached) > 4812: IdealLoopTree **pp = &loop->_parent->_child; May be use this opportunity and rename `pp` and `nn` to something meaningful. src/hotspot/share/opto/loopnode.cpp line 4814: > 4812: IdealLoopTree **pp = &loop->_parent->_child; > 4813: while (*pp != loop) { > 4814: pp = &((*pp)->_next); May be instead of this convoluted search, swap and verification code we can simply reorder whole siblings list first before verification? You don't need to swap - just create local ordered list. ------------- PR Review: https://git.openjdk.org/jdk/pull/13207#pullrequestreview-1361945433 PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1151177078 PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1151179871 PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1151187594 PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1151217615 PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1151221030 PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1151237595 From kvn at openjdk.org Tue Mar 28 23:47:34 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 28 Mar 2023 23:47:34 GMT Subject: RFR: 8302814: Delete unused CountLoopEnd instruct with CmpX [v3] In-Reply-To: References: <4tL_UFrMJLavPFPa_Q_O3himxRjM_sWRADfAXsMgZYk=.2ba5fdc5-0470-4bef-926b-2e40d0ca31e7@github.com> <32f3F61MuuypToLHcV6wFKZU8oTyQrJO3wHv9cxQghY=.3db91155-ea72-4315-952a-1aa10120af8c@github.com> <5xGjZbHg_TihXg2tfD7ZHmUvN4KVf32AKlekqqSh36g=.b2a327de-98e9-4b4a-8dca-bb5fed1d22fc@github.com> Message-ID: On Tue, 28 Mar 2023 06:31:51 GMT, SUN Guoyun wrote: >> I am taking back my next comment because I see this test fails in other PRs: >> >>>May be that is why next test failed in your GHA testing: >>>compiler/vectorization/runner/LoopRangeStrideTest.java > > @vnkozlov Do you have any comments on this patch and could you please sponsor it for me? @sunny868 please merge master and I will retest it. And please ask for sponsorship sooner. I did not see you needed it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/12648#issuecomment-1487747401 From kvn at openjdk.org Wed Mar 29 00:17:30 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 29 Mar 2023 00:17:30 GMT Subject: RFR: 8304042: C2 SuperWord: schedule must remove packs with cyclic dependencies In-Reply-To: <9VgAQeNZfUZJXO8llozcZZuRftv6kk43jw0YIrBIdck=.b5c89436-608b-4ed1-816d-b3514374eaeb@github.com> References: <9VgAQeNZfUZJXO8llozcZZuRftv6kk43jw0YIrBIdck=.b5c89436-608b-4ed1-816d-b3514374eaeb@github.com> Message-ID: On Fri, 17 Mar 2023 14:34:26 GMT, Emanuel Peter wrote: > I discovered this bug during the bug fix of [JDK-8298935](https://bugs.openjdk.org/browse/JDK-8298935) [PR](https://git.openjdk.org/jdk/pull/12350). > > Currently, the SuperWord algorithm only ensures that all `packs` are `isomorphic` and `independent` (additionally memops are `adjacent`). > > This is **not sufficient**. We need to ensure that the `packs` do not introduce `cycles` into the graph. Example: > > https://github.com/openjdk/jdk/blob/ad580d18dbbf074c8a3692e2836839505b574326/test/hotspot/jtreg/compiler/loopopts/superword/TestIndependentPacksWithCyclicDependency.java#L217-L231 > > This is also mentioned in the [SuperWord Paper](https://groups.csail.mit.edu/cag/slp/SLP-PLDI-2000.pdf) (2000, Samuel Larsen and Saman Amarasinghe, Exploiting Superword Level Parallelism with Multimedia Instruction Sets): > > > 3.7 Scheduling > Dependence analysis before packing ensures that statements within a group can be executed > safely in parallel. However, it may be the case that executing two groups produces a dependence > violation. An example of this is shown in Figure 6. Here, dependence edges are drawn between > groups if a statement in one group is dependent on a statement in the other. As long as there > are no cycles in this dependence graph, all groups can be scheduled such that no violations > occur. However, a cycle indicates that the set of chosen groups is invalid and at least one group > will need to be eliminated. Although experimental data has shown this case to be extremely rare, > care must be taken to ensure correctness. > > > **Solution** > > Just before scheduling, I introduced `SuperWord::remove_cycles`. It creates a `PacksetGraph`, based on nodes in the `packs`, and scalar-nodes which are not in a pack. The edges are taken from `DepPreds`. We check if the graph can be scheduled without cycles (via topological sort). > > **FYI** > > I found a further bug, this time I think it happens during scheduling. See [JDK-8304720](https://bugs.openjdk.org/browse/JDK-8304720). Because of that, I had to disable a test case (`TestIndependentPacksWithCyclicDependency::test5`). I also had to require 64 bit, and either `avx2` or `asimd`. I hope we can lift that again once we fix the other bug. The issue is this: the cyclic dependency example can degenerate to non-cyclic ones, that need to reorder the non-vectorized memory operations. src/hotspot/share/opto/superword.cpp line 2351: > 2349: #endif > 2350: > 2351: class PacksetGraph { May short description of what this class is doing. Something similar to your comment for `schedule()`. src/hotspot/share/opto/superword.cpp line 2355: > 2353: // pid: packset graph node id. > 2354: GrowableArray _pid; // Node.idx -> pid > 2355: GrowableArray _incnt; Please comment `_incnt` src/hotspot/share/opto/superword.cpp line 2384: > 2382: void set_pid(const Node* n, int pid) { > 2383: assert(n != nullptr && pid > 0, "sane inputs"); > 2384: _pid.at_put_grow(n->_idx, pid); This could be huge waste of space. `_idx` could be very big number vs mach smaller `_max_pid`. Can we use `_bb_idx` instead? src/hotspot/share/opto/superword.cpp line 2491: > 2489: return worklist.length() == _max_pid; > 2490: } > 2491: void print(bool print_nodes, bool print_zero_incnt) { I assume you need this parameters for debugging when you can change their values. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13078#discussion_r1151272239 PR Review Comment: https://git.openjdk.org/jdk/pull/13078#discussion_r1151264070 PR Review Comment: https://git.openjdk.org/jdk/pull/13078#discussion_r1151266993 PR Review Comment: https://git.openjdk.org/jdk/pull/13078#discussion_r1151272907 From kvn at openjdk.org Wed Mar 29 00:28:31 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 29 Mar 2023 00:28:31 GMT Subject: RFR: 8304681: compiler/sharedstubs/SharedStubToInterpTest.java fails after JDK-8304387 In-Reply-To: References: Message-ID: <3Ia8uZmoE3Gul-yLcaWVJ4dLt3pCgEGU1n_7y-UDO2A=.e9b31af3-1521-48b1-a40d-95ace412d1ee@github.com> On Wed, 22 Mar 2023 05:14:21 GMT, Xiaolin Zheng wrote: > Please review this test fix for [JDK-8304387](https://bugs.openjdk.org/browse/JDK-8304387) after RFR. > > Instead of specifying shared stubs' locations in this test, we could check if two or more relocations combined with each of them in this test, same as the other test `SharedTrampolineTest.java`. The counting logic is aligned with `SharedTrampolineTest.java`. Printing relocation stuff requires debug version vm, so this test is changed to debug only. Also some minor cleanups for the tests. > > Apologies for the tier2 failure. I mainly focused on if there were new hs_errs when working on JDK-8304387. :-( > > I am now confirming if the comment "Static stubs must be created at the end of the Stub section" could be removed, which needs a little extra time - though I think we can relax such limitations in `SharedStubToInterpTest.java`. > -- Update on March 27th: comfirmed; please see the JBS issue for the discussion. > > Tested x86_64, AArch64 and RISC-V for tests under `compiler/sharedstubs` folder, and now all passed (release, fastdebug). > > Thanks, > Xiaolin Looks good to me too. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13135#pullrequestreview-1362091639 From duke at openjdk.org Wed Mar 29 01:19:43 2023 From: duke at openjdk.org (SUN Guoyun) Date: Wed, 29 Mar 2023 01:19:43 GMT Subject: RFR: 8302814: Delete unused CountLoopEnd instruct with CmpX [v6] In-Reply-To: <4tL_UFrMJLavPFPa_Q_O3himxRjM_sWRADfAXsMgZYk=.2ba5fdc5-0470-4bef-926b-2e40d0ca31e7@github.com> References: <4tL_UFrMJLavPFPa_Q_O3himxRjM_sWRADfAXsMgZYk=.2ba5fdc5-0470-4bef-926b-2e40d0ca31e7@github.com> Message-ID: > CountLoopEnd only for T_int, therefore the following instructs in riscv.ad are useless and should be deleted. > > CountedLoopEnd cmp (CmpU op1 op2) > CountedLoopEnd cmp (CmpP op1 op2) > CountedLoopEnd cmp (CmpN op1 op2) > CountedLoopEnd cmp (CmpF op1 op2) > CountedLoopEnd cmp (CmpD op1 op2) > > and CountedLoopEnd with CmpU on x86*.ad, aarch64.ad ar useless also. > > Please help review it. > > Thanks. SUN Guoyun has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision: - Merge branch 'openjdk:master' into 8302814 - 8302814: Delete unused CountLoopEnd instruct with CmpX - 8302814: Delete unused CountLoopEnd instruct with CmpX - 8302814: Delete unused CountLoopEnd instruct with CmpX - Merge branch 'openjdk:master' into 8302814 - Merge branch 'openjdk:master' into 8302814 - 8302814: Delete unused CountLoopEnd instruct with CmpX ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12648/files - new: https://git.openjdk.org/jdk/pull/12648/files/fefb9122..0652b11c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12648&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12648&range=04-05 Stats: 360496 lines in 2711 files changed: 216340 ins; 121512 del; 22644 mod Patch: https://git.openjdk.org/jdk/pull/12648.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12648/head:pull/12648 PR: https://git.openjdk.org/jdk/pull/12648 From eliu at openjdk.org Wed Mar 29 01:22:10 2023 From: eliu at openjdk.org (Eric Liu) Date: Wed, 29 Mar 2023 01:22:10 GMT Subject: RFR: 8303278: Imprecise bottom type of ExtractB/UB [v2] In-Reply-To: References: Message-ID: > This is a trivial patch, which fixes the bottom type of ExtractB/UB nodes. > > ExtractNode can be generated by Vector API Vector.lane(int), which gets the lane element at the given index. A more precise type of range can help to optimize out unnecessary type conversion in some cases. > > Below shows a typical case used ExtractBNode > > > public static byte byteLt16() { > ByteVector vecb = ByteVector.broadcast(ByteVector.SPECIES_128, 1); > return vecb.lane(1); > } > > > In this case, c2 constructs IR graph like: > > ExtractB ConI(24) > | __| > | / | > LShiftI __| > | / > RShiftI > > which generates AArch64 code: > > movi v16.16b, #0x1 > smov x11, v16.b[1] > sxtb w0, w11 > > with this patch, this shift pair can be optimized out by RShiftI's identity [1]. The code is optimized to: > > movi v16.16b, #0x1 > smov x0, v16.b[1] > > [TEST] > > Full jtreg passed except 4 files on x86: > > jdk/incubator/vector/Byte128VectorTests.java > jdk/incubator/vector/Byte256VectorTests.java > jdk/incubator/vector/Byte512VectorTests.java > jdk/incubator/vector/Byte64VectorTests.java > > They are caused by a known issue on x86 [2]. > > [1] https://github.com/openjdk/jdk/blob/742bc041eaba1ff9beb7f5b6d896e4f382b030ea/src/hotspot/share/opto/mulnode.cpp#L1052 > [2] https://bugs.openjdk.org/browse/JDK-8303508 Eric Liu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge jdk:master Change-Id: I40cce803da09bae31cd74b86bf93607a08219545 - 8303278: Imprecise bottom type of ExtractB/UB This is a trivial patch, which fixes the bottom type of ExtractB/UB nodes. ExtractNode can be generated by Vector API Vector.lane(int), which gets the lane element at the given index. A more precise type of range can help to optimize out unnecessary type conversion in some cases. Below shows a typical case used ExtractBNode ``` public static byte byteLt16() { ByteVector vecb = ByteVector.broadcast(ByteVector.SPECIES_128, 1); return vecb.lane(1); } ``` In this case, c2 constructs IR graph like: ExtractB ConI(24) | __| | / | LShiftI __| | / RShiftI which generates AArch64 code: movi v16.16b, #0x1 smov x11, v16.b[1] sxtb w0, w11 with this patch, this shift pair can be optimized out by RShiftI's identity [1]. The code is optimized to: movi v16.16b, #0x1 smov x0, v16.b[1] [TEST] Full jtreg passed except 4 files on x86: jdk/incubator/vector/Byte128VectorTests.java jdk/incubator/vector/Byte256VectorTests.java jdk/incubator/vector/Byte512VectorTests.java jdk/incubator/vector/Byte64VectorTests.java They are caused by a known issue on x86 [2]. [1] https://github.com/openjdk/jdk/blob/742bc041eaba1ff9beb7f5b6d896e4f382b030ea/src/hotspot/share/opto/mulnode.cpp#L1052 [2] https://bugs.openjdk.org/browse/JDK-8303508 Change-Id: Ibea9aeacb41b4d1c5b2621c7a97494429394b599 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13070/files - new: https://git.openjdk.org/jdk/pull/13070/files/29e99153..12748c7a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13070&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13070&range=00-01 Stats: 90611 lines in 1154 files changed: 52776 ins; 26535 del; 11300 mod Patch: https://git.openjdk.org/jdk/pull/13070.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13070/head:pull/13070 PR: https://git.openjdk.org/jdk/pull/13070 From xlinzheng at openjdk.org Wed Mar 29 02:14:32 2023 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Wed, 29 Mar 2023 02:14:32 GMT Subject: RFR: 8304681: compiler/sharedstubs/SharedStubToInterpTest.java fails after JDK-8304387 In-Reply-To: References: Message-ID: On Wed, 22 Mar 2023 05:14:21 GMT, Xiaolin Zheng wrote: > Please review this test fix for [JDK-8304387](https://bugs.openjdk.org/browse/JDK-8304387) after RFR. > > Instead of specifying shared stubs' locations in this test, we could check if two or more relocations combined with each of them in this test, same as the other test `SharedTrampolineTest.java`. The counting logic is aligned with `SharedTrampolineTest.java`. Printing relocation stuff requires debug version vm, so this test is changed to debug only. Also some minor cleanups for the tests. > > Apologies for the tier2 failure. I mainly focused on if there were new hs_errs when working on JDK-8304387. :-( > > I am now confirming if the comment "Static stubs must be created at the end of the Stub section" could be removed, which needs a little extra time - though I think we can relax such limitations in `SharedStubToInterpTest.java`. > -- Update on March 27th: comfirmed; please see the JBS issue for the discussion. > > Tested x86_64, AArch64 and RISC-V for tests under `compiler/sharedstubs` folder, and now all passed (release, fastdebug). > > Thanks, > Xiaolin Thank you for the review, Vladimir! Only a test fix. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13135#issuecomment-1487848906 From fgao at openjdk.org Wed Mar 29 04:33:33 2023 From: fgao at openjdk.org (Fei Gao) Date: Wed, 29 Mar 2023 04:33:33 GMT Subject: RFR: 8304042: C2 SuperWord: schedule must remove packs with cyclic dependencies In-Reply-To: <9VgAQeNZfUZJXO8llozcZZuRftv6kk43jw0YIrBIdck=.b5c89436-608b-4ed1-816d-b3514374eaeb@github.com> References: <9VgAQeNZfUZJXO8llozcZZuRftv6kk43jw0YIrBIdck=.b5c89436-608b-4ed1-816d-b3514374eaeb@github.com> Message-ID: <6WzUZE5-847OaD_RKVmD2ieyZqOx2M_o0F0_k_yEpJI=.f8c493d1-1a36-4178-a446-d83225c07092@github.com> On Fri, 17 Mar 2023 14:34:26 GMT, Emanuel Peter wrote: > I discovered this bug during the bug fix of [JDK-8298935](https://bugs.openjdk.org/browse/JDK-8298935) [PR](https://git.openjdk.org/jdk/pull/12350). > > Currently, the SuperWord algorithm only ensures that all `packs` are `isomorphic` and `independent` (additionally memops are `adjacent`). > > This is **not sufficient**. We need to ensure that the `packs` do not introduce `cycles` into the graph. Example: > > https://github.com/openjdk/jdk/blob/ad580d18dbbf074c8a3692e2836839505b574326/test/hotspot/jtreg/compiler/loopopts/superword/TestIndependentPacksWithCyclicDependency.java#L217-L231 > > This is also mentioned in the [SuperWord Paper](https://groups.csail.mit.edu/cag/slp/SLP-PLDI-2000.pdf) (2000, Samuel Larsen and Saman Amarasinghe, Exploiting Superword Level Parallelism with Multimedia Instruction Sets): > > > 3.7 Scheduling > Dependence analysis before packing ensures that statements within a group can be executed > safely in parallel. However, it may be the case that executing two groups produces a dependence > violation. An example of this is shown in Figure 6. Here, dependence edges are drawn between > groups if a statement in one group is dependent on a statement in the other. As long as there > are no cycles in this dependence graph, all groups can be scheduled such that no violations > occur. However, a cycle indicates that the set of chosen groups is invalid and at least one group > will need to be eliminated. Although experimental data has shown this case to be extremely rare, > care must be taken to ensure correctness. > > > **Solution** > > Just before scheduling, I introduced `SuperWord::remove_cycles`. It creates a `PacksetGraph`, based on nodes in the `packs`, and scalar-nodes which are not in a pack. The edges are taken from `DepPreds`. We check if the graph can be scheduled without cycles (via topological sort). > > **FYI** > > I found a further bug, this time I think it happens during scheduling. See [JDK-8304720](https://bugs.openjdk.org/browse/JDK-8304720). Because of that, I had to disable a test case (`TestIndependentPacksWithCyclicDependency::test5`). I also had to require 64 bit, and either `avx2` or `asimd`. I hope we can lift that again once we fix the other bug. The issue is this: the cyclic dependency example can degenerate to non-cyclic ones, that need to reorder the non-vectorized memory operations. src/hotspot/share/opto/superword.cpp line 2429: > 2427: for (uint k = 0; k < p->size(); k++) { > 2428: Node* n = p->at(k); > 2429: int pid = get_pid(n); Nodes in the same pack have the same `pid`, right? Why do we need to fetch `pid` separately for each node in the `p` ? src/hotspot/share/opto/superword.cpp line 2449: > 2447: Node* n = _block.at(i); > 2448: int pid = get_pid_or_zero(n); > 2449: if (pid == 0 || pid <= max_pid_packset) { Is `pid == 0 ||` repetitive here? We have `max_pid_packset >= 0`, right? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13078#discussion_r1151363087 PR Review Comment: https://git.openjdk.org/jdk/pull/13078#discussion_r1151379804 From chagedorn at openjdk.org Wed Mar 29 07:10:37 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 29 Mar 2023 07:10:37 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter [v3] In-Reply-To: References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> Message-ID: <0lnz6Gw7u04DbFYQdEvxmotlG8ODG5aEERjeHPaGeTM=.1d754812-f7e7-4822-8395-bea9e099fe9f@github.com> On Tue, 28 Mar 2023 17:01:22 GMT, Roberto Casta?eda Lozano wrote: >> The "Simplify graph" filter abstracts away details from the graph that are typically unnecessary for debugging or analyzing the represented program. This changeset decouples this filter into two: >> >> - "Simplify graph", which hides elements that are typically (but not always) unnecessary, and >> - "Condense graph", which makes the graph more compact without loss of information. >> >> Together, these two filters reduce the average graph size by a factor of 1.6x (nodes) and 1.9x (edges): >> >> ![without-with-filters](https://user-images.githubusercontent.com/8792647/224118397-e6bd45d1-0b90-4d94-88ae-0a83f9ef20da.png) >> >> Besides decoupling the "Simplify graph" filter, the changeset extends its functionality by: >> - combining Bool and conversion nodes into their predecessors, >> - inlining all Parm nodes except control into their successors (this removes lots of long edges), >> - removing "top" inputs from call-like nodes, >> - inlining more source nodes (such as MachTemp and ThreadLocal) into their successors, >> - pretty-printing the labels of many inlined and combined nodes such as Bool comparisons or Catch projections (via a new filter that edits node properties), and >> - using a sparse representation of nodes with empty inputs (e.g. call-like nodes after applying "Simplify graph"). >> >> The sparse input representation shows dots between non-contiguous inputs, instead of horizontal space proportional to the number of empty inputs. This helps reducing node width, which is known to improve overall layout quality: >> >> ![dense-vs-sparse](https://user-images.githubusercontent.com/8792647/224118703-04f663b7-7a73-4e49-87d9-2acd8b98522b.png) >> >> Note that the exact input indices can still be retrieved via the incoming edge's tooltips: >> >> ![tooltip-with-input-index](https://user-images.githubusercontent.com/8792647/224119319-7f40fba2-1e9f-436e-a11c-8c3d428d46a6.png) >> >> The control-flow graph view is also adapted to this representation: >> >> ![sparse-in-cfg](https://user-images.githubusercontent.com/8792647/224119399-884e2516-a9a1-43fd-b5f5-747c99472ace.png) >> >> #### Additional improvements >> >> Additionally, this changeset: >> - ensures that the selected filter subset is applied in the order listed in the "Filter" window (this is necessary for combining effectively the "Simplify graph" and "Condense graph" filters, but is also generally desirable for simplicity and consistency), >> - introduces a complementary filter "Show custom node info" (enabled by default) that extends the labels of call and exception-creation nodes with custom information, >> - extends the search functionality so that combined and inlined nodes can also be searched on and selected, and >> - defines and documents JavaScript helpers to simplify the new and existing available filters. >> >> Here is an example of the effect of the new "Show custom node info" filter: >> >> ![show-custom-node-info](https://user-images.githubusercontent.com/8792647/224119545-fd564224-7ccc-4829-988e-77f05d25b3bc.png) >> >> ### Testing >> >> #### Functionality >> >> - Tested the functionality manually on a small selection of graphs. >> >> - Tested automatically that viewing thousands of graphs in the three views with different filter subsets enabled does not trigger any assertion failure (by instrumenting IGV to view graphs as they are loaded and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`). >> >> #### Performance >> >> Measured the combined filter application and view creation time for the sea-of-nodes view on a selection of 100 medium-sized graphs (200-500 nodes). On average, applying the new "Show custom node info" filter introduces a minimal overhead of around 1%, which motivates enabling it by default. Applying the "simplify graph" and "condense graph" on top actually gives a speedup of about 12%, since the additional filter application time is amortized by laying out and drawing fewer nodes. However, these filters are not enabled by default, since they cause a (minor) loss of information which is not desirable in every use case. >> >> The graph size reduction and performance results are [attached](https://github.com/openjdk/jdk/files/10934804/performance-evaluation.ods) (note that each time measurement in the sheet corresponds to the median of ten runs). > > Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: > > Revert "Select slots as well" > > This reverts commit 8256f0c20d7747cda691291c47841a9280d8c493. > > Revert "Fix figure selection" > > This reverts commit 71e73e89facfbb31614e4f1f3676c9e91a38e01a. > > Revert "Make slots searchable and selectable" > > This reverts commit 69cbec1f24ec5e941a5a72ab94b79117551d9560. I've tested your update and it looks good! The code details are hard to review but I fully trust Toby's expertise there :-) src/utils/IdealGraphVisualizer/ServerCompiler/src/main/resources/com/sun/hotspot/igv/servercompiler/filters/customNodeInfo.filter line 1: > 1: // This filter add a new line to the label of selected nodes with custom Suggestion: // This filter adds a new line to the label of selected nodes with custom ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/12955#pullrequestreview-1362375078 PR Review Comment: https://git.openjdk.org/jdk/pull/12955#discussion_r1151475942 From xliu at openjdk.org Wed Mar 29 07:24:25 2023 From: xliu at openjdk.org (Xin Liu) Date: Wed, 29 Mar 2023 07:24:25 GMT Subject: RFR: 8305142: Can't bootstrap ctw.jar Message-ID: <_KrHiSvoOhisFwY9VvBouTIqYycK8YPpPFwb9BP5XHE=.c60cc690-c7e6-42b0-9010-f080c6981ace@github.com> This patch add a few add-exports so CTW can access those internal packages of java.base module. make succeeds and ctw.jar is generated as expected. ------------- Commit messages: - 8305142: Can't bootstrap ctw.jar Changes: https://git.openjdk.org/jdk/pull/13220/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13220&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8305142 Stats: 6 lines in 1 file changed: 5 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/13220.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13220/head:pull/13220 PR: https://git.openjdk.org/jdk/pull/13220 From rcastanedalo at openjdk.org Wed Mar 29 07:38:13 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 29 Mar 2023 07:38:13 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter [v4] In-Reply-To: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> Message-ID: > The "Simplify graph" filter abstracts away details from the graph that are typically unnecessary for debugging or analyzing the represented program. This changeset decouples this filter into two: > > - "Simplify graph", which hides elements that are typically (but not always) unnecessary, and > - "Condense graph", which makes the graph more compact without loss of information. > > Together, these two filters reduce the average graph size by a factor of 1.6x (nodes) and 1.9x (edges): > > ![without-with-filters](https://user-images.githubusercontent.com/8792647/224118397-e6bd45d1-0b90-4d94-88ae-0a83f9ef20da.png) > > Besides decoupling the "Simplify graph" filter, the changeset extends its functionality by: > - combining Bool and conversion nodes into their predecessors, > - inlining all Parm nodes except control into their successors (this removes lots of long edges), > - removing "top" inputs from call-like nodes, > - inlining more source nodes (such as MachTemp and ThreadLocal) into their successors, > - pretty-printing the labels of many inlined and combined nodes such as Bool comparisons or Catch projections (via a new filter that edits node properties), and > - using a sparse representation of nodes with empty inputs (e.g. call-like nodes after applying "Simplify graph"). > > The sparse input representation shows dots between non-contiguous inputs, instead of horizontal space proportional to the number of empty inputs. This helps reducing node width, which is known to improve overall layout quality: > > ![dense-vs-sparse](https://user-images.githubusercontent.com/8792647/224118703-04f663b7-7a73-4e49-87d9-2acd8b98522b.png) > > Note that the exact input indices can still be retrieved via the incoming edge's tooltips: > > ![tooltip-with-input-index](https://user-images.githubusercontent.com/8792647/224119319-7f40fba2-1e9f-436e-a11c-8c3d428d46a6.png) > > The control-flow graph view is also adapted to this representation: > > ![sparse-in-cfg](https://user-images.githubusercontent.com/8792647/224119399-884e2516-a9a1-43fd-b5f5-747c99472ace.png) > > #### Additional improvements > > Additionally, this changeset: > - ensures that the selected filter subset is applied in the order listed in the "Filter" window (this is necessary for combining effectively the "Simplify graph" and "Condense graph" filters, but is also generally desirable for simplicity and consistency), > - introduces a complementary filter "Show custom node info" (enabled by default) that extends the labels of call and exception-creation nodes with custom information, > - extends the search functionality so that combined and inlined nodes can also be searched on and selected, and > - defines and documents JavaScript helpers to simplify the new and existing available filters. > > Here is an example of the effect of the new "Show custom node info" filter: > > ![show-custom-node-info](https://user-images.githubusercontent.com/8792647/224119545-fd564224-7ccc-4829-988e-77f05d25b3bc.png) > > ### Testing > > #### Functionality > > - Tested the functionality manually on a small selection of graphs. > > - Tested automatically that viewing thousands of graphs in the three views with different filter subsets enabled does not trigger any assertion failure (by instrumenting IGV to view graphs as they are loaded and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`). > > #### Performance > > Measured the combined filter application and view creation time for the sea-of-nodes view on a selection of 100 medium-sized graphs (200-500 nodes). On average, applying the new "Show custom node info" filter introduces a minimal overhead of around 1%, which motivates enabling it by default. Applying the "simplify graph" and "condense graph" on top actually gives a speedup of about 12%, since the additional filter application time is amortized by laying out and drawing fewer nodes. However, these filters are not enabled by default, since they cause a (minor) loss of information which is not desirable in every use case. > > The graph size reduction and performance results are [attached](https://github.com/openjdk/jdk/files/10934804/performance-evaluation.ods) (note that each time measurement in the sheet corresponds to the median of ten runs). Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: Fix comment typo ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12955/files - new: https://git.openjdk.org/jdk/pull/12955/files/1ea23e42..10bc0646 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12955&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12955&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/12955.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12955/head:pull/12955 PR: https://git.openjdk.org/jdk/pull/12955 From rcastanedalo at openjdk.org Wed Mar 29 07:38:15 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 29 Mar 2023 07:38:15 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter [v3] In-Reply-To: <0lnz6Gw7u04DbFYQdEvxmotlG8ODG5aEERjeHPaGeTM=.1d754812-f7e7-4822-8395-bea9e099fe9f@github.com> References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> <0lnz6Gw7u04DbFYQdEvxmotlG8ODG5aEERjeHPaGeTM=.1d754812-f7e7-4822-8395-bea9e099fe9f@github.com> Message-ID: On Wed, 29 Mar 2023 07:07:18 GMT, Christian Hagedorn wrote: > I've tested your update and it looks good! Thanks again, Christian! > src/utils/IdealGraphVisualizer/ServerCompiler/src/main/resources/com/sun/hotspot/igv/servercompiler/filters/customNodeInfo.filter line 1: > >> 1: // This filter add a new line to the label of selected nodes with custom > > Suggestion: > > // This filter adds a new line to the label of selected nodes with custom Fixed, thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/12955#issuecomment-1488086037 PR Review Comment: https://git.openjdk.org/jdk/pull/12955#discussion_r1151508129 From epeter at openjdk.org Wed Mar 29 07:48:33 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 29 Mar 2023 07:48:33 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 In-Reply-To: References: Message-ID: <_kT34J8rFaAwBHPfXTzlnRRI6SaFXR_acj9MJUj6qKg=.fa14ce76-9a6c-46f7-b459-59c097a7c8eb@github.com> On Tue, 28 Mar 2023 21:38:57 GMT, Vladimir Kozlov wrote: >> I am reviving `-XX:+VerifyLoopOptimizations` after many years of abandonment. There were many bugs filed, but so far it has not been addressed. >> >> The hope is that this work will allow us to catch ctrl / idom / loop body bugs quicker, and fix many of the existing ones along the way. >> >> **The Idea of VerifyLoopOptimizations** >> Before loop-opts, we build many data-structures for dominance (idom), control, and loop membership. Then, loop-opts use this data to transform the graph. At the same time, they must maintain the correctness of the data-structures, so that other optimizations can be made, without needing to re-compute the data-structures every time. >> `VerifyLoopOptimizations` was implemented to verify correctness of the data-structures. After some loop-opts, we re-compute a verification data-structure and compare it to the one we created before the loop-opts and maintained during loopopts. >> >> **My Approach** >> I soon realized that there were many reasons why `VerifyLoopOptimizations` was broken. It seemed infeasible to fix all of them at once. I decided to first remove any part that was failing, until I have a minimal set that is working. I will leave many parts commented out. In follow-up RFE's, I will then iteratively improve the verification by re-enabling some verification and fixing the corresponding bugs. >> >> **What I fixed** >> >> - `verify_compare` >> - Renamed it to `verify_nodes`, since it does verification node-by-node (vs `verify_tree`, which verifies the loop-tree). >> - Previously, it was implemented as a BFS with recursion, which lead to stack-overflow. I flattened the BFS into a loop. >> - The BFS calls `verify_node` on every node. I refactored `verify_node` a bit, so that it is more readable. >> - Rather than having a thread-unsafe static variable `fail`, I now made it a reference argument. >> - `verify_tree` >> - I corrected the style and improved comments. >> - I removed the broken verification for `Opaque` nodes. I added some rudamentary verification for `CountedLoop`. I leave more of this work for follow-up RFE's. >> >> **Disabled Verifications** >> I commented out the following verifications: >> >> (A) data nodes should have same ctrl >> (B) ctrl node should belong to same loop >> (C) ctrl node should have same idom >> (D) loop should have same tail >> (E) loop should have same body (list of nodes) >> (F) broken verification in PhaseIdealLoop::build_loop_late_post, because ctrl was set wrong >> >> >> Note: verifying `idom`, `ctrl` and `_body` is the central goal of `VerifyLoopOptimizations`. But all of them are broken in many parts of the VM, as we have now not verified them for many years. >> >> **Follow-Up Work** >> >> I filed a first follow-up RFE [JDK-8305073](https://bugs.openjdk.org/browse/JDK-8305073). The following tasks should be addressed in it, or in subsequent follow-up RFE's. >> >> I propose the following order: >> >> - idom (C): The dominance structure is at the base of everything else. >> - ctrl / loop (A, B): Once dominance is fixed, we can ensure every node is assigned to the correct ctrl/loop. >> - tail (D): ensure the tail of a loop is updated correctly >> - body (E): nodes are assigned to the `_body` of a loop, according to the node ctrl. >> - other issues like (F) >> - Add more verification to IdealLoopTree::verify_tree. For example zero-trip-guard, etc. >> - Evaluate from where else we should call `PhaseIdealLoop::verify`. Maybe we are missing some cases. >> >> **Testing** >> I am running `tier1-tier6` and stress testing. >> Preliminary results are all good. >> >> **Conclusion** >> With this fix, I have the basic infrastructure of the verification working. >> However, all of the substantial verification are now still disabled, because there are too many places in the VM that do not maintain the data-structures properly. >> Follow-up RFE's will have to address these one-by-one. > > src/hotspot/share/opto/loopnode.cpp line 4656: > >> 4654: // Verify ctrl and idom of every node. >> 4655: int fail = 0; >> 4656: verify_nodes(C->root(), &loop_verify, fail); > > I think result should be returned using local variables instead of storing it in passed variable. The nice thing about a variable is that it has an address in memory. Then I can set a watchpoint in the debugger, and go back to it easily, stepping through all the `fail` incrementations. Before this patch, we just used a static variable `fail`. I could also revert to that. It is just not thread-safe. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1151522000 From rcastanedalo at openjdk.org Wed Mar 29 07:53:32 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 29 Mar 2023 07:53:32 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 In-Reply-To: References: Message-ID: On Tue, 28 Mar 2023 12:49:57 GMT, Emanuel Peter wrote: > I am reviving `-XX:+VerifyLoopOptimizations` after many years of abandonment. There were many bugs filed, but so far it has not been addressed. > > The hope is that this work will allow us to catch ctrl / idom / loop body bugs quicker, and fix many of the existing ones along the way. > > **The Idea of VerifyLoopOptimizations** > Before loop-opts, we build many data-structures for dominance (idom), control, and loop membership. Then, loop-opts use this data to transform the graph. At the same time, they must maintain the correctness of the data-structures, so that other optimizations can be made, without needing to re-compute the data-structures every time. > `VerifyLoopOptimizations` was implemented to verify correctness of the data-structures. After some loop-opts, we re-compute a verification data-structure and compare it to the one we created before the loop-opts and maintained during loopopts. > > **My Approach** > I soon realized that there were many reasons why `VerifyLoopOptimizations` was broken. It seemed infeasible to fix all of them at once. I decided to first remove any part that was failing, until I have a minimal set that is working. I will leave many parts commented out. In follow-up RFE's, I will then iteratively improve the verification by re-enabling some verification and fixing the corresponding bugs. > > **What I fixed** > > - `verify_compare` > - Renamed it to `verify_nodes`, since it does verification node-by-node (vs `verify_tree`, which verifies the loop-tree). > - Previously, it was implemented as a BFS with recursion, which lead to stack-overflow. I flattened the BFS into a loop. > - The BFS calls `verify_node` on every node. I refactored `verify_node` a bit, so that it is more readable. > - Rather than having a thread-unsafe static variable `fail`, I now made it a reference argument. > - `verify_tree` > - I corrected the style and improved comments. > - I removed the broken verification for `Opaque` nodes. I added some rudamentary verification for `CountedLoop`. I leave more of this work for follow-up RFE's. > > **Disabled Verifications** > I commented out the following verifications: > > (A) data nodes should have same ctrl > (B) ctrl node should belong to same loop > (C) ctrl node should have same idom > (D) loop should have same tail > (E) loop should have same body (list of nodes) > (F) broken verification in PhaseIdealLoop::build_loop_late_post, because ctrl was set wrong > > > Note: verifying `idom`, `ctrl` and `_body` is the central goal of `VerifyLoopOptimizations`. But all of them are broken in many parts of the VM, as we have now not verified them for many years. > > **Follow-Up Work** > > I filed a first follow-up RFE [JDK-8305073](https://bugs.openjdk.org/browse/JDK-8305073). The following tasks should be addressed in it, or in subsequent follow-up RFE's. > > I propose the following order: > > - idom (C): The dominance structure is at the base of everything else. > - ctrl / loop (A, B): Once dominance is fixed, we can ensure every node is assigned to the correct ctrl/loop. > - tail (D): ensure the tail of a loop is updated correctly > - body (E): nodes are assigned to the `_body` of a loop, according to the node ctrl. > - other issues like (F) > - Add more verification to IdealLoopTree::verify_tree. For example zero-trip-guard, etc. > - Evaluate from where else we should call `PhaseIdealLoop::verify`. Maybe we are missing some cases. > > **Testing** > I am running `tier1-tier6` and stress testing. > Preliminary results are all good. > > **Conclusion** > With this fix, I have the basic infrastructure of the verification working. > However, all of the substantial verification are now still disabled, because there are too many places in the VM that do not maintain the data-structures properly. > Follow-up RFE's will have to address these one-by-one. Thanks for working on this, Emanuel! Is there any chance that, in the scope of this work, we could enable the `VerifyLoopOptimizations` flag by default? This would mitigate the risk of this code degrading again in the future. In my opinion, it would be worth doing this even if it meant sacrificing some of the more expensive verification steps. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13207#issuecomment-1488105455 From shade at openjdk.org Wed Mar 29 08:00:37 2023 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 29 Mar 2023 08:00:37 GMT Subject: RFR: 8304681: compiler/sharedstubs/SharedStubToInterpTest.java fails after JDK-8304387 In-Reply-To: References: Message-ID: On Wed, 22 Mar 2023 05:14:21 GMT, Xiaolin Zheng wrote: > Please review this test fix for [JDK-8304387](https://bugs.openjdk.org/browse/JDK-8304387) after RFR. > > Instead of specifying shared stubs' locations in this test, we could check if two or more relocations combined with each of them in this test, same as the other test `SharedTrampolineTest.java`. The counting logic is aligned with `SharedTrampolineTest.java`. Printing relocation stuff requires debug version vm, so this test is changed to debug only. Also some minor cleanups for the tests. > > Apologies for the tier2 failure. I mainly focused on if there were new hs_errs when working on JDK-8304387. :-( > > I am now confirming if the comment "Static stubs must be created at the end of the Stub section" could be removed, which needs a little extra time - though I think we can relax such limitations in `SharedStubToInterpTest.java`. > -- Update on March 27th: comfirmed; please see the JBS issue for the discussion. > > Tested x86_64, AArch64 and RISC-V for tests under `compiler/sharedstubs` folder, and now all passed (release, fastdebug). > > Thanks, > Xiaolin Marked as reviewed by shade (Reviewer). test/hotspot/jtreg/compiler/sharedstubs/SharedStubToInterpTest.java line 91: > 89: .matcher(output.getStdout()) > 90: .results() > 91: .map(m -> m.group(1)).toList(); Suggestion: .map(m -> m.group(1)) .toList(); test/hotspot/jtreg/compiler/sharedstubs/SharedTrampolineTest.java line 88: > 86: .matcher(output.getStdout()) > 87: .results() > 88: .map(m -> m.group(1)).toList(); Suggestion: .map(m -> m.group(1)) .toList(); ------------- PR Review: https://git.openjdk.org/jdk/pull/13135#pullrequestreview-1361521838 PR Review Comment: https://git.openjdk.org/jdk/pull/13135#discussion_r1150895562 PR Review Comment: https://git.openjdk.org/jdk/pull/13135#discussion_r1150895773 From epeter at openjdk.org Wed Mar 29 08:04:50 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 29 Mar 2023 08:04:50 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 In-Reply-To: References: Message-ID: On Wed, 29 Mar 2023 07:50:34 GMT, Roberto Casta?eda Lozano wrote: >> I am reviving `-XX:+VerifyLoopOptimizations` after many years of abandonment. There were many bugs filed, but so far it has not been addressed. >> >> The hope is that this work will allow us to catch ctrl / idom / loop body bugs quicker, and fix many of the existing ones along the way. >> >> **The Idea of VerifyLoopOptimizations** >> Before loop-opts, we build many data-structures for dominance (idom), control, and loop membership. Then, loop-opts use this data to transform the graph. At the same time, they must maintain the correctness of the data-structures, so that other optimizations can be made, without needing to re-compute the data-structures every time. >> `VerifyLoopOptimizations` was implemented to verify correctness of the data-structures. After some loop-opts, we re-compute a verification data-structure and compare it to the one we created before the loop-opts and maintained during loopopts. >> >> **My Approach** >> I soon realized that there were many reasons why `VerifyLoopOptimizations` was broken. It seemed infeasible to fix all of them at once. I decided to first remove any part that was failing, until I have a minimal set that is working. I will leave many parts commented out. In follow-up RFE's, I will then iteratively improve the verification by re-enabling some verification and fixing the corresponding bugs. >> >> **What I fixed** >> >> - `verify_compare` >> - Renamed it to `verify_nodes`, since it does verification node-by-node (vs `verify_tree`, which verifies the loop-tree). >> - Previously, it was implemented as a BFS with recursion, which lead to stack-overflow. I flattened the BFS into a loop. >> - The BFS calls `verify_node` on every node. I refactored `verify_node` a bit, so that it is more readable. >> - Rather than having a thread-unsafe static variable `fail`, I now made it a reference argument. >> - `verify_tree` >> - I corrected the style and improved comments. >> - I removed the broken verification for `Opaque` nodes. I added some rudamentary verification for `CountedLoop`. I leave more of this work for follow-up RFE's. >> >> **Disabled Verifications** >> I commented out the following verifications: >> >> (A) data nodes should have same ctrl >> (B) ctrl node should belong to same loop >> (C) ctrl node should have same idom >> (D) loop should have same tail >> (E) loop should have same body (list of nodes) >> (F) broken verification in PhaseIdealLoop::build_loop_late_post, because ctrl was set wrong >> >> >> Note: verifying `idom`, `ctrl` and `_body` is the central goal of `VerifyLoopOptimizations`. But all of them are broken in many parts of the VM, as we have now not verified them for many years. >> >> **Follow-Up Work** >> >> I filed a first follow-up RFE [JDK-8305073](https://bugs.openjdk.org/browse/JDK-8305073). The following tasks should be addressed in it, or in subsequent follow-up RFE's. >> >> I propose the following order: >> >> - idom (C): The dominance structure is at the base of everything else. >> - ctrl / loop (A, B): Once dominance is fixed, we can ensure every node is assigned to the correct ctrl/loop. >> - tail (D): ensure the tail of a loop is updated correctly >> - body (E): nodes are assigned to the `_body` of a loop, according to the node ctrl. >> - other issues like (F) >> - Add more verification to IdealLoopTree::verify_tree. For example zero-trip-guard, etc. >> - Evaluate from where else we should call `PhaseIdealLoop::verify`. Maybe we are missing some cases. >> >> **Testing** >> I am running `tier1-tier6` and stress testing. >> Preliminary results are all good. >> >> **Conclusion** >> With this fix, I have the basic infrastructure of the verification working. >> However, all of the substantial verification are now still disabled, because there are too many places in the VM that do not maintain the data-structures properly. >> Follow-up RFE's will have to address these one-by-one. > > Thanks for working on this, Emanuel! Is there any chance that, in the scope of this work, we could enable the `VerifyLoopOptimizations` flag by default? This would mitigate the risk of this code degrading again in the future. In my opinion, it would be worth doing this even if it meant sacrificing some of the more expensive verification steps. @robcasloz This is a trade-off with all verification and stress code. I think our general approach is that any time-consuming verification/stress testing only runs with a flag. This is important so that the performance does not drop too much. Because it would change the profiling and ergonomics, order of compilation and inlining, etc. Product failures may not easily reproduce in debug. Plus, it just makes all testing slower and more expensive. `VerifyLoopOptimizations` has quite the overhead. First, you need to build the loop-tree. That alone consists of multiple traversals over the whole graph. I suggest we keep it guarded by the flag, but add it to stress testing. And when one is touching anything to do with loop-opts, one should probably run testing with the flag. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13207#issuecomment-1488119585 From dzhang at openjdk.org Wed Mar 29 08:08:52 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Wed, 29 Mar 2023 08:08:52 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v10] In-Reply-To: References: Message-ID: <8968DzWTmYxKdJxGryQAe9cs6H5HBvVRYVYoPtVdrMc=.c124e38e-4766-4372-949a-d242bdd67925@github.com> > HI, > > We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! > This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. > > ## Load/Store/Cmp Mask > `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? > > 218 loadV V1, [R7] # vector (rvv) > 220 vloadmask V0, V1 > ... > 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 > 24c vstoremask V1, V0 > 258 storeV [R7], V1 # vector (rvv) > > > The corresponding generated jit assembly? > > # loadV > 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef95c: vle8.v v1,(t2) > > # vloadmask > 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, > 0x000000400c8ef964: vmsne.vx v0,v1,zero > > # vmaskcmp_rvv_masked > 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef980: vmclr.m v1 > 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t > 0x000000400c8ef988: vmv1r.v v0,v1 > > # vstoremask > 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef990: vmv.v.x v1,zero > 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 > > > ## Masked vector arithmetic instructions (e.g. vadd) > AddMaskTestMerge case: > > import jdk.incubator.vector.IntVector; > import jdk.incubator.vector.VectorMask; > import jdk.incubator.vector.VectorOperators; > import jdk.incubator.vector.VectorSpecies; > > public class AddMaskTestMerge { > > static final VectorSpecies SPECIES = IntVector.SPECIES_128; > static final int SIZE = 1024; > static int[] a = new int[SIZE]; > static int[] b = new int[SIZE]; > static int[] r = new int[SIZE]; > static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; > static { > for (int i = 0; i < SIZE; i++) { > a[i] = i; > b[i] = i; > } > } > > static void workload(int idx) { > VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); > IntVector av = IntVector.fromArray(SPECIES, a, idx); > IntVector bv = IntVector.fromArray(SPECIES, b, idx); > av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); > } > > public static void main(String[] args) { > for (int i = 0; i < 30_0000; i++) { > for (int j = 0; j < SIZE; j += SPECIES.length()) { > workload(j); > } > } > } > } > > > This test case is reduced from existing jtreg vector tests Int128VectorTests.java[3]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. > > Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: > > > 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 > 0ae loadV V1, [R31] # vector (rvv) > 0b6 vloadmask V0, V2 > 0be vadd.vv V3, V1, V0 #@vaddI_masked > 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r > 0ca decode_heap_oop R28, R28 #@decodeHeapOop > 0cc lwu R7, [R28, #12] # range, #@loadRange > 0d0 NullCheck R28 > > > And the jit code is as follows: > > > 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) > ; - AddMaskTestMerge::workload at 46 (line 25) > 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) > ; - AddMaskTestMerge::workload at 7 (line 22) > 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) > ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) > ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) > ; - AddMaskTestMerge::workload at 39 (line 25) > > > ## Mask register allocation & mask bit opreation > Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[2], so define v30 and v31 as mask register too. > > `AndVMask` will emit the C2 JIT code like: > > vloadmask V0, V1 > vloadmask V30, V2 > vmask_and V0, V30, V0 > > We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. > > By the way, the current implementation of `VectorMaskCast` is for the case of equal width of the parameter data, other cases depend on the subsequent cast node. > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java > [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 > > ### Testing: > > qemu with UseRVV: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) > - [ ] Tier3 tests (release) > - [x] test/jdk/jdk/incubator/vector (release/fastdebug) Dingli Zhang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains one commit: RISC-V: Support vector add mask instructions for Vector API ------------- Changes: https://git.openjdk.org/jdk/pull/12682/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=09 Stats: 688 lines in 6 files changed: 622 ins; 5 del; 61 mod Patch: https://git.openjdk.org/jdk/pull/12682.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12682/head:pull/12682 PR: https://git.openjdk.org/jdk/pull/12682 From dzhang at openjdk.org Wed Mar 29 08:08:55 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Wed, 29 Mar 2023 08:08:55 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v3] In-Reply-To: References: Message-ID: <5NyOi_XulCNVzwgDfG9vxm4crdBkVObreycZrmd6iBU=.4b7a9086-e1d0-4416-8866-788580085026@github.com> On Wed, 22 Feb 2023 00:37:06 GMT, Dingli Zhang wrote: >> HI, >> >> We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! >> This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. >> >> ## Load/Store/Cmp Mask >> `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? >> >> 218 loadV V1, [R7] # vector (rvv) >> 220 vloadmask V0, V1 >> ... >> 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 >> 24c vstoremask V1, V0 >> 258 storeV [R7], V1 # vector (rvv) >> >> >> The corresponding generated jit assembly? >> >> # loadV >> 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef95c: vle8.v v1,(t2) >> >> # vloadmask >> 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, >> 0x000000400c8ef964: vmsne.vx v0,v1,zero >> >> # vmaskcmp_rvv_masked >> 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef980: vmclr.m v1 >> 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t >> 0x000000400c8ef988: vmv1r.v v0,v1 >> >> # vstoremask >> 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef990: vmv.v.x v1,zero >> 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 >> >> >> ## Masked vector arithmetic instructions (e.g. vadd) >> AddMaskTestMerge case: >> >> import jdk.incubator.vector.IntVector; >> import jdk.incubator.vector.VectorMask; >> import jdk.incubator.vector.VectorOperators; >> import jdk.incubator.vector.VectorSpecies; >> >> public class AddMaskTestMerge { >> >> static final VectorSpecies SPECIES = IntVector.SPECIES_128; >> static final int SIZE = 1024; >> static int[] a = new int[SIZE]; >> static int[] b = new int[SIZE]; >> static int[] r = new int[SIZE]; >> static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; >> static { >> for (int i = 0; i < SIZE; i++) { >> a[i] = i; >> b[i] = i; >> } >> } >> >> static void workload(int idx) { >> VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); >> IntVector av = IntVector.fromArray(SPECIES, a, idx); >> IntVector bv = IntVector.fromArray(SPECIES, b, idx); >> av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); >> } >> >> public static void main(String[] args) { >> for (int i = 0; i < 30_0000; i++) { >> for (int j = 0; j < SIZE; j += SPECIES.length()) { >> workload(j); >> } >> } >> } >> } >> >> >> This test case is reduced from existing jtreg vector tests Int128VectorTests.java[3]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. >> >> Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: >> >> >> 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 >> 0ae loadV V1, [R31] # vector (rvv) >> 0b6 vloadmask V0, V2 >> 0be vadd.vv V3, V1, V0 #@vaddI_masked >> 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r >> 0ca decode_heap_oop R28, R28 #@decodeHeapOop >> 0cc lwu R7, [R28, #12] # range, #@loadRange >> 0d0 NullCheck R28 >> >> >> And the jit code is as follows: >> >> >> 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu >> 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) >> ; - AddMaskTestMerge::workload at 46 (line 25) >> 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) >> ; - AddMaskTestMerge::workload at 7 (line 22) >> 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu >> 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) >> ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) >> ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) >> ; - AddMaskTestMerge::workload at 39 (line 25) >> >> >> ## Mask register allocation & mask bit opreation >> Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[2], so define v30 and v31 as mask register too. >> >> `AndVMask` will emit the C2 JIT code like: >> >> vloadmask V0, V1 >> vloadmask V30, V2 >> vmask_and V0, V30, V0 >> >> We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. >> >> By the way, the current implementation of `VectorMaskCast` is for the case of equal width of the parameter data, other cases depend on the subsequent cast node. >> >> [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc >> [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java >> [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 >> >> ### Testing: >> >> qemu with UseRVV: >> - [x] Tier1 tests (release) >> - [x] Tier2 tests (release) >> - [ ] Tier3 tests (release) >> - [x] test/jdk/jdk/incubator/vector (release/fastdebug) > > Dingli Zhang has refreshed the contents of this pull request, and previous commits have been removed. Incremental views are not available. Because operations such as AndVMask require more than one mask register, we are discussing a more rational approach to register allocation. ------------- PR Comment: https://git.openjdk.org/jdk/pull/12682#issuecomment-1447518509 From xlinzheng at openjdk.org Wed Mar 29 08:11:55 2023 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Wed, 29 Mar 2023 08:11:55 GMT Subject: RFR: 8304681: compiler/sharedstubs/SharedStubToInterpTest.java fails after JDK-8304387 [v2] In-Reply-To: References: Message-ID: > Please review this test fix for [JDK-8304387](https://bugs.openjdk.org/browse/JDK-8304387) after RFR. > > Instead of specifying shared stubs' locations in this test, we could check if two or more relocations combined with each of them in this test, same as the other test `SharedTrampolineTest.java`. The counting logic is aligned with `SharedTrampolineTest.java`. Printing relocation stuff requires debug version vm, so this test is changed to debug only. Also some minor cleanups for the tests. > > Apologies for the tier2 failure. I mainly focused on if there were new hs_errs when working on JDK-8304387. :-( > > I am now confirming if the comment "Static stubs must be created at the end of the Stub section" could be removed, which needs a little extra time - though I think we can relax such limitations in `SharedStubToInterpTest.java`. > -- Update on March 27th: comfirmed; please see the JBS issue for the discussion. > > Tested x86_64, AArch64 and RISC-V for tests under `compiler/sharedstubs` folder, and now all passed (release, fastdebug). > > Thanks, > Xiaolin Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: Aleksey's code style suggestions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13135/files - new: https://git.openjdk.org/jdk/pull/13135/files/3f343c2b..d5573ac3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13135&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13135&range=00-01 Stats: 4 lines in 2 files changed: 2 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/13135.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13135/head:pull/13135 PR: https://git.openjdk.org/jdk/pull/13135 From rcastanedalo at openjdk.org Wed Mar 29 08:15:47 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 29 Mar 2023 08:15:47 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 In-Reply-To: References: Message-ID: On Wed, 29 Mar 2023 08:01:35 GMT, Emanuel Peter wrote: > VerifyLoopOptimizations has quite the overhead. First, you need to build the loop-tree. That alone consists of multiple traversals over the whole graph. Thanks, would be interesting to quantify this overhead, just to make sure we are not being overcautious here. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13207#issuecomment-1488136114 From epeter at openjdk.org Wed Mar 29 08:15:53 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 29 Mar 2023 08:15:53 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 In-Reply-To: References: Message-ID: On Tue, 28 Mar 2023 23:11:09 GMT, Vladimir Kozlov wrote: >> I am reviving `-XX:+VerifyLoopOptimizations` after many years of abandonment. There were many bugs filed, but so far it has not been addressed. >> >> The hope is that this work will allow us to catch ctrl / idom / loop body bugs quicker, and fix many of the existing ones along the way. >> >> **The Idea of VerifyLoopOptimizations** >> Before loop-opts, we build many data-structures for dominance (idom), control, and loop membership. Then, loop-opts use this data to transform the graph. At the same time, they must maintain the correctness of the data-structures, so that other optimizations can be made, without needing to re-compute the data-structures every time. >> `VerifyLoopOptimizations` was implemented to verify correctness of the data-structures. After some loop-opts, we re-compute a verification data-structure and compare it to the one we created before the loop-opts and maintained during loopopts. >> >> **My Approach** >> I soon realized that there were many reasons why `VerifyLoopOptimizations` was broken. It seemed infeasible to fix all of them at once. I decided to first remove any part that was failing, until I have a minimal set that is working. I will leave many parts commented out. In follow-up RFE's, I will then iteratively improve the verification by re-enabling some verification and fixing the corresponding bugs. >> >> **What I fixed** >> >> - `verify_compare` >> - Renamed it to `verify_nodes`, since it does verification node-by-node (vs `verify_tree`, which verifies the loop-tree). >> - Previously, it was implemented as a BFS with recursion, which lead to stack-overflow. I flattened the BFS into a loop. >> - The BFS calls `verify_node` on every node. I refactored `verify_node` a bit, so that it is more readable. >> - Rather than having a thread-unsafe static variable `fail`, I now made it a reference argument. >> - `verify_tree` >> - I corrected the style and improved comments. >> - I removed the broken verification for `Opaque` nodes. I added some rudamentary verification for `CountedLoop`. I leave more of this work for follow-up RFE's. >> >> **Disabled Verifications** >> I commented out the following verifications: >> >> (A) data nodes should have same ctrl >> (B) ctrl node should belong to same loop >> (C) ctrl node should have same idom >> (D) loop should have same tail >> (E) loop should have same body (list of nodes) >> (F) broken verification in PhaseIdealLoop::build_loop_late_post, because ctrl was set wrong >> >> >> Note: verifying `idom`, `ctrl` and `_body` is the central goal of `VerifyLoopOptimizations`. But all of them are broken in many parts of the VM, as we have now not verified them for many years. >> >> **Follow-Up Work** >> >> I filed a first follow-up RFE [JDK-8305073](https://bugs.openjdk.org/browse/JDK-8305073). The following tasks should be addressed in it, or in subsequent follow-up RFE's. >> >> I propose the following order: >> >> - idom (C): The dominance structure is at the base of everything else. >> - ctrl / loop (A, B): Once dominance is fixed, we can ensure every node is assigned to the correct ctrl/loop. >> - tail (D): ensure the tail of a loop is updated correctly >> - body (E): nodes are assigned to the `_body` of a loop, according to the node ctrl. >> - other issues like (F) >> - Add more verification to IdealLoopTree::verify_tree. For example zero-trip-guard, etc. >> - Evaluate from where else we should call `PhaseIdealLoop::verify`. Maybe we are missing some cases. >> >> **Testing** >> I am running `tier1-tier6` and stress testing. >> Preliminary results are all good. >> >> **Conclusion** >> With this fix, I have the basic infrastructure of the verification working. >> However, all of the substantial verification are now still disabled, because there are too many places in the VM that do not maintain the data-structures properly. >> Follow-up RFE's will have to address these one-by-one. > > src/hotspot/share/opto/loopnode.cpp line 4814: > >> 4812: IdealLoopTree **pp = &loop->_parent->_child; >> 4813: while (*pp != loop) { >> 4814: pp = &((*pp)->_next); > > May be instead of this convoluted search, swap and verification code we can simply reorder whole siblings list first before verification? You don't need to swap - just create local ordered list. Ok, I will refactor this code. I will have two local lists, sort them by `_head->_idx`. Then I can also verify that the lists are the same. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1151554026 From shade at openjdk.org Wed Mar 29 08:21:06 2023 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 29 Mar 2023 08:21:06 GMT Subject: RFR: 8304681: compiler/sharedstubs/SharedStubToInterpTest.java fails after JDK-8304387 [v2] In-Reply-To: References: Message-ID: <3KmwKDUCerB_U8TK_bh5aIW3H5RM_j68H_ktv8Uu1l8=.5ad320c6-cdc8-474e-a3b1-32b8044c84b0@github.com> On Wed, 29 Mar 2023 08:11:55 GMT, Xiaolin Zheng wrote: >> Please review this test fix for [JDK-8304387](https://bugs.openjdk.org/browse/JDK-8304387) after RFR. >> >> Instead of specifying shared stubs' locations in this test, we could check if two or more relocations combined with each of them in this test, same as the other test `SharedTrampolineTest.java`. The counting logic is aligned with `SharedTrampolineTest.java`. Printing relocation stuff requires debug version vm, so this test is changed to debug only. Also some minor cleanups for the tests. >> >> Apologies for the tier2 failure. I mainly focused on if there were new hs_errs when working on JDK-8304387. :-( >> >> I am now confirming if the comment "Static stubs must be created at the end of the Stub section" could be removed, which needs a little extra time - though I think we can relax such limitations in `SharedStubToInterpTest.java`. >> -- Update on March 27th: comfirmed; please see the JBS issue for the discussion. >> >> Tested x86_64, AArch64 and RISC-V for tests under `compiler/sharedstubs` folder, and now all passed (release, fastdebug). >> >> Thanks, >> Xiaolin > > Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: > > Aleksey's code style suggestions Marked as reviewed by shade (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/13135#pullrequestreview-1362502181 From xlinzheng at openjdk.org Wed Mar 29 08:21:08 2023 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Wed, 29 Mar 2023 08:21:08 GMT Subject: RFR: 8304681: compiler/sharedstubs/SharedStubToInterpTest.java fails after JDK-8304387 [v2] In-Reply-To: References: Message-ID: On Wed, 29 Mar 2023 08:11:55 GMT, Xiaolin Zheng wrote: >> Please review this test fix for [JDK-8304387](https://bugs.openjdk.org/browse/JDK-8304387) after RFR. >> >> Instead of specifying shared stubs' locations in this test, we could check if two or more relocations combined with each of them in this test, same as the other test `SharedTrampolineTest.java`. The counting logic is aligned with `SharedTrampolineTest.java`. Printing relocation stuff requires debug version vm, so this test is changed to debug only. Also some minor cleanups for the tests. >> >> Apologies for the tier2 failure. I mainly focused on if there were new hs_errs when working on JDK-8304387. :-( >> >> I am now confirming if the comment "Static stubs must be created at the end of the Stub section" could be removed, which needs a little extra time - though I think we can relax such limitations in `SharedStubToInterpTest.java`. >> -- Update on March 27th: comfirmed; please see the JBS issue for the discussion. >> >> Tested x86_64, AArch64 and RISC-V for tests under `compiler/sharedstubs` folder, and now all passed (release, fastdebug). >> >> Thanks, >> Xiaolin > > Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: > > Aleksey's code style suggestions Thank you for the review, Aleksey! Done; tests still pass on three platforms with the latest change (release and fastdebug build). Adding the label back. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13135#issuecomment-1488140110 From epeter at openjdk.org Wed Mar 29 08:24:15 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 29 Mar 2023 08:24:15 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 In-Reply-To: References: Message-ID: On Tue, 28 Mar 2023 12:49:57 GMT, Emanuel Peter wrote: > I am reviving `-XX:+VerifyLoopOptimizations` after many years of abandonment. There were many bugs filed, but so far it has not been addressed. > > The hope is that this work will allow us to catch ctrl / idom / loop body bugs quicker, and fix many of the existing ones along the way. > > **The Idea of VerifyLoopOptimizations** > Before loop-opts, we build many data-structures for dominance (idom), control, and loop membership. Then, loop-opts use this data to transform the graph. At the same time, they must maintain the correctness of the data-structures, so that other optimizations can be made, without needing to re-compute the data-structures every time. > `VerifyLoopOptimizations` was implemented to verify correctness of the data-structures. After some loop-opts, we re-compute a verification data-structure and compare it to the one we created before the loop-opts and maintained during loopopts. > > **My Approach** > I soon realized that there were many reasons why `VerifyLoopOptimizations` was broken. It seemed infeasible to fix all of them at once. I decided to first remove any part that was failing, until I have a minimal set that is working. I will leave many parts commented out. In follow-up RFE's, I will then iteratively improve the verification by re-enabling some verification and fixing the corresponding bugs. > > **What I fixed** > > - `verify_compare` > - Renamed it to `verify_nodes`, since it does verification node-by-node (vs `verify_tree`, which verifies the loop-tree). > - Previously, it was implemented as a BFS with recursion, which lead to stack-overflow. I flattened the BFS into a loop. > - The BFS calls `verify_node` on every node. I refactored `verify_node` a bit, so that it is more readable. > - Rather than having a thread-unsafe static variable `fail`, I now made it a reference argument. > - `verify_tree` > - I corrected the style and improved comments. > - I removed the broken verification for `Opaque` nodes. I added some rudamentary verification for `CountedLoop`. I leave more of this work for follow-up RFE's. > > **Disabled Verifications** > I commented out the following verifications: > > (A) data nodes should have same ctrl > (B) ctrl node should belong to same loop > (C) ctrl node should have same idom > (D) loop should have same tail > (E) loop should have same body (list of nodes) > (F) broken verification in PhaseIdealLoop::build_loop_late_post, because ctrl was set wrong > > > Note: verifying `idom`, `ctrl` and `_body` is the central goal of `VerifyLoopOptimizations`. But all of them are broken in many parts of the VM, as we have now not verified them for many years. > > **Follow-Up Work** > > I filed a first follow-up RFE [JDK-8305073](https://bugs.openjdk.org/browse/JDK-8305073). The following tasks should be addressed in it, or in subsequent follow-up RFE's. > > I propose the following order: > > - idom (C): The dominance structure is at the base of everything else. > - ctrl / loop (A, B): Once dominance is fixed, we can ensure every node is assigned to the correct ctrl/loop. > - tail (D): ensure the tail of a loop is updated correctly > - body (E): nodes are assigned to the `_body` of a loop, according to the node ctrl. > - other issues like (F) > - Add more verification to IdealLoopTree::verify_tree. For example zero-trip-guard, etc. > - Evaluate from where else we should call `PhaseIdealLoop::verify`. Maybe we are missing some cases. > > **Testing** > I am running `tier1-tier6` and stress testing. > Preliminary results are all good. > > **Conclusion** > With this fix, I have the basic infrastructure of the verification working. > However, all of the substantial verification are now still disabled, because there are too many places in the VM that do not maintain the data-structures properly. > Follow-up RFE's will have to address these one-by-one. I had testing run, for `tier1-tier6` and stress testing. The non-verification run finished with `27d 3h` machine time. The verification run is still running, with at least `27d 7h` machine time. That overhead does not seem very significant. However, probably many verifications would only have a small overhead by themselves. But I'd suspect that all of them would have a more significant cumulative overhead. The reason why this code was rotting is that really nobody ever ran any tests with the flag. One additional approach to prevent the verification code from rotting: activate it for fuzzing, at least sometimes. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13207#issuecomment-1488147139 From xlinzheng at openjdk.org Wed Mar 29 08:32:24 2023 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Wed, 29 Mar 2023 08:32:24 GMT Subject: Integrated: 8304681: compiler/sharedstubs/SharedStubToInterpTest.java fails after JDK-8304387 In-Reply-To: References: Message-ID: On Wed, 22 Mar 2023 05:14:21 GMT, Xiaolin Zheng wrote: > Please review this test fix for [JDK-8304387](https://bugs.openjdk.org/browse/JDK-8304387) after RFR. > > Instead of specifying shared stubs' locations in this test, we could check if two or more relocations combined with each of them in this test, same as the other test `SharedTrampolineTest.java`. The counting logic is aligned with `SharedTrampolineTest.java`. Printing relocation stuff requires debug version vm, so this test is changed to debug only. Also some minor cleanups for the tests. > > Apologies for the tier2 failure. I mainly focused on if there were new hs_errs when working on JDK-8304387. :-( > > I am now confirming if the comment "Static stubs must be created at the end of the Stub section" could be removed, which needs a little extra time - though I think we can relax such limitations in `SharedStubToInterpTest.java`. > -- Update on March 27th: comfirmed; please see the JBS issue for the discussion. > > Tested x86_64, AArch64 and RISC-V for tests under `compiler/sharedstubs` folder, and now all passed (release, fastdebug). > > Thanks, > Xiaolin This pull request has now been integrated. Changeset: 09852884 Author: Xiaolin Zheng Committer: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/09852884cc4f55b2c95e2dbe28cf5c7ad9095684 Stats: 59 lines in 3 files changed: 3 ins; 47 del; 9 mod 8304681: compiler/sharedstubs/SharedStubToInterpTest.java fails after JDK-8304387 Reviewed-by: eastigeevich, kvn, shade ------------- PR: https://git.openjdk.org/jdk/pull/13135 From shade at openjdk.org Wed Mar 29 08:34:10 2023 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 29 Mar 2023 08:34:10 GMT Subject: RFR: 8305142: Can't bootstrap ctw.jar In-Reply-To: <_KrHiSvoOhisFwY9VvBouTIqYycK8YPpPFwb9BP5XHE=.c60cc690-c7e6-42b0-9010-f080c6981ace@github.com> References: <_KrHiSvoOhisFwY9VvBouTIqYycK8YPpPFwb9BP5XHE=.c60cc690-c7e6-42b0-9010-f080c6981ace@github.com> Message-ID: On Wed, 29 Mar 2023 07:17:23 GMT, Xin Liu wrote: > This patch add a few add-exports so CTW can access those internal packages of java.base module. > > make succeeds and ctw.jar is generated as expected. So this build failure only happens because we implicitly compile the rest of the testlib, even though CTW does not use it? Because if CTW uses these, I would have expected we need the similar additions in `CtwRunner` here: https://github.com/openjdk/jdk/blob/09852884cc4f55b2c95e2dbe28cf5c7ad9095684/test/hotspot/jtreg/testlibrary/ctw/src/sun/hotspot/tools/ctw/CtwRunner.java#L274-L278 ------------- PR Review: https://git.openjdk.org/jdk/pull/13220#pullrequestreview-1362527600 From rcastanedalo at openjdk.org Wed Mar 29 08:36:11 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 29 Mar 2023 08:36:11 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 In-Reply-To: References: Message-ID: On Wed, 29 Mar 2023 08:21:20 GMT, Emanuel Peter wrote: > I had testing run, for tier1-tier6 and stress testing. > The non-verification run finished with 27d 3h machine time. > The verification run is still running, with at least 27d 7h machine time. Thanks for the data! If the verification run does not take much longer (say <1% on top of what the non-verification run takes), it might be a good trade-off to have it enabled by default. Not just to prevent the verification code from rotting but to actually get more value from it (better chances to find bugs earlier). ------------- PR Comment: https://git.openjdk.org/jdk/pull/13207#issuecomment-1488163510 From epeter at openjdk.org Wed Mar 29 08:47:45 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 29 Mar 2023 08:47:45 GMT Subject: RFR: 8304042: C2 SuperWord: schedule must remove packs with cyclic dependencies In-Reply-To: References: <9VgAQeNZfUZJXO8llozcZZuRftv6kk43jw0YIrBIdck=.b5c89436-608b-4ed1-816d-b3514374eaeb@github.com> Message-ID: On Wed, 29 Mar 2023 00:00:53 GMT, Vladimir Kozlov wrote: >> I discovered this bug during the bug fix of [JDK-8298935](https://bugs.openjdk.org/browse/JDK-8298935) [PR](https://git.openjdk.org/jdk/pull/12350). >> >> Currently, the SuperWord algorithm only ensures that all `packs` are `isomorphic` and `independent` (additionally memops are `adjacent`). >> >> This is **not sufficient**. We need to ensure that the `packs` do not introduce `cycles` into the graph. Example: >> >> https://github.com/openjdk/jdk/blob/ad580d18dbbf074c8a3692e2836839505b574326/test/hotspot/jtreg/compiler/loopopts/superword/TestIndependentPacksWithCyclicDependency.java#L217-L231 >> >> This is also mentioned in the [SuperWord Paper](https://groups.csail.mit.edu/cag/slp/SLP-PLDI-2000.pdf) (2000, Samuel Larsen and Saman Amarasinghe, Exploiting Superword Level Parallelism with Multimedia Instruction Sets): >> >> >> 3.7 Scheduling >> Dependence analysis before packing ensures that statements within a group can be executed >> safely in parallel. However, it may be the case that executing two groups produces a dependence >> violation. An example of this is shown in Figure 6. Here, dependence edges are drawn between >> groups if a statement in one group is dependent on a statement in the other. As long as there >> are no cycles in this dependence graph, all groups can be scheduled such that no violations >> occur. However, a cycle indicates that the set of chosen groups is invalid and at least one group >> will need to be eliminated. Although experimental data has shown this case to be extremely rare, >> care must be taken to ensure correctness. >> >> >> **Solution** >> >> Just before scheduling, I introduced `SuperWord::remove_cycles`. It creates a `PacksetGraph`, based on nodes in the `packs`, and scalar-nodes which are not in a pack. The edges are taken from `DepPreds`. We check if the graph can be scheduled without cycles (via topological sort). >> >> **FYI** >> >> I found a further bug, this time I think it happens during scheduling. See [JDK-8304720](https://bugs.openjdk.org/browse/JDK-8304720). Because of that, I had to disable a test case (`TestIndependentPacksWithCyclicDependency::test5`). I also had to require 64 bit, and either `avx2` or `asimd`. I hope we can lift that again once we fix the other bug. The issue is this: the cyclic dependency example can degenerate to non-cyclic ones, that need to reorder the non-vectorized memory operations. > > src/hotspot/share/opto/superword.cpp line 2384: > >> 2382: void set_pid(const Node* n, int pid) { >> 2383: assert(n != nullptr && pid > 0, "sane inputs"); >> 2384: _pid.at_put_grow(n->_idx, pid); > > This could be huge waste of space. `_idx` could be very big number vs mach smaller `_max_pid`. > Can we use `_bb_idx` instead? Ah, that's a great idea :) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13078#discussion_r1151594369 From tholenstein at openjdk.org Wed Mar 29 08:53:19 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 29 Mar 2023 08:53:19 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter [v4] In-Reply-To: References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> Message-ID: On Wed, 29 Mar 2023 07:38:13 GMT, Roberto Casta?eda Lozano wrote: >> The "Simplify graph" filter abstracts away details from the graph that are typically unnecessary for debugging or analyzing the represented program. This changeset decouples this filter into two: >> >> - "Simplify graph", which hides elements that are typically (but not always) unnecessary, and >> - "Condense graph", which makes the graph more compact without loss of information. >> >> Together, these two filters reduce the average graph size by a factor of 1.6x (nodes) and 1.9x (edges): >> >> ![without-with-filters](https://user-images.githubusercontent.com/8792647/224118397-e6bd45d1-0b90-4d94-88ae-0a83f9ef20da.png) >> >> Besides decoupling the "Simplify graph" filter, the changeset extends its functionality by: >> - combining Bool and conversion nodes into their predecessors, >> - inlining all Parm nodes except control into their successors (this removes lots of long edges), >> - removing "top" inputs from call-like nodes, >> - inlining more source nodes (such as MachTemp and ThreadLocal) into their successors, >> - pretty-printing the labels of many inlined and combined nodes such as Bool comparisons or Catch projections (via a new filter that edits node properties), and >> - using a sparse representation of nodes with empty inputs (e.g. call-like nodes after applying "Simplify graph"). >> >> The sparse input representation shows dots between non-contiguous inputs, instead of horizontal space proportional to the number of empty inputs. This helps reducing node width, which is known to improve overall layout quality: >> >> ![dense-vs-sparse](https://user-images.githubusercontent.com/8792647/224118703-04f663b7-7a73-4e49-87d9-2acd8b98522b.png) >> >> Note that the exact input indices can still be retrieved via the incoming edge's tooltips: >> >> ![tooltip-with-input-index](https://user-images.githubusercontent.com/8792647/224119319-7f40fba2-1e9f-436e-a11c-8c3d428d46a6.png) >> >> The control-flow graph view is also adapted to this representation: >> >> ![sparse-in-cfg](https://user-images.githubusercontent.com/8792647/224119399-884e2516-a9a1-43fd-b5f5-747c99472ace.png) >> >> #### Additional improvements >> >> Additionally, this changeset: >> - ensures that the selected filter subset is applied in the order listed in the "Filter" window (this is necessary for combining effectively the "Simplify graph" and "Condense graph" filters, but is also generally desirable for simplicity and consistency), >> - introduces a complementary filter "Show custom node info" (enabled by default) that extends the labels of call and exception-creation nodes with custom information, >> - extends the search functionality so that combined and inlined nodes can also be searched on and selected, and >> - defines and documents JavaScript helpers to simplify the new and existing available filters. >> >> Here is an example of the effect of the new "Show custom node info" filter: >> >> ![show-custom-node-info](https://user-images.githubusercontent.com/8792647/224119545-fd564224-7ccc-4829-988e-77f05d25b3bc.png) >> >> ### Testing >> >> #### Functionality >> >> - Tested the functionality manually on a small selection of graphs. >> >> - Tested automatically that viewing thousands of graphs in the three views with different filter subsets enabled does not trigger any assertion failure (by instrumenting IGV to view graphs as they are loaded and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`). >> >> #### Performance >> >> Measured the combined filter application and view creation time for the sea-of-nodes view on a selection of 100 medium-sized graphs (200-500 nodes). On average, applying the new "Show custom node info" filter introduces a minimal overhead of around 1%, which motivates enabling it by default. Applying the "simplify graph" and "condense graph" on top actually gives a speedup of about 12%, since the additional filter application time is amortized by laying out and drawing fewer nodes. However, these filters are not enabled by default, since they cause a (minor) loss of information which is not desirable in every use case. >> >> The graph size reduction and performance results are [attached](https://github.com/openjdk/jdk/files/10934804/performance-evaluation.ods) (note that each time measurement in the sheet corresponds to the median of ten runs). > > Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: > > Fix comment typo looks good to me now! thanks for the work @robcasloz ------------- Marked as reviewed by tholenstein (Committer). PR Review: https://git.openjdk.org/jdk/pull/12955#pullrequestreview-1362561621 From tholenstein at openjdk.org Wed Mar 29 08:53:49 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 29 Mar 2023 08:53:49 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally [v9] In-Reply-To: References: Message-ID: > In IGV the user can apply a set of filters to a graph. Currently, the same set of selected filters is applied to all graphs (globally). > > - With this change the use can define a set of filters for each individual graph tab using the `--Local--` profile > - Further a filter profile can be created that represents a set of filter. This filter profile can the be selected in each graph tab individually. > > ### Global profile > Each tab has a `--Global--` filter profile which is selected when opening a graph. Filters applied to the `--Global--` profile are applied to all tabs that have the `--Global--` profile selected. > > ### Local profile > Each tab has its own `--Local--` filter profile. Filters applied to the `--Local--` profile are applied only to the currently selected tabs. Only one tab can be selected at a time and a tab gets selected by clicking on it. To make it more clear which tab is currently selected, the title of the selected tab is displayed in **bold** font. > tabA > > When clicking on a different tab with a different `--Local--` profile, the selected filters get updated accordingly. > tabB > > ### New profile > The user can also create a new filter profile and give it a name. E.g. `My Filters` > newProfile > > The `My Filters` profile is then globally available to other tabs as well > selectProfile > > > ### Filters for cloned tabs > When the user clones a tab, the `--Local--` profile gets cloned as well. Further the clone has the same filter profile selected when it gets opened > cloneTab > > ### Saving of filters and profiles > When the users closes IGV, the filters (in their exact order) are save, as well as the filter profiles. The profile that was last used is selected when opening IGV. Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: fix: update checkboxes when switching filter profiles based on feedback from @chhagedorn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12714/files - new: https://git.openjdk.org/jdk/pull/12714/files/2d6409b9..119b012e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12714&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12714&range=07-08 Stats: 30 lines in 3 files changed: 9 ins; 16 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/12714.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12714/head:pull/12714 PR: https://git.openjdk.org/jdk/pull/12714 From duke at openjdk.org Wed Mar 29 09:20:57 2023 From: duke at openjdk.org (changpeng1997) Date: Wed, 29 Mar 2023 09:20:57 GMT Subject: RFR: 8303553: AArch64: Add BCAX backend rule Message-ID: We can use BCAX[1][2] to merge a bit clear and an exclusive-OR operation. For example, on a 128-bit aarch64 machine which supports NEON and SHA3, following instruction sequence: ... bic v16.16b, v16.16b, v17.16b eor v16.16b, v16.16b, v18.16b ... can be optimized to: ... bcax v16.16b, v17.16b, v16.16b, v18.16b ... This patch adds backend rules for BCAX, and we can gain almost 10% performance lift on a 128-bit aarch64 machine which supports NEON and SHA3. Similar performance uplift can also be observed on SVE2. Performance_Before: Benchmark Score(op/ms) Error TestByte#size(2048) 9779.361 47.184 TestInt#size(2048) 3028.617 7.292 TestLong#size(2048) 1331.216 1.815 TestShort#size(2048) 5828.089 8.975 Performance_BCAX_NEON: Benchmark Score(op/ms) Error TestByte#size(2048) 10510.371 34.931 TestInt#size(2048) 3437.512 81.318 TestLong#size(2048) 1461.023 0.679 TestShort#size(2048) 6238.210 26.452 [1]: https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/BCAX--Bit-Clear-and-XOR- [2]: https://developer.arm.com/documentation/ddi0602/2022-12/SVE-Instructions/BCAX--Bitwise-clear-and-exclusive-OR-?lang=en ------------- Commit messages: - 8303553: AArch64: Add BCAX backend rule Changes: https://git.openjdk.org/jdk/pull/13222/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13222&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8303553 Stats: 468 lines in 9 files changed: 433 ins; 0 del; 35 mod Patch: https://git.openjdk.org/jdk/pull/13222.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13222/head:pull/13222 PR: https://git.openjdk.org/jdk/pull/13222 From chagedorn at openjdk.org Wed Mar 29 09:27:19 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 29 Mar 2023 09:27:19 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally [v9] In-Reply-To: References: Message-ID: On Wed, 29 Mar 2023 08:53:49 GMT, Tobias Holenstein wrote: >> In IGV the user can apply a set of filters to a graph. Currently, the same set of selected filters is applied to all graphs (globally). >> >> - With this change the use can define a set of filters for each individual graph tab using the `--Local--` profile >> - Further a filter profile can be created that represents a set of filter. This filter profile can the be selected in each graph tab individually. >> >> ### Global profile >> Each tab has a `--Global--` filter profile which is selected when opening a graph. Filters applied to the `--Global--` profile are applied to all tabs that have the `--Global--` profile selected. >> >> ### Local profile >> Each tab has its own `--Local--` filter profile. Filters applied to the `--Local--` profile are applied only to the currently selected tabs. Only one tab can be selected at a time and a tab gets selected by clicking on it. To make it more clear which tab is currently selected, the title of the selected tab is displayed in **bold** font. >> tabA >> >> When clicking on a different tab with a different `--Local--` profile, the selected filters get updated accordingly. >> tabB >> >> ### New profile >> The user can also create a new filter profile and give it a name. E.g. `My Filters` >> newProfile >> >> The `My Filters` profile is then globally available to other tabs as well >> selectProfile >> >> >> ### Filters for cloned tabs >> When the user clones a tab, the `--Local--` profile gets cloned as well. Further the clone has the same filter profile selected when it gets opened >> cloneTab >> >> ### Saving of filters and profiles >> When the users closes IGV, the filters (in their exact order) are save, as well as the filter profiles. The profile that was last used is selected when opening IGV. > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > fix: update checkboxes when switching filter profiles > > based on feedback from @chhagedorn That's a nice improvement! I've tried your patch out again and it works as expected. Thanks for fixing the issue discussed offline. I only have some minor code style comments. Otherwise, looks good to me! src/utils/IdealGraphVisualizer/Filter/src/main/java/com/sun/hotspot/igv/filter/FilterChain.java line 133: > 131: public List getFilters() { > 132: return Collections.unmodifiableList(filters); > 133: } Missing new line Suggestion: } src/utils/IdealGraphVisualizer/FilterWindow/src/main/java/com/sun/hotspot/igv/filterwindow/FilterTopComponent.java line 352: > 350: } > 351: > 352: public void addFilter(CustomFilter cf) { While at it, you could rename `cf` and `fo` to something more descriptive like `customFilter` and `fileObject`. src/utils/IdealGraphVisualizer/Util/src/main/java/com/sun/hotspot/igv/util/RangeSliderModel.java line 49: > 47: > 48: public RangeSliderModel(RangeSliderModel model) { > 49: this(); I suggest to directly initialize the missing fields (`changedEvent` and `colorChangedEvent`) instead of initializing the other fields twice with `this()`. src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/DiagramViewModel.java line 163: > 161: showNodeHull = true; > 162: showEmptyBlocks = true; > 163: group = graph.getGroup(); You could keep that to make `group` `final` again and then just initialize it in `init()` (or `initGroup()`) by directly accessing the field. src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/EditorTopComponent.java line 379: > 377: } > 378: > 379: private boolean useBoldDisplayName = false; Should be moved up to the other fields src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/EditorTopComponent.java line 386: > 384: } > 385: > 386: Suggestion: ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/12714#pullrequestreview-1362595235 PR Review Comment: https://git.openjdk.org/jdk/pull/12714#discussion_r1151623131 PR Review Comment: https://git.openjdk.org/jdk/pull/12714#discussion_r1151627866 PR Review Comment: https://git.openjdk.org/jdk/pull/12714#discussion_r1151631346 PR Review Comment: https://git.openjdk.org/jdk/pull/12714#discussion_r1151637654 PR Review Comment: https://git.openjdk.org/jdk/pull/12714#discussion_r1151640971 PR Review Comment: https://git.openjdk.org/jdk/pull/12714#discussion_r1151641458 From epeter at openjdk.org Wed Mar 29 09:31:25 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 29 Mar 2023 09:31:25 GMT Subject: RFR: 8304042: C2 SuperWord: schedule must remove packs with cyclic dependencies In-Reply-To: <6WzUZE5-847OaD_RKVmD2ieyZqOx2M_o0F0_k_yEpJI=.f8c493d1-1a36-4178-a446-d83225c07092@github.com> References: <9VgAQeNZfUZJXO8llozcZZuRftv6kk43jw0YIrBIdck=.b5c89436-608b-4ed1-816d-b3514374eaeb@github.com> <6WzUZE5-847OaD_RKVmD2ieyZqOx2M_o0F0_k_yEpJI=.f8c493d1-1a36-4178-a446-d83225c07092@github.com> Message-ID: On Wed, 29 Mar 2023 03:48:01 GMT, Fei Gao wrote: >> I discovered this bug during the bug fix of [JDK-8298935](https://bugs.openjdk.org/browse/JDK-8298935) [PR](https://git.openjdk.org/jdk/pull/12350). >> >> Currently, the SuperWord algorithm only ensures that all `packs` are `isomorphic` and `independent` (additionally memops are `adjacent`). >> >> This is **not sufficient**. We need to ensure that the `packs` do not introduce `cycles` into the graph. Example: >> >> https://github.com/openjdk/jdk/blob/ad580d18dbbf074c8a3692e2836839505b574326/test/hotspot/jtreg/compiler/loopopts/superword/TestIndependentPacksWithCyclicDependency.java#L217-L231 >> >> This is also mentioned in the [SuperWord Paper](https://groups.csail.mit.edu/cag/slp/SLP-PLDI-2000.pdf) (2000, Samuel Larsen and Saman Amarasinghe, Exploiting Superword Level Parallelism with Multimedia Instruction Sets): >> >> >> 3.7 Scheduling >> Dependence analysis before packing ensures that statements within a group can be executed >> safely in parallel. However, it may be the case that executing two groups produces a dependence >> violation. An example of this is shown in Figure 6. Here, dependence edges are drawn between >> groups if a statement in one group is dependent on a statement in the other. As long as there >> are no cycles in this dependence graph, all groups can be scheduled such that no violations >> occur. However, a cycle indicates that the set of chosen groups is invalid and at least one group >> will need to be eliminated. Although experimental data has shown this case to be extremely rare, >> care must be taken to ensure correctness. >> >> >> **Solution** >> >> Just before scheduling, I introduced `SuperWord::remove_cycles`. It creates a `PacksetGraph`, based on nodes in the `packs`, and scalar-nodes which are not in a pack. The edges are taken from `DepPreds`. We check if the graph can be scheduled without cycles (via topological sort). >> >> **FYI** >> >> I found a further bug, this time I think it happens during scheduling. See [JDK-8304720](https://bugs.openjdk.org/browse/JDK-8304720). Because of that, I had to disable a test case (`TestIndependentPacksWithCyclicDependency::test5`). I also had to require 64 bit, and either `avx2` or `asimd`. I hope we can lift that again once we fix the other bug. The issue is this: the cyclic dependency example can degenerate to non-cyclic ones, that need to reorder the non-vectorized memory operations. > > src/hotspot/share/opto/superword.cpp line 2429: > >> 2427: for (uint k = 0; k < p->size(); k++) { >> 2428: Node* n = p->at(k); >> 2429: int pid = get_pid(n); > > Nodes in the same pack have the same `pid`, right? Why do we need to fetch `pid` separately for each node in the `p` ? pulled it out of the loop, but added an assert for it in the loop. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13078#discussion_r1151648496 From epeter at openjdk.org Wed Mar 29 09:31:28 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 29 Mar 2023 09:31:28 GMT Subject: RFR: 8304042: C2 SuperWord: schedule must remove packs with cyclic dependencies In-Reply-To: References: <9VgAQeNZfUZJXO8llozcZZuRftv6kk43jw0YIrBIdck=.b5c89436-608b-4ed1-816d-b3514374eaeb@github.com> Message-ID: On Wed, 29 Mar 2023 00:14:50 GMT, Vladimir Kozlov wrote: >> I discovered this bug during the bug fix of [JDK-8298935](https://bugs.openjdk.org/browse/JDK-8298935) [PR](https://git.openjdk.org/jdk/pull/12350). >> >> Currently, the SuperWord algorithm only ensures that all `packs` are `isomorphic` and `independent` (additionally memops are `adjacent`). >> >> This is **not sufficient**. We need to ensure that the `packs` do not introduce `cycles` into the graph. Example: >> >> https://github.com/openjdk/jdk/blob/ad580d18dbbf074c8a3692e2836839505b574326/test/hotspot/jtreg/compiler/loopopts/superword/TestIndependentPacksWithCyclicDependency.java#L217-L231 >> >> This is also mentioned in the [SuperWord Paper](https://groups.csail.mit.edu/cag/slp/SLP-PLDI-2000.pdf) (2000, Samuel Larsen and Saman Amarasinghe, Exploiting Superword Level Parallelism with Multimedia Instruction Sets): >> >> >> 3.7 Scheduling >> Dependence analysis before packing ensures that statements within a group can be executed >> safely in parallel. However, it may be the case that executing two groups produces a dependence >> violation. An example of this is shown in Figure 6. Here, dependence edges are drawn between >> groups if a statement in one group is dependent on a statement in the other. As long as there >> are no cycles in this dependence graph, all groups can be scheduled such that no violations >> occur. However, a cycle indicates that the set of chosen groups is invalid and at least one group >> will need to be eliminated. Although experimental data has shown this case to be extremely rare, >> care must be taken to ensure correctness. >> >> >> **Solution** >> >> Just before scheduling, I introduced `SuperWord::remove_cycles`. It creates a `PacksetGraph`, based on nodes in the `packs`, and scalar-nodes which are not in a pack. The edges are taken from `DepPreds`. We check if the graph can be scheduled without cycles (via topological sort). >> >> **FYI** >> >> I found a further bug, this time I think it happens during scheduling. See [JDK-8304720](https://bugs.openjdk.org/browse/JDK-8304720). Because of that, I had to disable a test case (`TestIndependentPacksWithCyclicDependency::test5`). I also had to require 64 bit, and either `avx2` or `asimd`. I hope we can lift that again once we fix the other bug. The issue is this: the cyclic dependency example can degenerate to non-cyclic ones, that need to reorder the non-vectorized memory operations. > > src/hotspot/share/opto/superword.cpp line 2491: > >> 2489: return worklist.length() == _max_pid; >> 2490: } >> 2491: void print(bool print_nodes, bool print_zero_incnt) { > > I assume you need this parameters for debugging when you can change their values. Exactly. This helps debugging the scheduling. I added a comment to explain the arguments. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13078#discussion_r1151646027 From tholenstein at openjdk.org Wed Mar 29 09:34:20 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 29 Mar 2023 09:34:20 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally [v10] In-Reply-To: References: Message-ID: > In IGV the user can apply a set of filters to a graph. Currently, the same set of selected filters is applied to all graphs (globally). > > - With this change the use can define a set of filters for each individual graph tab using the `--Local--` profile > - Further a filter profile can be created that represents a set of filter. This filter profile can the be selected in each graph tab individually. > > ### Global profile > Each tab has a `--Global--` filter profile which is selected when opening a graph. Filters applied to the `--Global--` profile are applied to all tabs that have the `--Global--` profile selected. > > ### Local profile > Each tab has its own `--Local--` filter profile. Filters applied to the `--Local--` profile are applied only to the currently selected tabs. Only one tab can be selected at a time and a tab gets selected by clicking on it. To make it more clear which tab is currently selected, the title of the selected tab is displayed in **bold** font. > tabA > > When clicking on a different tab with a different `--Local--` profile, the selected filters get updated accordingly. > tabB > > ### New profile > The user can also create a new filter profile and give it a name. E.g. `My Filters` > newProfile > > The `My Filters` profile is then globally available to other tabs as well > selectProfile > > > ### Filters for cloned tabs > When the user clones a tab, the `--Local--` profile gets cloned as well. Further the clone has the same filter profile selected when it gets opened > cloneTab > > ### Saving of filters and profiles > When the users closes IGV, the filters (in their exact order) are save, as well as the filter profiles. The profile that was last used is selected when opening IGV. Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: Update src/utils/IdealGraphVisualizer/Filter/src/main/java/com/sun/hotspot/igv/filter/FilterChain.java missing newline Co-authored-by: Christian Hagedorn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12714/files - new: https://git.openjdk.org/jdk/pull/12714/files/119b012e..8e821631 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12714&range=09 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12714&range=08-09 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/12714.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12714/head:pull/12714 PR: https://git.openjdk.org/jdk/pull/12714 From aph at openjdk.org Wed Mar 29 09:35:31 2023 From: aph at openjdk.org (Andrew Haley) Date: Wed, 29 Mar 2023 09:35:31 GMT Subject: RFR: 8303553: AArch64: Add BCAX backend rule In-Reply-To: References: Message-ID: On Wed, 29 Mar 2023 09:12:59 GMT, changpeng1997 wrote: > We can use BCAX [1] [2] to merge a bit clear and an exclusive-OR operation. For example, on a 128-bit aarch64 machine which supports NEON and SHA3, following instruction sequence: > > > ... > bic v16.16b, v16.16b, v17.16b > eor v16.16b, v16.16b, v18.16b > ... > > > can be optimized to: > > > ... > bcax v16.16b, v17.16b, v16.16b, v18.16b > ... > > > This patch adds backend rules for BCAX, and we can gain almost 10% performance lift on a 128-bit aarch64 machine which supports NEON and SHA3. Similar performance uplift can also be observed on SVE2. > > Performance_Before: > > > Benchmark Score(op/ms) Error > TestByte#size(2048) 9779.361 47.184 > TestInt#size(2048) 3028.617 7.292 > TestLong#size(2048) 1331.216 1.815 > TestShort#size(2048) 5828.089 8.975 > > > Performance_BCAX_NEON: > > > Benchmark Score(op/ms) Error > TestByte#size(2048) 10510.371 34.931 > TestInt#size(2048) 3437.512 81.318 > TestLong#size(2048) 1461.023 0.679 > TestShort#size(2048) 6238.210 26.452 > > > [1]: https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/BCAX--Bit-Clear-and-XOR- > [2]: https://developer.arm.com/documentation/ddi0602/2022-12/SVE-Instructions/BCAX--Bitwise-clear-and-exclusive-OR-?lang=en Do we have any reason to believe that this instruction will ever be matched in a Java application? If not, it's just slowing down compilation for no good reason. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13222#issuecomment-1488261480 From epeter at openjdk.org Wed Mar 29 09:38:12 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 29 Mar 2023 09:38:12 GMT Subject: RFR: 8304042: C2 SuperWord: schedule must remove packs with cyclic dependencies In-Reply-To: <6WzUZE5-847OaD_RKVmD2ieyZqOx2M_o0F0_k_yEpJI=.f8c493d1-1a36-4178-a446-d83225c07092@github.com> References: <9VgAQeNZfUZJXO8llozcZZuRftv6kk43jw0YIrBIdck=.b5c89436-608b-4ed1-816d-b3514374eaeb@github.com> <6WzUZE5-847OaD_RKVmD2ieyZqOx2M_o0F0_k_yEpJI=.f8c493d1-1a36-4178-a446-d83225c07092@github.com> Message-ID: On Wed, 29 Mar 2023 04:26:48 GMT, Fei Gao wrote: >> I discovered this bug during the bug fix of [JDK-8298935](https://bugs.openjdk.org/browse/JDK-8298935) [PR](https://git.openjdk.org/jdk/pull/12350). >> >> Currently, the SuperWord algorithm only ensures that all `packs` are `isomorphic` and `independent` (additionally memops are `adjacent`). >> >> This is **not sufficient**. We need to ensure that the `packs` do not introduce `cycles` into the graph. Example: >> >> https://github.com/openjdk/jdk/blob/ad580d18dbbf074c8a3692e2836839505b574326/test/hotspot/jtreg/compiler/loopopts/superword/TestIndependentPacksWithCyclicDependency.java#L217-L231 >> >> This is also mentioned in the [SuperWord Paper](https://groups.csail.mit.edu/cag/slp/SLP-PLDI-2000.pdf) (2000, Samuel Larsen and Saman Amarasinghe, Exploiting Superword Level Parallelism with Multimedia Instruction Sets): >> >> >> 3.7 Scheduling >> Dependence analysis before packing ensures that statements within a group can be executed >> safely in parallel. However, it may be the case that executing two groups produces a dependence >> violation. An example of this is shown in Figure 6. Here, dependence edges are drawn between >> groups if a statement in one group is dependent on a statement in the other. As long as there >> are no cycles in this dependence graph, all groups can be scheduled such that no violations >> occur. However, a cycle indicates that the set of chosen groups is invalid and at least one group >> will need to be eliminated. Although experimental data has shown this case to be extremely rare, >> care must be taken to ensure correctness. >> >> >> **Solution** >> >> Just before scheduling, I introduced `SuperWord::remove_cycles`. It creates a `PacksetGraph`, based on nodes in the `packs`, and scalar-nodes which are not in a pack. The edges are taken from `DepPreds`. We check if the graph can be scheduled without cycles (via topological sort). >> >> **FYI** >> >> I found a further bug, this time I think it happens during scheduling. See [JDK-8304720](https://bugs.openjdk.org/browse/JDK-8304720). Because of that, I had to disable a test case (`TestIndependentPacksWithCyclicDependency::test5`). I also had to require 64 bit, and either `avx2` or `asimd`. I hope we can lift that again once we fix the other bug. The issue is this: the cyclic dependency example can degenerate to non-cyclic ones, that need to reorder the non-vectorized memory operations. > > src/hotspot/share/opto/superword.cpp line 2449: > >> 2447: Node* n = _block.at(i); >> 2448: int pid = get_pid_or_zero(n); >> 2449: if (pid == 0 || pid <= max_pid_packset) { > > Is `pid == 0 ||` repetitive here? We have `max_pid_packset >= 0`, right? Good catch. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13078#discussion_r1151657752 From duke at openjdk.org Wed Mar 29 09:45:42 2023 From: duke at openjdk.org (changpeng1997) Date: Wed, 29 Mar 2023 09:45:42 GMT Subject: RFR: 8303553: AArch64: Add BCAX backend rule In-Reply-To: References: Message-ID: On Wed, 29 Mar 2023 09:32:19 GMT, Andrew Haley wrote: > Do we have any reason to believe that this instruction will ever be matched in a Java application? If not, it's just slowing down compilation for no good reason. Hello. This computing pattern (a ^ (b & (~c))) can be found in some SHA-3 java implementation, like https://github.com/aelstad/keccakj/blob/07185d29fb6c881570e2d7fd2b160460626dc130/src/main/java/com/github/aelstad/keccakj/core/Keccak1600.java#L309. I believe this patch can accelerate some SHA-3 applications implemented by Java. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13222#issuecomment-1488276495 From rcastanedalo at openjdk.org Wed Mar 29 10:23:48 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 29 Mar 2023 10:23:48 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter [v3] In-Reply-To: References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> <0lnz6Gw7u04DbFYQdEvxmotlG8ODG5aEERjeHPaGeTM=.1d754812-f7e7-4822-8395-bea9e099fe9f@github.com> Message-ID: On Wed, 29 Mar 2023 07:33:07 GMT, Roberto Casta?eda Lozano wrote: >> I've tested your update and it looks good! The code details are hard to review but I fully trust Toby's expertise there :-) > >> I've tested your update and it looks good! > > Thanks again, Christian! > looks good to me now! thanks for the work @robcasloz Thanks for reviewing, Toby! I will wait for integration of JDK-8302644 before merging and integrating this one. ------------- PR Comment: https://git.openjdk.org/jdk/pull/12955#issuecomment-1488331283 From tholenstein at openjdk.org Wed Mar 29 10:29:44 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 29 Mar 2023 10:29:44 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally [v11] In-Reply-To: References: Message-ID: > In IGV the user can apply a set of filters to a graph. Currently, the same set of selected filters is applied to all graphs (globally). > > - With this change the use can define a set of filters for each individual graph tab using the `--Local--` profile > - Further a filter profile can be created that represents a set of filter. This filter profile can the be selected in each graph tab individually. > > ### Global profile > Each tab has a `--Global--` filter profile which is selected when opening a graph. Filters applied to the `--Global--` profile are applied to all tabs that have the `--Global--` profile selected. > > ### Local profile > Each tab has its own `--Local--` filter profile. Filters applied to the `--Local--` profile are applied only to the currently selected tabs. Only one tab can be selected at a time and a tab gets selected by clicking on it. To make it more clear which tab is currently selected, the title of the selected tab is displayed in **bold** font. > tabA > > When clicking on a different tab with a different `--Local--` profile, the selected filters get updated accordingly. > tabB > > ### New profile > The user can also create a new filter profile and give it a name. E.g. `My Filters` > newProfile > > The `My Filters` profile is then globally available to other tabs as well > selectProfile > > > ### Filters for cloned tabs > When the user clones a tab, the `--Local--` profile gets cloned as well. Further the clone has the same filter profile selected when it gets opened > cloneTab > > ### Saving of filters and profiles > When the users closes IGV, the filters (in their exact order) are save, as well as the filter profiles. The profile that was last used is selected when opening IGV. Tobias Holenstein has updated the pull request incrementally with two additional commits since the last revision: - Merge branch 'JDK-8302644' of github.com:tobiasholenstein/jdk into JDK-8302644 - renamed fo to fileObject, cf to customFilter ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12714/files - new: https://git.openjdk.org/jdk/pull/12714/files/8e821631..2e25ea72 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12714&range=10 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12714&range=09-10 Stats: 17 lines in 1 file changed: 0 ins; 0 del; 17 mod Patch: https://git.openjdk.org/jdk/pull/12714.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12714/head:pull/12714 PR: https://git.openjdk.org/jdk/pull/12714 From tholenstein at openjdk.org Wed Mar 29 10:29:52 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 29 Mar 2023 10:29:52 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally [v9] In-Reply-To: References: Message-ID: On Wed, 29 Mar 2023 09:11:12 GMT, Christian Hagedorn wrote: >> Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: >> >> fix: update checkboxes when switching filter profiles >> >> based on feedback from @chhagedorn > > src/utils/IdealGraphVisualizer/FilterWindow/src/main/java/com/sun/hotspot/igv/filterwindow/FilterTopComponent.java line 352: > >> 350: } >> 351: >> 352: public void addFilter(CustomFilter cf) { > > While at it, you could rename `cf` and `fo` to something more descriptive like `customFilter` and `fileObject`. done ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12714#discussion_r1151720378 From epeter at openjdk.org Wed Mar 29 10:30:52 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 29 Mar 2023 10:30:52 GMT Subject: RFR: 8304042: C2 SuperWord: schedule must remove packs with cyclic dependencies [v2] In-Reply-To: <9VgAQeNZfUZJXO8llozcZZuRftv6kk43jw0YIrBIdck=.b5c89436-608b-4ed1-816d-b3514374eaeb@github.com> References: <9VgAQeNZfUZJXO8llozcZZuRftv6kk43jw0YIrBIdck=.b5c89436-608b-4ed1-816d-b3514374eaeb@github.com> Message-ID: > I discovered this bug during the bug fix of [JDK-8298935](https://bugs.openjdk.org/browse/JDK-8298935) [PR](https://git.openjdk.org/jdk/pull/12350). > > Currently, the SuperWord algorithm only ensures that all `packs` are `isomorphic` and `independent` (additionally memops are `adjacent`). > > This is **not sufficient**. We need to ensure that the `packs` do not introduce `cycles` into the graph. Example: > > https://github.com/openjdk/jdk/blob/ad580d18dbbf074c8a3692e2836839505b574326/test/hotspot/jtreg/compiler/loopopts/superword/TestIndependentPacksWithCyclicDependency.java#L217-L231 > > This is also mentioned in the [SuperWord Paper](https://groups.csail.mit.edu/cag/slp/SLP-PLDI-2000.pdf) (2000, Samuel Larsen and Saman Amarasinghe, Exploiting Superword Level Parallelism with Multimedia Instruction Sets): > > > 3.7 Scheduling > Dependence analysis before packing ensures that statements within a group can be executed > safely in parallel. However, it may be the case that executing two groups produces a dependence > violation. An example of this is shown in Figure 6. Here, dependence edges are drawn between > groups if a statement in one group is dependent on a statement in the other. As long as there > are no cycles in this dependence graph, all groups can be scheduled such that no violations > occur. However, a cycle indicates that the set of chosen groups is invalid and at least one group > will need to be eliminated. Although experimental data has shown this case to be extremely rare, > care must be taken to ensure correctness. > > > **Solution** > > Just before scheduling, I introduced `SuperWord::remove_cycles`. It creates a `PacksetGraph`, based on nodes in the `packs`, and scalar-nodes which are not in a pack. The edges are taken from `DepPreds`. We check if the graph can be scheduled without cycles (via topological sort). > > **FYI** > > I found a further bug, this time I think it happens during scheduling. See [JDK-8304720](https://bugs.openjdk.org/browse/JDK-8304720). Because of that, I had to disable a test case (`TestIndependentPacksWithCyclicDependency::test5`). I also had to require 64 bit, and either `avx2` or `asimd`. I hope we can lift that again once we fix the other bug. The issue is this: the cyclic dependency example can degenerate to non-cyclic ones, that need to reorder the non-vectorized memory operations. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: review feedback implemented ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13078/files - new: https://git.openjdk.org/jdk/pull/13078/files/0f1bae6f..adc297e4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13078&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13078&range=00-01 Stats: 82 lines in 3 files changed: 37 ins; 6 del; 39 mod Patch: https://git.openjdk.org/jdk/pull/13078.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13078/head:pull/13078 PR: https://git.openjdk.org/jdk/pull/13078 From chagedorn at openjdk.org Wed Mar 29 10:36:03 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 29 Mar 2023 10:36:03 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 In-Reply-To: References: Message-ID: On Tue, 28 Mar 2023 12:49:57 GMT, Emanuel Peter wrote: > I am reviving `-XX:+VerifyLoopOptimizations` after many years of abandonment. There were many bugs filed, but so far it has not been addressed. > > The hope is that this work will allow us to catch ctrl / idom / loop body bugs quicker, and fix many of the existing ones along the way. > > **The Idea of VerifyLoopOptimizations** > Before loop-opts, we build many data-structures for dominance (idom), control, and loop membership. Then, loop-opts use this data to transform the graph. At the same time, they must maintain the correctness of the data-structures, so that other optimizations can be made, without needing to re-compute the data-structures every time. > `VerifyLoopOptimizations` was implemented to verify correctness of the data-structures. After some loop-opts, we re-compute a verification data-structure and compare it to the one we created before the loop-opts and maintained during loopopts. > > **My Approach** > I soon realized that there were many reasons why `VerifyLoopOptimizations` was broken. It seemed infeasible to fix all of them at once. I decided to first remove any part that was failing, until I have a minimal set that is working. I will leave many parts commented out. In follow-up RFE's, I will then iteratively improve the verification by re-enabling some verification and fixing the corresponding bugs. > > **What I fixed** > > - `verify_compare` > - Renamed it to `verify_nodes`, since it does verification node-by-node (vs `verify_tree`, which verifies the loop-tree). > - Previously, it was implemented as a BFS with recursion, which lead to stack-overflow. I flattened the BFS into a loop. > - The BFS calls `verify_node` on every node. I refactored `verify_node` a bit, so that it is more readable. > - Rather than having a thread-unsafe static variable `fail`, I now made it a reference argument. > - `verify_tree` > - I corrected the style and improved comments. > - I removed the broken verification for `Opaque` nodes. I added some rudamentary verification for `CountedLoop`. I leave more of this work for follow-up RFE's. > > **Disabled Verifications** > I commented out the following verifications: > > (A) data nodes should have same ctrl > (B) ctrl node should belong to same loop > (C) ctrl node should have same idom > (D) loop should have same tail > (E) loop should have same body (list of nodes) > (F) broken verification in PhaseIdealLoop::build_loop_late_post, because ctrl was set wrong > > > Note: verifying `idom`, `ctrl` and `_body` is the central goal of `VerifyLoopOptimizations`. But all of them are broken in many parts of the VM, as we have now not verified them for many years. > > **Follow-Up Work** > > I filed a first follow-up RFE [JDK-8305073](https://bugs.openjdk.org/browse/JDK-8305073). The following tasks should be addressed in it, or in subsequent follow-up RFE's. > > I propose the following order: > > - idom (C): The dominance structure is at the base of everything else. > - ctrl / loop (A, B): Once dominance is fixed, we can ensure every node is assigned to the correct ctrl/loop. > - tail (D): ensure the tail of a loop is updated correctly > - body (E): nodes are assigned to the `_body` of a loop, according to the node ctrl. > - other issues like (F) > - Add more verification to IdealLoopTree::verify_tree. For example zero-trip-guard, etc. > - Evaluate from where else we should call `PhaseIdealLoop::verify`. Maybe we are missing some cases. > > **Testing** > I am running `tier1-tier6` and stress testing. > Preliminary results are all good. > > **Conclusion** > With this fix, I have the basic infrastructure of the verification working. > However, all of the substantial verification are now still disabled, because there are too many places in the VM that do not maintain the data-structures properly. > Follow-up RFE's will have to address these one-by-one. It's great seeing that this flag is finally being revived! I agree with your suggestion to split the work into multiple patches instead of trying to fix all the issues with the flag. The proposed schedule makes sense. Maybe you can rename the bug title to reflect the changes that you tackled in this patch (might not be possible to find such a concise title due to the many changes and disabling individual verification). I also have a few comments. src/hotspot/share/opto/loopnode.cpp line 4655: > 4653: > 4654: // Verify ctrl and idom of every node. > 4655: int fail = 0; Could be a `uint`. src/hotspot/share/opto/loopnode.cpp line 4675: > 4673: Node* n = worklist.at(i); > 4674: // process node > 4675: verify_node(n, loop_verify, fail); Based on Vladimir's comment above, I think it's cleaner to do fails += verify_node(n, loop_verify); instead of using an input/output `int` parameter. But not sure how easy it is to adapt `verify_node()` with the multiple bailouts there. src/hotspot/share/opto/loopnode.cpp line 4677: > 4675: verify_node(n, loop_verify, fail); > 4676: // visit inputs > 4677: for(uint j = 0; j < n->req(); j++) { Suggestion: for (uint j = 0; j < n->req(); j++) { src/hotspot/share/opto/loopnode.cpp line 4690: > 4688: // (2) Verify dominator structure (IDOM). > 4689: void PhaseIdealLoop::verify_node(Node* n, const PhaseIdealLoop* loop_verify, int &fail) const { > 4690: uint i = n->_idx; Suggestion: const uint i = n->_idx; src/hotspot/share/opto/loopnode.cpp line 4693: > 4691: // The loop-tree was built from def to use. The verification happens from def to use. > 4692: // We may thus find nodes during verification that are not in the loop-tree. > 4693: if(_nodes[i] == nullptr) { Suggestion: if (_nodes[i] == nullptr) { src/hotspot/share/opto/loopnode.cpp line 4694: > 4692: // We may thus find nodes during verification that are not in the loop-tree. > 4693: if(_nodes[i] == nullptr) { > 4694: assert(loop_verify->_nodes[i] == nullptr, "both should be unreachable"); Could this also be turned into a `tty->print()` + `fail++` instead of a direct assertion? src/hotspot/share/opto/loopnode.cpp line 4698: > 4696: } > 4697: > 4698: // Check everything stored in "_nodes". It might be cleaner to split this method into `verify_node_list()` and `verify_idom()` ------------- PR Review: https://git.openjdk.org/jdk/pull/13207#pullrequestreview-1362661846 PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1151665187 PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1151664105 PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1151665398 PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1151720685 PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1151667589 PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1151676395 PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1151680984 From chagedorn at openjdk.org Wed Mar 29 10:36:06 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 29 Mar 2023 10:36:06 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 In-Reply-To: <_kT34J8rFaAwBHPfXTzlnRRI6SaFXR_acj9MJUj6qKg=.fa14ce76-9a6c-46f7-b459-59c097a7c8eb@github.com> References: <_kT34J8rFaAwBHPfXTzlnRRI6SaFXR_acj9MJUj6qKg=.fa14ce76-9a6c-46f7-b459-59c097a7c8eb@github.com> Message-ID: On Wed, 29 Mar 2023 07:45:24 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/loopnode.cpp line 4656: >> >>> 4654: // Verify ctrl and idom of every node. >>> 4655: int fail = 0; >>> 4656: verify_nodes(C->root(), &loop_verify, fail); >> >> I think result should be returned using local variables instead of storing it in passed variable. > > The nice thing about a variable is that it has an address in memory. Then I can set a watchpoint in the debugger, and go back to it easily, stepping through all the `fail` incrementations. Before this patch, we just used a static variable `fail`. I could also revert to that. It is just not thread-safe. You might just get rid of `fail` and move the `assert(fail == 0)` directly to `verify_nodes()`. Then you can also turn the parameter `fail` into a local variable inside `verify_nodes()`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1151662386 From chagedorn at openjdk.org Wed Mar 29 10:36:08 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 29 Mar 2023 10:36:08 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 In-Reply-To: References: Message-ID: On Tue, 28 Mar 2023 22:36:30 GMT, Vladimir Kozlov wrote: >> I am reviving `-XX:+VerifyLoopOptimizations` after many years of abandonment. There were many bugs filed, but so far it has not been addressed. >> >> The hope is that this work will allow us to catch ctrl / idom / loop body bugs quicker, and fix many of the existing ones along the way. >> >> **The Idea of VerifyLoopOptimizations** >> Before loop-opts, we build many data-structures for dominance (idom), control, and loop membership. Then, loop-opts use this data to transform the graph. At the same time, they must maintain the correctness of the data-structures, so that other optimizations can be made, without needing to re-compute the data-structures every time. >> `VerifyLoopOptimizations` was implemented to verify correctness of the data-structures. After some loop-opts, we re-compute a verification data-structure and compare it to the one we created before the loop-opts and maintained during loopopts. >> >> **My Approach** >> I soon realized that there were many reasons why `VerifyLoopOptimizations` was broken. It seemed infeasible to fix all of them at once. I decided to first remove any part that was failing, until I have a minimal set that is working. I will leave many parts commented out. In follow-up RFE's, I will then iteratively improve the verification by re-enabling some verification and fixing the corresponding bugs. >> >> **What I fixed** >> >> - `verify_compare` >> - Renamed it to `verify_nodes`, since it does verification node-by-node (vs `verify_tree`, which verifies the loop-tree). >> - Previously, it was implemented as a BFS with recursion, which lead to stack-overflow. I flattened the BFS into a loop. >> - The BFS calls `verify_node` on every node. I refactored `verify_node` a bit, so that it is more readable. >> - Rather than having a thread-unsafe static variable `fail`, I now made it a reference argument. >> - `verify_tree` >> - I corrected the style and improved comments. >> - I removed the broken verification for `Opaque` nodes. I added some rudamentary verification for `CountedLoop`. I leave more of this work for follow-up RFE's. >> >> **Disabled Verifications** >> I commented out the following verifications: >> >> (A) data nodes should have same ctrl >> (B) ctrl node should belong to same loop >> (C) ctrl node should have same idom >> (D) loop should have same tail >> (E) loop should have same body (list of nodes) >> (F) broken verification in PhaseIdealLoop::build_loop_late_post, because ctrl was set wrong >> >> >> Note: verifying `idom`, `ctrl` and `_body` is the central goal of `VerifyLoopOptimizations`. But all of them are broken in many parts of the VM, as we have now not verified them for many years. >> >> **Follow-Up Work** >> >> I filed a first follow-up RFE [JDK-8305073](https://bugs.openjdk.org/browse/JDK-8305073). The following tasks should be addressed in it, or in subsequent follow-up RFE's. >> >> I propose the following order: >> >> - idom (C): The dominance structure is at the base of everything else. >> - ctrl / loop (A, B): Once dominance is fixed, we can ensure every node is assigned to the correct ctrl/loop. >> - tail (D): ensure the tail of a loop is updated correctly >> - body (E): nodes are assigned to the `_body` of a loop, according to the node ctrl. >> - other issues like (F) >> - Add more verification to IdealLoopTree::verify_tree. For example zero-trip-guard, etc. >> - Evaluate from where else we should call `PhaseIdealLoop::verify`. Maybe we are missing some cases. >> >> **Testing** >> I am running `tier1-tier6` and stress testing. >> Preliminary results are all good. >> >> **Conclusion** >> With this fix, I have the basic infrastructure of the verification working. >> However, all of the substantial verification are now still disabled, because there are too many places in the VM that do not maintain the data-structures properly. >> Follow-up RFE's will have to address these one-by-one. > > src/hotspot/share/opto/loopnode.cpp line 4805: > >> 4803: // within the loop tree can be reordered. We attempt to deal with that by >> 4804: // reordering the verify's loop tree if possible. >> 4805: void IdealLoopTree::verify_tree(IdealLoopTree* loop, const IdealLoopTree* parent) const { > > Should we rename `loop` to `loop_verify` similar to `verify_node()` code? I agree with Vladimir but then it's a little bit confusing that `loop_verify` is used for `IdealLoopTree` and `PhaseIdealLoop`. Maybe the variable name for the latter could be renamed into something like `phase_verify` to follow the naming convention used at other places (`loop` for `IdealLoopTree` and `phase` for `PhaseIdealLoop`)? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1151683781 From chagedorn at openjdk.org Wed Mar 29 10:36:09 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 29 Mar 2023 10:36:09 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 In-Reply-To: References: Message-ID: On Wed, 29 Mar 2023 09:54:25 GMT, Christian Hagedorn wrote: >> src/hotspot/share/opto/loopnode.cpp line 4805: >> >>> 4803: // within the loop tree can be reordered. We attempt to deal with that by >>> 4804: // reordering the verify's loop tree if possible. >>> 4805: void IdealLoopTree::verify_tree(IdealLoopTree* loop, const IdealLoopTree* parent) const { >> >> Should we rename `loop` to `loop_verify` similar to `verify_node()` code? > > I agree with Vladimir but then it's a little bit confusing that `loop_verify` is used for `IdealLoopTree` and `PhaseIdealLoop`. Maybe the variable name for the latter could be renamed into something like `phase_verify` to follow the naming convention used at other places (`loop` for `IdealLoopTree` and `phase` for `PhaseIdealLoop`)? Would it make sense to also have a `fail` counter in this method, similar to `verify_nodes()`, to print multiple failures on the go (from this pass and also accumulated from the loop siblings and children)? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1151701084 From tholenstein at openjdk.org Wed Mar 29 10:38:38 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 29 Mar 2023 10:38:38 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally [v12] In-Reply-To: References: Message-ID: > In IGV the user can apply a set of filters to a graph. Currently, the same set of selected filters is applied to all graphs (globally). > > - With this change the use can define a set of filters for each individual graph tab using the `--Local--` profile > - Further a filter profile can be created that represents a set of filter. This filter profile can the be selected in each graph tab individually. > > ### Global profile > Each tab has a `--Global--` filter profile which is selected when opening a graph. Filters applied to the `--Global--` profile are applied to all tabs that have the `--Global--` profile selected. > > ### Local profile > Each tab has its own `--Local--` filter profile. Filters applied to the `--Local--` profile are applied only to the currently selected tabs. Only one tab can be selected at a time and a tab gets selected by clicking on it. To make it more clear which tab is currently selected, the title of the selected tab is displayed in **bold** font. > tabA > > When clicking on a different tab with a different `--Local--` profile, the selected filters get updated accordingly. > tabB > > ### New profile > The user can also create a new filter profile and give it a name. E.g. `My Filters` > newProfile > > The `My Filters` profile is then globally available to other tabs as well > selectProfile > > > ### Filters for cloned tabs > When the user clones a tab, the `--Local--` profile gets cloned as well. Further the clone has the same filter profile selected when it gets opened > cloneTab > > ### Saving of filters and profiles > When the users closes IGV, the filters (in their exact order) are save, as well as the filter profiles. The profile that was last used is selected when opening IGV. Tobias Holenstein has updated the pull request incrementally with six additional commits since the last revision: - add missing empty line - Merge branch 'JDK-8302644' of github.com:tobiasholenstein/jdk into JDK-8302644 - Update src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/EditorTopComponent.java remove empty line Co-authored-by: Christian Hagedorn - init fields directly in RangeSliderModel constructor - move useBoldDisplayName to fields - make group final again ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12714/files - new: https://git.openjdk.org/jdk/pull/12714/files/2e25ea72..166e6c46 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12714&range=11 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12714&range=10-11 Stats: 17 lines in 3 files changed: 5 ins; 5 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/12714.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12714/head:pull/12714 PR: https://git.openjdk.org/jdk/pull/12714 From tholenstein at openjdk.org Wed Mar 29 10:38:42 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 29 Mar 2023 10:38:42 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally [v9] In-Reply-To: References: Message-ID: On Wed, 29 Mar 2023 09:14:00 GMT, Christian Hagedorn wrote: >> Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: >> >> fix: update checkboxes when switching filter profiles >> >> based on feedback from @chhagedorn > > src/utils/IdealGraphVisualizer/Util/src/main/java/com/sun/hotspot/igv/util/RangeSliderModel.java line 49: > >> 47: >> 48: public RangeSliderModel(RangeSliderModel model) { >> 49: this(); > > I suggest to directly initialize the missing fields (`changedEvent` and `colorChangedEvent`) instead of initializing the other fields twice with `this()`. done > src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/EditorTopComponent.java line 379: > >> 377: } >> 378: >> 379: private boolean useBoldDisplayName = false; > > Should be moved up to the other fields done ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12714#discussion_r1151731569 PR Review Comment: https://git.openjdk.org/jdk/pull/12714#discussion_r1151727317 From tholenstein at openjdk.org Wed Mar 29 10:38:46 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 29 Mar 2023 10:38:46 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally [v12] In-Reply-To: References: Message-ID: On Wed, 29 Mar 2023 09:19:01 GMT, Christian Hagedorn wrote: >> Tobias Holenstein has updated the pull request incrementally with six additional commits since the last revision: >> >> - add missing empty line >> - Merge branch 'JDK-8302644' of github.com:tobiasholenstein/jdk into JDK-8302644 >> - Update src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/EditorTopComponent.java >> >> remove empty line >> >> Co-authored-by: Christian Hagedorn >> - init fields directly in RangeSliderModel constructor >> - move useBoldDisplayName to fields >> - make group final again > > src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/DiagramViewModel.java line 163: > >> 161: showNodeHull = true; >> 162: showEmptyBlocks = true; >> 163: group = graph.getGroup(); > > You could keep that to make `group` `final` again and then just initialize it in `init()` (or `initGroup()`) by directly accessing the field. good idea - done ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12714#discussion_r1151726052 From tholenstein at openjdk.org Wed Mar 29 10:42:22 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 29 Mar 2023 10:42:22 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally [v6] In-Reply-To: References: Message-ID: <7apcXR2qV5lzcW3YWZH3jtSA40rRQz_m906BhpqjWeA=.472142c1-7dbe-4e26-814b-04b58514d950@github.com> On Tue, 28 Mar 2023 10:33:18 GMT, Tobias Holenstein wrote: >> src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/DiagramViewModel.java line 326: >> >>> 324: >>> 325: // called when the filter in filterChain changed, but not filterChain itself >>> 326: private void filterChanged() { >> >> After applying this PR, `DiagramViewModel::filterChanged()` is fired every time a new graph in a group is viewed. This is not a functional bug, but it causes the expensive `DiagramViewModel::rebuildDiagram()` to be called twice in that scenario (the other call comes from `DiagramViewModel::changed()`). Would it be possible to arrange the code so that `DiagramViewModel::rebuildDiagram()` is only called once when a new graph in a group is viewed? > > You are right. `DiagramViewModel::rebuildDiagram()` was called too many times when it was not necessary. I updated the code - now `DiagramViewModel::rebuildDiagram()` should be only called when needed my changed caused the checkbox to not be updated anymore when changing the profile (thanks @chhagedorn for finding that bug). Should be fixed now. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12714#discussion_r1151736953 From tholenstein at openjdk.org Wed Mar 29 10:56:03 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 29 Mar 2023 10:56:03 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally [v13] In-Reply-To: References: Message-ID: > In IGV the user can apply a set of filters to a graph. Currently, the same set of selected filters is applied to all graphs (globally). > > - With this change the use can define a set of filters for each individual graph tab using the `--Local--` profile > - Further a filter profile can be created that represents a set of filter. This filter profile can the be selected in each graph tab individually. > > ### Global profile > Each tab has a `--Global--` filter profile which is selected when opening a graph. Filters applied to the `--Global--` profile are applied to all tabs that have the `--Global--` profile selected. > > ### Local profile > Each tab has its own `--Local--` filter profile. Filters applied to the `--Local--` profile are applied only to the currently selected tabs. Only one tab can be selected at a time and a tab gets selected by clicking on it. To make it more clear which tab is currently selected, the title of the selected tab is displayed in **bold** font. > tabA > > When clicking on a different tab with a different `--Local--` profile, the selected filters get updated accordingly. > tabB > > ### New profile > The user can also create a new filter profile and give it a name. E.g. `My Filters` > newProfile > > The `My Filters` profile is then globally available to other tabs as well > selectProfile > > > ### Filters for cloned tabs > When the user clones a tab, the `--Local--` profile gets cloned as well. Further the clone has the same filter profile selected when it gets opened > cloneTab > > ### Saving of filters and profiles > When the users closes IGV, the filters (in their exact order) are save, as well as the filter profiles. The profile that was last used is selected when opening IGV. Tobias Holenstein has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 32 additional commits since the last revision: - Merge remote-tracking branch 'origin/master' into JDK-8302644 - add missing empty line - Merge branch 'JDK-8302644' of github.com:tobiasholenstein/jdk into JDK-8302644 - Update src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/EditorTopComponent.java remove empty line Co-authored-by: Christian Hagedorn - init fields directly in RangeSliderModel constructor - move useBoldDisplayName to fields - make group final again - Merge branch 'JDK-8302644' of github.com:tobiasholenstein/jdk into JDK-8302644 - Update src/utils/IdealGraphVisualizer/Filter/src/main/java/com/sun/hotspot/igv/filter/FilterChain.java missing newline Co-authored-by: Christian Hagedorn - renamed fo to fileObject, cf to customFilter - ... and 22 more: https://git.openjdk.org/jdk/compare/9c4f4464...0ce7ab64 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12714/files - new: https://git.openjdk.org/jdk/pull/12714/files/166e6c46..0ce7ab64 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12714&range=12 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12714&range=11-12 Stats: 364734 lines in 2896 files changed: 219294 ins; 122274 del; 23166 mod Patch: https://git.openjdk.org/jdk/pull/12714.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12714/head:pull/12714 PR: https://git.openjdk.org/jdk/pull/12714 From aph at openjdk.org Wed Mar 29 11:10:14 2023 From: aph at openjdk.org (Andrew Haley) Date: Wed, 29 Mar 2023 11:10:14 GMT Subject: RFR: 8303553: AArch64: Add BCAX backend rule In-Reply-To: References: Message-ID: On Wed, 29 Mar 2023 09:43:14 GMT, changpeng1997 wrote: > This computing pattern (a ^ (b & (~c))) can be found in some SHA-3 java implementation, like https://github.com/aelstad/keccakj/blob/07185d29fb6c881570e2d7fd2b160460626dc130/src/main/java/com/github/aelstad/keccakj/core/Keccak1600.java#L309. > > I believe this patch can accelerate some SHA-3 applications implemented by Java. OK, thanks. Please benchmark this and let us know the result. It'd also be interesting to know how it compares with our SHA-3 intrinsic. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13222#issuecomment-1488393182 From thartmann at openjdk.org Wed Mar 29 12:35:42 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 29 Mar 2023 12:35:42 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally [v13] In-Reply-To: References: Message-ID: <4ogpCo4lpZYsHzCjBH_c6mpRM0X8J4uaIqwaGvZ5RlA=.9d3dd44c-ccfb-4d31-913f-220980e01e2b@github.com> On Wed, 29 Mar 2023 10:56:03 GMT, Tobias Holenstein wrote: >> In IGV the user can apply a set of filters to a graph. Currently, the same set of selected filters is applied to all graphs (globally). >> >> - With this change the use can define a set of filters for each individual graph tab using the `--Local--` profile >> - Further a filter profile can be created that represents a set of filter. This filter profile can the be selected in each graph tab individually. >> >> ### Global profile >> Each tab has a `--Global--` filter profile which is selected when opening a graph. Filters applied to the `--Global--` profile are applied to all tabs that have the `--Global--` profile selected. >> >> ### Local profile >> Each tab has its own `--Local--` filter profile. Filters applied to the `--Local--` profile are applied only to the currently selected tabs. Only one tab can be selected at a time and a tab gets selected by clicking on it. To make it more clear which tab is currently selected, the title of the selected tab is displayed in **bold** font. >> tabA >> >> When clicking on a different tab with a different `--Local--` profile, the selected filters get updated accordingly. >> tabB >> >> ### New profile >> The user can also create a new filter profile and give it a name. E.g. `My Filters` >> newProfile >> >> The `My Filters` profile is then globally available to other tabs as well >> selectProfile >> >> >> ### Filters for cloned tabs >> When the user clones a tab, the `--Local--` profile gets cloned as well. Further the clone has the same filter profile selected when it gets opened >> cloneTab >> >> ### Saving of filters and profiles >> When the users closes IGV, the filters (in their exact order) are save, as well as the filter profiles. The profile that was last used is selected when opening IGV. > > Tobias Holenstein has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 32 additional commits since the last revision: > > - Merge remote-tracking branch 'origin/master' into JDK-8302644 > - add missing empty line > - Merge branch 'JDK-8302644' of github.com:tobiasholenstein/jdk into JDK-8302644 > - Update src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/EditorTopComponent.java > > remove empty line > > Co-authored-by: Christian Hagedorn > - init fields directly in RangeSliderModel constructor > - move useBoldDisplayName to fields > - make group final again > - Merge branch 'JDK-8302644' of github.com:tobiasholenstein/jdk into JDK-8302644 > - Update src/utils/IdealGraphVisualizer/Filter/src/main/java/com/sun/hotspot/igv/filter/FilterChain.java > > missing newline > > Co-authored-by: Christian Hagedorn > - renamed fo to fileObject, cf to customFilter > - ... and 22 more: https://git.openjdk.org/jdk/compare/603f9b64...0ce7ab64 Works well for me but I spotted the following issue which seems to be a regression from this change: - Open two .xml files, open a graph in each and select the local profile - Double click on a filter and then click Cancel java.lang.AssertionError at com.sun.hotspot.igv.view.DiagramViewModel.filterChanged(DiagramViewModel.java:356) at com.sun.hotspot.igv.view.DiagramViewModel.lambda$new$1(DiagramViewModel.java:73) at com.sun.hotspot.igv.data.ChangedEvent.fire(ChangedEvent.java:44) at com.sun.hotspot.igv.data.ChangedEvent.fire(ChangedEvent.java:31) at com.sun.hotspot.igv.data.Event.fire(Event.java:56) at com.sun.hotspot.igv.filter.FilterChain$1.changed(FilterChain.java:48) at com.sun.hotspot.igv.filter.FilterChain$1.changed(FilterChain.java:45) at com.sun.hotspot.igv.data.ChangedEvent.fire(ChangedEvent.java:44) at com.sun.hotspot.igv.data.ChangedEvent.fire(ChangedEvent.java:31) at com.sun.hotspot.igv.data.Event.fire(Event.java:56) at com.sun.hotspot.igv.filter.CustomFilter.openInEditor(CustomFilter.java:86) at com.sun.hotspot.igv.filter.CustomFilter$1.open(CustomFilter.java:77) at org.openide.actions.OpenAction.performAction(OpenAction.java:59) at org.openide.util.actions.NodeAction$DelegateAction$1.run(NodeAction.java:561) at org.openide.util.actions.ActionInvoker$1.run(ActionInvoker.java:70) at org.openide.util.actions.ActionInvoker.doPerformAction(ActionInvoker.java:91) at org.openide.util.actions.ActionInvoker.invokeAction(ActionInvoker.java:74) at org.openide.util.actions.NodeAction$DelegateAction.actionPerformed(NodeAction.java:558) at org.openide.explorer.view.ListView.performObjectAt(ListView.java:681) at org.openide.explorer.view.ListView$PopupSupport.mouseClicked(ListView.java:1306) at java.desktop/java.awt.AWTEventMulticaster.mouseClicked(AWTEventMulticaster.java:278) at java.desktop/java.awt.Component.processMouseEvent(Component.java:6638) at java.desktop/javax.swing.JComponent.processMouseEvent(JComponent.java:3342) at java.desktop/java.awt.Component.processEvent(Component.java:6400) at java.desktop/java.awt.Container.processEvent(Container.java:2263) at java.desktop/java.awt.Component.dispatchEventImpl(Component.java:5011) at java.desktop/java.awt.Container.dispatchEventImpl(Container.java:2321) at java.desktop/java.awt.Component.dispatchEvent(Component.java:4843) at java.desktop/java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4918) at java.desktop/java.awt.LightweightDispatcher.processMouseEvent(Container.java:4556) at java.desktop/java.awt.LightweightDispatcher.dispatchEvent(Container.java:4488) at java.desktop/java.awt.Container.dispatchEventImpl(Container.java:2307) at java.desktop/java.awt.Window.dispatchEventImpl(Window.java:2772) at java.desktop/java.awt.Component.dispatchEvent(Component.java:4843) at java.desktop/java.awt.EventQueue.dispatchEventImpl(EventQueue.java:772) at java.desktop/java.awt.EventQueue$4.run(EventQueue.java:721) at java.desktop/java.awt.EventQueue$4.run(EventQueue.java:715) at java.base/java.security.AccessController.doPrivileged(Native Method) at java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:85) at java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:95) at java.desktop/java.awt.EventQueue$5.run(EventQueue.java:745) at java.desktop/java.awt.EventQueue$5.run(EventQueue.java:743) at java.base/java.security.AccessController.doPrivileged(Native Method) at java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:85) at java.desktop/java.awt.EventQueue.dispatchEvent(EventQueue.java:742) at org.netbeans.core.TimableEventQueue.dispatchEvent(TimableEventQueue.java:136) [catch] at java.desktop/java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:203) at java.desktop/java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:124) at java.desktop/java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:113) at java.desktop/java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:109) at java.desktop/java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:101) at java.desktop/java.awt.EventDispatchThread.run(EventDispatchThread.java:90) ------------- PR Comment: https://git.openjdk.org/jdk/pull/12714#issuecomment-1488516728 From vkempik at openjdk.org Wed Mar 29 12:48:29 2023 From: vkempik at openjdk.org (Vladimir Kempik) Date: Wed, 29 Mar 2023 12:48:29 GMT Subject: RFR: 8305056: Avoid unaligned access in emit_intX methods if not enabled Message-ID: Please review this change which attempts to eliminate unaligned memory stores generated by emit_int16/32/64 methods on some platforms. Primary aim is risc-v platform. But I had to change some code in ppc/arm32/x86 to prevent possible perf degradation. ------------- Commit messages: - Fix typo - change long to ulong in type convertion - Fix includes - 8305056: Avoid unaligned access in emit_intX methods if not enabled Changes: https://git.openjdk.org/jdk/pull/13227/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13227&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8305056 Stats: 23 lines in 5 files changed: 4 ins; 0 del; 19 mod Patch: https://git.openjdk.org/jdk/pull/13227.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13227/head:pull/13227 PR: https://git.openjdk.org/jdk/pull/13227 From epeter at openjdk.org Wed Mar 29 13:06:13 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 29 Mar 2023 13:06:13 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 [v2] In-Reply-To: References: Message-ID: > I am reviving `-XX:+VerifyLoopOptimizations` after many years of abandonment. There were many bugs filed, but so far it has not been addressed. > > The hope is that this work will allow us to catch ctrl / idom / loop body bugs quicker, and fix many of the existing ones along the way. > > **The Idea of VerifyLoopOptimizations** > Before loop-opts, we build many data-structures for dominance (idom), control, and loop membership. Then, loop-opts use this data to transform the graph. At the same time, they must maintain the correctness of the data-structures, so that other optimizations can be made, without needing to re-compute the data-structures every time. > `VerifyLoopOptimizations` was implemented to verify correctness of the data-structures. After some loop-opts, we re-compute a verification data-structure and compare it to the one we created before the loop-opts and maintained during loopopts. > > **My Approach** > I soon realized that there were many reasons why `VerifyLoopOptimizations` was broken. It seemed infeasible to fix all of them at once. I decided to first remove any part that was failing, until I have a minimal set that is working. I will leave many parts commented out. In follow-up RFE's, I will then iteratively improve the verification by re-enabling some verification and fixing the corresponding bugs. > > **What I fixed** > > - `verify_compare` > - Renamed it to `verify_nodes`, since it does verification node-by-node (vs `verify_tree`, which verifies the loop-tree). > - Previously, it was implemented as a BFS with recursion, which lead to stack-overflow. I flattened the BFS into a loop. > - The BFS calls `verify_node` on every node. I refactored `verify_node` a bit, so that it is more readable. > - Rather than having a thread-unsafe static variable `fail`, I now made it a reference argument. > - `verify_tree` > - I corrected the style and improved comments. > - I removed the broken verification for `Opaque` nodes. I added some rudamentary verification for `CountedLoop`. I leave more of this work for follow-up RFE's. > > **Disabled Verifications** > I commented out the following verifications: > > (A) data nodes should have same ctrl > (B) ctrl node should belong to same loop > (C) ctrl node should have same idom > (D) loop should have same tail > (E) loop should have same body (list of nodes) > (F) broken verification in PhaseIdealLoop::build_loop_late_post, because ctrl was set wrong > > > Note: verifying `idom`, `ctrl` and `_body` is the central goal of `VerifyLoopOptimizations`. But all of them are broken in many parts of the VM, as we have now not verified them for many years. > > **Follow-Up Work** > > I filed a first follow-up RFE [JDK-8305073](https://bugs.openjdk.org/browse/JDK-8305073). The following tasks should be addressed in it, or in subsequent follow-up RFE's. > > I propose the following order: > > - idom (C): The dominance structure is at the base of everything else. > - ctrl / loop (A, B): Once dominance is fixed, we can ensure every node is assigned to the correct ctrl/loop. > - tail (D): ensure the tail of a loop is updated correctly > - body (E): nodes are assigned to the `_body` of a loop, according to the node ctrl. > - other issues like (F) > - Add more verification to IdealLoopTree::verify_tree. For example zero-trip-guard, etc. > - Evaluate from where else we should call `PhaseIdealLoop::verify`. Maybe we are missing some cases. > > **Testing** > I am running `tier1-tier6` and stress testing. > Preliminary results are all good. > > **Conclusion** > With this fix, I have the basic infrastructure of the verification working. > However, all of the substantial verification are now still disabled, because there are too many places in the VM that do not maintain the data-structures properly. > Follow-up RFE's will have to address these one-by-one. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: fix after Vladimir's suggestions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13207/files - new: https://git.openjdk.org/jdk/pull/13207/files/68318248..6cd3479b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13207&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13207&range=00-01 Stats: 98 lines in 2 files changed: 55 ins; 20 del; 23 mod Patch: https://git.openjdk.org/jdk/pull/13207.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13207/head:pull/13207 PR: https://git.openjdk.org/jdk/pull/13207 From epeter at openjdk.org Wed Mar 29 13:51:07 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 29 Mar 2023 13:51:07 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 [v3] In-Reply-To: References: Message-ID: > I am reviving `-XX:+VerifyLoopOptimizations` after many years of abandonment. There were many bugs filed, but so far it has not been addressed. > > The hope is that this work will allow us to catch ctrl / idom / loop body bugs quicker, and fix many of the existing ones along the way. > > **The Idea of VerifyLoopOptimizations** > Before loop-opts, we build many data-structures for dominance (idom), control, and loop membership. Then, loop-opts use this data to transform the graph. At the same time, they must maintain the correctness of the data-structures, so that other optimizations can be made, without needing to re-compute the data-structures every time. > `VerifyLoopOptimizations` was implemented to verify correctness of the data-structures. After some loop-opts, we re-compute a verification data-structure and compare it to the one we created before the loop-opts and maintained during loopopts. > > **My Approach** > I soon realized that there were many reasons why `VerifyLoopOptimizations` was broken. It seemed infeasible to fix all of them at once. I decided to first remove any part that was failing, until I have a minimal set that is working. I will leave many parts commented out. In follow-up RFE's, I will then iteratively improve the verification by re-enabling some verification and fixing the corresponding bugs. > > **What I fixed** > > - `verify_compare` > - Renamed it to `verify_nodes`, since it does verification node-by-node (vs `verify_tree`, which verifies the loop-tree). > - Previously, it was implemented as a BFS with recursion, which lead to stack-overflow. I flattened the BFS into a loop. > - The BFS calls `verify_node` on every node. I refactored `verify_node` a bit, so that it is more readable. > - Rather than having a thread-unsafe static variable `fail`, I now made it a reference argument. > - `verify_tree` > - I corrected the style and improved comments. > - I removed the broken verification for `Opaque` nodes. I added some rudamentary verification for `CountedLoop`. I leave more of this work for follow-up RFE's. > > **Disabled Verifications** > I commented out the following verifications: > > (A) data nodes should have same ctrl > (B) ctrl node should belong to same loop > (C) ctrl node should have same idom > (D) loop should have same tail > (E) loop should have same body (list of nodes) > (F) broken verification in PhaseIdealLoop::build_loop_late_post, because ctrl was set wrong > > > Note: verifying `idom`, `ctrl` and `_body` is the central goal of `VerifyLoopOptimizations`. But all of them are broken in many parts of the VM, as we have now not verified them for many years. > > **Follow-Up Work** > > I filed a first follow-up RFE [JDK-8305073](https://bugs.openjdk.org/browse/JDK-8305073). The following tasks should be addressed in it, or in subsequent follow-up RFE's. > > I propose the following order: > > - idom (C): The dominance structure is at the base of everything else. > - ctrl / loop (A, B): Once dominance is fixed, we can ensure every node is assigned to the correct ctrl/loop. > - tail (D): ensure the tail of a loop is updated correctly > - body (E): nodes are assigned to the `_body` of a loop, according to the node ctrl. > - other issues like (F) > - Add more verification to IdealLoopTree::verify_tree. For example zero-trip-guard, etc. > - Evaluate from where else we should call `PhaseIdealLoop::verify`. Maybe we are missing some cases. > > **Testing** > I am running `tier1-tier6` and stress testing. > Preliminary results are all good. > > **Conclusion** > With this fix, I have the basic infrastructure of the verification working. > However, all of the substantial verification are now still disabled, because there are too many places in the VM that do not maintain the data-structures properly. > Follow-up RFE's will have to address these one-by-one. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: fix after Christian's suggestions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13207/files - new: https://git.openjdk.org/jdk/pull/13207/files/6cd3479b..5ef6113f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13207&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13207&range=01-02 Stats: 124 lines in 2 files changed: 55 ins; 37 del; 32 mod Patch: https://git.openjdk.org/jdk/pull/13207.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13207/head:pull/13207 PR: https://git.openjdk.org/jdk/pull/13207 From epeter at openjdk.org Wed Mar 29 13:56:30 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 29 Mar 2023 13:56:30 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 [v3] In-Reply-To: References: Message-ID: On Wed, 29 Mar 2023 13:51:07 GMT, Emanuel Peter wrote: >> I am reviving `-XX:+VerifyLoopOptimizations` after many years of abandonment. There were many bugs filed, but so far it has not been addressed. >> >> The hope is that this work will allow us to catch ctrl / idom / loop body bugs quicker, and fix many of the existing ones along the way. >> >> **The Idea of VerifyLoopOptimizations** >> Before loop-opts, we build many data-structures for dominance (idom), control, and loop membership. Then, loop-opts use this data to transform the graph. At the same time, they must maintain the correctness of the data-structures, so that other optimizations can be made, without needing to re-compute the data-structures every time. >> `VerifyLoopOptimizations` was implemented to verify correctness of the data-structures. After some loop-opts, we re-compute a verification data-structure and compare it to the one we created before the loop-opts and maintained during loopopts. >> >> **My Approach** >> I soon realized that there were many reasons why `VerifyLoopOptimizations` was broken. It seemed infeasible to fix all of them at once. I decided to first remove any part that was failing, until I have a minimal set that is working. I will leave many parts commented out. In follow-up RFE's, I will then iteratively improve the verification by re-enabling some verification and fixing the corresponding bugs. >> >> **What I fixed** >> >> - `verify_compare` >> - Renamed it to `verify_idom_and_nodes`, since it does verification node-by-node (vs `verify_tree`, which verifies the loop-tree). >> - Previously, it was implemented as a BFS with recursion, which lead to stack-overflow. I flattened the BFS into a loop. >> - The BFS calls `verify_idom` and `verify_nodes` on every node. I refactored `verify_nodes` a bit, so that it is more readable. >> - I now report all failures, before asserting. >> - `verify_tree` >> - I corrected the style and improved comments. >> - I removed the broken verification for `Opaque` nodes. I added some rudamentary verification for `CountedLoop`. I leave more of this work for follow-up RFE's. >> - I also converted the asserts to reporting failures, just like in `verify_idom_and_nodes`. >> >> **Disabled Verifications** >> I commented out the following verifications: >> >> (A) data nodes should have same ctrl >> (B) ctrl node should belong to same loop >> (C) ctrl node should have same idom >> (D) loop should have same tail >> (E) loop should have same body (list of nodes) >> (F) broken verification in PhaseIdealLoop::build_loop_late_post, because ctrl was set wrong >> >> >> Note: verifying `idom`, `ctrl` and `_body` is the central goal of `VerifyLoopOptimizations`. But all of them are broken in many parts of the VM, as we have now not verified them for many years. >> >> **Follow-Up Work** >> >> I filed a first follow-up RFE [JDK-8305073](https://bugs.openjdk.org/browse/JDK-8305073). The following tasks should be addressed in it, or in subsequent follow-up RFE's. >> >> I propose the following order: >> >> - idom (C): The dominance structure is at the base of everything else. >> - ctrl / loop (A, B): Once dominance is fixed, we can ensure every node is assigned to the correct ctrl/loop. >> - tail (D): ensure the tail of a loop is updated correctly >> - body (E): nodes are assigned to the `_body` of a loop, according to the node ctrl. >> - other issues like (F) >> - Add more verification to IdealLoopTree::verify_tree. For example zero-trip-guard, etc. >> - Evaluate from where else we should call `PhaseIdealLoop::verify`. Maybe we are missing some cases. >> >> **Testing** >> I am running `tier1-tier6` and stress testing. >> Preliminary results are all good. >> >> **Conclusion** >> With this fix, I have the basic infrastructure of the verification working. >> However, all of the substantial verification are now still disabled, because there are too many places in the VM that do not maintain the data-structures properly. >> Follow-up RFE's will have to address these one-by-one. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > fix after Christian's suggestions Update: I removed the `fail` variable completely. Now the verification methods simply return `true` if they found a failure. It is propagated up, all the way to `verify()`, where we assert. This means we will now report all failures. I think this is desirable. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13207#issuecomment-1488662211 From ngasson at openjdk.org Wed Mar 29 14:03:42 2023 From: ngasson at openjdk.org (Nick Gasson) Date: Wed, 29 Mar 2023 14:03:42 GMT Subject: RFR: 8303161: [vectorapi] VectorMask.cast narrow operation returns incorrect value with SVE [v2] In-Reply-To: References: Message-ID: On Tue, 28 Mar 2023 09:46:19 GMT, Bhavana Kilambi wrote: >> The cast operation for VectorMask from wider type to narrow type returns incorrect result for trueCount() method invocation for the resultant mask with SVE (on some SVE machines toLong() also results in incorrect values). An example narrow operation which results in incorrect toLong() and trueCount() values is shown below for a 128-bit -> 64-bit conversion and this can be extended to other narrow operations where the source mask in bytes is either 4x or 8x the size of the result mask in bytes - >> >> >> public class TestMaskCast { >> >> static final boolean [] mask_arr = {true, true, false, true}; >> >> public static long narrow_long() { >> VectorMask lmask128 = VectorMask.fromArray(LongVector.SPECIES_128, mask_arr, 0); >> return lmask128.cast(IntVector.SPECIES_64).toLong(); >> } >> >> public static void main(String[] args) { >> long r = 0L; >> for (int ic = 0; ic < 50000; ic++) { >> r = narrow_long(); >> } >> System.out.println("toLong() : " + r); >> } >> } >> >> >> **C2 compilation result :** >> java --add-modules jdk.incubator.vector TestMaskCast >> toLong(): 15 >> >> **Interpreter result (for verification) :** >> java --add-modules jdk.incubator.vector -Xint TestMaskCast >> toLong(): 3 >> >> The incorrect results with toLong() have been observed only on the 128-bit and 256-bit SVE machines but they are not reproducible on a 512-bit machine. However, trueCount() returns incorrect values too and they are reproducible on all the SVE machines and thus is more reliable to use trueCount() to bring out the drawbacks of the current implementation of mask cast narrow operation for SVE. >> >> Replacing the call to toLong() by trueCount() in the above example - >> >> >> public class TestMaskCast { >> >> static final boolean [] mask_arr = {true, true, false, true}; >> >> public static int narrow_long() { >> VectorMask lmask128 = VectorMask.fromArray(LongVector.SPECIES_128, mask_arr, 0); >> return lmask128.cast(IntVector.SPECIES_64).trueCount(); >> } >> >> public static void main(String[] args) { >> int r = 0; >> for (int ic = 0; ic < 50000; ic++) { >> r = narrow_long(); >> } >> System.out.println("trueCount() : " + r); >> } >> } >> >> >> >> **C2 compilation result:** >> java --add-modules jdk.incubator.vector TestMaskCast >> trueCount() : 4 >> >> **Interpreter result:** >> java --add-modules jdk.incubator.vector -Xint TestMaskCast >> trueCount() : 2 >> >> Since in this example, the source mask size in bytes is 2x that of the result mask, trueCount() returns 2x the number of true elements in the source mask. It would return 4x/8x the number of true elements in the source mask if the size of the source mask is 4x/8x that of result mask. >> >> The returned values are incorrect because of the higher order bits in the result not being cleared (since the result is narrowed down) and trueCount() or toLong() tend to consider the higher order bits in the vector register as well which results in incorrect value. For the 128-bit to 64-bit conversion with a mask - "TT" passed, the current implementation for mask cast narrow operation returns the same mask in the lower and upper half of the 128-bit register that is - "TTTT" which results in a long value of 15 (instead of 3 - "FFTT" for the 64-bit Integer mask) and number of true elements to be 4 (instead of 2). >> >> This patch proposes a fix for this problem. An already existing JTREG IR test - "test/hotspot/jtreg/compiler/vectorapi/VectorMaskCastTest.java" has also been modified to call the trueCount() method as well since the toString() method alone cannot be used to reproduce the incorrect values in this bug. This test passes successfully on 128-bit, 256-bit and 512-bit SVE machines. Since the IR test has been changed, it has been tested successfully on other platforms like x86 and aarch64 Neon machines as well to ensure the changes have not introduced any new errors. > > Bhavana Kilambi has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge master > - 8303161: [vectorapi] VectorMask.cast narrow operation returns incorrect value with SVE > > The cast operation for VectorMask from wider type to narrow type returns > incorrect result for trueCount() method invocation for the resultant > mask with SVE (on some SVE machines toLong() also results in incorrect > values). An example narrow operation which results in incorrect toLong() > and trueCount() values is shown below for a 128-bit -> 64-bit conversion > and this can be extended to other narrow operations where the source > mask in bytes is either 4x or 8x the size of the result mask in > bytes - > > public class TestMaskCast { > > static final boolean [] mask_arr = {true, true, false, true}; > > public static long narrow_long() { > VectorMask lmask128 = VectorMask.fromArray(LongVector.SPECIES_128, mask_arr, 0); > return lmask128.cast(IntVector.SPECIES_64).toLong(); > } > > public static void main(String[] args) { > long r = 0L; > for (int ic = 0; ic < 50000; ic++) { > r = narrow_long(); > } > System.out.println("toLong() : " + r); > } > } > > C2 compilation result : > java --add-modules jdk.incubator.vector TestMaskCast > toLong(): 15 > > Interpreter result (for verification) : > java --add-modules jdk.incubator.vector -Xint TestMaskCast > toLong(): 3 > > The incorrect results with toLong() have been observed only on the > 128-bit and 256-bit SVE machines but they are not reproducible on a > 512-bit machine. However, trueCount() returns incorrect values too > and they are reproducible on all the SVE machines and thus is more > reliable to use trueCount() to bring out the drawbacks of the current > implementation of mask cast narrow operation for SVE. > > Replacing the call to toLong() by trueCount() in the above example - > public class TestMaskCast { > > static final boolean [] mask_arr = {true, true, false, true}; > > public static int narrow_long() { > VectorMask lmask128 = VectorMask.fromArray(LongVector.SPECIES_128, mask_arr, 0); > return lmask128.cast(IntVector.SPECIES_64).trueCount(); > } > > public static void main(String[] args) { > int r = 0; > for (int ic = 0; ic < 50000; ic++) { > r = narrow_long(); > } > System.out.println("trueCount() : " + r); > } > } > > C2 compilation result: > java --add-modules jdk.incubator.vector TestMaskCast > trueCount() : 4 > > Interpreter result: > java --add-modules jdk.incubator.vector -Xint TestMaskCast > trueCount() : 2 > > Since in this example, the source mask size in bytes is 2x that of the > result mask, trueCount() returns 2x the number of true elements in the > source mask. It would return 4x/8x the number of true elements in the > source mask if the size of the source mask is 4x/8x that of result mask. > > The returned values are incorrect because of the higher order bits in > the result not being cleared (since the result is narrowed down) and > trueCount() or toLong() tend to consider the higher order bits in the > vector register as well which results in incorrect value. > For the 128-bit to 64-bit conversion with a mask - "TT" passed, the > current implementation for mask cast narrow operation returns the same > mask in the lower and upper half of the 128-bit register that is - > "TTTT" which results in a long value of 15 (instead of 3 - "FFTT" for > the 64-bit Integer mask) and number of true elements to be 4 (instead of > 2). > > This patch proposes a fix for this problem. An already existing JTREG IR > test - "test/hotspot/jtreg/compiler/vectorapi/VectorMaskCastTest.java" > has also been modified to call the trueCount() method as well since the > toString() method alone cannot be used to reproduce the incorrect values > in this bug. This test passes successfully on 128-bit, 256-bit and > 512-bit SVE machines. Since the IR test has been changed, it has been > tested successfully on other platforms like x86 and aarch64 Neon > machines as well to ensure the changes have not introduced any new > errors. Marked as reviewed by ngasson (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/12901#pullrequestreview-1363183016 From epeter at openjdk.org Wed Mar 29 14:22:54 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 29 Mar 2023 14:22:54 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 [v4] In-Reply-To: References: Message-ID: > I am reviving `-XX:+VerifyLoopOptimizations` after many years of abandonment. There were many bugs filed, but so far it has not been addressed. > > The hope is that this work will allow us to catch ctrl / idom / loop body bugs quicker, and fix many of the existing ones along the way. > > **The Idea of VerifyLoopOptimizations** > Before loop-opts, we build many data-structures for dominance (idom), control, and loop membership. Then, loop-opts use this data to transform the graph. At the same time, they must maintain the correctness of the data-structures, so that other optimizations can be made, without needing to re-compute the data-structures every time. > `VerifyLoopOptimizations` was implemented to verify correctness of the data-structures. After some loop-opts, we re-compute a verification data-structure and compare it to the one we created before the loop-opts and maintained during loopopts. > > **My Approach** > I soon realized that there were many reasons why `VerifyLoopOptimizations` was broken. It seemed infeasible to fix all of them at once. I decided to first remove any part that was failing, until I have a minimal set that is working. I will leave many parts commented out. In follow-up RFE's, I will then iteratively improve the verification by re-enabling some verification and fixing the corresponding bugs. > > **What I fixed** > > - `verify_compare` > - Renamed it to `verify_idom_and_nodes`, since it does verification node-by-node (vs `verify_tree`, which verifies the loop-tree). > - Previously, it was implemented as a BFS with recursion, which lead to stack-overflow. I flattened the BFS into a loop. > - The BFS calls `verify_idom` and `verify_nodes` on every node. I refactored `verify_nodes` a bit, so that it is more readable. > - I now report all failures, before asserting. > - `verify_tree` > - I corrected the style and improved comments. > - I removed the broken verification for `Opaque` nodes. I added some rudamentary verification for `CountedLoop`. I leave more of this work for follow-up RFE's. > - I also converted the asserts to reporting failures, just like in `verify_idom_and_nodes`. > > **Disabled Verifications** > I commented out the following verifications: > > (A) data nodes should have same ctrl > (B) ctrl node should belong to same loop > (C) ctrl node should have same idom > (D) loop should have same tail > (E) loop should have same body (list of nodes) > (F) broken verification in PhaseIdealLoop::build_loop_late_post, because ctrl was set wrong > > > Note: verifying `idom`, `ctrl` and `_body` is the central goal of `VerifyLoopOptimizations`. But all of them are broken in many parts of the VM, as we have now not verified them for many years. > > **Follow-Up Work** > > I filed a first follow-up RFE [JDK-8305073](https://bugs.openjdk.org/browse/JDK-8305073). The following tasks should be addressed in it, or in subsequent follow-up RFE's. > > I propose the following order: > > - idom (C): The dominance structure is at the base of everything else. > - ctrl / loop (A, B): Once dominance is fixed, we can ensure every node is assigned to the correct ctrl/loop. > - tail (D): ensure the tail of a loop is updated correctly > - body (E): nodes are assigned to the `_body` of a loop, according to the node ctrl. > - other issues like (F) > - Add more verification to IdealLoopTree::verify_tree. For example zero-trip-guard, etc. > - Evaluate from where else we should call `PhaseIdealLoop::verify`. Maybe we are missing some cases. > > **Testing** > I am running `tier1-tier6` and stress testing. > Preliminary results are all good. > > **Conclusion** > With this fix, I have the basic infrastructure of the verification working. > However, all of the substantial verification are now still disabled, because there are too many places in the VM that do not maintain the data-structures properly. > Follow-up RFE's will have to address these one-by-one. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Restrict VerifyLoopOptimizations to ASSERT / DEBUG_ONLY ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13207/files - new: https://git.openjdk.org/jdk/pull/13207/files/5ef6113f..fe56c534 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13207&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13207&range=02-03 Stats: 26 lines in 4 files changed: 12 ins; 9 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/13207.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13207/head:pull/13207 PR: https://git.openjdk.org/jdk/pull/13207 From xliu at openjdk.org Wed Mar 29 15:49:12 2023 From: xliu at openjdk.org (Xin Liu) Date: Wed, 29 Mar 2023 15:49:12 GMT Subject: RFR: 8305142: Can't bootstrap ctw.jar In-Reply-To: <_KrHiSvoOhisFwY9VvBouTIqYycK8YPpPFwb9BP5XHE=.c60cc690-c7e6-42b0-9010-f080c6981ace@github.com> References: <_KrHiSvoOhisFwY9VvBouTIqYycK8YPpPFwb9BP5XHE=.c60cc690-c7e6-42b0-9010-f080c6981ace@github.com> Message-ID: <0b9rVtkCzkuy0fnpbIWGGh39kpSKrn7YT3spVip83qY=.51fddb49-e43a-443a-ba75-628e15a30cf3@github.com> On Wed, 29 Mar 2023 07:17:23 GMT, Xin Liu wrote: > This patch add a few add-exports so CTW can access those internal packages of java.base module. > > make succeeds and ctw.jar is generated as expected. Yes, those are all compile-time errors. I don't think CTW itself uses them in runtime. CtwRunner.java is yet another driver. I verify that using the following script. ? jdk git:(JDK-8305142) cat ctw.sh #!/bin/bash DIST=test/hotspot/jtreg/testlibrary/ctw/dist java \ -Dtest.jdk=${JAVA_HOME} \ -Dtest.vm.opts="-Xbootclasspath/a:${DIST}/wb.jar" \ -XX:-UseCounterDecay -Xbatch "-XX:CompileCommand=exclude,java/lang/invoke/MethodHandle.*" -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -Xbootclasspath/a:${DIST}/wb.jar \ -cp $DIST/ctw.jar sun.hotspot.tools.ctw.CtwRunner $@ it works fine. My conclusion is it doesn't need those symbols in runtime. sh ./ctw.sh modules:java.base 0 100 CompileCommand: exclude java/lang/invoke/MethodHandle.* bool exclude = true Compiling 100 classes (of 7396 total classes) starting at 0 and ending at 100 For random generator using seed: 6656142999236136158 To re-run test with same seed value please add "-Djdk.test.lib.random.seed=6656142999236136158" to command line. Command line: [/local/home/xxinliu/Devel/jdk/build/linux-x86_64-server-release/images/jdk/bin/java -cp test/hotspot/jtreg/testlibrary/ctw/dist/ctw.jar -Xbootclasspath/a:test/hotspot/jtreg/testlibrary/ctw/dist/wb.jar @/local/home/xxinliu/Devel/jdk/modules_java_base_0.cmd ] modules_java_base_0 1715114783ms START : [/local/home/xxinliu/Devel/jdk/build/linux-x86_64-server-release/images/jdk/bin/java -cp test/hotspot/jtreg/testlibrary/ctw/dist/ctw.jar -Xbootclasspath/a:test/hotspot/jtreg/testlibrary/ctw/dist/wb.jar @/local/home/xxinliu/Devel/jdk/modules_java_base_0.cmd] cout/cerr are redirected to modules_java_base_0 modules_java_base_0 1715117836ms END : exit code = 0 Executed CTW for all 100 classes in modules:java.base(at /local/home/xxinliu/Devel/jdk/build/linux-x86_64-server-release/images/jdk/lib/modules) ------------- PR Comment: https://git.openjdk.org/jdk/pull/13220#issuecomment-1488871358 From shade at openjdk.org Wed Mar 29 15:53:46 2023 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 29 Mar 2023 15:53:46 GMT Subject: RFR: 8305142: Can't bootstrap ctw.jar In-Reply-To: <_KrHiSvoOhisFwY9VvBouTIqYycK8YPpPFwb9BP5XHE=.c60cc690-c7e6-42b0-9010-f080c6981ace@github.com> References: <_KrHiSvoOhisFwY9VvBouTIqYycK8YPpPFwb9BP5XHE=.c60cc690-c7e6-42b0-9010-f080c6981ace@github.com> Message-ID: On Wed, 29 Mar 2023 07:17:23 GMT, Xin Liu wrote: > This patch add a few add-exports so CTW can access those internal packages of java.base module. > > make succeeds and ctw.jar is generated as expected. Okay then. Looks good! ------------- Marked as reviewed by shade (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13220#pullrequestreview-1363449470 From bkilambi at openjdk.org Wed Mar 29 16:16:40 2023 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Wed, 29 Mar 2023 16:16:40 GMT Subject: Integrated: 8303161: [vectorapi] VectorMask.cast narrow operation returns incorrect value with SVE In-Reply-To: References: Message-ID: On Tue, 7 Mar 2023 10:56:40 GMT, Bhavana Kilambi wrote: > The cast operation for VectorMask from wider type to narrow type returns incorrect result for trueCount() method invocation for the resultant mask with SVE (on some SVE machines toLong() also results in incorrect values). An example narrow operation which results in incorrect toLong() and trueCount() values is shown below for a 128-bit -> 64-bit conversion and this can be extended to other narrow operations where the source mask in bytes is either 4x or 8x the size of the result mask in bytes - > > > public class TestMaskCast { > > static final boolean [] mask_arr = {true, true, false, true}; > > public static long narrow_long() { > VectorMask lmask128 = VectorMask.fromArray(LongVector.SPECIES_128, mask_arr, 0); > return lmask128.cast(IntVector.SPECIES_64).toLong(); > } > > public static void main(String[] args) { > long r = 0L; > for (int ic = 0; ic < 50000; ic++) { > r = narrow_long(); > } > System.out.println("toLong() : " + r); > } > } > > > **C2 compilation result :** > java --add-modules jdk.incubator.vector TestMaskCast > toLong(): 15 > > **Interpreter result (for verification) :** > java --add-modules jdk.incubator.vector -Xint TestMaskCast > toLong(): 3 > > The incorrect results with toLong() have been observed only on the 128-bit and 256-bit SVE machines but they are not reproducible on a 512-bit machine. However, trueCount() returns incorrect values too and they are reproducible on all the SVE machines and thus is more reliable to use trueCount() to bring out the drawbacks of the current implementation of mask cast narrow operation for SVE. > > Replacing the call to toLong() by trueCount() in the above example - > > > public class TestMaskCast { > > static final boolean [] mask_arr = {true, true, false, true}; > > public static int narrow_long() { > VectorMask lmask128 = VectorMask.fromArray(LongVector.SPECIES_128, mask_arr, 0); > return lmask128.cast(IntVector.SPECIES_64).trueCount(); > } > > public static void main(String[] args) { > int r = 0; > for (int ic = 0; ic < 50000; ic++) { > r = narrow_long(); > } > System.out.println("trueCount() : " + r); > } > } > > > > **C2 compilation result:** > java --add-modules jdk.incubator.vector TestMaskCast > trueCount() : 4 > > **Interpreter result:** > java --add-modules jdk.incubator.vector -Xint TestMaskCast > trueCount() : 2 > > Since in this example, the source mask size in bytes is 2x that of the result mask, trueCount() returns 2x the number of true elements in the source mask. It would return 4x/8x the number of true elements in the source mask if the size of the source mask is 4x/8x that of result mask. > > The returned values are incorrect because of the higher order bits in the result not being cleared (since the result is narrowed down) and trueCount() or toLong() tend to consider the higher order bits in the vector register as well which results in incorrect value. For the 128-bit to 64-bit conversion with a mask - "TT" passed, the current implementation for mask cast narrow operation returns the same mask in the lower and upper half of the 128-bit register that is - "TTTT" which results in a long value of 15 (instead of 3 - "FFTT" for the 64-bit Integer mask) and number of true elements to be 4 (instead of 2). > > This patch proposes a fix for this problem. An already existing JTREG IR test - "test/hotspot/jtreg/compiler/vectorapi/VectorMaskCastTest.java" has also been modified to call the trueCount() method as well since the toString() method alone cannot be used to reproduce the incorrect values in this bug. This test passes successfully on 128-bit, 256-bit and 512-bit SVE machines. Since the IR test has been changed, it has been tested successfully on other platforms like x86 and aarch64 Neon machines as well to ensure the changes have not introduced any new errors. This pull request has now been integrated. Changeset: 67274906 Author: Bhavana Kilambi Committer: Nick Gasson URL: https://git.openjdk.org/jdk/commit/67274906aeb7a6b83761e6aaf85688aa61aa8a20 Stats: 589 lines in 5 files changed: 449 ins; 0 del; 140 mod 8303161: [vectorapi] VectorMask.cast narrow operation returns incorrect value with SVE Reviewed-by: eliu, xgong, ngasson ------------- PR: https://git.openjdk.org/jdk/pull/12901 From vkempik at openjdk.org Wed Mar 29 16:32:10 2023 From: vkempik at openjdk.org (Vladimir Kempik) Date: Wed, 29 Mar 2023 16:32:10 GMT Subject: RFR: 8305056: Avoid unaligned access in emit_intX methods if not enabled [v2] In-Reply-To: References: Message-ID: > Please review this change which attempts to eliminate unaligned memory stores generated by emit_int16/32/64 methods on some platforms. > > Primary aim is risc-v platform. But I had to change some code in ppc/arm32/x86 to prevent possible perf degradation. Vladimir Kempik has updated the pull request incrementally with one additional commit since the last revision: Fix 32-bit archs ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13227/files - new: https://git.openjdk.org/jdk/pull/13227/files/3300c8ab..12ebef47 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13227&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13227&range=00-01 Stats: 7 lines in 1 file changed: 6 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/13227.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13227/head:pull/13227 PR: https://git.openjdk.org/jdk/pull/13227 From vkempik at openjdk.org Wed Mar 29 16:36:17 2023 From: vkempik at openjdk.org (Vladimir Kempik) Date: Wed, 29 Mar 2023 16:36:17 GMT Subject: RFR: 8305056: Avoid unaligned access in emit_intX methods if not enabled [v3] In-Reply-To: References: Message-ID: > Please review this change which attempts to eliminate unaligned memory stores generated by emit_int16/32/64 methods on some platforms. > > Primary aim is risc-v platform. But I had to change some code in ppc/arm32/x86 to prevent possible perf degradation. Vladimir Kempik has updated the pull request incrementally with one additional commit since the last revision: Reduce code duplication ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13227/files - new: https://git.openjdk.org/jdk/pull/13227/files/12ebef47..ffa4edd3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13227&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13227&range=01-02 Stats: 3 lines in 1 file changed: 1 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/13227.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13227/head:pull/13227 PR: https://git.openjdk.org/jdk/pull/13227 From kvn at openjdk.org Wed Mar 29 16:41:34 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 29 Mar 2023 16:41:34 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 - minimal infrastructure In-Reply-To: References: Message-ID: On Wed, 29 Mar 2023 08:33:16 GMT, Roberto Casta?eda Lozano wrote: > > I had testing run, for tier1-tier6 and stress testing. > > The non-verification run finished with 27d 3h machine time. > > The verification run is still running, with at least 27d 7h machine time. > > Thanks for the data! If the verification run does not take much longer (say <1% on top of what the non-verification run takes), it might be a good trade-off to have it enabled by default. Not just to prevent the verification code from rotting but to actually get more value from it (better chances to find bugs earlier). I assume @eme64 tested it with current limited verification. With adding/restoring more code the time will increase. I suggest to enable it only for `stress` testing now so we always use it for pre-integration testing and later tiers. After enabling of all verification code we will check time again and can decide if we can enable it by default always. So after pushing this fix we should add it to stress testing - we need that for pre-integration testing. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13207#issuecomment-1488938514 From qamai at openjdk.org Wed Mar 29 17:07:30 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 29 Mar 2023 17:07:30 GMT Subject: RFR: 8303553: AArch64: Add BCAX backend rule In-Reply-To: References: Message-ID: On Wed, 29 Mar 2023 09:12:59 GMT, changpeng1997 wrote: > We can use BCAX [1] [2] to merge a bit clear and an exclusive-OR operation. For example, on a 128-bit aarch64 machine which supports NEON and SHA3, following instruction sequence: > > > ... > bic v16.16b, v16.16b, v17.16b > eor v16.16b, v16.16b, v18.16b > ... > > > can be optimized to: > > > ... > bcax v16.16b, v17.16b, v16.16b, v18.16b > ... > > > This patch adds backend rules for BCAX, and we can gain almost 10% performance lift on a 128-bit aarch64 machine which supports NEON and SHA3. Similar performance uplift can also be observed on SVE2. > > Performance_Before: > > > Benchmark Score(op/ms) Error > TestByte#size(2048) 9779.361 47.184 > TestInt#size(2048) 3028.617 7.292 > TestLong#size(2048) 1331.216 1.815 > TestShort#size(2048) 5828.089 8.975 > > > Performance_BCAX_NEON: > > > Benchmark Score(op/ms) Error > TestByte#size(2048) 10510.371 34.931 > TestInt#size(2048) 3437.512 81.318 > TestLong#size(2048) 1461.023 0.679 > TestShort#size(2048) 6238.210 26.452 > > > [1]: https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/BCAX--Bit-Clear-and-XOR- > [2]: https://developer.arm.com/documentation/ddi0602/2022-12/SVE-Instructions/BCAX--Bitwise-clear-and-exclusive-OR-?lang=en We have `MacroLogicVNode` in the middle-end, can it be used to leverage the ternary logical instructions of SVE2 by selectively creating the node if the truth table is supported? Thanks a lot. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13222#issuecomment-1488978800 From kvn at openjdk.org Wed Mar 29 17:18:41 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 29 Mar 2023 17:18:41 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 - minimal infrastructure [v4] In-Reply-To: References: Message-ID: On Wed, 29 Mar 2023 14:22:54 GMT, Emanuel Peter wrote: >> I am reviving `-XX:+VerifyLoopOptimizations` after many years of abandonment. There were many bugs filed, but so far it has not been addressed. >> >> The hope is that this work will allow us to catch ctrl / idom / loop body bugs quicker, and fix many of the existing ones along the way. >> >> **The Idea of VerifyLoopOptimizations** >> Before loop-opts, we build many data-structures for dominance (idom), control, and loop membership. Then, loop-opts use this data to transform the graph. At the same time, they must maintain the correctness of the data-structures, so that other optimizations can be made, without needing to re-compute the data-structures every time. >> `VerifyLoopOptimizations` was implemented to verify correctness of the data-structures. After some loop-opts, we re-compute a verification data-structure and compare it to the one we created before the loop-opts and maintained during loopopts. >> >> **My Approach** >> I soon realized that there were many reasons why `VerifyLoopOptimizations` was broken. It seemed infeasible to fix all of them at once. I decided to first remove any part that was failing, until I have a minimal set that is working. I will leave many parts commented out. In follow-up RFE's, I will then iteratively improve the verification by re-enabling some verification and fixing the corresponding bugs. >> >> **What I fixed** >> >> - `verify_compare` >> - Renamed it to `verify_idom_and_nodes`, since it does verification node-by-node (vs `verify_tree`, which verifies the loop-tree). >> - Previously, it was implemented as a BFS with recursion, which lead to stack-overflow. I flattened the BFS into a loop. >> - The BFS calls `verify_idom` and `verify_nodes` on every node. I refactored `verify_nodes` a bit, so that it is more readable. >> - I now report all failures, before asserting. >> - `verify_tree` >> - I corrected the style and improved comments. >> - I removed the broken verification for `Opaque` nodes. I added some rudamentary verification for `CountedLoop`. I leave more of this work for follow-up RFE's. >> - I also converted the asserts to reporting failures, just like in `verify_idom_and_nodes`. >> >> **Disabled Verifications** >> I commented out the following verifications: >> >> (A) data nodes should have same ctrl >> (B) ctrl node should belong to same loop >> (C) ctrl node should have same idom >> (D) loop should have same tail >> (E) loop should have same body (list of nodes) >> (F) broken verification in PhaseIdealLoop::build_loop_late_post, because ctrl was set wrong >> >> >> Note: verifying `idom`, `ctrl` and `_body` is the central goal of `VerifyLoopOptimizations`. But all of them are broken in many parts of the VM, as we have now not verified them for many years. >> >> **Follow-Up Work** >> >> I filed a first follow-up RFE [JDK-8305073](https://bugs.openjdk.org/browse/JDK-8305073). The following tasks should be addressed in it, or in subsequent follow-up RFE's. >> >> I propose the following order: >> >> - idom (C): The dominance structure is at the base of everything else. >> - ctrl / loop (A, B): Once dominance is fixed, we can ensure every node is assigned to the correct ctrl/loop. >> - tail (D): ensure the tail of a loop is updated correctly >> - body (E): nodes are assigned to the `_body` of a loop, according to the node ctrl. >> - other issues like (F) >> - Add more verification to IdealLoopTree::verify_tree. For example zero-trip-guard, etc. >> - Evaluate from where else we should call `PhaseIdealLoop::verify`. Maybe we are missing some cases. >> >> **Testing** >> I am running `tier1-tier6` and stress testing. >> Preliminary results are all good. >> >> **Conclusion** >> With this fix, I have the basic infrastructure of the verification working. >> However, all of the substantial verification are now still disabled, because there are too many places in the VM that do not maintain the data-structures properly. >> Follow-up RFE's will have to address these one-by-one. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > Restrict VerifyLoopOptimizations to ASSERT / DEBUG_ONLY I like this much better. Very nice `verify_tree()` code now. I have few comments. src/hotspot/share/opto/loopnode.cpp line 4862: > 4860: child_verify = children_verify.at(j); > 4861: } > 4862: if (child != nullptr && child_verify != nullptr && child->_head != child_verify->_head) { May be have sanity assert before this line that we can't have both values equal to `nullptr`. src/hotspot/share/opto/loopnode.cpp line 4863: > 4861: } > 4862: if (child != nullptr && child_verify != nullptr && child->_head != child_verify->_head) { > 4863: assert(child->_head->_idx != child_verify->_head->_idx, "is implied"); Why you need this assert? It duplicate the check you already have. src/hotspot/share/opto/loopnode.cpp line 4885: > 4883: // Irreducible loops can pick a different header (one of its entries). > 4884: } else if (child_verify->_head->as_Region()->is_in_infinite_subgraph()) { > 4885: // Infinite loops do not get attached to the loop-tree on their first visit. Can you explain why you don't check `child` for infinite subgraph? ------------- PR Review: https://git.openjdk.org/jdk/pull/13207#pullrequestreview-1363579247 PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1152259422 PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1152251954 PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1152243021 From kvn at openjdk.org Wed Mar 29 17:22:50 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 29 Mar 2023 17:22:50 GMT Subject: RFR: 8302814: Delete unused CountLoopEnd instruct with CmpX [v6] In-Reply-To: References: <4tL_UFrMJLavPFPa_Q_O3himxRjM_sWRADfAXsMgZYk=.2ba5fdc5-0470-4bef-926b-2e40d0ca31e7@github.com> Message-ID: <7BrXLGLJbwD-IsmS_J3Q489m-q5xwGchQ4QYOM3XgHg=.0e0b31d0-c166-40ec-be26-3e3a940f93f5@github.com> On Wed, 29 Mar 2023 01:19:43 GMT, SUN Guoyun wrote: >> CountLoopEnd only for T_int, therefore the following instructs in riscv.ad are useless and should be deleted. >> >> CountedLoopEnd cmp (CmpU op1 op2) >> CountedLoopEnd cmp (CmpP op1 op2) >> CountedLoopEnd cmp (CmpN op1 op2) >> CountedLoopEnd cmp (CmpF op1 op2) >> CountedLoopEnd cmp (CmpD op1 op2) >> >> and CountedLoopEnd with CmpU on x86*.ad, aarch64.ad ar useless also. >> >> Please help review it. >> >> Thanks. > > SUN Guoyun has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision: > > - Merge branch 'openjdk:master' into 8302814 > - 8302814: Delete unused CountLoopEnd instruct with CmpX > - 8302814: Delete unused CountLoopEnd instruct with CmpX > - 8302814: Delete unused CountLoopEnd instruct with CmpX > - Merge branch 'openjdk:master' into 8302814 > - Merge branch 'openjdk:master' into 8302814 > - 8302814: Delete unused CountLoopEnd instruct with CmpX My testing (tier1-4, xcomp, stress) passed with latest version. ------------- PR Comment: https://git.openjdk.org/jdk/pull/12648#issuecomment-1489000893 From duke at openjdk.org Wed Mar 29 17:22:51 2023 From: duke at openjdk.org (SUN Guoyun) Date: Wed, 29 Mar 2023 17:22:51 GMT Subject: Integrated: 8302814: Delete unused CountLoopEnd instruct with CmpX In-Reply-To: <4tL_UFrMJLavPFPa_Q_O3himxRjM_sWRADfAXsMgZYk=.2ba5fdc5-0470-4bef-926b-2e40d0ca31e7@github.com> References: <4tL_UFrMJLavPFPa_Q_O3himxRjM_sWRADfAXsMgZYk=.2ba5fdc5-0470-4bef-926b-2e40d0ca31e7@github.com> Message-ID: On Mon, 20 Feb 2023 07:34:08 GMT, SUN Guoyun wrote: > CountLoopEnd only for T_int, therefore the following instructs in riscv.ad are useless and should be deleted. > > CountedLoopEnd cmp (CmpU op1 op2) > CountedLoopEnd cmp (CmpP op1 op2) > CountedLoopEnd cmp (CmpN op1 op2) > CountedLoopEnd cmp (CmpF op1 op2) > CountedLoopEnd cmp (CmpD op1 op2) > > and CountedLoopEnd with CmpU on x86*.ad, aarch64.ad ar useless also. > > Please help review it. > > Thanks. This pull request has now been integrated. Changeset: be764a71 Author: SUN Guoyun <40024232+sunny868 at users.noreply.github.com> Committer: Vladimir Kozlov URL: https://git.openjdk.org/jdk/commit/be764a711c1bf489f54d5bdc8e5e3b1891ea13cd Stats: 546 lines in 4 files changed: 0 ins; 546 del; 0 mod 8302814: Delete unused CountLoopEnd instruct with CmpX Reviewed-by: kvn, fjiang ------------- PR: https://git.openjdk.org/jdk/pull/12648 From kvn at openjdk.org Wed Mar 29 17:32:48 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 29 Mar 2023 17:32:48 GMT Subject: RFR: 8282365: Optimize divideUnsigned and remainderUnsigned for constants [v15] In-Reply-To: References: Message-ID: On Mon, 27 Mar 2023 13:35:46 GMT, Quan Anh Mai wrote: >> This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. >> >> In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: >> >> floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) >> ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) >> >> The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. >> >> For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: >> >> c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) >> c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) >> >> which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. >> >> For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. >> >> More tests are added to cover the possible patterns. >> >> Please take a look and have some reviews. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > whitespace Testing passed. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/9947#pullrequestreview-1363631429 From kvn at openjdk.org Wed Mar 29 17:42:31 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 29 Mar 2023 17:42:31 GMT Subject: RFR: 8304042: C2 SuperWord: schedule must remove packs with cyclic dependencies [v2] In-Reply-To: References: <9VgAQeNZfUZJXO8llozcZZuRftv6kk43jw0YIrBIdck=.b5c89436-608b-4ed1-816d-b3514374eaeb@github.com> Message-ID: On Wed, 29 Mar 2023 10:30:52 GMT, Emanuel Peter wrote: >> I discovered this bug during the bug fix of [JDK-8298935](https://bugs.openjdk.org/browse/JDK-8298935) [PR](https://git.openjdk.org/jdk/pull/12350). >> >> Currently, the SuperWord algorithm only ensures that all `packs` are `isomorphic` and `independent` (additionally memops are `adjacent`). >> >> This is **not sufficient**. We need to ensure that the `packs` do not introduce `cycles` into the graph. Example: >> >> https://github.com/openjdk/jdk/blob/ad580d18dbbf074c8a3692e2836839505b574326/test/hotspot/jtreg/compiler/loopopts/superword/TestIndependentPacksWithCyclicDependency.java#L217-L231 >> >> This is also mentioned in the [SuperWord Paper](https://groups.csail.mit.edu/cag/slp/SLP-PLDI-2000.pdf) (2000, Samuel Larsen and Saman Amarasinghe, Exploiting Superword Level Parallelism with Multimedia Instruction Sets): >> >> >> 3.7 Scheduling >> Dependence analysis before packing ensures that statements within a group can be executed >> safely in parallel. However, it may be the case that executing two groups produces a dependence >> violation. An example of this is shown in Figure 6. Here, dependence edges are drawn between >> groups if a statement in one group is dependent on a statement in the other. As long as there >> are no cycles in this dependence graph, all groups can be scheduled such that no violations >> occur. However, a cycle indicates that the set of chosen groups is invalid and at least one group >> will need to be eliminated. Although experimental data has shown this case to be extremely rare, >> care must be taken to ensure correctness. >> >> >> **Solution** >> >> Just before scheduling, I introduced `SuperWord::remove_cycles`. It creates a `PacksetGraph`, based on nodes in the `packs`, and scalar-nodes which are not in a pack. The edges are taken from `DepPreds`. We check if the graph can be scheduled without cycles (via topological sort). >> >> **FYI** >> >> I found a further bug, this time I think it happens during scheduling. See [JDK-8304720](https://bugs.openjdk.org/browse/JDK-8304720). Because of that, I had to disable a test case (`TestIndependentPacksWithCyclicDependency::test5`). I also had to require 64 bit, and either `avx2` or `asimd`. I hope we can lift that again once we fix the other bug. The issue is this: the cyclic dependency example can degenerate to non-cyclic ones, that need to reorder the non-vectorized memory operations. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > review feedback implemented Good. Do you know if this affect any our existing vector tests? ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13078#pullrequestreview-1363650618 From duke at openjdk.org Thu Mar 30 03:22:15 2023 From: duke at openjdk.org (changpeng1997) Date: Thu, 30 Mar 2023 03:22:15 GMT Subject: RFR: 8303553: AArch64: Add BCAX backend rule In-Reply-To: References: Message-ID: On Wed, 29 Mar 2023 17:04:58 GMT, Quan Anh Mai wrote: > We have `MacroLogicVNode` in the middle-end, can it be used to leverage the ternary logical instructions of SVE2 by selectively creating the node if the truth table is supported? Thanks a lot. Hi. I found that there were only two ternary logical instructions in SVE2 (BCAX, EOR3). Maybe we need not leverage them in the middle-end by using `MacroLogicVNode`. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13222#issuecomment-1489626744 From fgao at openjdk.org Thu Mar 30 04:20:06 2023 From: fgao at openjdk.org (Fei Gao) Date: Thu, 30 Mar 2023 04:20:06 GMT Subject: RFR: 8305055: IR check fails on some aarch64 platforms Message-ID: As @eme64 said in [1], [JDK-8298935](https://bugs.openjdk.org/browse/JDK-8298935) introduced some "collateral damage", disabling the vectorization of some conversions when `+AlignVector`. That affects IR checks of `TestVectorizeTypeConversion.java` and `ArrayTypeConvertTest.java` on some `aarch64` platforms like ThunderX and ThunderX2 [2]. This trivial patch is to allow IR check only when we have `-AlignVector`. [1] https://github.com/openjdk/jdk/pull/12350#issuecomment-1470065706 [2] https://github.com/openjdk/jdk/blob/7239150f8aff0e3dc07c5b27f6b7fb07237bfc55/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp#L154 ------------- Commit messages: - 8305055: IR check fails on some aarch64 platforms Changes: https://git.openjdk.org/jdk/pull/13236/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13236&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8305055 Stats: 14 lines in 2 files changed: 12 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/13236.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13236/head:pull/13236 PR: https://git.openjdk.org/jdk/pull/13236 From dzhang at openjdk.org Thu Mar 30 05:09:13 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Thu, 30 Mar 2023 05:09:13 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v11] In-Reply-To: References: Message-ID: <9Ncc9axOQ-CUhPcFCpPQ5i87LfVwuCqCMS3pq-rVpBY=.5a12bbef-ed95-42c1-ada4-16c02f7df281@github.com> > HI, > > We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! > This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. > > ## Load/Store/Cmp Mask > `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? > > 218 loadV V1, [R7] # vector (rvv) > 220 vloadmask V0, V1 > ... > 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 > 24c vstoremask V1, V0 > 258 storeV [R7], V1 # vector (rvv) > > > The corresponding generated jit assembly? > > # loadV > 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef95c: vle8.v v1,(t2) > > # vloadmask > 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, > 0x000000400c8ef964: vmsne.vx v0,v1,zero > > # vmaskcmp_rvv_masked > 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef980: vmclr.m v1 > 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t > 0x000000400c8ef988: vmv1r.v v0,v1 > > # vstoremask > 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef990: vmv.v.x v1,zero > 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 > > > ## Masked vector arithmetic instructions (e.g. vadd) > AddMaskTestMerge case: > > import jdk.incubator.vector.IntVector; > import jdk.incubator.vector.VectorMask; > import jdk.incubator.vector.VectorOperators; > import jdk.incubator.vector.VectorSpecies; > > public class AddMaskTestMerge { > > static final VectorSpecies SPECIES = IntVector.SPECIES_128; > static final int SIZE = 1024; > static int[] a = new int[SIZE]; > static int[] b = new int[SIZE]; > static int[] r = new int[SIZE]; > static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; > static { > for (int i = 0; i < SIZE; i++) { > a[i] = i; > b[i] = i; > } > } > > static void workload(int idx) { > VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); > IntVector av = IntVector.fromArray(SPECIES, a, idx); > IntVector bv = IntVector.fromArray(SPECIES, b, idx); > av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); > } > > public static void main(String[] args) { > for (int i = 0; i < 30_0000; i++) { > for (int j = 0; j < SIZE; j += SPECIES.length()) { > workload(j); > } > } > } > } > > > This test case is reduced from existing jtreg vector tests Int128VectorTests.java[3]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. > > Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: > > > 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 > 0ae loadV V1, [R31] # vector (rvv) > 0b6 vloadmask V0, V2 > 0be vadd.vv V3, V1, V0 #@vaddI_masked > 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r > 0ca decode_heap_oop R28, R28 #@decodeHeapOop > 0cc lwu R7, [R28, #12] # range, #@loadRange > 0d0 NullCheck R28 > > > And the jit code is as follows: > > > 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) > ; - AddMaskTestMerge::workload at 46 (line 25) > 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) > ; - AddMaskTestMerge::workload at 7 (line 22) > 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) > ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) > ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) > ; - AddMaskTestMerge::workload at 39 (line 25) > > > ## Mask register allocation & mask bit opreation > Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[2], so define v30 and v31 as mask register too. > > `AndVMask` will emit the C2 JIT code like: > > vloadmask V0, V1 > vloadmask V30, V2 > vmask_and V0, V30, V0 > > We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. > > By the way, the current implementation of `VectorMaskCast` is for the case of equal width of the parameter data, other cases depend on the subsequent cast node. > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java > [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 > > ### Testing: > > qemu with UseRVV: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) > - [ ] Tier3 tests (release) > - [x] test/jdk/jdk/incubator/vector (release/fastdebug) Dingli Zhang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains one commit: RISC-V: Support vector add mask instructions for Vector API ------------- Changes: https://git.openjdk.org/jdk/pull/12682/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=10 Stats: 724 lines in 6 files changed: 658 ins; 5 del; 61 mod Patch: https://git.openjdk.org/jdk/pull/12682.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12682/head:pull/12682 PR: https://git.openjdk.org/jdk/pull/12682 From xliu at openjdk.org Thu Mar 30 05:33:21 2023 From: xliu at openjdk.org (Xin Liu) Date: Thu, 30 Mar 2023 05:33:21 GMT Subject: RFR: 8305203: Simplify trimming operation in Region::Ideal Message-ID: This patch improves how Region::Ideal trims unreachable paths. 1. Don't restart from beginning. Trimming doesn't change the DU-chain. 2. Replace DFIterator with DFIterator_Fast. The later is a raw pointer in release build. 3. Don't call add_users_to_worklist(this) repeatly. 4. Reduce its strength from add_users_to_worklist to add_users_to_worklist0 because RegionNode has no special logic. This patch also includes a cosmetic change: rename n to 'use' inside of the loop. Otherwise, we would overshadow Node* n = in(i). Nothing wrong but harder to read. ------------- Commit messages: - 8305203: Simplify trimming operation in Region::Ideal Changes: https://git.openjdk.org/jdk/pull/13238/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13238&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8305203 Stats: 24 lines in 1 file changed: 4 ins; 9 del; 11 mod Patch: https://git.openjdk.org/jdk/pull/13238.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13238/head:pull/13238 PR: https://git.openjdk.org/jdk/pull/13238 From epeter at openjdk.org Thu Mar 30 06:24:18 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 30 Mar 2023 06:24:18 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 - minimal infrastructure [v4] In-Reply-To: References: Message-ID: On Wed, 29 Mar 2023 17:00:14 GMT, Vladimir Kozlov wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Restrict VerifyLoopOptimizations to ASSERT / DEBUG_ONLY > > src/hotspot/share/opto/loopnode.cpp line 4885: > >> 4883: // Irreducible loops can pick a different header (one of its entries). >> 4884: } else if (child_verify->_head->as_Region()->is_in_infinite_subgraph()) { >> 4885: // Infinite loops do not get attached to the loop-tree on their first visit. > > Can you explain why you don't check `child` for infinite subgraph? As the comment here says, infinite loops are not attached the first time we run `build_loop_tree`. But from the second time on, we have the `NeverBranch` added, so it has a "fake" exit, and is not really an infinite loop any more, so it will always be found and attached to the loop tree. `this` runs before `loop_verify`, so it is possible that `child` does not have the loop, but `child_verify` finds it. It is not possible the other way around. I can add a comment to the code for that. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1152784489 From epeter at openjdk.org Thu Mar 30 06:30:20 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 30 Mar 2023 06:30:20 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 - minimal infrastructure [v4] In-Reply-To: References: Message-ID: On Wed, 29 Mar 2023 17:13:42 GMT, Vladimir Kozlov wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Restrict VerifyLoopOptimizations to ASSERT / DEBUG_ONLY > > src/hotspot/share/opto/loopnode.cpp line 4862: > >> 4860: child_verify = children_verify.at(j); >> 4861: } >> 4862: if (child != nullptr && child_verify != nullptr && child->_head != child_verify->_head) { > > May be have sanity assert before this line that we can't have both values equal to `nullptr`. Will add it. > src/hotspot/share/opto/loopnode.cpp line 4863: > >> 4861: } >> 4862: if (child != nullptr && child_verify != nullptr && child->_head != child_verify->_head) { >> 4863: assert(child->_head->_idx != child_verify->_head->_idx, "is implied"); > > Why you need this assert? It duplicate the check you already have. Will remove it. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1152788962 PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1152788866 From epeter at openjdk.org Thu Mar 30 06:37:10 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 30 Mar 2023 06:37:10 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 - minimal infrastructure [v4] In-Reply-To: References: Message-ID: On Thu, 30 Mar 2023 06:21:47 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/loopnode.cpp line 4885: >> >>> 4883: // Irreducible loops can pick a different header (one of its entries). >>> 4884: } else if (child_verify->_head->as_Region()->is_in_infinite_subgraph()) { >>> 4885: // Infinite loops do not get attached to the loop-tree on their first visit. >> >> Can you explain why you don't check `child` for infinite subgraph? > > As the comment here says, infinite loops are not attached the first time we run `build_loop_tree`. > But from the second time on, we have the `NeverBranch` added, so it has a "fake" exit, and is not really an infinite loop any more, so it will always be found and attached to the loop tree. > `this` runs before `loop_verify`, so it is possible that `child` does not have the loop, but `child_verify` finds it. It is not possible the other way around. > I can add a comment to the code for that. Side note: Before my patch, we did not check if `loop_verify` had any loops that we did not have in `this`. That is why we did not have to deal with suddenly appearing infinite loops. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1152791851 From epeter at openjdk.org Thu Mar 30 06:42:21 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 30 Mar 2023 06:42:21 GMT Subject: RFR: 8304042: C2 SuperWord: schedule must remove packs with cyclic dependencies [v2] In-Reply-To: References: <9VgAQeNZfUZJXO8llozcZZuRftv6kk43jw0YIrBIdck=.b5c89436-608b-4ed1-816d-b3514374eaeb@github.com> Message-ID: On Wed, 29 Mar 2023 17:39:43 GMT, Vladimir Kozlov wrote: > Do you know if this affect any our existing vector tests? @vladimir Thanks for the review. Yes. I had a run where I assert if I find cycles. I ran it up to tier5 and stress testing. And the assert was never triggered, except in the two regression tests that I added (there it triggered a lot). So I think it really has no effect, except the extra runtime. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13078#issuecomment-1489774620 From epeter at openjdk.org Thu Mar 30 06:37:05 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 30 Mar 2023 06:37:05 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 - minimal infrastructure [v5] In-Reply-To: References: Message-ID: > I am reviving `-XX:+VerifyLoopOptimizations` after many years of abandonment. There were many bugs filed, but so far it has not been addressed. > > The hope is that this work will allow us to catch ctrl / idom / loop body bugs quicker, and fix many of the existing ones along the way. > > **The Idea of VerifyLoopOptimizations** > Before loop-opts, we build many data-structures for dominance (idom), control, and loop membership. Then, loop-opts use this data to transform the graph. At the same time, they must maintain the correctness of the data-structures, so that other optimizations can be made, without needing to re-compute the data-structures every time. > `VerifyLoopOptimizations` was implemented to verify correctness of the data-structures. After some loop-opts, we re-compute a verification data-structure and compare it to the one we created before the loop-opts and maintained during loopopts. > > **My Approach** > I soon realized that there were many reasons why `VerifyLoopOptimizations` was broken. It seemed infeasible to fix all of them at once. I decided to first remove any part that was failing, until I have a minimal set that is working. I will leave many parts commented out. In follow-up RFE's, I will then iteratively improve the verification by re-enabling some verification and fixing the corresponding bugs. > > **What I fixed** > > - `verify_compare` > - Renamed it to `verify_idom_and_nodes`, since it does verification node-by-node (vs `verify_tree`, which verifies the loop-tree). > - Previously, it was implemented as a BFS with recursion, which lead to stack-overflow. I flattened the BFS into a loop. > - The BFS calls `verify_idom` and `verify_nodes` on every node. I refactored `verify_nodes` a bit, so that it is more readable. > - I now report all failures, before asserting. > - `verify_tree` > - I corrected the style and improved comments. > - I removed the broken verification for `Opaque` nodes. I added some rudamentary verification for `CountedLoop`. I leave more of this work for follow-up RFE's. > - I also converted the asserts to reporting failures, just like in `verify_idom_and_nodes`. > > **Disabled Verifications** > I commented out the following verifications: > > (A) data nodes should have same ctrl > (B) ctrl node should belong to same loop > (C) ctrl node should have same idom > (D) loop should have same tail > (E) loop should have same body (list of nodes) > (F) broken verification in PhaseIdealLoop::build_loop_late_post, because ctrl was set wrong > > > Note: verifying `idom`, `ctrl` and `_body` is the central goal of `VerifyLoopOptimizations`. But all of them are broken in many parts of the VM, as we have now not verified them for many years. > > **Follow-Up Work** > > I filed a first follow-up RFE [JDK-8305073](https://bugs.openjdk.org/browse/JDK-8305073). The following tasks should be addressed in it, or in subsequent follow-up RFE's. > > I propose the following order: > > - idom (C): The dominance structure is at the base of everything else. > - ctrl / loop (A, B): Once dominance is fixed, we can ensure every node is assigned to the correct ctrl/loop. > - tail (D): ensure the tail of a loop is updated correctly > - body (E): nodes are assigned to the `_body` of a loop, according to the node ctrl. > - other issues like (F) > - Add more verification to IdealLoopTree::verify_tree. For example zero-trip-guard, etc. > - Evaluate from where else we should call `PhaseIdealLoop::verify`. Maybe we are missing some cases. > > **Testing** > I am running `tier1-tier6` and stress testing. > Preliminary results are all good. > > **Conclusion** > With this fix, I have the basic infrastructure of the verification working. > However, all of the substantial verification are now still disabled, because there are too many places in the VM that do not maintain the data-structures properly. > Follow-up RFE's will have to address these one-by-one. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: fix after Vladimir's second round of suggestions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13207/files - new: https://git.openjdk.org/jdk/pull/13207/files/fe56c534..f690bdbd Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13207&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13207&range=03-04 Stats: 5 lines in 1 file changed: 4 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/13207.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13207/head:pull/13207 PR: https://git.openjdk.org/jdk/pull/13207 From epeter at openjdk.org Thu Mar 30 06:53:15 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 30 Mar 2023 06:53:15 GMT Subject: RFR: 8305055: IR check fails on some aarch64 platforms In-Reply-To: References: Message-ID: <8mEOn_ZKzbPl7UjEXeSP5PtByUL5ia3i0Y5SGznjq-A=.584e547f-6a9b-4ed8-8130-328278849b32@github.com> On Thu, 30 Mar 2023 04:13:39 GMT, Fei Gao wrote: > As @eme64 said in [1], [JDK-8298935](https://bugs.openjdk.org/browse/JDK-8298935) introduced some "collateral damage", disabling the vectorization of some conversions when `+AlignVector`. That affects IR checks of `TestVectorizeTypeConversion.java` and `ArrayTypeConvertTest.java` on some `aarch64` platforms like ThunderX and ThunderX2 [2]. > > This trivial patch is to allow IR check only when we have `-AlignVector`. > > [1] https://github.com/openjdk/jdk/pull/12350#issuecomment-1470065706 > [2] https://github.com/openjdk/jdk/blob/7239150f8aff0e3dc07c5b27f6b7fb07237bfc55/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp#L154 Otherwise looks good. @fg1417 Thanks for doing this. I guess this is only a temporary solution? We should revert this change when we fix [JDK-8303827](https://bugs.openjdk.org/browse/JDK-8303827). I added a comment to that RFE. @fg1417 Maybe rather than mentioning [JDK-8298935](https://bugs.openjdk.org/browse/JDK-8298935) in the test, you could mention that we can remove it after [JDK-8303827](https://bugs.openjdk.org/browse/JDK-8303827). ------------- Marked as reviewed by epeter (Committer). PR Review: https://git.openjdk.org/jdk/pull/13236#pullrequestreview-1364434750 PR Comment: https://git.openjdk.org/jdk/pull/13236#issuecomment-1489781605 PR Comment: https://git.openjdk.org/jdk/pull/13236#issuecomment-1489783462 From tholenstein at openjdk.org Thu Mar 30 07:08:49 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Thu, 30 Mar 2023 07:08:49 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally [v14] In-Reply-To: References: Message-ID: > In IGV the user can apply a set of filters to a graph. Currently, the same set of selected filters is applied to all graphs (globally). > > - With this change the use can define a set of filters for each individual graph tab using the `--Local--` profile > - Further a filter profile can be created that represents a set of filter. This filter profile can the be selected in each graph tab individually. > > ### Global profile > Each tab has a `--Global--` filter profile which is selected when opening a graph. Filters applied to the `--Global--` profile are applied to all tabs that have the `--Global--` profile selected. > > ### Local profile > Each tab has its own `--Local--` filter profile. Filters applied to the `--Local--` profile are applied only to the currently selected tabs. Only one tab can be selected at a time and a tab gets selected by clicking on it. To make it more clear which tab is currently selected, the title of the selected tab is displayed in **bold** font. > tabA > > When clicking on a different tab with a different `--Local--` profile, the selected filters get updated accordingly. > tabB > > ### New profile > The user can also create a new filter profile and give it a name. E.g. `My Filters` > newProfile > > The `My Filters` profile is then globally available to other tabs as well > selectProfile > > > ### Filters for cloned tabs > When the user clones a tab, the `--Local--` profile gets cloned as well. Further the clone has the same filter profile selected when it gets opened > cloneTab > > ### Saving of filters and profiles > When the users closes IGV, the filters (in their exact order) are save, as well as the filter profiles. The profile that was last used is selected when opening IGV. Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: Fix hitting assertion in filterChainChangedListener ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12714/files - new: https://git.openjdk.org/jdk/pull/12714/files/0ce7ab64..e50eb974 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12714&range=13 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12714&range=12-13 Stats: 43 lines in 4 files changed: 9 ins; 20 del; 14 mod Patch: https://git.openjdk.org/jdk/pull/12714.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12714/head:pull/12714 PR: https://git.openjdk.org/jdk/pull/12714 From thartmann at openjdk.org Thu Mar 30 06:59:16 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 30 Mar 2023 06:59:16 GMT Subject: RFR: 8304445: Remaining uses of NULL in ciInstanceKlass.cpp In-Reply-To: References: Message-ID: On Fri, 24 Mar 2023 15:46:11 GMT, Ilya Korennoy wrote: > 8304445: Remaining uses of NULL in ciInstanceKlass.cpp Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13178#pullrequestreview-1364443111 From epeter at openjdk.org Thu Mar 30 07:26:54 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 30 Mar 2023 07:26:54 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 - minimal infrastructure [v6] In-Reply-To: References: Message-ID: > I am reviving `-XX:+VerifyLoopOptimizations` after many years of abandonment. There were many bugs filed, but so far it has not been addressed. > > The hope is that this work will allow us to catch ctrl / idom / loop body bugs quicker, and fix many of the existing ones along the way. > > **The Idea of VerifyLoopOptimizations** > Before loop-opts, we build many data-structures for dominance (idom), control, and loop membership. Then, loop-opts use this data to transform the graph. At the same time, they must maintain the correctness of the data-structures, so that other optimizations can be made, without needing to re-compute the data-structures every time. > `VerifyLoopOptimizations` was implemented to verify correctness of the data-structures. After some loop-opts, we re-compute a verification data-structure and compare it to the one we created before the loop-opts and maintained during loopopts. > > **My Approach** > I soon realized that there were many reasons why `VerifyLoopOptimizations` was broken. It seemed infeasible to fix all of them at once. I decided to first remove any part that was failing, until I have a minimal set that is working. I will leave many parts commented out. In follow-up RFE's, I will then iteratively improve the verification by re-enabling some verification and fixing the corresponding bugs. > > **What I fixed** > > - `verify_compare` > - Renamed it to `verify_idom_and_nodes`, since it does verification node-by-node (vs `verify_tree`, which verifies the loop-tree). > - Previously, it was implemented as a BFS with recursion, which lead to stack-overflow. I flattened the BFS into a loop. > - The BFS calls `verify_idom` and `verify_nodes` on every node. I refactored `verify_nodes` a bit, so that it is more readable. > - I now report all failures, before asserting. > - `verify_tree` > - I corrected the style and improved comments. > - I removed the broken verification for `Opaque` nodes. I added some rudamentary verification for `CountedLoop`. I leave more of this work for follow-up RFE's. > - I also converted the asserts to reporting failures, just like in `verify_idom_and_nodes`. > > **Disabled Verifications** > I commented out the following verifications: > > (A) data nodes should have same ctrl > (B) ctrl node should belong to same loop > (C) ctrl node should have same idom > (D) loop should have same tail > (E) loop should have same body (list of nodes) > (F) broken verification in PhaseIdealLoop::build_loop_late_post, because ctrl was set wrong > > > Note: verifying `idom`, `ctrl` and `_body` is the central goal of `VerifyLoopOptimizations`. But all of them are broken in many parts of the VM, as we have now not verified them for many years. > > **Follow-Up Work** > > I filed a first follow-up RFE [JDK-8305073](https://bugs.openjdk.org/browse/JDK-8305073). The following tasks should be addressed in it, or in subsequent follow-up RFE's. > > I propose the following order: > > - idom (C): The dominance structure is at the base of everything else. > - ctrl / loop (A, B): Once dominance is fixed, we can ensure every node is assigned to the correct ctrl/loop. > - tail (D): ensure the tail of a loop is updated correctly > - body (E): nodes are assigned to the `_body` of a loop, according to the node ctrl. > - other issues like (F) > - Add more verification to IdealLoopTree::verify_tree. For example zero-trip-guard, etc. > - Evaluate from where else we should call `PhaseIdealLoop::verify`. Maybe we are missing some cases. > > **Testing** > I am running `tier1-tier6` and stress testing. > Preliminary results are all good. > > **Conclusion** > With this fix, I have the basic infrastructure of the verification working. > However, all of the substantial verification are now still disabled, because there are too many places in the VM that do not maintain the data-structures properly. > Follow-up RFE's will have to address these one-by-one. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: fix typo ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13207/files - new: https://git.openjdk.org/jdk/pull/13207/files/f690bdbd..d01296d8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13207&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13207&range=04-05 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/13207.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13207/head:pull/13207 PR: https://git.openjdk.org/jdk/pull/13207 From tholenstein at openjdk.org Thu Mar 30 07:19:48 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Thu, 30 Mar 2023 07:19:48 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally [v13] In-Reply-To: <4ogpCo4lpZYsHzCjBH_c6mpRM0X8J4uaIqwaGvZ5RlA=.9d3dd44c-ccfb-4d31-913f-220980e01e2b@github.com> References: <4ogpCo4lpZYsHzCjBH_c6mpRM0X8J4uaIqwaGvZ5RlA=.9d3dd44c-ccfb-4d31-913f-220980e01e2b@github.com> Message-ID: On Wed, 29 Mar 2023 12:32:29 GMT, Tobias Hartmann wrote: > Works well for me but I spotted the following issue which seems to be a regression from this change: > > * Open two .xml files, open a graph in each and select the local profile > * Double click on a filter and then click Cancel > > ``` > java.lang.AssertionError > at com.sun.hotspot.igv.view.DiagramViewModel.filterChanged(DiagramViewModel.java:356) > at com.sun.hotspot.igv.view.DiagramViewModel.lambda$new$1(DiagramViewModel.java:73) > at com.sun.hotspot.igv.data.ChangedEvent.fire(ChangedEvent.java:44) > at com.sun.hotspot.igv.data.ChangedEvent.fire(ChangedEvent.java:31) > at com.sun.hotspot.igv.data.Event.fire(Event.java:56) > at com.sun.hotspot.igv.filter.FilterChain$1.changed(FilterChain.java:48) > at com.sun.hotspot.igv.filter.FilterChain$1.changed(FilterChain.java:45) > at com.sun.hotspot.igv.data.ChangedEvent.fire(ChangedEvent.java:44) > at com.sun.hotspot.igv.data.ChangedEvent.fire(ChangedEvent.java:31) > at com.sun.hotspot.igv.data.Event.fire(Event.java:56) > at com.sun.hotspot.igv.filter.CustomFilter.openInEditor(CustomFilter.java:86) > at com.sun.hotspot.igv.filter.CustomFilter$1.open(CustomFilter.java:77) > at org.openide.actions.OpenAction.performAction(OpenAction.java:59) > at org.openide.util.actions.NodeAction$DelegateAction$1.run(NodeAction.java:561) > at org.openide.util.actions.ActionInvoker$1.run(ActionInvoker.java:70) > at org.openide.util.actions.ActionInvoker.doPerformAction(ActionInvoker.java:91) > at org.openide.util.actions.ActionInvoker.invokeAction(ActionInvoker.java:74) > at org.openide.util.actions.NodeAction$DelegateAction.actionPerformed(NodeAction.java:558) > at org.openide.explorer.view.ListView.performObjectAt(ListView.java:681) > at org.openide.explorer.view.ListView$PopupSupport.mouseClicked(ListView.java:1306) > at java.desktop/java.awt.AWTEventMulticaster.mouseClicked(AWTEventMulticaster.java:278) > at java.desktop/java.awt.Component.processMouseEvent(Component.java:6638) > at java.desktop/javax.swing.JComponent.processMouseEvent(JComponent.java:3342) > at java.desktop/java.awt.Component.processEvent(Component.java:6400) > at java.desktop/java.awt.Container.processEvent(Container.java:2263) > at java.desktop/java.awt.Component.dispatchEventImpl(Component.java:5011) > at java.desktop/java.awt.Container.dispatchEventImpl(Container.java:2321) > at java.desktop/java.awt.Component.dispatchEvent(Component.java:4843) > at java.desktop/java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4918) > at java.desktop/java.awt.LightweightDispatcher.processMouseEvent(Container.java:4556) > at java.desktop/java.awt.LightweightDispatcher.dispatchEvent(Container.java:4488) > at java.desktop/java.awt.Container.dispatchEventImpl(Container.java:2307) > at java.desktop/java.awt.Window.dispatchEventImpl(Window.java:2772) > at java.desktop/java.awt.Component.dispatchEvent(Component.java:4843) > at java.desktop/java.awt.EventQueue.dispatchEventImpl(EventQueue.java:772) > at java.desktop/java.awt.EventQueue$4.run(EventQueue.java:721) > at java.desktop/java.awt.EventQueue$4.run(EventQueue.java:715) > at java.base/java.security.AccessController.doPrivileged(Native Method) > at java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:85) > at java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:95) > at java.desktop/java.awt.EventQueue$5.run(EventQueue.java:745) > at java.desktop/java.awt.EventQueue$5.run(EventQueue.java:743) > at java.base/java.security.AccessController.doPrivileged(Native Method) > at java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:85) > at java.desktop/java.awt.EventQueue.dispatchEvent(EventQueue.java:742) > at org.netbeans.core.TimableEventQueue.dispatchEvent(TimableEventQueue.java:136) > [catch] at java.desktop/java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:203) > at java.desktop/java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:124) > at java.desktop/java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:113) > at java.desktop/java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:109) > at java.desktop/java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:101) > at java.desktop/java.awt.EventDispatchThread.run(EventDispatchThread.java:90) > ``` Thanks for catching that @TobiHartmann . It should be fixed now ------------- PR Comment: https://git.openjdk.org/jdk/pull/12714#issuecomment-1489812196 From thartmann at openjdk.org Thu Mar 30 07:10:18 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 30 Mar 2023 07:10:18 GMT Subject: RFR: JDK-8304546: CompileTask::_directive leaked if CompileBroker::invoke_compiler_on_method not called In-Reply-To: References: Message-ID: On Mon, 20 Mar 2023 20:48:28 GMT, Justin King wrote: > Ensure `CompileTask::_directive` is not leaked when `CompileBroker::invoke_compiler_on_method` is not called. This can happen for stale tasks or when compilation is disabled. src/hotspot/share/compiler/compileTask.cpp line 71: > 69: if (task->_directive != nullptr) { > 70: DirectivesStack::release(task->_directive); > 71: task->clear_directive(); Just wondering, shouldn't we first call `task->clear_directive()` to avoid having a dangling pointer? It's of course not an issue here but a static code analysis tool might complain. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13108#discussion_r1152822535 From tholenstein at openjdk.org Thu Mar 30 07:19:48 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Thu, 30 Mar 2023 07:19:48 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally [v15] In-Reply-To: References: Message-ID: > In IGV the user can apply a set of filters to a graph. Currently, the same set of selected filters is applied to all graphs (globally). > > - With this change the use can define a set of filters for each individual graph tab using the `--Local--` profile > - Further a filter profile can be created that represents a set of filter. This filter profile can the be selected in each graph tab individually. > > ### Global profile > Each tab has a `--Global--` filter profile which is selected when opening a graph. Filters applied to the `--Global--` profile are applied to all tabs that have the `--Global--` profile selected. > > ### Local profile > Each tab has its own `--Local--` filter profile. Filters applied to the `--Local--` profile are applied only to the currently selected tabs. Only one tab can be selected at a time and a tab gets selected by clicking on it. To make it more clear which tab is currently selected, the title of the selected tab is displayed in **bold** font. > tabA > > When clicking on a different tab with a different `--Local--` profile, the selected filters get updated accordingly. > tabB > > ### New profile > The user can also create a new filter profile and give it a name. E.g. `My Filters` > newProfile > > The `My Filters` profile is then globally available to other tabs as well > selectProfile > > > ### Filters for cloned tabs > When the user clones a tab, the `--Local--` profile gets cloned as well. Further the clone has the same filter profile selected when it gets opened > cloneTab > > ### Saving of filters and profiles > When the users closes IGV, the filters (in their exact order) are save, as well as the filter profiles. The profile that was last used is selected when opening IGV. Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: center nodes after zooming ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12714/files - new: https://git.openjdk.org/jdk/pull/12714/files/e50eb974..ba9aacc3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12714&range=14 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12714&range=13-14 Stats: 2 lines in 1 file changed: 1 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/12714.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12714/head:pull/12714 PR: https://git.openjdk.org/jdk/pull/12714 From epeter at openjdk.org Thu Mar 30 07:29:28 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 30 Mar 2023 07:29:28 GMT Subject: RFR: 8305222: Change unique_ctrl_out_or_null to unique_ctrl_out in PhaseCFG::convert_NeverBranch_to_Goto In-Reply-To: References: Message-ID: On Thu, 9 Mar 2023 21:08:45 GMT, Vladimir Kozlov wrote: >> Replaced `unique_ctrl_out_or_null` with `unique_ctrl_out`, which asserts if it finds `nullptr`. This is better than running into a `nullptr`-dereference inside `get_block_for_node`. >> >> This was found by a static code analyzer, so it is not clear that a `nullptr` dereference would ever happen. But let's still fix it. > > Good. @vnkozlov @TobiHartmann Thanks for the reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/12919#issuecomment-1489823321 From epeter at openjdk.org Thu Mar 30 07:29:30 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 30 Mar 2023 07:29:30 GMT Subject: Integrated: 8305222: Change unique_ctrl_out_or_null to unique_ctrl_out in PhaseCFG::convert_NeverBranch_to_Goto In-Reply-To: References: Message-ID: On Wed, 8 Mar 2023 08:51:46 GMT, Emanuel Peter wrote: > Replaced `unique_ctrl_out_or_null` with `unique_ctrl_out`, which asserts if it finds `nullptr`. This is better than running into a `nullptr`-dereference inside `get_block_for_node`. > > This was found by a static code analyzer, so it is not clear that a `nullptr` dereference would ever happen. But let's still fix it. This pull request has now been integrated. Changeset: 77811fa3 Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/77811fa39be4ed7b50beb911c30f685377372655 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod 8305222: Change unique_ctrl_out_or_null to unique_ctrl_out in PhaseCFG::convert_NeverBranch_to_Goto Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/12919 From thartmann at openjdk.org Thu Mar 30 07:43:24 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 30 Mar 2023 07:43:24 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally [v15] In-Reply-To: References: Message-ID: On Thu, 30 Mar 2023 07:19:48 GMT, Tobias Holenstein wrote: >> In IGV the user can apply a set of filters to a graph. Currently, the same set of selected filters is applied to all graphs (globally). >> >> - With this change the use can define a set of filters for each individual graph tab using the `--Local--` profile >> - Further a filter profile can be created that represents a set of filter. This filter profile can the be selected in each graph tab individually. >> >> ### Global profile >> Each tab has a `--Global--` filter profile which is selected when opening a graph. Filters applied to the `--Global--` profile are applied to all tabs that have the `--Global--` profile selected. >> >> ### Local profile >> Each tab has its own `--Local--` filter profile. Filters applied to the `--Local--` profile are applied only to the currently selected tabs. Only one tab can be selected at a time and a tab gets selected by clicking on it. To make it more clear which tab is currently selected, the title of the selected tab is displayed in **bold** font. >> tabA >> >> When clicking on a different tab with a different `--Local--` profile, the selected filters get updated accordingly. >> tabB >> >> ### New profile >> The user can also create a new filter profile and give it a name. E.g. `My Filters` >> newProfile >> >> The `My Filters` profile is then globally available to other tabs as well >> selectProfile >> >> >> ### Filters for cloned tabs >> When the user clones a tab, the `--Local--` profile gets cloned as well. Further the clone has the same filter profile selected when it gets opened >> cloneTab >> >> ### Saving of filters and profiles >> When the users closes IGV, the filters (in their exact order) are save, as well as the filter profiles. The profile that was last used is selected when opening IGV. > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > center nodes after zooming Works well now. Thanks for fixing! ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/12714#pullrequestreview-1364508108 From thartmann at openjdk.org Thu Mar 30 07:45:15 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 30 Mar 2023 07:45:15 GMT Subject: RFR: 8305055: IR check fails on some aarch64 platforms In-Reply-To: References: Message-ID: On Thu, 30 Mar 2023 04:13:39 GMT, Fei Gao wrote: > As @eme64 said in [1], [JDK-8298935](https://bugs.openjdk.org/browse/JDK-8298935) introduced some "collateral damage", disabling the vectorization of some conversions when `+AlignVector`. That affects IR checks of `TestVectorizeTypeConversion.java` and `ArrayTypeConvertTest.java` on some `aarch64` platforms like ThunderX and ThunderX2 [2]. > > This trivial patch is to allow IR check only when we have `-AlignVector`. > > [1] https://github.com/openjdk/jdk/pull/12350#issuecomment-1470065706 > [2] https://github.com/openjdk/jdk/blob/7239150f8aff0e3dc07c5b27f6b7fb07237bfc55/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp#L154 +1 to Emanuel's suggestion. Looks good to me otherwise. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13236#pullrequestreview-1364511006 From tholenstein at openjdk.org Thu Mar 30 07:55:21 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Thu, 30 Mar 2023 07:55:21 GMT Subject: RFR: JDK-8305223: IGV: mark osr compiled graphs with [OSR] in the name Message-ID: Graphs in IGV that were osr compiled have an "osr" property. To make it easier to distinguish them from non-osr, append `[OSR]` to the name of the group in the outline osr ------------- Commit messages: - JDK-8305223: IGV: mark osr compiled graphs with [OSR] in the name Changes: https://git.openjdk.org/jdk/pull/13241/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13241&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8305223 Stats: 5 lines in 1 file changed: 4 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/13241.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13241/head:pull/13241 PR: https://git.openjdk.org/jdk/pull/13241 From eliu at openjdk.org Thu Mar 30 08:14:18 2023 From: eliu at openjdk.org (Eric Liu) Date: Thu, 30 Mar 2023 08:14:18 GMT Subject: RFR: 8303278: Imprecise bottom type of ExtractB/UB [v2] In-Reply-To: References: Message-ID: On Mon, 20 Mar 2023 10:24:10 GMT, Tobias Hartmann wrote: >> Eric Liu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: >> >> - Merge jdk:master >> >> Change-Id: I40cce803da09bae31cd74b86bf93607a08219545 >> - 8303278: Imprecise bottom type of ExtractB/UB >> >> This is a trivial patch, which fixes the bottom type of ExtractB/UB >> nodes. >> >> ExtractNode can be generated by Vector API Vector.lane(int), which gets >> the lane element at the given index. A more precise type of range can >> help to optimize out unnecessary type conversion in some cases. >> >> Below shows a typical case used ExtractBNode >> >> ``` >> public static byte byteLt16() { >> ByteVector vecb = ByteVector.broadcast(ByteVector.SPECIES_128, 1); >> return vecb.lane(1); >> } >> >> ``` >> In this case, c2 constructs IR graph like: >> >> ExtractB ConI(24) >> | __| >> | / | >> LShiftI __| >> | / >> RShiftI >> >> which generates AArch64 code: >> >> movi v16.16b, #0x1 >> smov x11, v16.b[1] >> sxtb w0, w11 >> >> with this patch, this shift pair can be optimized out by RShiftI's >> identity [1]. The code is optimized to: >> >> movi v16.16b, #0x1 >> smov x0, v16.b[1] >> >> [TEST] >> >> Full jtreg passed except 4 files on x86: >> >> jdk/incubator/vector/Byte128VectorTests.java >> jdk/incubator/vector/Byte256VectorTests.java >> jdk/incubator/vector/Byte512VectorTests.java >> jdk/incubator/vector/Byte64VectorTests.java >> >> They are caused by a known issue on x86 [2]. >> >> [1] https://github.com/openjdk/jdk/blob/742bc041eaba1ff9beb7f5b6d896e4f382b030ea/src/hotspot/share/opto/mulnode.cpp#L1052 >> [2] https://bugs.openjdk.org/browse/JDK-8303508 >> >> Change-Id: Ibea9aeacb41b4d1c5b2621c7a97494429394b599 > > This triggers failures in testing: > > > jdk/incubator/vector/Byte64VectorTests.java > > java.lang.Exception: failures: 1 > at com.sun.javatest.regtest.agent.TestNGRunner.main(TestNGRunner.java:95) > at com.sun.javatest.regtest.agent.TestNGRunner.main(TestNGRunner.java:53) > at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) > at java.base/java.lang.reflect.Method.invoke(Method.java:578) > at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:125) > at java.base/java.lang.Thread.run(Thread.java:1623) @TobiHartmann Could you help to take a look? ------------- PR Comment: https://git.openjdk.org/jdk/pull/13070#issuecomment-1489880736 From tholenstein at openjdk.org Thu Mar 30 08:16:25 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Thu, 30 Mar 2023 08:16:25 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally [v16] In-Reply-To: References: Message-ID: > In IGV the user can apply a set of filters to a graph. Currently, the same set of selected filters is applied to all graphs (globally). > > - With this change the use can define a set of filters for each individual graph tab using the `--Local--` profile > - Further a filter profile can be created that represents a set of filter. This filter profile can the be selected in each graph tab individually. > > ### Global profile > Each tab has a `--Global--` filter profile which is selected when opening a graph. Filters applied to the `--Global--` profile are applied to all tabs that have the `--Global--` profile selected. > > ### Local profile > Each tab has its own `--Local--` filter profile. Filters applied to the `--Local--` profile are applied only to the currently selected tabs. Only one tab can be selected at a time and a tab gets selected by clicking on it. To make it more clear which tab is currently selected, the title of the selected tab is displayed in **bold** font. > tabA > > When clicking on a different tab with a different `--Local--` profile, the selected filters get updated accordingly. > tabB > > ### New profile > The user can also create a new filter profile and give it a name. E.g. `My Filters` > newProfile > > The `My Filters` profile is then globally available to other tabs as well > selectProfile > > > ### Filters for cloned tabs > When the user clones a tab, the `--Local--` profile gets cloned as well. Further the clone has the same filter profile selected when it gets opened > cloneTab > > ### Saving of filters and profiles > When the users closes IGV, the filters (in their exact order) are save, as well as the filter profiles. The profile that was last used is selected when opening IGV. Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: remove print ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12714/files - new: https://git.openjdk.org/jdk/pull/12714/files/ba9aacc3..754a24a8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12714&range=15 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12714&range=14-15 Stats: 1 line in 1 file changed: 0 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/12714.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12714/head:pull/12714 PR: https://git.openjdk.org/jdk/pull/12714 From chagedorn at openjdk.org Thu Mar 30 08:16:29 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 30 Mar 2023 08:16:29 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally [v15] In-Reply-To: References: Message-ID: <4OHiE9JS1wm2-jVYlOdU12e-CeMJFSkKJkQqxqKO5Oo=.92cf2bec-0f8f-4618-a895-a0d5548a4e11@github.com> On Thu, 30 Mar 2023 07:19:48 GMT, Tobias Holenstein wrote: >> In IGV the user can apply a set of filters to a graph. Currently, the same set of selected filters is applied to all graphs (globally). >> >> - With this change the use can define a set of filters for each individual graph tab using the `--Local--` profile >> - Further a filter profile can be created that represents a set of filter. This filter profile can the be selected in each graph tab individually. >> >> ### Global profile >> Each tab has a `--Global--` filter profile which is selected when opening a graph. Filters applied to the `--Global--` profile are applied to all tabs that have the `--Global--` profile selected. >> >> ### Local profile >> Each tab has its own `--Local--` filter profile. Filters applied to the `--Local--` profile are applied only to the currently selected tabs. Only one tab can be selected at a time and a tab gets selected by clicking on it. To make it more clear which tab is currently selected, the title of the selected tab is displayed in **bold** font. >> tabA >> >> When clicking on a different tab with a different `--Local--` profile, the selected filters get updated accordingly. >> tabB >> >> ### New profile >> The user can also create a new filter profile and give it a name. E.g. `My Filters` >> newProfile >> >> The `My Filters` profile is then globally available to other tabs as well >> selectProfile >> >> >> ### Filters for cloned tabs >> When the user clones a tab, the `--Local--` profile gets cloned as well. Further the clone has the same filter profile selected when it gets opened >> cloneTab >> >> ### Saving of filters and profiles >> When the users closes IGV, the filters (in their exact order) are save, as well as the filter profiles. The profile that was last used is selected when opening IGV. > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > center nodes after zooming Thanks for making the changes! The update looks good. src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/DiagramViewModel.java line 358: > 356: > 357: void close() { > 358: System.out.println(getGraph().getDisplayName() + " removeListener to " + filterChain.getName()); Leftover from debugging? ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/12714#pullrequestreview-1364554424 PR Review Comment: https://git.openjdk.org/jdk/pull/12714#discussion_r1152886813 From tholenstein at openjdk.org Thu Mar 30 08:16:30 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Thu, 30 Mar 2023 08:16:30 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally In-Reply-To: References: Message-ID: <2FztUbxDxKq-N_eCSHSQx3s09Pn6ey0Tuhoq2Eivxg0=.a32ccffc-5a5e-4cc5-925e-64dbd737d7ce@github.com> On Tue, 7 Mar 2023 12:37:06 GMT, Roberto Casta?eda Lozano wrote: >> I updated the PR. @robcasloz >> >>> However, as an IGV user I miss two things from the current behavior: persistence (the same filters are applied after restarting IGV) >> >> I agree with this. Now global filters profiles are saved and reloaded at startup >> >>> and the ability to apply the same filter configuration to all tabs in a simple manner. >> >> Before my PR all filter profiles were global. And they still are except for the `--Local--` profile. I now added also a `--Global--` profile that is selected by default. >> >>> I would like to propose an alternative model that is almost a superset of what is proposed here and would preserve persistence and easy filter synchronization among tabs. By default, each tab has two filter profiles available, ?local? and ?global?. >> >> I added that now. >> >>> More profiles cannot be added or removed. >> >> I would prefer to keep the option to define new profile (especially, now that they are saved and reloaded at startup) >> >>> The local filter profile can be edited but is not persistent (i.e. it acts like the --Custom-- profile in this changeset). >> >> That?s what we have now >> >>> The global filter profile can be edited, is persistent, and the changes are propagated for all tabs where it is selected. >> >> `--Global--` is like this >> >>> The Link node selection globally button is generalized to Link node and filter selection globally. It is disabled by default, and clicking on it selects the global filter profile for all opened tabs. >> >> I prefer to keep the option to have a Tab with local and a Tab with global filters AND be able to link the selection. > >> I now added also a --Global-- profile that is selected by default. > > Thanks for the changes, Toby. I can see the `--Global--` profile selected by default, however as soon as I open a graph it switches to `--Local--`. Is this intended? Thanks @robcasloz , @chhagedorn and @TobiHartmann for the reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/12714#issuecomment-1489879259 From tholenstein at openjdk.org Thu Mar 30 08:16:33 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Thu, 30 Mar 2023 08:16:33 GMT Subject: RFR: JDK-8302644: IGV: Apply filters per graph tab and not globally [v15] In-Reply-To: <4OHiE9JS1wm2-jVYlOdU12e-CeMJFSkKJkQqxqKO5Oo=.92cf2bec-0f8f-4618-a895-a0d5548a4e11@github.com> References: <4OHiE9JS1wm2-jVYlOdU12e-CeMJFSkKJkQqxqKO5Oo=.92cf2bec-0f8f-4618-a895-a0d5548a4e11@github.com> Message-ID: <9jIPDa4q-E1tXFisogYbw4l0CBbOXTSLaNHycy8621o=.a888befd-7566-410a-9563-c741de0acb1c@github.com> On Thu, 30 Mar 2023 08:04:12 GMT, Christian Hagedorn wrote: >> Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: >> >> center nodes after zooming > > src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/DiagramViewModel.java line 358: > >> 356: >> 357: void close() { >> 358: System.out.println(getGraph().getDisplayName() + " removeListener to " + filterChain.getName()); > > Leftover from debugging? right. I removed it ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12714#discussion_r1152892862 From tholenstein at openjdk.org Thu Mar 30 08:16:35 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Thu, 30 Mar 2023 08:16:35 GMT Subject: Integrated: JDK-8302644: IGV: Apply filters per graph tab and not globally In-Reply-To: References: Message-ID: <4yrnxNdthPl2lzbNJAyxVDtr1F_1vB-R3YBLvIHV2Kg=.167f2b85-5223-4812-9f79-7d3e4ce05a4d@github.com> On Wed, 22 Feb 2023 13:59:41 GMT, Tobias Holenstein wrote: > In IGV the user can apply a set of filters to a graph. Currently, the same set of selected filters is applied to all graphs (globally). > > - With this change the use can define a set of filters for each individual graph tab using the `--Local--` profile > - Further a filter profile can be created that represents a set of filter. This filter profile can the be selected in each graph tab individually. > > ### Global profile > Each tab has a `--Global--` filter profile which is selected when opening a graph. Filters applied to the `--Global--` profile are applied to all tabs that have the `--Global--` profile selected. > > ### Local profile > Each tab has its own `--Local--` filter profile. Filters applied to the `--Local--` profile are applied only to the currently selected tabs. Only one tab can be selected at a time and a tab gets selected by clicking on it. To make it more clear which tab is currently selected, the title of the selected tab is displayed in **bold** font. > tabA > > When clicking on a different tab with a different `--Local--` profile, the selected filters get updated accordingly. > tabB > > ### New profile > The user can also create a new filter profile and give it a name. E.g. `My Filters` > newProfile > > The `My Filters` profile is then globally available to other tabs as well > selectProfile > > > ### Filters for cloned tabs > When the user clones a tab, the `--Local--` profile gets cloned as well. Further the clone has the same filter profile selected when it gets opened > cloneTab > > ### Saving of filters and profiles > When the users closes IGV, the filters (in their exact order) are save, as well as the filter profiles. The profile that was last used is selected when opening IGV. This pull request has now been integrated. Changeset: 2c38e67b Author: Tobias Holenstein URL: https://git.openjdk.org/jdk/commit/2c38e67b296c7133dae36d5dbd0064c602b85d4f Stats: 897 lines in 17 files changed: 353 ins; 369 del; 175 mod 8302644: IGV: Apply filters per graph tab and not globally Reviewed-by: rcastanedalo, chagedorn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/12714 From qamai at openjdk.org Thu Mar 30 08:30:15 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 30 Mar 2023 08:30:15 GMT Subject: RFR: 8303553: AArch64: Add BCAX backend rule In-Reply-To: References: Message-ID: <9apf9GvmnVE-NucP7WFPMWqhc5RD2tE5cDhHt2t2nh0=.ef40d3ed-847f-4c28-9c3d-36f5bbf41588@github.com> On Thu, 30 Mar 2023 03:19:17 GMT, changpeng1997 wrote: >> We have `MacroLogicVNode` in the middle-end, can it be used to leverage the ternary logical instructions of SVE2 by selectively creating the node if the truth table is supported? Thanks a lot. > >> We have `MacroLogicVNode` in the middle-end, can it be used to leverage the ternary logical instructions of SVE2 by selectively creating the node if the truth table is supported? Thanks a lot. > > Hi. I found that there were only two ternary logical instructions in SVE2 (BCAX, EOR3). > Maybe we need not to leverage them in the middle-end by using `MacroLogicVNode`. @changpeng1997 There are also the bit-select instructions (`BSL`, `BSL1N`, `BSL2N`, `NBSL`). ------------- PR Comment: https://git.openjdk.org/jdk/pull/13222#issuecomment-1489901998 From duke at openjdk.org Thu Mar 30 08:33:15 2023 From: duke at openjdk.org (changpeng1997) Date: Thu, 30 Mar 2023 08:33:15 GMT Subject: RFR: 8303553: AArch64: Add BCAX backend rule In-Reply-To: References: Message-ID: On Thu, 30 Mar 2023 03:19:17 GMT, changpeng1997 wrote: >> We have `MacroLogicVNode` in the middle-end, can it be used to leverage the ternary logical instructions of SVE2 by selectively creating the node if the truth table is supported? Thanks a lot. > >> We have `MacroLogicVNode` in the middle-end, can it be used to leverage the ternary logical instructions of SVE2 by selectively creating the node if the truth table is supported? Thanks a lot. > > Hi. I found that there were only two ternary logical instructions in SVE2 (BCAX, EOR3). > Maybe we need not to leverage them in the middle-end by using `MacroLogicVNode`. > @changpeng1997 There are also the bit-select instructions (`BSL`, `BSL1N`, `BSL2N`, `NBSL`). Yes, but they are not logic-operation instructions. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13222#issuecomment-1489905666 From duke at openjdk.org Thu Mar 30 08:36:39 2023 From: duke at openjdk.org (Ilya Korennoy) Date: Thu, 30 Mar 2023 08:36:39 GMT Subject: RFR: 8304445: Remaining uses of NULL in ciInstanceKlass.cpp In-Reply-To: References: Message-ID: On Thu, 30 Mar 2023 06:56:48 GMT, Tobias Hartmann wrote: >> 8304445: Remaining uses of NULL in ciInstanceKlass.cpp > > Looks good to me. @TobiHartmann thank you for the review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/13178#issuecomment-1489910170 From fgao at openjdk.org Thu Mar 30 08:44:06 2023 From: fgao at openjdk.org (Fei Gao) Date: Thu, 30 Mar 2023 08:44:06 GMT Subject: RFR: 8305055: IR check fails on some aarch64 platforms [v2] In-Reply-To: References: Message-ID: > As @eme64 said in [1], [JDK-8298935](https://bugs.openjdk.org/browse/JDK-8298935) introduced some "collateral damage", disabling the vectorization of some conversions when `+AlignVector`. That affects IR checks of `TestVectorizeTypeConversion.java` and `ArrayTypeConvertTest.java` on some `aarch64` platforms like ThunderX and ThunderX2 [2]. > > This trivial patch is to allow IR check only when we have `-AlignVector`. > > [1] https://github.com/openjdk/jdk/pull/12350#issuecomment-1470065706 > [2] https://github.com/openjdk/jdk/blob/7239150f8aff0e3dc07c5b27f6b7fb07237bfc55/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp#L154 Fei Gao has updated the pull request incrementally with one additional commit since the last revision: Update comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13236/files - new: https://git.openjdk.org/jdk/pull/13236/files/78fcd6e5..04f8b116 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13236&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13236&range=00-01 Stats: 8 lines in 2 files changed: 0 ins; 0 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/13236.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13236/head:pull/13236 PR: https://git.openjdk.org/jdk/pull/13236 From fgao at openjdk.org Thu Mar 30 08:44:08 2023 From: fgao at openjdk.org (Fei Gao) Date: Thu, 30 Mar 2023 08:44:08 GMT Subject: RFR: 8305055: IR check fails on some aarch64 platforms In-Reply-To: <8mEOn_ZKzbPl7UjEXeSP5PtByUL5ia3i0Y5SGznjq-A=.584e547f-6a9b-4ed8-8130-328278849b32@github.com> References: <8mEOn_ZKzbPl7UjEXeSP5PtByUL5ia3i0Y5SGznjq-A=.584e547f-6a9b-4ed8-8130-328278849b32@github.com> Message-ID: On Thu, 30 Mar 2023 06:48:31 GMT, Emanuel Peter wrote: >> As @eme64 said in [1], [JDK-8298935](https://bugs.openjdk.org/browse/JDK-8298935) introduced some "collateral damage", disabling the vectorization of some conversions when `+AlignVector`. That affects IR checks of `TestVectorizeTypeConversion.java` and `ArrayTypeConvertTest.java` on some `aarch64` platforms like ThunderX and ThunderX2 [2]. >> >> This trivial patch is to allow IR check only when we have `-AlignVector`. >> >> [1] https://github.com/openjdk/jdk/pull/12350#issuecomment-1470065706 >> [2] https://github.com/openjdk/jdk/blob/7239150f8aff0e3dc07c5b27f6b7fb07237bfc55/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp#L154 > > @fg1417 Maybe rather than mentioning [JDK-8298935](https://bugs.openjdk.org/browse/JDK-8298935) in the test, you could mention that we can remove it after [JDK-8303827](https://bugs.openjdk.org/browse/JDK-8303827). Thanks for your kind review and suggestion, @eme64 @TobiHartmann . Updated the comments in the new commit. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13236#issuecomment-1489918184 From duke at openjdk.org Thu Mar 30 08:51:30 2023 From: duke at openjdk.org (changpeng1997) Date: Thu, 30 Mar 2023 08:51:30 GMT Subject: RFR: 8303553: AArch64: Add BCAX backend rule In-Reply-To: References: Message-ID: On Wed, 29 Mar 2023 11:06:38 GMT, Andrew Haley wrote: > > This computing pattern (a ^ (b & (~c))) can be found in some SHA-3 java implementation, like https://github.com/aelstad/keccakj/blob/07185d29fb6c881570e2d7fd2b160460626dc130/src/main/java/com/github/aelstad/keccakj/core/Keccak1600.java#L309. > > I believe this patch can accelerate some SHA-3 applications implemented by Java. > > OK, thanks. Please benchmark this and let us know the result. It'd also be interesting to know how it compares with our SHA-3 intrinsic. @theRealAph I have run the SHA-3 benchmark [1] [2] and I found BCAX instruction cannot be generated since SLP failed to vectorize these code. I tend to agree with you. In addition, SHA-3 intrinsic has been removed so we cannot compare with it. [1]: https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java [2]: https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/sun/security/provider/SHA3.java ------------- PR Comment: https://git.openjdk.org/jdk/pull/13222#issuecomment-1489930583 From thartmann at openjdk.org Thu Mar 30 08:57:19 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 30 Mar 2023 08:57:19 GMT Subject: RFR: 8305055: IR check fails on some aarch64 platforms [v2] In-Reply-To: References: Message-ID: On Thu, 30 Mar 2023 08:44:06 GMT, Fei Gao wrote: >> As @eme64 said in [1], [JDK-8298935](https://bugs.openjdk.org/browse/JDK-8298935) introduced some "collateral damage", disabling the vectorization of some conversions when `+AlignVector`. That affects IR checks of `TestVectorizeTypeConversion.java` and `ArrayTypeConvertTest.java` on some `aarch64` platforms like ThunderX and ThunderX2 [2]. >> >> This trivial patch is to allow IR check only when we have `-AlignVector`. >> >> [1] https://github.com/openjdk/jdk/pull/12350#issuecomment-1470065706 >> [2] https://github.com/openjdk/jdk/blob/7239150f8aff0e3dc07c5b27f6b7fb07237bfc55/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp#L154 > > Fei Gao has updated the pull request incrementally with one additional commit since the last revision: > > Update comments Marked as reviewed by thartmann (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/13236#pullrequestreview-1364644787 From thartmann at openjdk.org Thu Mar 30 09:00:21 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 30 Mar 2023 09:00:21 GMT Subject: RFR: 8303278: Imprecise bottom type of ExtractB/UB [v2] In-Reply-To: References: Message-ID: On Wed, 29 Mar 2023 01:22:10 GMT, Eric Liu wrote: >> This is a trivial patch, which fixes the bottom type of ExtractB/UB nodes. >> >> ExtractNode can be generated by Vector API Vector.lane(int), which gets the lane element at the given index. A more precise type of range can help to optimize out unnecessary type conversion in some cases. >> >> Below shows a typical case used ExtractBNode >> >> >> public static byte byteLt16() { >> ByteVector vecb = ByteVector.broadcast(ByteVector.SPECIES_128, 1); >> return vecb.lane(1); >> } >> >> >> In this case, c2 constructs IR graph like: >> >> ExtractB ConI(24) >> | __| >> | / | >> LShiftI __| >> | / >> RShiftI >> >> which generates AArch64 code: >> >> movi v16.16b, #0x1 >> smov x11, v16.b[1] >> sxtb w0, w11 >> >> with this patch, this shift pair can be optimized out by RShiftI's identity [1]. The code is optimized to: >> >> movi v16.16b, #0x1 >> smov x0, v16.b[1] >> >> [TEST] >> >> Full jtreg passed except 4 files on x86: >> >> jdk/incubator/vector/Byte128VectorTests.java >> jdk/incubator/vector/Byte256VectorTests.java >> jdk/incubator/vector/Byte512VectorTests.java >> jdk/incubator/vector/Byte64VectorTests.java >> >> They are caused by a known issue on x86 [2]. >> >> [1] https://github.com/openjdk/jdk/blob/742bc041eaba1ff9beb7f5b6d896e4f382b030ea/src/hotspot/share/opto/mulnode.cpp#L1052 >> [2] https://bugs.openjdk.org/browse/JDK-8303508 > > Eric Liu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge jdk:master > > Change-Id: I40cce803da09bae31cd74b86bf93607a08219545 > - 8303278: Imprecise bottom type of ExtractB/UB > > This is a trivial patch, which fixes the bottom type of ExtractB/UB > nodes. > > ExtractNode can be generated by Vector API Vector.lane(int), which gets > the lane element at the given index. A more precise type of range can > help to optimize out unnecessary type conversion in some cases. > > Below shows a typical case used ExtractBNode > > ``` > public static byte byteLt16() { > ByteVector vecb = ByteVector.broadcast(ByteVector.SPECIES_128, 1); > return vecb.lane(1); > } > > ``` > In this case, c2 constructs IR graph like: > > ExtractB ConI(24) > | __| > | / | > LShiftI __| > | / > RShiftI > > which generates AArch64 code: > > movi v16.16b, #0x1 > smov x11, v16.b[1] > sxtb w0, w11 > > with this patch, this shift pair can be optimized out by RShiftI's > identity [1]. The code is optimized to: > > movi v16.16b, #0x1 > smov x0, v16.b[1] > > [TEST] > > Full jtreg passed except 4 files on x86: > > jdk/incubator/vector/Byte128VectorTests.java > jdk/incubator/vector/Byte256VectorTests.java > jdk/incubator/vector/Byte512VectorTests.java > jdk/incubator/vector/Byte64VectorTests.java > > They are caused by a known issue on x86 [2]. > > [1] https://github.com/openjdk/jdk/blob/742bc041eaba1ff9beb7f5b6d896e4f382b030ea/src/hotspot/share/opto/mulnode.cpp#L1052 > [2] https://bugs.openjdk.org/browse/JDK-8303508 > > Change-Id: Ibea9aeacb41b4d1c5b2621c7a97494429394b599 Sure, I'll re-run testing. ------------- PR Review: https://git.openjdk.org/jdk/pull/13070#pullrequestreview-1364649519 From duke at openjdk.org Thu Mar 30 09:00:40 2023 From: duke at openjdk.org (Ilya Korennoy) Date: Thu, 30 Mar 2023 09:00:40 GMT Subject: Integrated: 8304445: Remaining uses of NULL in ciInstanceKlass.cpp In-Reply-To: References: Message-ID: <6JI_HgdqIX05d9wUKezEWalg-EurNC3uoknqFg4yqbg=.f9f864ce-fdbc-4f3e-9d4a-3ef85aa8a715@github.com> On Fri, 24 Mar 2023 15:46:11 GMT, Ilya Korennoy wrote: > 8304445: Remaining uses of NULL in ciInstanceKlass.cpp This pull request has now been integrated. Changeset: b261e6c4 Author: Ilya Korennoy Committer: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/b261e6c43f8ef219d309683cc8ff92ecedc9126a Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod 8304445: Remaining uses of NULL in ciInstanceKlass.cpp Reviewed-by: thartmann ------------- PR: https://git.openjdk.org/jdk/pull/13178 From thartmann at openjdk.org Thu Mar 30 09:01:27 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 30 Mar 2023 09:01:27 GMT Subject: RFR: JDK-8305223: IGV: mark osr compiled graphs with [OSR] in the name In-Reply-To: References: Message-ID: <34xb3nGAybhQeR36tczM4JaevFVY8SrK_Ab7tXJGgJQ=.ef00e088-7e2a-4527-b031-938141fbaef7@github.com> On Thu, 30 Mar 2023 07:46:43 GMT, Tobias Holenstein wrote: > Graphs in IGV that were osr compiled have an "osr" property. To make it easier to distinguish them from non-osr, append `[OSR]` to the name of the group in the outline > > osr Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13241#pullrequestreview-1364652245 From qamai at openjdk.org Thu Mar 30 09:02:19 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 30 Mar 2023 09:02:19 GMT Subject: RFR: 8303553: AArch64: Add BCAX backend rule In-Reply-To: References: Message-ID: On Wed, 29 Mar 2023 09:12:59 GMT, changpeng1997 wrote: > We can use BCAX [1] [2] to merge a bit clear and an exclusive-OR operation. For example, on a 128-bit aarch64 machine which supports NEON and SHA3, following instruction sequence: > > > ... > bic v16.16b, v16.16b, v17.16b > eor v16.16b, v16.16b, v18.16b > ... > > > can be optimized to: > > > ... > bcax v16.16b, v17.16b, v16.16b, v18.16b > ... > > > This patch adds backend rules for BCAX, and we can gain almost 10% performance lift on a 128-bit aarch64 machine which supports NEON and SHA3. Similar performance uplift can also be observed on SVE2. > > Performance_Before: > > > Benchmark Score(op/ms) Error > TestByte#size(2048) 9779.361 47.184 > TestInt#size(2048) 3028.617 7.292 > TestLong#size(2048) 1331.216 1.815 > TestShort#size(2048) 5828.089 8.975 > > > Performance_BCAX_NEON: > > > Benchmark Score(op/ms) Error > TestByte#size(2048) 10510.371 34.931 > TestInt#size(2048) 3437.512 81.318 > TestLong#size(2048) 1461.023 0.679 > TestShort#size(2048) 6238.210 26.452 > > > [1]: https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/BCAX--Bit-Clear-and-XOR- > [2]: https://developer.arm.com/documentation/ddi0602/2022-12/SVE-Instructions/BCAX--Bitwise-clear-and-exclusive-OR-?lang=en `bsl(dst, src2, src3) == (dst & src3) | (src2 & ~src3)` which is a bitwise logical operation ------------- PR Comment: https://git.openjdk.org/jdk/pull/13222#issuecomment-1489945729 From duke at openjdk.org Thu Mar 30 09:10:14 2023 From: duke at openjdk.org (changpeng1997) Date: Thu, 30 Mar 2023 09:10:14 GMT Subject: RFR: 8303553: AArch64: Add BCAX backend rule In-Reply-To: References: Message-ID: On Thu, 30 Mar 2023 08:59:33 GMT, Quan Anh Mai wrote: > `bsl(dst, src2, src3) == (dst & src3) | (src2 & ~src3)` which is a bitwise logical operation @merykitty Sorry, these instructions are logical-operation instructions indeed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13222#issuecomment-1489957022 From dzhang at openjdk.org Thu Mar 30 09:16:06 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Thu, 30 Mar 2023 09:16:06 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v12] In-Reply-To: References: Message-ID: > HI, > > We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! > This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. > > ## Load/Store/Cmp Mask > `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? > > 218 loadV V1, [R7] # vector (rvv) > 220 vloadmask V0, V1 > ... > 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 > 24c vstoremask V1, V0 > 258 storeV [R7], V1 # vector (rvv) > > > The corresponding generated jit assembly? > > # loadV > 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef95c: vle8.v v1,(t2) > > # vloadmask > 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, > 0x000000400c8ef964: vmsne.vx v0,v1,zero > > # vmaskcmp_rvv_masked > 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef980: vmclr.m v1 > 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t > 0x000000400c8ef988: vmv1r.v v0,v1 > > # vstoremask > 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef990: vmv.v.x v1,zero > 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 > > > ## Masked vector arithmetic instructions (e.g. vadd) > AddMaskTestMerge case: > > import jdk.incubator.vector.IntVector; > import jdk.incubator.vector.VectorMask; > import jdk.incubator.vector.VectorOperators; > import jdk.incubator.vector.VectorSpecies; > > public class AddMaskTestMerge { > > static final VectorSpecies SPECIES = IntVector.SPECIES_128; > static final int SIZE = 1024; > static int[] a = new int[SIZE]; > static int[] b = new int[SIZE]; > static int[] r = new int[SIZE]; > static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; > static { > for (int i = 0; i < SIZE; i++) { > a[i] = i; > b[i] = i; > } > } > > static void workload(int idx) { > VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); > IntVector av = IntVector.fromArray(SPECIES, a, idx); > IntVector bv = IntVector.fromArray(SPECIES, b, idx); > av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); > } > > public static void main(String[] args) { > for (int i = 0; i < 30_0000; i++) { > for (int j = 0; j < SIZE; j += SPECIES.length()) { > workload(j); > } > } > } > } > > > This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. > > Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: > > > 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 > 0ae loadV V1, [R31] # vector (rvv) > 0b6 vloadmask V0, V2 > 0be vadd.vv V3, V1, V0 #@vaddI_masked > 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r > 0ca decode_heap_oop R28, R28 #@decodeHeapOop > 0cc lwu R7, [R28, #12] # range, #@loadRange > 0d0 NullCheck R28 > > > And the jit code is as follows: > > > 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) > ; - AddMaskTestMerge::workload at 46 (line 25) > 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) > ; - AddMaskTestMerge::workload at 7 (line 22) > 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) > ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) > ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) > ; - AddMaskTestMerge::workload at 39 (line 25) > > > ## Mask register allocation & mask bit opreation > Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. > When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: > > > > > > > > > So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: > > vloadmask V0, V1 > vloadmask V30, V2 > vmask_and V0, V30, V0 > > We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. > > By the way, the current implementation of `VectorMaskCast` is for the case of equal width of the parameter data, other cases depend on the subsequent cast node. > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java > [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 > [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java > > ### Testing: > > qemu with UseRVV: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) > - [ ] Tier3 tests (release) > - [x] test/jdk/jdk/incubator/vector (release/fastdebug) Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: Fix comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12682/files - new: https://git.openjdk.org/jdk/pull/12682/files/c66fefec..8083ede3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=11 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=10-11 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/12682.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12682/head:pull/12682 PR: https://git.openjdk.org/jdk/pull/12682 From aph at openjdk.org Thu Mar 30 09:32:18 2023 From: aph at openjdk.org (Andrew Haley) Date: Thu, 30 Mar 2023 09:32:18 GMT Subject: RFR: 8303553: AArch64: Add BCAX backend rule In-Reply-To: References: Message-ID: On Thu, 30 Mar 2023 08:48:39 GMT, changpeng1997 wrote: > In addition, SHA-3 intrinsic has been removed so we cannot compare with it. When was it removed? ------------- PR Comment: https://git.openjdk.org/jdk/pull/13222#issuecomment-1489985286 From duke at openjdk.org Thu Mar 30 09:32:19 2023 From: duke at openjdk.org (changpeng1997) Date: Thu, 30 Mar 2023 09:32:19 GMT Subject: RFR: 8303553: AArch64: Add BCAX backend rule In-Reply-To: References: Message-ID: On Thu, 30 Mar 2023 09:27:46 GMT, Andrew Haley wrote: > > In addition, SHA-3 intrinsic has been removed so we cannot compare with it. > > When was it removed? Sorry, it was disabled. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13222#issuecomment-1489987397 From rcastanedalo at openjdk.org Thu Mar 30 09:41:22 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 30 Mar 2023 09:41:22 GMT Subject: RFR: JDK-8305223: IGV: mark osr compiled graphs with [OSR] in the name In-Reply-To: References: Message-ID: On Thu, 30 Mar 2023 07:46:43 GMT, Tobias Holenstein wrote: > Graphs in IGV that were osr compiled have an "osr" property. To make it easier to distinguish them from non-osr, append `[OSR]` to the name of the group in the outline > > osr Looks good. Could you update the copyright year? ------------- Marked as reviewed by rcastanedalo (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13241#pullrequestreview-1364724568 From aph at openjdk.org Thu Mar 30 10:05:15 2023 From: aph at openjdk.org (Andrew Haley) Date: Thu, 30 Mar 2023 10:05:15 GMT Subject: RFR: 8303553: AArch64: Add BCAX backend rule In-Reply-To: References: Message-ID: On Wed, 29 Mar 2023 09:12:59 GMT, changpeng1997 wrote: > We can use BCAX [1] [2] to merge a bit clear and an exclusive-OR operation. For example, on a 128-bit aarch64 machine which supports NEON and SHA3, following instruction sequence: > > > ... > bic v16.16b, v16.16b, v17.16b > eor v16.16b, v16.16b, v18.16b > ... > > > can be optimized to: > > > ... > bcax v16.16b, v17.16b, v16.16b, v18.16b > ... > > > This patch adds backend rules for BCAX, and we can gain almost 10% performance lift on a 128-bit aarch64 machine which supports NEON and SHA3. Similar performance uplift can also be observed on SVE2. > > Performance_Before: > > > Benchmark Score(op/ms) Error > TestByte#size(2048) 9779.361 47.184 > TestInt#size(2048) 3028.617 7.292 > TestLong#size(2048) 1331.216 1.815 > TestShort#size(2048) 5828.089 8.975 > > > Performance_BCAX_NEON: > > > Benchmark Score(op/ms) Error > TestByte#size(2048) 10510.371 34.931 > TestInt#size(2048) 3437.512 81.318 > TestLong#size(2048) 1461.023 0.679 > TestShort#size(2048) 6238.210 26.452 > > > [1]: https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/BCAX--Bit-Clear-and-XOR- > [2]: https://developer.arm.com/documentation/ddi0602/2022-12/SVE-Instructions/BCAX--Bitwise-clear-and-exclusive-OR-?lang=en > > > In addition, SHA-3 intrinsic has been removed so we cannot compare with it. > > > > > > When was it removed? > > Sorry, it was disabled on AArch64 except Apple. #11382 OK, so it seems that adding a BCAX rule is not necessary for the AArch64 port of HotSpot. In summary: BCAX was added to the instruction set for SHA-3, but it doesn't help much even in a hand-coded intrinsic. There is no significant performance advantage to be gained from this patch, and it adds a burden. I suggest that you should withdraw it. Thanks. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13222#issuecomment-1490031743 From duke at openjdk.org Thu Mar 30 10:08:20 2023 From: duke at openjdk.org (changpeng1997) Date: Thu, 30 Mar 2023 10:08:20 GMT Subject: RFR: 8303553: AArch64: Add BCAX backend rule In-Reply-To: References: Message-ID: <4DcY31iZ6umxm6X8CopwObYLi2aBFbtWrCZlMTjL8Xw=.b0931ee4-a1f3-46f3-b956-76815cc8cddc@github.com> On Thu, 30 Mar 2023 10:02:24 GMT, Andrew Haley wrote: > > > > In addition, SHA-3 intrinsic has been removed so we cannot compare with it. > > > > > > > > > When was it removed? > > > > > > Sorry, it was disabled on AArch64 except Apple. #11382 > > OK, so it seems that adding a BCAX rule is not necessary for the AArch64 port of HotSpot. > > In summary: BCAX was added to the instruction set for SHA-3, but it doesn't help much even in a hand-coded intrinsic. There is no significant performance advantage to be gained from this patch, and it adds a burden. I suggest that you should withdraw it. Thanks. Ok. Thanks for your review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13222#issuecomment-1490035414 From duke at openjdk.org Thu Mar 30 10:11:32 2023 From: duke at openjdk.org (changpeng1997) Date: Thu, 30 Mar 2023 10:11:32 GMT Subject: Withdrawn: 8303553: AArch64: Add BCAX backend rule In-Reply-To: References: Message-ID: On Wed, 29 Mar 2023 09:12:59 GMT, changpeng1997 wrote: > We can use BCAX [1] [2] to merge a bit clear and an exclusive-OR operation. For example, on a 128-bit aarch64 machine which supports NEON and SHA3, following instruction sequence: > > > ... > bic v16.16b, v16.16b, v17.16b > eor v16.16b, v16.16b, v18.16b > ... > > > can be optimized to: > > > ... > bcax v16.16b, v17.16b, v16.16b, v18.16b > ... > > > This patch adds backend rules for BCAX, and we can gain almost 10% performance lift on a 128-bit aarch64 machine which supports NEON and SHA3. Similar performance uplift can also be observed on SVE2. > > Performance_Before: > > > Benchmark Score(op/ms) Error > TestByte#size(2048) 9779.361 47.184 > TestInt#size(2048) 3028.617 7.292 > TestLong#size(2048) 1331.216 1.815 > TestShort#size(2048) 5828.089 8.975 > > > Performance_BCAX_NEON: > > > Benchmark Score(op/ms) Error > TestByte#size(2048) 10510.371 34.931 > TestInt#size(2048) 3437.512 81.318 > TestLong#size(2048) 1461.023 0.679 > TestShort#size(2048) 6238.210 26.452 > > > [1]: https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/BCAX--Bit-Clear-and-XOR- > [2]: https://developer.arm.com/documentation/ddi0602/2022-12/SVE-Instructions/BCAX--Bitwise-clear-and-exclusive-OR-?lang=en This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/13222 From tholenstein at openjdk.org Thu Mar 30 10:24:06 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Thu, 30 Mar 2023 10:24:06 GMT Subject: RFR: JDK-8305223: IGV: mark osr compiled graphs with [OSR] in the name [v2] In-Reply-To: References: Message-ID: > Graphs in IGV that were osr compiled have an "osr" property. To make it easier to distinguish them from non-osr, append `[OSR]` to the name of the group in the outline > > osr Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: Update Group.java update copyright year ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13241/files - new: https://git.openjdk.org/jdk/pull/13241/files/bfe14342..1c108367 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13241&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13241&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/13241.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13241/head:pull/13241 PR: https://git.openjdk.org/jdk/pull/13241 From tholenstein at openjdk.org Thu Mar 30 10:24:09 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Thu, 30 Mar 2023 10:24:09 GMT Subject: RFR: JDK-8305223: IGV: mark osr compiled graphs with [OSR] in the name [v2] In-Reply-To: References: Message-ID: On Thu, 30 Mar 2023 09:38:38 GMT, Roberto Casta?eda Lozano wrote: > Looks good. Could you update the copyright year? done ------------- PR Comment: https://git.openjdk.org/jdk/pull/13241#issuecomment-1490054214 From rcastanedalo at openjdk.org Thu Mar 30 10:52:17 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 30 Mar 2023 10:52:17 GMT Subject: RFR: JDK-8305223: IGV: mark osr compiled graphs with [OSR] in the name [v2] In-Reply-To: References: Message-ID: <63V1NQ9q1ojZnqyd4Jsq5RGY08eh6c_E59Fve9h2j98=.b00771af-4f72-448f-8a04-cad10c9defa8@github.com> On Thu, 30 Mar 2023 10:24:06 GMT, Tobias Holenstein wrote: >> Graphs in IGV that were osr compiled have an "osr" property. To make it easier to distinguish them from non-osr, append `[OSR]` to the name of the group in the outline >> >> osr > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > Update Group.java > > update copyright year Marked as reviewed by rcastanedalo (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/13241#pullrequestreview-1364841920 From duke at openjdk.org Thu Mar 30 12:56:10 2023 From: duke at openjdk.org (SUN Guoyun) Date: Thu, 30 Mar 2023 12:56:10 GMT Subject: RFR: 8305236: The LoadLoad barrier has been useless after JDK-8220051 Message-ID: After JDK-8220051, Interpreter::notice_safepoints() only be executed at a safe point, so LoadLoad barrier is useless. Barrier directives are generally time-consuming, so this patch removes LoadLoad be used for aarch64 and riscv. Please help review it. Thanks. ------------- Commit messages: - 8305236: The LoadLoad barrier has been useless after JDK-8220051 Changes: https://git.openjdk.org/jdk/pull/13244/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13244&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8305236 Stats: 24 lines in 2 files changed: 0 ins; 24 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/13244.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13244/head:pull/13244 PR: https://git.openjdk.org/jdk/pull/13244 From qamai at openjdk.org Thu Mar 30 14:30:17 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 30 Mar 2023 14:30:17 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v5] In-Reply-To: References: Message-ID: > Hi, > > This patch reimplements `VectorShuffle` implementations to be a vector of the bit type. Currently, VectorShuffle is stored as a byte array, and would be expanded upon usage. This poses several drawbacks: > > 1. Inefficient conversions between a shuffle and its corresponding vector. This hinders the performance when the shuffle indices are not constant and are loaded or computed dynamically. > 2. Redundant expansions in `rearrange` operations. On all platforms, it seems that a shuffle index vector is always expanded to the correct type before executing the `rearrange` operations. > 3. Some redundant intrinsics are needed to support this handling as well as special considerations in the C2 compiler. > 4. Range checks are performed using `VectorShuffle::toVector`, which is inefficient for FP types since both FP conversions and FP comparisons are more expensive than the integral ones. > > Upon these changes, a `rearrange` can emit more efficient code: > > var species = IntVector.SPECIES_128; > var v1 = IntVector.fromArray(species, SRC1, 0); > var v2 = IntVector.fromArray(species, SRC2, 0); > v1.rearrange(v2.toShuffle()).intoArray(DST, 0); > > Before: > movabs $0x751589fa8,%r10 ; {oop([I{0x0000000751589fa8})} > vmovdqu 0x10(%r10),%xmm2 > movabs $0x7515a0d08,%r10 ; {oop([I{0x00000007515a0d08})} > vmovdqu 0x10(%r10),%xmm1 > movabs $0x75158afb8,%r10 ; {oop([I{0x000000075158afb8})} > vmovdqu 0x10(%r10),%xmm0 > vpand -0xddc12(%rip),%xmm0,%xmm0 # Stub::vector_int_to_byte_mask > ; {external_word} > vpackusdw %xmm0,%xmm0,%xmm0 > vpackuswb %xmm0,%xmm0,%xmm0 > vpmovsxbd %xmm0,%xmm3 > vpcmpgtd %xmm3,%xmm1,%xmm3 > vtestps %xmm3,%xmm3 > jne 0x00007fc2acb4e0d8 > vpmovzxbd %xmm0,%xmm0 > vpermd %ymm2,%ymm0,%ymm0 > movabs $0x751588f98,%r10 ; {oop([I{0x0000000751588f98})} > vmovdqu %xmm0,0x10(%r10) > > After: > movabs $0x751589c78,%r10 ; {oop([I{0x0000000751589c78})} > vmovdqu 0x10(%r10),%xmm1 > movabs $0x75158ac88,%r10 ; {oop([I{0x000000075158ac88})} > vmovdqu 0x10(%r10),%xmm2 > vpxor %xmm0,%xmm0,%xmm0 > vpcmpgtd %xmm2,%xmm0,%xmm3 > vtestps %xmm3,%xmm3 > jne 0x00007fa818b27cb1 > vpermd %ymm1,%ymm2,%ymm0 > movabs $0x751588c68,%r10 ; {oop([I{0x0000000751588c68})} > vmovdqu %xmm0,0x10(%r10) > > Please take a look and leave reviews. Thanks a lot. Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 14 commits: - move implementations up - Merge branch 'master' into shufflerefactor - Merge branch 'master' into shufflerefactor - reviews - missing casts - clean up - fix Matcher::vector_needs_load_shuffle - fix internal types, clean up - optimise laneIsValid - Merge branch 'master' into shufflerefactor - ... and 4 more: https://git.openjdk.org/jdk/compare/d063b896...a4835c00 ------------- Changes: https://git.openjdk.org/jdk/pull/13093/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13093&range=04 Stats: 3683 lines in 64 files changed: 1610 ins; 1169 del; 904 mod Patch: https://git.openjdk.org/jdk/pull/13093.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13093/head:pull/13093 PR: https://git.openjdk.org/jdk/pull/13093 From qamai at openjdk.org Thu Mar 30 14:34:34 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 30 Mar 2023 14:34:34 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v5] In-Reply-To: References: Message-ID: On Thu, 23 Mar 2023 07:13:55 GMT, Xiaohong Gong wrote: >> Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 14 commits: >> >> - move implementations up >> - Merge branch 'master' into shufflerefactor >> - Merge branch 'master' into shufflerefactor >> - reviews >> - missing casts >> - clean up >> - fix Matcher::vector_needs_load_shuffle >> - fix internal types, clean up >> - optimise laneIsValid >> - Merge branch 'master' into shufflerefactor >> - ... and 4 more: https://git.openjdk.org/jdk/compare/d063b896...a4835c00 > > src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Byte128Vector.java line 868: > >> 866: return (Byte128Vector) Byte128Vector.VSPECIES.dummyVector() >> 867: .vectorFactory(s.indices()); >> 868: } > > Move the implementation details to the super class? I have moved `toBitsVectorTemplate` to `AbstractShuffle`, also `toShuffle` has been refactored to be more versatile. > src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteMaxVector.java line 862: > >> 860: v.convertShape(VectorOperators.B2I, species, 3) >> 861: .reinterpretAsInts() >> 862: .intoArray(a, offset + species.length() * 3); > > Can we add a method like `intoIntArray()` in `ByteVector` and move these common code there? The same to other vector types. I don't think this and the below suggestion achieve much, reducing the generated code at the expense of moving the usage away from the definition does not seem worth it to me. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1153353267 PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1153354836 From qamai at openjdk.org Thu Mar 30 14:44:36 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 30 Mar 2023 14:44:36 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v5] In-Reply-To: References: <4Op0Z8whnyDXDC6zGyMbx4ugcZp5TEoAqW_myB5flxM=.1c7b59ba-efb2-4f68-90d7-2d6e33e39572@github.com> Message-ID: On Thu, 23 Mar 2023 02:23:20 GMT, Xiaohong Gong wrote: >> I think not emitting `VectorLoadShuffleNode` is more common so it is better to emit them only when needed, as it will simplify the graph and may allow better inspections of the indices in the future. Additionally, a do-nothing node does not alias with its input and therefore kills the input, which leads to an additional spill if they both need to live. > > Yeah, I agree that saving a node have some benefits like what you said. My concern is there are more and more methods added into `Matcher::` and each platform has to do the different implementation. There is not too much meaning for those platforms that do not implement Vector API like` arm/ppc/...` for me. This makes code not so easy to maintain. I agree, I am thinking of (ab)using template to have a common query function like this template T vectorQuery(U... args) { return T(); } then each platform can have a specialisation that looks like this: template <> bool vectorQuery(BasicType bt, int vlen) { } Unimplemented platform will return `false` for this, what do you think? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1153370540 From jcking at openjdk.org Thu Mar 30 15:03:31 2023 From: jcking at openjdk.org (Justin King) Date: Thu, 30 Mar 2023 15:03:31 GMT Subject: RFR: 8305056: Avoid unaligned access in emit_intX methods if not enabled [v3] In-Reply-To: References: Message-ID: <6eaR42oiyEwMmYWPU0xW8IBBNuMcSkZHz9fzXbvOQdo=.77d5c8b7-cfea-4250-9964-9e1ec974f776@github.com> On Wed, 29 Mar 2023 16:36:17 GMT, Vladimir Kempik wrote: >> Please review this change which attempts to eliminate unaligned memory stores generated by emit_int16/32/64 methods on some platforms. >> >> Primary aim is risc-v platform. But I had to change some code in ppc/arm32/x86 to prevent possible perf degradation. > > Vladimir Kempik has updated the pull request incrementally with one additional commit since the last revision: > > Reduce code duplication `UseUnalignedAccesses` is a flag, no? Doesn't this now require checking to see if its aligned or the flag is true? So for aligned accesses the speed is the same, for unaligned it is now slower as it has to look at the flag first, which is likely somewhere else in memory, forcing cache lines to be flushed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13227#issuecomment-1490459477 From phh at openjdk.org Thu Mar 30 15:18:47 2023 From: phh at openjdk.org (Paul Hohensee) Date: Thu, 30 Mar 2023 15:18:47 GMT Subject: RFR: 8305142: Can't bootstrap ctw.jar In-Reply-To: <_KrHiSvoOhisFwY9VvBouTIqYycK8YPpPFwb9BP5XHE=.c60cc690-c7e6-42b0-9010-f080c6981ace@github.com> References: <_KrHiSvoOhisFwY9VvBouTIqYycK8YPpPFwb9BP5XHE=.c60cc690-c7e6-42b0-9010-f080c6981ace@github.com> Message-ID: On Wed, 29 Mar 2023 07:17:23 GMT, Xin Liu wrote: > This patch add a few add-exports so CTW can access those internal packages of java.base module. > > make succeeds and ctw.jar is generated as expected. Lgtm. ------------- Marked as reviewed by phh (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13220#pullrequestreview-1365370298 From stuefe at openjdk.org Thu Mar 30 15:18:49 2023 From: stuefe at openjdk.org (Thomas Stuefe) Date: Thu, 30 Mar 2023 15:18:49 GMT Subject: RFR: 8305056: Avoid unaligned access in emit_intX methods if not enabled [v3] In-Reply-To: References: Message-ID: <4iSP7cQzd1cs5DRHTEmFDmyLXwCkkmm-O504keQPuZw=.723ab392-e59b-4c49-a43c-9b4690298de5@github.com> On Wed, 29 Mar 2023 16:36:17 GMT, Vladimir Kempik wrote: >> Please review this change which attempts to eliminate unaligned memory stores generated by emit_int16/32/64 methods on some platforms. >> >> Primary aim is risc-v platform. But I had to change some code in ppc/arm32/x86 to prevent possible perf degradation. > > Vladimir Kempik has updated the pull request incrementally with one additional commit since the last revision: > > Reduce code duplication Sorry if I'm slow, but I do not understand this patch, and the JBS issue text does not help much. Why would the existing test for alignment not be sufficient to prevent unaligned access? ------------- PR Comment: https://git.openjdk.org/jdk/pull/13227#issuecomment-1490487237 From kvn at openjdk.org Thu Mar 30 16:29:28 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 30 Mar 2023 16:29:28 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 - minimal infrastructure [v6] In-Reply-To: References: Message-ID: On Thu, 30 Mar 2023 07:26:54 GMT, Emanuel Peter wrote: >> I am reviving `-XX:+VerifyLoopOptimizations` after many years of abandonment. There were many bugs filed, but so far it has not been addressed. >> >> The hope is that this work will allow us to catch ctrl / idom / loop body bugs quicker, and fix many of the existing ones along the way. >> >> **The Idea of VerifyLoopOptimizations** >> Before loop-opts, we build many data-structures for dominance (idom), control, and loop membership. Then, loop-opts use this data to transform the graph. At the same time, they must maintain the correctness of the data-structures, so that other optimizations can be made, without needing to re-compute the data-structures every time. >> `VerifyLoopOptimizations` was implemented to verify correctness of the data-structures. After some loop-opts, we re-compute a verification data-structure and compare it to the one we created before the loop-opts and maintained during loopopts. >> >> **My Approach** >> I soon realized that there were many reasons why `VerifyLoopOptimizations` was broken. It seemed infeasible to fix all of them at once. I decided to first remove any part that was failing, until I have a minimal set that is working. I will leave many parts commented out. In follow-up RFE's, I will then iteratively improve the verification by re-enabling some verification and fixing the corresponding bugs. >> >> **What I fixed** >> >> - `verify_compare` >> - Renamed it to `verify_idom_and_nodes`, since it does verification node-by-node (vs `verify_tree`, which verifies the loop-tree). >> - Previously, it was implemented as a BFS with recursion, which lead to stack-overflow. I flattened the BFS into a loop. >> - The BFS calls `verify_idom` and `verify_nodes` on every node. I refactored `verify_nodes` a bit, so that it is more readable. >> - I now report all failures, before asserting. >> - `verify_tree` >> - I corrected the style and improved comments. >> - I removed the broken verification for `Opaque` nodes. I added some rudamentary verification for `CountedLoop`. I leave more of this work for follow-up RFE's. >> - I also converted the asserts to reporting failures, just like in `verify_idom_and_nodes`. >> >> **Disabled Verifications** >> I commented out the following verifications: >> >> (A) data nodes should have same ctrl >> (B) ctrl node should belong to same loop >> (C) ctrl node should have same idom >> (D) loop should have same tail >> (E) loop should have same body (list of nodes) >> (F) broken verification in PhaseIdealLoop::build_loop_late_post, because ctrl was set wrong >> >> >> Note: verifying `idom`, `ctrl` and `_body` is the central goal of `VerifyLoopOptimizations`. But all of them are broken in many parts of the VM, as we have now not verified them for many years. >> >> **Follow-Up Work** >> >> I filed a first follow-up RFE [JDK-8305073](https://bugs.openjdk.org/browse/JDK-8305073). The following tasks should be addressed in it, or in subsequent follow-up RFE's. >> >> I propose the following order: >> >> - idom (C): The dominance structure is at the base of everything else. >> - ctrl / loop (A, B): Once dominance is fixed, we can ensure every node is assigned to the correct ctrl/loop. >> - tail (D): ensure the tail of a loop is updated correctly >> - body (E): nodes are assigned to the `_body` of a loop, according to the node ctrl. >> - other issues like (F) >> - Add more verification to IdealLoopTree::verify_tree. For example zero-trip-guard, etc. >> - Evaluate from where else we should call `PhaseIdealLoop::verify`. Maybe we are missing some cases. >> >> **Testing** >> I am running `tier1-tier6` and stress testing. >> Preliminary results are all good. >> >> **Conclusion** >> With this fix, I have the basic infrastructure of the verification working. >> However, all of the substantial verification are now still disabled, because there are too many places in the VM that do not maintain the data-structures properly. >> Follow-up RFE's will have to address these one-by-one. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > fix typo Looks good. now. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13207#pullrequestreview-1365499725 From kvn at openjdk.org Thu Mar 30 16:29:31 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 30 Mar 2023 16:29:31 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 - minimal infrastructure [v4] In-Reply-To: References: Message-ID: On Thu, 30 Mar 2023 06:31:31 GMT, Emanuel Peter wrote: >> As the comment here says, infinite loops are not attached the first time we run `build_loop_tree`. >> But from the second time on, we have the `NeverBranch` added, so it has a "fake" exit, and is not really an infinite loop any more, so it will always be found and attached to the loop tree. >> `this` runs before `loop_verify`, so it is possible that `child` does not have the loop, but `child_verify` finds it. It is not possible the other way around. >> I can add a comment to the code for that. > > Side note: > Before my patch, we did not check if `loop_verify` had any loops that we did not have in `this`. That is why we did not have to deal with suddenly appearing infinite loops. Yes, please add comment. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1153505177 From rcastanedalo at openjdk.org Thu Mar 30 16:30:35 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 30 Mar 2023 16:30:35 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter [v5] In-Reply-To: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> Message-ID: > The "Simplify graph" filter abstracts away details from the graph that are typically unnecessary for debugging or analyzing the represented program. This changeset decouples this filter into two: > > - "Simplify graph", which hides elements that are typically (but not always) unnecessary, and > - "Condense graph", which makes the graph more compact without loss of information. > > Together, these two filters reduce the average graph size by a factor of 1.6x (nodes) and 1.9x (edges): > > ![without-with-filters](https://user-images.githubusercontent.com/8792647/224118397-e6bd45d1-0b90-4d94-88ae-0a83f9ef20da.png) > > Besides decoupling the "Simplify graph" filter, the changeset extends its functionality by: > - combining Bool and conversion nodes into their predecessors, > - inlining all Parm nodes except control into their successors (this removes lots of long edges), > - removing "top" inputs from call-like nodes, > - inlining more source nodes (such as MachTemp and ThreadLocal) into their successors, > - pretty-printing the labels of many inlined and combined nodes such as Bool comparisons or Catch projections (via a new filter that edits node properties), and > - using a sparse representation of nodes with empty inputs (e.g. call-like nodes after applying "Simplify graph"). > > The sparse input representation shows dots between non-contiguous inputs, instead of horizontal space proportional to the number of empty inputs. This helps reducing node width, which is known to improve overall layout quality: > > ![dense-vs-sparse](https://user-images.githubusercontent.com/8792647/224118703-04f663b7-7a73-4e49-87d9-2acd8b98522b.png) > > Note that the exact input indices can still be retrieved via the incoming edge's tooltips: > > ![tooltip-with-input-index](https://user-images.githubusercontent.com/8792647/224119319-7f40fba2-1e9f-436e-a11c-8c3d428d46a6.png) > > The control-flow graph view is also adapted to this representation: > > ![sparse-in-cfg](https://user-images.githubusercontent.com/8792647/224119399-884e2516-a9a1-43fd-b5f5-747c99472ace.png) > > #### Additional improvements > > Additionally, this changeset introduces a complementary filter "Show custom node info" (enabled by default) that extends the labels of call and exception-creation nodes with custom information; and defines and documents JavaScript helpers to simplify the new and existing available filters. Here is an example of the effect of the new "Show custom node info" filter: > > ![show-custom-node-info](https://user-images.githubusercontent.com/8792647/224119545-fd564224-7ccc-4829-988e-77f05d25b3bc.png) > > ### Testing > > #### Functionality > > - Tested the functionality manually on a small selection of graphs. > > - Tested automatically that viewing thousands of graphs in the three views with different filter subsets enabled does not trigger any assertion failure (by instrumenting IGV to view graphs as they are loaded and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`). > > #### Performance > > Measured the combined filter application and view creation time for the sea-of-nodes view on a selection of 100 medium-sized graphs (200-500 nodes). On average, applying the new "Show custom node info" filter introduces a minimal overhead of around 1%, which motivates enabling it by default. Applying the "simplify graph" and "condense graph" on top actually gives a speedup of about 12%, since the additional filter application time is amortized by laying out and drawing fewer nodes. However, these filters are not enabled by default, since they cause a (minor) loss of information which is not desirable in every use case. > > The graph size reduction and performance results are [attached](https://github.com/openjdk/jdk/files/10934804/performance-evaluation.ods) (note that each time measurement in the sheet corresponds to the median of ten runs). Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 53 commits: - Merge branch 'master' into JDK-8302738 - Partially revert "Split simplify graph filter into two, ensure they are applied in right order" This reverts parts of commit 07621a8012c925eaa612ef8b35611557f9f0f4ca, as they have been contributed independently to mainline. - Fix comment typo - Revert "Select slots as well" This reverts commit 8256f0c20d7747cda691291c47841a9280d8c493. Revert "Fix figure selection" This reverts commit 71e73e89facfbb31614e4f1f3676c9e91a38e01a. Revert "Make slots searchable and selectable" This reverts commit 69cbec1f24ec5e941a5a72ab94b79117551d9560. - Increase the bold text line factor slightly - Add extra horizontal margin for long labels and let them overflow within the node - Select slots as well - Remove code that is commented out - Assert inputLabel is non-null - Document filter helpers - ... and 43 more: https://git.openjdk.org/jdk/compare/9df20600...6e6a8ecc ------------- Changes: https://git.openjdk.org/jdk/pull/12955/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12955&range=04 Stats: 877 lines in 37 files changed: 537 ins; 207 del; 133 mod Patch: https://git.openjdk.org/jdk/pull/12955.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12955/head:pull/12955 PR: https://git.openjdk.org/jdk/pull/12955 From qamai at openjdk.org Thu Mar 30 16:32:01 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 30 Mar 2023 16:32:01 GMT Subject: RFR: 8305056: Avoid unaligned access in emit_intX methods if not enabled [v3] In-Reply-To: References: Message-ID: <6xESHmK3740UCxCW9YpqxH8qg5mwR6GnBqyK8s5baAA=.784e556f-918e-4cf4-a92a-5159083d928c@github.com> On Wed, 29 Mar 2023 16:36:17 GMT, Vladimir Kempik wrote: >> Please review this change which attempts to eliminate unaligned memory stores generated by emit_int16/32/64 methods on some platforms. >> >> Primary aim is risc-v platform. But I had to change some code in ppc/arm32/x86 to prevent possible perf degradation. > > Vladimir Kempik has updated the pull request incrementally with one additional commit since the last revision: > > Reduce code duplication It would probably be more efficient if you just use `memcpy` and let the compiler figure out the best method to do memory accesses. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13227#issuecomment-1490593420 From kvn at openjdk.org Thu Mar 30 16:32:01 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 30 Mar 2023 16:32:01 GMT Subject: RFR: 8304042: C2 SuperWord: schedule must remove packs with cyclic dependencies [v2] In-Reply-To: References: <9VgAQeNZfUZJXO8llozcZZuRftv6kk43jw0YIrBIdck=.b5c89436-608b-4ed1-816d-b3514374eaeb@github.com> Message-ID: On Thu, 30 Mar 2023 06:39:11 GMT, Emanuel Peter wrote: > > Do you know if this affect any our existing vector tests? > > @vladimir Thanks for the review. Yes. I had a run where I assert if I find cycles. I ran it up to tier5 and stress testing. And the assert was never triggered, except in the two regression tests that I added (there it triggered a lot). So I think it really has no effect, except the extra runtime. Perfect! Thank you for doing it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13078#issuecomment-1490592865 From rcastanedalo at openjdk.org Thu Mar 30 16:35:37 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 30 Mar 2023 16:35:37 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter [v2] In-Reply-To: References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> Message-ID: On Tue, 28 Mar 2023 08:42:53 GMT, Christian Hagedorn wrote: >> Roberto Casta?eda Lozano has updated the pull request incrementally with five additional commits since the last revision: >> >> - Increase the bold text line factor slightly >> - Add extra horizontal margin for long labels and let them overflow within the node >> - Select slots as well >> - Remove code that is commented out >> - Assert inputLabel is non-null > > Thanks Roberto for your detailed answers! > >> > When selecting a CallStaticJava node, the custom node info is sometimes cut depending on the zoom level (sometimes more, sometimes less) >> >> Good catch, @chhagedorn! This is an existing issue in mainline IGV, you can reproduce it e.g. by showing a long property such as `dump_spec` in the node text. The issue just becomes more visible with the addition of custom node info in this changeset. As far as I understand, the node width is computed assuming it is selected (i.e. bold text) at 100% zoom level, and scaled proportionally to the selected zoom level. This assumes label fonts scale perfectly with the zoom level, which is not the case. As a result, very long node labels can overflow at different zoom levels than 100%. I don't see a better solution than multiplying the computed node width with a factor (`Figure::BOLD_LINE_FACTOR`) to account for the worst-case text overflow at any zoom level. This will not change the width of most nodes since this tends to be dominated by the input slots anyway, only for those nodes with long labels. I selected this factor experimentally to be of 6% of the total width. Hope this new ver sion fixes the issue you observed. If not, please try out and suggest a more appropriate factor. > > That looks much better now! 6% seems to be a good value to go with. Thanks for fixing this general issue. > >> > Selecting an inlined node with the condensed graph filter does not work when searching for it. For example, I can search for 165 Bool node in the search field. It finds it but when clicking on it, it shows me an empty graph. I would have expected to see the following graph with the "outer" node being selected which includes 165 Bool. >> >> Selecting, highlighting, centering, synchronizing etc. inlined and combined nodes ("slots" in IGV speak) has not been possible at all before this changeset. You can reproduce similar issues when using the "Simplify graph" filter in mainline IGV. > > I see, I've never used the "Simplify graph" filter before. That's why I've only noticed this now. > >> I included some basic (admittedly half-baked) support for this in this changeset (enhanced searching and parts of selecting, but not highlighting, centering, or synchronizing among tabs), but implementing full support would require a rather deep refactoring of IGV. I will not have time to work on such a refactoring in the coming weeks, so I propose to simply remove the partial support for slot interaction implemented provided by this changeset, so that we leave IGV in the same consistent state as before, and create a RFE for adding proper support in the future. @chhagedorn, @tobiasholenstein what do you think? > > I agree with your suggestion to remove the partial implementation and try to fully support it later in a separate RFE. That might be the cleanest solution for now. And we could still take your current code as a starting point for that RFE . > >> > Maybe the node info can be improved further in a future RFE, for example for CountedLoop nodes to also show if it is a pre/main/post loop or to add the stride. >> >> Good suggestion! I agree that there is room for further exploiting custom node info in the future, loop nodes are excellent candidates :) > > Great! :-) Can you file an RFE for that? > > Thanks, > Christian I just reverted the filter ordering changes from this changeset and merged from master, incorporating the changes from JDK-8302644 instead ("IGV: Apply filters per graph tab and not globally"). Testing passes as before. @chhagedorn, @tobiasholenstein would you like to have another look or am I free to integrate? ------------- PR Comment: https://git.openjdk.org/jdk/pull/12955#issuecomment-1490597196 From vkempik at openjdk.org Thu Mar 30 16:52:13 2023 From: vkempik at openjdk.org (Vladimir Kempik) Date: Thu, 30 Mar 2023 16:52:13 GMT Subject: RFR: 8305056: Avoid unaligned access in emit_intX methods if not enabled [v3] In-Reply-To: <4iSP7cQzd1cs5DRHTEmFDmyLXwCkkmm-O504keQPuZw=.723ab392-e59b-4c49-a43c-9b4690298de5@github.com> References: <4iSP7cQzd1cs5DRHTEmFDmyLXwCkkmm-O504keQPuZw=.723ab392-e59b-4c49-a43c-9b4690298de5@github.com> Message-ID: <6g6jSo3ngSBhalwL_-2r-9ysfrdDUUZIyzo-Q4tHvcs=.7ba9d6a3-750b-4162-8c79-42d71109f348@github.com> On Thu, 30 Mar 2023 15:16:14 GMT, Thomas Stuefe wrote: > Sorry if I'm slow, but I do not understand this patch, and the JBS issue text does not help much. Why would the existing test for alignment not be sufficient to prevent unaligned access? Hello Currently emit_intXX perform unaligned access without any check for existing flags regarding the status of unaligned access. This especially bad(performance wise) on platforms where misalignment access is emulated (e.g. some risc-v boards). The idea of the patch is to use put_native_uX instead of direct pointer deref. But then I need to adjust some existing put_native_uX code to respect UseUnalignedAccess flags. As for speed of new put_native after adding flag check - that flag is kind of constant during the lifetime of jvm and should be easy for any branch predictor, resulting in a pretty low overhead. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13227#issuecomment-1490619497 From xliu at openjdk.org Thu Mar 30 16:57:26 2023 From: xliu at openjdk.org (Xin Liu) Date: Thu, 30 Mar 2023 16:57:26 GMT Subject: Integrated: 8305142: Can't bootstrap ctw.jar In-Reply-To: <_KrHiSvoOhisFwY9VvBouTIqYycK8YPpPFwb9BP5XHE=.c60cc690-c7e6-42b0-9010-f080c6981ace@github.com> References: <_KrHiSvoOhisFwY9VvBouTIqYycK8YPpPFwb9BP5XHE=.c60cc690-c7e6-42b0-9010-f080c6981ace@github.com> Message-ID: On Wed, 29 Mar 2023 07:17:23 GMT, Xin Liu wrote: > This patch add a few add-exports so CTW can access those internal packages of java.base module. > > make succeeds and ctw.jar is generated as expected. This pull request has now been integrated. Changeset: 83cf28f9 Author: Xin Liu URL: https://git.openjdk.org/jdk/commit/83cf28f99639d80e62c4031c4c9752460de5f36c Stats: 6 lines in 1 file changed: 5 ins; 0 del; 1 mod 8305142: Can't bootstrap ctw.jar Reviewed-by: shade, phh ------------- PR: https://git.openjdk.org/jdk/pull/13220 From chagedorn at openjdk.org Thu Mar 30 20:13:25 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 30 Mar 2023 20:13:25 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter [v5] In-Reply-To: References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> Message-ID: On Thu, 30 Mar 2023 16:30:35 GMT, Roberto Casta?eda Lozano wrote: >> The "Simplify graph" filter abstracts away details from the graph that are typically unnecessary for debugging or analyzing the represented program. This changeset decouples this filter into two: >> >> - "Simplify graph", which hides elements that are typically (but not always) unnecessary, and >> - "Condense graph", which makes the graph more compact without loss of information. >> >> Together, these two filters reduce the average graph size by a factor of 1.6x (nodes) and 1.9x (edges): >> >> ![without-with-filters](https://user-images.githubusercontent.com/8792647/224118397-e6bd45d1-0b90-4d94-88ae-0a83f9ef20da.png) >> >> Besides decoupling the "Simplify graph" filter, the changeset extends its functionality by: >> - combining Bool and conversion nodes into their predecessors, >> - inlining all Parm nodes except control into their successors (this removes lots of long edges), >> - removing "top" inputs from call-like nodes, >> - inlining more source nodes (such as MachTemp and ThreadLocal) into their successors, >> - pretty-printing the labels of many inlined and combined nodes such as Bool comparisons or Catch projections (via a new filter that edits node properties), and >> - using a sparse representation of nodes with empty inputs (e.g. call-like nodes after applying "Simplify graph"). >> >> The sparse input representation shows dots between non-contiguous inputs, instead of horizontal space proportional to the number of empty inputs. This helps reducing node width, which is known to improve overall layout quality: >> >> ![dense-vs-sparse](https://user-images.githubusercontent.com/8792647/224118703-04f663b7-7a73-4e49-87d9-2acd8b98522b.png) >> >> Note that the exact input indices can still be retrieved via the incoming edge's tooltips: >> >> ![tooltip-with-input-index](https://user-images.githubusercontent.com/8792647/224119319-7f40fba2-1e9f-436e-a11c-8c3d428d46a6.png) >> >> The control-flow graph view is also adapted to this representation: >> >> ![sparse-in-cfg](https://user-images.githubusercontent.com/8792647/224119399-884e2516-a9a1-43fd-b5f5-747c99472ace.png) >> >> #### Additional improvements >> >> Additionally, this changeset introduces a complementary filter "Show custom node info" (enabled by default) that extends the labels of call and exception-creation nodes with custom information; and defines and documents JavaScript helpers to simplify the new and existing available filters. Here is an example of the effect of the new "Show custom node info" filter: >> >> ![show-custom-node-info](https://user-images.githubusercontent.com/8792647/224119545-fd564224-7ccc-4829-988e-77f05d25b3bc.png) >> >> ### Testing >> >> #### Functionality >> >> - Tested the functionality manually on a small selection of graphs. >> >> - Tested automatically that viewing thousands of graphs in the three views with different filter subsets enabled does not trigger any assertion failure (by instrumenting IGV to view graphs as they are loaded and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`). >> >> #### Performance >> >> Measured the combined filter application and view creation time for the sea-of-nodes view on a selection of 100 medium-sized graphs (200-500 nodes). On average, applying the new "Show custom node info" filter introduces a minimal overhead of around 1%, which motivates enabling it by default. Applying the "simplify graph" and "condense graph" on top actually gives a speedup of about 12%, since the additional filter application time is amortized by laying out and drawing fewer nodes. However, these filters are not enabled by default, since they cause a (minor) loss of information which is not desirable in every use case. >> >> The graph size reduction and performance results are [attached](https://github.com/openjdk/jdk/files/10934804/performance-evaluation.ods) (note that each time measurement in the sheet corresponds to the median of ten runs). > > Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 53 commits: > > - Merge branch 'master' into JDK-8302738 > - Partially revert "Split simplify graph filter into two, ensure they are applied in right order" > > This reverts parts of commit 07621a8012c925eaa612ef8b35611557f9f0f4ca, as > they have been contributed independently to mainline. > - Fix comment typo > - Revert "Select slots as well" > > This reverts commit 8256f0c20d7747cda691291c47841a9280d8c493. > > Revert "Fix figure selection" > > This reverts commit 71e73e89facfbb31614e4f1f3676c9e91a38e01a. > > Revert "Make slots searchable and selectable" > > This reverts commit 69cbec1f24ec5e941a5a72ab94b79117551d9560. > - Increase the bold text line factor slightly > - Add extra horizontal margin for long labels and let them overflow within the node > - Select slots as well > - Remove code that is commented out > - Assert inputLabel is non-null > - Document filter helpers > - ... and 43 more: https://git.openjdk.org/jdk/compare/9df20600...6e6a8ecc That looks good to go in! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/12955#pullrequestreview-1365853670 From cslucas at openjdk.org Thu Mar 30 23:36:20 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Thu, 30 Mar 2023 23:36:20 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v5] In-Reply-To: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: > Can I please get reviews for this PR? > > The most common and frequent use of NonEscaping Phis merging object allocations is for debugging information. The two graphs below show numbers for Renaissance and DaCapo benchmarks - similar results are obtained for all other applications that I tested. > > With what frequency does each IR node type occurs as an allocation merge user? I.e., if the same node type uses a Phi N times the counter is incremented by N: > > ![image](https://user-images.githubusercontent.com/2249648/222280517-4dcf5871-2564-4207-b49e-22aee47fa49d.png) > > What are the most common users of allocation merges? I.e., if the same node type uses a Phi N times the counter is incremented by 1: > > ![image](https://user-images.githubusercontent.com/2249648/222280608-ca742a4e-1622-4e69-a778-e4db6805ea02.png) > > This PR adds support scalar replacing allocations participating in merges that are used as debug information OR as a base for field loads. I plan to create subsequent PRs to enable scalar replacement of merges used by other node types (CmpP is next on the list) subsequently. > > The approach I used for _rematerialization_ is pretty straightforward. It consists basically in: 1) Extend SafePointScalarObjectNode to represent multiple SR objects; 2) Add a new Class to support rematerialization of SR objects part of merges; 3) Patch HotSpot to be able to serialize and deserialize debug information related to allocation merges; 4) Patch C2 to generate unique types for SR objects participating in some allocation merges. > > The approach I used for _enabling the scalar replacement of some of the inputs of the allocation merge_ is also pretty straight forward: call `MemNode::split_through_phi` to, well, split AddP->Load* through the merge which will render the Phi useless. > > I tested this with JTREG tests tier 1-4 (Windows, Linux, and Mac) and didn't see regression. I also tested with several applications and didn't see any failure. I also ran tests with "-ea -esa -Xbatch -Xcomp -XX:+UnlockExperimentalVMOptions -XX:-TieredCompilation -server -XX:+IgnoreUnrecognizedVMOptions -XX:+UnlockDiagnosticVMOptions -XX:+StressLCM -XX:+StressGCM -XX:+StressCCP" and didn't observe any related failures. Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: Address PR feeedback 1: make ObjectMergeValue subclass of ObjectValue & create new IR class to represent scalarized merges. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12897/files - new: https://git.openjdk.org/jdk/pull/12897/files/a158ae66..5ef86371 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12897&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12897&range=03-04 Stats: 552 lines in 18 files changed: 181 ins; 169 del; 202 mod Patch: https://git.openjdk.org/jdk/pull/12897.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12897/head:pull/12897 PR: https://git.openjdk.org/jdk/pull/12897 From psandoz at openjdk.org Fri Mar 31 00:01:22 2023 From: psandoz at openjdk.org (Paul Sandoz) Date: Fri, 31 Mar 2023 00:01:22 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v5] In-Reply-To: References: Message-ID: On Thu, 30 Mar 2023 14:30:17 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch reimplements `VectorShuffle` implementations to be a vector of the bit type. Currently, VectorShuffle is stored as a byte array, and would be expanded upon usage. This poses several drawbacks: >> >> 1. Inefficient conversions between a shuffle and its corresponding vector. This hinders the performance when the shuffle indices are not constant and are loaded or computed dynamically. >> 2. Redundant expansions in `rearrange` operations. On all platforms, it seems that a shuffle index vector is always expanded to the correct type before executing the `rearrange` operations. >> 3. Some redundant intrinsics are needed to support this handling as well as special considerations in the C2 compiler. >> 4. Range checks are performed using `VectorShuffle::toVector`, which is inefficient for FP types since both FP conversions and FP comparisons are more expensive than the integral ones. >> >> Upon these changes, a `rearrange` can emit more efficient code: >> >> var species = IntVector.SPECIES_128; >> var v1 = IntVector.fromArray(species, SRC1, 0); >> var v2 = IntVector.fromArray(species, SRC2, 0); >> v1.rearrange(v2.toShuffle()).intoArray(DST, 0); >> >> Before: >> movabs $0x751589fa8,%r10 ; {oop([I{0x0000000751589fa8})} >> vmovdqu 0x10(%r10),%xmm2 >> movabs $0x7515a0d08,%r10 ; {oop([I{0x00000007515a0d08})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158afb8,%r10 ; {oop([I{0x000000075158afb8})} >> vmovdqu 0x10(%r10),%xmm0 >> vpand -0xddc12(%rip),%xmm0,%xmm0 # Stub::vector_int_to_byte_mask >> ; {external_word} >> vpackusdw %xmm0,%xmm0,%xmm0 >> vpackuswb %xmm0,%xmm0,%xmm0 >> vpmovsxbd %xmm0,%xmm3 >> vpcmpgtd %xmm3,%xmm1,%xmm3 >> vtestps %xmm3,%xmm3 >> jne 0x00007fc2acb4e0d8 >> vpmovzxbd %xmm0,%xmm0 >> vpermd %ymm2,%ymm0,%ymm0 >> movabs $0x751588f98,%r10 ; {oop([I{0x0000000751588f98})} >> vmovdqu %xmm0,0x10(%r10) >> >> After: >> movabs $0x751589c78,%r10 ; {oop([I{0x0000000751589c78})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158ac88,%r10 ; {oop([I{0x000000075158ac88})} >> vmovdqu 0x10(%r10),%xmm2 >> vpxor %xmm0,%xmm0,%xmm0 >> vpcmpgtd %xmm2,%xmm0,%xmm3 >> vtestps %xmm3,%xmm3 >> jne 0x00007fa818b27cb1 >> vpermd %ymm1,%ymm2,%ymm0 >> movabs $0x751588c68,%r10 ; {oop([I{0x0000000751588c68})} >> vmovdqu %xmm0,0x10(%r10) >> >> Please take a look and leave reviews. Thanks a lot. > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 14 commits: > > - move implementations up > - Merge branch 'master' into shufflerefactor > - Merge branch 'master' into shufflerefactor > - reviews > - missing casts > - clean up > - fix Matcher::vector_needs_load_shuffle > - fix internal types, clean up > - optimise laneIsValid > - Merge branch 'master' into shufflerefactor > - ... and 4 more: https://git.openjdk.org/jdk/compare/d063b896...a4835c00 src/hotspot/cpu/x86/x86.ad line 2174: > 2172: } > 2173: > 2174: // Do Vector::rearrange needs preparation of the shuffle argument Suggestion: // Returns true if Vector::rearrange needs preparation of the shuffle argument ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1153894269 From psandoz at openjdk.org Fri Mar 31 00:10:29 2023 From: psandoz at openjdk.org (Paul Sandoz) Date: Fri, 31 Mar 2023 00:10:29 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v5] In-Reply-To: References: Message-ID: On Thu, 30 Mar 2023 14:30:17 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch reimplements `VectorShuffle` implementations to be a vector of the bit type. Currently, VectorShuffle is stored as a byte array, and would be expanded upon usage. This poses several drawbacks: >> >> 1. Inefficient conversions between a shuffle and its corresponding vector. This hinders the performance when the shuffle indices are not constant and are loaded or computed dynamically. >> 2. Redundant expansions in `rearrange` operations. On all platforms, it seems that a shuffle index vector is always expanded to the correct type before executing the `rearrange` operations. >> 3. Some redundant intrinsics are needed to support this handling as well as special considerations in the C2 compiler. >> 4. Range checks are performed using `VectorShuffle::toVector`, which is inefficient for FP types since both FP conversions and FP comparisons are more expensive than the integral ones. >> >> Upon these changes, a `rearrange` can emit more efficient code: >> >> var species = IntVector.SPECIES_128; >> var v1 = IntVector.fromArray(species, SRC1, 0); >> var v2 = IntVector.fromArray(species, SRC2, 0); >> v1.rearrange(v2.toShuffle()).intoArray(DST, 0); >> >> Before: >> movabs $0x751589fa8,%r10 ; {oop([I{0x0000000751589fa8})} >> vmovdqu 0x10(%r10),%xmm2 >> movabs $0x7515a0d08,%r10 ; {oop([I{0x00000007515a0d08})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158afb8,%r10 ; {oop([I{0x000000075158afb8})} >> vmovdqu 0x10(%r10),%xmm0 >> vpand -0xddc12(%rip),%xmm0,%xmm0 # Stub::vector_int_to_byte_mask >> ; {external_word} >> vpackusdw %xmm0,%xmm0,%xmm0 >> vpackuswb %xmm0,%xmm0,%xmm0 >> vpmovsxbd %xmm0,%xmm3 >> vpcmpgtd %xmm3,%xmm1,%xmm3 >> vtestps %xmm3,%xmm3 >> jne 0x00007fc2acb4e0d8 >> vpmovzxbd %xmm0,%xmm0 >> vpermd %ymm2,%ymm0,%ymm0 >> movabs $0x751588f98,%r10 ; {oop([I{0x0000000751588f98})} >> vmovdqu %xmm0,0x10(%r10) >> >> After: >> movabs $0x751589c78,%r10 ; {oop([I{0x0000000751589c78})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158ac88,%r10 ; {oop([I{0x000000075158ac88})} >> vmovdqu 0x10(%r10),%xmm2 >> vpxor %xmm0,%xmm0,%xmm0 >> vpcmpgtd %xmm2,%xmm0,%xmm3 >> vtestps %xmm3,%xmm3 >> jne 0x00007fa818b27cb1 >> vpermd %ymm1,%ymm2,%ymm0 >> movabs $0x751588c68,%r10 ; {oop([I{0x0000000751588c68})} >> vmovdqu %xmm0,0x10(%r10) >> >> Please take a look and leave reviews. Thanks a lot. > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 14 commits: > > - move implementations up > - Merge branch 'master' into shufflerefactor > - Merge branch 'master' into shufflerefactor > - reviews > - missing casts > - clean up > - fix Matcher::vector_needs_load_shuffle > - fix internal types, clean up > - optimise laneIsValid > - Merge branch 'master' into shufflerefactor > - ... and 4 more: https://git.openjdk.org/jdk/compare/d063b896...a4835c00 src/jdk.incubator.vector/share/classes/jdk/incubator/vector/AbstractVector.java line 204: > 202: dvtype, dtype, dlength, > 203: this, dsp, > 204: AbstractVector::toShuffle0); Suggestion: return VectorSupport.convert(VectorSupport.VECTOR_OP_CAST, getClass(), etype, length(), dvtype, dtype, dlength, this, dsp, AbstractVector::toShuffle0); ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1153897796 From psandoz at openjdk.org Fri Mar 31 00:21:22 2023 From: psandoz at openjdk.org (Paul Sandoz) Date: Fri, 31 Mar 2023 00:21:22 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v5] In-Reply-To: References: Message-ID: On Thu, 30 Mar 2023 14:30:17 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch reimplements `VectorShuffle` implementations to be a vector of the bit type. Currently, VectorShuffle is stored as a byte array, and would be expanded upon usage. This poses several drawbacks: >> >> 1. Inefficient conversions between a shuffle and its corresponding vector. This hinders the performance when the shuffle indices are not constant and are loaded or computed dynamically. >> 2. Redundant expansions in `rearrange` operations. On all platforms, it seems that a shuffle index vector is always expanded to the correct type before executing the `rearrange` operations. >> 3. Some redundant intrinsics are needed to support this handling as well as special considerations in the C2 compiler. >> 4. Range checks are performed using `VectorShuffle::toVector`, which is inefficient for FP types since both FP conversions and FP comparisons are more expensive than the integral ones. >> >> Upon these changes, a `rearrange` can emit more efficient code: >> >> var species = IntVector.SPECIES_128; >> var v1 = IntVector.fromArray(species, SRC1, 0); >> var v2 = IntVector.fromArray(species, SRC2, 0); >> v1.rearrange(v2.toShuffle()).intoArray(DST, 0); >> >> Before: >> movabs $0x751589fa8,%r10 ; {oop([I{0x0000000751589fa8})} >> vmovdqu 0x10(%r10),%xmm2 >> movabs $0x7515a0d08,%r10 ; {oop([I{0x00000007515a0d08})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158afb8,%r10 ; {oop([I{0x000000075158afb8})} >> vmovdqu 0x10(%r10),%xmm0 >> vpand -0xddc12(%rip),%xmm0,%xmm0 # Stub::vector_int_to_byte_mask >> ; {external_word} >> vpackusdw %xmm0,%xmm0,%xmm0 >> vpackuswb %xmm0,%xmm0,%xmm0 >> vpmovsxbd %xmm0,%xmm3 >> vpcmpgtd %xmm3,%xmm1,%xmm3 >> vtestps %xmm3,%xmm3 >> jne 0x00007fc2acb4e0d8 >> vpmovzxbd %xmm0,%xmm0 >> vpermd %ymm2,%ymm0,%ymm0 >> movabs $0x751588f98,%r10 ; {oop([I{0x0000000751588f98})} >> vmovdqu %xmm0,0x10(%r10) >> >> After: >> movabs $0x751589c78,%r10 ; {oop([I{0x0000000751589c78})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158ac88,%r10 ; {oop([I{0x000000075158ac88})} >> vmovdqu 0x10(%r10),%xmm2 >> vpxor %xmm0,%xmm0,%xmm0 >> vpcmpgtd %xmm2,%xmm0,%xmm3 >> vtestps %xmm3,%xmm3 >> jne 0x00007fa818b27cb1 >> vpermd %ymm1,%ymm2,%ymm0 >> movabs $0x751588c68,%r10 ; {oop([I{0x0000000751588c68})} >> vmovdqu %xmm0,0x10(%r10) >> >> Please take a look and leave reviews. Thanks a lot. > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 14 commits: > > - move implementations up > - Merge branch 'master' into shufflerefactor > - Merge branch 'master' into shufflerefactor > - reviews > - missing casts > - clean up > - fix Matcher::vector_needs_load_shuffle > - fix internal types, clean up > - optimise laneIsValid > - Merge branch 'master' into shufflerefactor > - ... and 4 more: https://git.openjdk.org/jdk/compare/d063b896...a4835c00 src/jdk.incubator.vector/share/classes/jdk/incubator/vector/X-VectorBits.java.template line 1106: > 1104: @Override > 1105: @ForceInline > 1106: public int laneSource(int i) { Can this method be moved to `AbstractShuffle`? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1153902133 From psandoz at openjdk.org Fri Mar 31 00:29:22 2023 From: psandoz at openjdk.org (Paul Sandoz) Date: Fri, 31 Mar 2023 00:29:22 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v5] In-Reply-To: References: Message-ID: On Thu, 30 Mar 2023 14:30:17 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch reimplements `VectorShuffle` implementations to be a vector of the bit type. Currently, VectorShuffle is stored as a byte array, and would be expanded upon usage. This poses several drawbacks: >> >> 1. Inefficient conversions between a shuffle and its corresponding vector. This hinders the performance when the shuffle indices are not constant and are loaded or computed dynamically. >> 2. Redundant expansions in `rearrange` operations. On all platforms, it seems that a shuffle index vector is always expanded to the correct type before executing the `rearrange` operations. >> 3. Some redundant intrinsics are needed to support this handling as well as special considerations in the C2 compiler. >> 4. Range checks are performed using `VectorShuffle::toVector`, which is inefficient for FP types since both FP conversions and FP comparisons are more expensive than the integral ones. >> >> Upon these changes, a `rearrange` can emit more efficient code: >> >> var species = IntVector.SPECIES_128; >> var v1 = IntVector.fromArray(species, SRC1, 0); >> var v2 = IntVector.fromArray(species, SRC2, 0); >> v1.rearrange(v2.toShuffle()).intoArray(DST, 0); >> >> Before: >> movabs $0x751589fa8,%r10 ; {oop([I{0x0000000751589fa8})} >> vmovdqu 0x10(%r10),%xmm2 >> movabs $0x7515a0d08,%r10 ; {oop([I{0x00000007515a0d08})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158afb8,%r10 ; {oop([I{0x000000075158afb8})} >> vmovdqu 0x10(%r10),%xmm0 >> vpand -0xddc12(%rip),%xmm0,%xmm0 # Stub::vector_int_to_byte_mask >> ; {external_word} >> vpackusdw %xmm0,%xmm0,%xmm0 >> vpackuswb %xmm0,%xmm0,%xmm0 >> vpmovsxbd %xmm0,%xmm3 >> vpcmpgtd %xmm3,%xmm1,%xmm3 >> vtestps %xmm3,%xmm3 >> jne 0x00007fc2acb4e0d8 >> vpmovzxbd %xmm0,%xmm0 >> vpermd %ymm2,%ymm0,%ymm0 >> movabs $0x751588f98,%r10 ; {oop([I{0x0000000751588f98})} >> vmovdqu %xmm0,0x10(%r10) >> >> After: >> movabs $0x751589c78,%r10 ; {oop([I{0x0000000751589c78})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158ac88,%r10 ; {oop([I{0x000000075158ac88})} >> vmovdqu 0x10(%r10),%xmm2 >> vpxor %xmm0,%xmm0,%xmm0 >> vpcmpgtd %xmm2,%xmm0,%xmm3 >> vtestps %xmm3,%xmm3 >> jne 0x00007fa818b27cb1 >> vpermd %ymm1,%ymm2,%ymm0 >> movabs $0x751588c68,%r10 ; {oop([I{0x0000000751588c68})} >> vmovdqu %xmm0,0x10(%r10) >> >> Please take a look and leave reviews. Thanks a lot. > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 14 commits: > > - move implementations up > - Merge branch 'master' into shufflerefactor > - Merge branch 'master' into shufflerefactor > - reviews > - missing casts > - clean up > - fix Matcher::vector_needs_load_shuffle > - fix internal types, clean up > - optimise laneIsValid > - Merge branch 'master' into shufflerefactor > - ... and 4 more: https://git.openjdk.org/jdk/compare/d063b896...a4835c00 The changes look very good to me. It would be useful if @jatin-bhateja could also take a look. src/jdk.incubator.vector/share/classes/jdk/incubator/vector/X-VectorBits.java.template line 1158: > 1156: } > 1157: > 1158: private static $bitstype$[] prepare(int[] indices, int offset) { If we want to reduce code duplication further I suspect we could move these static methods to IntVector etc. Up to you. ------------- Marked as reviewed by psandoz (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13093#pullrequestreview-1366098245 PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1153904271 From eliu at openjdk.org Fri Mar 31 01:06:19 2023 From: eliu at openjdk.org (Eric Liu) Date: Fri, 31 Mar 2023 01:06:19 GMT Subject: RFR: JDK-8305223: IGV: mark osr compiled graphs with [OSR] in the name [v2] In-Reply-To: References: Message-ID: <7lgPpLklaKAXajJZQrMKb2PCMxg2TU_eKmGT9OdRMeU=.80eee11e-22b7-4b9b-975f-5f4f54c3e83d@github.com> On Thu, 30 Mar 2023 10:24:06 GMT, Tobias Holenstein wrote: >> Graphs in IGV that were osr compiled have an "osr" property. To make it easier to distinguish them from non-osr, append `[OSR]` to the name of the group in the outline >> >> osr > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > Update Group.java > > update copyright year Marked as reviewed by eliu (Committer). ------------- PR Review: https://git.openjdk.org/jdk/pull/13241#pullrequestreview-1366117989 From xgong at openjdk.org Fri Mar 31 01:31:30 2023 From: xgong at openjdk.org (Xiaohong Gong) Date: Fri, 31 Mar 2023 01:31:30 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v5] In-Reply-To: References: <4Op0Z8whnyDXDC6zGyMbx4ugcZp5TEoAqW_myB5flxM=.1c7b59ba-efb2-4f68-90d7-2d6e33e39572@github.com> Message-ID: <85ZXqzxAoKbsQrQdxENyOxLI_t5BXW5skPnWEHxcjd0=.7cb7e9f7-8049-4b23-938f-041c8e45b0ec@github.com> On Thu, 30 Mar 2023 14:41:29 GMT, Quan Anh Mai wrote: >> Yeah, I agree that saving a node have some benefits like what you said. My concern is there are more and more methods added into `Matcher::` and each platform has to do the different implementation. There is not too much meaning for those platforms that do not implement Vector API like` arm/ppc/...` for me. This makes code not so easy to maintain. > > I agree, I am thinking of (ab)using template to have a common query function like this > > template > T vectorQuery(U... args) { > return T(); > } > > then each platform can have a specialisation that looks like this: > > template <> > bool vectorQuery(BasicType bt, int vlen) { > } > > Unimplemented platform will return `false` for this, what do you think? Sounds good! Maybe we can have a folllowed patch to see what the final code looks like. Thanks! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1153928312 From xgong at openjdk.org Fri Mar 31 03:13:22 2023 From: xgong at openjdk.org (Xiaohong Gong) Date: Fri, 31 Mar 2023 03:13:22 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v5] In-Reply-To: References: Message-ID: On Thu, 30 Mar 2023 14:30:17 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch reimplements `VectorShuffle` implementations to be a vector of the bit type. Currently, VectorShuffle is stored as a byte array, and would be expanded upon usage. This poses several drawbacks: >> >> 1. Inefficient conversions between a shuffle and its corresponding vector. This hinders the performance when the shuffle indices are not constant and are loaded or computed dynamically. >> 2. Redundant expansions in `rearrange` operations. On all platforms, it seems that a shuffle index vector is always expanded to the correct type before executing the `rearrange` operations. >> 3. Some redundant intrinsics are needed to support this handling as well as special considerations in the C2 compiler. >> 4. Range checks are performed using `VectorShuffle::toVector`, which is inefficient for FP types since both FP conversions and FP comparisons are more expensive than the integral ones. >> >> Upon these changes, a `rearrange` can emit more efficient code: >> >> var species = IntVector.SPECIES_128; >> var v1 = IntVector.fromArray(species, SRC1, 0); >> var v2 = IntVector.fromArray(species, SRC2, 0); >> v1.rearrange(v2.toShuffle()).intoArray(DST, 0); >> >> Before: >> movabs $0x751589fa8,%r10 ; {oop([I{0x0000000751589fa8})} >> vmovdqu 0x10(%r10),%xmm2 >> movabs $0x7515a0d08,%r10 ; {oop([I{0x00000007515a0d08})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158afb8,%r10 ; {oop([I{0x000000075158afb8})} >> vmovdqu 0x10(%r10),%xmm0 >> vpand -0xddc12(%rip),%xmm0,%xmm0 # Stub::vector_int_to_byte_mask >> ; {external_word} >> vpackusdw %xmm0,%xmm0,%xmm0 >> vpackuswb %xmm0,%xmm0,%xmm0 >> vpmovsxbd %xmm0,%xmm3 >> vpcmpgtd %xmm3,%xmm1,%xmm3 >> vtestps %xmm3,%xmm3 >> jne 0x00007fc2acb4e0d8 >> vpmovzxbd %xmm0,%xmm0 >> vpermd %ymm2,%ymm0,%ymm0 >> movabs $0x751588f98,%r10 ; {oop([I{0x0000000751588f98})} >> vmovdqu %xmm0,0x10(%r10) >> >> After: >> movabs $0x751589c78,%r10 ; {oop([I{0x0000000751589c78})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158ac88,%r10 ; {oop([I{0x000000075158ac88})} >> vmovdqu 0x10(%r10),%xmm2 >> vpxor %xmm0,%xmm0,%xmm0 >> vpcmpgtd %xmm2,%xmm0,%xmm3 >> vtestps %xmm3,%xmm3 >> jne 0x00007fa818b27cb1 >> vpermd %ymm1,%ymm2,%ymm0 >> movabs $0x751588c68,%r10 ; {oop([I{0x0000000751588c68})} >> vmovdqu %xmm0,0x10(%r10) >> >> Please take a look and leave reviews. Thanks a lot. > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 14 commits: > > - move implementations up > - Merge branch 'master' into shufflerefactor > - Merge branch 'master' into shufflerefactor > - reviews > - missing casts > - clean up > - fix Matcher::vector_needs_load_shuffle > - fix internal types, clean up > - optimise laneIsValid > - Merge branch 'master' into shufflerefactor > - ... and 4 more: https://git.openjdk.org/jdk/compare/d063b896...a4835c00 Vector API tests pass on AArch64 platforms (NEON & SVE). So looks good to me! Please do not forget to update the copyright for two additional touched files `AbstractSpecies.java` and `VectorSpecies.java`. Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/13093#issuecomment-1491229216 From fgao at openjdk.org Fri Mar 31 06:17:17 2023 From: fgao at openjdk.org (Fei Gao) Date: Fri, 31 Mar 2023 06:17:17 GMT Subject: RFR: 8305055: IR check fails on some aarch64 platforms [v2] In-Reply-To: References: Message-ID: On Thu, 30 Mar 2023 08:44:06 GMT, Fei Gao wrote: >> As @eme64 said in [1], [JDK-8298935](https://bugs.openjdk.org/browse/JDK-8298935) introduced some "collateral damage", disabling the vectorization of some conversions when `+AlignVector`. That affects IR checks of `TestVectorizeTypeConversion.java` and `ArrayTypeConvertTest.java` on some `aarch64` platforms like ThunderX and ThunderX2 [2]. >> >> This trivial patch is to allow IR check only when we have `-AlignVector`. >> >> [1] https://github.com/openjdk/jdk/pull/12350#issuecomment-1470065706 >> [2] https://github.com/openjdk/jdk/blob/7239150f8aff0e3dc07c5b27f6b7fb07237bfc55/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp#L154 > > Fei Gao has updated the pull request incrementally with one additional commit since the last revision: > > Update comments The GHA failure of `runtime/ErrorHandling/TestDwarf` is not related to this PR. I'll integrate it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13236#issuecomment-1491354506 From amitkumar at openjdk.org Fri Mar 31 07:15:15 2023 From: amitkumar at openjdk.org (Amit Kumar) Date: Fri, 31 Mar 2023 07:15:15 GMT Subject: RFR: 8305227: [s390x] build broken after JDK-8231349 Message-ID: This PR moves nmethod entry barrier from `generate_compiler_stubs()` to `generate_final_stubs()`. Test build for fastdebug, slow debug, release and optimised. Tier1 test in fastdebug seems clean as well. ------------- Commit messages: - moves _nmethod_entry_barrier Changes: https://git.openjdk.org/jdk/pull/13259/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13259&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8305227 Stats: 13 lines in 1 file changed: 7 ins; 6 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/13259.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13259/head:pull/13259 PR: https://git.openjdk.org/jdk/pull/13259 From amitkumar at openjdk.org Fri Mar 31 07:15:15 2023 From: amitkumar at openjdk.org (Amit Kumar) Date: Fri, 31 Mar 2023 07:15:15 GMT Subject: RFR: 8305227: [s390x] build broken after JDK-8231349 In-Reply-To: References: Message-ID: <4nV0H8zZ2eUlOUTv_S5bWQI-m_jIFaJqCyFVhdgR8RM=.03f76534-3763-42b7-b578-68a0f4e68d48@github.com> On Fri, 31 Mar 2023 07:06:25 GMT, Amit Kumar wrote: > This PR moves nmethod entry barrier from `generate_compiler_stubs()` to `generate_final_stubs()`. Test build for fastdebug, slow debug, release and optimised. Tier1 test in fastdebug seems clean as well. Hi @vnkozlov , @RealLucy Please do review this PR. Thanks. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13259#issuecomment-1491422833 From jbhateja at openjdk.org Fri Mar 31 07:37:13 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 31 Mar 2023 07:37:13 GMT Subject: RFR: 8302673: [SuperWord] MaxReduction and MinReduction should vectorize for int Message-ID: This bugfix patch bypasses couple of canonicalizing ideal transformations for MaxI/MinI IR nodes to prevent breaking reduction chain. Kindly review. Best Regards, Jatin ------------- Commit messages: - 8302673: [SuperWord] MaxReduction and MinReduction should vectorize for int Changes: https://git.openjdk.org/jdk/pull/13260/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13260&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8302673 Stats: 124 lines in 3 files changed: 120 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/13260.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13260/head:pull/13260 PR: https://git.openjdk.org/jdk/pull/13260 From shade at openjdk.org Fri Mar 31 07:45:14 2023 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 31 Mar 2023 07:45:14 GMT Subject: RFR: 8305227: [s390x] build broken after JDK-8231349 In-Reply-To: References: Message-ID: On Fri, 31 Mar 2023 07:06:25 GMT, Amit Kumar wrote: > This PR moves nmethod entry barrier from `generate_compiler_stubs()` to `generate_final_stubs()`. Test build for fastdebug, slow debug, release and optimised. Tier1 test in fastdebug seems clean as well. Looks fine to me. ------------- Marked as reviewed by shade (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13259#pullrequestreview-1366418701 From rcastanedalo at openjdk.org Fri Mar 31 09:01:13 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 31 Mar 2023 09:01:13 GMT Subject: RFR: 8302673: [SuperWord] MaxReduction and MinReduction should vectorize for int In-Reply-To: References: Message-ID: On Fri, 31 Mar 2023 07:30:30 GMT, Jatin Bhateja wrote: > This bugfix patch bypasses couple of canonicalizing ideal transformations for MaxI/MinI IR nodes to prevent breaking reduction chain. > > Kindly review. > > Best Regards, > Jatin Hi @jatin-bhateja, this changeset is closely related to JDK-8287087, which proposes removing reduction flags and doing reduction analysis on-demand. JDK-8287087 is currently [out for review](https://github.com/openjdk/jdk/pull/13120), would be great to know what you think about it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13260#issuecomment-1491569562 From fgao at openjdk.org Fri Mar 31 09:18:27 2023 From: fgao at openjdk.org (Fei Gao) Date: Fri, 31 Mar 2023 09:18:27 GMT Subject: Integrated: 8305055: IR check fails on some aarch64 platforms In-Reply-To: References: Message-ID: On Thu, 30 Mar 2023 04:13:39 GMT, Fei Gao wrote: > As @eme64 said in [1], [JDK-8298935](https://bugs.openjdk.org/browse/JDK-8298935) introduced some "collateral damage", disabling the vectorization of some conversions when `+AlignVector`. That affects IR checks of `TestVectorizeTypeConversion.java` and `ArrayTypeConvertTest.java` on some `aarch64` platforms like ThunderX and ThunderX2 [2]. > > This trivial patch is to allow IR check only when we have `-AlignVector`. > > [1] https://github.com/openjdk/jdk/pull/12350#issuecomment-1470065706 > [2] https://github.com/openjdk/jdk/blob/7239150f8aff0e3dc07c5b27f6b7fb07237bfc55/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp#L154 This pull request has now been integrated. Changeset: dea9db2d Author: Fei Gao URL: https://git.openjdk.org/jdk/commit/dea9db2d0a28b379303ce867df6b125f5fdfcf16 Stats: 14 lines in 2 files changed: 12 ins; 0 del; 2 mod 8305055: IR check fails on some aarch64 platforms Reviewed-by: epeter, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/13236 From lucy at openjdk.org Fri Mar 31 11:08:15 2023 From: lucy at openjdk.org (Lutz Schmidt) Date: Fri, 31 Mar 2023 11:08:15 GMT Subject: RFR: 8305227: [s390x] build broken after JDK-8231349 In-Reply-To: References: Message-ID: On Fri, 31 Mar 2023 07:06:25 GMT, Amit Kumar wrote: > This PR moves nmethod entry barrier from `generate_compiler_stubs()` to `generate_final_stubs()`. Test build for fastdebug, slow debug, release and optimised. Tier1 test in fastdebug seems clean as well. LGTM and resolves the crash observed. @offamitkumar Thanks for fixing! @vnkozlov Thanks for helping out! ------------- Marked as reviewed by lucy (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13259#pullrequestreview-1366733435 From lucy at openjdk.org Fri Mar 31 11:27:18 2023 From: lucy at openjdk.org (Lutz Schmidt) Date: Fri, 31 Mar 2023 11:27:18 GMT Subject: RFR: 8303147: [s390x] fast & slow debug builds are broken [v2] In-Reply-To: References: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> Message-ID: On Fri, 17 Mar 2023 08:37:28 GMT, SUN Guoyun wrote: >> Amit Kumar has updated the pull request incrementally with one additional commit since the last revision: >> >> use constant instead of enum > > I don't think the current modification is reasonable, why don't you modify `emit_typecheck_helper`? > Maybe we can also @dean-long has to say about this. @sunny868 @dean-long You are right. The first approach was a band-aid at best. With more knowledge, it turns out the assert condition is incorrect. You have to consider a potentially non-zero FrameMap::first_available_sp_in_frame(). I expect @offamitkumar will come up with a suitable fix real soon. ------------- PR Comment: https://git.openjdk.org/jdk/pull/12825#issuecomment-1491772237 From tholenstein at openjdk.org Fri Mar 31 11:56:22 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Fri, 31 Mar 2023 11:56:22 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter [v5] In-Reply-To: References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> Message-ID: On Thu, 30 Mar 2023 16:30:35 GMT, Roberto Casta?eda Lozano wrote: >> The "Simplify graph" filter abstracts away details from the graph that are typically unnecessary for debugging or analyzing the represented program. This changeset decouples this filter into two: >> >> - "Simplify graph", which hides elements that are typically (but not always) unnecessary, and >> - "Condense graph", which makes the graph more compact without loss of information. >> >> Together, these two filters reduce the average graph size by a factor of 1.6x (nodes) and 1.9x (edges): >> >> ![without-with-filters](https://user-images.githubusercontent.com/8792647/224118397-e6bd45d1-0b90-4d94-88ae-0a83f9ef20da.png) >> >> Besides decoupling the "Simplify graph" filter, the changeset extends its functionality by: >> - combining Bool and conversion nodes into their predecessors, >> - inlining all Parm nodes except control into their successors (this removes lots of long edges), >> - removing "top" inputs from call-like nodes, >> - inlining more source nodes (such as MachTemp and ThreadLocal) into their successors, >> - pretty-printing the labels of many inlined and combined nodes such as Bool comparisons or Catch projections (via a new filter that edits node properties), and >> - using a sparse representation of nodes with empty inputs (e.g. call-like nodes after applying "Simplify graph"). >> >> The sparse input representation shows dots between non-contiguous inputs, instead of horizontal space proportional to the number of empty inputs. This helps reducing node width, which is known to improve overall layout quality: >> >> ![dense-vs-sparse](https://user-images.githubusercontent.com/8792647/224118703-04f663b7-7a73-4e49-87d9-2acd8b98522b.png) >> >> Note that the exact input indices can still be retrieved via the incoming edge's tooltips: >> >> ![tooltip-with-input-index](https://user-images.githubusercontent.com/8792647/224119319-7f40fba2-1e9f-436e-a11c-8c3d428d46a6.png) >> >> The control-flow graph view is also adapted to this representation: >> >> ![sparse-in-cfg](https://user-images.githubusercontent.com/8792647/224119399-884e2516-a9a1-43fd-b5f5-747c99472ace.png) >> >> #### Additional improvements >> >> Additionally, this changeset introduces a complementary filter "Show custom node info" (enabled by default) that extends the labels of call and exception-creation nodes with custom information; and defines and documents JavaScript helpers to simplify the new and existing available filters. Here is an example of the effect of the new "Show custom node info" filter: >> >> ![show-custom-node-info](https://user-images.githubusercontent.com/8792647/224119545-fd564224-7ccc-4829-988e-77f05d25b3bc.png) >> >> ### Testing >> >> #### Functionality >> >> - Tested the functionality manually on a small selection of graphs. >> >> - Tested automatically that viewing thousands of graphs in the three views with different filter subsets enabled does not trigger any assertion failure (by instrumenting IGV to view graphs as they are loaded and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`). >> >> #### Performance >> >> Measured the combined filter application and view creation time for the sea-of-nodes view on a selection of 100 medium-sized graphs (200-500 nodes). On average, applying the new "Show custom node info" filter introduces a minimal overhead of around 1%, which motivates enabling it by default. Applying the "simplify graph" and "condense graph" on top actually gives a speedup of about 12%, since the additional filter application time is amortized by laying out and drawing fewer nodes. However, these filters are not enabled by default, since they cause a (minor) loss of information which is not desirable in every use case. >> >> The graph size reduction and performance results are [attached](https://github.com/openjdk/jdk/files/10934804/performance-evaluation.ods) (note that each time measurement in the sheet corresponds to the median of ten runs). > > Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 53 commits: > > - Merge branch 'master' into JDK-8302738 > - Partially revert "Split simplify graph filter into two, ensure they are applied in right order" > > This reverts parts of commit 07621a8012c925eaa612ef8b35611557f9f0f4ca, as > they have been contributed independently to mainline. > - Fix comment typo > - Revert "Select slots as well" > > This reverts commit 8256f0c20d7747cda691291c47841a9280d8c493. > > Revert "Fix figure selection" > > This reverts commit 71e73e89facfbb31614e4f1f3676c9e91a38e01a. > > Revert "Make slots searchable and selectable" > > This reverts commit 69cbec1f24ec5e941a5a72ab94b79117551d9560. > - Increase the bold text line factor slightly > - Add extra horizontal margin for long labels and let them overflow within the node > - Select slots as well > - Remove code that is commented out > - Assert inputLabel is non-null > - Document filter helpers > - ... and 43 more: https://git.openjdk.org/jdk/compare/9df20600...6e6a8ecc approved looks good - go ahead ------------- Marked as reviewed by tholenstein (Committer). PR Review: https://git.openjdk.org/jdk/pull/12955#pullrequestreview-1366797441 PR Comment: https://git.openjdk.org/jdk/pull/12955#issuecomment-1491812317 From rcastanedalo at openjdk.org Fri Mar 31 12:07:42 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 31 Mar 2023 12:07:42 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter [v5] In-Reply-To: References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> Message-ID: On Thu, 30 Mar 2023 16:30:35 GMT, Roberto Casta?eda Lozano wrote: >> The "Simplify graph" filter abstracts away details from the graph that are typically unnecessary for debugging or analyzing the represented program. This changeset decouples this filter into two: >> >> - "Simplify graph", which hides elements that are typically (but not always) unnecessary, and >> - "Condense graph", which makes the graph more compact without loss of information. >> >> Together, these two filters reduce the average graph size by a factor of 1.6x (nodes) and 1.9x (edges): >> >> ![without-with-filters](https://user-images.githubusercontent.com/8792647/224118397-e6bd45d1-0b90-4d94-88ae-0a83f9ef20da.png) >> >> Besides decoupling the "Simplify graph" filter, the changeset extends its functionality by: >> - combining Bool and conversion nodes into their predecessors, >> - inlining all Parm nodes except control into their successors (this removes lots of long edges), >> - removing "top" inputs from call-like nodes, >> - inlining more source nodes (such as MachTemp and ThreadLocal) into their successors, >> - pretty-printing the labels of many inlined and combined nodes such as Bool comparisons or Catch projections (via a new filter that edits node properties), and >> - using a sparse representation of nodes with empty inputs (e.g. call-like nodes after applying "Simplify graph"). >> >> The sparse input representation shows dots between non-contiguous inputs, instead of horizontal space proportional to the number of empty inputs. This helps reducing node width, which is known to improve overall layout quality: >> >> ![dense-vs-sparse](https://user-images.githubusercontent.com/8792647/224118703-04f663b7-7a73-4e49-87d9-2acd8b98522b.png) >> >> Note that the exact input indices can still be retrieved via the incoming edge's tooltips: >> >> ![tooltip-with-input-index](https://user-images.githubusercontent.com/8792647/224119319-7f40fba2-1e9f-436e-a11c-8c3d428d46a6.png) >> >> The control-flow graph view is also adapted to this representation: >> >> ![sparse-in-cfg](https://user-images.githubusercontent.com/8792647/224119399-884e2516-a9a1-43fd-b5f5-747c99472ace.png) >> >> #### Additional improvements >> >> Additionally, this changeset introduces a complementary filter "Show custom node info" (enabled by default) that extends the labels of call and exception-creation nodes with custom information; and defines and documents JavaScript helpers to simplify the new and existing available filters. Here is an example of the effect of the new "Show custom node info" filter: >> >> ![show-custom-node-info](https://user-images.githubusercontent.com/8792647/224119545-fd564224-7ccc-4829-988e-77f05d25b3bc.png) >> >> ### Testing >> >> #### Functionality >> >> - Tested the functionality manually on a small selection of graphs. >> >> - Tested automatically that viewing thousands of graphs in the three views with different filter subsets enabled does not trigger any assertion failure (by instrumenting IGV to view graphs as they are loaded and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`). >> >> #### Performance >> >> Measured the combined filter application and view creation time for the sea-of-nodes view on a selection of 100 medium-sized graphs (200-500 nodes). On average, applying the new "Show custom node info" filter introduces a minimal overhead of around 1%, which motivates enabling it by default. Applying the "simplify graph" and "condense graph" on top actually gives a speedup of about 12%, since the additional filter application time is amortized by laying out and drawing fewer nodes. However, these filters are not enabled by default, since they cause a (minor) loss of information which is not desirable in every use case. >> >> The graph size reduction and performance results are [attached](https://github.com/openjdk/jdk/files/10934804/performance-evaluation.ods) (note that each time measurement in the sheet corresponds to the median of ten runs). > > Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 53 commits: > > - Merge branch 'master' into JDK-8302738 > - Partially revert "Split simplify graph filter into two, ensure they are applied in right order" > > This reverts parts of commit 07621a8012c925eaa612ef8b35611557f9f0f4ca, as > they have been contributed independently to mainline. > - Fix comment typo > - Revert "Select slots as well" > > This reverts commit 8256f0c20d7747cda691291c47841a9280d8c493. > > Revert "Fix figure selection" > > This reverts commit 71e73e89facfbb31614e4f1f3676c9e91a38e01a. > > Revert "Make slots searchable and selectable" > > This reverts commit 69cbec1f24ec5e941a5a72ab94b79117551d9560. > - Increase the bold text line factor slightly > - Add extra horizontal margin for long labels and let them overflow within the node > - Select slots as well > - Remove code that is commented out > - Assert inputLabel is non-null > - Document filter helpers > - ... and 43 more: https://git.openjdk.org/jdk/compare/9df20600...6e6a8ecc Thanks for reviewing Christian and Toby! ------------- PR Comment: https://git.openjdk.org/jdk/pull/12955#issuecomment-1491823451 From rcastanedalo at openjdk.org Fri Mar 31 12:07:44 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 31 Mar 2023 12:07:44 GMT Subject: Integrated: 8302738: IGV: refine 'Simplify graph' filter In-Reply-To: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> Message-ID: On Thu, 9 Mar 2023 18:28:02 GMT, Roberto Casta?eda Lozano wrote: > The "Simplify graph" filter abstracts away details from the graph that are typically unnecessary for debugging or analyzing the represented program. This changeset decouples this filter into two: > > - "Simplify graph", which hides elements that are typically (but not always) unnecessary, and > - "Condense graph", which makes the graph more compact without loss of information. > > Together, these two filters reduce the average graph size by a factor of 1.6x (nodes) and 1.9x (edges): > > ![without-with-filters](https://user-images.githubusercontent.com/8792647/224118397-e6bd45d1-0b90-4d94-88ae-0a83f9ef20da.png) > > Besides decoupling the "Simplify graph" filter, the changeset extends its functionality by: > - combining Bool and conversion nodes into their predecessors, > - inlining all Parm nodes except control into their successors (this removes lots of long edges), > - removing "top" inputs from call-like nodes, > - inlining more source nodes (such as MachTemp and ThreadLocal) into their successors, > - pretty-printing the labels of many inlined and combined nodes such as Bool comparisons or Catch projections (via a new filter that edits node properties), and > - using a sparse representation of nodes with empty inputs (e.g. call-like nodes after applying "Simplify graph"). > > The sparse input representation shows dots between non-contiguous inputs, instead of horizontal space proportional to the number of empty inputs. This helps reducing node width, which is known to improve overall layout quality: > > ![dense-vs-sparse](https://user-images.githubusercontent.com/8792647/224118703-04f663b7-7a73-4e49-87d9-2acd8b98522b.png) > > Note that the exact input indices can still be retrieved via the incoming edge's tooltips: > > ![tooltip-with-input-index](https://user-images.githubusercontent.com/8792647/224119319-7f40fba2-1e9f-436e-a11c-8c3d428d46a6.png) > > The control-flow graph view is also adapted to this representation: > > ![sparse-in-cfg](https://user-images.githubusercontent.com/8792647/224119399-884e2516-a9a1-43fd-b5f5-747c99472ace.png) > > #### Additional improvements > > Additionally, this changeset introduces a complementary filter "Show custom node info" (enabled by default) that extends the labels of call and exception-creation nodes with custom information; and defines and documents JavaScript helpers to simplify the new and existing available filters. Here is an example of the effect of the new "Show custom node info" filter: > > ![show-custom-node-info](https://user-images.githubusercontent.com/8792647/224119545-fd564224-7ccc-4829-988e-77f05d25b3bc.png) > > ### Testing > > #### Functionality > > - Tested the functionality manually on a small selection of graphs. > > - Tested automatically that viewing thousands of graphs in the three views with different filter subsets enabled does not trigger any assertion failure (by instrumenting IGV to view graphs as they are loaded and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`). > > #### Performance > > Measured the combined filter application and view creation time for the sea-of-nodes view on a selection of 100 medium-sized graphs (200-500 nodes). On average, applying the new "Show custom node info" filter introduces a minimal overhead of around 1%, which motivates enabling it by default. Applying the "simplify graph" and "condense graph" on top actually gives a speedup of about 12%, since the additional filter application time is amortized by laying out and drawing fewer nodes. However, these filters are not enabled by default, since they cause a (minor) loss of information which is not desirable in every use case. > > The graph size reduction and performance results are [attached](https://github.com/openjdk/jdk/files/10934804/performance-evaluation.ods) (note that each time measurement in the sheet corresponds to the median of ten runs). This pull request has now been integrated. Changeset: 345669c2 Author: Roberto Casta?eda Lozano URL: https://git.openjdk.org/jdk/commit/345669c29d422e4dfd5ff3d1132023ebc02f1bcd Stats: 877 lines in 37 files changed: 537 ins; 207 del; 133 mod 8302738: IGV: refine 'Simplify graph' filter Reviewed-by: tholenstein, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/12955 From tholenstein at openjdk.org Fri Mar 31 12:13:29 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Fri, 31 Mar 2023 12:13:29 GMT Subject: RFR: JDK-8305223: IGV: mark osr compiled graphs with [OSR] in the name [v2] In-Reply-To: <34xb3nGAybhQeR36tczM4JaevFVY8SrK_Ab7tXJGgJQ=.ef00e088-7e2a-4527-b031-938141fbaef7@github.com> References: <34xb3nGAybhQeR36tczM4JaevFVY8SrK_Ab7tXJGgJQ=.ef00e088-7e2a-4527-b031-938141fbaef7@github.com> Message-ID: On Thu, 30 Mar 2023 08:58:39 GMT, Tobias Hartmann wrote: >> Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: >> >> Update Group.java >> >> update copyright year > > Looks good to me. thanks @TobiHartmann , @robcasloz and @theRealELiu for the reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/13241#issuecomment-1491830883 From tholenstein at openjdk.org Fri Mar 31 12:13:30 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Fri, 31 Mar 2023 12:13:30 GMT Subject: Integrated: JDK-8305223: IGV: mark osr compiled graphs with [OSR] in the name In-Reply-To: References: Message-ID: On Thu, 30 Mar 2023 07:46:43 GMT, Tobias Holenstein wrote: > Graphs in IGV that were osr compiled have an "osr" property. To make it easier to distinguish them from non-osr, append `[OSR]` to the name of the group in the outline > > osr This pull request has now been integrated. Changeset: 049b953f Author: Tobias Holenstein URL: https://git.openjdk.org/jdk/commit/049b953f8fdab62532e957c86a6009f4c8fa1653 Stats: 6 lines in 1 file changed: 4 ins; 0 del; 2 mod 8305223: IGV: mark osr compiled graphs with [OSR] in the name Reviewed-by: thartmann, rcastanedalo, eliu ------------- PR: https://git.openjdk.org/jdk/pull/13241 From qamai at openjdk.org Fri Mar 31 12:25:16 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 31 Mar 2023 12:25:16 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v6] In-Reply-To: References: Message-ID: > Hi, > > This patch reimplements `VectorShuffle` implementations to be a vector of the bit type. Currently, VectorShuffle is stored as a byte array, and would be expanded upon usage. This poses several drawbacks: > > 1. Inefficient conversions between a shuffle and its corresponding vector. This hinders the performance when the shuffle indices are not constant and are loaded or computed dynamically. > 2. Redundant expansions in `rearrange` operations. On all platforms, it seems that a shuffle index vector is always expanded to the correct type before executing the `rearrange` operations. > 3. Some redundant intrinsics are needed to support this handling as well as special considerations in the C2 compiler. > 4. Range checks are performed using `VectorShuffle::toVector`, which is inefficient for FP types since both FP conversions and FP comparisons are more expensive than the integral ones. > > Upon these changes, a `rearrange` can emit more efficient code: > > var species = IntVector.SPECIES_128; > var v1 = IntVector.fromArray(species, SRC1, 0); > var v2 = IntVector.fromArray(species, SRC2, 0); > v1.rearrange(v2.toShuffle()).intoArray(DST, 0); > > Before: > movabs $0x751589fa8,%r10 ; {oop([I{0x0000000751589fa8})} > vmovdqu 0x10(%r10),%xmm2 > movabs $0x7515a0d08,%r10 ; {oop([I{0x00000007515a0d08})} > vmovdqu 0x10(%r10),%xmm1 > movabs $0x75158afb8,%r10 ; {oop([I{0x000000075158afb8})} > vmovdqu 0x10(%r10),%xmm0 > vpand -0xddc12(%rip),%xmm0,%xmm0 # Stub::vector_int_to_byte_mask > ; {external_word} > vpackusdw %xmm0,%xmm0,%xmm0 > vpackuswb %xmm0,%xmm0,%xmm0 > vpmovsxbd %xmm0,%xmm3 > vpcmpgtd %xmm3,%xmm1,%xmm3 > vtestps %xmm3,%xmm3 > jne 0x00007fc2acb4e0d8 > vpmovzxbd %xmm0,%xmm0 > vpermd %ymm2,%ymm0,%ymm0 > movabs $0x751588f98,%r10 ; {oop([I{0x0000000751588f98})} > vmovdqu %xmm0,0x10(%r10) > > After: > movabs $0x751589c78,%r10 ; {oop([I{0x0000000751589c78})} > vmovdqu 0x10(%r10),%xmm1 > movabs $0x75158ac88,%r10 ; {oop([I{0x000000075158ac88})} > vmovdqu 0x10(%r10),%xmm2 > vpxor %xmm0,%xmm0,%xmm0 > vpcmpgtd %xmm2,%xmm0,%xmm3 > vtestps %xmm3,%xmm3 > jne 0x00007fa818b27cb1 > vpermd %ymm1,%ymm2,%ymm0 > movabs $0x751588c68,%r10 ; {oop([I{0x0000000751588c68})} > vmovdqu %xmm0,0x10(%r10) > > Please take a look and leave reviews. Thanks a lot. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: small cosmetics ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13093/files - new: https://git.openjdk.org/jdk/pull/13093/files/a4835c00..97c8fabf Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13093&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13093&range=04-05 Stats: 6 lines in 4 files changed: 1 ins; 0 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/13093.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13093/head:pull/13093 PR: https://git.openjdk.org/jdk/pull/13093 From qamai at openjdk.org Fri Mar 31 12:25:24 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 31 Mar 2023 12:25:24 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v5] In-Reply-To: References: Message-ID: On Fri, 31 Mar 2023 00:18:21 GMT, Paul Sandoz wrote: >> Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 14 commits: >> >> - move implementations up >> - Merge branch 'master' into shufflerefactor >> - Merge branch 'master' into shufflerefactor >> - reviews >> - missing casts >> - clean up >> - fix Matcher::vector_needs_load_shuffle >> - fix internal types, clean up >> - optimise laneIsValid >> - Merge branch 'master' into shufflerefactor >> - ... and 4 more: https://git.openjdk.org/jdk/compare/d063b896...a4835c00 > > src/jdk.incubator.vector/share/classes/jdk/incubator/vector/X-VectorBits.java.template line 1106: > >> 1104: @Override >> 1105: @ForceInline >> 1106: public int laneSource(int i) { > > Can this method be moved to `AbstractShuffle`? No because `T lane(int)` is a method of the typed vector classes which is not available in `AbstractVector` > src/jdk.incubator.vector/share/classes/jdk/incubator/vector/X-VectorBits.java.template line 1158: > >> 1156: } >> 1157: >> 1158: private static $bitstype$[] prepare(int[] indices, int offset) { > > If we want to reduce code duplication further I suspect we could move these static methods to IntVector etc. Up to you. I think duplication of generated code is less of a concern so it may be more desirable to keep them in the shuffle classes and near their usages. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1154405710 PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1154407372 From qamai at openjdk.org Fri Mar 31 12:25:18 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 31 Mar 2023 12:25:18 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v2] In-Reply-To: References: Message-ID: On Tue, 21 Mar 2023 16:29:44 GMT, Paul Sandoz wrote: >> I have moved most of the methods to `AbstractVector` and `AbstractShuffle`, I have to resort to raw types, though, since there seems to be no way to do the same with wild cards, and the generics mechanism is not powerful enough for things like `Vector`. The remaining failure seems to be related to [JDK-8304676](https://bugs.openjdk.org/projects/JDK/issues/JDK-8304676), so I think this patch is ready for review now. >> >>> The mask implementation is specialized by the species of vectors it operates on, but does it have to be >> >> Apart from the mask implementation, shuffle implementation definitely has to take into consideration the element type. However, this information does not have to be visible to the API, similar to how we currently handle the vector length, we can have `class AbstractMask implements VectorMask`. As a result, the cast method would be useless and can be removed in the API, but our implementation details would still use it, for example >> >> Vector blend(Vector v, VectorMask w) { >> AbstractMask aw = (AbstractMask) w; >> AbstractMask tw = aw.cast(vspecies()); >> return VectorSupport.blend(...); >> } >> >> Vector rearrange(VectorShuffle s) { >> AbstractShuffle as = (AbstractShuffle) s; >> AbstractShuffle ts = s.cast(vspecies()); >> return VectorSupport.rearrangeOp(...); >> } >> >> What do you think? > >> Apart from the mask implementation, shuffle implementation definitely has to take into consideration the element type. > > Yes, the way you have implemented shuffle is tightly connected, that looks ok. > > I am wondering if we can make the mask implementation more loosely coupled and modified such that it does not have to take into consideration the element type (or species) of the vector it operates on, and instead compatibility is based solely on the lane count. > > Ideally it would be good to change the `VectorMask::check` method to just compare the lanes counts and not require a cast in the implementation, which i presume requires some deeper changes in C2? > > What you propose seems a possible a interim step towards a more preferable API, if the performance is good. Thanks @PaulSandoz and @XiaohongGong for the reviews and testings. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13093#issuecomment-1491844652 From stuefe at openjdk.org Fri Mar 31 13:10:18 2023 From: stuefe at openjdk.org (Thomas Stuefe) Date: Fri, 31 Mar 2023 13:10:18 GMT Subject: RFR: 8305056: Avoid unaligned access in emit_intX methods if not enabled [v3] In-Reply-To: <6g6jSo3ngSBhalwL_-2r-9ysfrdDUUZIyzo-Q4tHvcs=.7ba9d6a3-750b-4162-8c79-42d71109f348@github.com> References: <4iSP7cQzd1cs5DRHTEmFDmyLXwCkkmm-O504keQPuZw=.723ab392-e59b-4c49-a43c-9b4690298de5@github.com> <6g6jSo3ngSBhalwL_-2r-9ysfrdDUUZIyzo-Q4tHvcs=.7ba9d6a3-750b-4162-8c79-42d71109f348@github.com> Message-ID: On Thu, 30 Mar 2023 16:49:31 GMT, Vladimir Kempik wrote: > > Sorry if I'm slow, but I do not understand this patch, and the JBS issue text does not help much. Why would the existing test for alignment not be sufficient to prevent unaligned access? > > Hello > > Currently emit_intXX perform unaligned access without any check for existing flags regarding the status of unaligned access. This especially bad(performance wise) on platforms where misalignment access is emulated (e.g. some risc-v boards). > > The idea of the patch is to use put_native_uX instead of direct pointer deref. > > But then I need to adjust some existing put_native_uX code to respect UseUnalignedAccess flags. > > As for speed of new put_native after adding flag check - that flag is kind of constant during the lifetime of jvm and should be easy for any branch predictor, resulting in a pretty low overhead. @VladimirKempik Thank you for the clarification! ------------- PR Comment: https://git.openjdk.org/jdk/pull/13227#issuecomment-1491900412 From rcastanedalo at openjdk.org Fri Mar 31 14:08:42 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 31 Mar 2023 14:08:42 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter [v5] In-Reply-To: References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> Message-ID: <0Sri_uicDEjV6e7Q9IDqsT1TtJ6ajwsK0DU6zPB_4p4=.6b5d659c-97a7-420a-aa2b-48945c7845cc@github.com> On Thu, 30 Mar 2023 16:30:35 GMT, Roberto Casta?eda Lozano wrote: >> The "Simplify graph" filter abstracts away details from the graph that are typically unnecessary for debugging or analyzing the represented program. This changeset decouples this filter into two: >> >> - "Simplify graph", which hides elements that are typically (but not always) unnecessary, and >> - "Condense graph", which makes the graph more compact without loss of information. >> >> Together, these two filters reduce the average graph size by a factor of 1.6x (nodes) and 1.9x (edges): >> >> ![without-with-filters](https://user-images.githubusercontent.com/8792647/224118397-e6bd45d1-0b90-4d94-88ae-0a83f9ef20da.png) >> >> Besides decoupling the "Simplify graph" filter, the changeset extends its functionality by: >> - combining Bool and conversion nodes into their predecessors, >> - inlining all Parm nodes except control into their successors (this removes lots of long edges), >> - removing "top" inputs from call-like nodes, >> - inlining more source nodes (such as MachTemp and ThreadLocal) into their successors, >> - pretty-printing the labels of many inlined and combined nodes such as Bool comparisons or Catch projections (via a new filter that edits node properties), and >> - using a sparse representation of nodes with empty inputs (e.g. call-like nodes after applying "Simplify graph"). >> >> The sparse input representation shows dots between non-contiguous inputs, instead of horizontal space proportional to the number of empty inputs. This helps reducing node width, which is known to improve overall layout quality: >> >> ![dense-vs-sparse](https://user-images.githubusercontent.com/8792647/224118703-04f663b7-7a73-4e49-87d9-2acd8b98522b.png) >> >> Note that the exact input indices can still be retrieved via the incoming edge's tooltips: >> >> ![tooltip-with-input-index](https://user-images.githubusercontent.com/8792647/224119319-7f40fba2-1e9f-436e-a11c-8c3d428d46a6.png) >> >> The control-flow graph view is also adapted to this representation: >> >> ![sparse-in-cfg](https://user-images.githubusercontent.com/8792647/224119399-884e2516-a9a1-43fd-b5f5-747c99472ace.png) >> >> #### Additional improvements >> >> Additionally, this changeset introduces a complementary filter "Show custom node info" (enabled by default) that extends the labels of call and exception-creation nodes with custom information; and defines and documents JavaScript helpers to simplify the new and existing available filters. Here is an example of the effect of the new "Show custom node info" filter: >> >> ![show-custom-node-info](https://user-images.githubusercontent.com/8792647/224119545-fd564224-7ccc-4829-988e-77f05d25b3bc.png) >> >> ### Testing >> >> #### Functionality >> >> - Tested the functionality manually on a small selection of graphs. >> >> - Tested automatically that viewing thousands of graphs in the three views with different filter subsets enabled does not trigger any assertion failure (by instrumenting IGV to view graphs as they are loaded and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`). >> >> #### Performance >> >> Measured the combined filter application and view creation time for the sea-of-nodes view on a selection of 100 medium-sized graphs (200-500 nodes). On average, applying the new "Show custom node info" filter introduces a minimal overhead of around 1%, which motivates enabling it by default. Applying the "simplify graph" and "condense graph" on top actually gives a speedup of about 12%, since the additional filter application time is amortized by laying out and drawing fewer nodes. However, these filters are not enabled by default, since they cause a (minor) loss of information which is not desirable in every use case. >> >> The graph size reduction and performance results are [attached](https://github.com/openjdk/jdk/files/10934804/performance-evaluation.ods) (note that each time measurement in the sheet corresponds to the median of ten runs). > > Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 53 commits: > > - Merge branch 'master' into JDK-8302738 > - Partially revert "Split simplify graph filter into two, ensure they are applied in right order" > > This reverts parts of commit 07621a8012c925eaa612ef8b35611557f9f0f4ca, as > they have been contributed independently to mainline. > - Fix comment typo > - Revert "Select slots as well" > > This reverts commit 8256f0c20d7747cda691291c47841a9280d8c493. > > Revert "Fix figure selection" > > This reverts commit 71e73e89facfbb31614e4f1f3676c9e91a38e01a. > > Revert "Make slots searchable and selectable" > > This reverts commit 69cbec1f24ec5e941a5a72ab94b79117551d9560. > - Increase the bold text line factor slightly > - Add extra horizontal margin for long labels and let them overflow within the node > - Select slots as well > - Remove code that is commented out > - Assert inputLabel is non-null > - Document filter helpers > - ... and 43 more: https://git.openjdk.org/jdk/compare/9df20600...6e6a8ecc Here are the promised follow-up issues: - [JDK-8305386](https://bugs.openjdk.org/browse/JDK-8305386): IGV: slot nodes cannot be found, selected, and highlighted - [JDK-8305389](https://bugs.openjdk.org/browse/JDK-8305389): IGV: add custom info for loop nodes ------------- PR Comment: https://git.openjdk.org/jdk/pull/12955#issuecomment-1491979435 From chagedorn at openjdk.org Fri Mar 31 14:28:36 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 31 Mar 2023 14:28:36 GMT Subject: RFR: 8302738: IGV: refine 'Simplify graph' filter [v5] In-Reply-To: References: <-D9kP0Lbt1V57jTzJWElLBm9Ao2C_IfP3opQVU1KIcA=.cd4c0e24-dcb8-42cb-a91c-4e9ab71c0ca3@github.com> Message-ID: On Thu, 30 Mar 2023 16:30:35 GMT, Roberto Casta?eda Lozano wrote: >> The "Simplify graph" filter abstracts away details from the graph that are typically unnecessary for debugging or analyzing the represented program. This changeset decouples this filter into two: >> >> - "Simplify graph", which hides elements that are typically (but not always) unnecessary, and >> - "Condense graph", which makes the graph more compact without loss of information. >> >> Together, these two filters reduce the average graph size by a factor of 1.6x (nodes) and 1.9x (edges): >> >> ![without-with-filters](https://user-images.githubusercontent.com/8792647/224118397-e6bd45d1-0b90-4d94-88ae-0a83f9ef20da.png) >> >> Besides decoupling the "Simplify graph" filter, the changeset extends its functionality by: >> - combining Bool and conversion nodes into their predecessors, >> - inlining all Parm nodes except control into their successors (this removes lots of long edges), >> - removing "top" inputs from call-like nodes, >> - inlining more source nodes (such as MachTemp and ThreadLocal) into their successors, >> - pretty-printing the labels of many inlined and combined nodes such as Bool comparisons or Catch projections (via a new filter that edits node properties), and >> - using a sparse representation of nodes with empty inputs (e.g. call-like nodes after applying "Simplify graph"). >> >> The sparse input representation shows dots between non-contiguous inputs, instead of horizontal space proportional to the number of empty inputs. This helps reducing node width, which is known to improve overall layout quality: >> >> ![dense-vs-sparse](https://user-images.githubusercontent.com/8792647/224118703-04f663b7-7a73-4e49-87d9-2acd8b98522b.png) >> >> Note that the exact input indices can still be retrieved via the incoming edge's tooltips: >> >> ![tooltip-with-input-index](https://user-images.githubusercontent.com/8792647/224119319-7f40fba2-1e9f-436e-a11c-8c3d428d46a6.png) >> >> The control-flow graph view is also adapted to this representation: >> >> ![sparse-in-cfg](https://user-images.githubusercontent.com/8792647/224119399-884e2516-a9a1-43fd-b5f5-747c99472ace.png) >> >> #### Additional improvements >> >> Additionally, this changeset introduces a complementary filter "Show custom node info" (enabled by default) that extends the labels of call and exception-creation nodes with custom information; and defines and documents JavaScript helpers to simplify the new and existing available filters. Here is an example of the effect of the new "Show custom node info" filter: >> >> ![show-custom-node-info](https://user-images.githubusercontent.com/8792647/224119545-fd564224-7ccc-4829-988e-77f05d25b3bc.png) >> >> ### Testing >> >> #### Functionality >> >> - Tested the functionality manually on a small selection of graphs. >> >> - Tested automatically that viewing thousands of graphs in the three views with different filter subsets enabled does not trigger any assertion failure (by instrumenting IGV to view graphs as they are loaded and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`). >> >> #### Performance >> >> Measured the combined filter application and view creation time for the sea-of-nodes view on a selection of 100 medium-sized graphs (200-500 nodes). On average, applying the new "Show custom node info" filter introduces a minimal overhead of around 1%, which motivates enabling it by default. Applying the "simplify graph" and "condense graph" on top actually gives a speedup of about 12%, since the additional filter application time is amortized by laying out and drawing fewer nodes. However, these filters are not enabled by default, since they cause a (minor) loss of information which is not desirable in every use case. >> >> The graph size reduction and performance results are [attached](https://github.com/openjdk/jdk/files/10934804/performance-evaluation.ods) (note that each time measurement in the sheet corresponds to the median of ten runs). > > Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 53 commits: > > - Merge branch 'master' into JDK-8302738 > - Partially revert "Split simplify graph filter into two, ensure they are applied in right order" > > This reverts parts of commit 07621a8012c925eaa612ef8b35611557f9f0f4ca, as > they have been contributed independently to mainline. > - Fix comment typo > - Revert "Select slots as well" > > This reverts commit 8256f0c20d7747cda691291c47841a9280d8c493. > > Revert "Fix figure selection" > > This reverts commit 71e73e89facfbb31614e4f1f3676c9e91a38e01a. > > Revert "Make slots searchable and selectable" > > This reverts commit 69cbec1f24ec5e941a5a72ab94b79117551d9560. > - Increase the bold text line factor slightly > - Add extra horizontal margin for long labels and let them overflow within the node > - Select slots as well > - Remove code that is commented out > - Assert inputLabel is non-null > - Document filter helpers > - ... and 43 more: https://git.openjdk.org/jdk/compare/9df20600...6e6a8ecc Great, thanks Roberto! ------------- PR Comment: https://git.openjdk.org/jdk/pull/12955#issuecomment-1492006195 From jcking at openjdk.org Fri Mar 31 14:30:29 2023 From: jcking at openjdk.org (Justin King) Date: Fri, 31 Mar 2023 14:30:29 GMT Subject: RFR: JDK-8304684: Memory leak in DirectivesParser::set_option_flag [v4] In-Reply-To: <9XO5we9RK8MKNE5HpGWLFySNOr6Y_TB6gXl13ksg0Yo=.dec7763e-9483-4c8c-ba79-7b6d47148d81@github.com> References: <9XO5we9RK8MKNE5HpGWLFySNOr6Y_TB6gXl13ksg0Yo=.dec7763e-9483-4c8c-ba79-7b6d47148d81@github.com> Message-ID: On Tue, 28 Mar 2023 14:30:55 GMT, Justin King wrote: >> Update `DirectivesSet` to take ownership of string options in some cases, to not leak memory. > > Justin King has updated the pull request incrementally with one additional commit since the last revision: > > Adjust logic based on review > > Signed-off-by: Justin King Going to run this through ASan/LSan to double check. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13125#issuecomment-1492010882 From psandoz at openjdk.org Fri Mar 31 16:24:25 2023 From: psandoz at openjdk.org (Paul Sandoz) Date: Fri, 31 Mar 2023 16:24:25 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v5] In-Reply-To: References: Message-ID: On Fri, 31 Mar 2023 12:17:49 GMT, Quan Anh Mai wrote: >> src/jdk.incubator.vector/share/classes/jdk/incubator/vector/X-VectorBits.java.template line 1106: >> >>> 1104: @Override >>> 1105: @ForceInline >>> 1106: public int laneSource(int i) { >> >> Can this method be moved to `AbstractShuffle`? > > No because `T lane(int)` is a method of the typed vector classes which is not available in `AbstractVector` Ah, doh!, yes of course, we would need to specialize the shuffle. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1154671839 From amitkumar at openjdk.org Fri Mar 31 16:43:18 2023 From: amitkumar at openjdk.org (Amit Kumar) Date: Fri, 31 Mar 2023 16:43:18 GMT Subject: RFR: 8305227: [s390x] build broken after JDK-8231349 In-Reply-To: References: Message-ID: On Fri, 31 Mar 2023 16:38:47 GMT, Vladimir Kozlov wrote: >> This PR moves nmethod entry barrier from `generate_compiler_stubs()` to `generate_final_stubs()`. Test build for fastdebug, slow debug, release and optimised. Tier1 test in fastdebug seems clean as well. > > Good. Thanks @vnkozlov ------------- PR Comment: https://git.openjdk.org/jdk/pull/13259#issuecomment-1492248943 From kvn at openjdk.org Fri Mar 31 16:43:18 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 31 Mar 2023 16:43:18 GMT Subject: RFR: 8305227: [s390x] build broken after JDK-8231349 In-Reply-To: References: Message-ID: On Fri, 31 Mar 2023 07:06:25 GMT, Amit Kumar wrote: > This PR moves nmethod entry barrier from `generate_compiler_stubs()` to `generate_final_stubs()`. Test build for fastdebug, slow debug, release and optimised. Tier1 test in fastdebug seems clean as well. Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13259#pullrequestreview-1367266112 From amitkumar at openjdk.org Fri Mar 31 17:01:28 2023 From: amitkumar at openjdk.org (Amit Kumar) Date: Fri, 31 Mar 2023 17:01:28 GMT Subject: Integrated: 8305227: [s390x] build broken after JDK-8231349 In-Reply-To: References: Message-ID: <_8Wtjy8I3B6c3cIwm5fQOmkHLB26WpJSLPT1LEMdyrU=.56bf9572-83ae-4e63-a7cb-463b52ac1af9@github.com> On Fri, 31 Mar 2023 07:06:25 GMT, Amit Kumar wrote: > This PR moves nmethod entry barrier from `generate_compiler_stubs()` to `generate_final_stubs()`. Test build for fastdebug, slow debug, release and optimised. Tier1 test in fastdebug seems clean as well. This pull request has now been integrated. Changeset: 4a5d7ca7 Author: Amit Kumar Committer: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/4a5d7ca7d9cf90f8c61d890419c8557b61f78f7e Stats: 13 lines in 1 file changed: 7 ins; 6 del; 0 mod 8305227: [s390x] build broken after JDK-8231349 Reviewed-by: shade, lucy, kvn ------------- PR: https://git.openjdk.org/jdk/pull/13259 From psandoz at openjdk.org Fri Mar 31 17:07:20 2023 From: psandoz at openjdk.org (Paul Sandoz) Date: Fri, 31 Mar 2023 17:07:20 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v2] In-Reply-To: References: Message-ID: On Tue, 21 Mar 2023 16:29:44 GMT, Paul Sandoz wrote: >> I have moved most of the methods to `AbstractVector` and `AbstractShuffle`, I have to resort to raw types, though, since there seems to be no way to do the same with wild cards, and the generics mechanism is not powerful enough for things like `Vector`. The remaining failure seems to be related to [JDK-8304676](https://bugs.openjdk.org/projects/JDK/issues/JDK-8304676), so I think this patch is ready for review now. >> >>> The mask implementation is specialized by the species of vectors it operates on, but does it have to be >> >> Apart from the mask implementation, shuffle implementation definitely has to take into consideration the element type. However, this information does not have to be visible to the API, similar to how we currently handle the vector length, we can have `class AbstractMask implements VectorMask`. As a result, the cast method would be useless and can be removed in the API, but our implementation details would still use it, for example >> >> Vector blend(Vector v, VectorMask w) { >> AbstractMask aw = (AbstractMask) w; >> AbstractMask tw = aw.cast(vspecies()); >> return VectorSupport.blend(...); >> } >> >> Vector rearrange(VectorShuffle s) { >> AbstractShuffle as = (AbstractShuffle) s; >> AbstractShuffle ts = s.cast(vspecies()); >> return VectorSupport.rearrangeOp(...); >> } >> >> What do you think? > >> Apart from the mask implementation, shuffle implementation definitely has to take into consideration the element type. > > Yes, the way you have implemented shuffle is tightly connected, that looks ok. > > I am wondering if we can make the mask implementation more loosely coupled and modified such that it does not have to take into consideration the element type (or species) of the vector it operates on, and instead compatibility is based solely on the lane count. > > Ideally it would be good to change the `VectorMask::check` method to just compare the lanes counts and not require a cast in the implementation, which i presume requires some deeper changes in C2? > > What you propose seems a possible a interim step towards a more preferable API, if the performance is good. > Thanks @PaulSandoz and @XiaohongGong for the reviews and testings. Running tier2/3 tests. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13093#issuecomment-1492274813 From xliu at openjdk.org Fri Mar 31 17:53:20 2023 From: xliu at openjdk.org (Xin Liu) Date: Fri, 31 Mar 2023 17:53:20 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v5] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: On Thu, 30 Mar 2023 23:36:20 GMT, Cesar Soares Lucas wrote: >> Can I please get reviews for this PR? >> >> The most common and frequent use of NonEscaping Phis merging object allocations is for debugging information. The two graphs below show numbers for Renaissance and DaCapo benchmarks - similar results are obtained for all other applications that I tested. >> >> With what frequency does each IR node type occurs as an allocation merge user? I.e., if the same node type uses a Phi N times the counter is incremented by N: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280517-4dcf5871-2564-4207-b49e-22aee47fa49d.png) >> >> What are the most common users of allocation merges? I.e., if the same node type uses a Phi N times the counter is incremented by 1: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280608-ca742a4e-1622-4e69-a778-e4db6805ea02.png) >> >> This PR adds support scalar replacing allocations participating in merges that are used as debug information OR as a base for field loads. I plan to create subsequent PRs to enable scalar replacement of merges used by other node types (CmpP is next on the list) subsequently. >> >> The approach I used for _rematerialization_ is pretty straightforward. It consists basically in: 1) Extend SafePointScalarObjectNode to represent multiple SR objects; 2) Add a new Class to support rematerialization of SR objects part of merges; 3) Patch HotSpot to be able to serialize and deserialize debug information related to allocation merges; 4) Patch C2 to generate unique types for SR objects participating in some allocation merges. >> >> The approach I used for _enabling the scalar replacement of some of the inputs of the allocation merge_ is also pretty straight forward: call `MemNode::split_through_phi` to, well, split AddP->Load* through the merge which will render the Phi useless. >> >> I tested this with JTREG tests tier 1-4 (Windows, Linux, and Mac) and didn't see regression. I also tested with several applications and didn't see any failure. I also ran tests with "-ea -esa -Xbatch -Xcomp -XX:+UnlockExperimentalVMOptions -XX:-TieredCompilation -server -XX:+IgnoreUnrecognizedVMOptions -XX:+UnlockDiagnosticVMOptions -XX:+StressLCM -XX:+StressGCM -XX:+StressCCP" and didn't observe any related failures. > > Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: > > Address PR feeedback 1: make ObjectMergeValue subclass of ObjectValue & create new IR class to represent scalarized merges. src/hotspot/share/opto/escape.cpp line 588: > 586: } > 587: > 588: // This method will create a SafePointScalarObjectNode for each combination of you have changed to SafePointScalarMergeNode in code. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1154745690 From xliu at openjdk.org Fri Mar 31 18:24:18 2023 From: xliu at openjdk.org (Xin Liu) Date: Fri, 31 Mar 2023 18:24:18 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v5] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: On Thu, 30 Mar 2023 23:36:20 GMT, Cesar Soares Lucas wrote: >> Can I please get reviews for this PR? >> >> The most common and frequent use of NonEscaping Phis merging object allocations is for debugging information. The two graphs below show numbers for Renaissance and DaCapo benchmarks - similar results are obtained for all other applications that I tested. >> >> With what frequency does each IR node type occurs as an allocation merge user? I.e., if the same node type uses a Phi N times the counter is incremented by N: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280517-4dcf5871-2564-4207-b49e-22aee47fa49d.png) >> >> What are the most common users of allocation merges? I.e., if the same node type uses a Phi N times the counter is incremented by 1: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280608-ca742a4e-1622-4e69-a778-e4db6805ea02.png) >> >> This PR adds support scalar replacing allocations participating in merges that are used as debug information OR as a base for field loads. I plan to create subsequent PRs to enable scalar replacement of merges used by other node types (CmpP is next on the list) subsequently. >> >> The approach I used for _rematerialization_ is pretty straightforward. It consists basically in: 1) Extend SafePointScalarObjectNode to represent multiple SR objects; 2) Add a new Class to support rematerialization of SR objects part of merges; 3) Patch HotSpot to be able to serialize and deserialize debug information related to allocation merges; 4) Patch C2 to generate unique types for SR objects participating in some allocation merges. >> >> The approach I used for _enabling the scalar replacement of some of the inputs of the allocation merge_ is also pretty straight forward: call `MemNode::split_through_phi` to, well, split AddP->Load* through the merge which will render the Phi useless. >> >> I tested this with JTREG tests tier 1-4 (Windows, Linux, and Mac) and didn't see regression. I also tested with several applications and didn't see any failure. I also ran tests with "-ea -esa -Xbatch -Xcomp -XX:+UnlockExperimentalVMOptions -XX:-TieredCompilation -server -XX:+IgnoreUnrecognizedVMOptions -XX:+UnlockDiagnosticVMOptions -XX:+StressLCM -XX:+StressGCM -XX:+StressCCP" and didn't observe any related failures. > > Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: > > Address PR feeedback 1: make ObjectMergeValue subclass of ObjectValue & create new IR class to represent scalarized merges. src/hotspot/share/opto/macro.cpp line 632: > 630: safepoints->append_if_missing(sfpt); > 631: } > 632: } else if (ignore_merges && (use->is_Phi() || use->is_EncodeP() || use->Opcode() == Op_MemBarRelease)) { I try to understand this part. now `can_eliminate_allocation` can pre-test whether SR can eliminate the alloc. I see that you use it in EA. With ignore_merges, why we also skip EncodeP or MemBarRelease here? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1154771471 From xliu at openjdk.org Fri Mar 31 18:27:21 2023 From: xliu at openjdk.org (Xin Liu) Date: Fri, 31 Mar 2023 18:27:21 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v5] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: On Fri, 31 Mar 2023 18:21:40 GMT, Xin Liu wrote: >> Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: >> >> Address PR feeedback 1: make ObjectMergeValue subclass of ObjectValue & create new IR class to represent scalarized merges. > > src/hotspot/share/opto/macro.cpp line 632: > >> 630: safepoints->append_if_missing(sfpt); >> 631: } >> 632: } else if (ignore_merges && (use->is_Phi() || use->is_EncodeP() || use->Opcode() == Op_MemBarRelease)) { > > I try to understand this part. now `can_eliminate_allocation` can pre-test whether SR can eliminate the alloc. I see that you use it in EA. > > With ignore_merges, why we also skip EncodeP or MemBarRelease here? Do you really need the boolean parameter ignore_merges here? It looks like we can use (safepoints == nullptr) instead? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1154773826 From xliu at openjdk.org Fri Mar 31 18:33:19 2023 From: xliu at openjdk.org (Xin Liu) Date: Fri, 31 Mar 2023 18:33:19 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v5] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: On Thu, 30 Mar 2023 23:36:20 GMT, Cesar Soares Lucas wrote: >> Can I please get reviews for this PR? >> >> The most common and frequent use of NonEscaping Phis merging object allocations is for debugging information. The two graphs below show numbers for Renaissance and DaCapo benchmarks - similar results are obtained for all other applications that I tested. >> >> With what frequency does each IR node type occurs as an allocation merge user? I.e., if the same node type uses a Phi N times the counter is incremented by N: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280517-4dcf5871-2564-4207-b49e-22aee47fa49d.png) >> >> What are the most common users of allocation merges? I.e., if the same node type uses a Phi N times the counter is incremented by 1: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280608-ca742a4e-1622-4e69-a778-e4db6805ea02.png) >> >> This PR adds support scalar replacing allocations participating in merges that are used as debug information OR as a base for field loads. I plan to create subsequent PRs to enable scalar replacement of merges used by other node types (CmpP is next on the list) subsequently. >> >> The approach I used for _rematerialization_ is pretty straightforward. It consists basically in: 1) Extend SafePointScalarObjectNode to represent multiple SR objects; 2) Add a new Class to support rematerialization of SR objects part of merges; 3) Patch HotSpot to be able to serialize and deserialize debug information related to allocation merges; 4) Patch C2 to generate unique types for SR objects participating in some allocation merges. >> >> The approach I used for _enabling the scalar replacement of some of the inputs of the allocation merge_ is also pretty straight forward: call `MemNode::split_through_phi` to, well, split AddP->Load* through the merge which will render the Phi useless. >> >> I tested this with JTREG tests tier 1-4 (Windows, Linux, and Mac) and didn't see regression. I also tested with several applications and didn't see any failure. I also ran tests with "-ea -esa -Xbatch -Xcomp -XX:+UnlockExperimentalVMOptions -XX:-TieredCompilation -server -XX:+IgnoreUnrecognizedVMOptions -XX:+UnlockDiagnosticVMOptions -XX:+StressLCM -XX:+StressGCM -XX:+StressCCP" and didn't observe any related failures. > > Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: > > Address PR feeedback 1: make ObjectMergeValue subclass of ObjectValue & create new IR class to represent scalarized merges. src/hotspot/share/opto/escape.cpp line 457: > 455: found_sr_allocate = true; > 456: } else { > 457: ptn->set_scalar_replaceable(false); This member function is const. Do we really need to change ptn's property here? My reading is ophi is profitable as long as we spot any input object which can be eliminated. how about you just return at line 455? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1154778280 From xliu at openjdk.org Fri Mar 31 18:41:23 2023 From: xliu at openjdk.org (Xin Liu) Date: Fri, 31 Mar 2023 18:41:23 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v4] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> <6NDwZSpjSrokmglncPRp4tM7_Hiq4b26dXukhXODpKo=.8ba7efd0-bc44-4f1e-beb8-c1c68bc33515@github.com> <0UbMqMHtVIayPdJMmfDF6YTadWe4YTlSW6mZc5P3IU8=.c4b1a292-e434-4c57-a5cd-015edca2ec95@github.com> Message-ID: On Fri, 24 Mar 2023 23:37:29 GMT, Vladimir Kozlov wrote: >> I had to make this method static because it uses `value_from_mem` - which I also made static. I had to make `value_from_mem` static so that I can use it outside PhaseMacroExpand. > > I see, you use it in escape.cpp. Okay. I need to review changes there too. or you could construct a temporary PhaseMacroExpand object in EA. I see that you convert many member function to static so you can query in EA. the only blocker is _igvn. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1154785040 From jlu at openjdk.org Fri Mar 31 21:41:17 2023 From: jlu at openjdk.org (Justin Lu) Date: Fri, 31 Mar 2023 21:41:17 GMT Subject: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native [v5] In-Reply-To: References: <0MB7FLFNfaGEWssr9X54UJ_iZNFWBJkxQ1yusP7fsuY=.3f9f3de5-fe84-48e6-9449-626cac42da0b@github.com> Message-ID: On Fri, 17 Mar 2023 22:27:48 GMT, Justin Lu wrote: >> This PR converts Unicode sequences to UTF-8 native in .properties file. (Excluding the Unicode space and tab sequence). The conversion was done using native2ascii. >> >> In addition, the build logic is adjusted to support reading in the .properties files as UTF-8 during the conversion from .properties file to .java ListResourceBundle file. > > Justin Lu has updated the pull request incrementally with one additional commit since the last revision: > > Close streams when finished loading into props Something thing to consider is that Intellj defaults .properties files to ISO 8859-1. https://www.jetbrains.com/help/idea/properties-files.html#encoding So users of Intellj / (other IDEs that default to ISO 8859-1 for .properties files) will need to change the default encoding to utf-8 for such files. Or ideally, the respective IDEs can change their default encoding for .properties files if this change is integrated. ------------- PR Comment: https://git.openjdk.org/jdk/pull/12726#issuecomment-1492640306 From naoto at openjdk.org Fri Mar 31 22:48:29 2023 From: naoto at openjdk.org (Naoto Sato) Date: Fri, 31 Mar 2023 22:48:29 GMT Subject: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native [v5] In-Reply-To: References: <0MB7FLFNfaGEWssr9X54UJ_iZNFWBJkxQ1yusP7fsuY=.3f9f3de5-fe84-48e6-9449-626cac42da0b@github.com> Message-ID: On Fri, 17 Mar 2023 22:27:48 GMT, Justin Lu wrote: >> This PR converts Unicode sequences to UTF-8 native in .properties file. (Excluding the Unicode space and tab sequence). The conversion was done using native2ascii. >> >> In addition, the build logic is adjusted to support reading in the .properties files as UTF-8 during the conversion from .properties file to .java ListResourceBundle file. > > Justin Lu has updated the pull request incrementally with one additional commit since the last revision: > > Close streams when finished loading into props Hmm, I just wonder why they are sticking to ISO-8859-1 as the default. I know j.u.Properties defaults to 8859-1, but PropertyResourceBundle, which is their primary use defaults to UTF-8 since JDK9 (https://openjdk.org/jeps/226) ------------- PR Comment: https://git.openjdk.org/jdk/pull/12726#issuecomment-1492682703