From haosun at openjdk.org Tue Nov 1 01:39:29 2022 From: haosun at openjdk.org (Hao Sun) Date: Tue, 1 Nov 2022 01:39:29 GMT Subject: RFR: 8294194: [AArch64] Create intrinsics compress and expand [v2] In-Reply-To: References: Message-ID: <2fWOyn2ZhGy2Gw9fCy5P2wQCOY-wSBwWBc38oykTasU=.8e61e0ba-9aa7-4196-a695-76779edf6bc7@github.com> On Mon, 31 Oct 2022 15:08:33 GMT, Stuart Monteith wrote: >> The java.lang.Long and java.lang.Integer classes have the methods "compress(i, mask)" and "expand(i, mask)". They compile down to 236 assembler instructions. There are no scalar instructions that perform the equivalent functions on aarch64, instead the intrinsics can be implemented with vector instructions included in SVE2; expand with BDEP, compress with BEXT. >> >> Only the first lane of each vector will be used, two MOV instructions will move the inputs from GPRs into temporary vector registers, and another to do the reverse for the result. Autovectorization for this functionality is/will be implemented separately. >> >> Running on an SVE2 enabled system, I ran the following benchmarks: >> >> org.openjdk.bench.java.lang.Integers >> org.openjdk.bench.java.lang.Longs >> >> The time for each operation reduced to 56% to 72% of the original run time: >> >> >> Benchmark Result error Unit % against non-SVE2 >> Integers.expand 2.106 0.011 us/op >> Integers.expand-SVE 1.431 0.009 us/op 67.95% >> Longs.expand 2.606 0.006 us/op >> Longs.expand-SVE 1.46 0.003 us/op 56.02% >> Integers.compress 1.982 0.004 us/op >> Integers.compress-SVE 1.427 0.003 us/op 72.00% >> Longs.compress 2.501 0.002 us/op >> Longs.compress-SVE 1.441 0.003 us/op 57.62% >> >> >> These methods can bed specifically tested with: >> `make test TEST="jtreg:compiler/intrinsics/TestBitShuffleOpers.java"` > > Stuart Monteith has updated the pull request incrementally with one additional commit since the last revision: > > Update src/hotspot/cpu/aarch64/aarch64.ad > > Correct slight formatting error. > > Co-authored-by: Eric Liu LGTM except two style issues. (I'm not a Reviewer) src/hotspot/cpu/aarch64/aarch64.ad line 16978: > 16976: __ sve_bext($tdst$$FloatRegister, __ D, $tsrc$$FloatRegister, $tmask$$FloatRegister); > 16977: __ mov($dst$$Register, $tdst$$FloatRegister, __ D, 0); > 16978: %} nit: indentation issue Suggestion: %} src/hotspot/cpu/aarch64/aarch64.ad line 16996: > 16994: __ sve_bdep($tdst$$FloatRegister, __ S, $tsrc$$FloatRegister, $tmask$$FloatRegister); > 16995: __ mov($dst$$Register, $tdst$$FloatRegister, __ S, 0); > 16996: %} ditto ------------- Marked as reviewed by haosun (Author). PR: https://git.openjdk.org/jdk/pull/10537 From xxinliu at amazon.com Tue Nov 1 04:34:42 2022 From: xxinliu at amazon.com (Liu, Xin) Date: Mon, 31 Oct 2022 21:34:42 -0700 Subject: RFC: Partial Escape Analysis in HotSpot C2 In-Reply-To: <4d996adb-7d10-aa02-3a47-73d043b5013d@amazon.com> References: <8644C1FF-CD78-4D05-9382-00C1E2661EDD@oracle.com> <94A2737B-0FDE-4159-827F-15B6ACE1BC42@oracle.com> <114af950-f6b6-7e4a-8ac0-3da99bd40297@amazon.com> <2f29160c-7368-7c11-924e-a626e42c3aa2@amazon.com> <6d5c2aa5-c684-bc42-765d-ed116d3ef43c@oracle.com> <0bc75ee6-641f-1145-8fde-6d11e2ec887e@amazon.com> <1da07de9-90d2-d4ad-188e-d7d976009f52@oracle.com> <4768851c-2f3b-69be-ce28-070dae4792c7@amazon.com> <4127d57d-ca6f-0cda-13a8-efbdd2ef0501@oracle.com> <4d996adb-7d10-aa02-3a47-73d043b5013d@amazon.com> Message-ID: Hi, I would like to update on this. Last week, I posted Example-1 which is a poster child partial Escape case(https://gist.github.com/navyxliu/9c325d5c445899c02a0d115c6ca90a79). It shows that it is possible to conduct Standler's PEA in parser and we can punt the obsolete object to C2 EA/SR. I made some progress on Example-2(https://gist.github.com/navyxliu/ee4465e2146ef99c5ae1fa1ba6b70e25). It's not partial escaped case but parser still needs to handle it correctly. I think it can explain better why the algorithm can guarantee 'dynamic variant'. 27 Allocate dominates 106/129(clones) like I explained before. A new application of PEA emerges from the prior discussion(JDK-8267532). The Try-Catch idiom divides control-flow into hot-cold paths by nature. Method calls such as ?obj.close()? in cold path may not be inlined. There are 2 possibilities: (1)callee itself is too big (2) inliner is out of budget in favor of the hot paths. Non-inlined callsites will impede EA/SR because the object escapes as a receiver. We found that we can leverage the technique ?object cloning? developed in the PEA algorithm to tackle this issue. In essence, PEA would split the original object to 2 copies: one is for the hot path and the other one is for the cold path. This is like I did in Example-2 but along with exceptional flow. Then C2 would conduct scalar replacement in the hot path because the object in try-block is non-escaped! Therefore, I would like to treat ResourceScopeCloseMin as 'Example-3' and try it out. We know that there are different solutions for this idiom. C2 parser could use uncommon_trap for the exceptional paths, or we can adjust inliner policy for this case. I am not suggesting PEA is the right way to solve it. I just want to explore a possibility. Paul commented that I bloat code by cloning objects in Example-2. I need to develop an optimize to recognize the 'common subexpression' and eliminate them. I am aware of it and I mentioned it in the Risk section of RFC. I would like to put aside of it because it's an optimization. I think we can revert to the original allocation if the transformation turns out not beneficial. I guess we can revert objects after Iterative EA. JDK-8267532 is an example that too early restoration may miss the chance of EA/SR. thanks, --lx On 10/24/22 1:11 PM, Liu, Xin wrote: > hi, Vladimir Ivanov, > > Your email is my starting point. It's thorough and very insightful. > We spent a lot of time trying to crack your questions. The RFC is the > summary of what we got. I own your a big thank! > > Sorry, I still haven't had a clear answer for "2. Move vs Split > decision" yet. Stadler's algorithm just materialize a virtual object on > demand. After then, its state changes from virtual to materialized. > IMHO, I don't think it is optimal placement. I feel the optimal > placement should be along domination frontier of the original > AllocateNode and Exit. It is like the minimal phi construction and we > might borrow the idea from it. I put aside it because it's indeed an > optimization problem. Maybe it's not a big deal in common cases. > > > I try to answer your questions inline. > > On 10/20/22 5:26 PM, Vladimir Ivanov wrote: >> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >> >> >> >> Hi, >> >>> I would like to update on this. I manage to get PEA work in Vladimir >>> Ivanov's testcase. I put the testcase, assembly and graphs here[1]. >>> >>> Even though it is a quite simple case, I think it demonstrates that the >>> RFC is practical in C2. I proposed 3 major differences from Graal. >> >> Nice! Also, a very similar (but a much more popular case) should be >> escape sites in catch blocks (as reported by JDK-8267532 [1]). >> >>> 1. The algorithm runs in parser instead of optimizer. >>> 2. Prefer clone-and-eliminate strategy rather than >>> virtualize-and-materialize. >>> 3. Refrain from scalar replacement on-the-fly. >> >> I don't understand how you plan to implement it solely during parsing. >> You could do some bookkeeping during parsing and capture JVM state, but >> I don't see how to do EA that early. >> >> Also, please, elaborate on #3. It's not clear to me what do you mean there. >> > I added a PEAState for each basic block[2]. > > If we need to materialize a virtual object, we just create a new > AllocateNode and its children in-place[3]. > > > In Stadler's algorithm, it inherently performs "scalar replacement". Its > first step is to delete the original allocation node. PEA phase replaces > LoadField nodes with scalars. Because I propose to focus on 'escaping > object' only, we don't perform 'scalar replacement'. I leave them to the > C2 EA/SR. that's why I say "I refrain from on-the-fly scalar replacement". > > >>> The test excises them all. I pasted 3 graphs here[2]. When we >>> materialize an object, we just clone it with the right JVMState. It >>> shows that C2 IterEA can automatically picks up the obsolete object and >>> get rid of it, as we expected. >>> >>> It turns out cloning an object isn't as complex as I thought. I mainly >>> spent time on adjusting JVMState for the cloned AllocateNode. Not only >>> to call sync_jvm(), I also need to 1) kill dead locals 2) clean stack >>> and even avoid reexecution that bci. >>> >>> JVMState* jvms = parser->sync_jvms(); >>> SafePointNode* map = jvms->map(); >>> parser->kill_dead_locals(); >>> parser->clean_stack(jvms->sp()); >>> jvms->set_should_reexecute(false); >>> >>> Clearly, the algorithm hasn't completed yet. I am still working on >>> MergeProcessor, general classes fields and loop construct. >> >> There was a previous discussion on PEA for C2 back in 2021 [2] [3]. One >> interesting observation related to your current experiments was: >> >> "4. Escape sites separate the graph into 2 parts: before and after the >> instance escapes. In order to preserve identity invariants (and avoid >> identity paradoxes), PEA can't just put an allocation at every escape >> site. It should respect the order of escape events and ensure that the >> very same object is observed when multiple escape events happen. >> >> Dynamic invariant can be formulated as: there should never be more than >> 1 allocation at runtime per 1 eliminated allocation. >> >> Considering non-escaping operations can force materialization on their >> own, it poses additional constraints." >> >> So, when you clone an allocation, you should ensure that only a single >> instance can be observed. And safepoints can be escape points as well >> (rematerialization in case of deoptimization event). >> > This is fairly complex. That's why I suggest to focus on 'escaping > objects' only in C2 PEA. Please assume that all non-escape objects > remain intact after our PEA. > > I think we can guarantee dynamic invariant here. First of all, we > traverse basic blocks in reverse-post-order(RPO). It's in the same > direction of execution. An object allocation is virtual initially. Once > a virtual object becomes materialized, it won't change back. We track > allocation states so we know that. The following appearances of an > materialized object won't cause 'materialization' again. It will be > treated as a ordinary object. > > Here is Example2 and I am still working on it. Beside the place where x > escapes to _cache, we also need to materialize the virtual object before > merging two basic blocks. this is described in 5.3 Merge nodes of > Stadler's CGO paper. > > > class Example2 { > private Object _cache; > public Object foo(boolean cond) { > Object x = new Object(); > > blackhole(); > > if (cond) { > _cache = x; > } > return x; > } > > public static void blackhole() {} > ... > } > > We expect to see code as follows after PEA. "x2 = new Object()" is the > result of materialization at merging point. > > public Object foo(boolean cond) { > Object x0 = new Object(); > > blackhole(); > > if (cond) { > x1 = new Object(); > _cache = x1; > } > x3 = phi(x2 = new Object(), x1); > return x3; > } > > We've proved that the obsolete object is either dead or scalar > replaceable after PEA[4]. We expect C2 EA/SR to get rid of x0 down the > road. Please note that x0(the obsolete obj) dominates all clones(x1 and > x2) because we materialize them before merging, we can guarantee dynamic > invariant. > > I plan to mark the original AllocateNode 'obsolete' if PEA does > materialize it. I am going to assert in MacroExpansion that it won't > expand an obsolete object. If it did, it would introduce redundancy and > violate the 'dynamic invariant'. > > > [1]https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2021-May/047536.html > [2]https://github.com/navyxliu/jdk/blob/PEA_parser/src/hotspot/share/opto/parse.hpp#L171 > > [3] > https://github.com/navyxliu/jdk/blob/PEA_parser/src/hotspot/share/opto/parseHelper.cpp#L367 > > [4]https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2022-October/059432.html > > >>> I haven't figured out how to test PEA in a reliable way. It is not easy >>> for IR framework to capture node movement. If we measure allocation >>> rate, it will be subject to CPU capability and also the sampling rate. I >>> came up with an idea so-called 'Epsilon-Test'. We create a JVM with >>> EpsilonGC and a fixed Java heap. Because EpsilonGC never replenish the >>> java heap, we can count how many iterations a test can run before OOME. >>> The less allocation made in a method, the more iterations HotSpot can >>> execute the method. This isn't perfect either. I found that hotspot >>> can't guarantee to execute the final-block in this case[3]. So far, I >>> just measure execution time instead. >> >> It sounds more like a job for benchmarks, but focused on measuring >> allocation rate (per iteration). ("-prof gc" mode in JMH terms.) >> >> Personally, I very much liked the IR framework-based approach Cesar used >> in the unit test for allocation merges [4]. Do you see any problems with >> that? >> >> Best regards, >> Vladimir Ivanov >> > > Okay. I will follow this direction. > > thanks, > --lx > >> [1] https://bugs.openjdk.org/browse/JDK-8267532 >> [2] >> https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2021-May/047486.html >> [3] >> https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2021-May/047536.html >> [4] https://github.com/openjdk/jdk/pull/9073 >> >> >>> >>> Appreciate your feedbacks or you spot any redflag. >>> >>> [1] https://gist.github.com/navyxliu/9c325d5c445899c02a0d115c6ca90a79 >>> >>> [2] >>> https://gist.github.com/navyxliu/9c325d5c445899c02a0d115c6ca90a79?permalink_comment_id=4341838#gistcomment-4341838 >>> >>> [3]?https://gist.github.com/navyxliu/9c325d5c445899c02a0d115c6ca90a79#file-example1-java-L43 >>> >>> thanks, >>> --lx >>> >>> >>> >>> >>> On 10/12/22 11:17 AM, Vladimir Kozlov wrote: >>>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >>>> >>>> >>>> >>>> On 10/12/22 7:58 AM, Liu, Xin wrote: >>>>> hi, Vladimir, >>>>>> You should show that your implementation can rematirealize an object >>>>> at any escape site. >>>>> >>>>> My understanding is I suppose to 'materialize' an object at any escape site. >>>> >>>> Words ;^) >>>> >>>> Yes, I mistyped and misspelled. >>>> >>>> Vladimir K >>>> >>>>> >>>>> 'rematerialize' refers to 'create an scalar-replaced object on heap' in >>>>> deoptimization. It's for interpreter as if the object was created in the >>>>> first place. It doesn't apply to an escaped object because it's marked >>>>> 'GlobalEscaped' in C2 EA. >>>>> >>>>> >>>>> Okay. I will try this idea! >>>>> >>>>> thanks, >>>>> --lx >>>>> >>>>> >>>>> >>>>> >>>>> On 10/11/22 3:12 PM, Vladimir Kozlov wrote: >>>>>> Also in your test there should be no merge at safepoint2 because `obj` is "not alive" (not referenced) anymore. -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_0xB9D934C61E047B0D.asc Type: application/pgp-keys Size: 3675 bytes Desc: OpenPGP public key URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_signature Type: application/pgp-signature Size: 665 bytes Desc: OpenPGP digital signature URL: From fyang at openjdk.org Tue Nov 1 06:56:27 2022 From: fyang at openjdk.org (Fei Yang) Date: Tue, 1 Nov 2022 06:56:27 GMT Subject: RFR: 8296136: Use correct register in aarch64_enc_fast_unlock() In-Reply-To: References: Message-ID: On Mon, 31 Oct 2022 17:31:31 GMT, Roman Kennke wrote: > In aarch64_enc_fast_unlock() (aarch64.ad) we have this piece of code: > > > __ ldr(tmp, Address(oop, oopDesc::mark_offset_in_bytes())); > __ tbnz(disp_hdr, exact_log2(markWord::monitor_value), object_has_monitor); > > > The tbnz uses the wrong register - it should really use tmp. disp_hdr has been loaded with the displaced header of the stack-lock, which would never have its monitor bits set, thus the branch will always take the slow path. In this common case, it is only a performance nuisance. In the case of !UseHeavyMonitors it is even worse, then disp_hdr will be unitialized, and we are facing a correctness problem. > > As far as I can tell, the problem dates back to when aarch64 C2 parts have been added to OpenJDK. > > Testing: > - [x] tier1 > - [x] tier2 > - [x] tier3 > - [ ] tier4 Could you please also incorporate following fix for RISC-V at the same time? I see it inherits the same similar issue here. This has passed tier1 test on HiFive Unmatched board. Thanks. diff --git a/src/hotspot/cpu/riscv/riscv.ad b/src/hotspot/cpu/riscv/riscv.ad index 75612ef7508..abe0f609a62 100644 --- a/src/hotspot/cpu/riscv/riscv.ad +++ b/src/hotspot/cpu/riscv/riscv.ad @@ -2474,7 +2474,7 @@ encode %{ // Handle existing monitor. __ ld(tmp, Address(oop, oopDesc::mark_offset_in_bytes())); - __ andi(t0, disp_hdr, markWord::monitor_value); + __ andi(t0, tmp, markWord::monitor_value); __ bnez(t0, object_has_monitor); if (!UseHeavyMonitors) { ------------- PR: https://git.openjdk.org/jdk/pull/10921 From jiefu at openjdk.org Tue Nov 1 07:28:19 2022 From: jiefu at openjdk.org (Jie Fu) Date: Tue, 1 Nov 2022 07:28:19 GMT Subject: RFR: 8295970: Add vector api sanity tests in tier1 [v2] In-Reply-To: <8uMsheGwjIBbgf1nJeYkCR4Pt_ddbnq4WVKMRPwS7C0=.c59b10eb-67e1-439c-936e-857c8430c520@github.com> References: <8uMsheGwjIBbgf1nJeYkCR4Pt_ddbnq4WVKMRPwS7C0=.c59b10eb-67e1-439c-936e-857c8430c520@github.com> Message-ID: On Fri, 28 Oct 2022 07:19:31 GMT, Jie Fu wrote: >> Hi all, >> >> As discussed here https://github.com/openjdk/jdk/pull/10807#pullrequestreview-1150314487 , it would be better to add the vector api tests in GHA. >> >> Thanks. >> Best regards, >> Jie > > Jie Fu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Add jdk_vector_sanity test group > - Merge branch 'master' into JDK-8295970 > - Revert changes in test.yml > - 8295970: Add jdk_vector tests in GHA Hi all, I changed the JBS title as `Add vector api sanity tests in tier1`. And the added `jdk_vector_sanity` tests can be run in less than 2min on my testing box. Any comments? Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/10879 From aph at openjdk.org Tue Nov 1 08:45:31 2022 From: aph at openjdk.org (Andrew Haley) Date: Tue, 1 Nov 2022 08:45:31 GMT Subject: RFR: 8294194: [AArch64] Create intrinsics compress and expand [v2] In-Reply-To: References: Message-ID: On Mon, 31 Oct 2022 15:08:33 GMT, Stuart Monteith wrote: >> The java.lang.Long and java.lang.Integer classes have the methods "compress(i, mask)" and "expand(i, mask)". They compile down to 236 assembler instructions. There are no scalar instructions that perform the equivalent functions on aarch64, instead the intrinsics can be implemented with vector instructions included in SVE2; expand with BDEP, compress with BEXT. >> >> Only the first lane of each vector will be used, two MOV instructions will move the inputs from GPRs into temporary vector registers, and another to do the reverse for the result. Autovectorization for this functionality is/will be implemented separately. >> >> Running on an SVE2 enabled system, I ran the following benchmarks: >> >> org.openjdk.bench.java.lang.Integers >> org.openjdk.bench.java.lang.Longs >> >> The time for each operation reduced to 56% to 72% of the original run time: >> >> >> Benchmark Result error Unit % against non-SVE2 >> Integers.expand 2.106 0.011 us/op >> Integers.expand-SVE 1.431 0.009 us/op 67.95% >> Longs.expand 2.606 0.006 us/op >> Longs.expand-SVE 1.46 0.003 us/op 56.02% >> Integers.compress 1.982 0.004 us/op >> Integers.compress-SVE 1.427 0.003 us/op 72.00% >> Longs.compress 2.501 0.002 us/op >> Longs.compress-SVE 1.441 0.003 us/op 57.62% >> >> >> These methods can bed specifically tested with: >> `make test TEST="jtreg:compiler/intrinsics/TestBitShuffleOpers.java"` > > Stuart Monteith has updated the pull request incrementally with one additional commit since the last revision: > > Update src/hotspot/cpu/aarch64/aarch64.ad > > Correct slight formatting error. > > Co-authored-by: Eric Liu src/hotspot/cpu/aarch64/aarch64.ad line 16952: > 16950: format %{ "mov $tsrc, $src\n\t" > 16951: "mov $tmask, $mask\n\t" > 16952: "bext $tdst, $tsrc, $tmask\t# parallel bit extract\n\t" We don't need to tell readers what BEXT does.; ------------- PR: https://git.openjdk.org/jdk/pull/10537 From aph at openjdk.org Tue Nov 1 08:50:33 2022 From: aph at openjdk.org (Andrew Haley) Date: Tue, 1 Nov 2022 08:50:33 GMT Subject: RFR: 8294194: [AArch64] Create intrinsics compress and expand [v2] In-Reply-To: References: Message-ID: On Mon, 31 Oct 2022 15:08:33 GMT, Stuart Monteith wrote: >> The java.lang.Long and java.lang.Integer classes have the methods "compress(i, mask)" and "expand(i, mask)". They compile down to 236 assembler instructions. There are no scalar instructions that perform the equivalent functions on aarch64, instead the intrinsics can be implemented with vector instructions included in SVE2; expand with BDEP, compress with BEXT. >> >> Only the first lane of each vector will be used, two MOV instructions will move the inputs from GPRs into temporary vector registers, and another to do the reverse for the result. Autovectorization for this functionality is/will be implemented separately. >> >> Running on an SVE2 enabled system, I ran the following benchmarks: >> >> org.openjdk.bench.java.lang.Integers >> org.openjdk.bench.java.lang.Longs >> >> The time for each operation reduced to 56% to 72% of the original run time: >> >> >> Benchmark Result error Unit % against non-SVE2 >> Integers.expand 2.106 0.011 us/op >> Integers.expand-SVE 1.431 0.009 us/op 67.95% >> Longs.expand 2.606 0.006 us/op >> Longs.expand-SVE 1.46 0.003 us/op 56.02% >> Integers.compress 1.982 0.004 us/op >> Integers.compress-SVE 1.427 0.003 us/op 72.00% >> Longs.compress 2.501 0.002 us/op >> Longs.compress-SVE 1.441 0.003 us/op 57.62% >> >> >> These methods can bed specifically tested with: >> `make test TEST="jtreg:compiler/intrinsics/TestBitShuffleOpers.java"` > > Stuart Monteith has updated the pull request incrementally with one additional commit since the last revision: > > Update src/hotspot/cpu/aarch64/aarch64.ad > > Correct slight formatting error. > > Co-authored-by: Eric Liu src/hotspot/cpu/aarch64/aarch64.ad line 2294: > 2292: case Op_CacheWBPostSync: > 2293: if (!VM_Version::supports_data_cache_line_flush()) { > 2294: ret_value = false; Doesn't this stuff belong in aarch64_sve.ad? src/hotspot/cpu/aarch64/aarch64.ad line 2299: > 2297: case Op_ExpandBits: > 2298: case Op_CompressBits: > 2299: if (!(UseSVE > 1 && VM_Version::supports_svebitperm())) { Do we need to test `UseSVE` here? src/hotspot/cpu/aarch64/aarch64.ad line 16958: > 16956: __ mov($tsrc$$FloatRegister, __ S, 0, $src$$Register); > 16957: __ mov($tmask$$FloatRegister, __ S, 0, $mask$$Register); > 16958: __ sve_bext($tdst$$FloatRegister, __ S, $tsrc$$FloatRegister, $tmask$$FloatRegister); The long latency of core <-> vector moves will be hurting us here. Loading operands from memory directly into vectors might help, as might an immediate form of the mask. ------------- PR: https://git.openjdk.org/jdk/pull/10537 From smonteith at openjdk.org Tue Nov 1 09:24:30 2022 From: smonteith at openjdk.org (Stuart Monteith) Date: Tue, 1 Nov 2022 09:24:30 GMT Subject: RFR: 8294194: [AArch64] Create intrinsics compress and expand [v2] In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 08:46:55 GMT, Andrew Haley wrote: >> Stuart Monteith has updated the pull request incrementally with one additional commit since the last revision: >> >> Update src/hotspot/cpu/aarch64/aarch64.ad >> >> Correct slight formatting error. >> >> Co-authored-by: Eric Liu > > src/hotspot/cpu/aarch64/aarch64.ad line 2299: > >> 2297: case Op_ExpandBits: >> 2298: case Op_CompressBits: >> 2299: if (!(UseSVE > 1 && VM_Version::supports_svebitperm())) { > > Do we need to test `UseSVE` here? We don't want to enable this on detection of the feature - if UseSVE is disable, we'll be missing the SVE infrastructure that we are dependent on in the compiler. ------------- PR: https://git.openjdk.org/jdk/pull/10537 From smonteith at openjdk.org Tue Nov 1 09:51:35 2022 From: smonteith at openjdk.org (Stuart Monteith) Date: Tue, 1 Nov 2022 09:51:35 GMT Subject: RFR: 8294194: [AArch64] Create intrinsics compress and expand [v2] In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 09:20:31 GMT, Stuart Monteith wrote: >> src/hotspot/cpu/aarch64/aarch64.ad line 2299: >> >>> 2297: case Op_ExpandBits: >>> 2298: case Op_CompressBits: >>> 2299: if (!(UseSVE > 1 && VM_Version::supports_svebitperm())) { >> >> Do we need to test `UseSVE` here? > > We don't want to enable this on detection of the feature - if UseSVE is disable, we'll be missing the SVE infrastructure that we are dependent on in the compiler. Correction - we don't want to enable this solely on the present of the feature. With -XX:-UseSVE on the command line, this would break, so we need UseSVE and the feature to be present. ------------- PR: https://git.openjdk.org/jdk/pull/10537 From eosterlund at openjdk.org Tue Nov 1 09:52:59 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Tue, 1 Nov 2022 09:52:59 GMT Subject: RFR: 8296101: nmethod::is_unloading result unstable with concurrent unloading Message-ID: If an nmethod is not called during a concurrent full GC, then after marking has terminated, multiple threads can call is_unloading. If at the same time, the nmethod is made not_entrant, then we run into a source of instability in the is_cold() calculation used when computing is_unloading. There we check if the nmethod is_not_entrant(), which some concurrent observers will think is true, while others think it's false. The current code that sets the is_unloading_state in is_unloading() assumes that the computed state is the same across all observers. However, that is no longer true. I propose to set the is_unloading_state with a CAS instead of plain store. Then, as is_unloading() is computed before making nmethods not_entrant, we can guarantee that all concurrent readers of is_unloading in this scenario will return false in the current unloading cycle, instead of racingly returning either false or true. One thread wins, and it will say false, and the other threads will compute conflicting results, but end up agreeing after the CAS, that they should all return false. Tested with mach5 tier1-7. Also tried replacing the is_not_entrant() ingredient in is_cold with os::random() to simulate the instability source. Without my fix RunThese crashes almost immediately, and with my fix it doesn't crash. ------------- Commit messages: - 8296101: nmethod::is_unloading result unstable with concurrent unloading Changes: https://git.openjdk.org/jdk/pull/10926/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10926&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296101 Stats: 16 lines in 1 file changed: 11 ins; 0 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/10926.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10926/head:pull/10926 PR: https://git.openjdk.org/jdk/pull/10926 From kbarrett at openjdk.org Tue Nov 1 09:58:57 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Tue, 1 Nov 2022 09:58:57 GMT Subject: RFR: 8296161: [aarch64] Remove unused "pcrel" addressing mode tag Message-ID: Please review this trivial change to remove the unused Address::pcrel enumerator from the Address::mode enum. ------------- Commit messages: - remove unused pcrel mode Changes: https://git.openjdk.org/jdk/pull/10927/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10927&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296161 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10927.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10927/head:pull/10927 PR: https://git.openjdk.org/jdk/pull/10927 From kbarrett at openjdk.org Tue Nov 1 10:02:04 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Tue, 1 Nov 2022 10:02:04 GMT Subject: RFR: 8296162: [aarch64] Remove unused Address::_is_lval Message-ID: Please review this trivial change to remove the unused Address::_is_lval member. ------------- Commit messages: - remove unused _is_lval Changes: https://git.openjdk.org/jdk/pull/10928/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10928&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296162 Stats: 7 lines in 2 files changed: 0 ins; 7 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10928.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10928/head:pull/10928 PR: https://git.openjdk.org/jdk/pull/10928 From kbarrett at openjdk.org Tue Nov 1 10:28:38 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Tue, 1 Nov 2022 10:28:38 GMT Subject: RFR: 8296163: [aarch64] Cleanup Pre/Post addressing mode classes Message-ID: Please review this cleanup of the aarch64 Pre, Post, and PrePost addressing mode helper classes. The special functions for the PrePost class are changed from public to protected, to ensure no slicing is possible. In the Post class constructors, initialization of the members declared directly in that class is now performed in the ctor-initializer rather than by assignments in the body. The member reader functions in PrePost and Post are now const. ------------- Commit messages: - cleanup Pre/Post/PrePost Changes: https://git.openjdk.org/jdk/pull/10929/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10929&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296163 Stats: 12 lines in 1 file changed: 5 ins; 0 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/10929.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10929/head:pull/10929 PR: https://git.openjdk.org/jdk/pull/10929 From kbarrett at openjdk.org Tue Nov 1 10:44:08 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Tue, 1 Nov 2022 10:44:08 GMT Subject: RFR: 8296101: nmethod::is_unloading result unstable with concurrent unloading In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 09:46:33 GMT, Erik ?sterlund wrote: > If an nmethod is not called during a concurrent full GC, then after marking has terminated, multiple threads can call is_unloading. If at the same time, the nmethod is made not_entrant, then we run into a source of instability in the is_cold() calculation used when computing is_unloading. There we check if the nmethod is_not_entrant(), which some concurrent observers will think is true, while others think it's false. > The current code that sets the is_unloading_state in is_unloading() assumes that the computed state is the same across all observers. However, that is no longer true. > > I propose to set the is_unloading_state with a CAS instead of plain store. Then, as is_unloading() is computed before making nmethods not_entrant, we can guarantee that all concurrent readers of is_unloading in this scenario will return false in the current unloading cycle, instead of racingly returning either false or true. One thread wins, and it will say false, and the other threads will compute conflicting results, but end up agreeing after the CAS, that they should all return false. > > Tested with mach5 tier1-7. Also tried replacing the is_not_entrant() ingredient in is_cold with os::random() to simulate the instability source. Without my fix RunThese crashes almost immediately, and with my fix it doesn't crash. Looks good as-is, but with a suggested alternate you can take or not. src/hotspot/share/code/nmethod.cpp line 1673: > 1671: // determined. This can't recurse as the second time we call the state is > 1672: // guaranteed to be cached for the current unloading cycle. > 1673: return is_unloading(); FWIW, it seems like you could remove any question about recursion by doing something like this: uint8_t found_state = Atomic::cmpxchg(&_is_unloading_state, state, new_state, memory_order_relaxed); uint8_t updated_state = (found_state == state) ? new_state : found_state; return IsUnloadingState::is_unloading(updated_state); Is that clearer (esp. with corresponding commentary)? Maybe. ------------- Marked as reviewed by kbarrett (Reviewer). PR: https://git.openjdk.org/jdk/pull/10926 From chagedorn at openjdk.org Tue Nov 1 10:44:25 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 1 Nov 2022 10:44:25 GMT Subject: RFR: 8296161: [aarch64] Remove unused "pcrel" addressing mode tag In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 09:50:36 GMT, Kim Barrett wrote: > Please review this trivial change to remove the unused Address::pcrel > enumerator from the Address::mode enum. Looks good and trivial! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10927 From adinn at openjdk.org Tue Nov 1 10:57:28 2022 From: adinn at openjdk.org (Andrew Dinn) Date: Tue, 1 Nov 2022 10:57:28 GMT Subject: RFR: 8294194: [AArch64] Create intrinsics compress and expand [v2] In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 08:48:10 GMT, Andrew Haley wrote: >> Stuart Monteith has updated the pull request incrementally with one additional commit since the last revision: >> >> Update src/hotspot/cpu/aarch64/aarch64.ad >> >> Correct slight formatting error. >> >> Co-authored-by: Eric Liu > > src/hotspot/cpu/aarch64/aarch64.ad line 2294: > >> 2292: case Op_CacheWBPostSync: >> 2293: if (!VM_Version::supports_data_cache_line_flush()) { >> 2294: ret_value = false; > > Doesn't this stuff belong in aarch64_sve.ad? I'm not sure exactly what you are asking about here. However, if it is about the Op_CacheWB* checks then the answer is no. These are memory flush operations that are used by the code that syncs writes to persistent memory (NVRam) mapped buffers. ------------- PR: https://git.openjdk.org/jdk/pull/10537 From chagedorn at openjdk.org Tue Nov 1 11:03:39 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 1 Nov 2022 11:03:39 GMT Subject: RFR: 8296163: [aarch64] Cleanup Pre/Post addressing mode classes In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 10:21:12 GMT, Kim Barrett wrote: > Please review this cleanup of the aarch64 Pre, Post, and PrePost addressing > mode helper classes. > > The special functions for the PrePost class are changed from public to > protected, to ensure no slicing is possible. > > In the Post class constructors, initialization of the members declared > directly in that class is now performed in the ctor-initializer rather than by > assignments in the body. > > The member reader functions in PrePost and Post are now const. Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10929 From chagedorn at openjdk.org Tue Nov 1 11:04:31 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 1 Nov 2022 11:04:31 GMT Subject: RFR: 8296162: [aarch64] Remove unused Address::_is_lval In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 09:53:43 GMT, Kim Barrett wrote: > Please review this trivial change to remove the unused Address::_is_lval member. Looks good and trivial! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10928 From eliu at openjdk.org Tue Nov 1 11:12:48 2022 From: eliu at openjdk.org (Eric Liu) Date: Tue, 1 Nov 2022 11:12:48 GMT Subject: RFR: 8294194: [AArch64] Create intrinsics compress and expand [v2] In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 09:49:06 GMT, Stuart Monteith wrote: >> We don't want to enable this on detection of the feature - if UseSVE is disable, we'll be missing the SVE infrastructure that we are dependent on in the compiler. > > Correction - we don't want to enable this solely on the present of the feature. With -XX:-UseSVE on the command line, this would break, so we need UseSVE and the feature to be present. My understanding: `VM_Version::supports_svebitperm()` means if the hardware supports the feature or not. `UseSVE` means if user wants to use SVE(2). ------------- PR: https://git.openjdk.org/jdk/pull/10537 From smonteith at openjdk.org Tue Nov 1 11:12:48 2022 From: smonteith at openjdk.org (Stuart Monteith) Date: Tue, 1 Nov 2022 11:12:48 GMT Subject: RFR: 8294194: [AArch64] Create intrinsics compress and expand [v2] In-Reply-To: References: Message-ID: <-JIyxKdOEB6tVBISy2rs4yzKgy0CM58FVV8zRwBR00E=.c85c57c6-5a8f-48d0-9ebb-604a30fa63ce@github.com> On Tue, 1 Nov 2022 08:41:40 GMT, Andrew Haley wrote: >> Stuart Monteith has updated the pull request incrementally with one additional commit since the last revision: >> >> Update src/hotspot/cpu/aarch64/aarch64.ad >> >> Correct slight formatting error. >> >> Co-authored-by: Eric Liu > > src/hotspot/cpu/aarch64/aarch64.ad line 16952: > >> 16950: format %{ "mov $tsrc, $src\n\t" >> 16951: "mov $tmask, $mask\n\t" >> 16952: "bext $tdst, $tsrc, $tmask\t# parallel bit extract\n\t" > > We don't need to tell readers what BEXT does.; I can remove that - it is just to match the equivalent x86 code. My actual preference would be to emit the method the intrinsic was replacing. ------------- PR: https://git.openjdk.org/jdk/pull/10537 From aph at openjdk.org Tue Nov 1 11:29:26 2022 From: aph at openjdk.org (Andrew Haley) Date: Tue, 1 Nov 2022 11:29:26 GMT Subject: RFR: 8296161: [aarch64] Remove unused "pcrel" addressing mode tag In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 09:50:36 GMT, Kim Barrett wrote: > Please review this trivial change to remove the unused Address::pcrel > enumerator from the Address::mode enum. Marked as reviewed by aph (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/10927 From aph at openjdk.org Tue Nov 1 11:32:21 2022 From: aph at openjdk.org (Andrew Haley) Date: Tue, 1 Nov 2022 11:32:21 GMT Subject: RFR: 8296163: [aarch64] Cleanup Pre/Post addressing mode classes In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 10:21:12 GMT, Kim Barrett wrote: > Please review this cleanup of the aarch64 Pre, Post, and PrePost addressing > mode helper classes. > > The special functions for the PrePost class are changed from public to > protected, to ensure no slicing is possible. > > In the Post class constructors, initialization of the members declared > directly in that class is now performed in the ctor-initializer rather than by > assignments in the body. > > The member reader functions in PrePost and Post are now const. Marked as reviewed by aph (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/10929 From aph at openjdk.org Tue Nov 1 11:33:26 2022 From: aph at openjdk.org (Andrew Haley) Date: Tue, 1 Nov 2022 11:33:26 GMT Subject: RFR: 8296162: [aarch64] Remove unused Address::_is_lval In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 09:53:43 GMT, Kim Barrett wrote: > Please review this trivial change to remove the unused Address::_is_lval member. Marked as reviewed by aph (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/10928 From aph at openjdk.org Tue Nov 1 11:39:27 2022 From: aph at openjdk.org (Andrew Haley) Date: Tue, 1 Nov 2022 11:39:27 GMT Subject: RFR: 8294194: [AArch64] Create intrinsics compress and expand [v2] In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 10:53:19 GMT, Andrew Dinn wrote: >> src/hotspot/cpu/aarch64/aarch64.ad line 2294: >> >>> 2292: case Op_CacheWBPostSync: >>> 2293: if (!VM_Version::supports_data_cache_line_flush()) { >>> 2294: ret_value = false; >> >> Doesn't this stuff belong in aarch64_sve.ad? > > I'm not sure exactly what you are asking about here. However, if it is about the Op_CacheWB* checks then the answer is no. These are memory flush operations that are used by the code that syncs writes to persistent memory (NVRam) mapped buffers. I'm talking about the SVE intrinsics. ------------- PR: https://git.openjdk.org/jdk/pull/10537 From aph at openjdk.org Tue Nov 1 11:39:28 2022 From: aph at openjdk.org (Andrew Haley) Date: Tue, 1 Nov 2022 11:39:28 GMT Subject: RFR: 8294194: [AArch64] Create intrinsics compress and expand [v2] In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 11:07:44 GMT, Eric Liu wrote: >> Correction - we don't want to enable this solely on the present of the feature. With -XX:-UseSVE on the command line, this would break, so we need UseSVE and the feature to be present. > > My understanding: > `VM_Version::supports_svebitperm()` means if the hardware supports the feature or not. `UseSVE` means if user wants to use SVE(2). OK, got it. ------------- PR: https://git.openjdk.org/jdk/pull/10537 From thartmann at openjdk.org Tue Nov 1 12:03:59 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 1 Nov 2022 12:03:59 GMT Subject: RFR: 8276064: CheckCastPP with raw oop input floats below a safepoint Message-ID: This bug is similar to [JDK-8271600](https://bugs.openjdk.org/browse/JDK-8271600): A CheckCastPP with a raw oop input floats out of a loop and below a safepoint. Since C2 does not generate OopMap entries for raw pointers, the GC will not update the oop if the corresponding object is moved during the safepoint. We either assert already during OopMap creation, or crash when dereferencing a stale oop during runtime (the verification code does not always detect such live raw oops at safepoints, I included a fix for that as well). I think the fix for [JDK-8271600](https://bugs.openjdk.org/browse/JDK-8271600) is incomplete, because it only bails out of [PhaseIdealLoop::try_sink_out_of_loop](https://github.com/openjdk/jdk/commit/2ff4c01d42f1afcc53abd48e074356fb4a700754) while the underlying issue is that a raw CheckCastPP ends up with ctrl "far away" from its Allocate/Initialize and potentially even below a safepoint. Usually, the CheckCastPP would always be part of safepoint debug info and therefore late ctrl would be guaranteed to be above the safepoint. However, vector objects are aggressively scalar replaced in safepoints, which allows late ctrl to be set to further below. This is specific to vectors, since "normal" Java objects would either be fully scalarized or not be scalarized at all. In the failing case, Loop Unswitching clones the loop body and creates a Phi to merge the oop results from the vector allocations in both loops. Since ctrl of the CheckCastPP is outside of the loop, its data input is changed to the newly created Phi and its control input is set to the region that merges the loop exits. This moves the CheckCastPP below a safepoint in the loop. Below graphs show the details. `395 CheckCastPP` is removed from the debug info for `262 CallStaticJava` because it's scalarized (`326 SafepointScalarObject`). Late ctrl is then computed to be outside of the loop because the CheckCastPP is only used in the return. ![8276064_Before](https://user-images.githubusercontent.com/5312595/199204249-17564a59-2b67-4426-be71-19bc0eafac99.png) Now Loop Unswitching creates a `487 Region` and `517 Phi` to merge control and data inputs to the CheckCastPP from the fast and slow loops (see `PhaseIdealLoop::clone_loop_handle_data_uses`). Control of the `395 CheckCastPP` is updated accordingly. ![8276064_After](https://user-images.githubusercontent.com/5312595/199204273-44341cd7-b5b6-4ec0-b8c9-6f349393dbd1.png) As a result, the raw oop input of the `395 CheckCastPP` is live at `262 CallStaticJava`. We could now add another point fix to prevent loop unswitching from moving the CheckCastPP out of the loop, but I think there is a risk that other current or future optimizations would rely on the CheckCastPP's late ctrl and do a similar thing. I would therefore suggest to pin all CheckCastPPs with a raw oop input, similar to what [JDK-5071820](https://bugs.openjdk.org/browse/JDK-5071820) did in GCM: https://github.com/openjdk/jdk/blob/37107fc1574a4191987420d88f7182e63c7da60c/src/hotspot/share/opto/gcm.cpp#L1325-L1330 The fix for [JDK-8271600](https://bugs.openjdk.org/browse/JDK-8271600) can then be reverted, also because Roland's fix for [JDK-8272562](https://bugs.openjdk.org/browse/JDK-8272562) disabled moving **all** CheckCastPPs out of loops anyway. Roland said that he plans to revisit that decision with [JDK-8275202](https://bugs.openjdk.org/browse/JDK-8275202). The tests added with this PR will cover the `PhaseIdealLoop::try_sink_out_of_loop` case as well and therefore serve as regression tests for [JDK-8271600](https://bugs.openjdk.org/browse/JDK-8271600). We could improve this by adding logic to set late ctrl just above the safepoint, but I'm not sure if it's worth the complexity because we would need to walk up the control paths from late to early control and compute the dominator of all safepoints. I also fixed the verification code in `OopFlow::build_oop_map` to account for spilling. Before, compilation of `test1` would pass and only crash during execution. Now, we assert and print: 454 DefinitionSpillCopy === _ 122 [[ 321 ]] !jvms: IntVector::zero @ bci:26 (line 563) TestRawOopAtSafepoint::test1 @ bci:25 (line 74) 321 Phi === 315 454 503 [[ 512 ]] #rawptr:NotNull !jvms: IntVector::zero @ bci:26 (line 563) TestRawOopAtSafepoint::test1 @ bci:25 (line 74) 38 CallStaticJavaDirect === 40 126 136 102 0 455 452 138 139 453 460 [[ 39 84 130 37 388 ]] Static compiler.vectorapi.TestRawOopAtSafepoint::safepoint # void ( int ) TestRawOopAtSafepoint::test1 @ bci:44 (line 75) !jvms: TestRawOopAtSafepoint::test1 @ bci:44 (line 75) What do you think? Thanks, Tobias ------------- Commit messages: - Fixed verification code - 8276064: CheckCastPP with raw oop input floats below a safepoint Changes: https://git.openjdk.org/jdk/pull/10932/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10932&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8276064 Stats: 154 lines in 4 files changed: 146 ins; 4 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/10932.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10932/head:pull/10932 PR: https://git.openjdk.org/jdk/pull/10932 From smonteith at openjdk.org Tue Nov 1 12:08:52 2022 From: smonteith at openjdk.org (Stuart Monteith) Date: Tue, 1 Nov 2022 12:08:52 GMT Subject: RFR: 8294194: [AArch64] Create intrinsics compress and expand [v2] In-Reply-To: References: Message-ID: <2iWMACcRXODgZh4RMCaJgucokFeMmUaeYEzfWLDTUc4=.a7aa150d-67ed-4997-98ac-dab07f220591@github.com> On Tue, 1 Nov 2022 11:35:09 GMT, Andrew Haley wrote: >> I'm not sure exactly what you are asking about here. However, if it is about the Op_CacheWB* checks then the answer is no. These are memory flush operations that are used by the code that syncs writes to persistent memory (NVRam) mapped buffers. > > I'm talking about the SVE intrinsics. @theRealAph just means the cases for op_ExpandBits/op_CompressBits. aarch64_sve.ad was merged with aarch64_neon.ad into aarch64_vector.ad, but while I'm using SVE instructions, I'm not using SVE types. match_rule_supported_vector isn't called by the scalar compiler code for intrinsics, and so we'd be deviating further from the common code. I'm reluctant to move the rules into aarch64_vector.ad file, as that would separate it from the match_rule_supported code that enables it - placing it among vector code doesn't really match the intent behind the code using vector instructions to perform scalar operations. ------------- PR: https://git.openjdk.org/jdk/pull/10537 From smonteith at openjdk.org Tue Nov 1 12:08:53 2022 From: smonteith at openjdk.org (Stuart Monteith) Date: Tue, 1 Nov 2022 12:08:53 GMT Subject: RFR: 8294194: [AArch64] Create intrinsics compress and expand [v2] In-Reply-To: References: Message-ID: <8BmzwyLqeeJe72OZZJLSN-WAFK2HLru6QSyVv7glMdM=.290fd661-1122-4a69-bc51-de3000564645@github.com> On Tue, 1 Nov 2022 08:45:50 GMT, Andrew Haley wrote: >> Stuart Monteith has updated the pull request incrementally with one additional commit since the last revision: >> >> Update src/hotspot/cpu/aarch64/aarch64.ad >> >> Correct slight formatting error. >> >> Co-authored-by: Eric Liu > > src/hotspot/cpu/aarch64/aarch64.ad line 16958: > >> 16956: __ mov($tsrc$$FloatRegister, __ S, 0, $src$$Register); >> 16957: __ mov($tmask$$FloatRegister, __ S, 0, $mask$$Register); >> 16958: __ sve_bext($tdst$$FloatRegister, __ S, $tsrc$$FloatRegister, $tmask$$FloatRegister); > > The long latency of core <-> vector moves will be hurting us here. Loading operands from memory directly into vectors might help, as might an immediate form of the mask. I'll try this out - I imagine we could specialise this in the way you suggest. ------------- PR: https://git.openjdk.org/jdk/pull/10537 From rkennke at openjdk.org Tue Nov 1 12:16:48 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 1 Nov 2022 12:16:48 GMT Subject: RFR: 8296136: Use correct register in aarch64_enc_fast_unlock() [v2] In-Reply-To: References: Message-ID: > In aarch64_enc_fast_unlock() (aarch64.ad) we have this piece of code: > > > __ ldr(tmp, Address(oop, oopDesc::mark_offset_in_bytes())); > __ tbnz(disp_hdr, exact_log2(markWord::monitor_value), object_has_monitor); > > > The tbnz uses the wrong register - it should really use tmp. disp_hdr has been loaded with the displaced header of the stack-lock, which would never have its monitor bits set, thus the branch will always take the slow path. In this common case, it is only a performance nuisance. In the case of !UseHeavyMonitors it is even worse, then disp_hdr will be unitialized, and we are facing a correctness problem. > > As far as I can tell, the problem dates back to when aarch64 C2 parts have been added to OpenJDK. > > Testing: > - [x] tier1 > - [x] tier2 > - [x] tier3 > - [x] tier4 Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: Same fix for RISC-V ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10921/files - new: https://git.openjdk.org/jdk/pull/10921/files/0b8cfea5..832dbb06 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10921&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10921&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10921.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10921/head:pull/10921 PR: https://git.openjdk.org/jdk/pull/10921 From rkennke at openjdk.org Tue Nov 1 12:16:48 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 1 Nov 2022 12:16:48 GMT Subject: RFR: 8296136: Use correct register in aarch64_enc_fast_unlock() In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 06:52:31 GMT, Fei Yang wrote: > Could you please also incorporate following fix for RISC-V at the same time? I see it inherits the same similar issue here. This has passed tier1 test on HiFive Unmatched board. Thanks. > > ``` > diff --git a/src/hotspot/cpu/riscv/riscv.ad b/src/hotspot/cpu/riscv/riscv.ad > index 75612ef7508..abe0f609a62 100644 > --- a/src/hotspot/cpu/riscv/riscv.ad > +++ b/src/hotspot/cpu/riscv/riscv.ad > @@ -2474,7 +2474,7 @@ encode %{ > > // Handle existing monitor. > __ ld(tmp, Address(oop, oopDesc::mark_offset_in_bytes())); > - __ andi(t0, disp_hdr, markWord::monitor_value); > + __ andi(t0, tmp, markWord::monitor_value); > __ bnez(t0, object_has_monitor); > > if (!UseHeavyMonitors) { > ``` I pushed the proposed fix for RISC-V. Could you please give it a quick build and smoke test, and approve the PR? Then I'd integrate it. Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/10921 From fyang at openjdk.org Tue Nov 1 12:34:28 2022 From: fyang at openjdk.org (Fei Yang) Date: Tue, 1 Nov 2022 12:34:28 GMT Subject: RFR: 8296136: Use correct register in aarch64_enc_fast_unlock() [v2] In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 12:16:48 GMT, Roman Kennke wrote: >> In aarch64_enc_fast_unlock() (aarch64.ad) we have this piece of code: >> >> >> __ ldr(tmp, Address(oop, oopDesc::mark_offset_in_bytes())); >> __ tbnz(disp_hdr, exact_log2(markWord::monitor_value), object_has_monitor); >> >> >> The tbnz uses the wrong register - it should really use tmp. disp_hdr has been loaded with the displaced header of the stack-lock, which would never have its monitor bits set, thus the branch will always take the slow path. In this common case, it is only a performance nuisance. In the case of !UseHeavyMonitors it is even worse, then disp_hdr will be unitialized, and we are facing a correctness problem. >> >> As far as I can tell, the problem dates back to when aarch64 C2 parts have been added to OpenJDK. >> >> Testing: >> - [x] tier1 >> - [x] tier2 >> - [x] tier3 >> - [x] tier4 > > Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: > > Same fix for RISC-V Marked as reviewed by fyang (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/10921 From fyang at openjdk.org Tue Nov 1 12:34:29 2022 From: fyang at openjdk.org (Fei Yang) Date: Tue, 1 Nov 2022 12:34:29 GMT Subject: RFR: 8296136: Use correct register in aarch64_enc_fast_unlock() In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 12:13:27 GMT, Roman Kennke wrote: > > Could you please also incorporate following fix for RISC-V at the same time? I see it inherits the same similar issue here. This has passed tier1 test on HiFive Unmatched board. Thanks. > > ``` > > diff --git a/src/hotspot/cpu/riscv/riscv.ad b/src/hotspot/cpu/riscv/riscv.ad > > index 75612ef7508..abe0f609a62 100644 > > --- a/src/hotspot/cpu/riscv/riscv.ad > > +++ b/src/hotspot/cpu/riscv/riscv.ad > > @@ -2474,7 +2474,7 @@ encode %{ > > > > // Handle existing monitor. > > __ ld(tmp, Address(oop, oopDesc::mark_offset_in_bytes())); > > - __ andi(t0, disp_hdr, markWord::monitor_value); > > + __ andi(t0, tmp, markWord::monitor_value); > > __ bnez(t0, object_has_monitor); > > > > if (!UseHeavyMonitors) { > > ``` > > I pushed the proposed fix for RISC-V. Could you please give it a quick build and smoke test, and approve the PR? Then I'd integrate it. Thanks! Yes, my local tests looks good. I think we are ready to go. Thanks again. ------------- PR: https://git.openjdk.org/jdk/pull/10921 From erikj at openjdk.org Tue Nov 1 12:47:29 2022 From: erikj at openjdk.org (Erik Joelsson) Date: Tue, 1 Nov 2022 12:47:29 GMT Subject: RFR: 8295970: Add vector api sanity tests in tier1 [v2] In-Reply-To: <8uMsheGwjIBbgf1nJeYkCR4Pt_ddbnq4WVKMRPwS7C0=.c59b10eb-67e1-439c-936e-857c8430c520@github.com> References: <8uMsheGwjIBbgf1nJeYkCR4Pt_ddbnq4WVKMRPwS7C0=.c59b10eb-67e1-439c-936e-857c8430c520@github.com> Message-ID: On Fri, 28 Oct 2022 07:19:31 GMT, Jie Fu wrote: >> Hi all, >> >> As discussed here https://github.com/openjdk/jdk/pull/10807#pullrequestreview-1150314487 , it would be better to add the vector api tests in GHA. >> >> Thanks. >> Best regards, >> Jie > > Jie Fu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Add jdk_vector_sanity test group > - Merge branch 'master' into JDK-8295970 > - Revert changes in test.yml > - 8295970: Add jdk_vector tests in GHA This looks good to me. The group tier1_part3 looks like it's currently the fastest of the 3, so it's the right one to add to. ------------- Marked as reviewed by erikj (Reviewer). PR: https://git.openjdk.org/jdk/pull/10879 From mdoerr at openjdk.org Tue Nov 1 13:24:12 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 1 Nov 2022 13:24:12 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic Message-ID: This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. ------------- Commit messages: - 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic Changes: https://git.openjdk.org/jdk/pull/10933/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10933&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295724 Stats: 12 lines in 2 files changed: 11 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10933.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10933/head:pull/10933 PR: https://git.openjdk.org/jdk/pull/10933 From aph at openjdk.org Tue Nov 1 14:35:17 2022 From: aph at openjdk.org (Andrew Haley) Date: Tue, 1 Nov 2022 14:35:17 GMT Subject: RFR: 8294194: [AArch64] Create intrinsics compress and expand [v2] In-Reply-To: <2iWMACcRXODgZh4RMCaJgucokFeMmUaeYEzfWLDTUc4=.a7aa150d-67ed-4997-98ac-dab07f220591@github.com> References: <2iWMACcRXODgZh4RMCaJgucokFeMmUaeYEzfWLDTUc4=.a7aa150d-67ed-4997-98ac-dab07f220591@github.com> Message-ID: <7LyfJnbqznuJcRX0y0-BxH_QaUh7IMLXPwsVSRWuScg=.6e52f634-6e4a-45f0-9a50-8cd1d044ca1f@github.com> On Tue, 1 Nov 2022 12:05:38 GMT, Stuart Monteith wrote: >> I'm talking about the SVE intrinsics. > > @theRealAph just means the cases for op_ExpandBits/op_CompressBits. > aarch64_sve.ad was merged with aarch64_neon.ad into aarch64_vector.ad, but while I'm using SVE instructions, I'm not using SVE types. match_rule_supported_vector isn't called by the scalar compiler code for intrinsics, and so we'd be deviating further from the common code. I'm reluctant to move the rules into aarch64_vector.ad file, as that would separate it from the match_rule_supported code that enables it - placing it among vector code doesn't really match the intent behind the code using vector instructions to perform scalar operations. Hmm, interesting. Seems a bit odd, but OK. I guess I have to admit that a bunch of code that's not vectors uses the vector uint. ------------- PR: https://git.openjdk.org/jdk/pull/10537 From chagedorn at openjdk.org Tue Nov 1 15:42:48 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 1 Nov 2022 15:42:48 GMT Subject: RFR: 8280378: [IR Framework] Support IR matching for different compile phases [v6] In-Reply-To: <7eV6OcVY0w8MzR-qUTg4glsxSQ3ig5OkJ8ymwhqvmlc=.716d2c2b-fc10-462f-ae37-14335dafff84@github.com> References: <55-DBEUO9DmlD_ZbinJmkImXt4Wyllzs2Dt8e--O9IA=.0959234f-d224-43e0-8e61-422bacdade51@github.com> <7eV6OcVY0w8MzR-qUTg4glsxSQ3ig5OkJ8ymwhqvmlc=.716d2c2b-fc10-462f-ae37-14335dafff84@github.com> Message-ID: On Mon, 31 Oct 2022 08:26:25 GMT, Christian Hagedorn wrote: >> This patch extends the IR framework with the capability to perform IR matching not only on the `PrintIdeal` and `PrintOptoAssembly` flag outputs but also on the `PrintIdeal` output of the different compile phases defined by the `COMPILER_PHASES` macro in `phasetype.hpp`: >> >> https://github.com/openjdk/jdk/blob/fba763f82528d2825831a26b4ae4e090c602208f/src/hotspot/share/opto/phasetype.hpp#L28-L29 >> >> The IR framework uses the same compile phases with the same names (it only drops those which are part of a normal execution like `PHASE_DEBUG` or `PHASE_FAILURE`). A matching on the `PrintIdeal` and `PrintOptoAssembly` flags is still possible. The IR framework creates an own `CompilePhase` enum entry for them to simplify the implementation. >> >> ## How does it work? >> >> ### Basic idea >> There is a new `phase` attribute for `@IR` annotations that allows the user to specify a list of compile phases on which the IR rule should be applied on: >> >> >> int iFld; >> >> @Test >> @IR(counts = {IRNode.STORE_I, "1"}, >> phase = {CompilePhase.AFTER_PARSING, // Fails >> CompilePhase.ITER_GVN1}) // Works >> public void optimizeStores() { >> iFld = 42; >> iFld = 42 + iFld; // Removed in first IGVN iteration and replaced by iFld = 84 >> } >> >> In this example, we apply the IR rule on the compile phases `AFTER_PARSING` and `ITER_GVN1`. Since the first store to `iFld` is only removed in the first IGVN iteration, the IR rule fails for `AFTER_PARSING` while it passes for `ITER_GVN1`: >> >> 1) Method "public void ir_framework.examples.IRExample.optimizeStores()" - [Failed IR rules: 1]: >> * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING, ITER_GVN1}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeatureOr={}, applyIfCPUFeature={}, counts={"_#STORE_I#_", "1"}, failOn={}, applyIfAnd={}, applyIfOr={}, applyIfNot={})" >> > Phase "After Parsing": >> - counts: Graph contains wrong number of nodes: >> * Constraint 1: "(\d+(\s){2}(StoreI.*)+(\s){2}===.*)" >> - Failed comparison: [found] 2 = 1 [given] >> - Matched nodes (2): >> * 25 StoreI === 5 7 24 21 [[ 31 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:3 (line 123) >> * 31 StoreI === 5 25 24 29 [[ 16 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:14 (line 124) >> >> >> More examples are shown in `IRExample.java` and also in the new test `TestPhaseIRMatching.java`. >> >> ### CompilePhase.DEFAULT - default compile phase >> The existing IR tests either match on the `PrintIdeal` and/or `PrintOptoAssembly` flag. Looking closer at the individual `@IR` rules, we can see that a single regex only matches **either** on `PrintIdeal` **or** on `PrintOptoAssembly` but never on both. Most of these regexes are taken from the class `IRNode` which currently provides default regexes for various C2 IR nodes (a default regex either matches on `PrintIdeal` or `PrintOptoAssembly`). >> >> Not only the existing IR tests but also the majority of future test do not need this new flexibility - they simply want to default match on the `PrintIdeal` flag. To avoid having to specify `phase = CompilePhase.PRINT_IDEAL` for each new rule (and also to avoid updating the large number of existing IR tests), we introduce a new compile phase `CompilePhase.DEFAULT` which is used by default if the user does not specify the `phase` attribute. >> >> Each entry in class `IRNode` now needs to define a default phase such that the IR framework knows which compile phase an `IRNode` should be matched on if `CompilePhase.DEFAULT` is selected. >> >> ### Different regexes for the same IRNode entry >> A regex for an IR node might look different for certain compile phases. For example, we can directly match the node name `Allocate` before macro expansion for an `Allocate` node. But we are only able to match allocations again on the `PrintOptoAssembly` output (the current option) which requires a different regex. Therefore, the `IRNode` class is redesigned in the following way: >> >> - `IRNode` entries which currently represent direct regexes are replaced by "IR node placeholder strings". These strings are just encodings recognized by the IR framework as IR nodes: >> >> public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; // Normal IR node >> public static final String ALLOC_OF = COMPOSITE_PREFIX + "ALLOC_OF" + POSTFIX; // Composite IR node >> >> - For each _IR node placeholder string_, we need to define which regex should be used for which compile phase. Additionally, we need to specify the default compile phase for this IR node. This is done in a static block immediately following the IR node placeholder string where we add a mapping to `IRNode.IR_NODE_MAPPINGS` (by using helper methods such als `allocNodes()`): >> >> public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; >> static { >> String idealIndependentRegex = START + "Allocate" + MID + END; >> String optoRegex = "(.*precise .*\\R((.*(?i:mov|xorl|nop|spill).*|\\s*|.*LGHI.*)\\R)*.*(?i:call,static).*wrapper for: _new_instance_Java" + END; >> allocNodes(ALLOC, idealIndependentRegex, optoRegex); >> } >> >> **Thus, adding a new IR node now requires two steps: Defining an IR node placeholder string and an associated regex-compile-phase-mapping in the static block immediately following the IR node placeholder string. This mapping should reflect in which compile phases the new IR node can be found.** >> >> ### Using the IRNode entries correctly >> The IR framework enforces the correct usage of the new IR node placeholder strings. It reports format violations if: >> - An `IRNode` entry is used for a compile phase which is not supported (e.g. trying to match a `LoopNode` on the output of `CompilePhase.AFTER_PARSING`). >> - An IR node entry is used without a specifying a mapping in a static block (e.g. when creating a new entry and forgetting to set up the mapping). >> - Using a user-defined regex in `@IR` without specifying a non-default compile phase in `phase`. `CompilePhase.DEFAULT` only works with IR node placeholder strings. In all other cases, the user must explicitly specify one or more non-default compile phases. >> >> ## General Changes >> The patch became larger than I've intended it to be. I've tried to split it into smaller patches but that was quite difficult and I eventually gave up. I therefore try to summarize the changes to make reviewing this simpler: >> >> - Added more packages to better group related classes together. >> - Removed unused IGV phases in `phasetype.hpp` and sorted them in the order how they are used throughout a compilation. >> - Updated existing, now failing, IR tests that use user-defined regexes. These now require a non-default phase. I've fixed that by replacing the user-defined regexes with `IRNode` entries (required some new `IRNode` entries for missing IR nodes). >> - Introduced a new interface `Matchable.java` (for all classes representing a part on which IR matching can be done such as `IRMethod` or `IRRule`) to use it throughout the code instead of concrete class references. This allows better code sharing together with the already added `MatchResult.java` interface. Using interface types also simplifies testing as everything is substitutable. Each `Matchable`/`MatchResult` object now just contains a list of `Matchable`/`MatchResult` objects (e.g. an `IRMethod`/`IRMethodResult` contains a list of `IRRule`/`IRRuleMatchResult` objects to represent all IR rules/IR rule match results of this method etc.) >> - Cleaned up and refactored a lot of code to use this new design. >> - Using visitor pattern to visit the `MatchResult` objects. This simplifies the failure message and compilation output printing of the different compile phases. It allows us to perform specific operations at each level (i.e. on a method level, an IR rule level etc.). Visitors are also used in testing. >> - New compile phase related classes in package `phase`. The idea is that each `IRRule`/`IRRuleMatchResult` class contains a list of `CompilePhaseIRRule`/`CompilePhaseIRRuleMatchResult` objects for each compile phase. When `CompilePhase.DEFAULT` is used, we create or re-use `CompilePhaseIRRule` objects which represent the specified default phases of each `IRNode` entry. >> - The `FlagVM` now collects all the needed compile phases and creates a compiler directives file for the `TestVM` accordingly. We now only print the output which is also used for IR matching. Before this change, we always emitted `PrintIdeal` and `PrintOptoAssembly`. Even when not using the new matching on compile phases, we still get a performance benefit when only using `IRNode` entries which match on either `PrintIdeal` or `PrintOptoAssembly`. >> - The failure message and compilation output is sorted alphabetically by method names and by the enum definition order of `CompilerPhase`. >> - Replaced implementation inheritance by interfaces. >> - Improved encapsulation of object data. >> - Updated README and many comments/class descriptions to reflect this new feature. >> - Added new IR framework tests >> >> ## Testing >> - Normal tier testing. >> - Applying the patch to Valhalla to perform tier testing. >> - Testing with a ZGC branch by @robcasloz which does IR matching on mach graph compile phases. This requirement was the original motivation for this RFE. Thanks Roberto for helping to test this! >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 85 commits: > > - Merge branch 'master' into JDK-8280378 > - Merge branch 'master' into JDK-8280378 > - Fix TestVectorConditionalMove > - Merge branch 'master' into JDK-8280378 > - Hao's patch to address review comments > - Roberto's review comments > - Update test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/NonIRTestClass.java > > Co-authored-by: Roberto Casta?eda Lozano > - Update test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/irrule/constraint/raw/RawConstraint.java > > Co-authored-by: Roberto Casta?eda Lozano > - Update test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/irrule/phase/CompilePhaseIRRuleBuilder.java > > Co-authored-by: Roberto Casta?eda Lozano > - Merge branch 'master' into JDK-8280378 > - ... and 75 more: https://git.openjdk.org/jdk/compare/9b9be88b...8d330790 Another test iteration was successful. I'm therefore integrating this now. Thanks all for your feedback again! ------------- PR: https://git.openjdk.org/jdk/pull/10695 From chagedorn at openjdk.org Tue Nov 1 15:44:07 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 1 Nov 2022 15:44:07 GMT Subject: Integrated: 8280378: [IR Framework] Support IR matching for different compile phases In-Reply-To: <55-DBEUO9DmlD_ZbinJmkImXt4Wyllzs2Dt8e--O9IA=.0959234f-d224-43e0-8e61-422bacdade51@github.com> References: <55-DBEUO9DmlD_ZbinJmkImXt4Wyllzs2Dt8e--O9IA=.0959234f-d224-43e0-8e61-422bacdade51@github.com> Message-ID: On Thu, 13 Oct 2022 12:00:42 GMT, Christian Hagedorn wrote: > This patch extends the IR framework with the capability to perform IR matching not only on the `PrintIdeal` and `PrintOptoAssembly` flag outputs but also on the `PrintIdeal` output of the different compile phases defined by the `COMPILER_PHASES` macro in `phasetype.hpp`: > > https://github.com/openjdk/jdk/blob/fba763f82528d2825831a26b4ae4e090c602208f/src/hotspot/share/opto/phasetype.hpp#L28-L29 > > The IR framework uses the same compile phases with the same names (it only drops those which are part of a normal execution like `PHASE_DEBUG` or `PHASE_FAILURE`). A matching on the `PrintIdeal` and `PrintOptoAssembly` flags is still possible. The IR framework creates an own `CompilePhase` enum entry for them to simplify the implementation. > > ## How does it work? > > ### Basic idea > There is a new `phase` attribute for `@IR` annotations that allows the user to specify a list of compile phases on which the IR rule should be applied on: > > > int iFld; > > @Test > @IR(counts = {IRNode.STORE_I, "1"}, > phase = {CompilePhase.AFTER_PARSING, // Fails > CompilePhase.ITER_GVN1}) // Works > public void optimizeStores() { > iFld = 42; > iFld = 42 + iFld; // Removed in first IGVN iteration and replaced by iFld = 84 > } > > In this example, we apply the IR rule on the compile phases `AFTER_PARSING` and `ITER_GVN1`. Since the first store to `iFld` is only removed in the first IGVN iteration, the IR rule fails for `AFTER_PARSING` while it passes for `ITER_GVN1`: > > 1) Method "public void ir_framework.examples.IRExample.optimizeStores()" - [Failed IR rules: 1]: > * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING, ITER_GVN1}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeatureOr={}, applyIfCPUFeature={}, counts={"_#STORE_I#_", "1"}, failOn={}, applyIfAnd={}, applyIfOr={}, applyIfNot={})" > > Phase "After Parsing": > - counts: Graph contains wrong number of nodes: > * Constraint 1: "(\d+(\s){2}(StoreI.*)+(\s){2}===.*)" > - Failed comparison: [found] 2 = 1 [given] > - Matched nodes (2): > * 25 StoreI === 5 7 24 21 [[ 31 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:3 (line 123) > * 31 StoreI === 5 25 24 29 [[ 16 ]] @ir_framework/examples/IRExample+12 *, name=iFld, idx=4; Memory: @ir_framework/examples/IRExample:NotNull+12 *, name=iFld, idx=4; !jvms: IRExample::optimizeStores @ bci:14 (line 124) > > > More examples are shown in `IRExample.java` and also in the new test `TestPhaseIRMatching.java`. > > ### CompilePhase.DEFAULT - default compile phase > The existing IR tests either match on the `PrintIdeal` and/or `PrintOptoAssembly` flag. Looking closer at the individual `@IR` rules, we can see that a single regex only matches **either** on `PrintIdeal` **or** on `PrintOptoAssembly` but never on both. Most of these regexes are taken from the class `IRNode` which currently provides default regexes for various C2 IR nodes (a default regex either matches on `PrintIdeal` or `PrintOptoAssembly`). > > Not only the existing IR tests but also the majority of future test do not need this new flexibility - they simply want to default match on the `PrintIdeal` flag. To avoid having to specify `phase = CompilePhase.PRINT_IDEAL` for each new rule (and also to avoid updating the large number of existing IR tests), we introduce a new compile phase `CompilePhase.DEFAULT` which is used by default if the user does not specify the `phase` attribute. > > Each entry in class `IRNode` now needs to define a default phase such that the IR framework knows which compile phase an `IRNode` should be matched on if `CompilePhase.DEFAULT` is selected. > > ### Different regexes for the same IRNode entry > A regex for an IR node might look different for certain compile phases. For example, we can directly match the node name `Allocate` before macro expansion for an `Allocate` node. But we are only able to match allocations again on the `PrintOptoAssembly` output (the current option) which requires a different regex. Therefore, the `IRNode` class is redesigned in the following way: > > - `IRNode` entries which currently represent direct regexes are replaced by "IR node placeholder strings". These strings are just encodings recognized by the IR framework as IR nodes: > > public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; // Normal IR node > public static final String ALLOC_OF = COMPOSITE_PREFIX + "ALLOC_OF" + POSTFIX; // Composite IR node > > - For each _IR node placeholder string_, we need to define which regex should be used for which compile phase. Additionally, we need to specify the default compile phase for this IR node. This is done in a static block immediately following the IR node placeholder string where we add a mapping to `IRNode.IR_NODE_MAPPINGS` (by using helper methods such als `allocNodes()`): > > public static final String ALLOC = PREFIX + "ALLOC" + POSTFIX; > static { > String idealIndependentRegex = START + "Allocate" + MID + END; > String optoRegex = "(.*precise .*\\R((.*(?i:mov|xorl|nop|spill).*|\\s*|.*LGHI.*)\\R)*.*(?i:call,static).*wrapper for: _new_instance_Java" + END; > allocNodes(ALLOC, idealIndependentRegex, optoRegex); > } > > **Thus, adding a new IR node now requires two steps: Defining an IR node placeholder string and an associated regex-compile-phase-mapping in the static block immediately following the IR node placeholder string. This mapping should reflect in which compile phases the new IR node can be found.** > > ### Using the IRNode entries correctly > The IR framework enforces the correct usage of the new IR node placeholder strings. It reports format violations if: > - An `IRNode` entry is used for a compile phase which is not supported (e.g. trying to match a `LoopNode` on the output of `CompilePhase.AFTER_PARSING`). > - An IR node entry is used without a specifying a mapping in a static block (e.g. when creating a new entry and forgetting to set up the mapping). > - Using a user-defined regex in `@IR` without specifying a non-default compile phase in `phase`. `CompilePhase.DEFAULT` only works with IR node placeholder strings. In all other cases, the user must explicitly specify one or more non-default compile phases. > > ## General Changes > The patch became larger than I've intended it to be. I've tried to split it into smaller patches but that was quite difficult and I eventually gave up. I therefore try to summarize the changes to make reviewing this simpler: > > - Added more packages to better group related classes together. > - Removed unused IGV phases in `phasetype.hpp` and sorted them in the order how they are used throughout a compilation. > - Updated existing, now failing, IR tests that use user-defined regexes. These now require a non-default phase. I've fixed that by replacing the user-defined regexes with `IRNode` entries (required some new `IRNode` entries for missing IR nodes). > - Introduced a new interface `Matchable.java` (for all classes representing a part on which IR matching can be done such as `IRMethod` or `IRRule`) to use it throughout the code instead of concrete class references. This allows better code sharing together with the already added `MatchResult.java` interface. Using interface types also simplifies testing as everything is substitutable. Each `Matchable`/`MatchResult` object now just contains a list of `Matchable`/`MatchResult` objects (e.g. an `IRMethod`/`IRMethodResult` contains a list of `IRRule`/`IRRuleMatchResult` objects to represent all IR rules/IR rule match results of this method etc.) > - Cleaned up and refactored a lot of code to use this new design. > - Using visitor pattern to visit the `MatchResult` objects. This simplifies the failure message and compilation output printing of the different compile phases. It allows us to perform specific operations at each level (i.e. on a method level, an IR rule level etc.). Visitors are also used in testing. > - New compile phase related classes in package `phase`. The idea is that each `IRRule`/`IRRuleMatchResult` class contains a list of `CompilePhaseIRRule`/`CompilePhaseIRRuleMatchResult` objects for each compile phase. When `CompilePhase.DEFAULT` is used, we create or re-use `CompilePhaseIRRule` objects which represent the specified default phases of each `IRNode` entry. > - The `FlagVM` now collects all the needed compile phases and creates a compiler directives file for the `TestVM` accordingly. We now only print the output which is also used for IR matching. Before this change, we always emitted `PrintIdeal` and `PrintOptoAssembly`. Even when not using the new matching on compile phases, we still get a performance benefit when only using `IRNode` entries which match on either `PrintIdeal` or `PrintOptoAssembly`. > - The failure message and compilation output is sorted alphabetically by method names and by the enum definition order of `CompilerPhase`. > - Replaced implementation inheritance by interfaces. > - Improved encapsulation of object data. > - Updated README and many comments/class descriptions to reflect this new feature. > - Added new IR framework tests > > ## Testing > - Normal tier testing. > - Applying the patch to Valhalla to perform tier testing. > - Testing with a ZGC branch by @robcasloz which does IR matching on mach graph compile phases. This requirement was the original motivation for this RFE. Thanks Roberto for helping to test this! > > Thanks, > Christian This pull request has now been integrated. Changeset: f829b5a7 Author: Christian Hagedorn URL: https://git.openjdk.org/jdk/commit/f829b5a73f699ca7fc513f491f77daae6c8f4ed9 Stats: 9484 lines in 154 files changed: 7149 ins; 1598 del; 737 mod 8280378: [IR Framework] Support IR matching for different compile phases Reviewed-by: kvn, rcastanedalo ------------- PR: https://git.openjdk.org/jdk/pull/10695 From kvn at openjdk.org Tue Nov 1 16:15:24 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 1 Nov 2022 16:15:24 GMT Subject: RFR: 8295970: Add vector api sanity tests in tier1 [v2] In-Reply-To: <8uMsheGwjIBbgf1nJeYkCR4Pt_ddbnq4WVKMRPwS7C0=.c59b10eb-67e1-439c-936e-857c8430c520@github.com> References: <8uMsheGwjIBbgf1nJeYkCR4Pt_ddbnq4WVKMRPwS7C0=.c59b10eb-67e1-439c-936e-857c8430c520@github.com> Message-ID: <4W4-SzCjMp5ML_70PyjaweJPxaBaBZuT728K3OS4sJc=.5d1b3fec-051d-4b12-bd30-b839f2368c5c@github.com> On Fri, 28 Oct 2022 07:19:31 GMT, Jie Fu wrote: >> Hi all, >> >> As discussed here https://github.com/openjdk/jdk/pull/10807#pullrequestreview-1150314487 , it would be better to add the vector api tests in GHA. >> >> Thanks. >> Best regards, >> Jie > > Jie Fu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Add jdk_vector_sanity test group > - Merge branch 'master' into JDK-8295970 > - Revert changes in test.yml > - 8295970: Add jdk_vector tests in GHA Good. I think we need review from @shipilev too since he had strong opinion about this. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10879 From kbarrett at openjdk.org Tue Nov 1 17:00:28 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Tue, 1 Nov 2022 17:00:28 GMT Subject: RFR: 8296161: [aarch64] Remove unused "pcrel" addressing mode tag [v2] In-Reply-To: References: Message-ID: > Please review this trivial change to remove the unused Address::pcrel > enumerator from the Address::mode enum. Kim Barrett has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge branch 'master' into remove-pcrel - remove unused pcrel mode ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10927/files - new: https://git.openjdk.org/jdk/pull/10927/files/48bf39a6..21004cc1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10927&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10927&range=00-01 Stats: 9830 lines in 205 files changed: 7178 ins; 1732 del; 920 mod Patch: https://git.openjdk.org/jdk/pull/10927.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10927/head:pull/10927 PR: https://git.openjdk.org/jdk/pull/10927 From kbarrett at openjdk.org Tue Nov 1 17:00:29 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Tue, 1 Nov 2022 17:00:29 GMT Subject: RFR: 8296161: [aarch64] Remove unused "pcrel" addressing mode tag [v2] In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 10:40:41 GMT, Christian Hagedorn wrote: >> Kim Barrett has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: >> >> - Merge branch 'master' into remove-pcrel >> - remove unused pcrel mode > > Looks good and trivial! Thanks @chhagedorn and @theRealAph for reviews. ------------- PR: https://git.openjdk.org/jdk/pull/10927 From kbarrett at openjdk.org Tue Nov 1 17:02:08 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Tue, 1 Nov 2022 17:02:08 GMT Subject: Integrated: 8296161: [aarch64] Remove unused "pcrel" addressing mode tag In-Reply-To: References: Message-ID: <88wUc5J_wMwHzSIgHV7nADTuzWlw32C0McshWG8xqMM=.03ef1f5e-229f-4f90-b932-ea44202dffc0@github.com> On Tue, 1 Nov 2022 09:50:36 GMT, Kim Barrett wrote: > Please review this trivial change to remove the unused Address::pcrel > enumerator from the Address::mode enum. This pull request has now been integrated. Changeset: 15b8b451 Author: Kim Barrett URL: https://git.openjdk.org/jdk/commit/15b8b45178637acb07c33194f564acf807dfa5d4 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8296161: [aarch64] Remove unused "pcrel" addressing mode tag Reviewed-by: chagedorn, aph ------------- PR: https://git.openjdk.org/jdk/pull/10927 From kvn at openjdk.org Tue Nov 1 17:12:24 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 1 Nov 2022 17:12:24 GMT Subject: RFR: 8276064: CheckCastPP with raw oop input floats below a safepoint In-Reply-To: References: Message-ID: <-lOGjSVxJWizo8DwigVYxZdDK6bciEapRZYwNSPigOg=.583bd3a0-b290-4c6e-89a6-5381f9525d5d@github.com> On Tue, 1 Nov 2022 11:55:04 GMT, Tobias Hartmann wrote: > This bug is similar to [JDK-8271600](https://bugs.openjdk.org/browse/JDK-8271600): A CheckCastPP with a raw oop input floats out of a loop and below a safepoint. Since C2 does not generate OopMap entries for raw pointers, the GC will not update the oop if the corresponding object is moved during the safepoint. We either assert already during OopMap creation, or crash when dereferencing a stale oop during runtime (the verification code does not always detect such live raw oops at safepoints, I included a fix for that as well). > > I think the fix for [JDK-8271600](https://bugs.openjdk.org/browse/JDK-8271600) is incomplete, because it only bails out of [PhaseIdealLoop::try_sink_out_of_loop](https://github.com/openjdk/jdk/commit/2ff4c01d42f1afcc53abd48e074356fb4a700754) while the underlying issue is that a raw CheckCastPP ends up with ctrl "far away" from its Allocate/Initialize and potentially even below a safepoint. Usually, the CheckCastPP would always be part of safepoint debug info and therefore late ctrl would be guaranteed to be above the safepoint. However, vector objects are aggressively scalar replaced in safepoints, which allows late ctrl to be set to further below. This is specific to vectors, since "normal" Java objects would either be fully scalarized or not be scalarized at all. > > In the failing case, Loop Unswitching clones the loop body and creates a Phi to merge the oop results from the vector allocations in both loops. Since ctrl of the CheckCastPP is outside of the loop, its data input is changed to the newly created Phi and its control input is set to the region that merges the loop exits. This moves the CheckCastPP below a safepoint in the loop. > > Below graphs show the details. `395 CheckCastPP` is removed from the debug info for `262 CallStaticJava` because it's scalarized (`326 SafepointScalarObject`). Late ctrl is then computed to be outside of the loop because the CheckCastPP is only used in the return. > > ![8276064_Before](https://user-images.githubusercontent.com/5312595/199204249-17564a59-2b67-4426-be71-19bc0eafac99.png) > > Now Loop Unswitching creates a `487 Region` and `517 Phi` to merge control and data inputs to the CheckCastPP from the fast and slow loops (see `PhaseIdealLoop::clone_loop_handle_data_uses`). Control of the `395 CheckCastPP` is updated accordingly. > > ![8276064_After](https://user-images.githubusercontent.com/5312595/199204273-44341cd7-b5b6-4ec0-b8c9-6f349393dbd1.png) > > As a result, the raw oop input of the `395 CheckCastPP` is live at `262 CallStaticJava`. > > We could now add another point fix to prevent loop unswitching from moving the CheckCastPP out of the loop, but I think there is a risk that other current or future optimizations would rely on the CheckCastPP's late ctrl and do a similar thing. I would therefore suggest to pin all CheckCastPPs with a raw oop input, similar to what [JDK-5071820](https://bugs.openjdk.org/browse/JDK-5071820) did in GCM: > https://github.com/openjdk/jdk/blob/37107fc1574a4191987420d88f7182e63c7da60c/src/hotspot/share/opto/gcm.cpp#L1325-L1330 > > The fix for [JDK-8271600](https://bugs.openjdk.org/browse/JDK-8271600) can then be reverted, also because Roland's fix for [JDK-8272562](https://bugs.openjdk.org/browse/JDK-8272562) disabled moving **all** CheckCastPPs out of loops anyway. Roland said that he plans to revisit that decision with [JDK-8275202](https://bugs.openjdk.org/browse/JDK-8275202). The tests added with this PR will cover the `PhaseIdealLoop::try_sink_out_of_loop` case as well and therefore serve as regression tests for [JDK-8271600](https://bugs.openjdk.org/browse/JDK-8271600). > > We could improve this by adding logic to set late ctrl just above the safepoint, but I'm not sure if it's worth the complexity because we would need to walk up the control paths from late to early control and compute the dominator of all safepoints. > > I also fixed the verification code in `OopFlow::build_oop_map` to account for spilling. Before, compilation of `test1` would pass and only crash during execution. Now, we assert and print: > > > 454 DefinitionSpillCopy === _ 122 [[ 321 ]] !jvms: IntVector::zero @ bci:26 (line 563) TestRawOopAtSafepoint::test1 @ bci:25 (line 74) > 321 Phi === 315 454 503 [[ 512 ]] #rawptr:NotNull !jvms: IntVector::zero @ bci:26 (line 563) TestRawOopAtSafepoint::test1 @ bci:25 (line 74) > 38 CallStaticJavaDirect === 40 126 136 102 0 455 452 138 139 453 460 [[ 39 84 130 37 388 ]] Static compiler.vectorapi.TestRawOopAtSafepoint::safepoint # void ( int ) TestRawOopAtSafepoint::test1 @ bci:44 (line 75) !jvms: TestRawOopAtSafepoint::test1 @ bci:44 (line 75) > > > > What do you think? > > Thanks, > Tobias Good and conservative solution. Thanks for discussion offline with Tobias and Vladimir I. about this special case from Vector API. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10932 From duke at openjdk.org Tue Nov 1 17:12:31 2022 From: duke at openjdk.org (Dhamoder Nalla) Date: Tue, 1 Nov 2022 17:12:31 GMT Subject: RFR: 8286800: Assert in PhaseIdealLoop::dump_real_LCA is too strong In-Reply-To: References: Message-ID: On Wed, 28 Sep 2022 19:04:07 GMT, Dhamoder Nalla wrote: > https://bugs.openjdk.org/browse/JDK-8286800 > > assert(real_LCA != NULL) in dump_real_LCA is not appropriate in bad graph scenario when both wrong_lca & early nodes are start nodes > > jvm!PhaseIdealLoop::dump_real_LCA(): > // Walk the idom chain up from early and wrong_lca and stop when they intersect. > while (!n1->is_Start() && !n2->is_Start()) { > ... > } > assert(real_LCA != NULL, "must always find an LCA"); > > Fix: replace assert with a console message > Thanks @chhagedorn, please take this over as you are already familiar with this area. ------------- PR: https://git.openjdk.org/jdk/pull/10472 From kbarrett at openjdk.org Tue Nov 1 17:18:08 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Tue, 1 Nov 2022 17:18:08 GMT Subject: RFR: 8296162: [aarch64] Remove unused Address::_is_lval [v2] In-Reply-To: References: Message-ID: <4AKm8_x3SKzRh-D1AnfE1PEJwnKxsSkShKaHXvi3QAM=.d6838294-eb33-42c2-af44-550cd958b19d@github.com> > Please review this trivial change to remove the unused Address::_is_lval member. Kim Barrett has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge branch 'master' into remove-is-lval - remove unused _is_lval ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10928/files - new: https://git.openjdk.org/jdk/pull/10928/files/9b2d7f4f..97030437 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10928&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10928&range=00-01 Stats: 9831 lines in 206 files changed: 7178 ins; 1732 del; 921 mod Patch: https://git.openjdk.org/jdk/pull/10928.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10928/head:pull/10928 PR: https://git.openjdk.org/jdk/pull/10928 From kbarrett at openjdk.org Tue Nov 1 17:18:10 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Tue, 1 Nov 2022 17:18:10 GMT Subject: RFR: 8296162: [aarch64] Remove unused Address::_is_lval [v2] In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 11:02:15 GMT, Christian Hagedorn wrote: >> Kim Barrett has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: >> >> - Merge branch 'master' into remove-is-lval >> - remove unused _is_lval > > Looks good and trivial! Thanks @chhagedorn and @theRealAph for reviews. ------------- PR: https://git.openjdk.org/jdk/pull/10928 From kbarrett at openjdk.org Tue Nov 1 17:19:51 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Tue, 1 Nov 2022 17:19:51 GMT Subject: Integrated: 8296162: [aarch64] Remove unused Address::_is_lval In-Reply-To: References: Message-ID: <3wJXhzEb0QqFTyZ1UYjTrMYa13FFkbARzAdVM6KwKos=.15f3ed07-dd77-4e54-8e7d-ce32a427fe17@github.com> On Tue, 1 Nov 2022 09:53:43 GMT, Kim Barrett wrote: > Please review this trivial change to remove the unused Address::_is_lval member. This pull request has now been integrated. Changeset: 2fb64a4a Author: Kim Barrett URL: https://git.openjdk.org/jdk/commit/2fb64a4a4fd650e8767bb9959dc53f8c450d4060 Stats: 7 lines in 2 files changed: 0 ins; 7 del; 0 mod 8296162: [aarch64] Remove unused Address::_is_lval Reviewed-by: chagedorn, aph ------------- PR: https://git.openjdk.org/jdk/pull/10928 From vlivanov at openjdk.org Tue Nov 1 17:44:26 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Tue, 1 Nov 2022 17:44:26 GMT Subject: RFR: 8276064: CheckCastPP with raw oop input floats below a safepoint In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 11:55:04 GMT, Tobias Hartmann wrote: > This bug is similar to [JDK-8271600](https://bugs.openjdk.org/browse/JDK-8271600): A CheckCastPP with a raw oop input floats out of a loop and below a safepoint. Since C2 does not generate OopMap entries for raw pointers, the GC will not update the oop if the corresponding object is moved during the safepoint. We either assert already during OopMap creation, or crash when dereferencing a stale oop during runtime (the verification code does not always detect such live raw oops at safepoints, I included a fix for that as well). > > I think the fix for [JDK-8271600](https://bugs.openjdk.org/browse/JDK-8271600) is incomplete, because it only bails out of [PhaseIdealLoop::try_sink_out_of_loop](https://github.com/openjdk/jdk/commit/2ff4c01d42f1afcc53abd48e074356fb4a700754) while the underlying issue is that a raw CheckCastPP ends up with ctrl "far away" from its Allocate/Initialize and potentially even below a safepoint. Usually, the CheckCastPP would always be part of safepoint debug info and therefore late ctrl would be guaranteed to be above the safepoint. However, vector objects are aggressively scalar replaced in safepoints, which allows late ctrl to be set to further below. This is specific to vectors, since "normal" Java objects would either be fully scalarized or not be scalarized at all. > > In the failing case, Loop Unswitching clones the loop body and creates a Phi to merge the oop results from the vector allocations in both loops. Since ctrl of the CheckCastPP is outside of the loop, its data input is changed to the newly created Phi and its control input is set to the region that merges the loop exits. This moves the CheckCastPP below a safepoint in the loop. > > Below graphs show the details. `395 CheckCastPP` is removed from the debug info for `262 CallStaticJava` because it's scalarized (`326 SafepointScalarObject`). Late ctrl is then computed to be outside of the loop because the CheckCastPP is only used in the return. > > ![8276064_Before](https://user-images.githubusercontent.com/5312595/199204249-17564a59-2b67-4426-be71-19bc0eafac99.png) > > Now Loop Unswitching creates a `487 Region` and `517 Phi` to merge control and data inputs to the CheckCastPP from the fast and slow loops (see `PhaseIdealLoop::clone_loop_handle_data_uses`). Control of the `395 CheckCastPP` is updated accordingly. > > ![8276064_After](https://user-images.githubusercontent.com/5312595/199204273-44341cd7-b5b6-4ec0-b8c9-6f349393dbd1.png) > > As a result, the raw oop input of the `395 CheckCastPP` is live at `262 CallStaticJava`. > > We could now add another point fix to prevent loop unswitching from moving the CheckCastPP out of the loop, but I think there is a risk that other current or future optimizations would rely on the CheckCastPP's late ctrl and do a similar thing. I would therefore suggest to pin all CheckCastPPs with a raw oop input, similar to what [JDK-5071820](https://bugs.openjdk.org/browse/JDK-5071820) did in GCM: > https://github.com/openjdk/jdk/blob/37107fc1574a4191987420d88f7182e63c7da60c/src/hotspot/share/opto/gcm.cpp#L1325-L1330 > > The fix for [JDK-8271600](https://bugs.openjdk.org/browse/JDK-8271600) can then be reverted, also because Roland's fix for [JDK-8272562](https://bugs.openjdk.org/browse/JDK-8272562) disabled moving **all** CheckCastPPs out of loops anyway. Roland said that he plans to revisit that decision with [JDK-8275202](https://bugs.openjdk.org/browse/JDK-8275202). The tests added with this PR will cover the `PhaseIdealLoop::try_sink_out_of_loop` case as well and therefore serve as regression tests for [JDK-8271600](https://bugs.openjdk.org/browse/JDK-8271600). > > We could improve this by adding logic to set late ctrl just above the safepoint, but I'm not sure if it's worth the complexity because we would need to walk up the control paths from late to early control and compute the dominator of all safepoints. > > I also fixed the verification code in `OopFlow::build_oop_map` to account for spilling. Before, compilation of `test1` would pass and only crash during execution. Now, we assert and print: > > > 454 DefinitionSpillCopy === _ 122 [[ 321 ]] !jvms: IntVector::zero @ bci:26 (line 563) TestRawOopAtSafepoint::test1 @ bci:25 (line 74) > 321 Phi === 315 454 503 [[ 512 ]] #rawptr:NotNull !jvms: IntVector::zero @ bci:26 (line 563) TestRawOopAtSafepoint::test1 @ bci:25 (line 74) > 38 CallStaticJavaDirect === 40 126 136 102 0 455 452 138 139 453 460 [[ 39 84 130 37 388 ]] Static compiler.vectorapi.TestRawOopAtSafepoint::safepoint # void ( int ) TestRawOopAtSafepoint::test1 @ bci:44 (line 75) !jvms: TestRawOopAtSafepoint::test1 @ bci:44 (line 75) > > > > What do you think? > > Thanks, > Tobias Looks good. Thanks for fixing the issue! ------------- Marked as reviewed by vlivanov (Reviewer). PR: https://git.openjdk.org/jdk/pull/10932 From kvn at openjdk.org Tue Nov 1 18:15:31 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 1 Nov 2022 18:15:31 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 13:13:46 GMT, Martin Doerr wrote: > This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. Is it possible to test this case possibly with small CodeCache sizes? ------------- PR: https://git.openjdk.org/jdk/pull/10933 From dlong at openjdk.org Tue Nov 1 18:36:25 2022 From: dlong at openjdk.org (Dean Long) Date: Tue, 1 Nov 2022 18:36:25 GMT Subject: RFR: 8296101: nmethod::is_unloading result unstable with concurrent unloading In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 10:39:56 GMT, Kim Barrett wrote: >> If an nmethod is not called during a concurrent full GC, then after marking has terminated, multiple threads can call is_unloading. If at the same time, the nmethod is made not_entrant, then we run into a source of instability in the is_cold() calculation used when computing is_unloading. There we check if the nmethod is_not_entrant(), which some concurrent observers will think is true, while others think it's false. >> The current code that sets the is_unloading_state in is_unloading() assumes that the computed state is the same across all observers. However, that is no longer true. >> >> I propose to set the is_unloading_state with a CAS instead of plain store. Then, as is_unloading() is computed before making nmethods not_entrant, we can guarantee that all concurrent readers of is_unloading in this scenario will return false in the current unloading cycle, instead of racingly returning either false or true. One thread wins, and it will say false, and the other threads will compute conflicting results, but end up agreeing after the CAS, that they should all return false. >> >> Tested with mach5 tier1-7. Also tried replacing the is_not_entrant() ingredient in is_cold with os::random() to simulate the instability source. Without my fix RunThese crashes almost immediately, and with my fix it doesn't crash. > > src/hotspot/share/code/nmethod.cpp line 1673: > >> 1671: // determined. This can't recurse as the second time we call the state is >> 1672: // guaranteed to be cached for the current unloading cycle. >> 1673: return is_unloading(); > > FWIW, it seems like you could remove any question about recursion by doing > something like this: > > uint8_t found_state = Atomic::cmpxchg(&_is_unloading_state, state, new_state, memory_order_relaxed); > uint8_t updated_state = (found_state == state) ? new_state : found_state; > return IsUnloadingState::is_unloading(updated_state); > > Is that clearer (esp. with corresponding commentary)? Maybe. I think avoiding recursion is clearer. How about: if (found_state == state) { // First to change state, we win return state_is_unloading; } else { // State already set, so use it return IsUnloadingState::is_unloading(found_state); } ------------- PR: https://git.openjdk.org/jdk/pull/10926 From vlivanov at openjdk.org Tue Nov 1 23:34:34 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Tue, 1 Nov 2022 23:34:34 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v6] In-Reply-To: References: Message-ID: On Fri, 28 Oct 2022 20:39:44 GMT, vpaprotsk wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > vpaprotsk has updated the pull request incrementally with one additional commit since the last revision: > > invalidkeyexception and some review comments src/hotspot/cpu/x86/macroAssembler_x86.hpp line 970: > 968: > 969: void addmq(int disp, Register r1, Register r2); > 970: All Poly1305-related methods can be moved to `StubGenerator`. They are used solely during stub creation. src/hotspot/cpu/x86/macroAssembler_x86_poly.cpp line 32: > 30: #include "macroAssembler_x86.hpp" > 31: > 32: #ifdef _LP64 You could rename the file to `macroAssembler_x86_64_poly.cpp` and get rid of `#ifdef _LP64`. Once you move the declarations to `StubGenerator`, it'll be `stubGenerator_x86_64_poly.cpp`. src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 2002: > 2000: } > 2001: > 2002: address StubGenerator::generate_poly1305_masksCP() { I suggest to turn it into a C++ literal constant and move the declaration next to `poly1305_process_blocks_avx512` where they are used. As an example, here's how it is handled in GHASH stubs: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/stubGenerator_x86_64_ghash.cpp#L35 That would allow to avoid to simplify the code a bit (no need in `StubRoutines::x86::_poly1305_mask_addr`/`poly1305_mask_addr()` and no need to generate the constants during VM startup). You could split it into 3 constants, but then using a single base register (`polyCP`) won't work anymore. Thinking more about it, I'm not sure why you can't just do the split and use address literals instead to access individual constants (and repurpose `r13` to be used as a scratch register when RIP-relative addressing mode doesn't work). src/hotspot/share/runtime/globals.hpp line 241: > 239: "Use intrinsics for java.util.Base64") \ > 240: \ > 241: product(bool, UsePolyIntrinsics, false, \ I'm not a fan of introducing new flags for individual intrinsics (there's already `-XX:DisableIntrinsic=_name` specifically for that), but since we already have many, shouldn't it be declared as a diagnostic flag, at least? ------------- PR: https://git.openjdk.org/jdk/pull/10582 From vlivanov at openjdk.org Tue Nov 1 23:34:34 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Tue, 1 Nov 2022 23:34:34 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5] In-Reply-To: References: <9h52z_DWFvTWWwasN7vzl9-7C0-Tj50Cis4fgRNuId8=.65de1f73-f5f3-4326-b9e0-6211861452ea@github.com> Message-ID: <4HxTb1DtD6KeuYupOKf32GoQ7SV8_EjHcqfhiZhbLHM=.884e631a-1336-454d-aae1-06f85f784381@github.com> On Fri, 28 Oct 2022 20:19:35 GMT, vpaprotsk wrote: > And just looking now on uops.info, they seem to have identical timings? Actual instruction being used (aligned vs unaligned versions) doesn't matter much here, because it's a dynamic property of the address being accessed: misaligned accesses that cross cache line boundary incur a penalty. Since the cache line size is 64 byte in size, every misaligned 512-bit access is penalized. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From vlivanov at openjdk.org Tue Nov 1 23:51:27 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Tue, 1 Nov 2022 23:51:27 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v6] In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 23:17:46 GMT, Vladimir Ivanov wrote: >> vpaprotsk has updated the pull request incrementally with one additional commit since the last revision: >> >> invalidkeyexception and some review comments > > src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 2002: > >> 2000: } >> 2001: >> 2002: address StubGenerator::generate_poly1305_masksCP() { > > I suggest to turn it into a C++ literal constant and move the declaration next to `poly1305_process_blocks_avx512` where they are used. As an example, here's how it is handled in GHASH stubs: > https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/stubGenerator_x86_64_ghash.cpp#L35 > > That would allow to avoid to simplify the code a bit (no need in `StubRoutines::x86::_poly1305_mask_addr`/`poly1305_mask_addr()` and no need to generate the constants during VM startup). > > You could split it into 3 constants, but then using a single base register (`polyCP`) won't work anymore. > Thinking more about it, I'm not sure why you can't just do the split and use address literals instead to access individual constants (and repurpose `r13` to be used as a scratch register when RIP-relative addressing mode doesn't work). The case of AES stubs may be even a better fit here: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp#L47 It doesn't use/introduce any shared constants, so declaring a constant and a local accessor (to save on pointer to address casts at use sites) is enough. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From jiefu at openjdk.org Wed Nov 2 01:28:07 2022 From: jiefu at openjdk.org (Jie Fu) Date: Wed, 2 Nov 2022 01:28:07 GMT Subject: RFR: 8295970: Add vector api sanity tests in tier1 [v2] In-Reply-To: References: <8uMsheGwjIBbgf1nJeYkCR4Pt_ddbnq4WVKMRPwS7C0=.c59b10eb-67e1-439c-936e-857c8430c520@github.com> Message-ID: On Fri, 28 Oct 2022 13:41:28 GMT, Erik Joelsson wrote: >> Jie Fu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: >> >> - Add jdk_vector_sanity test group >> - Merge branch 'master' into JDK-8295970 >> - Revert changes in test.yml >> - 8295970: Add jdk_vector tests in GHA > > I think you need to add at least one other label than `build` to this now to make sure the right people can have a say in the change. Thanks @erikj79 and @vnkozlov for the review. So @shipilev , are you fine with this change? Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/10879 From duke at openjdk.org Wed Nov 2 02:38:26 2022 From: duke at openjdk.org (vpaprotsk) Date: Wed, 2 Nov 2022 02:38:26 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v6] In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 23:49:17 GMT, Vladimir Ivanov wrote: >> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 2002: >> >>> 2000: } >>> 2001: >>> 2002: address StubGenerator::generate_poly1305_masksCP() { >> >> I suggest to turn it into a C++ literal constant and move the declaration next to `poly1305_process_blocks_avx512` where they are used. As an example, here's how it is handled in GHASH stubs: >> https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/stubGenerator_x86_64_ghash.cpp#L35 >> >> That would allow to avoid to simplify the code a bit (no need in `StubRoutines::x86::_poly1305_mask_addr`/`poly1305_mask_addr()` and no need to generate the constants during VM startup). >> >> You could split it into 3 constants, but then using a single base register (`polyCP`) won't work anymore. >> Thinking more about it, I'm not sure why you can't just do the split and use address literals instead to access individual constants (and repurpose `r13` to be used as a scratch register when RIP-relative addressing mode doesn't work). > > The case of AES stubs may be even a better fit here: > https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp#L47 > > It doesn't use/introduce any shared constants, so declaring a constant and a local accessor (to save on pointer to address casts at use sites) is enough. I wonder if I can remove that function completely now.. Originally I kept those in memory, because I was rather tight on zmm registers (actually, all registers), and I could use the `Address` version of instructions to save a register.. But I had done a mayor cleanup on register allocation before pushing the PR, maybe there is room now. (But if we do want to bring back any of the optimizations I kept back, we would need those registers again.. but will see) PS: I am trying to address 10% degradation @jnimeh and I discussed above, will take a few days to implement the latest round. Apologies for the delay and appreciate the review! ------------- PR: https://git.openjdk.org/jdk/pull/10582 From dongbo at openjdk.org Wed Nov 2 03:13:19 2022 From: dongbo at openjdk.org (Dong Bo) Date: Wed, 2 Nov 2022 03:13:19 GMT Subject: RFR: 8295698: AArch64: test/jdk/sun/security/ec/ed/EdDSATest.java failed with -XX:+UseSHA3Intrinsics Message-ID: In JDK-8252204, when implemented SHA3 intrinsics, we use `digest_length` to differentiate SHA3-224, SHA3-256, SHA3-384, SHA3-512 and calculate `block_size` with `block_size = 200 - 2 * digest_length`. However, there are two extra SHA3 instances, SHAKE256 and SHAKE128, allowing an arbitrary `digest_length`: digest_length block_size SHA3-224 28 144 SHA3-256 32 136 SHA3-384 48 104 SHA3-512 64 72 SHAKE128 variable 168 SHAKE256 variable 136 This causes SIGSEGV crash or hash code mismatch with `test/jdk/sun/security/ec/ed/EdDSATest.java`. The test calls `SHAKE256` in `Ed448`. The main idea of the patch is to pass the `block_size` to differentiate SHA3 instances. Tests `test/jdk/sun/security/ec/ed/EdDSATest.java` and `./test/jdk/sun/security/provider/MessageDigest/SHA3.java` both passed. And tier1~3 passed on SHA3 supported hardware. The SHA3 intrinsics still deliver 20%~40% performance improvement on our pre-silicon simulated platform. The latency and throughput of crypto SHA3 ops are designed to be 1 cpu cycle and 2 execution pipes respectively. Compared with the main stream code, the performance change with this patch are negligible on real hardware and simulation platform. Based on the JMH results of SHA3 intirinsics, performance can be improved by ~50% on some hardware, while some hardware have ~30% regression. These performance details are available in the comments of the issue page. I guess the performance benefit of SHA3 intrinsics is dependent on the micro architecture, it should be switched on/off based on the running platform. ------------- Commit messages: - Merge branch 'master' into 8295698-EdDSATest-crash - add some comments - 8295698: AArch64: test/jdk/sun/security/ec/ed/EdDSATest.java failed with -XX:+UseSHA3Intrinsics Changes: https://git.openjdk.org/jdk/pull/10939/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10939&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295698 Stats: 68 lines in 4 files changed: 18 ins; 13 del; 37 mod Patch: https://git.openjdk.org/jdk/pull/10939.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10939/head:pull/10939 PR: https://git.openjdk.org/jdk/pull/10939 From jbhateja at openjdk.org Wed Nov 2 03:19:04 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 2 Nov 2022 03:19:04 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5] In-Reply-To: <4HxTb1DtD6KeuYupOKf32GoQ7SV8_EjHcqfhiZhbLHM=.884e631a-1336-454d-aae1-06f85f784381@github.com> References: <9h52z_DWFvTWWwasN7vzl9-7C0-Tj50Cis4fgRNuId8=.65de1f73-f5f3-4326-b9e0-6211861452ea@github.com> <4HxTb1DtD6KeuYupOKf32GoQ7SV8_EjHcqfhiZhbLHM=.884e631a-1336-454d-aae1-06f85f784381@github.com> Message-ID: On Tue, 1 Nov 2022 23:04:45 GMT, Vladimir Ivanov wrote: >> Hmm.. interesting. Is this for loading? `evmovdquq` vs `evmovdqaq`? I was actually looking at using evmovdqaq but there is no encoding for it yet (And just looking now on uops.info, they seem to have identical timings? perhaps their measurements are off..). There are quite a few optimizations I tried (and removed) here, but not this one.. >> >> Perhaps to have a record, while its relatively fresh in my mind.. since there is a 8-block (I deleted a 16-block vector multiply), one can have a peeled off version for just 256 as the minimum payload.. In that case we only need R^1..R^8, (not R^1..R^16). I also tried loop stride of 8 blocks instead of 16, but that gets quite bit slower (20ish%?).. There was also a version that did a much better interleaving of multiplication and loading of next message block into limbs.. There is potentially a better way to 'devolve' the vector loop at tail; ie. when 15-blocks are left, just do one more 8-block multiply, all the constants are already available.. >> >> I removed all of those eventually. Even then, the assembler code currently is already fairly complex. The extra pre-, post-processing and if cases, I was struggling to keep up myself. Maybe code cleanup would have helped, so it _is_ possible to bring some of that back in for extra 10+%? (There is a branch on my fork with that code) >> >> I guess that's my long way of saying 'I don't want to complicate the assembler loop'? > >> And just looking now on uops.info, they seem to have identical timings? > > Actual instruction being used (aligned vs unaligned versions) doesn't matter much here, because it's a dynamic property of the address being accessed: misaligned accesses that cross cache line boundary incur a penalty. Since the cache line size is 64 byte in size, every misaligned 512-bit access is penalized. I collected performance counters for the benchmark included with the patch and its showing around 30% of 64 byte loads were spanning across the cache line. Performance counter stats for 'java -jar target/benchmarks.jar -f 1 -wi 1 -i 2 -w 30 -p dataSize=8192': 122385646614 cycles 328096538160 instructions # 2.68 insn per cycle 64530343063 MEM_INST_RETIRED.ALL_LOADS 22900705491 MEM_INST_RETIRED.ALL_STORES 19815558484 MEM_INST_RETIRED.SPLIT_LOADS 701176106 MEM_INST_RETIRED.SPLIT_STORES Presence of scalar peel loop before the vector loop can save this penalty. We should also extend the scope of optimization (preferably in this PR or in subsequent one) to optimize [MAC computation routine accepting ByteBuffer.](https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java#L116), ------------- PR: https://git.openjdk.org/jdk/pull/10582 From thartmann at openjdk.org Wed Nov 2 06:10:21 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 2 Nov 2022 06:10:21 GMT Subject: RFR: 8276064: CheckCastPP with raw oop input floats below a safepoint In-Reply-To: <-lOGjSVxJWizo8DwigVYxZdDK6bciEapRZYwNSPigOg=.583bd3a0-b290-4c6e-89a6-5381f9525d5d@github.com> References: <-lOGjSVxJWizo8DwigVYxZdDK6bciEapRZYwNSPigOg=.583bd3a0-b290-4c6e-89a6-5381f9525d5d@github.com> Message-ID: On Tue, 1 Nov 2022 17:08:35 GMT, Vladimir Kozlov wrote: >> This bug is similar to [JDK-8271600](https://bugs.openjdk.org/browse/JDK-8271600): A CheckCastPP with a raw oop input floats out of a loop and below a safepoint. Since C2 does not generate OopMap entries for raw pointers, the GC will not update the oop if the corresponding object is moved during the safepoint. We either assert already during OopMap creation, or crash when dereferencing a stale oop during runtime (the verification code does not always detect such live raw oops at safepoints, I included a fix for that as well). >> >> I think the fix for [JDK-8271600](https://bugs.openjdk.org/browse/JDK-8271600) is incomplete, because it only bails out of [PhaseIdealLoop::try_sink_out_of_loop](https://github.com/openjdk/jdk/commit/2ff4c01d42f1afcc53abd48e074356fb4a700754) while the underlying issue is that a raw CheckCastPP ends up with ctrl "far away" from its Allocate/Initialize and potentially even below a safepoint. Usually, the CheckCastPP would always be part of safepoint debug info and therefore late ctrl would be guaranteed to be above the safepoint. However, vector objects are aggressively scalar replaced in safepoints, which allows late ctrl to be set to further below. This is specific to vectors, since "normal" Java objects would either be fully scalarized or not be scalarized at all. >> >> In the failing case, Loop Unswitching clones the loop body and creates a Phi to merge the oop results from the vector allocations in both loops. Since ctrl of the CheckCastPP is outside of the loop, its data input is changed to the newly created Phi and its control input is set to the region that merges the loop exits. This moves the CheckCastPP below a safepoint in the loop. >> >> Below graphs show the details. `395 CheckCastPP` is removed from the debug info for `262 CallStaticJava` because it's scalarized (`326 SafepointScalarObject`). Late ctrl is then computed to be outside of the loop because the CheckCastPP is only used in the return. >> >> ![8276064_Before](https://user-images.githubusercontent.com/5312595/199204249-17564a59-2b67-4426-be71-19bc0eafac99.png) >> >> Now Loop Unswitching creates a `487 Region` and `517 Phi` to merge control and data inputs to the CheckCastPP from the fast and slow loops (see `PhaseIdealLoop::clone_loop_handle_data_uses`). Control of the `395 CheckCastPP` is updated accordingly. >> >> ![8276064_After](https://user-images.githubusercontent.com/5312595/199204273-44341cd7-b5b6-4ec0-b8c9-6f349393dbd1.png) >> >> As a result, the raw oop input of the `395 CheckCastPP` is live at `262 CallStaticJava`. >> >> We could now add another point fix to prevent loop unswitching from moving the CheckCastPP out of the loop, but I think there is a risk that other current or future optimizations would rely on the CheckCastPP's late ctrl and do a similar thing. I would therefore suggest to pin all CheckCastPPs with a raw oop input, similar to what [JDK-5071820](https://bugs.openjdk.org/browse/JDK-5071820) did in GCM: >> https://github.com/openjdk/jdk/blob/37107fc1574a4191987420d88f7182e63c7da60c/src/hotspot/share/opto/gcm.cpp#L1325-L1330 >> >> The fix for [JDK-8271600](https://bugs.openjdk.org/browse/JDK-8271600) can then be reverted, also because Roland's fix for [JDK-8272562](https://bugs.openjdk.org/browse/JDK-8272562) disabled moving **all** CheckCastPPs out of loops anyway. Roland said that he plans to revisit that decision with [JDK-8275202](https://bugs.openjdk.org/browse/JDK-8275202). The tests added with this PR will cover the `PhaseIdealLoop::try_sink_out_of_loop` case as well and therefore serve as regression tests for [JDK-8271600](https://bugs.openjdk.org/browse/JDK-8271600). >> >> We could improve this by adding logic to set late ctrl just above the safepoint, but I'm not sure if it's worth the complexity because we would need to walk up the control paths from late to early control and compute the dominator of all safepoints. >> >> I also fixed the verification code in `OopFlow::build_oop_map` to account for spilling. Before, compilation of `test1` would pass and only crash during execution. Now, we assert and print: >> >> >> 454 DefinitionSpillCopy === _ 122 [[ 321 ]] !jvms: IntVector::zero @ bci:26 (line 563) TestRawOopAtSafepoint::test1 @ bci:25 (line 74) >> 321 Phi === 315 454 503 [[ 512 ]] #rawptr:NotNull !jvms: IntVector::zero @ bci:26 (line 563) TestRawOopAtSafepoint::test1 @ bci:25 (line 74) >> 38 CallStaticJavaDirect === 40 126 136 102 0 455 452 138 139 453 460 [[ 39 84 130 37 388 ]] Static compiler.vectorapi.TestRawOopAtSafepoint::safepoint # void ( int ) TestRawOopAtSafepoint::test1 @ bci:44 (line 75) !jvms: TestRawOopAtSafepoint::test1 @ bci:44 (line 75) >> >> >> >> What do you think? >> >> Thanks, >> Tobias > > Good and conservative solution. > > Thanks for discussion offline with Tobias and Vladimir I. about this special case from Vector API. Thanks for the reviews, @vnkozlov and @iwanowww! I'll run some performance testing, just to make sure. ------------- PR: https://git.openjdk.org/jdk/pull/10932 From thartmann at openjdk.org Wed Nov 2 06:59:27 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 2 Nov 2022 06:59:27 GMT Subject: RFR: JDK-8290063: IGV: Give the graphs a unique number in the outline In-Reply-To: References: Message-ID: On Wed, 26 Oct 2022 14:02:11 GMT, Tobias Holenstein wrote: > Some graphs may have the same name more than once in IGV. To make it clearer which graph is currently open, all graphs within a group are now enumerated with `1.`, `2.`, `3.` , etc. Similarly, groups are enumerated with `1 -`, `2 -`, etc. > > overview > > The make it even further easier to distinguish different graphs and group, we can now rename them. > To rename them: > 1. click on a graph or group once to select it (does not have to be opened). > 2. click a second time on the selected graph and wait 1-2 seconds. > 3. now you can rename the graph > rename_group > rename_graph > > The numbering always starts at 1 and is continuous from 1 to N for N graphs. When a graph is deleted, the numbering of the following graphs changes. This implementation allows the keep the XML format unchanged, because the numbering is only local and not part of the name. However, if a graph/group is renamed, the name in the XML file will also change when it is saved. > > # Implementierung > The renaming is simply enabled by overriding `canRename() {return true;}` in `FolderNode` and `GraphNode` > > The numbering is implemented in `getDisplayName()` in `Group` and `InputGraph` by concatenating the index of the group/graph with the name. > > When a group/graph is deleted the we trigger an update of the index in `FolderNode` -> `destroyNodes(Node[] nodes)` by calling `node.setDisplayName(node.getDisplayName())` for all nodes. > > Refresh the group/graph name in the `EditorTopComponent` of the opened graph is a bit tricky. It is implemented by adding a `Listener` to the `getDisplayNameChangedEvent()` of the currently opened `InputGraph` in `DiagramViewModel`. `getDisplayNameChangedEvent()` is fired whenever the name or the group name of the corresponding `InputGraph` is changed. It works well for me but I'm wondering if "click and wait for 1-2 seconds" (which is more like 3 seconds on my system) should be replaced by an entry in the right-click menu. For fast renaming, a shortcut (F2) can be used, which already works. I spotted the following (existing) issue: - Open Phase 5 - Delete Phase 2 - The graph changes. Phase 6 is now displayed while the selection still shows Phase 5. I don't think deleting a non-opened phase should affect the currently opened one, right? ------------- PR: https://git.openjdk.org/jdk/pull/10873 From fyang at openjdk.org Wed Nov 2 08:01:13 2022 From: fyang at openjdk.org (Fei Yang) Date: Wed, 2 Nov 2022 08:01:13 GMT Subject: RFR: 8295968: RISC-V: Rename some assembler intrinsic functions for RVV 1.0. [v3] In-Reply-To: <6TA2jQl7mrwAh7K5aTG0cUIn4e1lOq0O-XnxK4DIXb4=.ad37b83d-e53d-4ffd-8eb9-0926d3d69dd7@github.com> References: <6TA2jQl7mrwAh7K5aTG0cUIn4e1lOq0O-XnxK4DIXb4=.ad37b83d-e53d-4ffd-8eb9-0926d3d69dd7@github.com> Message-ID: On Thu, 27 Oct 2022 08:39:53 GMT, Dingli Zhang wrote: >> Hi, >> >> Some instructions previously had old assembler notation, but were renamed in RVV1.0 to be consistent with scalar instructions, such as `vpopc_m->vcpop_m`[1] , `vfredsum_vs->vfredusum_vs`[2], `vmornot_mm->vmorn_mm/vmandnot_mm->vmandn_mm`[3], `vle1_v->vlm_v/vse1_v->vsm_v`[4]. We'd better keep the name the same as the new assembler mnemonics. >> >> The instruction `vl1r.v` in rvv0.9[5] is the same logic as `vl1re8.v` in rvv1.0[6], while in rvv1.0 it becomes a pseudoinstruction which is equal to `vl1re8.v`. >> >> By the way, we can find all the rvv aliases here[7]. >> >> Please take a look and have some reviews. Thanks a lot. >> >> >> [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#152-vector-count-population-in-mask-vcpopm >> [2] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#143-vector-single-width-floating-point-reduction-instructions >> [3] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#151-vector-mask-register-logical-instructions >> [4] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#74-vector-unit-stride-instructions >> [5] https://github.com/riscv/riscv-v-spec/blob/0.9/v-spec.adoc#79-vector-loadstore-whole-register-instructions >> [6] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#79-vector-loadstore-whole-register-instructions >> [7] https://github.com/riscv/riscv-opcodes/blob/master/rv_v_aliases >> >> ## Testing: >> >> - hotspot and jdk tier1 on unmatched board without new failures > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Fix alignment Looks good. I think it should be easy to add those instruction aliases in macro-assembler when they are really needed some day. ------------- Marked as reviewed by fyang (Reviewer). PR: https://git.openjdk.org/jdk/pull/10878 From thartmann at openjdk.org Wed Nov 2 08:13:31 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 2 Nov 2022 08:13:31 GMT Subject: RFR: 8294217: Assertion failure: parsing found no loops but there are some In-Reply-To: <8wgYaLn82fk_CgKacAQsEygK63k6KDDKYXf8m4cv_OM=.d04ae5d7-b3f8-47b3-bec8-aaaa4234f018@github.com> References: <8wgYaLn82fk_CgKacAQsEygK63k6KDDKYXf8m4cv_OM=.d04ae5d7-b3f8-47b3-bec8-aaaa4234f018@github.com> Message-ID: On Fri, 28 Oct 2022 14:34:42 GMT, Roland Westrelin wrote: > This was reported on 11 and is not reproducible with the current > jdk. The reason is that the PhaseIdealLoop invocation before EA was > changed from LoopOptsNone to LoopOptsMaxUnroll. In the absence of > loops, LoopOptsMaxUnroll exits earlier than LoopOptsNone. That wasn't > intended and this patch makes sure they behave the same. Once that's > changed, the crash reproduces with the current jdk. > > The assert fires because PhaseIdealLoop::only_has_infinite_loops() > returns false even though the IR only has infinite loops. There's a > single loop nest and the inner most loop is an infinite loop. The > current logic only looks at loops that are direct children of the root > of the loop tree. It's not the first bug where > PhaseIdealLoop::only_has_infinite_loops() fails to catch an infinite > loop (8257574 was the previous one) and it's proving challenging to > have PhaseIdealLoop::only_has_infinite_loops() handle corner cases > robustly. I reworked PhaseIdealLoop::only_has_infinite_loops() once > more. This time it goes over all children of the root of the loop > tree, collects all controls for the loop and its inner loop. It then > checks whether any control is a branch out of the loop and if it is > whether it's not a NeverBranch. Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/10904 From thartmann at openjdk.org Wed Nov 2 08:13:31 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 2 Nov 2022 08:13:31 GMT Subject: RFR: 8294217: Assertion failure: parsing found no loops but there are some In-Reply-To: References: <8wgYaLn82fk_CgKacAQsEygK63k6KDDKYXf8m4cv_OM=.d04ae5d7-b3f8-47b3-bec8-aaaa4234f018@github.com> Message-ID: On Mon, 31 Oct 2022 07:39:54 GMT, Christian Hagedorn wrote: >> This was reported on 11 and is not reproducible with the current >> jdk. The reason is that the PhaseIdealLoop invocation before EA was >> changed from LoopOptsNone to LoopOptsMaxUnroll. In the absence of >> loops, LoopOptsMaxUnroll exits earlier than LoopOptsNone. That wasn't >> intended and this patch makes sure they behave the same. Once that's >> changed, the crash reproduces with the current jdk. >> >> The assert fires because PhaseIdealLoop::only_has_infinite_loops() >> returns false even though the IR only has infinite loops. There's a >> single loop nest and the inner most loop is an infinite loop. The >> current logic only looks at loops that are direct children of the root >> of the loop tree. It's not the first bug where >> PhaseIdealLoop::only_has_infinite_loops() fails to catch an infinite >> loop (8257574 was the previous one) and it's proving challenging to >> have PhaseIdealLoop::only_has_infinite_loops() handle corner cases >> robustly. I reworked PhaseIdealLoop::only_has_infinite_loops() once >> more. This time it goes over all children of the root of the loop >> tree, collects all controls for the loop and its inner loop. It then >> checks whether any control is a branch out of the loop and if it is >> whether it's not a NeverBranch. > > src/hotspot/share/opto/loopnode.cpp line 4183: > >> 4181: >> 4182: #ifdef ASSERT >> 4183: bool PhaseIdealLoop::only_has_infinite_loops() { > >> This time it goes over all children of the root of the loop > tree, collects all controls for the loop and its inner loop. It then > checks whether any control is a branch out of the loop and if it is > whether it's not a NeverBranch. > > Maybe you can add this summary as a comment here. +1 ------------- PR: https://git.openjdk.org/jdk/pull/10904 From mdoerr at openjdk.org Wed Nov 2 09:19:21 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Wed, 2 Nov 2022 09:19:21 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 18:12:54 GMT, Vladimir Kozlov wrote: > Is it possible to test this case possibly with small CodeCache sizes? We don't have any deterministic reproduction case. The problem only happens sometimes on some machines with a proprietary test, not always. For reproduction, both code cache segments need to be completely full and the GC needs to be slow enough at the point of time at which a new method handle intrinsic is required. I only verified that the issue is gone with this change in our nightly tests. ------------- PR: https://git.openjdk.org/jdk/pull/10933 From haosun at openjdk.org Wed Nov 2 09:24:29 2022 From: haosun at openjdk.org (Hao Sun) Date: Wed, 2 Nov 2022 09:24:29 GMT Subject: RFR: 8295698: AArch64: test/jdk/sun/security/ec/ed/EdDSATest.java failed with -XX:+UseSHA3Intrinsics In-Reply-To: References: Message-ID: <1K3GDuQ_B_jYVDTYPvYcbhbUeJM1RihgIiHnbvlFYyQ=.fa471a1d-49bb-4ef3-87d0-b70ee82334ac@github.com> On Wed, 2 Nov 2022 03:06:21 GMT, Dong Bo wrote: > In JDK-8252204, when implemented SHA3 intrinsics, we use `digest_length` to differentiate SHA3-224, SHA3-256, SHA3-384, SHA3-512 and calculate `block_size` with `block_size = 200 - 2 * digest_length`. > However, there are two extra SHA3 instances, SHAKE256 and SHAKE128, allowing an arbitrary `digest_length`: > > digest_length block_size > SHA3-224 28 144 > SHA3-256 32 136 > SHA3-384 48 104 > SHA3-512 64 72 > SHAKE128 variable 168 > SHAKE256 variable 136 > > > This causes SIGSEGV crash or hash code mismatch with `test/jdk/sun/security/ec/ed/EdDSATest.java`. The test calls `SHAKE256` in `Ed448`. > > The main idea of the patch is to pass the `block_size` to differentiate SHA3 instances. > Tests `test/jdk/sun/security/ec/ed/EdDSATest.java` and `./test/jdk/sun/security/provider/MessageDigest/SHA3.java` both passed. > And tier1~3 passed on SHA3 supported hardware. > > The SHA3 intrinsics still deliver 20%~40% performance improvement on our pre-silicon simulated platform. > The latency and throughput of crypto SHA3 ops are designed to be 1 cpu cycle and 2 execution pipes respectively. > > Compared with the main stream code, the performance change with this patch are negligible on real hardware and simulation platform. > Based on the JMH results of SHA3 intirinsics, performance can be improved by ~50% on some hardware, while some hardware have ~30% regression. > These performance details are available in the comments of the issue page. > I guess the performance benefit of SHA3 intrinsics is dependent on the micro architecture, it should be switched on/off based on the running platform. Overall, it looks good to me. (I'm not a Reviewer) I agree with your fix to pass `block_size` rather than `digest_length`, but I didn't fully understand your update insides `stubGenerator_aarch64.cpp`, i.e. those operations before `rounds24_loop`, since I'm not a crypto expert. We may need some crypo expert to review that part. Note that I rerun tier1~3 with the latest commit in this PR on sha512 feature supported hardware, and the test passed. Thanks. ------------- Marked as reviewed by haosun (Author). PR: https://git.openjdk.org/jdk/pull/10939 From tholenstein at openjdk.org Wed Nov 2 09:26:44 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 2 Nov 2022 09:26:44 GMT Subject: RFR: JDK-8290063: IGV: Give the graphs a unique number in the outline [v2] In-Reply-To: References: Message-ID: > Some graphs may have the same name more than once in IGV. To make it clearer which graph is currently open, all graphs within a group are now enumerated with `1.`, `2.`, `3.` , etc. Similarly, groups are enumerated with `1 -`, `2 -`, etc. > > overview > > The make it even further easier to distinguish different graphs and group, we can now rename them. > To rename them: > 1. click on a graph or group once to select it (does not have to be opened). > 2. click a second time on the selected graph and wait 1-2 seconds. > 3. now you can rename the graph > rename_group > rename_graph > > The numbering always starts at 1 and is continuous from 1 to N for N graphs. When a graph is deleted, the numbering of the following graphs changes. This implementation allows the keep the XML format unchanged, because the numbering is only local and not part of the name. However, if a graph/group is renamed, the name in the XML file will also change when it is saved. > > # Implementierung > The renaming is simply enabled by overriding `canRename() {return true;}` in `FolderNode` and `GraphNode` > > The numbering is implemented in `getDisplayName()` in `Group` and `InputGraph` by concatenating the index of the group/graph with the name. > > When a group/graph is deleted the we trigger an update of the index in `FolderNode` -> `destroyNodes(Node[] nodes)` by calling `node.setDisplayName(node.getDisplayName())` for all nodes. > > Refresh the group/graph name in the `EditorTopComponent` of the opened graph is a bit tricky. It is implemented by adding a `Listener` to the `getDisplayNameChangedEvent()` of the currently opened `InputGraph` in `DiagramViewModel`. `getDisplayNameChangedEvent()` is fired whenever the name or the group name of the corresponding `InputGraph` is changed. Tobias Holenstein has updated the pull request incrementally with two additional commits since the last revision: - missing imports - added RenameAction with shortcut ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10873/files - new: https://git.openjdk.org/jdk/pull/10873/files/d7fb0854..4f36c623 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10873&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10873&range=00-01 Stats: 36 lines in 4 files changed: 28 ins; 0 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/10873.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10873/head:pull/10873 PR: https://git.openjdk.org/jdk/pull/10873 From tholenstein at openjdk.org Wed Nov 2 09:52:28 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 2 Nov 2022 09:52:28 GMT Subject: RFR: JDK-8290063: IGV: Give the graphs a unique number in the outline In-Reply-To: References: Message-ID: On Wed, 2 Nov 2022 06:56:59 GMT, Tobias Hartmann wrote: > It works well for me but I'm wondering if "click and wait for 1-2 seconds" (which is more like 3 seconds on my system) should be replaced by an entry in the right-click menu. For fast renaming, a shortcut (F2) can be used, which already works. > > I spotted the following (existing) issue: > > * Open Phase 5 > * Delete Phase 2 > * The graph changes. Phase 6 is now displayed while the selection still shows Phase 5. > > I don't think deleting a non-opened phase should affect the currently opened one, right? Thanks for testing @TobiHartmann ! I agree that a right-click menu is more convenient. I added this functionality and updated the PR. The issue with the deleting is known and it already exists an bug report: https://bugs.openjdk.org/browse/JDK-8294066 ------------- PR: https://git.openjdk.org/jdk/pull/10873 From eosterlund at openjdk.org Wed Nov 2 10:02:37 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Wed, 2 Nov 2022 10:02:37 GMT Subject: RFR: 8296101: nmethod::is_unloading result unstable with concurrent unloading [v2] In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 10:41:17 GMT, Kim Barrett wrote: >> Erik ?sterlund has updated the pull request incrementally with one additional commit since the last revision: >> >> Avoid recursion > > Looks good as-is, but with a suggested alternate you can take or not. Thank you for the reviews, @kimbarrett and @dean-long ! I updated the PR with a variation without recursion, as requested. ------------- PR: https://git.openjdk.org/jdk/pull/10926 From eosterlund at openjdk.org Wed Nov 2 10:02:37 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Wed, 2 Nov 2022 10:02:37 GMT Subject: RFR: 8296101: nmethod::is_unloading result unstable with concurrent unloading [v2] In-Reply-To: References: Message-ID: > If an nmethod is not called during a concurrent full GC, then after marking has terminated, multiple threads can call is_unloading. If at the same time, the nmethod is made not_entrant, then we run into a source of instability in the is_cold() calculation used when computing is_unloading. There we check if the nmethod is_not_entrant(), which some concurrent observers will think is true, while others think it's false. > The current code that sets the is_unloading_state in is_unloading() assumes that the computed state is the same across all observers. However, that is no longer true. > > I propose to set the is_unloading_state with a CAS instead of plain store. Then, as is_unloading() is computed before making nmethods not_entrant, we can guarantee that all concurrent readers of is_unloading in this scenario will return false in the current unloading cycle, instead of racingly returning either false or true. One thread wins, and it will say false, and the other threads will compute conflicting results, but end up agreeing after the CAS, that they should all return false. > > Tested with mach5 tier1-7. Also tried replacing the is_not_entrant() ingredient in is_cold with os::random() to simulate the instability source. Without my fix RunThese crashes almost immediately, and with my fix it doesn't crash. Erik ?sterlund has updated the pull request incrementally with one additional commit since the last revision: Avoid recursion ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10926/files - new: https://git.openjdk.org/jdk/pull/10926/files/a36328ab..dc4b463a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10926&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10926&range=00-01 Stats: 12 lines in 1 file changed: 5 ins; 5 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/10926.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10926/head:pull/10926 PR: https://git.openjdk.org/jdk/pull/10926 From thartmann at openjdk.org Wed Nov 2 10:19:07 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 2 Nov 2022 10:19:07 GMT Subject: RFR: JDK-8290063: IGV: Give the graphs a unique number in the outline [v2] In-Reply-To: References: Message-ID: On Wed, 2 Nov 2022 09:26:44 GMT, Tobias Holenstein wrote: >> Some graphs may have the same name more than once in IGV. To make it clearer which graph is currently open, all graphs within a group are now enumerated with `1.`, `2.`, `3.` , etc. Similarly, groups are enumerated with `1 -`, `2 -`, etc. >> >> overview >> >> The make it even further easier to distinguish different graphs and group, we can now rename them in two ways. >> **(A)** To rename a group or graph: >> 1. click on a graph or group once to select it (does not have to be opened). >> 2. click a second time on the selected graph and wait 1-2 seconds. >> 3. now you can rename the graph >> 4. on Linux F2 works as a shortcut for renaming >> rename_group >> rename_graph >> >> **(B)** To rename a group or graph: >> 1. right-click on a graph or group once to select it (does not have to be opened). >> 2. select "Rename.." >> 3. now a window opens to rename the graph >> 4. The shortcut for this `RenameAction` is `ALT-CTRL-R` or `OPT-CMD-R` on a mac keyboard >> RenameAction >> Rename >> >> The numbering always starts at 1 and is continuous from 1 to N for N graphs. When a graph is deleted, the numbering of the following graphs changes. This implementation allows the keep the XML format unchanged, because the numbering is only local and not part of the name. However, if a graph/group is renamed, the name in the XML file will also change when it is saved. >> >> # Implementierung >> The renaming is simply enabled by overriding `canRename() {return true;}` in `FolderNode` and `GraphNode` >> >> The numbering is implemented in `getDisplayName()` in `Group` and `InputGraph` by concatenating the index of the group/graph with the name. >> >> When a group/graph is deleted the we trigger an update of the index in `FolderNode` -> `destroyNodes(Node[] nodes)` by calling `node.setDisplayName(node.getDisplayName())` for all nodes. >> >> Refresh the group/graph name in the `EditorTopComponent` of the opened graph is a bit tricky. It is implemented by adding a `Listener` to the `getDisplayNameChangedEvent()` of the currently opened `InputGraph` in `DiagramViewModel`. `getDisplayNameChangedEvent()` is fired whenever the name or the group name of the corresponding `InputGraph` is changed. > > Tobias Holenstein has updated the pull request incrementally with two additional commits since the last revision: > > - missing imports > - added RenameAction with shortcut Thanks for updating. Looks good to me! ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/10873 From kbarrett at openjdk.org Wed Nov 2 11:14:20 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Wed, 2 Nov 2022 11:14:20 GMT Subject: RFR: 8296101: nmethod::is_unloading result unstable with concurrent unloading [v2] In-Reply-To: References: Message-ID: <6-EquT0B-FHlnfP9Gdl78p6otzThgDQpSHJIjgxv2_Q=.8ec85fcd-71a8-4183-bf9f-3cb71b8e3351@github.com> On Wed, 2 Nov 2022 10:02:37 GMT, Erik ?sterlund wrote: >> If an nmethod is not called during a concurrent full GC, then after marking has terminated, multiple threads can call is_unloading. If at the same time, the nmethod is made not_entrant, then we run into a source of instability in the is_cold() calculation used when computing is_unloading. There we check if the nmethod is_not_entrant(), which some concurrent observers will think is true, while others think it's false. >> The current code that sets the is_unloading_state in is_unloading() assumes that the computed state is the same across all observers. However, that is no longer true. >> >> I propose to set the is_unloading_state with a CAS instead of plain store. Then, as is_unloading() is computed before making nmethods not_entrant, we can guarantee that all concurrent readers of is_unloading in this scenario will return false in the current unloading cycle, instead of racingly returning either false or true. One thread wins, and it will say false, and the other threads will compute conflicting results, but end up agreeing after the CAS, that they should all return false. >> >> Tested with mach5 tier1-7. Also tried replacing the is_not_entrant() ingredient in is_cold with os::random() to simulate the instability source. Without my fix RunThese crashes almost immediately, and with my fix it doesn't crash. > > Erik ?sterlund has updated the pull request incrementally with one additional commit since the last revision: > > Avoid recursion Recursion removal looks good. ------------- Marked as reviewed by kbarrett (Reviewer). PR: https://git.openjdk.org/jdk/pull/10926 From rkennke at openjdk.org Wed Nov 2 11:32:21 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Wed, 2 Nov 2022 11:32:21 GMT Subject: RFR: 8296136: Use correct register in aarch64_enc_fast_unlock() [v2] In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 12:16:48 GMT, Roman Kennke wrote: >> In aarch64_enc_fast_unlock() (aarch64.ad) we have this piece of code: >> >> >> __ ldr(tmp, Address(oop, oopDesc::mark_offset_in_bytes())); >> __ tbnz(disp_hdr, exact_log2(markWord::monitor_value), object_has_monitor); >> >> >> The tbnz uses the wrong register - it should really use tmp. disp_hdr has been loaded with the displaced header of the stack-lock, which would never have its monitor bits set, thus the branch will always take the slow path. In this common case, it is only a performance nuisance. In the case of !UseHeavyMonitors it is even worse, then disp_hdr will be unitialized, and we are facing a correctness problem. >> >> As far as I can tell, the problem dates back to when aarch64 C2 parts have been added to OpenJDK. >> >> Testing: >> - [x] tier1 >> - [x] tier2 >> - [x] tier3 >> - [x] tier4 > > Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: > > Same fix for RISC-V Thank you! ------------- PR: https://git.openjdk.org/jdk/pull/10921 From rkennke at openjdk.org Wed Nov 2 11:36:31 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Wed, 2 Nov 2022 11:36:31 GMT Subject: Integrated: 8296136: Use correct register in aarch64_enc_fast_unlock() In-Reply-To: References: Message-ID: On Mon, 31 Oct 2022 17:31:31 GMT, Roman Kennke wrote: > In aarch64_enc_fast_unlock() (aarch64.ad) we have this piece of code: > > > __ ldr(tmp, Address(oop, oopDesc::mark_offset_in_bytes())); > __ tbnz(disp_hdr, exact_log2(markWord::monitor_value), object_has_monitor); > > > The tbnz uses the wrong register - it should really use tmp. disp_hdr has been loaded with the displaced header of the stack-lock, which would never have its monitor bits set, thus the branch will always take the slow path. In this common case, it is only a performance nuisance. In the case of !UseHeavyMonitors it is even worse, then disp_hdr will be unitialized, and we are facing a correctness problem. > > As far as I can tell, the problem dates back to when aarch64 C2 parts have been added to OpenJDK. > > Testing: > - [x] tier1 > - [x] tier2 > - [x] tier3 > - [x] tier4 This pull request has now been integrated. Changeset: 7619602c Author: Roman Kennke URL: https://git.openjdk.org/jdk/commit/7619602c365acee73a490f05be2bd0d3dd09d7a4 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod 8296136: Use correct register in aarch64_enc_fast_unlock() Reviewed-by: aph, fyang ------------- PR: https://git.openjdk.org/jdk/pull/10921 From rkennke at openjdk.org Wed Nov 2 11:41:59 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Wed, 2 Nov 2022 11:41:59 GMT Subject: RFR: 8296170: Refactor stack-locking path in C2_MacroAssembler::fast_unlock() Message-ID: The code in C2_MacroAssembler::fast_unlock() has several (minor) issues: - The stack-locking path for x86_32 is not under UseHeavyMonitors - it would be executed even when stack-locking is disabled. - The stack-locking paths are the same for x86_32 and x86_64 - they can be merged into a common path. - In x86_32 path, we call get_thread(boxReg) which is totally bogus because we clear boxReg right afterwards with xorptr(boxReg, boxReg). - In x86_32 path, the CheckSucc label is identical to the DONE label, and in-fact CheckSucc is only ever really used in the x86_64 path and can be moved there. Testing: - [x] tier1 (x86_32, x86_64) - [x] tier2 (x86_32, x86_64) ------------- Commit messages: - 8296170: Refactor stack-locking path in C2_MacroAssembler::fast_unlock() Changes: https://git.openjdk.org/jdk/pull/10936/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10936&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296170 Stats: 32 lines in 1 file changed: 8 ins; 20 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/10936.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10936/head:pull/10936 PR: https://git.openjdk.org/jdk/pull/10936 From bkilambi at openjdk.org Wed Nov 2 12:13:21 2022 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Wed, 2 Nov 2022 12:13:21 GMT Subject: RFR: 8294816: C2: Math.min/max vectorization miscompilation Message-ID: <_yz_CZFBqHft7ZJwzc51_-uo_5OWKvb295bc6OGiPx8=.e8479118-1fd2-48cb-a87c-8eccddc979b2@github.com> C2 miscompiles during auto-vectorization of MinI/MaxI nodes when "short" type operands are involved. When a short and an integer value is compared, C2 generates vector min/max nodes for "short" types which does not result in correct output as it disregards the higher order bits of the integer input. Java API for Math.min/max also only supports the int, long, float and double types but not the subword integer types namely - char, byte and short. Hence, char/short/byte min/max vector instructions should not be generated. This patch ensures that MaxV and MinV vector nodes are only generated for the "int" type for MaxI and MinI nodes during auto-vectorization. ------------- Commit messages: - 8294816: C2: Math.min/max vectorization miscompilation Changes: https://git.openjdk.org/jdk/pull/10944/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10944&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8294816 Stats: 139 lines in 4 files changed: 139 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10944.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10944/head:pull/10944 PR: https://git.openjdk.org/jdk/pull/10944 From bkilambi at openjdk.org Wed Nov 2 12:14:39 2022 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Wed, 2 Nov 2022 12:14:39 GMT Subject: RFR: 8288107: Auto-vectorization for integer min/max [v2] In-Reply-To: <00xVL_xq6POBib-8x5oB0GPkE57-7hNtNL2rOCXCdPE=.d891ffa2-9409-4089-ac16-15fdca2964bb@github.com> References: <00xVL_xq6POBib-8x5oB0GPkE57-7hNtNL2rOCXCdPE=.d891ffa2-9409-4089-ac16-15fdca2964bb@github.com> Message-ID: On Mon, 24 Oct 2022 13:25:36 GMT, Tobias Hartmann wrote: >> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: >> >> 8288107: Auto-vectorization for integer min/max >> >> When Math.min/max is invoked on integer arrays, it generates the CMP-CMOVE instructions instead of vectorizing the loop(if vectorizable and relevant ISA is available) using vector equivalent of min/max instructions. Emitting MaxI/MinI nodes instead of Cmp/CmoveI nodes results in the loop getting vectorized eventually and the architecture specific min/max vector instructions are generated. >> A test to assess the performance of Math.max/min and StrictMath.max/min is added. On aarch64, the smin/smax instructions are generated when the loop is vectorized. On x86-64, vectorization support for min/max operations is available only in SSE4 (where pmaxsd/pminsd are generated) and AVX version >= 1 (where vpmaxsd/vpminsd are generated). This patch generates these instructions only when the loop is vectorized. In cases where the loop is not vectorizable or when the Math.max/min operations are called outside of the loop, cmp-cmove instructions are generated (tested on aarch64, x86-64 machines which have cmp-cmove instructions defined for the scalar MaxI/MinI nodes). Performance comparisons for the VectorIntMinMax.java test with and without the patch are given below : >> >> Before this patch: >> aarch64: >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 1593.510 ? 1.488 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 1593.123 ? 1.365 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1593.112 ? 0.985 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1593.290 ? 1.219 ns/op >> >> x86-64: >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 2084.717 ? 4.780 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 2087.322 ? 4.158 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2084.568 ? 4.838 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2086.595 ? 4.025 ns/op >> >> After this patch: >> aarch64: >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 323.911 ? 0.206 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 324.084 ? 0.231 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 323.892 ? 0.234 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 323.990 ? 0.295 ns/op >> >> x86-64: >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 387.639 ? 0.512 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 387.999 ? 0.740 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 387.605 ? 0.376 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 387.765 ? 0.498 ns/op >> >> With auto-vectorization, both the machines exhibit a significant performance gain. On both the machines the runtime is ~80% better than the case without this patch. Also ran the patch with -XX:-UseSuperWord to make sure the performance does not degrade in cases where vectorization does not happen. The performance numbers are shown below : >> aarch64: >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 1449.792 ? 1.072 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 1450.636 ? 1.057 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1450.214 ? 1.093 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1450.615 ? 1.098 ns/op >> >> x86-64: >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 2059.673 ? 4.726 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 2059.853 ? 4.754 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2059.920 ? 4.658 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2059.622 ? 4.768 ns/op >> There is no degradation when vectorization is disabled. > > Thank you! @TobiHartmann Please review - https://github.com/openjdk/jdk/pull/10944. Thank you! ------------- PR: https://git.openjdk.org/jdk/pull/9466 From dzhang at openjdk.org Wed Nov 2 12:23:28 2022 From: dzhang at openjdk.org (Dingli Zhang) Date: Wed, 2 Nov 2022 12:23:28 GMT Subject: RFR: 8295968: RISC-V: Rename some assembler intrinsic functions for RVV 1.0 [v3] In-Reply-To: References: Message-ID: On Thu, 27 Oct 2022 10:24:15 GMT, Zixian Cai wrote: >>> Not a reviewer. The changes match the spec. But perhaps it's a good idea to still keep the old names for compatibility per the spec. See line comments. >> >> Hi @caizixian, thanks for review! >> In a compiler (e.g. llvm) these alias need to be preserved because the assembly file only has instruction names, but does it also need to be preserved in a virtual machine like openjdk? >> If these older assembly mnemonics need to be retained as aliases, I think we can add it inside the macro assembler. > >> > Not a reviewer. The changes match the spec. But perhaps it's a good idea to still keep the old names for compatibility per the spec. See line comments. >> >> Hi @caizixian, thanks for review! In a compiler (e.g. llvm) these alias need to be preserved because the assembly file only has instruction names, but does it also need to be preserved in a virtual machine like openjdk? If these older assembly mnemonics need to be retained as aliases, I think we can add it inside the macro assembler. > > @DingliZhang Good question. I can think of some possible use cases. > > 1. Someone has a fork and has existing modifications that use the vector instructions. When they merge the upstream into their fork, even though there's no merge conflicts, the code won't compile. Though you can argue that if they have a fork and maintain non-trivial changes, they should probably stay up-to-date with upstream changes. > 2. When we interact with the rest of the ecosystem, for example, binutils via hsdis, old mnemonics might still be shown by other tools, so keeping the old name might help when someone does a text search of the OpenJDK code. > 3. Somewhat related to 2, when someone tries to port existing assembly snippets (either handwritten or disassembling object files produced by gcc or LLVM), it's easier if the old mnemonics still exist. > > Just to be clear, I don't have any strong opinion regarding this. But based on my recent experience with porting some GC assembly code to RISC-V, I thought it would be nice if we can make the assembler is bit more friendly to help people transition from older RISC-V specs to newer ones. @caizixian @yadongw @RealFYang Thanks for the review! ------------- PR: https://git.openjdk.org/jdk/pull/10878 From thartmann at openjdk.org Wed Nov 2 12:33:24 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 2 Nov 2022 12:33:24 GMT Subject: RFR: 8294816: C2: Math.min/max vectorization miscompilation In-Reply-To: <_yz_CZFBqHft7ZJwzc51_-uo_5OWKvb295bc6OGiPx8=.e8479118-1fd2-48cb-a87c-8eccddc979b2@github.com> References: <_yz_CZFBqHft7ZJwzc51_-uo_5OWKvb295bc6OGiPx8=.e8479118-1fd2-48cb-a87c-8eccddc979b2@github.com> Message-ID: On Wed, 2 Nov 2022 12:06:14 GMT, Bhavana Kilambi wrote: > C2 miscompiles during auto-vectorization of MinI/MaxI nodes when "short" type operands are involved. When a short and an integer value is compared, C2 generates vector min/max nodes for "short" types which does not result in correct output as it disregards the higher order bits of the integer input. Java API for Math.min/max also only supports the int, long, float and double types but not the subword integer types namely - char, byte and short. Hence, char/short/byte min/max vector instructions should not be generated. > This patch ensures that MaxV and MinV vector nodes are only generated for the "int" type for MaxI and MinI nodes during auto-vectorization. Looks good to me. I'll run testing and report back once it passed. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/10944 From dzhang at openjdk.org Wed Nov 2 12:37:06 2022 From: dzhang at openjdk.org (Dingli Zhang) Date: Wed, 2 Nov 2022 12:37:06 GMT Subject: Integrated: 8295968: RISC-V: Rename some assembler intrinsic functions for RVV 1.0 In-Reply-To: References: Message-ID: On Thu, 27 Oct 2022 02:23:13 GMT, Dingli Zhang wrote: > Hi, > > Some instructions previously had old assembler notation, but were renamed in RVV1.0 to be consistent with scalar instructions, such as `vpopc_m->vcpop_m`[1] , `vfredsum_vs->vfredusum_vs`[2], `vmornot_mm->vmorn_mm/vmandnot_mm->vmandn_mm`[3], `vle1_v->vlm_v/vse1_v->vsm_v`[4]. We'd better keep the name the same as the new assembler mnemonics. > > The instruction `vl1r.v` in rvv0.9[5] is the same logic as `vl1re8.v` in rvv1.0[6], while in rvv1.0 it becomes a pseudoinstruction which is equal to `vl1re8.v`. > > By the way, we can find all the rvv aliases here[7]. > > Please take a look and have some reviews. Thanks a lot. > > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#152-vector-count-population-in-mask-vcpopm > [2] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#143-vector-single-width-floating-point-reduction-instructions > [3] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#151-vector-mask-register-logical-instructions > [4] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#74-vector-unit-stride-instructions > [5] https://github.com/riscv/riscv-v-spec/blob/0.9/v-spec.adoc#79-vector-loadstore-whole-register-instructions > [6] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#79-vector-loadstore-whole-register-instructions > [7] https://github.com/riscv/riscv-opcodes/blob/master/rv_v_aliases > > ## Testing: > > - hotspot and jdk tier1 on unmatched board without new failures This pull request has now been integrated. Changeset: 2bd24c45 Author: Dingli Zhang Committer: Fei Yang URL: https://git.openjdk.org/jdk/commit/2bd24c4542d6a28b8a7829f6db8f80fefd2bce5a Stats: 10 lines in 4 files changed: 0 ins; 0 del; 10 mod 8295968: RISC-V: Rename some assembler intrinsic functions for RVV 1.0 Reviewed-by: fyang ------------- PR: https://git.openjdk.org/jdk/pull/10878 From redestad at openjdk.org Wed Nov 2 12:51:57 2022 From: redestad at openjdk.org (Claes Redestad) Date: Wed, 2 Nov 2022 12:51:57 GMT Subject: RFR: 8296168: x86: Add reasonable constraints between AVX and SSE Message-ID: We've not seen any x86 CPU, Intel or otherwise, where AVX features are available but SSE 4.1 is not supported. This patch suggests constraining setup so that any explicit value of UseSSE less than 4 (the default on any AVX-supporting CPU) implicitly disables AVX. This simplifies ergonomics and reduces the testing surface. Concretely this would allow #10847 to not have to guard the new intrinsic on UseSSE level to avoid some surprising test failures in tests verifying SSE-enabled intrinsics. I've rearranged the initialization of UseAVX and UseSSE to allow AVX to look at the post-ergo values of UseSSE. ------------- Commit messages: - Constrain AVX to require SSE 4 Changes: https://git.openjdk.org/jdk/pull/10946/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10946&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296168 Stats: 64 lines in 1 file changed: 34 ins; 27 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/10946.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10946/head:pull/10946 PR: https://git.openjdk.org/jdk/pull/10946 From chagedorn at openjdk.org Wed Nov 2 13:21:11 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 2 Nov 2022 13:21:11 GMT Subject: RFR: 8286800: Assert in PhaseIdealLoop::dump_real_LCA is too strong In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 17:09:55 GMT, Dhamoder Nalla wrote: >> https://bugs.openjdk.org/browse/JDK-8286800 >> >> assert(real_LCA != NULL) in dump_real_LCA is not appropriate in bad graph scenario when both wrong_lca & early nodes are start nodes >> >> jvm!PhaseIdealLoop::dump_real_LCA(): >> // Walk the idom chain up from early and wrong_lca and stop when they intersect. >> while (!n1->is_Start() && !n2->is_Start()) { >> ... >> } >> assert(real_LCA != NULL, "must always find an LCA"); >> >> Fix: replace assert with a console message > >> > > Thanks @chhagedorn, please take this over as you are already familiar with this area. Hi @dhanalla, okay, I'll do that. Thanks for letting me know! ------------- PR: https://git.openjdk.org/jdk/pull/10472 From thartmann at openjdk.org Wed Nov 2 13:41:20 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 2 Nov 2022 13:41:20 GMT Subject: RFR: 8294816: C2: Math.min/max vectorization miscompilation In-Reply-To: <_yz_CZFBqHft7ZJwzc51_-uo_5OWKvb295bc6OGiPx8=.e8479118-1fd2-48cb-a87c-8eccddc979b2@github.com> References: <_yz_CZFBqHft7ZJwzc51_-uo_5OWKvb295bc6OGiPx8=.e8479118-1fd2-48cb-a87c-8eccddc979b2@github.com> Message-ID: <0-BPBR3D_Bf65Ad2-7t-p0Hmckc-VTi7jyXfeRFbK1g=.266806ea-8bf6-47c2-8391-e2648ee8e6b7@github.com> On Wed, 2 Nov 2022 12:06:14 GMT, Bhavana Kilambi wrote: > C2 miscompiles during auto-vectorization of MinI/MaxI nodes when "short" type operands are involved. When a short and an integer value is compared, C2 generates vector min/max nodes for "short" types which does not result in correct output as it disregards the higher order bits of the integer input. Java API for Math.min/max also only supports the int, long, float and double types but not the subword integer types namely - char, byte and short. Hence, char/short/byte min/max vector instructions should not be generated. > This patch ensures that MaxV and MinV vector nodes are only generated for the "int" type for MaxI and MinI nodes during auto-vectorization. Changes requested by thartmann (Reviewer). test/hotspot/jtreg/compiler/c2/TestMinMaxSubword.java line 64: > 62: // should not generate vectorized Min/Max nodes for them. > 63: @Test > 64: @IR(failOn = {IRNode.Min_V}) The test currently fails compilation because the IRNodes were renamed by https://github.com/openjdk/jdk/pull/10695 to ` MIN_V`. ------------- PR: https://git.openjdk.org/jdk/pull/10944 From bkilambi at openjdk.org Wed Nov 2 14:42:20 2022 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Wed, 2 Nov 2022 14:42:20 GMT Subject: RFR: 8294816: C2: Math.min/max vectorization miscompilation [v2] In-Reply-To: <_yz_CZFBqHft7ZJwzc51_-uo_5OWKvb295bc6OGiPx8=.e8479118-1fd2-48cb-a87c-8eccddc979b2@github.com> References: <_yz_CZFBqHft7ZJwzc51_-uo_5OWKvb295bc6OGiPx8=.e8479118-1fd2-48cb-a87c-8eccddc979b2@github.com> Message-ID: > C2 miscompiles during auto-vectorization of MinI/MaxI nodes when "short" type operands are involved. When a short and an integer value is compared, C2 generates vector min/max nodes for "short" types which does not result in correct output as it disregards the higher order bits of the integer input. Java API for Math.min/max also only supports the int, long, float and double types but not the subword integer types namely - char, byte and short. Hence, char/short/byte min/max vector instructions should not be generated. > This patch ensures that MaxV and MinV vector nodes are only generated for the "int" type for MaxI and MinI nodes during auto-vectorization. Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: Updated string names for Min and Max IR nodes ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10944/files - new: https://git.openjdk.org/jdk/pull/10944/files/064dbc1e..74d9886c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10944&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10944&range=00-01 Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/10944.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10944/head:pull/10944 PR: https://git.openjdk.org/jdk/pull/10944 From bkilambi at openjdk.org Wed Nov 2 14:42:23 2022 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Wed, 2 Nov 2022 14:42:23 GMT Subject: RFR: 8294816: C2: Math.min/max vectorization miscompilation [v2] In-Reply-To: <0-BPBR3D_Bf65Ad2-7t-p0Hmckc-VTi7jyXfeRFbK1g=.266806ea-8bf6-47c2-8391-e2648ee8e6b7@github.com> References: <_yz_CZFBqHft7ZJwzc51_-uo_5OWKvb295bc6OGiPx8=.e8479118-1fd2-48cb-a87c-8eccddc979b2@github.com> <0-BPBR3D_Bf65Ad2-7t-p0Hmckc-VTi7jyXfeRFbK1g=.266806ea-8bf6-47c2-8391-e2648ee8e6b7@github.com> Message-ID: On Wed, 2 Nov 2022 13:37:46 GMT, Tobias Hartmann wrote: >> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: >> >> Updated string names for Min and Max IR nodes > > test/hotspot/jtreg/compiler/c2/TestMinMaxSubword.java line 64: > >> 62: // should not generate vectorized Min/Max nodes for them. >> 63: @Test >> 64: @IR(failOn = {IRNode.Min_V}) > > The test currently fails compilation because the IRNodes were renamed by https://github.com/openjdk/jdk/pull/10695 to ` MIN_V`. Thanks for letting me know. I have updated the code accordingly. Please review.. ------------- PR: https://git.openjdk.org/jdk/pull/10944 From shade at openjdk.org Wed Nov 2 15:14:36 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 2 Nov 2022 15:14:36 GMT Subject: RFR: 8295970: Add vector api sanity tests in tier1 [v2] In-Reply-To: <8uMsheGwjIBbgf1nJeYkCR4Pt_ddbnq4WVKMRPwS7C0=.c59b10eb-67e1-439c-936e-857c8430c520@github.com> References: <8uMsheGwjIBbgf1nJeYkCR4Pt_ddbnq4WVKMRPwS7C0=.c59b10eb-67e1-439c-936e-857c8430c520@github.com> Message-ID: On Fri, 28 Oct 2022 07:19:31 GMT, Jie Fu wrote: >> Hi all, >> >> As discussed here https://github.com/openjdk/jdk/pull/10807#pullrequestreview-1150314487 , it would be better to add the vector api tests in GHA. >> >> Thanks. >> Best regards, >> Jie > > Jie Fu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Add jdk_vector_sanity test group > - Merge branch 'master' into JDK-8295970 > - Revert changes in test.yml > - 8295970: Add jdk_vector tests in GHA I think the usual style for these files is to maintain original order/sorting. test/jdk/TEST.groups line 44: > 42: :jdk_vector_sanity \ > 43: :jdk_svc_sanity \ > 44: :jdk_foreign \ Suggestion: :jdk_math \ :jdk_svc_sanity \ :jdk_foreign \ :jdk_vector_sanity \ test/jdk/TEST.groups line 80: > 78: :jdk_svc \ > 79: -:jdk_vector_sanity \ > 80: -:jdk_svc_sanity \ Suggestion: :jdk_svc \ -:jdk_svc_sanity \ -:jdk_vector_sanity \ test/jdk/TEST.groups line 367: > 365: > 366: jdk_vector_sanity = \ > 367: jdk/incubator/vector/AddTest.java \ These should probably be sorted too. ------------- PR: https://git.openjdk.org/jdk/pull/10879 From chagedorn at openjdk.org Wed Nov 2 15:30:30 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 2 Nov 2022 15:30:30 GMT Subject: RFR: JDK-8290063: IGV: Give the graphs a unique number in the outline [v2] In-Reply-To: References: Message-ID: On Wed, 2 Nov 2022 09:26:44 GMT, Tobias Holenstein wrote: >> Some graphs may have the same name more than once in IGV. To make it clearer which graph is currently open, all graphs within a group are now enumerated with `1.`, `2.`, `3.` , etc. Similarly, groups are enumerated with `1 -`, `2 -`, etc. >> >> overview >> >> The make it even further easier to distinguish different graphs and group, we can now rename them in two ways. >> **(A)** To rename a group or graph: >> 1. click on a graph or group once to select it (does not have to be opened). >> 2. click a second time on the selected graph and wait 1-2 seconds. >> 3. now you can rename the graph >> 4. on Linux F2 works as a shortcut for renaming >> rename_group >> rename_graph >> >> **(B)** To rename a group or graph: >> 1. right-click on a graph or group once to select it (does not have to be opened). >> 2. select "Rename.." >> 3. now a window opens to rename the graph >> 4. The shortcut for this `RenameAction` is `ALT-CTRL-R` or `OPT-CMD-R` on a mac keyboard >> RenameAction >> Rename >> >> The numbering always starts at 1 and is continuous from 1 to N for N graphs. When a graph is deleted, the numbering of the following graphs changes. This implementation allows the keep the XML format unchanged, because the numbering is only local and not part of the name. However, if a graph/group is renamed, the name in the XML file will also change when it is saved. >> >> # Implementierung >> The renaming is simply enabled by overriding `canRename() {return true;}` in `FolderNode` and `GraphNode` >> >> The numbering is implemented in `getDisplayName()` in `Group` and `InputGraph` by concatenating the index of the group/graph with the name. >> >> When a group/graph is deleted the we trigger an update of the index in `FolderNode` -> `destroyNodes(Node[] nodes)` by calling `node.setDisplayName(node.getDisplayName())` for all nodes. >> >> Refresh the group/graph name in the `EditorTopComponent` of the opened graph is a bit tricky. It is implemented by adding a `Listener` to the `getDisplayNameChangedEvent()` of the currently opened `InputGraph` in `DiagramViewModel`. `getDisplayNameChangedEvent()` is fired whenever the name or the group name of the corresponding `InputGraph` is changed. > > Tobias Holenstein has updated the pull request incrementally with two additional commits since the last revision: > > - missing imports > - added RenameAction with shortcut Nice improvement! Works as expected - looks good to me. src/utils/IdealGraphVisualizer/Coordinator/src/main/java/com/sun/hotspot/igv/coordinator/FolderNode.java line 156: > 154: children.getFolder().setName(name); > 155: fireDisplayNameChange(null, null); > 156: } Missing new line src/utils/IdealGraphVisualizer/Coordinator/src/main/java/com/sun/hotspot/igv/coordinator/GraphNode.java line 65: > 63: graph.setName(name); > 64: fireDisplayNameChange(null, null); > 65: } Missing new line ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10873 From tholenstein at openjdk.org Wed Nov 2 15:37:34 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 2 Nov 2022 15:37:34 GMT Subject: RFR: JDK-8296235: IGV: Change shortcut to delete graph from ctrl+del to del Message-ID: Change the current shortcut to delete a graph from `ctrl`+`delete` to `delete` only (dropping ctrl). For keyboards without the `delete` key (like Mac) the alternative shortcut `ctrl`+`backspace` is unchanged ------------- Commit messages: - JDK-8296235: IGV: Change shortcut to delete graph from ctrl+del to del Changes: https://git.openjdk.org/jdk/pull/10950/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10950&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296235 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10950.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10950/head:pull/10950 PR: https://git.openjdk.org/jdk/pull/10950 From chagedorn at openjdk.org Wed Nov 2 15:37:34 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 2 Nov 2022 15:37:34 GMT Subject: RFR: JDK-8296235: IGV: Change shortcut to delete graph from ctrl+del to del In-Reply-To: References: Message-ID: On Wed, 2 Nov 2022 15:29:55 GMT, Tobias Holenstein wrote: > Change the current shortcut to delete a graph from `ctrl`+`delete` to `delete` only (dropping ctrl). > > For keyboards without the `delete` key (like Mac) the alternative shortcut `ctrl`+`backspace` is unchanged That's a nice small improvement, thanks for fixing that! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10950 From thartmann at openjdk.org Wed Nov 2 15:59:46 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 2 Nov 2022 15:59:46 GMT Subject: RFR: JDK-8296235: IGV: Change shortcut to delete graph from ctrl+del to del In-Reply-To: References: Message-ID: On Wed, 2 Nov 2022 15:29:55 GMT, Tobias Holenstein wrote: > Change the current shortcut to delete a graph from `ctrl`+`delete` to `delete` only (dropping ctrl). > > For keyboards without the `delete` key (like Mac) the alternative shortcut `ctrl`+`backspace` is unchanged Looks good and trivial. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/10950 From tholenstein at openjdk.org Wed Nov 2 15:59:47 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 2 Nov 2022 15:59:47 GMT Subject: RFR: JDK-8296235: IGV: Change shortcut to delete graph from ctrl+del to del In-Reply-To: References: Message-ID: On Wed, 2 Nov 2022 15:34:24 GMT, Christian Hagedorn wrote: >> Change the current shortcut to delete a graph from `ctrl`+`delete` to `delete` only (dropping ctrl). >> >> For keyboards without the `delete` key (like Mac) the alternative shortcut `ctrl`+`backspace` is unchanged > > That's a nice small improvement, thanks for fixing that! Thanks @chhagedorn and @TobiHartmann for the reviews! ------------- PR: https://git.openjdk.org/jdk/pull/10950 From tholenstein at openjdk.org Wed Nov 2 16:01:28 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 2 Nov 2022 16:01:28 GMT Subject: RFR: JDK-8290063: IGV: Give the graphs a unique number in the outline [v3] In-Reply-To: References: Message-ID: > Some graphs may have the same name more than once in IGV. To make it clearer which graph is currently open, all graphs within a group are now enumerated with `1.`, `2.`, `3.` , etc. Similarly, groups are enumerated with `1 -`, `2 -`, etc. > > overview > > The make it even further easier to distinguish different graphs and group, we can now rename them in two ways. > **(A)** To rename a group or graph: > 1. click on a graph or group once to select it (does not have to be opened). > 2. click a second time on the selected graph and wait 1-2 seconds. > 3. now you can rename the graph > 4. on Linux F2 works as a shortcut for renaming > rename_group > rename_graph > > **(B)** To rename a group or graph: > 1. right-click on a graph or group once to select it (does not have to be opened). > 2. select "Rename.." > 3. now a window opens to rename the graph > 4. The shortcut for this `RenameAction` is `ALT-CTRL-R` or `OPT-CMD-R` on a mac keyboard > RenameAction > Rename > > The numbering always starts at 1 and is continuous from 1 to N for N graphs. When a graph is deleted, the numbering of the following graphs changes. This implementation allows the keep the XML format unchanged, because the numbering is only local and not part of the name. However, if a graph/group is renamed, the name in the XML file will also change when it is saved. > > # Implementation > The renaming is simply enabled by overriding `canRename() {return true;}` in `FolderNode` and `GraphNode` > > The numbering is implemented in `getDisplayName()` in `Group` and `InputGraph` by concatenating the index of the group/graph with the name. > > When a group/graph is deleted the we trigger an update of the index in `FolderNode` -> `destroyNodes(Node[] nodes)` by calling `node.setDisplayName(node.getDisplayName())` for all nodes. > > Refresh the group/graph name in the `EditorTopComponent` of the opened graph is a bit tricky. It is implemented by adding a `Listener` to the `getDisplayNameChangedEvent()` of the currently opened `InputGraph` in `DiagramViewModel`. `getDisplayNameChangedEvent()` is fired whenever the name or the group name of the corresponding `InputGraph` is changed. Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: add missing newline ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10873/files - new: https://git.openjdk.org/jdk/pull/10873/files/4f36c623..9cdd1b7f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10873&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10873&range=01-02 Stats: 2 lines in 2 files changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10873.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10873/head:pull/10873 PR: https://git.openjdk.org/jdk/pull/10873 From tholenstein at openjdk.org Wed Nov 2 16:01:31 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 2 Nov 2022 16:01:31 GMT Subject: RFR: JDK-8290063: IGV: Give the graphs a unique number in the outline [v2] In-Reply-To: References: Message-ID: On Wed, 2 Nov 2022 15:28:15 GMT, Christian Hagedorn wrote: >> Tobias Holenstein has updated the pull request incrementally with two additional commits since the last revision: >> >> - missing imports >> - added RenameAction with shortcut > > Nice improvement! Works as expected - looks good to me. Thanks @chhagedorn and @TobiHartmann for the reviews! > src/utils/IdealGraphVisualizer/Coordinator/src/main/java/com/sun/hotspot/igv/coordinator/FolderNode.java line 156: > >> 154: children.getFolder().setName(name); >> 155: fireDisplayNameChange(null, null); >> 156: } > > Missing new line fixed > src/utils/IdealGraphVisualizer/Coordinator/src/main/java/com/sun/hotspot/igv/coordinator/GraphNode.java line 65: > >> 63: graph.setName(name); >> 64: fireDisplayNameChange(null, null); >> 65: } > > Missing new line fixed ------------- PR: https://git.openjdk.org/jdk/pull/10873 From tholenstein at openjdk.org Wed Nov 2 16:04:19 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 2 Nov 2022 16:04:19 GMT Subject: RFR: JDK-8290063: IGV: Give the graphs a unique number in the outline [v4] In-Reply-To: References: Message-ID: > Some graphs may have the same name more than once in IGV. To make it clearer which graph is currently open, all graphs within a group are now enumerated with `1.`, `2.`, `3.` , etc. Similarly, groups are enumerated with `1 -`, `2 -`, etc. > > overview > > The make it even further easier to distinguish different graphs and group, we can now rename them in two ways. > **(A)** To rename a group or graph: > 1. click on a graph or group once to select it (does not have to be opened). > 2. click a second time on the selected graph and wait 1-2 seconds. > 3. now you can rename the graph > 4. on Linux F2 works as a shortcut for renaming > rename_group > rename_graph > > **(B)** To rename a group or graph: > 1. right-click on a graph or group once to select it (does not have to be opened). > 2. select "Rename.." > 3. now a window opens to rename the graph > 4. The shortcut for this `RenameAction` is `ALT-CTRL-R` or `OPT-CMD-R` on a mac keyboard > RenameAction > Rename > > The numbering always starts at 1 and is continuous from 1 to N for N graphs. When a graph is deleted, the numbering of the following graphs changes. This implementation allows the keep the XML format unchanged, because the numbering is only local and not part of the name. However, if a graph/group is renamed, the name in the XML file will also change when it is saved. > > # Implementation > The renaming is simply enabled by overriding `canRename() {return true;}` in `FolderNode` and `GraphNode` > > The numbering is implemented in `getDisplayName()` in `Group` and `InputGraph` by concatenating the index of the group/graph with the name. > > When a group/graph is deleted the we trigger an update of the index in `FolderNode` -> `destroyNodes(Node[] nodes)` by calling `node.setDisplayName(node.getDisplayName())` for all nodes. > > Refresh the group/graph name in the `EditorTopComponent` of the opened graph is a bit tricky. It is implemented by adding a `Listener` to the `getDisplayNameChangedEvent()` of the currently opened `InputGraph` in `DiagramViewModel`. `getDisplayNameChangedEvent()` is fired whenever the name or the group name of the corresponding `InputGraph` is changed. Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: remove trailing whitespace ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10873/files - new: https://git.openjdk.org/jdk/pull/10873/files/9cdd1b7f..e9c49122 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10873&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10873&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10873.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10873/head:pull/10873 PR: https://git.openjdk.org/jdk/pull/10873 From tholenstein at openjdk.org Wed Nov 2 16:08:38 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 2 Nov 2022 16:08:38 GMT Subject: Integrated: JDK-8290063: IGV: Give the graphs a unique number in the outline In-Reply-To: References: Message-ID: On Wed, 26 Oct 2022 14:02:11 GMT, Tobias Holenstein wrote: > Some graphs may have the same name more than once in IGV. To make it clearer which graph is currently open, all graphs within a group are now enumerated with `1.`, `2.`, `3.` , etc. Similarly, groups are enumerated with `1 -`, `2 -`, etc. > > overview > > The make it even further easier to distinguish different graphs and group, we can now rename them in two ways. > **(A)** To rename a group or graph: > 1. click on a graph or group once to select it (does not have to be opened). > 2. click a second time on the selected graph and wait 1-2 seconds. > 3. now you can rename the graph > 4. on Linux F2 works as a shortcut for renaming > rename_group > rename_graph > > **(B)** To rename a group or graph: > 1. right-click on a graph or group once to select it (does not have to be opened). > 2. select "Rename.." > 3. now a window opens to rename the graph > 4. The shortcut for this `RenameAction` is `ALT-CTRL-R` or `OPT-CMD-R` on a mac keyboard > RenameAction > Rename > > The numbering always starts at 1 and is continuous from 1 to N for N graphs. When a graph is deleted, the numbering of the following graphs changes. This implementation allows the keep the XML format unchanged, because the numbering is only local and not part of the name. However, if a graph/group is renamed, the name in the XML file will also change when it is saved. > > # Implementation > The renaming is simply enabled by overriding `canRename() {return true;}` in `FolderNode` and `GraphNode` > > The numbering is implemented in `getDisplayName()` in `Group` and `InputGraph` by concatenating the index of the group/graph with the name. > > When a group/graph is deleted the we trigger an update of the index in `FolderNode` -> `destroyNodes(Node[] nodes)` by calling `node.setDisplayName(node.getDisplayName())` for all nodes. > > Refresh the group/graph name in the `EditorTopComponent` of the opened graph is a bit tricky. It is implemented by adding a `Listener` to the `getDisplayNameChangedEvent()` of the currently opened `InputGraph` in `DiagramViewModel`. `getDisplayNameChangedEvent()` is fired whenever the name or the group name of the corresponding `InputGraph` is changed. This pull request has now been integrated. Changeset: a1c349f8 Author: Tobias Holenstein URL: https://git.openjdk.org/jdk/commit/a1c349f8b3382bfffc3621b0d525c345322db920 Stats: 242 lines in 11 files changed: 173 ins; 28 del; 41 mod 8290063: IGV: Give the graphs a unique number in the outline Reviewed-by: thartmann, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/10873 From tholenstein at openjdk.org Wed Nov 2 16:04:27 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 2 Nov 2022 16:04:27 GMT Subject: Integrated: JDK-8296235: IGV: Change shortcut to delete graph from ctrl+del to del In-Reply-To: References: Message-ID: On Wed, 2 Nov 2022 15:29:55 GMT, Tobias Holenstein wrote: > Change the current shortcut to delete a graph from `ctrl`+`delete` to `delete` only (dropping ctrl). > > For keyboards without the `delete` key (like Mac) the alternative shortcut `ctrl`+`backspace` is unchanged This pull request has now been integrated. Changeset: b807470a Author: Tobias Holenstein URL: https://git.openjdk.org/jdk/commit/b807470af495fcf12aca85411a890e95814584ae Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8296235: IGV: Change shortcut to delete graph from ctrl+del to del Reviewed-by: chagedorn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/10950 From kvn at openjdk.org Wed Nov 2 16:51:35 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 2 Nov 2022 16:51:35 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 13:13:46 GMT, Martin Doerr wrote: > This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. Thank you for testing explanation. An other question: why only MH? May be this should be done for all native methods. ------------- PR: https://git.openjdk.org/jdk/pull/10933 From kvn at openjdk.org Wed Nov 2 17:02:41 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 2 Nov 2022 17:02:41 GMT Subject: RFR: 8296168: x86: Add reasonable constraints between AVX and SSE In-Reply-To: References: Message-ID: On Wed, 2 Nov 2022 12:44:42 GMT, Claes Redestad wrote: > We've not seen any x86 CPU, Intel or otherwise, where AVX features are available but SSE 4.1 is not supported. This patch suggests constraining setup so that any explicit value of UseSSE less than 4 (the default on any AVX-supporting CPU) implicitly disables AVX. This simplifies ergonomics and reduces the testing surface. Concretely this would allow #10847 to not have to guard the new intrinsic on UseSSE level to avoid some surprising test failures in tests verifying SSE-enabled intrinsics. > > I've rearranged the initialization of UseAVX and UseSSE to allow AVX to look at the post-ergo values of UseSSE. > > Testing: tier1-tier3, manual verification Looks good. src/hotspot/cpu/x86/vm_version_x86.cpp line 1041: > 1039: } > 1040: } else { > 1041: if (UseSSE > 3) { We need an other RFE to clean this up. We should disable CPU features according UseSSE value as we already do for UseAVX. For these changes this is fine. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10946 From mdoerr at openjdk.org Wed Nov 2 17:03:33 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Wed, 2 Nov 2022 17:03:33 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic In-Reply-To: References: Message-ID: On Wed, 2 Nov 2022 16:49:06 GMT, Vladimir Kozlov wrote: > Thank you for testing explanation. > > An other question: why only MH? May be this should be done for all native methods. Only MH intrinsic allocation failures force the VM to terminate. Native wrappers are not required to continue execution. They can still get generated after GC has done its work and the other spaces are available again. I only wanted to use the NonNMethod space in emergency situations when the VM has no alternative. If we use it more often, it may get full, too, and then we wouldn't have any free space at all. ------------- PR: https://git.openjdk.org/jdk/pull/10933 From vlivanov at openjdk.org Wed Nov 2 17:16:24 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 2 Nov 2022 17:16:24 GMT Subject: RFR: 8296168: x86: Add reasonable constraints between AVX and SSE In-Reply-To: References: Message-ID: <8V-k860AUGlUT-6vBLYV14HgtFKybVmmZ-cozwdBcT4=.a600d585-1230-42d7-8d81-8d0afc91b976@github.com> On Wed, 2 Nov 2022 17:00:30 GMT, Vladimir Kozlov wrote: >> We've not seen any x86 CPU, Intel or otherwise, where AVX features are available but SSE 4.1 is not supported. This patch suggests constraining setup so that any explicit value of UseSSE less than 4 (the default on any AVX-supporting CPU) implicitly disables AVX. This simplifies ergonomics and reduces the testing surface. Concretely this would allow #10847 to not have to guard the new intrinsic on UseSSE level to avoid some surprising test failures in tests verifying SSE-enabled intrinsics. >> >> I've rearranged the initialization of UseAVX and UseSSE to allow AVX to look at the post-ergo values of UseSSE. >> >> Testing: tier1-tier3, manual verification > > src/hotspot/cpu/x86/vm_version_x86.cpp line 1041: > >> 1039: } >> 1040: } else { >> 1041: if (UseSSE > 3) { > > We need an other RFE to clean this up. We should disable CPU features according UseSSE value as we already do for UseAVX. > For these changes this is fine. It's already the case for `UseSSE`: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/vm_version_x86.cpp#L858 ------------- PR: https://git.openjdk.org/jdk/pull/10946 From dlong at openjdk.org Wed Nov 2 17:20:22 2022 From: dlong at openjdk.org (Dean Long) Date: Wed, 2 Nov 2022 17:20:22 GMT Subject: RFR: 8296101: nmethod::is_unloading result unstable with concurrent unloading [v2] In-Reply-To: References: Message-ID: On Wed, 2 Nov 2022 10:02:37 GMT, Erik ?sterlund wrote: >> If an nmethod is not called during a concurrent full GC, then after marking has terminated, multiple threads can call is_unloading. If at the same time, the nmethod is made not_entrant, then we run into a source of instability in the is_cold() calculation used when computing is_unloading. There we check if the nmethod is_not_entrant(), which some concurrent observers will think is true, while others think it's false. >> The current code that sets the is_unloading_state in is_unloading() assumes that the computed state is the same across all observers. However, that is no longer true. >> >> I propose to set the is_unloading_state with a CAS instead of plain store. Then, as is_unloading() is computed before making nmethods not_entrant, we can guarantee that all concurrent readers of is_unloading in this scenario will return false in the current unloading cycle, instead of racingly returning either false or true. One thread wins, and it will say false, and the other threads will compute conflicting results, but end up agreeing after the CAS, that they should all return false. >> >> Tested with mach5 tier1-7. Also tried replacing the is_not_entrant() ingredient in is_cold with os::random() to simulate the instability source. Without my fix RunThese crashes almost immediately, and with my fix it doesn't crash. > > Erik ?sterlund has updated the pull request incrementally with one additional commit since the last revision: > > Avoid recursion Marked as reviewed by dlong (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/10926 From vlivanov at openjdk.org Wed Nov 2 17:29:21 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 2 Nov 2022 17:29:21 GMT Subject: RFR: 8296168: x86: Add reasonable constraints between AVX and SSE In-Reply-To: References: Message-ID: On Wed, 2 Nov 2022 12:44:42 GMT, Claes Redestad wrote: > We've not seen any x86 CPU, Intel or otherwise, where AVX features are available but SSE 4.1 is not supported. This patch suggests constraining setup so that any explicit value of UseSSE less than 4 (the default on any AVX-supporting CPU) implicitly disables AVX. This simplifies ergonomics and reduces the testing surface. Concretely this would allow #10847 to not have to guard the new intrinsic on UseSSE level to avoid some surprising test failures in tests verifying SSE-enabled intrinsics. > > I've rearranged the initialization of UseAVX and UseSSE to allow AVX to look at the post-ergo values of UseSSE. > > Testing: tier1-tier3, manual verification src/hotspot/cpu/x86/vm_version_x86.cpp line 905: > 903: warning("UseSSE=%d is not supported on this CPU, setting it to UseSSE=%d", (int) UseSSE, use_sse_limit); > 904: FLAG_SET_DEFAULT(UseSSE, use_sse_limit); > 905: } else if (UseSSE < 0) { `UseSSE < 0` check is redundant since the flag is already constrained to `[0..99]` range. src/hotspot/cpu/x86/vm_version_x86.cpp line 941: > 939: } > 940: FLAG_SET_DEFAULT(UseAVX, use_avx_limit); > 941: } else if (UseAVX < 0) { `UseAVX < 0` check can also be cleaned up. ------------- PR: https://git.openjdk.org/jdk/pull/10946 From vlivanov at openjdk.org Wed Nov 2 17:29:23 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 2 Nov 2022 17:29:23 GMT Subject: RFR: 8296168: x86: Add reasonable constraints between AVX and SSE In-Reply-To: <8V-k860AUGlUT-6vBLYV14HgtFKybVmmZ-cozwdBcT4=.a600d585-1230-42d7-8d81-8d0afc91b976@github.com> References: <8V-k860AUGlUT-6vBLYV14HgtFKybVmmZ-cozwdBcT4=.a600d585-1230-42d7-8d81-8d0afc91b976@github.com> Message-ID: On Wed, 2 Nov 2022 17:12:35 GMT, Vladimir Ivanov wrote: >> src/hotspot/cpu/x86/vm_version_x86.cpp line 1041: >> >>> 1039: } >>> 1040: } else { >>> 1041: if (UseSSE > 3) { >> >> We need an other RFE to clean this up. We should disable CPU features according UseSSE value as we already do for UseAVX. >> For these changes this is fine. > > It's already the case for `UseSSE`: > https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/vm_version_x86.cpp#L858 I find the original check clearer: it says that `UseAESCTRIntrinsics`-related code depends specifically on sse4.1. Please, leave it as is. ------------- PR: https://git.openjdk.org/jdk/pull/10946 From duke at openjdk.org Wed Nov 2 17:37:18 2022 From: duke at openjdk.org (Yi-Fan Tsai) Date: Wed, 2 Nov 2022 17:37:18 GMT Subject: RFR: 8296190: TestMD5Intrinsics and TestMD5MultiBlockIntrinsics don't test the intrinsics Message-ID: Increase WARMUP_ITERATIONS to get MD5 hash from the intrinsic. The current value is too low and the intrinsic is not executed. ------------- Commit messages: - 8296190: TestMD5Intrinsics and TestMD5MultiBlockIntrinsics don't test the intrinsics Changes: https://git.openjdk.org/jdk/pull/10954/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10954&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296190 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10954.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10954/head:pull/10954 PR: https://git.openjdk.org/jdk/pull/10954 From eastigeevich at openjdk.org Wed Nov 2 18:00:26 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Wed, 2 Nov 2022 18:00:26 GMT Subject: RFR: 8296190: TestMD5Intrinsics and TestMD5MultiBlockIntrinsics don't test the intrinsics In-Reply-To: References: Message-ID: On Wed, 2 Nov 2022 17:30:19 GMT, Yi-Fan Tsai wrote: > Increase WARMUP_ITERATIONS to get MD5 hash from the intrinsic. > > Both TestMD5Intrinsics and TestMD5MultiBlockIntrinsics are run with options "-XX:CompileThreshold=500 -XX:Tier4InvocationThreshold=500", so the current value is too low for the intrinsic to be executed. Lgtm ------------- Marked as reviewed by eastigeevich (Committer). PR: https://git.openjdk.org/jdk/pull/10954 From phh at openjdk.org Wed Nov 2 18:19:24 2022 From: phh at openjdk.org (Paul Hohensee) Date: Wed, 2 Nov 2022 18:19:24 GMT Subject: RFR: 8296190: TestMD5Intrinsics and TestMD5MultiBlockIntrinsics don't test the intrinsics In-Reply-To: References: Message-ID: On Wed, 2 Nov 2022 17:30:19 GMT, Yi-Fan Tsai wrote: > Increase WARMUP_ITERATIONS to get MD5 hash from the intrinsic. > > Both TestMD5Intrinsics and TestMD5MultiBlockIntrinsics are run with options "-XX:CompileThreshold=500 -XX:Tier4InvocationThreshold=500", so the current value is too low for the intrinsic to be executed. Lgtm, and trivial. ------------- Marked as reviewed by phh (Reviewer). PR: https://git.openjdk.org/jdk/pull/10954 From simonis at openjdk.org Wed Nov 2 18:24:48 2022 From: simonis at openjdk.org (Volker Simonis) Date: Wed, 2 Nov 2022 18:24:48 GMT Subject: RFR: 8296190: TestMD5Intrinsics and TestMD5MultiBlockIntrinsics don't test the intrinsics In-Reply-To: References: Message-ID: On Wed, 2 Nov 2022 17:30:19 GMT, Yi-Fan Tsai wrote: > Increase WARMUP_ITERATIONS to get MD5 hash from the intrinsic. > > Both TestMD5Intrinsics and TestMD5MultiBlockIntrinsics are run with options "-XX:CompileThreshold=500 -XX:Tier4InvocationThreshold=500", so the current value is too low for the intrinsic to be executed. I'm a little surprised that this isn't detected. The tests you mention have a verification step (i.e. `compiler.testlibrary.intrinsics.Verifier`) which should detect this. Can you please check why that is not the case and if this also affects other intrinsic tests? ------------- PR: https://git.openjdk.org/jdk/pull/10954 From dlong at openjdk.org Wed Nov 2 18:27:58 2022 From: dlong at openjdk.org (Dean Long) Date: Wed, 2 Nov 2022 18:27:58 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 13:13:46 GMT, Martin Doerr wrote: > This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. I'm not sure this change is safe. We never put nmethods into NonNMethod (see code_blob_type_accepts_nmethod() and CodeCache::is_non_nmethod()). If we did, a lot of things could break, because NMethodIterator doesn't look in that heap for nmethods. We might get away with it if method handle intrinsics don't have oops or inline caches, but it seems dangerous. Also, this change only makes the failure harder to reproduce. We could still run out of NonNMethod space, right? ------------- PR: https://git.openjdk.org/jdk/pull/10933 From simonis at openjdk.org Wed Nov 2 18:41:52 2022 From: simonis at openjdk.org (Volker Simonis) Date: Wed, 2 Nov 2022 18:41:52 GMT Subject: RFR: 8296190: TestMD5Intrinsics and TestMD5MultiBlockIntrinsics don't test the intrinsics In-Reply-To: References: Message-ID: <3TYLdbG_EPXJRRzgCKTBzqtJtCJ6cG1RWM7H7u7uzds=.7a722f5b-458b-4c13-ac06-d5881fd9c803@github.com> On Wed, 2 Nov 2022 17:30:19 GMT, Yi-Fan Tsai wrote: > Increase WARMUP_ITERATIONS to get MD5 hash from the intrinsic. > > Both TestMD5Intrinsics and TestMD5MultiBlockIntrinsics are run with options "-XX:CompileThreshold=500 -XX:Tier4InvocationThreshold=500", so the current value is too low for the intrinsic to be executed. I'm a little surprised that this isn't detected. The tests you mention have a verification step (i.e. compiler.testlibrary.intrinsics.Verifier) which should detect this. Can you please check why that is not the case and if this also affects other intrinsic tests? Also, it is a little unfortunate, that the new warmup value still implicitly depends on the -XX:CompileThreshold=500 setting in the actual test. It would be more stable to get that value explicitly (e.g. by using the WhiteBox API) and use a strictly bigger warmup value. Finally, I think the actual tests should be run with -Xbatch to be sure that the corresponding methods have really been compiled. ------------- Changes requested by simonis (Reviewer). PR: https://git.openjdk.org/jdk/pull/10954 From dlong at openjdk.org Wed Nov 2 18:42:26 2022 From: dlong at openjdk.org (Dean Long) Date: Wed, 2 Nov 2022 18:42:26 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic In-Reply-To: References: Message-ID: <3UBXIZuZaENBdGz0LCPQEnbh0LiUJeYZDoQWe6qAVPY=.51cfcd7a-fdc0-439e-a68f-9632bd8cc99c@github.com> On Tue, 1 Nov 2022 13:13:46 GMT, Martin Doerr wrote: > This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. I think we need a test for this case, even if it requires the use of WhiteBox to trigger it reliably. For an alternative solution, can't we fall back to using the pre-generated interpreter entries? ------------- PR: https://git.openjdk.org/jdk/pull/10933 From mdoerr at openjdk.org Wed Nov 2 19:33:20 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Wed, 2 Nov 2022 19:33:20 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic In-Reply-To: References: Message-ID: On Wed, 2 Nov 2022 18:22:25 GMT, Dean Long wrote: > I'm not sure this change is safe. We never put nmethods into NonNMethod (see code_blob_type_accepts_nmethod() and CodeCache::is_non_nmethod()). If we did, a lot of things could break, because NMethodIterator doesn't look in that heap for nmethods. We might get away with it if method handle intrinsics don't have oops or inline caches, but it seems dangerous. Also, this change only makes the failure harder to reproduce. We could still run out of NonNMethod space, right? I think the method handle intrinsics were just implemented as nmethods because it was easy to do. They just perform method handle dispatch and contain no immediate oops. They are more like stubs (and like the interpreter versions which aren't nmethods, either). So, I can't see what could break by treating them this way. To your 2nd question: No, we can't just fall back to using interpreter entries when calling from compiled code. The calling convention is not compatible. We could get out of space in the NonNMethod segment, too, but the GC has more time until then. I have thought about waiting for the GC, but how long should we wait? A complete marking cycle of a huge Java heap may take time. Terminating the VM may still be undesired, but an unresponsive application, too. The situation sounds similar to "GC overhead limit exceeded". ------------- PR: https://git.openjdk.org/jdk/pull/10933 From dlong at openjdk.org Wed Nov 2 20:22:36 2022 From: dlong at openjdk.org (Dean Long) Date: Wed, 2 Nov 2022 20:22:36 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 13:13:46 GMT, Martin Doerr wrote: > This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. I believe the calling convention issues would be handled by the c2i adapter. We already force calling from the interpreter to use the i2c adapter. It looks like the problem with falling back to interpreter intrinsics is _linkToNative, which is missing. All platforms only generate it for compiled. From code comments, it seems like this was an implementation short-cut and falls under technical debt. Here is my suggestion: 1. File an RFE to support _linkToNative interpreter intrinsic, fallback to interpreted through c2i adapter 2. Proceed with this PR as a stop-gap For this PR, it would be reassuring to assert that any nmethods added to NonNMethod have no oops and no inlined caches, which probably also implies no metadata and no relocations. Also, I would really like a test that reproduces the problem. Would using -XX:-MethodFlushing -XX:-UseCodeCacheFlushing help with that? ------------- PR: https://git.openjdk.org/jdk/pull/10933 From kbarrett at openjdk.org Wed Nov 2 20:32:01 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Wed, 2 Nov 2022 20:32:01 GMT Subject: RFR: 8296163: [aarch64] Cleanup Pre/Post addressing mode classes [v2] In-Reply-To: References: Message-ID: > Please review this cleanup of the aarch64 Pre, Post, and PrePost addressing > mode helper classes. > > The special functions for the PrePost class are changed from public to > protected, to ensure no slicing is possible. > > In the Post class constructors, initialization of the members declared > directly in that class is now performed in the ctor-initializer rather than by > assignments in the body. > > The member reader functions in PrePost and Post are now const. Kim Barrett has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge branch 'master' into pre/post-cleanup - cleanup Pre/Post/PrePost ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10929/files - new: https://git.openjdk.org/jdk/pull/10929/files/867ad2aa..1306b2ca Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10929&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10929&range=00-01 Stats: 42347 lines in 285 files changed: 10139 ins; 30874 del; 1334 mod Patch: https://git.openjdk.org/jdk/pull/10929.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10929/head:pull/10929 PR: https://git.openjdk.org/jdk/pull/10929 From kbarrett at openjdk.org Wed Nov 2 20:32:04 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Wed, 2 Nov 2022 20:32:04 GMT Subject: RFR: 8296163: [aarch64] Cleanup Pre/Post addressing mode classes [v2] In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 11:00:59 GMT, Christian Hagedorn wrote: >> Kim Barrett has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: >> >> - Merge branch 'master' into pre/post-cleanup >> - cleanup Pre/Post/PrePost > > Looks good! Thanks @chhagedorn and @theRealAph for reviews. ------------- PR: https://git.openjdk.org/jdk/pull/10929 From kbarrett at openjdk.org Wed Nov 2 20:33:45 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Wed, 2 Nov 2022 20:33:45 GMT Subject: Integrated: 8296163: [aarch64] Cleanup Pre/Post addressing mode classes In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 10:21:12 GMT, Kim Barrett wrote: > Please review this cleanup of the aarch64 Pre, Post, and PrePost addressing > mode helper classes. > > The special functions for the PrePost class are changed from public to > protected, to ensure no slicing is possible. > > In the Post class constructors, initialization of the members declared > directly in that class is now performed in the ctor-initializer rather than by > assignments in the body. > > The member reader functions in PrePost and Post are now const. This pull request has now been integrated. Changeset: c7b95a89 Author: Kim Barrett URL: https://git.openjdk.org/jdk/commit/c7b95a895fe66a00c754b590ebde53087f183c51 Stats: 12 lines in 1 file changed: 5 ins; 0 del; 7 mod 8296163: [aarch64] Cleanup Pre/Post addressing mode classes Reviewed-by: chagedorn, aph ------------- PR: https://git.openjdk.org/jdk/pull/10929 From redestad at openjdk.org Wed Nov 2 21:39:25 2022 From: redestad at openjdk.org (Claes Redestad) Date: Wed, 2 Nov 2022 21:39:25 GMT Subject: RFR: 8296168: x86: Add reasonable constraints between AVX and SSE In-Reply-To: References: Message-ID: <4Xo_4-o3YQ59IkloSA_rKs9fqPFEUSPKgSKVk6qeeEs=.e51747e0-ab96-41c6-8dc8-6018702ad9f1@github.com> On Wed, 2 Nov 2022 17:20:03 GMT, Vladimir Ivanov wrote: >> We've not seen any x86 CPU, Intel or otherwise, where AVX features are available but SSE 4.1 is not supported. This patch suggests constraining setup so that any explicit value of UseSSE less than 4 (the default on any AVX-supporting CPU) implicitly disables AVX. This simplifies ergonomics and reduces the testing surface. Concretely this would allow #10847 to not have to guard the new intrinsic on UseSSE level to avoid some surprising test failures in tests verifying SSE-enabled intrinsics. >> >> I've rearranged the initialization of UseAVX and UseSSE to allow AVX to look at the post-ergo values of UseSSE. >> >> Testing: tier1-tier3, manual verification > > src/hotspot/cpu/x86/vm_version_x86.cpp line 941: > >> 939: } >> 940: FLAG_SET_DEFAULT(UseAVX, use_avx_limit); >> 941: } else if (UseAVX < 0) { > > `UseAVX < 0` check can also be cleaned up. Right, the flag range predicate check precedes this code, thus UseAVX < 0 is impossible and this code can be removed. ------------- PR: https://git.openjdk.org/jdk/pull/10946 From redestad at openjdk.org Wed Nov 2 21:49:52 2022 From: redestad at openjdk.org (Claes Redestad) Date: Wed, 2 Nov 2022 21:49:52 GMT Subject: RFR: 8296168: x86: Add reasonable constraints between AVX and SSE [v2] In-Reply-To: References: Message-ID: > We've not seen any x86 CPU, Intel or otherwise, where AVX features are available but SSE 4.1 is not supported. This patch suggests constraining setup so that any explicit value of UseSSE less than 4 (the default on any AVX-supporting CPU) implicitly disables AVX. This simplifies ergonomics and reduces the testing surface. Concretely this would allow #10847 to not have to guard the new intrinsic on UseSSE level to avoid some surprising test failures in tests verifying SSE-enabled intrinsics. > > I've rearranged the initialization of UseAVX and UseSSE to allow AVX to look at the post-ergo values of UseSSE. > > Testing: tier1-tier3, manual verification Claes Redestad has updated the pull request incrementally with two additional commits since the last revision: - Remove redundant SSE < 0 check - Remove redundant AVX < 0 check, revert narrow supports_sse4_1 check ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10946/files - new: https://git.openjdk.org/jdk/pull/10946/files/b3d347c8..4c2e9ca5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10946&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10946&range=00-01 Stats: 7 lines in 1 file changed: 0 ins; 6 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10946.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10946/head:pull/10946 PR: https://git.openjdk.org/jdk/pull/10946 From redestad at openjdk.org Wed Nov 2 21:49:53 2022 From: redestad at openjdk.org (Claes Redestad) Date: Wed, 2 Nov 2022 21:49:53 GMT Subject: RFR: 8296168: x86: Add reasonable constraints between AVX and SSE [v2] In-Reply-To: References: Message-ID: <7roMAEYB8DwvV-H2TEe2Zmg34bvy3udOaA7MiHCmTRc=.29d4076c-087a-4be3-8dd3-135add00864a@github.com> On Wed, 2 Nov 2022 17:19:27 GMT, Vladimir Ivanov wrote: >> Claes Redestad has updated the pull request incrementally with two additional commits since the last revision: >> >> - Remove redundant SSE < 0 check >> - Remove redundant AVX < 0 check, revert narrow supports_sse4_1 check > > src/hotspot/cpu/x86/vm_version_x86.cpp line 905: > >> 903: warning("UseSSE=%d is not supported on this CPU, setting it to UseSSE=%d", (int) UseSSE, use_sse_limit); >> 904: FLAG_SET_DEFAULT(UseSSE, use_sse_limit); >> 905: } else if (UseSSE < 0) { > > `UseSSE < 0` check is redundant since the flag is already constrained to `[0..99]` range. Fixed. (Should such ranges be narrowed down to the actual values supported?) ------------- PR: https://git.openjdk.org/jdk/pull/10946 From redestad at openjdk.org Wed Nov 2 21:49:53 2022 From: redestad at openjdk.org (Claes Redestad) Date: Wed, 2 Nov 2022 21:49:53 GMT Subject: RFR: 8296168: x86: Add reasonable constraints between AVX and SSE [v2] In-Reply-To: References: <8V-k860AUGlUT-6vBLYV14HgtFKybVmmZ-cozwdBcT4=.a600d585-1230-42d7-8d81-8d0afc91b976@github.com> Message-ID: <3kkiUBTJ-nzkaeW5XxM4k8-8WAIeOK51oamV6TbYGGA=.805825d9-d59b-4cca-b948-5c5482ce84da@github.com> On Wed, 2 Nov 2022 17:25:55 GMT, Vladimir Ivanov wrote: >> It's already the case for `UseSSE`: >> https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/vm_version_x86.cpp#L858 > > I find the original check clearer: it says that `UseAESCTRIntrinsics`-related code depends specifically on sse4.1. Please, leave it as is. Right, I first figured it was ignoring user settings, but as you point out the feature bits will be zeroed out anyhow. ------------- PR: https://git.openjdk.org/jdk/pull/10946 From kvn at openjdk.org Wed Nov 2 22:03:24 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 2 Nov 2022 22:03:24 GMT Subject: RFR: 8296168: x86: Add reasonable constraints between AVX and SSE [v2] In-Reply-To: References: Message-ID: On Wed, 2 Nov 2022 21:49:52 GMT, Claes Redestad wrote: >> We've not seen any x86 CPU, Intel or otherwise, where AVX features are available but SSE 4.1 is not supported. This patch suggests constraining setup so that any explicit value of UseSSE less than 4 (the default on any AVX-supporting CPU) implicitly disables AVX. This simplifies ergonomics and reduces the testing surface. Concretely this would allow #10847 to not have to guard the new intrinsic on UseSSE level to avoid some surprising test failures in tests verifying SSE-enabled intrinsics. >> >> I've rearranged the initialization of UseAVX and UseSSE to allow AVX to look at the post-ergo values of UseSSE. >> >> Testing: tier1-tier3, manual verification > > Claes Redestad has updated the pull request incrementally with two additional commits since the last revision: > > - Remove redundant SSE < 0 check > - Remove redundant AVX < 0 check, revert narrow supports_sse4_1 check This looks good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10946 From vlivanov at openjdk.org Wed Nov 2 22:03:25 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 2 Nov 2022 22:03:25 GMT Subject: RFR: 8296168: x86: Add reasonable constraints between AVX and SSE [v2] In-Reply-To: <7roMAEYB8DwvV-H2TEe2Zmg34bvy3udOaA7MiHCmTRc=.29d4076c-087a-4be3-8dd3-135add00864a@github.com> References: <7roMAEYB8DwvV-H2TEe2Zmg34bvy3udOaA7MiHCmTRc=.29d4076c-087a-4be3-8dd3-135add00864a@github.com> Message-ID: On Wed, 2 Nov 2022 21:45:46 GMT, Claes Redestad wrote: > Should such ranges be narrowed down to the actual values supported? Yes. At some point, `UseAVX` was set to `99` by default and then downgraded to the actual version. (BTW the same still happens for `UseSSE`.) I think the highest supported version should be used as the default instead. Feel free to file a followup RFE or fix it right away. ------------- PR: https://git.openjdk.org/jdk/pull/10946 From vlivanov at openjdk.org Wed Nov 2 22:09:29 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 2 Nov 2022 22:09:29 GMT Subject: RFR: 8296168: x86: Add reasonable constraints between AVX and SSE [v2] In-Reply-To: References: Message-ID: On Wed, 2 Nov 2022 21:49:52 GMT, Claes Redestad wrote: >> We've not seen any x86 CPU, Intel or otherwise, where AVX features are available but SSE 4.1 is not supported. This patch suggests constraining setup so that any explicit value of UseSSE less than 4 (the default on any AVX-supporting CPU) implicitly disables AVX. This simplifies ergonomics and reduces the testing surface. Concretely this would allow #10847 to not have to guard the new intrinsic on UseSSE level to avoid some surprising test failures in tests verifying SSE-enabled intrinsics. >> >> I've rearranged the initialization of UseAVX and UseSSE to allow AVX to look at the post-ergo values of UseSSE. >> >> Testing: tier1-tier3, manual verification > > Claes Redestad has updated the pull request incrementally with two additional commits since the last revision: > > - Remove redundant SSE < 0 check > - Remove redundant AVX < 0 check, revert narrow supports_sse4_1 check Looks good. ------------- Marked as reviewed by vlivanov (Reviewer). PR: https://git.openjdk.org/jdk/pull/10946 From vlivanov at openjdk.org Wed Nov 2 22:09:30 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 2 Nov 2022 22:09:30 GMT Subject: RFR: 8296168: x86: Add reasonable constraints between AVX and SSE [v2] In-Reply-To: References: <7roMAEYB8DwvV-H2TEe2Zmg34bvy3udOaA7MiHCmTRc=.29d4076c-087a-4be3-8dd3-135add00864a@github.com> Message-ID: On Wed, 2 Nov 2022 21:57:38 GMT, Vladimir Ivanov wrote: >> Fixed. (Should such ranges be narrowed down to the actual values supported?) > >> Should such ranges be narrowed down to the actual values supported? > > Yes. > > At some point, `UseAVX` was set to `99` by default and then downgraded to the actual version. (BTW the same still happens for `UseSSE`.) I think the highest supported version should be used as the default instead. > > Feel free to file a followup RFE or fix it right away. Also, as another cleanup idea, it makes sense to convert `UseSEE`/`UseAVX` from `intx` to `int` and get rid of `(int)` casts. ------------- PR: https://git.openjdk.org/jdk/pull/10946 From redestad at openjdk.org Wed Nov 2 22:09:30 2022 From: redestad at openjdk.org (Claes Redestad) Date: Wed, 2 Nov 2022 22:09:30 GMT Subject: RFR: 8296168: x86: Add reasonable constraints between AVX and SSE [v2] In-Reply-To: References: <7roMAEYB8DwvV-H2TEe2Zmg34bvy3udOaA7MiHCmTRc=.29d4076c-087a-4be3-8dd3-135add00864a@github.com> Message-ID: On Wed, 2 Nov 2022 22:03:48 GMT, Vladimir Ivanov wrote: >>> Should such ranges be narrowed down to the actual values supported? >> >> Yes. >> >> At some point, `UseAVX` was set to `99` by default and then downgraded to the actual version. (BTW the same still happens for `UseSSE`.) I think the highest supported version should be used as the default instead. >> >> Feel free to file a followup RFE or fix it right away. > > Also, as another cleanup idea, it makes sense to convert `UseSEE`/`UseAVX` from `intx` to `int` and get rid of `(int)` casts. Might need a CSR since we'd upgrade from a warning to an error. I'll defer that to a follow-up. ------------- PR: https://git.openjdk.org/jdk/pull/10946 From rrich at openjdk.org Wed Nov 2 22:20:28 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Wed, 2 Nov 2022 22:20:28 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 13:13:46 GMT, Martin Doerr wrote: > This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. I think the continuation intrinsics (see `Method::is_continuation_native_intrinsic()`) are problematic too. There are no interpreted versions of them. It will be difficult to implement interpreted versions. ------------- PR: https://git.openjdk.org/jdk/pull/10933 From mdoerr at openjdk.org Wed Nov 2 22:28:40 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Wed, 2 Nov 2022 22:28:40 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v2] In-Reply-To: References: Message-ID: > This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: Add assertions to ensure no immediate oops in MH intrinsics. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10933/files - new: https://git.openjdk.org/jdk/pull/10933/files/06f418c6..885c51c0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10933&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10933&range=00-01 Stats: 15 lines in 1 file changed: 15 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10933.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10933/head:pull/10933 PR: https://git.openjdk.org/jdk/pull/10933 From rrich at openjdk.org Wed Nov 2 23:19:26 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Wed, 2 Nov 2022 23:19:26 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 Message-ID: Hi, this is the port of [JEP 425: Virtual Threads (Preview)](https://openjdk.org/jeps/425)) to PPC64. More precisely it is the port of vm continuations in hotspot to PPC64. It allows to run with `-XX:+VMContinuations` which is a prerequisit for 'real' virtual threads (oxymoron?). Most of the shared code changes are related to a new platform dependent constant `frame::metadata_words_at_top`. It is either added or subtracted to a frame address or size. The following is supposed to explain (without real life details) why it is needed in addition to the existing `frame::metadata_words`. The diagram shows a frame at `SP` and its stack arguments. The caller frame is located at `callers_SP`. X86 / AARCH64 PPC64: : : : : : : : : | | | | |-----------------| |-----------------| | | | | | stack arguments | | stack arguments | | |<- callers_SP | | =================== |-----------------| | | | | | metadata at bottom | | metadata at top | | | | |<- callers_SP |-----------------| =================== | | | | | | | | | | | | | | | | | |<- SP | | =================== |-----------------| | | | metadata at top | | |<- SP =================== On X86 and AARCH64 metadata (that's return address, saved SP/FP, etc.) is stored at the frame bottom (`metadata at bottom`). On PPC64 it is stored at the frame top (`metadata at top`) where it affects size and address calculations. Shared code deals with this by making use of the platform dependent constant `frame::metadata_words_at_top`. * size required to 'freeze' a single frame with its stack arguments in a `StackChunk`: `sizeof(frame) + sizeof(stack arguments) + frame::metadata_words_at_top` * address of stack arguments: `callers_SP + frame::metadata_words_at_top` * The value of `frame::metadata_words_at_top` is 0 words on X86 and AARCH64. On PPC64 it is 4 words. Please refer to comments I've added for more details (e.g. the comment on StackChunkFrameStream::frame_size()). Recently I've given a talk about vm continuations and the PPC64 port to my colleagues. It's rather an overview than a deep dive. [The slides](http://cr.openjdk.java.net/~rrich/webrevs/2022/8286302/202210_loom_ppc64_port.pdf) might serve as well as an intro to the matter. The pr includes the new test jdk/jdk/internal/vm/Continuation/BasicExp.java which I wrote while doing the port. The test cases vary from simple to not-so-simple. One of the main features is that a CompilationPolicy passed as argument controls which frames are compiled/interpreted when freezing by defining a sliding window of compiled / interpreted frames which produces interesting transitions with and without stack arguments. There is overlap with Basic.java and Fuzz.java. Let me know wether to remove or keep BasicExp.java. Runtime with fastdebug: 2m on Apple M1 Pro and 3m on Intel Xeon E5-2660 v3 @ 2.60GHz. Note that -XX:+VerifyContinuations is explicitly set as a found it very useful, it increases the runtime quite a bit though. Testing: the change passed our CI testing: most JCK and JTREG tests, also in Xcomp mode, SPECjvm2008, SPECjbb2015, Renaissance Suite and SAP specific tests with fastdebug and release builds on the standard platforms plus PPC64. These tests do include hotspot_loom and jdk_loom JTREG tests which I've also run with TEST_VM_OPTS="-XX:+VerifyContinuations" on X86_64, PPC64, and AARCH64. Thanks, Richard. ------------- Commit messages: - Loom ppc64le port Changes: https://git.openjdk.org/jdk/pull/10961/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10961&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8286302 Stats: 3353 lines in 66 files changed: 2951 ins; 103 del; 299 mod Patch: https://git.openjdk.org/jdk/pull/10961.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10961/head:pull/10961 PR: https://git.openjdk.org/jdk/pull/10961 From duke at openjdk.org Thu Nov 3 00:19:03 2022 From: duke at openjdk.org (Yi-Fan Tsai) Date: Thu, 3 Nov 2022 00:19:03 GMT Subject: RFR: 8296190: TestMD5Intrinsics and TestMD5MultiBlockIntrinsics don't test the intrinsics [v2] In-Reply-To: References: Message-ID: > Increase WARMUP_ITERATIONS to get MD5 hash from the intrinsic. > > Both TestMD5Intrinsics and TestMD5MultiBlockIntrinsics are run with options "-XX:CompileThreshold=500 -XX:Tier4InvocationThreshold=500", so the current value is too low for the intrinsic to be executed. Yi-Fan Tsai has updated the pull request incrementally with one additional commit since the last revision: Set WARMUP_ITERATIONS based on Tier4InvocationThreshold ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10954/files - new: https://git.openjdk.org/jdk/pull/10954/files/b1ff22d2..b11fe5a3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10954&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10954&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10954.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10954/head:pull/10954 PR: https://git.openjdk.org/jdk/pull/10954 From phh at openjdk.org Thu Nov 3 00:19:05 2022 From: phh at openjdk.org (Paul Hohensee) Date: Thu, 3 Nov 2022 00:19:05 GMT Subject: RFR: 8296190: TestMD5Intrinsics and TestMD5MultiBlockIntrinsics don't test the intrinsics In-Reply-To: References: Message-ID: On Wed, 2 Nov 2022 17:30:19 GMT, Yi-Fan Tsai wrote: > Increase WARMUP_ITERATIONS to get MD5 hash from the intrinsic. > > Both TestMD5Intrinsics and TestMD5MultiBlockIntrinsics are run with options "-XX:CompileThreshold=500 -XX:Tier4InvocationThreshold=500", so the current value is too low for the intrinsic to be executed. Hi, Volker. Perhaps we could let this patch go in to fix the immediate problem, and have Yifan file an issue for a definitive fix? ------------- PR: https://git.openjdk.org/jdk/pull/10954 From duke at openjdk.org Thu Nov 3 00:33:28 2022 From: duke at openjdk.org (Yi-Fan Tsai) Date: Thu, 3 Nov 2022 00:33:28 GMT Subject: RFR: 8296190: TestMD5Intrinsics and TestMD5MultiBlockIntrinsics don't test the intrinsics [v2] In-Reply-To: References: Message-ID: On Thu, 3 Nov 2022 00:19:03 GMT, Yi-Fan Tsai wrote: >> Increase WARMUP_ITERATIONS to get MD5 hash from the intrinsic. >> >> Both TestMD5Intrinsics and TestMD5MultiBlockIntrinsics are run with options "-XX:CompileThreshold=500 -XX:Tier4InvocationThreshold=500", so the current value is too low for the intrinsic to be executed. > > Yi-Fan Tsai has updated the pull request incrementally with one additional commit since the last revision: > > Set WARMUP_ITERATIONS based on Tier4InvocationThreshold The verifier detects the intrinsic compiled/inlined somewhere, but it doesn't get executed. In other words, the verifier alone didn't guarantee the intrinsic tested. WARMUP_ITERATIONS is now set strictly larger than Tier4InvocationThreshold. Both TestMD5Intrinsics and TestMD5MultiBlockIntrinsics are run with `-Xbatch`. ------------- PR: https://git.openjdk.org/jdk/pull/10954 From dlong at openjdk.org Thu Nov 3 01:26:11 2022 From: dlong at openjdk.org (Dean Long) Date: Thu, 3 Nov 2022 01:26:11 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic In-Reply-To: References: Message-ID: On Wed, 2 Nov 2022 22:18:25 GMT, Richard Reingruber wrote: > I think the continuation intrinsics (see `Method::is_continuation_native_intrinsic()`) are problematic too. There are no interpreted versions of them. It will be difficult to implement interpreted versions. Good point. It's not really interpreted vs compiled that is important, it's eager-allocated vs lazy-allocated, and if we need a different one for every polymorphic signature. Somewhere in between lazy and eager is "delayed". We could chose a low-water mark, where the code cache has free space less than "LazyIntrinsicsSize" available, where we force these intrinsics to get generated if they weren't already generated lazily. The problem is the missing signature-independent interpreted _linkToNative and the fact that the compiled version depends on the polymorphic signature, forcing them to be generated lazily to handle all possible signatures. ------------- PR: https://git.openjdk.org/jdk/pull/10933 From jiefu at openjdk.org Thu Nov 3 03:16:30 2022 From: jiefu at openjdk.org (Jie Fu) Date: Thu, 3 Nov 2022 03:16:30 GMT Subject: RFR: 8295970: Add vector api sanity tests in tier1 [v3] In-Reply-To: References: Message-ID: > Hi all, > > As discussed here https://github.com/openjdk/jdk/pull/10807#pullrequestreview-1150314487 , it would be better to add the vector api tests in GHA. > > Thanks. > Best regards, > Jie Jie Fu has updated the pull request incrementally with two additional commits since the last revision: - Update test/jdk/TEST.groups Co-authored-by: Aleksey Shipil?v - Update test/jdk/TEST.groups Co-authored-by: Aleksey Shipil?v ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10879/files - new: https://git.openjdk.org/jdk/pull/10879/files/f901b907..b0fa749f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10879&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10879&range=01-02 Stats: 4 lines in 1 file changed: 2 ins; 2 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10879.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10879/head:pull/10879 PR: https://git.openjdk.org/jdk/pull/10879 From eastigeevich at openjdk.org Thu Nov 3 03:30:26 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Thu, 3 Nov 2022 03:30:26 GMT Subject: RFR: 8296190: TestMD5Intrinsics and TestMD5MultiBlockIntrinsics don't test the intrinsics [v2] In-Reply-To: References: Message-ID: On Thu, 3 Nov 2022 00:19:03 GMT, Yi-Fan Tsai wrote: >> Increase WARMUP_ITERATIONS to get MD5 hash from the intrinsic. >> >> Both TestMD5Intrinsics and TestMD5MultiBlockIntrinsics are run with options "-XX:CompileThreshold=500 -XX:Tier4InvocationThreshold=500", so the current value is too low for the intrinsic to be executed. > > Yi-Fan Tsai has updated the pull request incrementally with one additional commit since the last revision: > > Set WARMUP_ITERATIONS based on Tier4InvocationThreshold test/hotspot/jtreg/compiler/intrinsics/sha/sanity/DigestSanityTestBase.java line 57: > 55: private static final int OFFSET = 0; > 56: private static final int ITERATIONS = 10000; > 57: private static final int WARMUP_ITERATIONS = WHITE_BOX.getIntxVMFlag("Tier4InvocationThreshold").intValue() + 50; If the flag `Tier4InvocationThreshold` is not set, what will happen? ------------- PR: https://git.openjdk.org/jdk/pull/10954 From jiefu at openjdk.org Thu Nov 3 03:31:56 2022 From: jiefu at openjdk.org (Jie Fu) Date: Thu, 3 Nov 2022 03:31:56 GMT Subject: RFR: 8295970: Add vector api sanity tests in tier1 [v4] In-Reply-To: References: Message-ID: <94_J9ooQ7Sdo3Y9_Z2xOWZbzOfZ8szVM6HwqJJxnRI0=.ff557a5c-3a88-488e-8f02-73287fd3363d@github.com> > Hi all, > > As discussed here https://github.com/openjdk/jdk/pull/10807#pullrequestreview-1150314487 , it would be better to add the vector api tests in GHA. > > Thanks. > Best regards, > Jie Jie Fu has updated the pull request incrementally with one additional commit since the last revision: Sort jdk_vector_sanity tests ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10879/files - new: https://git.openjdk.org/jdk/pull/10879/files/b0fa749f..d353e26f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10879&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10879&range=02-03 Stats: 19 lines in 1 file changed: 9 ins; 9 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10879.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10879/head:pull/10879 PR: https://git.openjdk.org/jdk/pull/10879 From jiefu at openjdk.org Thu Nov 3 03:33:22 2022 From: jiefu at openjdk.org (Jie Fu) Date: Thu, 3 Nov 2022 03:33:22 GMT Subject: RFR: 8295970: Add vector api sanity tests in tier1 [v2] In-Reply-To: References: <8uMsheGwjIBbgf1nJeYkCR4Pt_ddbnq4WVKMRPwS7C0=.c59b10eb-67e1-439c-936e-857c8430c520@github.com> Message-ID: On Wed, 2 Nov 2022 15:11:36 GMT, Aleksey Shipilev wrote: >> Jie Fu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: >> >> - Add jdk_vector_sanity test group >> - Merge branch 'master' into JDK-8295970 >> - Revert changes in test.yml >> - 8295970: Add jdk_vector tests in GHA > > I think the usual style for these files is to maintain original order/sorting. @shipilev , all your comments had been resolved. Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/10879 From duke at openjdk.org Thu Nov 3 05:17:23 2022 From: duke at openjdk.org (Yi-Fan Tsai) Date: Thu, 3 Nov 2022 05:17:23 GMT Subject: RFR: 8296190: TestMD5Intrinsics and TestMD5MultiBlockIntrinsics don't test the intrinsics [v2] In-Reply-To: References: Message-ID: On Thu, 3 Nov 2022 03:26:41 GMT, Evgeny Astigeevich wrote: >> Yi-Fan Tsai has updated the pull request incrementally with one additional commit since the last revision: >> >> Set WARMUP_ITERATIONS based on Tier4InvocationThreshold > > test/hotspot/jtreg/compiler/intrinsics/sha/sanity/DigestSanityTestBase.java line 57: > >> 55: private static final int OFFSET = 0; >> 56: private static final int ITERATIONS = 10000; >> 57: private static final int WARMUP_ITERATIONS = WHITE_BOX.getIntxVMFlag("Tier4InvocationThreshold").intValue() + 50; > > If the flag `Tier4InvocationThreshold` is not set, what will happen? The default value, 5000, would be returned. ------------- PR: https://git.openjdk.org/jdk/pull/10954 From fyang at openjdk.org Thu Nov 3 07:47:05 2022 From: fyang at openjdk.org (Fei Yang) Date: Thu, 3 Nov 2022 07:47:05 GMT Subject: RFR: 8296285: test/hotspot/jtreg/compiler/intrinsics/TestFloatIsFinite.java fails after JDK-8280378 Message-ID: <_Gl3B6Bsp3sgAUoXdtPJ9dCZ7SZKn-QT2p7_d0fqncY=.9d651037-db75-43c1-b73a-c00724794f88@github.com> Hi, Please review this trivial change fixing a typo in jtreg test: test/hotspot/jtreg/compiler/intrinsics/TestFloatIsFinite.java This test case fail to compile by javac when running on RISC-V. Testing: Test case passed on RIS-CV after this typo is fixed. ------------- Commit messages: - 8296285: test/hotspot/jtreg/compiler/intrinsics/TestFloatIsFinite.java fails after JDK-8280378 Changes: https://git.openjdk.org/jdk/pull/10965/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10965&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296285 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10965.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10965/head:pull/10965 PR: https://git.openjdk.org/jdk/pull/10965 From jiefu at openjdk.org Thu Nov 3 08:19:26 2022 From: jiefu at openjdk.org (Jie Fu) Date: Thu, 3 Nov 2022 08:19:26 GMT Subject: RFR: 8296285: test/hotspot/jtreg/compiler/intrinsics/TestFloatIsFinite.java fails after JDK-8280378 In-Reply-To: <_Gl3B6Bsp3sgAUoXdtPJ9dCZ7SZKn-QT2p7_d0fqncY=.9d651037-db75-43c1-b73a-c00724794f88@github.com> References: <_Gl3B6Bsp3sgAUoXdtPJ9dCZ7SZKn-QT2p7_d0fqncY=.9d651037-db75-43c1-b73a-c00724794f88@github.com> Message-ID: On Thu, 3 Nov 2022 07:39:05 GMT, Fei Yang wrote: > Hi, > Please review this trivial change fixing a typo in jtreg test: test/hotspot/jtreg/compiler/intrinsics/TestFloatIsFinite.java > This typo is introduced by JDK-8280378 and the test case fails to compile by javac when running on RISC-V. > Testing: Test case passed on RIS-CV after this typo is fixed. LGTM and trivial. ------------- Marked as reviewed by jiefu (Reviewer). PR: https://git.openjdk.org/jdk/pull/10965 From redestad at openjdk.org Thu Nov 3 11:16:40 2022 From: redestad at openjdk.org (Claes Redestad) Date: Thu, 3 Nov 2022 11:16:40 GMT Subject: RFR: 8296168: x86: Add reasonable constraints between AVX and SSE [v2] In-Reply-To: References: Message-ID: On Wed, 2 Nov 2022 21:49:52 GMT, Claes Redestad wrote: >> We've not seen any x86 CPU, Intel or otherwise, where AVX features are available but SSE 4.1 is not supported. This patch suggests constraining setup so that any explicit value of UseSSE less than 4 (the default on any AVX-supporting CPU) implicitly disables AVX. This simplifies ergonomics and reduces the testing surface. Concretely this would allow #10847 to not have to guard the new intrinsic on UseSSE level to avoid some surprising test failures in tests verifying SSE-enabled intrinsics. >> >> I've rearranged the initialization of UseAVX and UseSSE to allow AVX to look at the post-ergo values of UseSSE. >> >> Testing: tier1-tier3, manual verification > > Claes Redestad has updated the pull request incrementally with two additional commits since the last revision: > > - Remove redundant SSE < 0 check > - Remove redundant AVX < 0 check, revert narrow supports_sse4_1 check Thanks for reviewing! ------------- PR: https://git.openjdk.org/jdk/pull/10946 From simonis at openjdk.org Thu Nov 3 11:16:50 2022 From: simonis at openjdk.org (Volker Simonis) Date: Thu, 3 Nov 2022 11:16:50 GMT Subject: RFR: 8296190: TestMD5Intrinsics and TestMD5MultiBlockIntrinsics don't test the intrinsics [v2] In-Reply-To: References: Message-ID: On Thu, 3 Nov 2022 00:19:03 GMT, Yi-Fan Tsai wrote: >> Increase WARMUP_ITERATIONS to get MD5 hash from the intrinsic. >> >> Both TestMD5Intrinsics and TestMD5MultiBlockIntrinsics are run with options "-XX:CompileThreshold=500 -XX:Tier4InvocationThreshold=500", so the current value is too low for the intrinsic to be executed. > > Yi-Fan Tsai has updated the pull request incrementally with one additional commit since the last revision: > > Set WARMUP_ITERATIONS based on Tier4InvocationThreshold Hi Yi-Fan, Thanks for doing the additional changes and for confirming that the test run with `-Xbatch`. Looks good now. Best regards, Volker ------------- Marked as reviewed by simonis (Reviewer). PR: https://git.openjdk.org/jdk/pull/10954 From shade at openjdk.org Thu Nov 3 11:18:35 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 3 Nov 2022 11:18:35 GMT Subject: RFR: 8295970: Add vector api sanity tests in tier1 [v4] In-Reply-To: <94_J9ooQ7Sdo3Y9_Z2xOWZbzOfZ8szVM6HwqJJxnRI0=.ff557a5c-3a88-488e-8f02-73287fd3363d@github.com> References: <94_J9ooQ7Sdo3Y9_Z2xOWZbzOfZ8szVM6HwqJJxnRI0=.ff557a5c-3a88-488e-8f02-73287fd3363d@github.com> Message-ID: On Thu, 3 Nov 2022 03:31:56 GMT, Jie Fu wrote: >> Hi all, >> >> As discussed here https://github.com/openjdk/jdk/pull/10807#pullrequestreview-1150314487 , it would be better to add the vector api tests in GHA. >> >> Thanks. >> Best regards, >> Jie > > Jie Fu has updated the pull request incrementally with one additional commit since the last revision: > > Sort jdk_vector_sanity tests Marked as reviewed by shade (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/10879 From jiefu at openjdk.org Thu Nov 3 11:18:35 2022 From: jiefu at openjdk.org (Jie Fu) Date: Thu, 3 Nov 2022 11:18:35 GMT Subject: RFR: 8295970: Add vector api sanity tests in tier1 [v4] In-Reply-To: <94_J9ooQ7Sdo3Y9_Z2xOWZbzOfZ8szVM6HwqJJxnRI0=.ff557a5c-3a88-488e-8f02-73287fd3363d@github.com> References: <94_J9ooQ7Sdo3Y9_Z2xOWZbzOfZ8szVM6HwqJJxnRI0=.ff557a5c-3a88-488e-8f02-73287fd3363d@github.com> Message-ID: On Thu, 3 Nov 2022 03:31:56 GMT, Jie Fu wrote: >> Hi all, >> >> As discussed here https://github.com/openjdk/jdk/pull/10807#pullrequestreview-1150314487 , it would be better to add the vector api tests in GHA. >> >> Thanks. >> Best regards, >> Jie > > Jie Fu has updated the pull request incrementally with one additional commit since the last revision: > > Sort jdk_vector_sanity tests Thank you all for the review and comments. ------------- PR: https://git.openjdk.org/jdk/pull/10879 From dongbo at openjdk.org Thu Nov 3 11:50:50 2022 From: dongbo at openjdk.org (Dong Bo) Date: Thu, 3 Nov 2022 11:50:50 GMT Subject: RFR: 8295698: AArch64: test/jdk/sun/security/ec/ed/EdDSATest.java failed with -XX:+UseSHA3Intrinsics In-Reply-To: <1K3GDuQ_B_jYVDTYPvYcbhbUeJM1RihgIiHnbvlFYyQ=.fa471a1d-49bb-4ef3-87d0-b70ee82334ac@github.com> References: <1K3GDuQ_B_jYVDTYPvYcbhbUeJM1RihgIiHnbvlFYyQ=.fa471a1d-49bb-4ef3-87d0-b70ee82334ac@github.com> Message-ID: <90d1Dd8-KcmzqfcORr_lN2eW-m7tMn842Wn-O05lIUQ=.19e83361-61c3-49b8-b002-f477039506dc@github.com> On Wed, 2 Nov 2022 09:21:57 GMT, Hao Sun wrote: >> In JDK-8252204, when implemented SHA3 intrinsics, we use `digest_length` to differentiate SHA3-224, SHA3-256, SHA3-384, SHA3-512 and calculate `block_size` with `block_size = 200 - 2 * digest_length`. >> However, there are two extra SHA3 instances, SHAKE256 and SHAKE128, allowing an arbitrary `digest_length`: >> >> digest_length block_size >> SHA3-224 28 144 >> SHA3-256 32 136 >> SHA3-384 48 104 >> SHA3-512 64 72 >> SHAKE128 variable 168 >> SHAKE256 variable 136 >> >> >> This causes SIGSEGV crash or hash code mismatch with `test/jdk/sun/security/ec/ed/EdDSATest.java`. The test calls `SHAKE256` in `Ed448`. >> >> The main idea of the patch is to pass the `block_size` to differentiate SHA3 instances. >> Tests `test/jdk/sun/security/ec/ed/EdDSATest.java` and `./test/jdk/sun/security/provider/MessageDigest/SHA3.java` both passed. >> And tier1~3 passed on SHA3 supported hardware. >> >> The SHA3 intrinsics still deliver 20%~40% performance improvement on our pre-silicon simulated platform. >> The latency and throughput of crypto SHA3 ops are designed to be 1 cpu cycle and 2 execution pipes respectively. >> >> Compared with the main stream code, the performance change with this patch are negligible on real hardware and simulation platform. >> Based on the JMH results of SHA3 intirinsics, performance can be improved by ~50% on some hardware, while some hardware have ~30% regression. >> These performance details are available in the comments of the issue page. >> I guess the performance benefit of SHA3 intrinsics is dependent on the micro architecture, it should be switched on/off based on the running platform. > > Overall, it looks good to me. (I'm not a Reviewer) > > I agree with your fix to pass `block_size` rather than `digest_length`, but I didn't fully understand your update insides `stubGenerator_aarch64.cpp`, i.e. those operations before `rounds24_loop`, since I'm not a crypto expert. > We may need some crypo expert to review that part. > > Note that I rerun tier1~3 with the latest commit in this PR on sha512 feature supported hardware, and the test passed. > > Thanks. @shqking Thanks for the review. Before the `NR (==24)` iterations in `keccak`, `a0~a24` need to be ready: // SHA3.java private void keccak() { // convert the 200-byte state into 25 lanes bytes2Lanes(state, lanes); ... a0 = lanes[0]; a1 = lanes[1]; ...; a24 = lanes[24]; // process the lanes through step mappings for (int ir = 0; ir < NR; ir++) { long c0 = a0^a5^a10^a15^a20; long c1 = a1^a6^a11^a16^a21; ... While `a_i` is computed as: `state[i] ^= b[ofs++]` in `implCompress0(byte[] b, int ofs)`, `byte2Lanes(state, lanes)`, `a_i = lanes[i]`. // SHA3.java @IntrinsicCandidate private void implCompress0(byte[] b, int ofs) { for (int i = 0; i < buffer.length; i++) { state[i] ^= b[ofs++]; } keccak(); } Those operations before `rounds24_loop` can be taken as vectorial version of the Java code shown above: 1. Load all elements of `state` array into vector registers (v0~v24). 2. Load the input data into `v25~v31`. The number of data to load is dependent on `block_size (== buffer.length)`. 3. Perform `state[i] ^ b[ofs]`. We don't need instructions for `byte2Lanes`, cause it is already done with the `ld1_8B` instructions. Hope this can address your confusion. ------------- PR: https://git.openjdk.org/jdk/pull/10939 From redestad at openjdk.org Thu Nov 3 12:09:34 2022 From: redestad at openjdk.org (Claes Redestad) Date: Thu, 3 Nov 2022 12:09:34 GMT Subject: Integrated: 8296168: x86: Add reasonable constraints between AVX and SSE In-Reply-To: References: Message-ID: <9zjsyFV4QJG7qdH_kHNigcSuNTNpegZAxkV-ml2KSGc=.1854507f-1ac9-456a-b864-d881dc1a4ebc@github.com> On Wed, 2 Nov 2022 12:44:42 GMT, Claes Redestad wrote: > We've not seen any x86 CPU, Intel or otherwise, where AVX features are available but SSE 4.1 is not supported. This patch suggests constraining setup so that any explicit value of UseSSE less than 4 (the default on any AVX-supporting CPU) implicitly disables AVX. This simplifies ergonomics and reduces the testing surface. Concretely this would allow #10847 to not have to guard the new intrinsic on UseSSE level to avoid some surprising test failures in tests verifying SSE-enabled intrinsics. > > I've rearranged the initialization of UseAVX and UseSSE to allow AVX to look at the post-ergo values of UseSSE. > > Testing: tier1-tier3, manual verification This pull request has now been integrated. Changeset: 6ee8ccfc Author: Claes Redestad URL: https://git.openjdk.org/jdk/commit/6ee8ccfcfea06b16383475f9bbef11753e7fcc22 Stats: 63 lines in 1 file changed: 31 ins; 30 del; 2 mod 8296168: x86: Add reasonable constraints between AVX and SSE Reviewed-by: kvn, vlivanov ------------- PR: https://git.openjdk.org/jdk/pull/10946 From jiefu at openjdk.org Thu Nov 3 12:18:14 2022 From: jiefu at openjdk.org (Jie Fu) Date: Thu, 3 Nov 2022 12:18:14 GMT Subject: Integrated: 8295970: Add vector api sanity tests in tier1 In-Reply-To: References: Message-ID: On Thu, 27 Oct 2022 03:53:13 GMT, Jie Fu wrote: > Hi all, > > As discussed here https://github.com/openjdk/jdk/pull/10807#pullrequestreview-1150314487 , it would be better to add the vector api tests in GHA. > > Thanks. > Best regards, > Jie This pull request has now been integrated. Changeset: d771abb2 Author: Jie Fu URL: https://git.openjdk.org/jdk/commit/d771abb2fbc72e02faf02f0724aea301953ac5e8 Stats: 27 lines in 1 file changed: 27 ins; 0 del; 0 mod 8295970: Add vector api sanity tests in tier1 Reviewed-by: shade, erikj, kvn ------------- PR: https://git.openjdk.org/jdk/pull/10879 From eosterlund at openjdk.org Thu Nov 3 13:33:09 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Thu, 3 Nov 2022 13:33:09 GMT Subject: RFR: 8296101: nmethod::is_unloading result unstable with concurrent unloading [v2] In-Reply-To: <6-EquT0B-FHlnfP9Gdl78p6otzThgDQpSHJIjgxv2_Q=.8ec85fcd-71a8-4183-bf9f-3cb71b8e3351@github.com> References: <6-EquT0B-FHlnfP9Gdl78p6otzThgDQpSHJIjgxv2_Q=.8ec85fcd-71a8-4183-bf9f-3cb71b8e3351@github.com> Message-ID: On Wed, 2 Nov 2022 11:12:11 GMT, Kim Barrett wrote: >> Erik ?sterlund has updated the pull request incrementally with one additional commit since the last revision: >> >> Avoid recursion > > Recursion removal looks good. Thanks for the reviews, @kimbarrett and @dean-long . ------------- PR: https://git.openjdk.org/jdk/pull/10926 From eosterlund at openjdk.org Thu Nov 3 13:34:52 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Thu, 3 Nov 2022 13:34:52 GMT Subject: Integrated: 8296101: nmethod::is_unloading result unstable with concurrent unloading In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 09:46:33 GMT, Erik ?sterlund wrote: > If an nmethod is not called during a concurrent full GC, then after marking has terminated, multiple threads can call is_unloading. If at the same time, the nmethod is made not_entrant, then we run into a source of instability in the is_cold() calculation used when computing is_unloading. There we check if the nmethod is_not_entrant(), which some concurrent observers will think is true, while others think it's false. > The current code that sets the is_unloading_state in is_unloading() assumes that the computed state is the same across all observers. However, that is no longer true. > > I propose to set the is_unloading_state with a CAS instead of plain store. Then, as is_unloading() is computed before making nmethods not_entrant, we can guarantee that all concurrent readers of is_unloading in this scenario will return false in the current unloading cycle, instead of racingly returning either false or true. One thread wins, and it will say false, and the other threads will compute conflicting results, but end up agreeing after the CAS, that they should all return false. > > Tested with mach5 tier1-7. Also tried replacing the is_not_entrant() ingredient in is_cold with os::random() to simulate the instability source. Without my fix RunThese crashes almost immediately, and with my fix it doesn't crash. This pull request has now been integrated. Changeset: cc3c5a18 Author: Erik ?sterlund URL: https://git.openjdk.org/jdk/commit/cc3c5a18ed4e52ea385ea0e8bedaf1b01f3c5e6e Stats: 16 lines in 1 file changed: 11 ins; 0 del; 5 mod 8296101: nmethod::is_unloading result unstable with concurrent unloading Reviewed-by: kbarrett, dlong ------------- PR: https://git.openjdk.org/jdk/pull/10926 From thartmann at openjdk.org Thu Nov 3 13:40:45 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 3 Nov 2022 13:40:45 GMT Subject: RFR: 8296285: test/hotspot/jtreg/compiler/intrinsics/TestFloatIsFinite.java fails after JDK-8280378 In-Reply-To: <_Gl3B6Bsp3sgAUoXdtPJ9dCZ7SZKn-QT2p7_d0fqncY=.9d651037-db75-43c1-b73a-c00724794f88@github.com> References: <_Gl3B6Bsp3sgAUoXdtPJ9dCZ7SZKn-QT2p7_d0fqncY=.9d651037-db75-43c1-b73a-c00724794f88@github.com> Message-ID: On Thu, 3 Nov 2022 07:39:05 GMT, Fei Yang wrote: > Hi, > Please review this trivial change fixing a typo in jtreg test: test/hotspot/jtreg/compiler/intrinsics/TestFloatIsFinite.java > This typo is introduced by JDK-8280378 and the test case fails to compile by javac when running on RISC-V. > Testing: Test case passed on RISC-V after this typo is fixed. Looks good and trivial. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/10965 From chagedorn at openjdk.org Thu Nov 3 13:40:45 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 3 Nov 2022 13:40:45 GMT Subject: RFR: 8296285: test/hotspot/jtreg/compiler/intrinsics/TestFloatIsFinite.java fails after JDK-8280378 In-Reply-To: <_Gl3B6Bsp3sgAUoXdtPJ9dCZ7SZKn-QT2p7_d0fqncY=.9d651037-db75-43c1-b73a-c00724794f88@github.com> References: <_Gl3B6Bsp3sgAUoXdtPJ9dCZ7SZKn-QT2p7_d0fqncY=.9d651037-db75-43c1-b73a-c00724794f88@github.com> Message-ID: On Thu, 3 Nov 2022 07:39:05 GMT, Fei Yang wrote: > Hi, > Please review this trivial change fixing a typo in jtreg test: test/hotspot/jtreg/compiler/intrinsics/TestFloatIsFinite.java > This typo is introduced by JDK-8280378 and the test case fails to compile by javac when running on RISC-V. > Testing: Test case passed on RISC-V after this typo is fixed. Thanks for fixing this! I missed that typo as I could not run this test on RISC-V. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10965 From mdoerr at openjdk.org Thu Nov 3 16:31:30 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 3 Nov 2022 16:31:30 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v2] In-Reply-To: References: Message-ID: On Wed, 2 Nov 2022 22:28:40 GMT, Martin Doerr wrote: >> This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Add assertions to ensure no immediate oops in MH intrinsics. Thanks for all your interesting thoughts. But now, I think we're discussing too many things in one PR. I suggest the following: - New issue for the Loom intrinsics. We just need 2 native functions, so they could just get created at startup if there's a concern. - New issue for C2I investigation. We may need to generate adapters in the NonNMethod space. Isn't that a similar problem? - Discuss further if this PR is a viable solution for the method handle intrinsic issue in JDK 20. They get created per signature, so it's hard to generate all ones :-) I wouldn't expect an insane amount, so a bit of NonNMethod space can hopefully take as much as needed. If there are more concerns, we could still wait for GC. - I'd like to try implementing a reproduction case, but I can't promise if I can get that done before RDP1. There's a lot to do atm. Could we add it later if it takes too long? ------------- PR: https://git.openjdk.org/jdk/pull/10933 From eastigeevich at openjdk.org Thu Nov 3 16:50:44 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Thu, 3 Nov 2022 16:50:44 GMT Subject: RFR: 8296190: TestMD5Intrinsics and TestMD5MultiBlockIntrinsics don't test the intrinsics [v2] In-Reply-To: References: Message-ID: On Thu, 3 Nov 2022 00:19:03 GMT, Yi-Fan Tsai wrote: >> Increase WARMUP_ITERATIONS to get MD5 hash from the intrinsic. >> >> Both TestMD5Intrinsics and TestMD5MultiBlockIntrinsics are run with options "-XX:CompileThreshold=500 -XX:Tier4InvocationThreshold=500", so the current value is too low for the intrinsic to be executed. > > Yi-Fan Tsai has updated the pull request incrementally with one additional commit since the last revision: > > Set WARMUP_ITERATIONS based on Tier4InvocationThreshold Marked as reviewed by eastigeevich (Committer). ------------- PR: https://git.openjdk.org/jdk/pull/10954 From dlong at openjdk.org Thu Nov 3 19:20:26 2022 From: dlong at openjdk.org (Dean Long) Date: Thu, 3 Nov 2022 19:20:26 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v2] In-Reply-To: References: Message-ID: <3f_GjtoKHboKAZT3T9Zxi1OdatF7OykJvb4o5K4wWyA=.c9dcc0d6-5c8d-4a46-8a7d-63f65e83655b@github.com> On Wed, 2 Nov 2022 22:28:40 GMT, Martin Doerr wrote: >> This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Add assertions to ensure no immediate oops in MH intrinsics. Those are reasonable suggestions. Sounds like a plan. For this PR, could you refactor the oops checking logic into its own method? Then would it make sense to add checks for relocations that use metadata? Some types of call sites are resolved lazily, so they could be using metadata without that being reflected in the oops. ------------- PR: https://git.openjdk.org/jdk/pull/10933 From phh at openjdk.org Thu Nov 3 19:26:07 2022 From: phh at openjdk.org (Paul Hohensee) Date: Thu, 3 Nov 2022 19:26:07 GMT Subject: RFR: 8296190: TestMD5Intrinsics and TestMD5MultiBlockIntrinsics don't test the intrinsics [v2] In-Reply-To: References: Message-ID: <3eycHer_PGXZ14W9KvxSXQWbP9kfouY-N4kBmhBGz_8=.38401618-8eb4-4b55-835a-dab1bf74c302@github.com> On Thu, 3 Nov 2022 00:19:03 GMT, Yi-Fan Tsai wrote: >> Increase WARMUP_ITERATIONS to get MD5 hash from the intrinsic. >> >> Both TestMD5Intrinsics and TestMD5MultiBlockIntrinsics are run with options "-XX:CompileThreshold=500 -XX:Tier4InvocationThreshold=500", so the current value is too low for the intrinsic to be executed. > > Yi-Fan Tsai has updated the pull request incrementally with one additional commit since the last revision: > > Set WARMUP_ITERATIONS based on Tier4InvocationThreshold Lgtm. ------------- Marked as reviewed by phh (Reviewer). PR: https://git.openjdk.org/jdk/pull/10954 From duke at openjdk.org Thu Nov 3 19:29:09 2022 From: duke at openjdk.org (Yi-Fan Tsai) Date: Thu, 3 Nov 2022 19:29:09 GMT Subject: Integrated: 8296190: TestMD5Intrinsics and TestMD5MultiBlockIntrinsics don't test the intrinsics In-Reply-To: References: Message-ID: On Wed, 2 Nov 2022 17:30:19 GMT, Yi-Fan Tsai wrote: > Increase WARMUP_ITERATIONS to get MD5 hash from the intrinsic. > > Both TestMD5Intrinsics and TestMD5MultiBlockIntrinsics are run with options "-XX:CompileThreshold=500 -XX:Tier4InvocationThreshold=500", so the current value is too low for the intrinsic to be executed. This pull request has now been integrated. Changeset: f43bb9fe Author: Yi-Fan Tsai Committer: Paul Hohensee URL: https://git.openjdk.org/jdk/commit/f43bb9feaa03008bad9708a4d7ed850d2532e102 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8296190: TestMD5Intrinsics and TestMD5MultiBlockIntrinsics don't test the intrinsics Reviewed-by: eastigeevich, phh, simonis ------------- PR: https://git.openjdk.org/jdk/pull/10954 From mdoerr at openjdk.org Thu Nov 3 21:15:39 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 3 Nov 2022 21:15:39 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v3] In-Reply-To: References: Message-ID: > This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: Refactor oop checks. Add metadata check. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10933/files - new: https://git.openjdk.org/jdk/pull/10933/files/885c51c0..0ef491fc Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10933&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10933&range=01-02 Stats: 23 lines in 2 files changed: 15 ins; 6 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/10933.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10933/head:pull/10933 PR: https://git.openjdk.org/jdk/pull/10933 From mdoerr at openjdk.org Thu Nov 3 21:40:54 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 3 Nov 2022 21:40:54 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v3] In-Reply-To: References: Message-ID: On Thu, 3 Nov 2022 21:15:39 GMT, Martin Doerr wrote: >> This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Refactor oop checks. Add metadata check. Filed 2 issues: https://bugs.openjdk.org/browse/JDK-8296336 https://bugs.openjdk.org/browse/JDK-8296334 Feel free to modify them. I've added a check for metadata. I don't think we need to check for metadata relocations because the oop recorder always records all metadata pointers. We never iterate over the code to find metadata AFAIK (except for patching purposes, but the metadata is duplicated in this case, not in the code only). Please take a look. ------------- PR: https://git.openjdk.org/jdk/pull/10933 From iklam at openjdk.org Thu Nov 3 21:53:27 2022 From: iklam at openjdk.org (Ioi Lam) Date: Thu, 3 Nov 2022 21:53:27 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v3] In-Reply-To: References: Message-ID: On Thu, 3 Nov 2022 21:15:39 GMT, Martin Doerr wrote: >> This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Refactor oop checks. Add metadata check. Is it possible to have a backup stub? It can trap to c code, do what the method handle intrinsics is supposed to do (by iterating on the signature), and continue execution. ------------- PR: https://git.openjdk.org/jdk/pull/10933 From dlong at openjdk.org Thu Nov 3 22:11:30 2022 From: dlong at openjdk.org (Dean Long) Date: Thu, 3 Nov 2022 22:11:30 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v3] In-Reply-To: References: Message-ID: <-b9cMAzyXX2w1SZPG6k9SlEu1DXN5cPuNSLn_DZJ2Zs=.4742dfd3-e212-4e00-a328-34b182a0c690@github.com> On Thu, 3 Nov 2022 21:37:16 GMT, Martin Doerr wrote: > Filed 2 issues: https://bugs.openjdk.org/browse/JDK-8296336 https://bugs.openjdk.org/browse/JDK-8296334 Feel free to modify them. > > I've added a check for metadata. I don't think we need to check for metadata relocations because the oop recorder always records all metadata pointers. We never iterate over the code to find metadata AFAIK (except for patching purposes, but the metadata is duplicated in this case, not in the code only). Please take a look. The metadata in stubs might not have an oop. See the comment here: https://github.com/openjdk/jdk/blob/054c23f484522881a0879176383d970a8de41201/src/hotspot/share/code/compiledMethod.cpp#L632 ------------- PR: https://git.openjdk.org/jdk/pull/10933 From dlong at openjdk.org Thu Nov 3 22:15:31 2022 From: dlong at openjdk.org (Dean Long) Date: Thu, 3 Nov 2022 22:15:31 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v3] In-Reply-To: References: Message-ID: On Thu, 3 Nov 2022 21:15:39 GMT, Martin Doerr wrote: >> This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Refactor oop checks. Add metadata check. src/hotspot/share/code/nmethod.cpp line 493: > 491: assert(nm->metadata_size() == 0, "metadata usage not expected"); > 492: } > 493: #endif It might make sense to do this check for all nmethods that were allocated in NonNMethod. ------------- PR: https://git.openjdk.org/jdk/pull/10933 From dlong at openjdk.org Thu Nov 3 22:21:26 2022 From: dlong at openjdk.org (Dean Long) Date: Thu, 3 Nov 2022 22:21:26 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v3] In-Reply-To: References: Message-ID: On Thu, 3 Nov 2022 21:51:02 GMT, Ioi Lam wrote: > Is it possible to have a backup stub? It can trap to c code, do what the method handle intrinsics is supposed to do (by iterating on the signature), and continue execution. Yes, I was thinking the interpreter _linkToNative might need to do that, similar to how old versions of the JDK had a generic native method wrapper that could handle all signatures. That can be investigated as part of JDK-8296336. ------------- PR: https://git.openjdk.org/jdk/pull/10933 From fyang at openjdk.org Fri Nov 4 00:57:26 2022 From: fyang at openjdk.org (Fei Yang) Date: Fri, 4 Nov 2022 00:57:26 GMT Subject: RFR: 8296285: test/hotspot/jtreg/compiler/intrinsics/TestFloatIsFinite.java fails after JDK-8280378 In-Reply-To: <_Gl3B6Bsp3sgAUoXdtPJ9dCZ7SZKn-QT2p7_d0fqncY=.9d651037-db75-43c1-b73a-c00724794f88@github.com> References: <_Gl3B6Bsp3sgAUoXdtPJ9dCZ7SZKn-QT2p7_d0fqncY=.9d651037-db75-43c1-b73a-c00724794f88@github.com> Message-ID: On Thu, 3 Nov 2022 07:39:05 GMT, Fei Yang wrote: > Hi, > Please review this trivial change fixing a typo in jtreg test: test/hotspot/jtreg/compiler/intrinsics/TestFloatIsFinite.java > This typo is introduced by JDK-8280378 and the test case fails to compile by javac when running on RISC-V. > Testing: Test case passed on RISC-V after this typo is fixed. Thank you! ------------- PR: https://git.openjdk.org/jdk/pull/10965 From fyang at openjdk.org Fri Nov 4 00:59:32 2022 From: fyang at openjdk.org (Fei Yang) Date: Fri, 4 Nov 2022 00:59:32 GMT Subject: Integrated: 8296285: test/hotspot/jtreg/compiler/intrinsics/TestFloatIsFinite.java fails after JDK-8280378 In-Reply-To: <_Gl3B6Bsp3sgAUoXdtPJ9dCZ7SZKn-QT2p7_d0fqncY=.9d651037-db75-43c1-b73a-c00724794f88@github.com> References: <_Gl3B6Bsp3sgAUoXdtPJ9dCZ7SZKn-QT2p7_d0fqncY=.9d651037-db75-43c1-b73a-c00724794f88@github.com> Message-ID: On Thu, 3 Nov 2022 07:39:05 GMT, Fei Yang wrote: > Hi, > Please review this trivial change fixing a typo in jtreg test: test/hotspot/jtreg/compiler/intrinsics/TestFloatIsFinite.java > This typo is introduced by JDK-8280378 and the test case fails to compile by javac when running on RISC-V. > Testing: Test case passed on RISC-V after this typo is fixed. This pull request has now been integrated. Changeset: 4d1bc1b5 Author: Fei Yang URL: https://git.openjdk.org/jdk/commit/4d1bc1b5add61f443f99f6d0726ebf8e37dc14ab Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8296285: test/hotspot/jtreg/compiler/intrinsics/TestFloatIsFinite.java fails after JDK-8280378 Reviewed-by: jiefu, thartmann, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/10965 From haosun at openjdk.org Fri Nov 4 01:34:31 2022 From: haosun at openjdk.org (Hao Sun) Date: Fri, 4 Nov 2022 01:34:31 GMT Subject: RFR: 8295698: AArch64: test/jdk/sun/security/ec/ed/EdDSATest.java failed with -XX:+UseSHA3Intrinsics In-Reply-To: References: Message-ID: On Wed, 2 Nov 2022 03:06:21 GMT, Dong Bo wrote: > In JDK-8252204, when implemented SHA3 intrinsics, we use `digest_length` to differentiate SHA3-224, SHA3-256, SHA3-384, SHA3-512 and calculate `block_size` with `block_size = 200 - 2 * digest_length`. > However, there are two extra SHA3 instances, SHAKE256 and SHAKE128, allowing an arbitrary `digest_length`: > > digest_length block_size > SHA3-224 28 144 > SHA3-256 32 136 > SHA3-384 48 104 > SHA3-512 64 72 > SHAKE128 variable 168 > SHAKE256 variable 136 > > > This causes SIGSEGV crash or hash code mismatch with `test/jdk/sun/security/ec/ed/EdDSATest.java`. The test calls `SHAKE256` in `Ed448`. > > The main idea of the patch is to pass the `block_size` to differentiate SHA3 instances. > Tests `test/jdk/sun/security/ec/ed/EdDSATest.java` and `./test/jdk/sun/security/provider/MessageDigest/SHA3.java` both passed. > And tier1~3 passed on SHA3 supported hardware. > > The SHA3 intrinsics still deliver 20%~40% performance improvement on our pre-silicon simulated platform. > The latency and throughput of crypto SHA3 ops are designed to be 1 cpu cycle and 2 execution pipes respectively. > > Compared with the main stream code, the performance change with this patch are negligible on real hardware and simulation platform. > Based on the JMH results of SHA3 intirinsics, performance can be improved by ~50% on some hardware, while some hardware have ~30% regression. > These performance details are available in the comments of the issue page. > I guess the performance benefit of SHA3 intrinsics is dependent on the micro architecture, it should be switched on/off based on the running platform. Thanks for your kind explanation. Understood. This patch is good to me. (I'm not a Reviewer) ------------- PR: https://git.openjdk.org/jdk/pull/10939 From yadongwang at openjdk.org Fri Nov 4 01:50:26 2022 From: yadongwang at openjdk.org (Yadong Wang) Date: Fri, 4 Nov 2022 01:50:26 GMT Subject: RFR: 8295261: RISC-V: Support ReductionV instructions for Vector API In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 07:54:47 GMT, Gui Cao wrote: > Currently, certain vector-specific instructions in c2 are not implemented in RISC-V. This patch will add support of `AndReductionV`, `OrReductionV`, `XorReductionV` for RISC-V. This patch was implemented by referring to the sve version of aarch64 and riscv-v-spec v1.0 [1]. > > For example, AndReductionV is implemented as follows: > > > diff --git a/src/hotspot/cpu/riscv/riscv_v.ad b/src/hotspot/cpu/riscv/riscv_v.ad > index 0ef36fdb292..c04962993c0 100644 > --- a/src/hotspot/cpu/riscv/riscv_v.ad > +++ b/src/hotspot/cpu/riscv/riscv_v.ad > @@ -63,7 +63,6 @@ source %{ > case Op_ExtractS: > case Op_ExtractUB: > // Vector API specific > - case Op_AndReductionV: > case Op_OrReductionV: > case Op_XorReductionV: > case Op_LoadVectorGather: > @@ -785,6 +784,120 @@ instruct vnegD(vReg dst, vReg src) %{ > ins_pipe(pipe_slow); > %} > > +// vector and reduction > + > +instruct reduce_andI(iRegINoSp dst, iRegIorL2I src1, vReg src2, vReg tmp) %{ > + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_INT); > + match(Set dst (AndReductionV src1 src2)); > + effect(TEMP tmp); > + ins_cost(VEC_COST); > + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andI\n\t" > + "vredand.vs $tmp, $src2, $tmp\n\t" > + "vmv.x.s $dst, $tmp" %} > + ins_encode %{ > + __ vsetvli(t0, x0, Assembler::e32); > + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); > + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), > + as_VectorRegister($tmp$$reg)); > + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); > + %} > + ins_pipe(pipe_slow); > +%} > + > +instruct reduce_andL(iRegLNoSp dst, iRegL src1, vReg src2, vReg tmp) %{ > + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_LONG); > + match(Set dst (AndReductionV src1 src2)); > + effect(TEMP tmp); > + ins_cost(VEC_COST); > + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andL\n\t" > + "vredand.vs $tmp, $src2, $tmp\n\t" > + "vmv.x.s $dst, $tmp" %} > + ins_encode %{ > + __ vsetvli(t0, x0, Assembler::e64); > + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); > + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), > + as_VectorRegister($tmp$$reg)); > + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); > + %} > > > > After this patch, Vector API can use RVV with the `-XX:+UseRVV` parameter when executing java programs on the RISC-V RVV 1.0 platform. Tests [2] and [3] can be used to test the implementation of this node and it passes the tests properly. > > By adding the `-XX:+PrintAssembly -Xcomp -XX:-TieredCompilation -XX:+LogCompilation -XX:LogFile=compile.log` parameter when executing the test case, hsdis is currently unable to decompile rvv's assembly instructions. The relevant OptoAssembly log output in the compilation log is as follows: > > > 2a8 B22: # out( B14 B23 ) <- in( B21 B31 ) Freq: 32.1131 > 2a8 lwu R28, [R9, #8] # loadNKlass, compressed class ptr, #@loadNKlass > 2ac decode_klass_not_null R14, R28 #@decodeKlass_not_null > 2b8 ld R30, [R14, #40] # class, #@loadKlass > 2bc li R7, #-1 # int, #@loadConI > 2c0 vmv.s.x V1, R7 #@reduce_andI > vredand.vs V1, V2, V1 > vmv.x.s R28, V1 > 2d0 mv R7, precise jdk/internal/vm/vector/VectorSupport$ReductionOperation: 0x000000408c4f6220:Constant:exact * # ptr, #@loadConP > 2e8 beq R30, R7, B14 #@cmpP_branch P=0.830000 C=-1.000000 > > > There is no hardware implementation of RISC-V RVV 1.0, so the tests are performed on qemu with parameter `-cpu rv64,v=true,vlen=256,vext_spec=v1.0`. The execution of `ANDReduceInt256VectorTests` and `ANDReduceLong256VectorTests` test cases under qemu, with `-XX:+UseRVV` turned on, can reduce the execution time of this method by about 50.7% compared to the RVV version without this node implemented. After implementing this node, by comparing the influence of the number of C2 assembly instructions before and after the -XX:+UseRVV parameter is enabled, after enabling -XX:+UseRVV, the number of assembly instructions is reduced by about 50% [4] > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#vector-reduction-operations > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests > [3] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Long256VectorTests.java#ANDReduceLong256VectorTests > [4] https://github.com/zifeihan/vector-api-test-rvv/blob/master/vector-api-rvv-performance.md > > ## Testing: > - hotspot and jdk tier1 on unmatched board without new failures > - test/jdk/jdk/incubator/vector/Int256VectorTests.java with fastdebug on qemu > - test/jdk/jdk/incubator/vector/Long256VectorTests.java with fastdebug on qemu lgtm ------------- Marked as reviewed by yadongwang (Author). PR: https://git.openjdk.org/jdk/pull/10691 From yadongwang at openjdk.org Fri Nov 4 02:27:35 2022 From: yadongwang at openjdk.org (Yadong Wang) Date: Fri, 4 Nov 2022 02:27:35 GMT Subject: RFR: 8295261: RISC-V: Support ReductionV instructions for Vector API In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 07:54:47 GMT, Gui Cao wrote: > Currently, certain vector-specific instructions in c2 are not implemented in RISC-V. This patch will add support of `AndReductionV`, `OrReductionV`, `XorReductionV` for RISC-V. This patch was implemented by referring to the sve version of aarch64 and riscv-v-spec v1.0 [1]. > > For example, AndReductionV is implemented as follows: > > > diff --git a/src/hotspot/cpu/riscv/riscv_v.ad b/src/hotspot/cpu/riscv/riscv_v.ad > index 0ef36fdb292..c04962993c0 100644 > --- a/src/hotspot/cpu/riscv/riscv_v.ad > +++ b/src/hotspot/cpu/riscv/riscv_v.ad > @@ -63,7 +63,6 @@ source %{ > case Op_ExtractS: > case Op_ExtractUB: > // Vector API specific > - case Op_AndReductionV: > case Op_OrReductionV: > case Op_XorReductionV: > case Op_LoadVectorGather: > @@ -785,6 +784,120 @@ instruct vnegD(vReg dst, vReg src) %{ > ins_pipe(pipe_slow); > %} > > +// vector and reduction > + > +instruct reduce_andI(iRegINoSp dst, iRegIorL2I src1, vReg src2, vReg tmp) %{ > + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_INT); > + match(Set dst (AndReductionV src1 src2)); > + effect(TEMP tmp); > + ins_cost(VEC_COST); > + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andI\n\t" > + "vredand.vs $tmp, $src2, $tmp\n\t" > + "vmv.x.s $dst, $tmp" %} > + ins_encode %{ > + __ vsetvli(t0, x0, Assembler::e32); > + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); > + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), > + as_VectorRegister($tmp$$reg)); > + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); > + %} > + ins_pipe(pipe_slow); > +%} > + > +instruct reduce_andL(iRegLNoSp dst, iRegL src1, vReg src2, vReg tmp) %{ > + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_LONG); > + match(Set dst (AndReductionV src1 src2)); > + effect(TEMP tmp); > + ins_cost(VEC_COST); > + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andL\n\t" > + "vredand.vs $tmp, $src2, $tmp\n\t" > + "vmv.x.s $dst, $tmp" %} > + ins_encode %{ > + __ vsetvli(t0, x0, Assembler::e64); > + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); > + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), > + as_VectorRegister($tmp$$reg)); > + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); > + %} > > > > After this patch, Vector API can use RVV with the `-XX:+UseRVV` parameter when executing java programs on the RISC-V RVV 1.0 platform. Tests [2] and [3] can be used to test the implementation of this node and it passes the tests properly. > > By adding the `-XX:+PrintAssembly -Xcomp -XX:-TieredCompilation -XX:+LogCompilation -XX:LogFile=compile.log` parameter when executing the test case, hsdis is currently unable to decompile rvv's assembly instructions. The relevant OptoAssembly log output in the compilation log is as follows: > > > 2a8 B22: # out( B14 B23 ) <- in( B21 B31 ) Freq: 32.1131 > 2a8 lwu R28, [R9, #8] # loadNKlass, compressed class ptr, #@loadNKlass > 2ac decode_klass_not_null R14, R28 #@decodeKlass_not_null > 2b8 ld R30, [R14, #40] # class, #@loadKlass > 2bc li R7, #-1 # int, #@loadConI > 2c0 vmv.s.x V1, R7 #@reduce_andI > vredand.vs V1, V2, V1 > vmv.x.s R28, V1 > 2d0 mv R7, precise jdk/internal/vm/vector/VectorSupport$ReductionOperation: 0x000000408c4f6220:Constant:exact * # ptr, #@loadConP > 2e8 beq R30, R7, B14 #@cmpP_branch P=0.830000 C=-1.000000 > > > There is no hardware implementation of RISC-V RVV 1.0, so the tests are performed on qemu with parameter `-cpu rv64,v=true,vlen=256,vext_spec=v1.0`. The execution of `ANDReduceInt256VectorTests` and `ANDReduceLong256VectorTests` test cases under qemu, with `-XX:+UseRVV` turned on, can reduce the execution time of this method by about 50.7% compared to the RVV version without this node implemented. After implementing this node, by comparing the influence of the number of C2 assembly instructions before and after the -XX:+UseRVV parameter is enabled, after enabling -XX:+UseRVV, the number of assembly instructions is reduced by about 50% [4] > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#vector-reduction-operations > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests > [3] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Long256VectorTests.java#ANDReduceLong256VectorTests > [4] https://github.com/zifeihan/vector-api-test-rvv/blob/master/vector-api-rvv-performance.md > > ## Testing: > - hotspot and jdk tier1 on unmatched board without new failures > - test/jdk/jdk/incubator/vector/Int256VectorTests.java with fastdebug on qemu > - test/jdk/jdk/incubator/vector/Long256VectorTests.java with fastdebug on qemu src/hotspot/cpu/riscv/riscv_v.ad line 814: > 812: "vmv.x.s $dst, $tmp" %} > 813: ins_encode %{ > 814: __ vsetvli(t0, x0, Assembler::e64); Only the element basic type of the two code segments is different. Could you use Matcher::vector_element_basic_type() to simplify the code? ------------- PR: https://git.openjdk.org/jdk/pull/10691 From kbarrett at openjdk.org Fri Nov 4 03:08:13 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Fri, 4 Nov 2022 03:08:13 GMT Subject: RFR: 8296349: [aarch64] Avoid slicing Address::extend Message-ID: Please review this change around `Address::extend`. The 4 derived classes are replaced by static functions of the same name as the former class. These functions return an `extend` object initialized with the same values as were used by the corresponding derived class constructor. Testing: mach5 tier1-3 ------------- Commit messages: - flatten Address::extend Changes: https://git.openjdk.org/jdk/pull/10976/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10976&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296349 Stats: 20 lines in 2 files changed: 0 ins; 14 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/10976.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10976/head:pull/10976 PR: https://git.openjdk.org/jdk/pull/10976 From duke at openjdk.org Fri Nov 4 03:20:11 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Fri, 4 Nov 2022 03:20:11 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v7] In-Reply-To: References: Message-ID: > Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. > > - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. > - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. > - Added a JMH perf test. > - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. > > Perf before: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s > > and after: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s Volodymyr Paprotski has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits: - Merge remote-tracking branch 'origin/master' into avx512-poly - address Jamil's review - invalidkeyexception and some review comments - extra whitespace character - assembler checks and test case fixes - Merge remote-tracking branch 'origin/master' into avx512-poly - Merge remote-tracking branch 'origin' into avx512-poly - further restrict UsePolyIntrinsics with supports_avx512vlbw - missed white-space fix - - Fix whitespace and copyright statements - Add benchmark - ... and 2 more: https://git.openjdk.org/jdk/compare/9d3b4ef2...38d9e83c ------------- Changes: https://git.openjdk.org/jdk/pull/10582/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=06 Stats: 1852 lines in 32 files changed: 1815 ins; 3 del; 34 mod Patch: https://git.openjdk.org/jdk/pull/10582.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10582/head:pull/10582 PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Fri Nov 4 03:24:33 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Fri, 4 Nov 2022 03:24:33 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5] In-Reply-To: <0xJMPRdK0h3UJBYxqeLMfp1baL8xoaUpNcAZOtrFLKo=.d5c1020e-9e61-4800-bb52-9adbdd17e19f@github.com> References: <9h52z_DWFvTWWwasN7vzl9-7C0-Tj50Cis4fgRNuId8=.65de1f73-f5f3-4326-b9e0-6211861452ea@github.com> <0xJMPRdK0h3UJBYxqeLMfp1baL8xoaUpNcAZOtrFLKo=.d5c1020e-9e61-4800-bb52-9adbdd17e19f@github.com> Message-ID: On Fri, 28 Oct 2022 21:55:59 GMT, Jamil Nimeh wrote: >> I flipped-flopped on this.. I already had the code for the exception.. and already described the potential fix. So rather then remove the code, pushed the described fix. Its always easier to remove the extra field I added. Let me know what you think about the 'backdoor' field. > > Well, what you're doing achieves what we're looking for, thanks for making that change. I think I'd like to see that value set on construction and not be mutable from outside the object. Something like this: > > - place a `private final boolean checkWeakKey` up near where all the other fields are defined. > - the no-args Poly1305 is implemented as `this(true)` > - an additional constructor is created `Poly1305(boolean checkKey)` which sets `checkWeakKey` true or false as provided by the parameter. > - in setRSVals you should be able to wrap lines 296-310 inside a single `if (checkWeakKey)` block. > - In the Poly1305KAT the `new Poly1305()` becomes `new Poly1305(false)`. done ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Fri Nov 4 03:54:13 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Fri, 4 Nov 2022 03:54:13 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v7] In-Reply-To: References: Message-ID: On Fri, 4 Nov 2022 03:20:11 GMT, Volodymyr Paprotski wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > Volodymyr Paprotski has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits: > > - Merge remote-tracking branch 'origin/master' into avx512-poly > - address Jamil's review > - invalidkeyexception and some review comments > - extra whitespace character > - assembler checks and test case fixes > - Merge remote-tracking branch 'origin/master' into avx512-poly > - Merge remote-tracking branch 'origin' into avx512-poly > - further restrict UsePolyIntrinsics with supports_avx512vlbw > - missed white-space fix > - - Fix whitespace and copyright statements > - Add benchmark > - ... and 2 more: https://git.openjdk.org/jdk/compare/9d3b4ef2...38d9e83c @jnimeh Hopefully last change addresses your pending comments. More data, new data... datasize | master | optimized | disabled | opt/mst | dis/mst -- | -- | -- | -- | -- | -- 32 | 3218169 | 3476352 | 3126538 | 1.08 | 0.97 64 | 2858030 | 3391015 | 2846735 | 1.19 | 1.00 128 | 2396796 | 3239888 | 2406931 | 1.35 | 1.00 256 | 1780679 | 3063749 | 1765664 | 1.72 | 0.99 512 | 1168824 | 2918524 | 1153009 | 2.50 | 0.99 1024 | 648772.1 | 2716787 | 688467.7 | 4.19 | 1.06 2048 | 357009 | 2382723 | 376023.7 | 6.67 | 1.05 16384 | 48854.33 | 896850 | 53104.68 | 18.36 | 1.09 1048576 | 771.461 | 15088.63 | 846.247 | 19.56 | 1.10 src/hotspot/share/opto/library_call.cpp line 7016: > 7014: Node* rObj = new CheckCastPPNode(control(), rFace, rtype); > 7015: rObj = _gvn.transform(rObj); > 7016: Node* rlimbs = load_field_from_object(rObj, "limbs", "[J"); @jnimeh if you could be particularly 'critical' here please? I generally know what I wanted to accomplish. And stepped through things with a debugger... but all the various IR types and conversions, I just don't know. I copied things from AES, which seem to work, as they do here, but I don't _understand_ the code. i.e. recursive `getfield`s `((IntegerPolynomial$MutableElement)(this.a)).limbs` plus checks if we know field offsets: if (recursive) classes are loaded. But if not loaded, crashing with assert? Seems 'rude'. I think Poly1305 class constructor running would had forced the classes here to load so nothing to worry about, so I suppose assert is enough.) src/hotspot/share/opto/library_call.cpp line 7027: > 7025: // Node* cmp = _gvn.transform(new CmpINode(load_array_length(alimbs), intcon(5))); > 7026: // Node* bol = _gvn.transform(new BoolNode(cmp, BoolTest::eq)); > 7027: // Node* if_eq = generate_slow_guard(bol, slow_region); @jnimeh I had "valiantly" tried to do a length check here, but couldn't find where to steal code from! If you have some suggestions... Meanwhile, I decided that perhaps a Java check would not be _that_ bad for non-intrinsic code. See `checkLimbsForIntrinsic`; I had to change the interface `IntegerModuloP` which initially felt like a hack. But perhaps the java check is 'alright', reminds java developer that there is a related intrinsic. test/jdk/com/sun/crypto/provider/Cipher/ChaCha20/unittest/java.base/com/sun/crypto/provider/Poly1305IntrinsicFuzzTest.java line 39: > 37: public static void main(String[] args) throws Exception { > 38: //Note: it might be useful to increase this number during development of new Poly1305 intrinsics > 39: final int repeat = 100; @jnimeh FYI... In case you end up doing supporting other architectures, left a trail (and lots of 'math' comments in the assembler) ------------- PR: https://git.openjdk.org/jdk/pull/10582 From gcao at openjdk.org Fri Nov 4 06:23:30 2022 From: gcao at openjdk.org (Gui Cao) Date: Fri, 4 Nov 2022 06:23:30 GMT Subject: RFR: 8295261: RISC-V: Support ReductionV instructions for Vector API In-Reply-To: References: Message-ID: On Fri, 4 Nov 2022 02:23:30 GMT, Yadong Wang wrote: >> Currently, certain vector-specific instructions in c2 are not implemented in RISC-V. This patch will add support of `AndReductionV`, `OrReductionV`, `XorReductionV` for RISC-V. This patch was implemented by referring to the sve version of aarch64 and riscv-v-spec v1.0 [1]. >> >> For example, AndReductionV is implemented as follows: >> >> >> diff --git a/src/hotspot/cpu/riscv/riscv_v.ad b/src/hotspot/cpu/riscv/riscv_v.ad >> index 0ef36fdb292..c04962993c0 100644 >> --- a/src/hotspot/cpu/riscv/riscv_v.ad >> +++ b/src/hotspot/cpu/riscv/riscv_v.ad >> @@ -63,7 +63,6 @@ source %{ >> case Op_ExtractS: >> case Op_ExtractUB: >> // Vector API specific >> - case Op_AndReductionV: >> case Op_OrReductionV: >> case Op_XorReductionV: >> case Op_LoadVectorGather: >> @@ -785,6 +784,120 @@ instruct vnegD(vReg dst, vReg src) %{ >> ins_pipe(pipe_slow); >> %} >> >> +// vector and reduction >> + >> +instruct reduce_andI(iRegINoSp dst, iRegIorL2I src1, vReg src2, vReg tmp) %{ >> + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_INT); >> + match(Set dst (AndReductionV src1 src2)); >> + effect(TEMP tmp); >> + ins_cost(VEC_COST); >> + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andI\n\t" >> + "vredand.vs $tmp, $src2, $tmp\n\t" >> + "vmv.x.s $dst, $tmp" %} >> + ins_encode %{ >> + __ vsetvli(t0, x0, Assembler::e32); >> + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); >> + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), >> + as_VectorRegister($tmp$$reg)); >> + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); >> + %} >> + ins_pipe(pipe_slow); >> +%} >> + >> +instruct reduce_andL(iRegLNoSp dst, iRegL src1, vReg src2, vReg tmp) %{ >> + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_LONG); >> + match(Set dst (AndReductionV src1 src2)); >> + effect(TEMP tmp); >> + ins_cost(VEC_COST); >> + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andL\n\t" >> + "vredand.vs $tmp, $src2, $tmp\n\t" >> + "vmv.x.s $dst, $tmp" %} >> + ins_encode %{ >> + __ vsetvli(t0, x0, Assembler::e64); >> + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); >> + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), >> + as_VectorRegister($tmp$$reg)); >> + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); >> + %} >> >> >> >> After this patch, Vector API can use RVV with the `-XX:+UseRVV` parameter when executing java programs on the RISC-V RVV 1.0 platform. Tests [2] and [3] can be used to test the implementation of this node and it passes the tests properly. >> >> By adding the `-XX:+PrintAssembly -Xcomp -XX:-TieredCompilation -XX:+LogCompilation -XX:LogFile=compile.log` parameter when executing the test case, hsdis is currently unable to decompile rvv's assembly instructions. The relevant OptoAssembly log output in the compilation log is as follows: >> >> >> 2a8 B22: # out( B14 B23 ) <- in( B21 B31 ) Freq: 32.1131 >> 2a8 lwu R28, [R9, #8] # loadNKlass, compressed class ptr, #@loadNKlass >> 2ac decode_klass_not_null R14, R28 #@decodeKlass_not_null >> 2b8 ld R30, [R14, #40] # class, #@loadKlass >> 2bc li R7, #-1 # int, #@loadConI >> 2c0 vmv.s.x V1, R7 #@reduce_andI >> vredand.vs V1, V2, V1 >> vmv.x.s R28, V1 >> 2d0 mv R7, precise jdk/internal/vm/vector/VectorSupport$ReductionOperation: 0x000000408c4f6220:Constant:exact * # ptr, #@loadConP >> 2e8 beq R30, R7, B14 #@cmpP_branch P=0.830000 C=-1.000000 >> >> >> There is no hardware implementation of RISC-V RVV 1.0, so the tests are performed on qemu with parameter `-cpu rv64,v=true,vlen=256,vext_spec=v1.0`. The execution of `ANDReduceInt256VectorTests` and `ANDReduceLong256VectorTests` test cases under qemu, with `-XX:+UseRVV` turned on, can reduce the execution time of this method by about 50.7% compared to the RVV version without this node implemented. After implementing this node, by comparing the influence of the number of C2 assembly instructions before and after the -XX:+UseRVV parameter is enabled, after enabling -XX:+UseRVV, the number of assembly instructions is reduced by about 50% [4] >> >> [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#vector-reduction-operations >> [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests >> [3] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Long256VectorTests.java#ANDReduceLong256VectorTests >> [4] https://github.com/zifeihan/vector-api-test-rvv/blob/master/vector-api-rvv-performance.md >> >> ## Testing: >> - hotspot and jdk tier1 on unmatched board without new failures >> - test/jdk/jdk/incubator/vector/Int256VectorTests.java with fastdebug on qemu >> - test/jdk/jdk/incubator/vector/Long256VectorTests.java with fastdebug on qemu > > src/hotspot/cpu/riscv/riscv_v.ad line 814: > >> 812: "vmv.x.s $dst, $tmp" %} >> 813: ins_encode %{ >> 814: __ vsetvli(t0, x0, Assembler::e64); > > Only the element basic type of the two code segments is different. Could you use Matcher::vector_element_basic_type() to simplify the code? @yadongw Hello, thanks for review. the current definition of AndReductionV node of riscv refers to the AndReductionV node of aarch64 and the AddReductionVI, AddReductionVL of riscv. The parameter types of the nodes here are different. At present, Matcher::vector_element_basic_type() should not be used to simplify the code. For example, the AndReductionV node of aarch64 defines the parameter types as follows: instruct reduce_andI_sve(iRegINoSp dst, iRegIorL2I isrc, vReg vsrc, vRegD tmp) instruct reduce_andL_sve(iRegLNoSp dst, iRegL isrc, vReg vsrc, vRegD tmp) riscv's AddReductionVI, AddReductionVL node defines the parameter types as follows: instruct reduce_addI(iRegINoSp dst, iRegIorL2I src1, vReg src2, vReg tmp) instruct reduce_addL(iRegLNoSp dst, iRegL src1, vReg src2, vReg tmp) ------------- PR: https://git.openjdk.org/jdk/pull/10691 From yadongwang at openjdk.org Fri Nov 4 08:38:30 2022 From: yadongwang at openjdk.org (Yadong Wang) Date: Fri, 4 Nov 2022 08:38:30 GMT Subject: RFR: 8295261: RISC-V: Support ReductionV instructions for Vector API In-Reply-To: References: Message-ID: On Fri, 4 Nov 2022 06:21:17 GMT, Gui Cao wrote: >> src/hotspot/cpu/riscv/riscv_v.ad line 814: >> >>> 812: "vmv.x.s $dst, $tmp" %} >>> 813: ins_encode %{ >>> 814: __ vsetvli(t0, x0, Assembler::e64); >> >> Only the element basic type of the two code segments is different. Could you use Matcher::vector_element_basic_type() to simplify the code? > > @yadongw Hello, thanks for review. the current definition of AndReductionV node of riscv refers to the AndReductionV node of aarch64 and the AddReductionVI, AddReductionVL of riscv. The parameter types of the nodes here are different. At present, Matcher::vector_element_basic_type() should not be used to simplify the code. > > For example, the AndReductionV node of aarch64 defines the parameter types as follows: > > instruct reduce_andI_sve(iRegINoSp dst, iRegIorL2I isrc, vReg vsrc, vRegD tmp) > > instruct reduce_andL_sve(iRegLNoSp dst, iRegL isrc, vReg vsrc, vRegD tmp) > > > riscv's AddReductionVI, AddReductionVL node defines the parameter types as follows: > > instruct reduce_addI(iRegINoSp dst, iRegIorL2I src1, vReg src2, vReg tmp) > > instruct reduce_addL(iRegLNoSp dst, iRegL src1, vReg src2, vReg tmp) Okay, that's fine. ------------- PR: https://git.openjdk.org/jdk/pull/10691 From aph at openjdk.org Fri Nov 4 08:47:31 2022 From: aph at openjdk.org (Andrew Haley) Date: Fri, 4 Nov 2022 08:47:31 GMT Subject: RFR: 8296349: [aarch64] Avoid slicing Address::extend In-Reply-To: References: Message-ID: On Fri, 4 Nov 2022 03:00:54 GMT, Kim Barrett wrote: > Please review this change around `Address::extend`. The 4 derived classes are > replaced by static functions of the same name as the former class. These > functions return an `extend` object initialized with the same values as were > used by the corresponding derived class constructor. > > Testing: mach5 tier1-3 Yep. That's a nice simplification, thanks. ------------- Marked as reviewed by aph (Reviewer). PR: https://git.openjdk.org/jdk/pull/10976 From dzhang at openjdk.org Fri Nov 4 09:02:08 2022 From: dzhang at openjdk.org (Dingli Zhang) Date: Fri, 4 Nov 2022 09:02:08 GMT Subject: RFR: 8295261: RISC-V: Support ReductionV instructions for Vector API In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 07:54:47 GMT, Gui Cao wrote: > Currently, certain vector-specific instructions in c2 are not implemented in RISC-V. This patch will add support of `AndReductionV`, `OrReductionV`, `XorReductionV` for RISC-V. This patch was implemented by referring to the sve version of aarch64 and riscv-v-spec v1.0 [1]. > > For example, AndReductionV is implemented as follows: > > > diff --git a/src/hotspot/cpu/riscv/riscv_v.ad b/src/hotspot/cpu/riscv/riscv_v.ad > index 0ef36fdb292..c04962993c0 100644 > --- a/src/hotspot/cpu/riscv/riscv_v.ad > +++ b/src/hotspot/cpu/riscv/riscv_v.ad > @@ -63,7 +63,6 @@ source %{ > case Op_ExtractS: > case Op_ExtractUB: > // Vector API specific > - case Op_AndReductionV: > case Op_OrReductionV: > case Op_XorReductionV: > case Op_LoadVectorGather: > @@ -785,6 +784,120 @@ instruct vnegD(vReg dst, vReg src) %{ > ins_pipe(pipe_slow); > %} > > +// vector and reduction > + > +instruct reduce_andI(iRegINoSp dst, iRegIorL2I src1, vReg src2, vReg tmp) %{ > + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_INT); > + match(Set dst (AndReductionV src1 src2)); > + effect(TEMP tmp); > + ins_cost(VEC_COST); > + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andI\n\t" > + "vredand.vs $tmp, $src2, $tmp\n\t" > + "vmv.x.s $dst, $tmp" %} > + ins_encode %{ > + __ vsetvli(t0, x0, Assembler::e32); > + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); > + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), > + as_VectorRegister($tmp$$reg)); > + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); > + %} > + ins_pipe(pipe_slow); > +%} > + > +instruct reduce_andL(iRegLNoSp dst, iRegL src1, vReg src2, vReg tmp) %{ > + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_LONG); > + match(Set dst (AndReductionV src1 src2)); > + effect(TEMP tmp); > + ins_cost(VEC_COST); > + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andL\n\t" > + "vredand.vs $tmp, $src2, $tmp\n\t" > + "vmv.x.s $dst, $tmp" %} > + ins_encode %{ > + __ vsetvli(t0, x0, Assembler::e64); > + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); > + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), > + as_VectorRegister($tmp$$reg)); > + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); > + %} > > > > After this patch, Vector API can use RVV with the `-XX:+UseRVV` parameter when executing java programs on the RISC-V RVV 1.0 platform. Tests [2] and [3] can be used to test the implementation of this node and it passes the tests properly. > > By adding the `-XX:+PrintAssembly -Xcomp -XX:-TieredCompilation -XX:+LogCompilation -XX:LogFile=compile.log` parameter when executing the test case, hsdis is currently unable to decompile rvv's assembly instructions. The relevant OptoAssembly log output in the compilation log is as follows: > > > 2a8 B22: # out( B14 B23 ) <- in( B21 B31 ) Freq: 32.1131 > 2a8 lwu R28, [R9, #8] # loadNKlass, compressed class ptr, #@loadNKlass > 2ac decode_klass_not_null R14, R28 #@decodeKlass_not_null > 2b8 ld R30, [R14, #40] # class, #@loadKlass > 2bc li R7, #-1 # int, #@loadConI > 2c0 vmv.s.x V1, R7 #@reduce_andI > vredand.vs V1, V2, V1 > vmv.x.s R28, V1 > 2d0 mv R7, precise jdk/internal/vm/vector/VectorSupport$ReductionOperation: 0x000000408c4f6220:Constant:exact * # ptr, #@loadConP > 2e8 beq R30, R7, B14 #@cmpP_branch P=0.830000 C=-1.000000 > > > There is no hardware implementation of RISC-V RVV 1.0, so the tests are performed on qemu with parameter `-cpu rv64,v=true,vlen=256,vext_spec=v1.0`. The execution of `ANDReduceInt256VectorTests` and `ANDReduceLong256VectorTests` test cases under qemu, with `-XX:+UseRVV` turned on, can reduce the execution time of this method by about 50.7% compared to the RVV version without this node implemented. After implementing this node, by comparing the influence of the number of C2 assembly instructions before and after the -XX:+UseRVV parameter is enabled, after enabling -XX:+UseRVV, the number of assembly instructions is reduced by about 50% [4] > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#vector-reduction-operations > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests > [3] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Long256VectorTests.java#ANDReduceLong256VectorTests > [4] https://github.com/zifeihan/vector-api-test-rvv/blob/master/vector-api-rvv-performance.md > > ## Testing: > - hotspot and jdk tier1 on unmatched board without new failures > - test/jdk/jdk/incubator/vector/Int256VectorTests.java with fastdebug on qemu > - test/jdk/jdk/incubator/vector/Long256VectorTests.java with fastdebug on qemu LGTM, thanks! ------------- Marked as reviewed by dzhang (Author). PR: https://git.openjdk.org/jdk/pull/10691 From fyang at openjdk.org Fri Nov 4 09:16:01 2022 From: fyang at openjdk.org (Fei Yang) Date: Fri, 4 Nov 2022 09:16:01 GMT Subject: RFR: 8295261: RISC-V: Support ReductionV instructions for Vector API In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 07:54:47 GMT, Gui Cao wrote: > Currently, certain vector-specific instructions in c2 are not implemented in RISC-V. This patch will add support of `AndReductionV`, `OrReductionV`, `XorReductionV` for RISC-V. This patch was implemented by referring to the sve version of aarch64 and riscv-v-spec v1.0 [1]. > > For example, AndReductionV is implemented as follows: > > > diff --git a/src/hotspot/cpu/riscv/riscv_v.ad b/src/hotspot/cpu/riscv/riscv_v.ad > index 0ef36fdb292..c04962993c0 100644 > --- a/src/hotspot/cpu/riscv/riscv_v.ad > +++ b/src/hotspot/cpu/riscv/riscv_v.ad > @@ -63,7 +63,6 @@ source %{ > case Op_ExtractS: > case Op_ExtractUB: > // Vector API specific > - case Op_AndReductionV: > case Op_OrReductionV: > case Op_XorReductionV: > case Op_LoadVectorGather: > @@ -785,6 +784,120 @@ instruct vnegD(vReg dst, vReg src) %{ > ins_pipe(pipe_slow); > %} > > +// vector and reduction > + > +instruct reduce_andI(iRegINoSp dst, iRegIorL2I src1, vReg src2, vReg tmp) %{ > + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_INT); > + match(Set dst (AndReductionV src1 src2)); > + effect(TEMP tmp); > + ins_cost(VEC_COST); > + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andI\n\t" > + "vredand.vs $tmp, $src2, $tmp\n\t" > + "vmv.x.s $dst, $tmp" %} > + ins_encode %{ > + __ vsetvli(t0, x0, Assembler::e32); > + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); > + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), > + as_VectorRegister($tmp$$reg)); > + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); > + %} > + ins_pipe(pipe_slow); > +%} > + > +instruct reduce_andL(iRegLNoSp dst, iRegL src1, vReg src2, vReg tmp) %{ > + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_LONG); > + match(Set dst (AndReductionV src1 src2)); > + effect(TEMP tmp); > + ins_cost(VEC_COST); > + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andL\n\t" > + "vredand.vs $tmp, $src2, $tmp\n\t" > + "vmv.x.s $dst, $tmp" %} > + ins_encode %{ > + __ vsetvli(t0, x0, Assembler::e64); > + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); > + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), > + as_VectorRegister($tmp$$reg)); > + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); > + %} > > > > After this patch, Vector API can use RVV with the `-XX:+UseRVV` parameter when executing java programs on the RISC-V RVV 1.0 platform. Tests [2] and [3] can be used to test the implementation of this node and it passes the tests properly. > > By adding the `-XX:+PrintAssembly -Xcomp -XX:-TieredCompilation -XX:+LogCompilation -XX:LogFile=compile.log` parameter when executing the test case, hsdis is currently unable to decompile rvv's assembly instructions. The relevant OptoAssembly log output in the compilation log is as follows: > > > 2a8 B22: # out( B14 B23 ) <- in( B21 B31 ) Freq: 32.1131 > 2a8 lwu R28, [R9, #8] # loadNKlass, compressed class ptr, #@loadNKlass > 2ac decode_klass_not_null R14, R28 #@decodeKlass_not_null > 2b8 ld R30, [R14, #40] # class, #@loadKlass > 2bc li R7, #-1 # int, #@loadConI > 2c0 vmv.s.x V1, R7 #@reduce_andI > vredand.vs V1, V2, V1 > vmv.x.s R28, V1 > 2d0 mv R7, precise jdk/internal/vm/vector/VectorSupport$ReductionOperation: 0x000000408c4f6220:Constant:exact * # ptr, #@loadConP > 2e8 beq R30, R7, B14 #@cmpP_branch P=0.830000 C=-1.000000 > > > There is no hardware implementation of RISC-V RVV 1.0, so the tests are performed on qemu with parameter `-cpu rv64,v=true,vlen=256,vext_spec=v1.0`. The execution of `ANDReduceInt256VectorTests` and `ANDReduceLong256VectorTests` test cases under qemu, with `-XX:+UseRVV` turned on, can reduce the execution time of this method by about 50.7% compared to the RVV version without this node implemented. After implementing this node, by comparing the influence of the number of C2 assembly instructions before and after the -XX:+UseRVV parameter is enabled, after enabling -XX:+UseRVV, the number of assembly instructions is reduced by about 50% [4] > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#vector-reduction-operations > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests > [3] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Long256VectorTests.java#ANDReduceLong256VectorTests > [4] https://github.com/zifeihan/vector-api-test-rvv/blob/master/vector-api-rvv-performance.md > > ## Testing: > - hotspot and jdk tier1 on unmatched board without new failures > - test/jdk/jdk/incubator/vector/Int256VectorTests.java with fastdebug on qemu > - test/jdk/jdk/incubator/vector/Long256VectorTests.java with fastdebug on qemu Marked as reviewed by fyang (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/10691 From eliu at openjdk.org Fri Nov 4 09:16:04 2022 From: eliu at openjdk.org (Eric Liu) Date: Fri, 4 Nov 2022 09:16:04 GMT Subject: RFR: 8295261: RISC-V: Support ReductionV instructions for Vector API In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 07:54:47 GMT, Gui Cao wrote: > Currently, certain vector-specific instructions in c2 are not implemented in RISC-V. This patch will add support of `AndReductionV`, `OrReductionV`, `XorReductionV` for RISC-V. This patch was implemented by referring to the sve version of aarch64 and riscv-v-spec v1.0 [1]. > > For example, AndReductionV is implemented as follows: > > > diff --git a/src/hotspot/cpu/riscv/riscv_v.ad b/src/hotspot/cpu/riscv/riscv_v.ad > index 0ef36fdb292..c04962993c0 100644 > --- a/src/hotspot/cpu/riscv/riscv_v.ad > +++ b/src/hotspot/cpu/riscv/riscv_v.ad > @@ -63,7 +63,6 @@ source %{ > case Op_ExtractS: > case Op_ExtractUB: > // Vector API specific > - case Op_AndReductionV: > case Op_OrReductionV: > case Op_XorReductionV: > case Op_LoadVectorGather: > @@ -785,6 +784,120 @@ instruct vnegD(vReg dst, vReg src) %{ > ins_pipe(pipe_slow); > %} > > +// vector and reduction > + > +instruct reduce_andI(iRegINoSp dst, iRegIorL2I src1, vReg src2, vReg tmp) %{ > + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_INT); > + match(Set dst (AndReductionV src1 src2)); > + effect(TEMP tmp); > + ins_cost(VEC_COST); > + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andI\n\t" > + "vredand.vs $tmp, $src2, $tmp\n\t" > + "vmv.x.s $dst, $tmp" %} > + ins_encode %{ > + __ vsetvli(t0, x0, Assembler::e32); > + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); > + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), > + as_VectorRegister($tmp$$reg)); > + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); > + %} > + ins_pipe(pipe_slow); > +%} > + > +instruct reduce_andL(iRegLNoSp dst, iRegL src1, vReg src2, vReg tmp) %{ > + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_LONG); > + match(Set dst (AndReductionV src1 src2)); > + effect(TEMP tmp); > + ins_cost(VEC_COST); > + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andL\n\t" > + "vredand.vs $tmp, $src2, $tmp\n\t" > + "vmv.x.s $dst, $tmp" %} > + ins_encode %{ > + __ vsetvli(t0, x0, Assembler::e64); > + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); > + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), > + as_VectorRegister($tmp$$reg)); > + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); > + %} > > > > After this patch, Vector API can use RVV with the `-XX:+UseRVV` parameter when executing java programs on the RISC-V RVV 1.0 platform. Tests [2] and [3] can be used to test the implementation of this node and it passes the tests properly. > > By adding the `-XX:+PrintAssembly -Xcomp -XX:-TieredCompilation -XX:+LogCompilation -XX:LogFile=compile.log` parameter when executing the test case, hsdis is currently unable to decompile rvv's assembly instructions. The relevant OptoAssembly log output in the compilation log is as follows: > > > 2a8 B22: # out( B14 B23 ) <- in( B21 B31 ) Freq: 32.1131 > 2a8 lwu R28, [R9, #8] # loadNKlass, compressed class ptr, #@loadNKlass > 2ac decode_klass_not_null R14, R28 #@decodeKlass_not_null > 2b8 ld R30, [R14, #40] # class, #@loadKlass > 2bc li R7, #-1 # int, #@loadConI > 2c0 vmv.s.x V1, R7 #@reduce_andI > vredand.vs V1, V2, V1 > vmv.x.s R28, V1 > 2d0 mv R7, precise jdk/internal/vm/vector/VectorSupport$ReductionOperation: 0x000000408c4f6220:Constant:exact * # ptr, #@loadConP > 2e8 beq R30, R7, B14 #@cmpP_branch P=0.830000 C=-1.000000 > > > There is no hardware implementation of RISC-V RVV 1.0, so the tests are performed on qemu with parameter `-cpu rv64,v=true,vlen=256,vext_spec=v1.0`. The execution of `ANDReduceInt256VectorTests` and `ANDReduceLong256VectorTests` test cases under qemu, with `-XX:+UseRVV` turned on, can reduce the execution time of this method by about 50.7% compared to the RVV version without this node implemented. After implementing this node, by comparing the influence of the number of C2 assembly instructions before and after the -XX:+UseRVV parameter is enabled, after enabling -XX:+UseRVV, the number of assembly instructions is reduced by about 50% [4] > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#vector-reduction-operations > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests > [3] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Long256VectorTests.java#ANDReduceLong256VectorTests > [4] https://github.com/zifeihan/vector-api-test-rvv/blob/master/vector-api-rvv-performance.md > > ## Testing: > - hotspot and jdk tier1 on unmatched board without new failures > - test/jdk/jdk/incubator/vector/Int256VectorTests.java with fastdebug on qemu > - test/jdk/jdk/incubator/vector/Long256VectorTests.java with fastdebug on qemu src/hotspot/cpu/riscv/riscv_v.ad line 788: > 786: > 787: instruct reduce_andI(iRegINoSp dst, iRegIorL2I src1, vReg src2, vReg tmp) %{ > 788: predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_INT); Does `Matcher::vector_element_basic_type(n->in(2)) == T_INT` work here? src/hotspot/cpu/riscv/riscv_v.ad line 838: > 836: __ vredor_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), > 837: as_VectorRegister($tmp$$reg)); > 838: __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); This is basically a shared code pattern for OrReductionV, AndReductionV, XorReduction. Maybe a common method can help to simplify the code. ------------- PR: https://git.openjdk.org/jdk/pull/10691 From tholenstein at openjdk.org Fri Nov 4 09:31:06 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Fri, 4 Nov 2022 09:31:06 GMT Subject: RFR: JDK-8296380: IGV: Shortcut for quick search not working Message-ID: The shortcut `Ctrl`-`F` for quick search was not working ------------- Commit messages: - JDK-8296380: IGV: Shortcut for quick search not working Changes: https://git.openjdk.org/jdk/pull/10980/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10980&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296380 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10980.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10980/head:pull/10980 PR: https://git.openjdk.org/jdk/pull/10980 From chagedorn at openjdk.org Fri Nov 4 09:31:07 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 4 Nov 2022 09:31:07 GMT Subject: RFR: JDK-8296380: IGV: Shortcut for quick search not working In-Reply-To: References: Message-ID: On Fri, 4 Nov 2022 09:22:11 GMT, Tobias Holenstein wrote: > The shortcut `Ctrl`-`F` for quick search was not working Looks good and trivial! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10980 From tholenstein at openjdk.org Fri Nov 4 09:35:06 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Fri, 4 Nov 2022 09:35:06 GMT Subject: RFR: JDK-8296380: IGV: Shortcut for quick search not working In-Reply-To: References: Message-ID: On Fri, 4 Nov 2022 09:25:07 GMT, Christian Hagedorn wrote: >> The shortcut `Ctrl`-`F` for quick search was not working > > Looks good and trivial! thanks @chhagedorn ! ------------- PR: https://git.openjdk.org/jdk/pull/10980 From rrich at openjdk.org Fri Nov 4 09:36:04 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Fri, 4 Nov 2022 09:36:04 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v2] In-Reply-To: References: Message-ID: > Hi, > > this is the port of [JEP 425: Virtual Threads (Preview)](https://openjdk.org/jeps/425)) to PPC64. > More precisely it is the port of vm continuations in hotspot to PPC64. It allows to run with `-XX:+VMContinuations` which is a prerequisit for 'real' virtual threads (oxymoron?). > > Most of the shared code changes are related to a new platform dependent constant `frame::metadata_words_at_top`. It is either added or subtracted to a frame address or size. > > The following is supposed to explain (without real life details) why it is needed in addition to the existing `frame::metadata_words`. The diagram shows a frame at `SP` and its stack arguments. The caller frame is located at `callers_SP`. > > > X86 / AARCH64 PPC64: > > : : : : > : : : : > | | | | > |-----------------| |-----------------| > | | | | > | stack arguments | | stack arguments | > | |<- callers_SP | | > =================== |-----------------| > | | | | > | metadata at bottom | | metadata at top | > | | | |<- callers_SP > |-----------------| =================== > | | | | > | | | | > | | | | > | | | | > | |<- SP | | > =================== |-----------------| > | | > | metadata at top | > | |<- SP > =================== > > > On X86 and AARCH64 metadata (that's return address, saved SP/FP, etc.) is stored at the frame bottom (`metadata at bottom`). On PPC64 it is stored at the frame top (`metadata at top`) where it affects size and address calculations. Shared code deals with this by making use of the platform dependent constant `frame::metadata_words_at_top`. > > * size required to 'freeze' a single frame with its stack arguments in a `StackChunk`: > `sizeof(frame) + sizeof(stack arguments) + frame::metadata_words_at_top` > > * address of stack arguments: > `callers_SP + frame::metadata_words_at_top` > > * The value of `frame::metadata_words_at_top` is 0 words on X86 and AARCH64. On PPC64 it is 4 words. > > Please refer to comments I've added for more details (e.g. the comment on StackChunkFrameStream::frame_size()). Recently I've given a talk about vm continuations and the PPC64 port to my colleagues. It's rather an overview than a deep dive. [The slides](http://cr.openjdk.java.net/~rrich/webrevs/2022/8286302/202210_loom_ppc64_port.pdf) might serve as well as an intro to the matter. > > The pr includes the new test jdk/jdk/internal/vm/Continuation/BasicExp.java which I wrote while doing the port. The test cases vary from simple to not-so-simple. One of the main features is that a CompilationPolicy passed as argument controls which frames are compiled/interpreted when freezing by defining a sliding window of compiled / interpreted frames which produces interesting transitions with and without stack arguments. There is overlap with Basic.java and Fuzz.java. Let me know wether to remove or keep BasicExp.java. Runtime with fastdebug: 2m on Apple M1 Pro and 3m on Intel Xeon E5-2660 v3 @ 2.60GHz. Note that -XX:+VerifyContinuations is explicitly set as a found it very useful, it increases the runtime quite a bit though. > > Testing: the change passed our CI testing: most JCK and JTREG tests, also in Xcomp mode, SPECjvm2008, SPECjbb2015, Renaissance Suite and SAP specific tests with fastdebug and release builds on the standard platforms plus PPC64. These tests do include hotspot_loom and jdk_loom JTREG tests which I've also run with TEST_VM_OPTS="-XX:+VerifyContinuations" on X86_64, PPC64, and AARCH64. > > Thanks, Richard. Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: Use callers_sp for fsize calculation in recurse_freeze_interpreted_frame ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10961/files - new: https://git.openjdk.org/jdk/pull/10961/files/c1564a2a..0d12b057 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10961&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10961&range=00-01 Stats: 34 lines in 9 files changed: 32 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/10961.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10961/head:pull/10961 PR: https://git.openjdk.org/jdk/pull/10961 From tholenstein at openjdk.org Fri Nov 4 09:36:48 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Fri, 4 Nov 2022 09:36:48 GMT Subject: Integrated: JDK-8296380: IGV: Shortcut for quick search not working In-Reply-To: References: Message-ID: On Fri, 4 Nov 2022 09:22:11 GMT, Tobias Holenstein wrote: > The shortcut `Ctrl`-`F` for quick search was not working This pull request has now been integrated. Changeset: bd729e69 Author: Tobias Holenstein URL: https://git.openjdk.org/jdk/commit/bd729e69066b94593b7a775c0034c5e8537b73cc Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8296380: IGV: Shortcut for quick search not working Reviewed-by: chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/10980 From gcao at openjdk.org Fri Nov 4 09:37:48 2022 From: gcao at openjdk.org (Gui Cao) Date: Fri, 4 Nov 2022 09:37:48 GMT Subject: RFR: 8295261: RISC-V: Support ReductionV instructions for Vector API In-Reply-To: References: Message-ID: <5swDwbPuht7SuThvQWcs7_bP2Pjiw94W1qo2v2yCxBs=.c2f36e75-fdf8-47bb-99ef-93eda85784a2@github.com> On Fri, 4 Nov 2022 09:03:12 GMT, Eric Liu wrote: >> Currently, certain vector-specific instructions in c2 are not implemented in RISC-V. This patch will add support of `AndReductionV`, `OrReductionV`, `XorReductionV` for RISC-V. This patch was implemented by referring to the sve version of aarch64 and riscv-v-spec v1.0 [1]. >> >> For example, AndReductionV is implemented as follows: >> >> >> diff --git a/src/hotspot/cpu/riscv/riscv_v.ad b/src/hotspot/cpu/riscv/riscv_v.ad >> index 0ef36fdb292..c04962993c0 100644 >> --- a/src/hotspot/cpu/riscv/riscv_v.ad >> +++ b/src/hotspot/cpu/riscv/riscv_v.ad >> @@ -63,7 +63,6 @@ source %{ >> case Op_ExtractS: >> case Op_ExtractUB: >> // Vector API specific >> - case Op_AndReductionV: >> case Op_OrReductionV: >> case Op_XorReductionV: >> case Op_LoadVectorGather: >> @@ -785,6 +784,120 @@ instruct vnegD(vReg dst, vReg src) %{ >> ins_pipe(pipe_slow); >> %} >> >> +// vector and reduction >> + >> +instruct reduce_andI(iRegINoSp dst, iRegIorL2I src1, vReg src2, vReg tmp) %{ >> + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_INT); >> + match(Set dst (AndReductionV src1 src2)); >> + effect(TEMP tmp); >> + ins_cost(VEC_COST); >> + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andI\n\t" >> + "vredand.vs $tmp, $src2, $tmp\n\t" >> + "vmv.x.s $dst, $tmp" %} >> + ins_encode %{ >> + __ vsetvli(t0, x0, Assembler::e32); >> + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); >> + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), >> + as_VectorRegister($tmp$$reg)); >> + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); >> + %} >> + ins_pipe(pipe_slow); >> +%} >> + >> +instruct reduce_andL(iRegLNoSp dst, iRegL src1, vReg src2, vReg tmp) %{ >> + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_LONG); >> + match(Set dst (AndReductionV src1 src2)); >> + effect(TEMP tmp); >> + ins_cost(VEC_COST); >> + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andL\n\t" >> + "vredand.vs $tmp, $src2, $tmp\n\t" >> + "vmv.x.s $dst, $tmp" %} >> + ins_encode %{ >> + __ vsetvli(t0, x0, Assembler::e64); >> + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); >> + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), >> + as_VectorRegister($tmp$$reg)); >> + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); >> + %} >> >> >> >> After this patch, Vector API can use RVV with the `-XX:+UseRVV` parameter when executing java programs on the RISC-V RVV 1.0 platform. Tests [2] and [3] can be used to test the implementation of this node and it passes the tests properly. >> >> By adding the `-XX:+PrintAssembly -Xcomp -XX:-TieredCompilation -XX:+LogCompilation -XX:LogFile=compile.log` parameter when executing the test case, hsdis is currently unable to decompile rvv's assembly instructions. The relevant OptoAssembly log output in the compilation log is as follows: >> >> >> 2a8 B22: # out( B14 B23 ) <- in( B21 B31 ) Freq: 32.1131 >> 2a8 lwu R28, [R9, #8] # loadNKlass, compressed class ptr, #@loadNKlass >> 2ac decode_klass_not_null R14, R28 #@decodeKlass_not_null >> 2b8 ld R30, [R14, #40] # class, #@loadKlass >> 2bc li R7, #-1 # int, #@loadConI >> 2c0 vmv.s.x V1, R7 #@reduce_andI >> vredand.vs V1, V2, V1 >> vmv.x.s R28, V1 >> 2d0 mv R7, precise jdk/internal/vm/vector/VectorSupport$ReductionOperation: 0x000000408c4f6220:Constant:exact * # ptr, #@loadConP >> 2e8 beq R30, R7, B14 #@cmpP_branch P=0.830000 C=-1.000000 >> >> >> There is no hardware implementation of RISC-V RVV 1.0, so the tests are performed on qemu with parameter `-cpu rv64,v=true,vlen=256,vext_spec=v1.0`. The execution of `ANDReduceInt256VectorTests` and `ANDReduceLong256VectorTests` test cases under qemu, with `-XX:+UseRVV` turned on, can reduce the execution time of this method by about 50.7% compared to the RVV version without this node implemented. After implementing this node, by comparing the influence of the number of C2 assembly instructions before and after the -XX:+UseRVV parameter is enabled, after enabling -XX:+UseRVV, the number of assembly instructions is reduced by about 50% [4] >> >> [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#vector-reduction-operations >> [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests >> [3] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Long256VectorTests.java#ANDReduceLong256VectorTests >> [4] https://github.com/zifeihan/vector-api-test-rvv/blob/master/vector-api-rvv-performance.md >> >> ## Testing: >> - hotspot and jdk tier1 on unmatched board without new failures >> - test/jdk/jdk/incubator/vector/Int256VectorTests.java with fastdebug on qemu >> - test/jdk/jdk/incubator/vector/Long256VectorTests.java with fastdebug on qemu > > src/hotspot/cpu/riscv/riscv_v.ad line 788: > >> 786: >> 787: instruct reduce_andI(iRegINoSp dst, iRegIorL2I src1, vReg src2, vReg tmp) %{ >> 788: predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_INT); > > Does `Matcher::vector_element_basic_type(n->in(2)) == T_INT` work here? Thanks for the review, this encoding method does look a lot clearer, but there are many codes in riscv_ad that use the current writing method. Next, I will test this new encoding method. If everyone agrees, then Make unified changes. ------------- PR: https://git.openjdk.org/jdk/pull/10691 From gcao at openjdk.org Fri Nov 4 10:05:20 2022 From: gcao at openjdk.org (Gui Cao) Date: Fri, 4 Nov 2022 10:05:20 GMT Subject: RFR: 8295261: RISC-V: Support ReductionV instructions for Vector API In-Reply-To: References: Message-ID: On Fri, 4 Nov 2022 09:08:38 GMT, Eric Liu wrote: >> Currently, certain vector-specific instructions in c2 are not implemented in RISC-V. This patch will add support of `AndReductionV`, `OrReductionV`, `XorReductionV` for RISC-V. This patch was implemented by referring to the sve version of aarch64 and riscv-v-spec v1.0 [1]. >> >> For example, AndReductionV is implemented as follows: >> >> >> diff --git a/src/hotspot/cpu/riscv/riscv_v.ad b/src/hotspot/cpu/riscv/riscv_v.ad >> index 0ef36fdb292..c04962993c0 100644 >> --- a/src/hotspot/cpu/riscv/riscv_v.ad >> +++ b/src/hotspot/cpu/riscv/riscv_v.ad >> @@ -63,7 +63,6 @@ source %{ >> case Op_ExtractS: >> case Op_ExtractUB: >> // Vector API specific >> - case Op_AndReductionV: >> case Op_OrReductionV: >> case Op_XorReductionV: >> case Op_LoadVectorGather: >> @@ -785,6 +784,120 @@ instruct vnegD(vReg dst, vReg src) %{ >> ins_pipe(pipe_slow); >> %} >> >> +// vector and reduction >> + >> +instruct reduce_andI(iRegINoSp dst, iRegIorL2I src1, vReg src2, vReg tmp) %{ >> + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_INT); >> + match(Set dst (AndReductionV src1 src2)); >> + effect(TEMP tmp); >> + ins_cost(VEC_COST); >> + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andI\n\t" >> + "vredand.vs $tmp, $src2, $tmp\n\t" >> + "vmv.x.s $dst, $tmp" %} >> + ins_encode %{ >> + __ vsetvli(t0, x0, Assembler::e32); >> + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); >> + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), >> + as_VectorRegister($tmp$$reg)); >> + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); >> + %} >> + ins_pipe(pipe_slow); >> +%} >> + >> +instruct reduce_andL(iRegLNoSp dst, iRegL src1, vReg src2, vReg tmp) %{ >> + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_LONG); >> + match(Set dst (AndReductionV src1 src2)); >> + effect(TEMP tmp); >> + ins_cost(VEC_COST); >> + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andL\n\t" >> + "vredand.vs $tmp, $src2, $tmp\n\t" >> + "vmv.x.s $dst, $tmp" %} >> + ins_encode %{ >> + __ vsetvli(t0, x0, Assembler::e64); >> + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); >> + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), >> + as_VectorRegister($tmp$$reg)); >> + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); >> + %} >> >> >> >> After this patch, Vector API can use RVV with the `-XX:+UseRVV` parameter when executing java programs on the RISC-V RVV 1.0 platform. Tests [2] and [3] can be used to test the implementation of this node and it passes the tests properly. >> >> By adding the `-XX:+PrintAssembly -Xcomp -XX:-TieredCompilation -XX:+LogCompilation -XX:LogFile=compile.log` parameter when executing the test case, hsdis is currently unable to decompile rvv's assembly instructions. The relevant OptoAssembly log output in the compilation log is as follows: >> >> >> 2a8 B22: # out( B14 B23 ) <- in( B21 B31 ) Freq: 32.1131 >> 2a8 lwu R28, [R9, #8] # loadNKlass, compressed class ptr, #@loadNKlass >> 2ac decode_klass_not_null R14, R28 #@decodeKlass_not_null >> 2b8 ld R30, [R14, #40] # class, #@loadKlass >> 2bc li R7, #-1 # int, #@loadConI >> 2c0 vmv.s.x V1, R7 #@reduce_andI >> vredand.vs V1, V2, V1 >> vmv.x.s R28, V1 >> 2d0 mv R7, precise jdk/internal/vm/vector/VectorSupport$ReductionOperation: 0x000000408c4f6220:Constant:exact * # ptr, #@loadConP >> 2e8 beq R30, R7, B14 #@cmpP_branch P=0.830000 C=-1.000000 >> >> >> There is no hardware implementation of RISC-V RVV 1.0, so the tests are performed on qemu with parameter `-cpu rv64,v=true,vlen=256,vext_spec=v1.0`. The execution of `ANDReduceInt256VectorTests` and `ANDReduceLong256VectorTests` test cases under qemu, with `-XX:+UseRVV` turned on, can reduce the execution time of this method by about 50.7% compared to the RVV version without this node implemented. After implementing this node, by comparing the influence of the number of C2 assembly instructions before and after the -XX:+UseRVV parameter is enabled, after enabling -XX:+UseRVV, the number of assembly instructions is reduced by about 50% [4] >> >> [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#vector-reduction-operations >> [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests >> [3] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Long256VectorTests.java#ANDReduceLong256VectorTests >> [4] https://github.com/zifeihan/vector-api-test-rvv/blob/master/vector-api-rvv-performance.md >> >> ## Testing: >> - hotspot and jdk tier1 on unmatched board without new failures >> - test/jdk/jdk/incubator/vector/Int256VectorTests.java with fastdebug on qemu >> - test/jdk/jdk/incubator/vector/Long256VectorTests.java with fastdebug on qemu > > src/hotspot/cpu/riscv/riscv_v.ad line 838: > >> 836: __ vredor_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), >> 837: as_VectorRegister($tmp$$reg)); >> 838: __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); > > This is basically a shared code pattern for OrReductionV, AndReductionV, XorReduction. Maybe a common method can help to simplify the code. @TheShermanTanker I get it. I think that AddReductionVI and AddReductionVL can also be simplified as above. I will submit a new PR after testing, and provide a general simplified method in it. ------------- PR: https://git.openjdk.org/jdk/pull/10691 From mdoerr at openjdk.org Fri Nov 4 13:12:35 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Fri, 4 Nov 2022 13:12:35 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v3] In-Reply-To: <-b9cMAzyXX2w1SZPG6k9SlEu1DXN5cPuNSLn_DZJ2Zs=.4742dfd3-e212-4e00-a328-34b182a0c690@github.com> References: <-b9cMAzyXX2w1SZPG6k9SlEu1DXN5cPuNSLn_DZJ2Zs=.4742dfd3-e212-4e00-a328-34b182a0c690@github.com> Message-ID: On Thu, 3 Nov 2022 22:09:21 GMT, Dean Long wrote: > > Filed 2 issues: https://bugs.openjdk.org/browse/JDK-8296336 https://bugs.openjdk.org/browse/JDK-8296334 Feel free to modify them. > > I've added a check for metadata. I don't think we need to check for metadata relocations because the oop recorder always records all metadata pointers. We never iterate over the code to find metadata AFAIK (except for patching purposes, but the metadata is duplicated in this case, not in the code only). Please take a look. > > The metadata in stubs might not have an oop. See the comment here: > > https://github.com/openjdk/jdk/blob/054c23f484522881a0879176383d970a8de41201/src/hotspot/share/code/compiledMethod.cpp#L632 I think this code was written for x86 and is based on a wrong assumption. If a static call (or optimized virtual call) is reachable then the callee class can?t be dead. Unfortunately, JDK-8222841 didn?t include a test. I?ve removed the `CompiledMethod::cleanup_inline_caches_impl` change and ran a few tests like `vmTestbase/nsk/jvmti/RedefineClasses/StressRedefine/TestDescription.java`. Didn?t show errors. Other platforms (e.g. PPC64, AArch64) do allocate a metadata slot via oop recorder. The code you are referring to only clears the slot in the metadata section, not the replicated value which is actually used by the code. Note that `fix_metadata_relocation()` does nothing on most platforms! This is a further indication that the code is not needed. Should I file another RFE for that? ------------- PR: https://git.openjdk.org/jdk/pull/10933 From thartmann at openjdk.org Fri Nov 4 13:25:37 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 4 Nov 2022 13:25:37 GMT Subject: RFR: 8294816: C2: Math.min/max vectorization miscompilation [v2] In-Reply-To: References: <_yz_CZFBqHft7ZJwzc51_-uo_5OWKvb295bc6OGiPx8=.e8479118-1fd2-48cb-a87c-8eccddc979b2@github.com> Message-ID: On Wed, 2 Nov 2022 14:42:20 GMT, Bhavana Kilambi wrote: >> C2 miscompiles during auto-vectorization of MinI/MaxI nodes when "short" type operands are involved. When a short and an integer value is compared, C2 generates vector min/max nodes for "short" types which does not result in correct output as it disregards the higher order bits of the integer input. Java API for Math.min/max also only supports the int, long, float and double types but not the subword integer types namely - char, byte and short. Hence, char/short/byte min/max vector instructions should not be generated. >> This patch ensures that MaxV and MinV vector nodes are only generated for the "int" type for MaxI and MinI nodes during auto-vectorization. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Updated string names for Min and Max IR nodes All tests passed. Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/10944 From jnimeh at openjdk.org Fri Nov 4 14:40:45 2022 From: jnimeh at openjdk.org (Jamil Nimeh) Date: Fri, 4 Nov 2022 14:40:45 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v7] In-Reply-To: References: Message-ID: On Fri, 4 Nov 2022 03:20:11 GMT, Volodymyr Paprotski wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > Volodymyr Paprotski has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits: > > - Merge remote-tracking branch 'origin/master' into avx512-poly > - address Jamil's review > - invalidkeyexception and some review comments > - extra whitespace character > - assembler checks and test case fixes > - Merge remote-tracking branch 'origin/master' into avx512-poly > - Merge remote-tracking branch 'origin' into avx512-poly > - further restrict UsePolyIntrinsics with supports_avx512vlbw > - missed white-space fix > - - Fix whitespace and copyright statements > - Add benchmark > - ... and 2 more: https://git.openjdk.org/jdk/compare/9d3b4ef2...38d9e83c Regarding the updated numbers and master v. optimized-and-disabled, those are looking pretty good. Looks like the break-even point is at 64 bytes and gets better from there which I think addresses my concerns. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Fri Nov 4 14:40:45 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Fri, 4 Nov 2022 14:40:45 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5] In-Reply-To: References: <9h52z_DWFvTWWwasN7vzl9-7C0-Tj50Cis4fgRNuId8=.65de1f73-f5f3-4326-b9e0-6211861452ea@github.com> <4HxTb1DtD6KeuYupOKf32GoQ7SV8_EjHcqfhiZhbLHM=.884e631a-1336-454d-aae1-06f85f784381@github.com> Message-ID: <2_5CBS8aY7mUfXAvavQiM66Xg7ZNmjDiM6YM1vADCM4=.c0b0cc5e-9257-4d65-82cc-1cfd5523b554@github.com> On Wed, 2 Nov 2022 03:16:57 GMT, Jatin Bhateja wrote: >>> And just looking now on uops.info, they seem to have identical timings? >> >> Actual instruction being used (aligned vs unaligned versions) doesn't matter much here, because it's a dynamic property of the address being accessed: misaligned accesses that cross cache line boundary incur a penalty. Since cache lines are 64 bytes in size, every misaligned 512-bit access is penalized. > > I collected performance counters for the benchmark included with the patch and its showing around 30% of 64 byte loads were spanning across the cache line. > > Performance counter stats for 'java -jar target/benchmarks.jar -f 1 -wi 1 -i 2 -w 30 -p dataSize=8192': > > 122385646614 cycles > 328096538160 instructions # 2.68 insn per cycle > 64530343063 MEM_INST_RETIRED.ALL_LOADS > 22900705491 MEM_INST_RETIRED.ALL_STORES > 19815558484 MEM_INST_RETIRED.SPLIT_LOADS > 701176106 MEM_INST_RETIRED.SPLIT_STORES > > Presence of scalar peel loop before the vector loop can save this penalty but given its operating over block streams it may be tricky. > We should also extend the scope of optimization (preferably in this PR or in subsequent one) to optimize [MAC computation routine accepting ByteBuffer.](https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java#L116), To close this thread.. @jatin-bhateja and I talked and realized that it is not possible to re-align input here. At least not with peeling with scalar loop. Scalar loop peels full blocks only (i.e. 16 bytes at a time). So out of 64 positions, 1 is already aligned, 3 could be aligned with the right peel, and 60 will land badly regardless. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Fri Nov 4 14:40:45 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Fri, 4 Nov 2022 14:40:45 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5] In-Reply-To: <4AB7TAZwydDonBwfxasMLmgVIQuaLgMUxck7eCbzYxw=.a9062602-90d4-4bde-baff-629bea466527@github.com> References: <9h52z_DWFvTWWwasN7vzl9-7C0-Tj50Cis4fgRNuId8=.65de1f73-f5f3-4326-b9e0-6211861452ea@github.com> <4AB7TAZwydDonBwfxasMLmgVIQuaLgMUxck7eCbzYxw=.a9062602-90d4-4bde-baff-629bea466527@github.com> Message-ID: On Fri, 28 Oct 2022 20:58:33 GMT, Volodymyr Paprotski wrote: >> No, going the WhiteBox route was not something I was thinking of. I sought feedback from a couple hotspot-knowledgable people about the use of WhiteBox APIs and both felt that it was not the right way to go. One said that WhiteBox is really for VM testing and not for these kinds of java classes. > > One idea I was trying to measure was to make the intrinsic (i.e. the while loop remains exactly the same, just moved to different =non-static= function): > > private void processMultipleBlocks(byte[] input, int offset, int length) { //, MutableIntegerModuloP A, IntegerModuloP R) { > while (length >= BLOCK_LENGTH) { > n.setValue(input, offset, BLOCK_LENGTH, (byte)0x01); > a.setSum(n); // A += (temp | 0x01) > a.setProduct(r); // A = (A * R) % p > offset += BLOCK_LENGTH; > length -= BLOCK_LENGTH; > } > } > > > In principle, the java version would not get any slower (i.e. there is only one extra function jump). At the expense of the C++ glue getting more complex. In C++ I need to dig out using IR `(sun.security.util.math.intpoly.IntegerPolynomial.MutableElement)(this.a).limbs` then convert 5*26bit limbs into 3*44-bit limbs. The IR is very new to me so will take some time. (I think I found some AES code that does something similar). > > That said.. I thought this idea would had been perhaps a separate PR, if needed at all.. Digging limbs out is one thing, but also need to add asserts and safety. Mostly would be happy to just measure if its worth it. thread resumed below ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Fri Nov 4 14:40:46 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Fri, 4 Nov 2022 14:40:46 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v7] In-Reply-To: <523ASDMlZe7mAZaBQe3ipxBLaLum7_XZqLLUUgsCJi0=.db28f521-c957-4fb2-8dcc-7c09d46189e3@github.com> References: <523ASDMlZe7mAZaBQe3ipxBLaLum7_XZqLLUUgsCJi0=.db28f521-c957-4fb2-8dcc-7c09d46189e3@github.com> Message-ID: On Tue, 18 Oct 2022 22:51:51 GMT, Sandhya Viswanathan wrote: >> Volodymyr Paprotski has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits: >> >> - Merge remote-tracking branch 'origin/master' into avx512-poly >> - address Jamil's review >> - invalidkeyexception and some review comments >> - extra whitespace character >> - assembler checks and test case fixes >> - Merge remote-tracking branch 'origin/master' into avx512-poly >> - Merge remote-tracking branch 'origin' into avx512-poly >> - further restrict UsePolyIntrinsics with supports_avx512vlbw >> - missed white-space fix >> - - Fix whitespace and copyright statements >> - Add benchmark >> - ... and 2 more: https://git.openjdk.org/jdk/compare/9d3b4ef2...38d9e83c > > src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 286: > >> 284: * numeric values. >> 285: */ >> 286: private void setRSVals() { //throws InvalidKeyException { > > The R and S check for invalid key (all bytes zero) could be submitted as a separate PR. > It is not related to the Poly1305 acceleration. done, added a flag ------------- PR: https://git.openjdk.org/jdk/pull/10582 From mdoerr at openjdk.org Fri Nov 4 15:38:33 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Fri, 4 Nov 2022 15:38:33 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v4] In-Reply-To: References: Message-ID: > This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: Improve comments and make it a bit harder to use incorrectly. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10933/files - new: https://git.openjdk.org/jdk/pull/10933/files/0ef491fc..eb86c710 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10933&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10933&range=02-03 Stats: 8 lines in 2 files changed: 5 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/10933.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10933/head:pull/10933 PR: https://git.openjdk.org/jdk/pull/10933 From gcao at openjdk.org Fri Nov 4 15:49:35 2022 From: gcao at openjdk.org (Gui Cao) Date: Fri, 4 Nov 2022 15:49:35 GMT Subject: RFR: 8295261: RISC-V: Support ReductionV instructions for Vector API [v2] In-Reply-To: References: Message-ID: > Currently, certain vector-specific instructions in c2 are not implemented in RISC-V. This patch will add support of `AndReductionV`, `OrReductionV`, `XorReductionV` for RISC-V. This patch was implemented by referring to the sve version of aarch64 and riscv-v-spec v1.0 [1]. > > For example, AndReductionV is implemented as follows: > > > diff --git a/src/hotspot/cpu/riscv/riscv_v.ad b/src/hotspot/cpu/riscv/riscv_v.ad > index 0ef36fdb292..c04962993c0 100644 > --- a/src/hotspot/cpu/riscv/riscv_v.ad > +++ b/src/hotspot/cpu/riscv/riscv_v.ad > @@ -63,7 +63,6 @@ source %{ > case Op_ExtractS: > case Op_ExtractUB: > // Vector API specific > - case Op_AndReductionV: > case Op_OrReductionV: > case Op_XorReductionV: > case Op_LoadVectorGather: > @@ -785,6 +784,120 @@ instruct vnegD(vReg dst, vReg src) %{ > ins_pipe(pipe_slow); > %} > > +// vector and reduction > + > +instruct reduce_andI(iRegINoSp dst, iRegIorL2I src1, vReg src2, vReg tmp) %{ > + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_INT); > + match(Set dst (AndReductionV src1 src2)); > + effect(TEMP tmp); > + ins_cost(VEC_COST); > + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andI\n\t" > + "vredand.vs $tmp, $src2, $tmp\n\t" > + "vmv.x.s $dst, $tmp" %} > + ins_encode %{ > + __ vsetvli(t0, x0, Assembler::e32); > + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); > + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), > + as_VectorRegister($tmp$$reg)); > + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); > + %} > + ins_pipe(pipe_slow); > +%} > + > +instruct reduce_andL(iRegLNoSp dst, iRegL src1, vReg src2, vReg tmp) %{ > + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_LONG); > + match(Set dst (AndReductionV src1 src2)); > + effect(TEMP tmp); > + ins_cost(VEC_COST); > + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andL\n\t" > + "vredand.vs $tmp, $src2, $tmp\n\t" > + "vmv.x.s $dst, $tmp" %} > + ins_encode %{ > + __ vsetvli(t0, x0, Assembler::e64); > + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); > + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), > + as_VectorRegister($tmp$$reg)); > + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); > + %} > > > > After this patch, Vector API can use RVV with the `-XX:+UseRVV` parameter when executing java programs on the RISC-V RVV 1.0 platform. Tests [2] and [3] can be used to test the implementation of this node and it passes the tests properly. > > By adding the `-XX:+PrintAssembly -Xcomp -XX:-TieredCompilation -XX:+LogCompilation -XX:LogFile=compile.log` parameter when executing the test case, hsdis is currently unable to decompile rvv's assembly instructions. The relevant OptoAssembly log output in the compilation log is as follows: > > > 2a8 B22: # out( B14 B23 ) <- in( B21 B31 ) Freq: 32.1131 > 2a8 lwu R28, [R9, #8] # loadNKlass, compressed class ptr, #@loadNKlass > 2ac decode_klass_not_null R14, R28 #@decodeKlass_not_null > 2b8 ld R30, [R14, #40] # class, #@loadKlass > 2bc li R7, #-1 # int, #@loadConI > 2c0 vmv.s.x V1, R7 #@reduce_andI > vredand.vs V1, V2, V1 > vmv.x.s R28, V1 > 2d0 mv R7, precise jdk/internal/vm/vector/VectorSupport$ReductionOperation: 0x000000408c4f6220:Constant:exact * # ptr, #@loadConP > 2e8 beq R30, R7, B14 #@cmpP_branch P=0.830000 C=-1.000000 > > > There is no hardware implementation of RISC-V RVV 1.0, so the tests are performed on qemu with parameter `-cpu rv64,v=true,vlen=256,vext_spec=v1.0`. The execution of `ANDReduceInt256VectorTests` and `ANDReduceLong256VectorTests` test cases under qemu, with `-XX:+UseRVV` turned on, can reduce the execution time of this method by about 50.7% compared to the RVV version without this node implemented. After implementing this node, by comparing the influence of the number of C2 assembly instructions before and after the -XX:+UseRVV parameter is enabled, after enabling -XX:+UseRVV, the number of assembly instructions is reduced by about 50% [4] > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#vector-reduction-operations > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests > [3] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Long256VectorTests.java#ANDReduceLong256VectorTests > [4] https://github.com/zifeihan/vector-api-test-rvv/blob/master/vector-api-rvv-performance.md > > ## Testing: > - hotspot and jdk tier1 on unmatched board without new failures > - test/jdk/jdk/incubator/vector/Int256VectorTests.java with fastdebug on qemu > - test/jdk/jdk/incubator/vector/Long256VectorTests.java with fastdebug on qemu Gui Cao has updated the pull request incrementally with two additional commits since the last revision: - Move REDUCTION_OP definition to macroAssembler_riscv.hpp - Simplify part of the code, extract shared code ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10691/files - new: https://git.openjdk.org/jdk/pull/10691/files/dea3ac11..342045ef Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10691&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10691&range=00-01 Stats: 70 lines in 4 files changed: 34 ins; 12 del; 24 mod Patch: https://git.openjdk.org/jdk/pull/10691.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10691/head:pull/10691 PR: https://git.openjdk.org/jdk/pull/10691 From gcao at openjdk.org Fri Nov 4 15:56:10 2022 From: gcao at openjdk.org (Gui Cao) Date: Fri, 4 Nov 2022 15:56:10 GMT Subject: RFR: 8295261: RISC-V: Support ReductionV instructions for Vector API [v2] In-Reply-To: References: Message-ID: On Fri, 4 Nov 2022 09:03:12 GMT, Eric Liu wrote: >> Gui Cao has updated the pull request incrementally with two additional commits since the last revision: >> >> - Move REDUCTION_OP definition to macroAssembler_riscv.hpp >> - Simplify part of the code, extract shared code > > src/hotspot/cpu/riscv/riscv_v.ad line 788: > >> 786: >> 787: instruct reduce_andI(iRegINoSp dst, iRegIorL2I src1, vReg src2, vReg tmp) %{ >> 788: predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_INT); > > Does `Matcher::vector_element_basic_type(n->in(2)) == T_INT` work here? @theRealELiu Thank you for your suggestion, the current new nodes have been modified, and a new PR will be submitted in the future to modify other related places > src/hotspot/cpu/riscv/riscv_v.ad line 838: > >> 836: __ vredor_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), >> 837: as_VectorRegister($tmp$$reg)); >> 838: __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); > > This is basically a shared code pattern for OrReductionV, AndReductionV, XorReduction. Maybe a common method can help to simplify the code. @theRealELiu Thanks, I extracted a portion of the shared code from it and made a new commit. ------------- PR: https://git.openjdk.org/jdk/pull/10691 From gcao at openjdk.org Fri Nov 4 16:02:41 2022 From: gcao at openjdk.org (Gui Cao) Date: Fri, 4 Nov 2022 16:02:41 GMT Subject: RFR: 8295261: RISC-V: Support ReductionV instructions for Vector API [v3] In-Reply-To: References: Message-ID: > Currently, certain vector-specific instructions in c2 are not implemented in RISC-V. This patch will add support of `AndReductionV`, `OrReductionV`, `XorReductionV` for RISC-V. This patch was implemented by referring to the sve version of aarch64 and riscv-v-spec v1.0 [1]. > > For example, AndReductionV is implemented as follows: > > > diff --git a/src/hotspot/cpu/riscv/riscv_v.ad b/src/hotspot/cpu/riscv/riscv_v.ad > index 0ef36fdb292..c04962993c0 100644 > --- a/src/hotspot/cpu/riscv/riscv_v.ad > +++ b/src/hotspot/cpu/riscv/riscv_v.ad > @@ -63,7 +63,6 @@ source %{ > case Op_ExtractS: > case Op_ExtractUB: > // Vector API specific > - case Op_AndReductionV: > case Op_OrReductionV: > case Op_XorReductionV: > case Op_LoadVectorGather: > @@ -785,6 +784,120 @@ instruct vnegD(vReg dst, vReg src) %{ > ins_pipe(pipe_slow); > %} > > +// vector and reduction > + > +instruct reduce_andI(iRegINoSp dst, iRegIorL2I src1, vReg src2, vReg tmp) %{ > + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_INT); > + match(Set dst (AndReductionV src1 src2)); > + effect(TEMP tmp); > + ins_cost(VEC_COST); > + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andI\n\t" > + "vredand.vs $tmp, $src2, $tmp\n\t" > + "vmv.x.s $dst, $tmp" %} > + ins_encode %{ > + __ vsetvli(t0, x0, Assembler::e32); > + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); > + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), > + as_VectorRegister($tmp$$reg)); > + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); > + %} > + ins_pipe(pipe_slow); > +%} > + > +instruct reduce_andL(iRegLNoSp dst, iRegL src1, vReg src2, vReg tmp) %{ > + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_LONG); > + match(Set dst (AndReductionV src1 src2)); > + effect(TEMP tmp); > + ins_cost(VEC_COST); > + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andL\n\t" > + "vredand.vs $tmp, $src2, $tmp\n\t" > + "vmv.x.s $dst, $tmp" %} > + ins_encode %{ > + __ vsetvli(t0, x0, Assembler::e64); > + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); > + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), > + as_VectorRegister($tmp$$reg)); > + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); > + %} > > > > After this patch, Vector API can use RVV with the `-XX:+UseRVV` parameter when executing java programs on the RISC-V RVV 1.0 platform. Tests [2] and [3] can be used to test the implementation of this node and it passes the tests properly. > > By adding the `-XX:+PrintAssembly -Xcomp -XX:-TieredCompilation -XX:+LogCompilation -XX:LogFile=compile.log` parameter when executing the test case, hsdis is currently unable to decompile rvv's assembly instructions. The relevant OptoAssembly log output in the compilation log is as follows: > > > 2a8 B22: # out( B14 B23 ) <- in( B21 B31 ) Freq: 32.1131 > 2a8 lwu R28, [R9, #8] # loadNKlass, compressed class ptr, #@loadNKlass > 2ac decode_klass_not_null R14, R28 #@decodeKlass_not_null > 2b8 ld R30, [R14, #40] # class, #@loadKlass > 2bc li R7, #-1 # int, #@loadConI > 2c0 vmv.s.x V1, R7 #@reduce_andI > vredand.vs V1, V2, V1 > vmv.x.s R28, V1 > 2d0 mv R7, precise jdk/internal/vm/vector/VectorSupport$ReductionOperation: 0x000000408c4f6220:Constant:exact * # ptr, #@loadConP > 2e8 beq R30, R7, B14 #@cmpP_branch P=0.830000 C=-1.000000 > > > There is no hardware implementation of RISC-V RVV 1.0, so the tests are performed on qemu with parameter `-cpu rv64,v=true,vlen=256,vext_spec=v1.0`. The execution of `ANDReduceInt256VectorTests` and `ANDReduceLong256VectorTests` test cases under qemu, with `-XX:+UseRVV` turned on, can reduce the execution time of this method by about 50.7% compared to the RVV version without this node implemented. After implementing this node, by comparing the influence of the number of C2 assembly instructions before and after the -XX:+UseRVV parameter is enabled, after enabling -XX:+UseRVV, the number of assembly instructions is reduced by about 50% [4] > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#vector-reduction-operations > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests > [3] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Long256VectorTests.java#ANDReduceLong256VectorTests > [4] https://github.com/zifeihan/vector-api-test-rvv/blob/master/vector-api-rvv-performance.md > > ## Testing: > - hotspot and jdk tier1 on unmatched board without new failures > - test/jdk/jdk/incubator/vector/Int256VectorTests.java with fastdebug on qemu > - test/jdk/jdk/incubator/vector/Long256VectorTests.java with fastdebug on qemu Gui Cao has updated the pull request incrementally with one additional commit since the last revision: Fix alignment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10691/files - new: https://git.openjdk.org/jdk/pull/10691/files/342045ef..135ed86c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10691&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10691&range=01-02 Stats: 4 lines in 2 files changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/10691.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10691/head:pull/10691 PR: https://git.openjdk.org/jdk/pull/10691 From gcao at openjdk.org Fri Nov 4 16:07:58 2022 From: gcao at openjdk.org (Gui Cao) Date: Fri, 4 Nov 2022 16:07:58 GMT Subject: RFR: 8295261: RISC-V: Support ReductionV instructions for Vector API [v4] In-Reply-To: References: Message-ID: <8kMdRs39-diJOlIxoBDYyIKaI8IZOgp10Q-GdHjEVbE=.c7f6a6f3-865a-4f8c-b887-3083b0108739@github.com> > Currently, certain vector-specific instructions in c2 are not implemented in RISC-V. This patch will add support of `AndReductionV`, `OrReductionV`, `XorReductionV` for RISC-V. This patch was implemented by referring to the sve version of aarch64 and riscv-v-spec v1.0 [1]. > > For example, AndReductionV is implemented as follows: > > > diff --git a/src/hotspot/cpu/riscv/riscv_v.ad b/src/hotspot/cpu/riscv/riscv_v.ad > index 0ef36fdb292..c04962993c0 100644 > --- a/src/hotspot/cpu/riscv/riscv_v.ad > +++ b/src/hotspot/cpu/riscv/riscv_v.ad > @@ -63,7 +63,6 @@ source %{ > case Op_ExtractS: > case Op_ExtractUB: > // Vector API specific > - case Op_AndReductionV: > case Op_OrReductionV: > case Op_XorReductionV: > case Op_LoadVectorGather: > @@ -785,6 +784,120 @@ instruct vnegD(vReg dst, vReg src) %{ > ins_pipe(pipe_slow); > %} > > +// vector and reduction > + > +instruct reduce_andI(iRegINoSp dst, iRegIorL2I src1, vReg src2, vReg tmp) %{ > + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_INT); > + match(Set dst (AndReductionV src1 src2)); > + effect(TEMP tmp); > + ins_cost(VEC_COST); > + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andI\n\t" > + "vredand.vs $tmp, $src2, $tmp\n\t" > + "vmv.x.s $dst, $tmp" %} > + ins_encode %{ > + __ vsetvli(t0, x0, Assembler::e32); > + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); > + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), > + as_VectorRegister($tmp$$reg)); > + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); > + %} > + ins_pipe(pipe_slow); > +%} > + > +instruct reduce_andL(iRegLNoSp dst, iRegL src1, vReg src2, vReg tmp) %{ > + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_LONG); > + match(Set dst (AndReductionV src1 src2)); > + effect(TEMP tmp); > + ins_cost(VEC_COST); > + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andL\n\t" > + "vredand.vs $tmp, $src2, $tmp\n\t" > + "vmv.x.s $dst, $tmp" %} > + ins_encode %{ > + __ vsetvli(t0, x0, Assembler::e64); > + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); > + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), > + as_VectorRegister($tmp$$reg)); > + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); > + %} > > > > After this patch, Vector API can use RVV with the `-XX:+UseRVV` parameter when executing java programs on the RISC-V RVV 1.0 platform. Tests [2] and [3] can be used to test the implementation of this node and it passes the tests properly. > > By adding the `-XX:+PrintAssembly -Xcomp -XX:-TieredCompilation -XX:+LogCompilation -XX:LogFile=compile.log` parameter when executing the test case, hsdis is currently unable to decompile rvv's assembly instructions. The relevant OptoAssembly log output in the compilation log is as follows: > > > 2a8 B22: # out( B14 B23 ) <- in( B21 B31 ) Freq: 32.1131 > 2a8 lwu R28, [R9, #8] # loadNKlass, compressed class ptr, #@loadNKlass > 2ac decode_klass_not_null R14, R28 #@decodeKlass_not_null > 2b8 ld R30, [R14, #40] # class, #@loadKlass > 2bc li R7, #-1 # int, #@loadConI > 2c0 vmv.s.x V1, R7 #@reduce_andI > vredand.vs V1, V2, V1 > vmv.x.s R28, V1 > 2d0 mv R7, precise jdk/internal/vm/vector/VectorSupport$ReductionOperation: 0x000000408c4f6220:Constant:exact * # ptr, #@loadConP > 2e8 beq R30, R7, B14 #@cmpP_branch P=0.830000 C=-1.000000 > > > There is no hardware implementation of RISC-V RVV 1.0, so the tests are performed on qemu with parameter `-cpu rv64,v=true,vlen=256,vext_spec=v1.0`. The execution of `ANDReduceInt256VectorTests` and `ANDReduceLong256VectorTests` test cases under qemu, with `-XX:+UseRVV` turned on, can reduce the execution time of this method by about 50.7% compared to the RVV version without this node implemented. After implementing this node, by comparing the influence of the number of C2 assembly instructions before and after the -XX:+UseRVV parameter is enabled, after enabling -XX:+UseRVV, the number of assembly instructions is reduced by about 50% [4] > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#vector-reduction-operations > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests > [3] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Long256VectorTests.java#ANDReduceLong256VectorTests > [4] https://github.com/zifeihan/vector-api-test-rvv/blob/master/vector-api-rvv-performance.md > > ## Testing: > - hotspot and jdk tier1 on unmatched board without new failures > - test/jdk/jdk/incubator/vector/Int256VectorTests.java with fastdebug on qemu > - test/jdk/jdk/incubator/vector/Long256VectorTests.java with fastdebug on qemu Gui Cao has updated the pull request incrementally with one additional commit since the last revision: Fix whitespace ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10691/files - new: https://git.openjdk.org/jdk/pull/10691/files/135ed86c..a8dba862 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10691&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10691&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10691.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10691/head:pull/10691 PR: https://git.openjdk.org/jdk/pull/10691 From jnimeh at openjdk.org Fri Nov 4 16:32:16 2022 From: jnimeh at openjdk.org (Jamil Nimeh) Date: Fri, 4 Nov 2022 16:32:16 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v7] In-Reply-To: References: Message-ID: On Fri, 4 Nov 2022 03:20:11 GMT, Volodymyr Paprotski wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > Volodymyr Paprotski has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits: > > - Merge remote-tracking branch 'origin/master' into avx512-poly > - address Jamil's review > - invalidkeyexception and some review comments > - extra whitespace character > - assembler checks and test case fixes > - Merge remote-tracking branch 'origin/master' into avx512-poly > - Merge remote-tracking branch 'origin' into avx512-poly > - further restrict UsePolyIntrinsics with supports_avx512vlbw > - missed white-space fix > - - Fix whitespace and copyright statements > - Add benchmark > - ... and 2 more: https://git.openjdk.org/jdk/compare/9d3b4ef2...38d9e83c src/hotspot/share/opto/library_call.cpp line 7036: > 7034: assert(r_start, "r array is NULL"); > 7035: > 7036: Node* call = make_runtime_call(RC_LEAF, Can we safely change this to `RC_LEAF | RC_NO_FP`? For the ChaCha20 block intrinsic I'm working on I've been using that parameter because I'm not touching the FP registers and that looks to be the case here (though your intrinsic is a lot more complicated than mine so I may have missed something). I believe the GHASH and AES library call routines also call `make_runtime_call()` in this way. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From ngasson at openjdk.org Fri Nov 4 17:24:01 2022 From: ngasson at openjdk.org (Nick Gasson) Date: Fri, 4 Nov 2022 17:24:01 GMT Subject: RFR: 8294816: C2: Math.min/max vectorization miscompilation [v2] In-Reply-To: References: <_yz_CZFBqHft7ZJwzc51_-uo_5OWKvb295bc6OGiPx8=.e8479118-1fd2-48cb-a87c-8eccddc979b2@github.com> Message-ID: On Wed, 2 Nov 2022 14:42:20 GMT, Bhavana Kilambi wrote: >> C2 miscompiles during auto-vectorization of MinI/MaxI nodes when "short" type operands are involved. When a short and an integer value is compared, C2 generates vector min/max nodes for "short" types which does not result in correct output as it disregards the higher order bits of the integer input. Java API for Math.min/max also only supports the int, long, float and double types but not the subword integer types namely - char, byte and short. Hence, char/short/byte min/max vector instructions should not be generated. >> This patch ensures that MaxV and MinV vector nodes are only generated for the "int" type for MaxI and MinI nodes during auto-vectorization. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Updated string names for Min and Max IR nodes Marked as reviewed by ngasson (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/10944 From bkilambi at openjdk.org Fri Nov 4 17:25:48 2022 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Fri, 4 Nov 2022 17:25:48 GMT Subject: Integrated: 8294816: C2: Math.min/max vectorization miscompilation In-Reply-To: <_yz_CZFBqHft7ZJwzc51_-uo_5OWKvb295bc6OGiPx8=.e8479118-1fd2-48cb-a87c-8eccddc979b2@github.com> References: <_yz_CZFBqHft7ZJwzc51_-uo_5OWKvb295bc6OGiPx8=.e8479118-1fd2-48cb-a87c-8eccddc979b2@github.com> Message-ID: On Wed, 2 Nov 2022 12:06:14 GMT, Bhavana Kilambi wrote: > C2 miscompiles during auto-vectorization of MinI/MaxI nodes when "short" type operands are involved. When a short and an integer value is compared, C2 generates vector min/max nodes for "short" types which does not result in correct output as it disregards the higher order bits of the integer input. Java API for Math.min/max also only supports the int, long, float and double types but not the subword integer types namely - char, byte and short. Hence, char/short/byte min/max vector instructions should not be generated. > This patch ensures that MaxV and MinV vector nodes are only generated for the "int" type for MaxI and MinI nodes during auto-vectorization. This pull request has now been integrated. Changeset: b49bdaea Author: Bhavana Kilambi Committer: Nick Gasson URL: https://git.openjdk.org/jdk/commit/b49bdaeade8445584550dbd5c48ea3c7e9cf1559 Stats: 139 lines in 4 files changed: 139 ins; 0 del; 0 mod 8294816: C2: Math.min/max vectorization miscompilation Reviewed-by: thartmann, ngasson ------------- PR: https://git.openjdk.org/jdk/pull/10944 From ascarpino at openjdk.org Fri Nov 4 17:29:47 2022 From: ascarpino at openjdk.org (Anthony Scarpino) Date: Fri, 4 Nov 2022 17:29:47 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v7] In-Reply-To: References: Message-ID: <_KivimSjPXP-a8M1gaVOxawfozZ8K4mOkvWwb1w00J8=.70c924a0-8d7e-4673-977c-01d1044ae73c@github.com> On Fri, 4 Nov 2022 03:20:11 GMT, Volodymyr Paprotski wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > Volodymyr Paprotski has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits: > > - Merge remote-tracking branch 'origin/master' into avx512-poly > - address Jamil's review > - invalidkeyexception and some review comments > - extra whitespace character > - assembler checks and test case fixes > - Merge remote-tracking branch 'origin/master' into avx512-poly > - Merge remote-tracking branch 'origin' into avx512-poly > - further restrict UsePolyIntrinsics with supports_avx512vlbw > - missed white-space fix > - - Fix whitespace and copyright statements > - Add benchmark > - ... and 2 more: https://git.openjdk.org/jdk/compare/9d3b4ef2...38d9e83c Thanks for moving the conversion of R and A from the java code into the intrinsic. That certainly reduced the footprint on the java code with regard to performance and code flow. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Fri Nov 4 17:29:48 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Fri, 4 Nov 2022 17:29:48 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v7] In-Reply-To: References: Message-ID: On Fri, 4 Nov 2022 16:28:51 GMT, Jamil Nimeh wrote: >> Volodymyr Paprotski has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits: >> >> - Merge remote-tracking branch 'origin/master' into avx512-poly >> - address Jamil's review >> - invalidkeyexception and some review comments >> - extra whitespace character >> - assembler checks and test case fixes >> - Merge remote-tracking branch 'origin/master' into avx512-poly >> - Merge remote-tracking branch 'origin' into avx512-poly >> - further restrict UsePolyIntrinsics with supports_avx512vlbw >> - missed white-space fix >> - - Fix whitespace and copyright statements >> - Add benchmark >> - ... and 2 more: https://git.openjdk.org/jdk/compare/9d3b4ef2...38d9e83c > > src/hotspot/share/opto/library_call.cpp line 7036: > >> 7034: assert(r_start, "r array is NULL"); >> 7035: >> 7036: Node* call = make_runtime_call(RC_LEAF, > > Can we safely change this to `RC_LEAF | RC_NO_FP`? For the ChaCha20 block intrinsic I'm working on I've been using that parameter because I'm not touching the FP registers and that looks to be the case here (though your intrinsic is a lot more complicated than mine so I may have missed something). I believe the GHASH and AES library call routines also call `make_runtime_call()` in this way. Makes sense to me, will put it in and re-test (no fp registers anywhere in the intrinsic). Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/10582 From redestad at openjdk.org Fri Nov 4 20:40:23 2022 From: redestad at openjdk.org (Claes Redestad) Date: Fri, 4 Nov 2022 20:40:23 GMT Subject: RFR: 8296426: x86: Narrow UseAVX and UseSSE flags Message-ID: <1-Zq6sFmnAdJMwJxyqmtLQWBNsdctsQToKb6PYRyee0=.37542925-d668-487f-8344-13cc6f63325b@github.com> This patch narrows down the UseAVX and UseSSE flags to their actual supported range and uses int rather than intx for their type. This avoids need for some silly casts, and surprisingly has a small beneficial effect to binary size (-4kb libjvm on linux-x64) This changes behavior of previously in-range values: `-XX:UseAVX=4` would emit a strongly worded warning, but with the proposed change we'll instead terminate the JVM with an error similar to `-XX:UseAVX=100`. I believe this is too trivial for a CSR, since it only changes behavior for unsupported values. ------------- Commit messages: - Merge - Narrow UseAVX and UseSSE flags Changes: https://git.openjdk.org/jdk/pull/10997/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10997&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296426 Stats: 11 lines in 2 files changed: 0 ins; 0 del; 11 mod Patch: https://git.openjdk.org/jdk/pull/10997.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10997/head:pull/10997 PR: https://git.openjdk.org/jdk/pull/10997 From vlivanov at openjdk.org Fri Nov 4 20:40:24 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 4 Nov 2022 20:40:24 GMT Subject: RFR: 8296426: x86: Narrow UseAVX and UseSSE flags In-Reply-To: <1-Zq6sFmnAdJMwJxyqmtLQWBNsdctsQToKb6PYRyee0=.37542925-d668-487f-8344-13cc6f63325b@github.com> References: <1-Zq6sFmnAdJMwJxyqmtLQWBNsdctsQToKb6PYRyee0=.37542925-d668-487f-8344-13cc6f63325b@github.com> Message-ID: On Fri, 4 Nov 2022 20:28:57 GMT, Claes Redestad wrote: > This patch narrows down the UseAVX and UseSSE flags to their actual supported range and uses int rather than intx for their type. This avoids need for some silly casts, and surprisingly has a small beneficial effect to binary size (-4kb libjvm on linux-x64) > > This changes behavior of previously in-range values: `-XX:UseAVX=4` would emit a strongly worded warning, but with the proposed change we'll instead terminate the JVM with an error similar to `-XX:UseAVX=100`. I believe this is too trivial for a CSR, since it only changes behavior for unsupported values. Nice! Looks good. ------------- Marked as reviewed by vlivanov (Reviewer). PR: https://git.openjdk.org/jdk/pull/10997 From duke at openjdk.org Fri Nov 4 21:01:40 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Fri, 4 Nov 2022 21:01:40 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5] In-Reply-To: References: <9h52z_DWFvTWWwasN7vzl9-7C0-Tj50Cis4fgRNuId8=.65de1f73-f5f3-4326-b9e0-6211861452ea@github.com> Message-ID: On Tue, 25 Oct 2022 00:31:07 GMT, Sandhya Viswanathan wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: >> >> extra whitespace character > > src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 175: > >> 173: // Choice of 1024 is arbitrary, need enough data blocks to amortize conversion overhead >> 174: // and not affect platforms without intrinsic support >> 175: int blockMultipleLength = (len/BLOCK_LENGTH) * BLOCK_LENGTH; > > The ByteBuffer version can also benefit from this optimization if it has array as backing storage. I spent some time looking at `engineUpdate(ByteBuffer buf)`. I think it makes sense to make it into a separate PR. I think I figured out the code, but its rather 'finicky'. The existing function is already rather clever; there are quite a few cases to get correct (`engineUpdate(byte[] input, int offset, int len)` unrolled the decision tree, so its easier to reason about) For future reference, patched but untested: void engineUpdate(ByteBuffer buf) { int remaining = buf.remaining(); while (remaining > 0) { int bytesToWrite = Integer.min(remaining, BLOCK_LENGTH - blockOffset); if (bytesToWrite >= BLOCK_LENGTH) { // Have at least one full block in the buf, process all full blocks int blockMultipleLength = buf.remaining() & (~(BLOCK_LENGTH-1)); processMultipleBlocks(buf, blockMultipleLength); remaining -= blockMultipleLength; } else { // We have some left-over data from previous updates, so // copy that into the holding block until we get a full block. buf.get(block, blockOffset, bytesToWrite); blockOffset += bytesToWrite; if (blockOffset >= BLOCK_LENGTH) { processBlock(block, 0, BLOCK_LENGTH); blockOffset = 0; } remaining -= bytesToWrite; } } } private void processMultipleBlocks(ByteBuffer buf, int blockMultipleLength) { if (buf.hasArray()) { byte[] input = buf.array(); int offset = buf.arrayOffset(); Objects.checkFromIndexSize(offset, blockMultipleLength, input.length); a.checkLimbsForIntrinsic(); r.checkLimbsForIntrinsic(); processMultipleBlocks(input, offset, blockMultipleLength); return; } while (blockMultipleLength > 0) { processBlock(buf, BLOCK_LENGTH); blockMultipleLength -= BLOCK_LENGTH; } } But it might make more sense to emulate `engineUpdate(byte[] input, int offset, int len)` and unroll the loop. (Hint: to test for Buffer without array, create read-only buffer: public final boolean hasArray() { return (hb != null) && !isReadOnly; } end hint) ------------- PR: https://git.openjdk.org/jdk/pull/10582 From sviswanathan at openjdk.org Fri Nov 4 21:05:30 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Fri, 4 Nov 2022 21:05:30 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5] In-Reply-To: References: <9h52z_DWFvTWWwasN7vzl9-7C0-Tj50Cis4fgRNuId8=.65de1f73-f5f3-4326-b9e0-6211861452ea@github.com> Message-ID: On Fri, 4 Nov 2022 20:59:10 GMT, Volodymyr Paprotski wrote: >> src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 175: >> >>> 173: // Choice of 1024 is arbitrary, need enough data blocks to amortize conversion overhead >>> 174: // and not affect platforms without intrinsic support >>> 175: int blockMultipleLength = (len/BLOCK_LENGTH) * BLOCK_LENGTH; >> >> The ByteBuffer version can also benefit from this optimization if it has array as backing storage. > > I spent some time looking at `engineUpdate(ByteBuffer buf)`. I think it makes sense to make it into a separate PR. I think I figured out the code, but its rather 'finicky'. The existing function is already rather clever; there are quite a few cases to get correct (`engineUpdate(byte[] input, int offset, int len)` unrolled the decision tree, so its easier to reason about) > > For future reference, patched but untested: > > > void engineUpdate(ByteBuffer buf) { > int remaining = buf.remaining(); > while (remaining > 0) { > int bytesToWrite = Integer.min(remaining, > BLOCK_LENGTH - blockOffset); > > if (bytesToWrite >= BLOCK_LENGTH) { > // Have at least one full block in the buf, process all full blocks > int blockMultipleLength = buf.remaining() & (~(BLOCK_LENGTH-1)); > processMultipleBlocks(buf, blockMultipleLength); > remaining -= blockMultipleLength; > } else { > // We have some left-over data from previous updates, so > // copy that into the holding block until we get a full block. > buf.get(block, blockOffset, bytesToWrite); > blockOffset += bytesToWrite; > > if (blockOffset >= BLOCK_LENGTH) { > processBlock(block, 0, BLOCK_LENGTH); > blockOffset = 0; > } > remaining -= bytesToWrite; > } > } > } > > private void processMultipleBlocks(ByteBuffer buf, int blockMultipleLength) { > if (buf.hasArray()) { > byte[] input = buf.array(); > int offset = buf.arrayOffset(); > > Objects.checkFromIndexSize(offset, blockMultipleLength, input.length); > a.checkLimbsForIntrinsic(); > r.checkLimbsForIntrinsic(); > processMultipleBlocks(input, offset, blockMultipleLength); > return; > } > > while (blockMultipleLength > 0) { > processBlock(buf, BLOCK_LENGTH); > blockMultipleLength -= BLOCK_LENGTH; > } > } > > > But it might make more sense to emulate `engineUpdate(byte[] input, int offset, int len)` and unroll the loop. (Hint: to test for Buffer without array, create read-only buffer: > > public final boolean hasArray() { > return (hb != null) && !isReadOnly; > } > > end hint) Sounds good, let us do the ByteBuffer support as a follow on PR. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From dlong at openjdk.org Fri Nov 4 21:42:29 2022 From: dlong at openjdk.org (Dean Long) Date: Fri, 4 Nov 2022 21:42:29 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v3] In-Reply-To: References: <-b9cMAzyXX2w1SZPG6k9SlEu1DXN5cPuNSLn_DZJ2Zs=.4742dfd3-e212-4e00-a328-34b182a0c690@github.com> Message-ID: On Fri, 4 Nov 2022 13:08:38 GMT, Martin Doerr wrote: > This is a further indication that the code is not needed. Should I file another RFE for that? Please do, and link JDK-8294002 as related. I suspected that the duplicated metadata on aarch64 could be a problem, if we ended up with different values, however I wasn't able to reproduce a problem. For this PR, would it make sense to use a MetadataClosure and metadata_do to check for metadata? ------------- PR: https://git.openjdk.org/jdk/pull/10933 From gcao at openjdk.org Sat Nov 5 04:35:38 2022 From: gcao at openjdk.org (Gui Cao) Date: Sat, 5 Nov 2022 04:35:38 GMT Subject: RFR: 8295261: RISC-V: Support ReductionV instructions for Vector API [v5] In-Reply-To: References: Message-ID: > Currently, certain vector-specific instructions in c2 are not implemented in RISC-V. This patch will add support of `AndReductionV`, `OrReductionV`, `XorReductionV` for RISC-V. This patch was implemented by referring to the sve version of aarch64 and riscv-v-spec v1.0 [1]. > > For example, AndReductionV is implemented as follows: > > > diff --git a/src/hotspot/cpu/riscv/riscv_v.ad b/src/hotspot/cpu/riscv/riscv_v.ad > index 0ef36fdb292..c04962993c0 100644 > --- a/src/hotspot/cpu/riscv/riscv_v.ad > +++ b/src/hotspot/cpu/riscv/riscv_v.ad > @@ -63,7 +63,6 @@ source %{ > case Op_ExtractS: > case Op_ExtractUB: > // Vector API specific > - case Op_AndReductionV: > case Op_OrReductionV: > case Op_XorReductionV: > case Op_LoadVectorGather: > @@ -785,6 +784,120 @@ instruct vnegD(vReg dst, vReg src) %{ > ins_pipe(pipe_slow); > %} > > +// vector and reduction > + > +instruct reduce_andI(iRegINoSp dst, iRegIorL2I src1, vReg src2, vReg tmp) %{ > + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_INT); > + match(Set dst (AndReductionV src1 src2)); > + effect(TEMP tmp); > + ins_cost(VEC_COST); > + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andI\n\t" > + "vredand.vs $tmp, $src2, $tmp\n\t" > + "vmv.x.s $dst, $tmp" %} > + ins_encode %{ > + __ vsetvli(t0, x0, Assembler::e32); > + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); > + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), > + as_VectorRegister($tmp$$reg)); > + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); > + %} > + ins_pipe(pipe_slow); > +%} > + > +instruct reduce_andL(iRegLNoSp dst, iRegL src1, vReg src2, vReg tmp) %{ > + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_LONG); > + match(Set dst (AndReductionV src1 src2)); > + effect(TEMP tmp); > + ins_cost(VEC_COST); > + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andL\n\t" > + "vredand.vs $tmp, $src2, $tmp\n\t" > + "vmv.x.s $dst, $tmp" %} > + ins_encode %{ > + __ vsetvli(t0, x0, Assembler::e64); > + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); > + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), > + as_VectorRegister($tmp$$reg)); > + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); > + %} > > > > After this patch, Vector API can use RVV with the `-XX:+UseRVV` parameter when executing java programs on the RISC-V RVV 1.0 platform. Tests [2] and [3] can be used to test the implementation of this node and it passes the tests properly. > > By adding the `-XX:+PrintAssembly -Xcomp -XX:-TieredCompilation -XX:+LogCompilation -XX:LogFile=compile.log` parameter when executing the test case, hsdis is currently unable to decompile rvv's assembly instructions. The relevant OptoAssembly log output in the compilation log is as follows: > > > 2a8 B22: # out( B14 B23 ) <- in( B21 B31 ) Freq: 32.1131 > 2a8 lwu R28, [R9, #8] # loadNKlass, compressed class ptr, #@loadNKlass > 2ac decode_klass_not_null R14, R28 #@decodeKlass_not_null > 2b8 ld R30, [R14, #40] # class, #@loadKlass > 2bc li R7, #-1 # int, #@loadConI > 2c0 vmv.s.x V1, R7 #@reduce_andI > vredand.vs V1, V2, V1 > vmv.x.s R28, V1 > 2d0 mv R7, precise jdk/internal/vm/vector/VectorSupport$ReductionOperation: 0x000000408c4f6220:Constant:exact * # ptr, #@loadConP > 2e8 beq R30, R7, B14 #@cmpP_branch P=0.830000 C=-1.000000 > > > There is no hardware implementation of RISC-V RVV 1.0, so the tests are performed on qemu with parameter `-cpu rv64,v=true,vlen=256,vext_spec=v1.0`. The execution of `ANDReduceInt256VectorTests` and `ANDReduceLong256VectorTests` test cases under qemu, with `-XX:+UseRVV` turned on, can reduce the execution time of this method by about 50.7% compared to the RVV version without this node implemented. After implementing this node, by comparing the influence of the number of C2 assembly instructions before and after the -XX:+UseRVV parameter is enabled, after enabling -XX:+UseRVV, the number of assembly instructions is reduced by about 50% [4] > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#vector-reduction-operations > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests > [3] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Long256VectorTests.java#ANDReduceLong256VectorTests > [4] https://github.com/zifeihan/vector-api-test-rvv/blob/master/vector-api-rvv-performance.md > > ## Testing: > - hotspot and jdk tier1 on unmatched board without new failures > - test/jdk/jdk/incubator/vector/Int256VectorTests.java with fastdebug on qemu > - test/jdk/jdk/incubator/vector/Long256VectorTests.java with fastdebug on qemu Gui Cao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: - Merge branch 'master' into vector-api-reduction # Conflicts: # src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp - Fix whitespace - Fix alignment - Move REDUCTION_OP definition to macroAssembler_riscv.hpp - Simplify part of the code, extract shared code - Add Reduction C2 instructions for Vector api ------------- Changes: https://git.openjdk.org/jdk/pull/10691/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10691&range=04 Stats: 139 lines in 4 files changed: 136 ins; 3 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10691.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10691/head:pull/10691 PR: https://git.openjdk.org/jdk/pull/10691 From gcao at openjdk.org Sat Nov 5 14:52:22 2022 From: gcao at openjdk.org (Gui Cao) Date: Sat, 5 Nov 2022 14:52:22 GMT Subject: RFR: 8295261: RISC-V: Support ReductionV instructions for Vector API [v6] In-Reply-To: References: Message-ID: > Currently, certain vector-specific instructions in c2 are not implemented in RISC-V. This patch will add support of `AndReductionV`, `OrReductionV`, `XorReductionV` for RISC-V. This patch was implemented by referring to the sve version of aarch64 and riscv-v-spec v1.0 [1]. > > For example, AndReductionV is implemented as follows: > > > diff --git a/src/hotspot/cpu/riscv/riscv_v.ad b/src/hotspot/cpu/riscv/riscv_v.ad > index 0ef36fdb292..c04962993c0 100644 > --- a/src/hotspot/cpu/riscv/riscv_v.ad > +++ b/src/hotspot/cpu/riscv/riscv_v.ad > @@ -63,7 +63,6 @@ source %{ > case Op_ExtractS: > case Op_ExtractUB: > // Vector API specific > - case Op_AndReductionV: > case Op_OrReductionV: > case Op_XorReductionV: > case Op_LoadVectorGather: > @@ -785,6 +784,120 @@ instruct vnegD(vReg dst, vReg src) %{ > ins_pipe(pipe_slow); > %} > > +// vector and reduction > + > +instruct reduce_andI(iRegINoSp dst, iRegIorL2I src1, vReg src2, vReg tmp) %{ > + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_INT); > + match(Set dst (AndReductionV src1 src2)); > + effect(TEMP tmp); > + ins_cost(VEC_COST); > + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andI\n\t" > + "vredand.vs $tmp, $src2, $tmp\n\t" > + "vmv.x.s $dst, $tmp" %} > + ins_encode %{ > + __ vsetvli(t0, x0, Assembler::e32); > + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); > + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), > + as_VectorRegister($tmp$$reg)); > + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); > + %} > + ins_pipe(pipe_slow); > +%} > + > +instruct reduce_andL(iRegLNoSp dst, iRegL src1, vReg src2, vReg tmp) %{ > + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_LONG); > + match(Set dst (AndReductionV src1 src2)); > + effect(TEMP tmp); > + ins_cost(VEC_COST); > + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andL\n\t" > + "vredand.vs $tmp, $src2, $tmp\n\t" > + "vmv.x.s $dst, $tmp" %} > + ins_encode %{ > + __ vsetvli(t0, x0, Assembler::e64); > + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); > + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), > + as_VectorRegister($tmp$$reg)); > + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); > + %} > > > > After this patch, Vector API can use RVV with the `-XX:+UseRVV` parameter when executing java programs on the RISC-V RVV 1.0 platform. Tests [2] and [3] can be used to test the implementation of this node and it passes the tests properly. > > By adding the `-XX:+PrintAssembly -Xcomp -XX:-TieredCompilation -XX:+LogCompilation -XX:LogFile=compile.log` parameter when executing the test case, hsdis is currently unable to decompile rvv's assembly instructions. The relevant OptoAssembly log output in the compilation log is as follows: > > > 2a8 B22: # out( B14 B23 ) <- in( B21 B31 ) Freq: 32.1131 > 2a8 lwu R28, [R9, #8] # loadNKlass, compressed class ptr, #@loadNKlass > 2ac decode_klass_not_null R14, R28 #@decodeKlass_not_null > 2b8 ld R30, [R14, #40] # class, #@loadKlass > 2bc li R7, #-1 # int, #@loadConI > 2c0 vmv.s.x V1, R7 #@reduce_andI > vredand.vs V1, V2, V1 > vmv.x.s R28, V1 > 2d0 mv R7, precise jdk/internal/vm/vector/VectorSupport$ReductionOperation: 0x000000408c4f6220:Constant:exact * # ptr, #@loadConP > 2e8 beq R30, R7, B14 #@cmpP_branch P=0.830000 C=-1.000000 > > > There is no hardware implementation of RISC-V RVV 1.0, so the tests are performed on qemu with parameter `-cpu rv64,v=true,vlen=256,vext_spec=v1.0`. The execution of `ANDReduceInt256VectorTests` and `ANDReduceLong256VectorTests` test cases under qemu, with `-XX:+UseRVV` turned on, can reduce the execution time of this method by about 50.7% compared to the RVV version without this node implemented. After implementing this node, by comparing the influence of the number of C2 assembly instructions before and after the -XX:+UseRVV parameter is enabled, after enabling -XX:+UseRVV, the number of assembly instructions is reduced by about 50% [4] > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#vector-reduction-operations > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests > [3] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Long256VectorTests.java#ANDReduceLong256VectorTests > [4] https://github.com/zifeihan/vector-api-test-rvv/blob/master/vector-api-rvv-performance.md > > ## Testing: > - hotspot and jdk tier1 on unmatched board without new failures > - test/jdk/jdk/incubator/vector/Int256VectorTests.java with fastdebug on qemu > - test/jdk/jdk/incubator/vector/Long256VectorTests.java with fastdebug on qemu Gui Cao has updated the pull request incrementally with one additional commit since the last revision: Format code ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10691/files - new: https://git.openjdk.org/jdk/pull/10691/files/3e31a773..a7db305d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10691&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10691&range=04-05 Stats: 15 lines in 3 files changed: 1 ins; 0 del; 14 mod Patch: https://git.openjdk.org/jdk/pull/10691.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10691/head:pull/10691 PR: https://git.openjdk.org/jdk/pull/10691 From mdoerr at openjdk.org Sat Nov 5 18:29:28 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Sat, 5 Nov 2022 18:29:28 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v5] In-Reply-To: References: Message-ID: > This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: Change checking code to reuse more existing code (oops_do and metadata_do). ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10933/files - new: https://git.openjdk.org/jdk/pull/10933/files/eb86c710..44215866 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10933&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10933&range=03-04 Stats: 35 lines in 2 files changed: 18 ins; 15 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/10933.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10933/head:pull/10933 PR: https://git.openjdk.org/jdk/pull/10933 From mdoerr at openjdk.org Sat Nov 5 18:29:28 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Sat, 5 Nov 2022 18:29:28 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v3] In-Reply-To: References: <-b9cMAzyXX2w1SZPG6k9SlEu1DXN5cPuNSLn_DZJ2Zs=.4742dfd3-e212-4e00-a328-34b182a0c690@github.com> Message-ID: On Fri, 4 Nov 2022 21:38:45 GMT, Dean Long wrote: > > This is a further indication that the code is not needed. Should I file another RFE for that? > > Please do, and link JDK-8294002 as related. I suspected that the duplicated metadata on aarch64 could be a problem, if we ended up with different values, however I wasn't able to reproduce a problem. > > For this PR, would it make sense to use a MetadataClosure and metadata_do to check for metadata? New Issue: https://bugs.openjdk.org/browse/JDK-8296440 Regarding the aarch64 question: I don't think we ever hit the case in which the Method* gets cleared. So, both values never differ. It is possible to reuse the existing code (oops_do and metadata_do). Please see my latest commit. Requires extra walks in debug build, but I think it's affordable. ------------- PR: https://git.openjdk.org/jdk/pull/10933 From mdoerr at openjdk.org Sat Nov 5 19:27:45 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Sat, 5 Nov 2022 19:27:45 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v6] In-Reply-To: References: Message-ID: > This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: Need to ignore own Method when using metadata_do. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10933/files - new: https://git.openjdk.org/jdk/pull/10933/files/44215866..860b71ed Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10933&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10933&range=04-05 Stats: 4 lines in 1 file changed: 2 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/10933.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10933/head:pull/10933 PR: https://git.openjdk.org/jdk/pull/10933 From rrich at openjdk.org Sun Nov 6 17:28:53 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Sun, 6 Nov 2022 17:28:53 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v3] In-Reply-To: References: Message-ID: > Hi, > > this is the port of [JEP 425: Virtual Threads (Preview)](https://openjdk.org/jeps/425)) to PPC64. > More precisely it is the port of vm continuations in hotspot to PPC64. It allows to run with `-XX:+VMContinuations` which is a prerequisit for 'real' virtual threads (oxymoron?). > > Most of the shared code changes are related to a new platform dependent constant `frame::metadata_words_at_top`. It is either added or subtracted to a frame address or size. > > The following is supposed to explain (without real life details) why it is needed in addition to the existing `frame::metadata_words`. The diagram shows a frame at `SP` and its stack arguments. The caller frame is located at `callers_SP`. > > > X86 / AARCH64 PPC64: > > : : : : > : : : : > | | | | > |-----------------| |-----------------| > | | | | > | stack arguments | | stack arguments | > | |<- callers_SP | | > =================== |-----------------| > | | | | > | metadata at bottom | | metadata at top | > | | | |<- callers_SP > |-----------------| =================== > | | | | > | | | | > | | | | > | | | | > | |<- SP | | > =================== |-----------------| > | | > | metadata at top | > | |<- SP > =================== > > > On X86 and AARCH64 metadata (that's return address, saved SP/FP, etc.) is stored at the frame bottom (`metadata at bottom`). On PPC64 it is stored at the frame top (`metadata at top`) where it affects size and address calculations. Shared code deals with this by making use of the platform dependent constant `frame::metadata_words_at_top`. > > * size required to 'freeze' a single frame with its stack arguments in a `StackChunk`: > `sizeof(frame) + sizeof(stack arguments) + frame::metadata_words_at_top` > > * address of stack arguments: > `callers_SP + frame::metadata_words_at_top` > > * The value of `frame::metadata_words_at_top` is 0 words on X86 and AARCH64. On PPC64 it is 4 words. > > Please refer to comments I've added for more details (e.g. the comment on StackChunkFrameStream::frame_size()). Recently I've given a talk about vm continuations and the PPC64 port to my colleagues. It's rather an overview than a deep dive. [The slides](http://cr.openjdk.java.net/~rrich/webrevs/2022/8286302/202210_loom_ppc64_port.pdf) might serve as well as an intro to the matter. > > The pr includes the new test jdk/jdk/internal/vm/Continuation/BasicExp.java which I wrote while doing the port. The test cases vary from simple to not-so-simple. One of the main features is that a CompilationPolicy passed as argument controls which frames are compiled/interpreted when freezing by defining a sliding window of compiled / interpreted frames which produces interesting transitions with and without stack arguments. There is overlap with Basic.java and Fuzz.java. Let me know wether to remove or keep BasicExp.java. Runtime with fastdebug: 2m on Apple M1 Pro and 3m on Intel Xeon E5-2660 v3 @ 2.60GHz. Note that -XX:+VerifyContinuations is explicitly set as a found it very useful, it increases the runtime quite a bit though. > > Testing: the change passed our CI testing: most JCK and JTREG tests, also in Xcomp mode, SPECjvm2008, SPECjbb2015, Renaissance Suite and SAP specific tests with fastdebug and release builds on the standard platforms plus PPC64. These tests do include hotspot_loom and jdk_loom JTREG tests which I've also run with TEST_VM_OPTS="-XX:+VerifyContinuations" on X86_64, PPC64, and AARCH64. > > Thanks, Richard. Richard Reingruber has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: - Fix cpp condition and add PPC64 - Changes lost in merge - Merge branch 'master' into 8286302_Port_JEP_425_to_PPC64 - Use callers_sp for fsize calculation in recurse_freeze_interpreted_frame - Loom ppc64le port ------------- Changes: https://git.openjdk.org/jdk/pull/10961/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10961&range=02 Stats: 3394 lines in 66 files changed: 2986 ins; 109 del; 299 mod Patch: https://git.openjdk.org/jdk/pull/10961.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10961/head:pull/10961 PR: https://git.openjdk.org/jdk/pull/10961 From omikhaltcova at openjdk.org Sun Nov 6 21:02:17 2022 From: omikhaltcova at openjdk.org (Olga Mikhaltsova) Date: Sun, 6 Nov 2022 21:02:17 GMT Subject: RFR: 8262901: [macos_aarch64] NativeCallTest expected:<-3.8194101E18> but was:<3.02668882E10> [v3] In-Reply-To: References: Message-ID: <6AcveZEfV2AvLEpEP-nSTG3r9aqc_S82tbDatEw1h4s=.f8b29e01-67af-4ff6-9f60-b84264bc724d@github.com> > This PR is opened as a follow-up for [1] and included the "must-done" fixes pointed by @teshull. > > This patch for JVMCI includes the following fixes related to the macOS AArch64 calling convention: > 1. arguments may consume slots on the stack that are not multiples of 8 bytes [2] > 2. natural alignment of stack arguments [2] > 3. stack must remain 16-byte aligned [3][4] > > Tested with tier1 on macOS AArch64 and Linux AArch64. > > [1] https://github.com/openjdk/jdk/pull/6641 > [2] https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple-platforms > [3] https://docs.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions?view=msvc-160#stack > [4] https://docs.microsoft.com/en-us/cpp/build/stack-usage?view=msvc-170 Olga Mikhaltsova has updated the pull request incrementally with one additional commit since the last revision: Refactoring ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10238/files - new: https://git.openjdk.org/jdk/pull/10238/files/6f8b3215..8b9922a2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10238&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10238&range=01-02 Stats: 17 lines in 4 files changed: 4 ins; 7 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/10238.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10238/head:pull/10238 PR: https://git.openjdk.org/jdk/pull/10238 From eliu at openjdk.org Mon Nov 7 01:33:32 2022 From: eliu at openjdk.org (Eric Liu) Date: Mon, 7 Nov 2022 01:33:32 GMT Subject: RFR: 8295261: RISC-V: Support ReductionV instructions for Vector API [v6] In-Reply-To: References: Message-ID: <0-4yVUpHVbWO1_2ykx3WztcptppbrinSjMtUP4NWX0A=.d10d5549-de66-4c0b-a582-1b266219f206@github.com> On Sat, 5 Nov 2022 14:52:22 GMT, Gui Cao wrote: >> Currently, certain vector-specific instructions in c2 are not implemented in RISC-V. This patch will add support of `AndReductionV`, `OrReductionV`, `XorReductionV` for RISC-V. This patch was implemented by referring to the sve version of aarch64 and riscv-v-spec v1.0 [1]. >> >> For example, AndReductionV is implemented as follows: >> >> >> diff --git a/src/hotspot/cpu/riscv/riscv_v.ad b/src/hotspot/cpu/riscv/riscv_v.ad >> index 0ef36fdb292..c04962993c0 100644 >> --- a/src/hotspot/cpu/riscv/riscv_v.ad >> +++ b/src/hotspot/cpu/riscv/riscv_v.ad >> @@ -63,7 +63,6 @@ source %{ >> case Op_ExtractS: >> case Op_ExtractUB: >> // Vector API specific >> - case Op_AndReductionV: >> case Op_OrReductionV: >> case Op_XorReductionV: >> case Op_LoadVectorGather: >> @@ -785,6 +784,120 @@ instruct vnegD(vReg dst, vReg src) %{ >> ins_pipe(pipe_slow); >> %} >> >> +// vector and reduction >> + >> +instruct reduce_andI(iRegINoSp dst, iRegIorL2I src1, vReg src2, vReg tmp) %{ >> + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_INT); >> + match(Set dst (AndReductionV src1 src2)); >> + effect(TEMP tmp); >> + ins_cost(VEC_COST); >> + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andI\n\t" >> + "vredand.vs $tmp, $src2, $tmp\n\t" >> + "vmv.x.s $dst, $tmp" %} >> + ins_encode %{ >> + __ vsetvli(t0, x0, Assembler::e32); >> + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); >> + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), >> + as_VectorRegister($tmp$$reg)); >> + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); >> + %} >> + ins_pipe(pipe_slow); >> +%} >> + >> +instruct reduce_andL(iRegLNoSp dst, iRegL src1, vReg src2, vReg tmp) %{ >> + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_LONG); >> + match(Set dst (AndReductionV src1 src2)); >> + effect(TEMP tmp); >> + ins_cost(VEC_COST); >> + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andL\n\t" >> + "vredand.vs $tmp, $src2, $tmp\n\t" >> + "vmv.x.s $dst, $tmp" %} >> + ins_encode %{ >> + __ vsetvli(t0, x0, Assembler::e64); >> + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); >> + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), >> + as_VectorRegister($tmp$$reg)); >> + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); >> + %} >> >> >> >> After this patch, Vector API can use RVV with the `-XX:+UseRVV` parameter when executing java programs on the RISC-V RVV 1.0 platform. Tests [2] and [3] can be used to test the implementation of this node and it passes the tests properly. >> >> By adding the `-XX:+PrintAssembly -Xcomp -XX:-TieredCompilation -XX:+LogCompilation -XX:LogFile=compile.log` parameter when executing the test case, hsdis is currently unable to decompile rvv's assembly instructions. The relevant OptoAssembly log output in the compilation log is as follows: >> >> >> 2a8 B22: # out( B14 B23 ) <- in( B21 B31 ) Freq: 32.1131 >> 2a8 lwu R28, [R9, #8] # loadNKlass, compressed class ptr, #@loadNKlass >> 2ac decode_klass_not_null R14, R28 #@decodeKlass_not_null >> 2b8 ld R30, [R14, #40] # class, #@loadKlass >> 2bc li R7, #-1 # int, #@loadConI >> 2c0 vmv.s.x V1, R7 #@reduce_andI >> vredand.vs V1, V2, V1 >> vmv.x.s R28, V1 >> 2d0 mv R7, precise jdk/internal/vm/vector/VectorSupport$ReductionOperation: 0x000000408c4f6220:Constant:exact * # ptr, #@loadConP >> 2e8 beq R30, R7, B14 #@cmpP_branch P=0.830000 C=-1.000000 >> >> >> There is no hardware implementation of RISC-V RVV 1.0, so the tests are performed on qemu with parameter `-cpu rv64,v=true,vlen=256,vext_spec=v1.0`. The execution of `ANDReduceInt256VectorTests` and `ANDReduceLong256VectorTests` test cases under qemu, with `-XX:+UseRVV` turned on, can reduce the execution time of this method by about 50.7% compared to the RVV version without this node implemented. After implementing this node, by comparing the influence of the number of C2 assembly instructions before and after the -XX:+UseRVV parameter is enabled, after enabling -XX:+UseRVV, the number of assembly instructions is reduced by about 50% [4] >> >> [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#vector-reduction-operations >> [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests >> [3] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Long256VectorTests.java#ANDReduceLong256VectorTests >> [4] https://github.com/zifeihan/vector-api-test-rvv/blob/master/vector-api-rvv-performance.md >> >> ## Testing: >> - hotspot and jdk tier1 on unmatched board without new failures >> - test/jdk/jdk/incubator/vector/Int256VectorTests.java with fastdebug on qemu >> - test/jdk/jdk/incubator/vector/Long256VectorTests.java with fastdebug on qemu > > Gui Cao has updated the pull request incrementally with one additional commit since the last revision: > > Format code Marked as reviewed by eliu (Committer). ------------- PR: https://git.openjdk.org/jdk/pull/10691 From cslucas at openjdk.org Mon Nov 7 02:05:18 2022 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Mon, 7 Nov 2022 02:05:18 GMT Subject: RFR: 8289943: Simplify some object allocation merges [v13] In-Reply-To: References: Message-ID: On Thu, 27 Oct 2022 06:58:38 GMT, Vladimir Ivanov wrote: >>> >>> As of now, it serves dual purpose. It (1) marks a merge point as safe to be untangled during SR; and (2) caches information about field values. >> >> An other important purpose of RAM is to have information at SafePoints after merge point for reallocation during deoptimization. You need Klass information. I don't think having only Phis for values is enough. >> >>> >>> I believe you can solve it in a cleaner manner without introducing placeholder nodes and connection graph adjustments. IMO it's all about keeping escape status and properly handling "safe" merges in `split_unique_types`. >>> >>> One possible way to handle merge points is: >>> >>> * Handle merge points in `adjust_scalar_replaceable_state` and refrain from marking relevant bases as NSR when possible. >>> * After `adjust_scalar_replaceable_state` is over, every merge point should have all its inputs as either NSR or SR. >>> * `split_unique_types` incrementally builds value phis to eventually replace the base phi at merge point while processing SR allocations one by one. >>> * After `split_unique_types` is done, there are no merge points anymore, each allocation has a dedicated memory graph and allocation elimination can proceed as before. >> >> I am not sure how this could be possible. Currently EA rely on IGVN to propagate fields values based on unique memory slice. What you do with memory Load or Store nodes after merge point? Which memory slice you will use for them? >> >>> >>> Do you see any problems with such an approach? >>> >>> One thing still confuses me though: the patch mentions that RAMs can merge both eliminated and not-yet-eliminated allocations. What's the intended use case? I believe it's still required to have all merged allocations to be eventually eliminated. Do you try to handle the case during allocation elimination when part of the inputs are already eliminated and the rest is pending their turn? >> >> There is check for it in `ConnectionGraph::can_reduce_this_phi()`. The only supported cases is when no deoptimization point (SFP or UNCT) after merge point. It allow eliminate SR allocations even if they merge with NSR allocations. This was idea. > >> An other important purpose of RAM is to have information at SafePoints after merge point for reallocation during deoptimization. You need Klass information. I don't think having only Phis for values is enough. > > Klass information is available either from Allocation node in `split_unique_types` or ConnectionGraph instance the Phi is part of. > > >> I am not sure how this could be possible. Currently EA rely on IGVN to propagate fields values based on unique memory slice. What you do with memory Load or Store nodes after merge point? Which memory slice you will use for them? > > My understanding of how proposed approach is expected to work: merge points have to be simple enough to still allow splitting unique types for individual allocations. > > For example, `eliminate_ram_addp_use()` replaces `Load (AddP (Phi base1 ... basen) off) mem` with `Phi (val1 ... valn)` and `eliminate_reduced_allocation_merge()` performs similar transformation for `SafePoint`s. > > Alternatively, corresponding `Phi`s can be build incrementally while processing each individual `base` by `split_unique_types`. Or, just by splitting `Load`s through `Phi`: > > Load (AddP (Phi base_1 ... base_n) off) mem > == split-through-phi ==> > Phi ((Load (AddP base_1 off) mem) ... (Load (AddP base_n off) mem)) > == split_unique_types ==> > Phi ((Load (AddP base_1 off) mem_1) ... (Load (AddP base_n off) mem_n)) > == IGVN ==> > Phi (val_1 ... val_n) > ``` > >> There is check for it in ConnectionGraph::can_reduce_this_phi(). The only supported cases is when no deoptimization point (SFP or UNCT) after merge point. It allow eliminate SR allocations even if they merge with NSR allocations. This was idea. > > That's nice! Now I see `has_call_as_user`-related code. > > It means that only `Load (AddP (Phi base_1 ... base_n) off) mem` shapes are allowed now. > I believe the aforementioned split-through-phi transformation should handle it well: > > Load (AddP (Phi base_1 ... base_n) off) mem > == split-through-phi ==> > Phi ((Load (AddP base_1 off) mem) ... (Load (AddP base_n off) mem)) > == split_unique_types ==> > Phi (... (Load (AddP base_SR_i off) mem_i) ... (Load (AddP base_NSR_n off) mem) ...) > == IGVN ==> > Phi (... val_i ... (Load (AddP base_NSR_n off) mem) ... ) Hi @iwanowww - Thank you for clarifying things! After much thought and some testing, I think I can make the RAM node go away and achieve the results I want. Below are some additional comments & the overall approach that I'm going to switch to. - `LoadNode::split_through_phi` requires `is_known_instance_field` and therefore can't be run before `split_unique_types` without changes. - I think splitting the merge Phi as part of `adjust_scalar_replaceable_state` might be the best place to create the value Phi for the fields. However, if the merge Phi is used by a SafePoint/UncommonTrap then it's necessary to also create a `SafePointScalarObjectNode` (SSON). As you know, there is already logic to create SSON as part of `PhaseMacroExpand::scalar_replacement`. So, we have to decide if it's best to re-use the code in `PhaseMacroExpand::scalar_replacement` in `adjust_scalar_replaceable_state` or if we want to add new code to create SSON for merge Phis in `PhaseMacroExpand::scalar_replacement`. Here are the Pros & Cons of the two options I mentioned above: - Create SSON in `adjust_scalar_replaceable_state`: Pros: all logic to split phi is centered in one place. Cons: the logic to create SSON is used outside `scalar_replacement` routine. - Create SafePointScalarReplacedNode in `scalar_replacement` routine. Pros: Scalar replacement of merge Phi is contained in scalar_replacement routine. Cons: the logic to split merge phi is spread throughout escape analysis and scalar replacement. Does it look like a better approach? TIA! ------------- PR: https://git.openjdk.org/jdk/pull/9073 From gcao at openjdk.org Mon Nov 7 05:04:09 2022 From: gcao at openjdk.org (Gui Cao) Date: Mon, 7 Nov 2022 05:04:09 GMT Subject: Integrated: 8295261: RISC-V: Support ReductionV instructions for Vector API In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 07:54:47 GMT, Gui Cao wrote: > Currently, certain vector-specific instructions in c2 are not implemented in RISC-V. This patch will add support of `AndReductionV`, `OrReductionV`, `XorReductionV` for RISC-V. This patch was implemented by referring to the sve version of aarch64 and riscv-v-spec v1.0 [1]. > > For example, AndReductionV is implemented as follows: > > > diff --git a/src/hotspot/cpu/riscv/riscv_v.ad b/src/hotspot/cpu/riscv/riscv_v.ad > index 0ef36fdb292..c04962993c0 100644 > --- a/src/hotspot/cpu/riscv/riscv_v.ad > +++ b/src/hotspot/cpu/riscv/riscv_v.ad > @@ -63,7 +63,6 @@ source %{ > case Op_ExtractS: > case Op_ExtractUB: > // Vector API specific > - case Op_AndReductionV: > case Op_OrReductionV: > case Op_XorReductionV: > case Op_LoadVectorGather: > @@ -785,6 +784,120 @@ instruct vnegD(vReg dst, vReg src) %{ > ins_pipe(pipe_slow); > %} > > +// vector and reduction > + > +instruct reduce_andI(iRegINoSp dst, iRegIorL2I src1, vReg src2, vReg tmp) %{ > + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_INT); > + match(Set dst (AndReductionV src1 src2)); > + effect(TEMP tmp); > + ins_cost(VEC_COST); > + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andI\n\t" > + "vredand.vs $tmp, $src2, $tmp\n\t" > + "vmv.x.s $dst, $tmp" %} > + ins_encode %{ > + __ vsetvli(t0, x0, Assembler::e32); > + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); > + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), > + as_VectorRegister($tmp$$reg)); > + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); > + %} > + ins_pipe(pipe_slow); > +%} > + > +instruct reduce_andL(iRegLNoSp dst, iRegL src1, vReg src2, vReg tmp) %{ > + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_LONG); > + match(Set dst (AndReductionV src1 src2)); > + effect(TEMP tmp); > + ins_cost(VEC_COST); > + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andL\n\t" > + "vredand.vs $tmp, $src2, $tmp\n\t" > + "vmv.x.s $dst, $tmp" %} > + ins_encode %{ > + __ vsetvli(t0, x0, Assembler::e64); > + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); > + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), > + as_VectorRegister($tmp$$reg)); > + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); > + %} > > > > After this patch, Vector API can use RVV with the `-XX:+UseRVV` parameter when executing java programs on the RISC-V RVV 1.0 platform. Tests [2] and [3] can be used to test the implementation of this node and it passes the tests properly. > > By adding the `-XX:+PrintAssembly -Xcomp -XX:-TieredCompilation -XX:+LogCompilation -XX:LogFile=compile.log` parameter when executing the test case, hsdis is currently unable to decompile rvv's assembly instructions. The relevant OptoAssembly log output in the compilation log is as follows: > > > 2a8 B22: # out( B14 B23 ) <- in( B21 B31 ) Freq: 32.1131 > 2a8 lwu R28, [R9, #8] # loadNKlass, compressed class ptr, #@loadNKlass > 2ac decode_klass_not_null R14, R28 #@decodeKlass_not_null > 2b8 ld R30, [R14, #40] # class, #@loadKlass > 2bc li R7, #-1 # int, #@loadConI > 2c0 vmv.s.x V1, R7 #@reduce_andI > vredand.vs V1, V2, V1 > vmv.x.s R28, V1 > 2d0 mv R7, precise jdk/internal/vm/vector/VectorSupport$ReductionOperation: 0x000000408c4f6220:Constant:exact * # ptr, #@loadConP > 2e8 beq R30, R7, B14 #@cmpP_branch P=0.830000 C=-1.000000 > > > There is no hardware implementation of RISC-V RVV 1.0, so the tests are performed on qemu with parameter `-cpu rv64,v=true,vlen=256,vext_spec=v1.0`. The execution of `ANDReduceInt256VectorTests` and `ANDReduceLong256VectorTests` test cases under qemu, with `-XX:+UseRVV` turned on, can reduce the execution time of this method by about 50.7% compared to the RVV version without this node implemented. After implementing this node, by comparing the influence of the number of C2 assembly instructions before and after the -XX:+UseRVV parameter is enabled, after enabling -XX:+UseRVV, the number of assembly instructions is reduced by about 50% [4] > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#vector-reduction-operations > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests > [3] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Long256VectorTests.java#ANDReduceLong256VectorTests > [4] https://github.com/zifeihan/vector-api-test-rvv/blob/master/vector-api-rvv-performance.md > > ## Testing: > - hotspot and jdk tier1 on unmatched board without new failures > - test/jdk/jdk/incubator/vector/Int256VectorTests.java with fastdebug on qemu > - test/jdk/jdk/incubator/vector/Long256VectorTests.java with fastdebug on qemu This pull request has now been integrated. Changeset: 087cedc0 Author: Gui Cao Committer: Fei Yang URL: https://git.openjdk.org/jdk/commit/087cedc080963f027306f9d4c4ab737ddf42a5bc Stats: 140 lines in 4 files changed: 137 ins; 3 del; 0 mod 8295261: RISC-V: Support ReductionV instructions for Vector API Reviewed-by: yadongwang, dzhang, fyang, eliu ------------- PR: https://git.openjdk.org/jdk/pull/10691 From chagedorn at openjdk.org Mon Nov 7 11:30:01 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 7 Nov 2022 11:30:01 GMT Subject: RFR: 8286800: Assert in PhaseIdealLoop::dump_real_LCA is too strong Message-ID: <2pkCki3_MbAhroP1p8173jY1ErpHaEEvGlRxH7nGkvM=.3675ca65-171e-4634-b0ed-b61c7c30ebf3@github.com> We sometimes hit the following assert when dumping a bad graph (before crashing with the bad graph assertion): assert(real_LCA != NULL, "must always find an LCA" ``` The algorithm is not correct as we should always find an LCA of two nodes. To fix this, I've re-implemented the algorithm and improved the dumped idom chains: - I limited the node dump to idx + node name to reduce the noise which made it hard to read. - Reversed the idom chain dumps to reflect the graph structure. Example output: Bad graph detected in build_loop_late n: 138 CastPP === 205 38 [[ 263 140 140 168 ]] #Test:NotNull * Oop:Test:NotNull * !jvms: Test::mainTest @ bci:40 (line 154) [... same output as before ...] idoms of early "197 IfFalse": idom[2]: 42 If idom[1]: 44 IfTrue idom[0]: 196 If n: 197 IfFalse idoms of (wrong) LCA "205 IfTrue": idom[4]: 42 If idom[3]: 37 Region idom[2]: 73 If idom[1]: 83 IfTrue idom[0]: 204 If n: 205 IfTrue Real LCA of early "197 IfFalse" (idom[2]) and wrong LCA "205 IfTrue" (idom[4]): 42 If === 30 41 [[ 43 44 ]] P=0.999000, C=-1.000000 !jvms: Test::mainTest @ bci:32 (line 153) Tested by manually calling `dump_idoms` during a compilation and by running reproducers of different bad graph assertion bugs. Thanks, Christian ------------- Commit messages: - 8286800: Assert in PhaseIdealLoop::dump_real_LCA is too strong Changes: https://git.openjdk.org/jdk/pull/11015/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11015&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8286800 Stats: 131 lines in 2 files changed: 59 ins; 30 del; 42 mod Patch: https://git.openjdk.org/jdk/pull/11015.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11015/head:pull/11015 PR: https://git.openjdk.org/jdk/pull/11015 From chagedorn at openjdk.org Mon Nov 7 11:47:51 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 7 Nov 2022 11:47:51 GMT Subject: RFR: 8286800: Assert in PhaseIdealLoop::dump_real_LCA is too strong [v2] In-Reply-To: <2pkCki3_MbAhroP1p8173jY1ErpHaEEvGlRxH7nGkvM=.3675ca65-171e-4634-b0ed-b61c7c30ebf3@github.com> References: <2pkCki3_MbAhroP1p8173jY1ErpHaEEvGlRxH7nGkvM=.3675ca65-171e-4634-b0ed-b61c7c30ebf3@github.com> Message-ID: > We sometimes hit the following assert when dumping a bad graph (before crashing with the bad graph assertion): > > assert(real_LCA != NULL, "must always find an LCA" > ``` > The algorithm is not correct as we should always find an LCA of two nodes. To fix this, I've re-implemented the algorithm and improved the dumped idom chains: > - I limited the node dump to idx + node name to reduce the noise which made it hard to read. > - Reversed the idom chain dumps to reflect the graph structure. > > Example output: > > Bad graph detected in build_loop_late > n: 138 CastPP === 205 38 [[ 263 140 140 168 ]] #Test:NotNull * Oop:Test:NotNull * !jvms: Test::mainTest @ bci:40 (line 154) > > [... same output as before ...] > > idoms of early "197 IfFalse": > idom[2]: 42 If > idom[1]: 44 IfTrue > idom[0]: 196 If > n: 197 IfFalse > > idoms of (wrong) LCA "205 IfTrue": > idom[4]: 42 If > idom[3]: 37 Region > idom[2]: 73 If > idom[1]: 83 IfTrue > idom[0]: 204 If > n: 205 IfTrue > > Real LCA of early "197 IfFalse" (idom[2]) and wrong LCA "205 IfTrue" (idom[4]): > 42 If === 30 41 [[ 43 44 ]] P=0.999000, C=-1.000000 !jvms: Test::mainTest @ bci:32 (line 153) > > Tested by manually calling `dump_idoms` during a compilation and by running reproducers of different bad graph assertion bugs. > > Thanks, > Christian Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: Fix optimized build ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11015/files - new: https://git.openjdk.org/jdk/pull/11015/files/884c0f82..342213d6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11015&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11015&range=00-01 Stats: 6 lines in 1 file changed: 1 ins; 3 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/11015.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11015/head:pull/11015 PR: https://git.openjdk.org/jdk/pull/11015 From omikhaltcova at openjdk.org Mon Nov 7 13:13:27 2022 From: omikhaltcova at openjdk.org (Olga Mikhaltsova) Date: Mon, 7 Nov 2022 13:13:27 GMT Subject: RFR: 8262901: [macos_aarch64] NativeCallTest expected:<-3.8194101E18> but was:<3.02668882E10> [v3] In-Reply-To: References: Message-ID: On Wed, 14 Sep 2022 08:56:24 GMT, Andrew Haley wrote: >> I tried to be closer to the original review https://github.com/openjdk/jdk/pull/6641 that requires only 2 fixes and tried to do only this in order to continue easily. >> >> Could you clarify please what boolean you talk about? `private final boolean macOS;` that was pushed into `class AArch64HotSpotRegisterConfig`, right? I'm hesitating a bit because of the highlighted code. > > Yes, that `macOS` boolean. > > Maybe it's not worth the effort, but it seems to me as though the use of the boolean in several places is something of a code smell, and this patch makes it more so. The control flow is not easy to follow. > I am wondering if refactoring it so that the code between L269 and L291 were broken out into two methods, one for MacOS and one for the others. I might be wrong, but I'd try it. @theRealAph could you please take a look! I've made some additional refactoring in order to get rid of those "boolean"s where it's possible. I moved them from AArch64HotSpotVMConfig to TargetDescription. This makes an access to them easier across the code. As a 2nd step it's possible to substitute these "boolean"s with "enum" if it's needed. But imho it's better to keep them as "boolean"s at the moment. In addition I see a new class RISCV64HotSpotVMConfig (committed after this pr) that also declared the same boolean: `final boolean linuxOs = Services.getSavedProperty("os.name", "").startsWith("Linux");` If this refactoring is made, this "boolean" won't be needed as well because it can be accessed from TargetDescription. ------------- PR: https://git.openjdk.org/jdk/pull/10238 From omikhaltcova at openjdk.org Mon Nov 7 13:33:33 2022 From: omikhaltcova at openjdk.org (Olga Mikhaltsova) Date: Mon, 7 Nov 2022 13:33:33 GMT Subject: RFR: 8262901: [macos_aarch64] NativeCallTest expected:<-3.8194101E18> but was:<3.02668882E10> [v3] In-Reply-To: <6AcveZEfV2AvLEpEP-nSTG3r9aqc_S82tbDatEw1h4s=.f8b29e01-67af-4ff6-9f60-b84264bc724d@github.com> References: <6AcveZEfV2AvLEpEP-nSTG3r9aqc_S82tbDatEw1h4s=.f8b29e01-67af-4ff6-9f60-b84264bc724d@github.com> Message-ID: <8M1GZUYKpvbr4DCJ3139r8D0-njoWu07yWLnP0jLxtU=.8cd5de15-a4c7-4a6a-9705-9979cbc82f60@github.com> On Sun, 6 Nov 2022 21:02:17 GMT, Olga Mikhaltsova wrote: >> This PR is opened as a follow-up for [1] and included the "must-done" fixes pointed by @teshull. >> >> This patch for JVMCI includes the following fixes related to the macOS AArch64 calling convention: >> 1. arguments may consume slots on the stack that are not multiples of 8 bytes [2] >> 2. natural alignment of stack arguments [2] >> 3. stack must remain 16-byte aligned [3][4] >> >> Tested with tier1 on macOS AArch64 and Linux AArch64. >> >> [1] https://github.com/openjdk/jdk/pull/6641 >> [2] https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple-platforms >> [3] https://docs.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions?view=msvc-160#stack >> [4] https://docs.microsoft.com/en-us/cpp/build/stack-usage?view=msvc-170 > > Olga Mikhaltsova has updated the pull request incrementally with one additional commit since the last revision: > > Refactoring @teshull could you please take a look! I tried to make fixes according to your comments in #6641 ------------- PR: https://git.openjdk.org/jdk/pull/10238 From aph at openjdk.org Mon Nov 7 17:01:37 2022 From: aph at openjdk.org (Andrew Haley) Date: Mon, 7 Nov 2022 17:01:37 GMT Subject: RFR: 8262901: [macos_aarch64] NativeCallTest expected:<-3.8194101E18> but was:<3.02668882E10> [v3] In-Reply-To: References: Message-ID: On Mon, 7 Nov 2022 13:10:57 GMT, Olga Mikhaltsova wrote: >> Yes, that `macOS` boolean. >> >> Maybe it's not worth the effort, but it seems to me as though the use of the boolean in several places is something of a code smell, and this patch makes it more so. The control flow is not easy to follow. >> I am wondering if refactoring it so that the code between L269 and L291 were broken out into two methods, one for MacOS and one for the others. I might be wrong, but I'd try it. > > @theRealAph could you please take a look! I've made some additional refactoring in order to get rid of those "boolean"s where it's possible. I moved them from AArch64HotSpotVMConfig to TargetDescription. This makes an access to them easier across the code. As a 2nd step it's possible to substitute these "boolean"s with "enum" if it's needed. But imho it's better to keep them as "boolean"s at the moment. > In addition I see a new class RISCV64HotSpotVMConfig (committed after this pr) that also declared the same boolean: > `final boolean linuxOs = Services.getSavedProperty("os.name", "").startsWith("Linux");` > If this refactoring is made, this "boolean" won't be needed as well because it can be accessed from TargetDescription. This looks reasonable enough. ------------- PR: https://git.openjdk.org/jdk/pull/10238 From aph at openjdk.org Mon Nov 7 17:01:37 2022 From: aph at openjdk.org (Andrew Haley) Date: Mon, 7 Nov 2022 17:01:37 GMT Subject: RFR: 8262901: [macos_aarch64] NativeCallTest expected:<-3.8194101E18> but was:<3.02668882E10> [v3] In-Reply-To: <6AcveZEfV2AvLEpEP-nSTG3r9aqc_S82tbDatEw1h4s=.f8b29e01-67af-4ff6-9f60-b84264bc724d@github.com> References: <6AcveZEfV2AvLEpEP-nSTG3r9aqc_S82tbDatEw1h4s=.f8b29e01-67af-4ff6-9f60-b84264bc724d@github.com> Message-ID: <5V_W5w5OM69N3cKaT5-4AK7YuXS1YPAX5w_sDxfoeeo=.ed4a54db-95b4-4b2d-a3fb-d714725bab96@github.com> On Sun, 6 Nov 2022 21:02:17 GMT, Olga Mikhaltsova wrote: >> This PR is opened as a follow-up for [1] and included the "must-done" fixes pointed by @teshull. >> >> This patch for JVMCI includes the following fixes related to the macOS AArch64 calling convention: >> 1. arguments may consume slots on the stack that are not multiples of 8 bytes [2] >> 2. natural alignment of stack arguments [2] >> 3. stack must remain 16-byte aligned [3][4] >> >> Tested with tier1 on macOS AArch64 and Linux AArch64. >> >> [1] https://github.com/openjdk/jdk/pull/6641 >> [2] https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple-platforms >> [3] https://docs.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions?view=msvc-160#stack >> [4] https://docs.microsoft.com/en-us/cpp/build/stack-usage?view=msvc-170 > > Olga Mikhaltsova has updated the pull request incrementally with one additional commit since the last revision: > > Refactoring Looks good. Thanks. ------------- Marked as reviewed by aph (Reviewer). PR: https://git.openjdk.org/jdk/pull/10238 From dlong at openjdk.org Mon Nov 7 22:40:48 2022 From: dlong at openjdk.org (Dean Long) Date: Mon, 7 Nov 2022 22:40:48 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v6] In-Reply-To: References: Message-ID: On Sat, 5 Nov 2022 19:27:45 GMT, Martin Doerr wrote: >> This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Need to ignore own Method when using metadata_do. src/hotspot/share/code/nmethod.cpp line 503: > 501: basic_lock_sp_offset, > 502: oop_maps); > 503: #ifdef ASSERT How about making this whole block a separate function? ------------- PR: https://git.openjdk.org/jdk/pull/10933 From dlong at openjdk.org Mon Nov 7 22:43:25 2022 From: dlong at openjdk.org (Dean Long) Date: Mon, 7 Nov 2022 22:43:25 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v6] In-Reply-To: References: Message-ID: On Sat, 5 Nov 2022 19:27:45 GMT, Martin Doerr wrote: >> This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Need to ignore own Method when using metadata_do. src/hotspot/share/code/nmethod.cpp line 512: > 510: nm->oops_do(&cfo); > 511: assert(!cfo.found_oop(), "no oops allowed"); > 512: CheckForMetadataClosure cfm(/* ignore reference to own Method */ nm->method()); If we are going to ignore method(), then how about checking that it's classloader is parmanent (ClassLoaderData::is_permanent_class_loader_data)? We also might need a review by a GC expert. @fisk, do you agree it is safe to put nmethod in NonNMethod space given the above checks? ------------- PR: https://git.openjdk.org/jdk/pull/10933 From vlivanov at openjdk.org Mon Nov 7 23:55:01 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Mon, 7 Nov 2022 23:55:01 GMT Subject: RFR: 6312651: Compiler should only use verified interface types for optimization [v2] In-Reply-To: References: Message-ID: <01z8uzrzUhUPTxFfIqlLbrZYjtn9xWd3Hth6rRWtbZg=.7792c2ee-f92d-4a2e-a605-973d4fd85816@github.com> On Fri, 28 Oct 2022 15:35:38 GMT, Roland Westrelin wrote: >> This change is mostly the same I sent for review 3 years ago but was >> never integrated: >> >> https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2019-May/033803.html >> >> The main difference is that, in the meantime, I submitted a couple of >> refactoring changes extracted from the 2019 patch: >> >> 8266550: C2: mirror TypeOopPtr/TypeInstPtr/TypeAryPtr with TypeKlassPtr/TypeInstKlassPtr/TypeAryKlassPtr >> 8275201: C2: hide klass() accessor from TypeOopPtr and typeKlassPtr subclasses >> >> As a result, the current patch is much smaller (but still not small). >> >> The implementation is otherwise largely the same as in the 2019 >> patch. I tried to remove some of the code duplication between the >> TypeOopPtr and TypeKlassPtr hierarchies by having some of the logic >> shared in template methods. In the 2019 patch, interfaces were trusted >> when types were constructed and I had added code to drop interfaces >> from a type where they couldn't be trusted. This new patch proceeds >> the other way around: interfaces are not trusted when a type is >> constructed and code that uses the type must explicitly request that >> they are included (this was suggested as an improvement by Vladimir >> Ivanov I think). > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > build fix BTW will there still be a need in `CastPP` vs `CheckCastPP` dichotomy once the patch goes in? ------------- PR: https://git.openjdk.org/jdk/pull/10901 From fgao at openjdk.org Tue Nov 8 02:46:11 2022 From: fgao at openjdk.org (Fei Gao) Date: Tue, 8 Nov 2022 02:46:11 GMT Subject: RFR: 8295407: C2 crash: Error: ShouldNotReachHere() in multiple vector tests with -XX:-MonomorphicArrayCheck -XX:-UncommonNullCast Message-ID: For unsupported `CMove` patterns, [JDK-8293833](https://bugs.openjdk.org/browse/JDK-8295407) helps remove unused `CMove` and related packs from superword candidate packset by the function `remove_cmove_and_related_packs()`, but it only works when `-XX:+UseVectorCmov` is enabled[1]. When the option is not enabled, these unsupported `CMove` packs are still kept in the superword packset, causing the same failure. Actually, the function `filter_packs()` in superword is to filter out unsupported packs but it can't work as expected currently for these `CMove` cases. As we know, not all `CMove` packs can be vectorized. `merge_packs_to_cmovd()`[2] looks through all packs in the superword packset and generates a `CMove` candidate packset to collect all qualified `CMove` packs. Hence, only `CMove` packs in the `CMove` candidate packset are our target patterns and can be vectorized. But `filter_packs()` thinks, if the `CMove` pack is in a superword packset and its vector node is implemented in the current platform, then it can be vectorized. Therefore, the function doesn't remove these unsupported packs. We can adjust the function `implemented()` in the stage of `filter_packs()` to check if the current `CMove` pack is in the `CMove` candidate packset. If not, `filter_packs()` considers it not to be vectorized and then remove it. After the fix, whether `-XX:+UseVectorCmov` is enabled or not, these unsupported packs can be removed by `filter_packs()`. In this way, we don't need the function`remove_cmove_and_related_packs()` anymore and thus the patch also cleans related code. [1] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L537 [2] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L1892 ------------- Commit messages: - 8295407: C2 crash: Error: ShouldNotReachHere() in multiple vector tests with -XX:-MonomorphicArrayCheck -XX:-UncommonNullCast Changes: https://git.openjdk.org/jdk/pull/11034/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11034&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295407 Stats: 128 lines in 3 files changed: 75 ins; 45 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/11034.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11034/head:pull/11034 PR: https://git.openjdk.org/jdk/pull/11034 From kvn at openjdk.org Tue Nov 8 03:21:32 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 8 Nov 2022 03:21:32 GMT Subject: RFR: 8296426: x86: Narrow UseAVX and UseSSE flags In-Reply-To: <1-Zq6sFmnAdJMwJxyqmtLQWBNsdctsQToKb6PYRyee0=.37542925-d668-487f-8344-13cc6f63325b@github.com> References: <1-Zq6sFmnAdJMwJxyqmtLQWBNsdctsQToKb6PYRyee0=.37542925-d668-487f-8344-13cc6f63325b@github.com> Message-ID: <9Hh2v2R9eYze3BcSTXpOmeLtG8IXtNuGtYlohg3HuIw=.b248e5fe-132d-429a-a2db-43c23c0ce0e5@github.com> On Fri, 4 Nov 2022 20:28:57 GMT, Claes Redestad wrote: > This patch narrows down the UseAVX and UseSSE flags to their actual supported range and uses int rather than intx for their type. This avoids need for some silly casts, and surprisingly has a small beneficial effect to binary size (-4kb libjvm on linux-x64) > > This changes behavior of previously in-range values: `-XX:UseAVX=4` would emit a strongly worded warning, but with the proposed change we'll instead terminate the JVM with an error similar to `-XX:UseAVX=100`. I believe this is too trivial for a CSR, since it only changes behavior for unsupported values. You need to fix these flags type in JVMCI too. JVMCI tests failed. ------------- Changes requested by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10997 From kvn at openjdk.org Tue Nov 8 03:23:27 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 8 Nov 2022 03:23:27 GMT Subject: RFR: 8286800: Assert in PhaseIdealLoop::dump_real_LCA is too strong [v2] In-Reply-To: References: <2pkCki3_MbAhroP1p8173jY1ErpHaEEvGlRxH7nGkvM=.3675ca65-171e-4634-b0ed-b61c7c30ebf3@github.com> Message-ID: On Mon, 7 Nov 2022 11:47:51 GMT, Christian Hagedorn wrote: >> We sometimes hit the following assert when dumping a bad graph (before crashing with the bad graph assertion): >> >> assert(real_LCA != NULL, "must always find an LCA" >> ``` >> The algorithm is not correct as we should always find an LCA of two nodes. To fix this, I've re-implemented the algorithm and improved the dumped idom chains: >> - I limited the node dump to idx + node name to reduce the noise which made it hard to read. >> - Reversed the idom chain dumps to reflect the graph structure. >> >> Example output: >> >> Bad graph detected in build_loop_late >> n: 138 CastPP === 205 38 [[ 263 140 140 168 ]] #Test:NotNull * Oop:Test:NotNull * !jvms: Test::mainTest @ bci:40 (line 154) >> >> [... same output as before ...] >> >> idoms of early "197 IfFalse": >> idom[2]: 42 If >> idom[1]: 44 IfTrue >> idom[0]: 196 If >> n: 197 IfFalse >> >> idoms of (wrong) LCA "205 IfTrue": >> idom[4]: 42 If >> idom[3]: 37 Region >> idom[2]: 73 If >> idom[1]: 83 IfTrue >> idom[0]: 204 If >> n: 205 IfTrue >> >> Real LCA of early "197 IfFalse" (idom[2]) and wrong LCA "205 IfTrue" (idom[4]): >> 42 If === 30 41 [[ 43 44 ]] P=0.999000, C=-1.000000 !jvms: Test::mainTest @ bci:32 (line 153) >> >> Tested by manually calling `dump_idoms` during a compilation and by running reproducers of different bad graph assertion bugs. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Fix optimized build Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11015 From eosterlund at openjdk.org Tue Nov 8 07:12:30 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Tue, 8 Nov 2022 07:12:30 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v6] In-Reply-To: References: Message-ID: On Mon, 7 Nov 2022 22:41:02 GMT, Dean Long wrote: >> Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: >> >> Need to ignore own Method when using metadata_do. > > src/hotspot/share/code/nmethod.cpp line 512: > >> 510: nm->oops_do(&cfo); >> 511: assert(!cfo.found_oop(), "no oops allowed"); >> 512: CheckForMetadataClosure cfm(/* ignore reference to own Method */ nm->method()); > > If we are going to ignore method(), then how about checking that it's classloader is parmanent (ClassLoaderData::is_permanent_class_loader_data)? We also might need a review by a GC expert. @fisk, do you agree it is safe to put nmethod in NonNMethod space given the above checks? Hmm. The GC used to walk all nmethods an ensure the is_unloading_state is computed for all of them. When using a STW collector, we will crash if trying to compute this outside of the UnloadingScope, as we then don't have rnough information to be able to compute it. So I suppose this is a bit fragile; if an nmethod is acquired from outside of an nmethod iterator, and the is_unloading() question is asked, we risk crashing. I suppose we could strengthen this by letting is_unloading() know rhat the answer is false for method handle intrinsics. It feels more and more like method handle intrinsics shouldn't be nmethods as more and more code that deals with nmethods needs to know that this isn't really an nmethod. Although that's not a new observation for this patch. ------------- PR: https://git.openjdk.org/jdk/pull/10933 From chagedorn at openjdk.org Tue Nov 8 09:38:28 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 8 Nov 2022 09:38:28 GMT Subject: RFR: 8286800: Assert in PhaseIdealLoop::dump_real_LCA is too strong [v2] In-Reply-To: References: <2pkCki3_MbAhroP1p8173jY1ErpHaEEvGlRxH7nGkvM=.3675ca65-171e-4634-b0ed-b61c7c30ebf3@github.com> Message-ID: <5y-k7DE7H3o6XTCn107ZlfsCou3qM8kYEbR1dWFptvA=.353432dc-de72-4d6c-9d30-685dbc78063f@github.com> On Mon, 7 Nov 2022 11:47:51 GMT, Christian Hagedorn wrote: >> We sometimes hit the following assert when dumping a bad graph (before crashing with the bad graph assertion): >> >> assert(real_LCA != NULL, "must always find an LCA" >> ``` >> The algorithm is not correct as we should always find an LCA of two nodes. To fix this, I've re-implemented the algorithm and improved the dumped idom chains: >> - I limited the node dump to idx + node name to reduce the noise which made it hard to read. >> - Reversed the idom chain dumps to reflect the graph structure. >> >> Example output: >> >> Bad graph detected in build_loop_late >> n: 138 CastPP === 205 38 [[ 263 140 140 168 ]] #Test:NotNull * Oop:Test:NotNull * !jvms: Test::mainTest @ bci:40 (line 154) >> >> [... same output as before ...] >> >> idoms of early "197 IfFalse": >> idom[2]: 42 If >> idom[1]: 44 IfTrue >> idom[0]: 196 If >> n: 197 IfFalse >> >> idoms of (wrong) LCA "205 IfTrue": >> idom[4]: 42 If >> idom[3]: 37 Region >> idom[2]: 73 If >> idom[1]: 83 IfTrue >> idom[0]: 204 If >> n: 205 IfTrue >> >> Real LCA of early "197 IfFalse" (idom[2]) and wrong LCA "205 IfTrue" (idom[4]): >> 42 If === 30 41 [[ 43 44 ]] P=0.999000, C=-1.000000 !jvms: Test::mainTest @ bci:32 (line 153) >> >> Tested by manually calling `dump_idoms` during a compilation and by running reproducers of different bad graph assertion bugs. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Fix optimized build Thanks Vladimir for your review! ------------- PR: https://git.openjdk.org/jdk/pull/11015 From redestad at openjdk.org Tue Nov 8 09:53:52 2022 From: redestad at openjdk.org (Claes Redestad) Date: Tue, 8 Nov 2022 09:53:52 GMT Subject: RFR: 8296426: x86: Narrow UseAVX and UseSSE flags [v2] In-Reply-To: <1-Zq6sFmnAdJMwJxyqmtLQWBNsdctsQToKb6PYRyee0=.37542925-d668-487f-8344-13cc6f63325b@github.com> References: <1-Zq6sFmnAdJMwJxyqmtLQWBNsdctsQToKb6PYRyee0=.37542925-d668-487f-8344-13cc6f63325b@github.com> Message-ID: > This patch narrows down the UseAVX and UseSSE flags to their actual supported range and uses int rather than intx for their type. This avoids need for some silly casts, and surprisingly has a small beneficial effect to binary size (-4kb libjvm on linux-x64) > > This changes behavior of previously in-range values: `-XX:UseAVX=4` would emit a strongly worded warning, but with the proposed change we'll instead terminate the JVM with an error similar to `-XX:UseAVX=100`. I believe this is too trivial for a CSR, since it only changes behavior for unsupported values. Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: Update JVMCI flag mapping ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10997/files - new: https://git.openjdk.org/jdk/pull/10997/files/38939310..104b2234 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10997&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10997&range=00-01 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/10997.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10997/head:pull/10997 PR: https://git.openjdk.org/jdk/pull/10997 From rcastanedalo at openjdk.org Tue Nov 8 10:08:28 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 8 Nov 2022 10:08:28 GMT Subject: RFR: 8286800: Assert in PhaseIdealLoop::dump_real_LCA is too strong [v2] In-Reply-To: References: <2pkCki3_MbAhroP1p8173jY1ErpHaEEvGlRxH7nGkvM=.3675ca65-171e-4634-b0ed-b61c7c30ebf3@github.com> Message-ID: On Mon, 7 Nov 2022 11:47:51 GMT, Christian Hagedorn wrote: >> We sometimes hit the following assert when dumping a bad graph (before crashing with the bad graph assertion): >> >> assert(real_LCA != NULL, "must always find an LCA" >> ``` >> The algorithm is not correct as we should always find an LCA of two nodes. To fix this, I've re-implemented the algorithm and improved the dumped idom chains: >> - I limited the node dump to idx + node name to reduce the noise which made it hard to read. >> - Reversed the idom chain dumps to reflect the graph structure. >> >> Example output: >> >> Bad graph detected in build_loop_late >> n: 138 CastPP === 205 38 [[ 263 140 140 168 ]] #Test:NotNull * Oop:Test:NotNull * !jvms: Test::mainTest @ bci:40 (line 154) >> >> [... same output as before ...] >> >> idoms of early "197 IfFalse": >> idom[2]: 42 If >> idom[1]: 44 IfTrue >> idom[0]: 196 If >> n: 197 IfFalse >> >> idoms of (wrong) LCA "205 IfTrue": >> idom[4]: 42 If >> idom[3]: 37 Region >> idom[2]: 73 If >> idom[1]: 83 IfTrue >> idom[0]: 204 If >> n: 205 IfTrue >> >> Real LCA of early "197 IfFalse" (idom[2]) and wrong LCA "205 IfTrue" (idom[4]): >> 42 If === 30 41 [[ 43 44 ]] P=0.999000, C=-1.000000 !jvms: Test::mainTest @ bci:32 (line 153) >> >> Tested by manually calling `dump_idoms` during a compilation and by running reproducers of different bad graph assertion bugs. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Fix optimized build Looks generally good, I just have a suggestion. Since performance is not a concern here, finding the real LCA in `RealLCA::compute_and_dump()` could be (perhaps) simplified by collecting the idom lists of `_early` (already done) and `_wrong_lca` and traversing them simultaneously from the root until a divergence is found. ------------- Marked as reviewed by rcastanedalo (Reviewer). PR: https://git.openjdk.org/jdk/pull/11015 From chagedorn at openjdk.org Tue Nov 8 10:10:37 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 8 Nov 2022 10:10:37 GMT Subject: RFR: 8296243: [IR Framework] Fix issues with IRNode.ALLOC* regexes Message-ID: There are currently two problems with `IRNode.ALLOC*` regexes: 1. On PPC64, we do not account for an `LI` instruction which matches the array size. As a result, we could miss some array allocations with the `ALLOC_ARRAY*` regexes: 2e4 LD R3, offset, R3 // load ptr precise [java/lang/Object: 0x0000200058006e40 *: :Constant:exact * from TOC (lo) 2e8 STD R17, [R1_SP + #104+0] // spill copy 2ec LI R4, #1 <------- we only look for LGHI here which is specific to s390 while LI is used for PPC64 2f0 CALL,static 0x00002000177cd300 // ==> wrapper for: _new_array_Java This was revealed by a new test added by [JDK-8280378](https://bugs.openjdk.org/browse/JDK-8280378) but was already a problem before this change. 2. The newly added `IRNode.ALLOC*` regexes in JDK-8280378 which can be matched on the independent ideal compile phases by using the name of the IR node "Allocate" also matches "AllocateArray" (substring match). This is unexpected. I've changed this by matching "Allocate" exactly. I've additionally removed the matching of `LI` and `LGHI` for the `ALLOC` regexes on normal objects as we do not have an array size. I think it's safe to remove these (might need some additional testing on PPC64/s390). Thanks @TheRealMDoerr for helping to test the initial fix on PPC64! Thanks, Christian ------------- Commit messages: - 8296243: [IR Framework] IRNode.ALLOC_ARRAY* PrintOptoAssembly regex does not work properly for PPC64 Changes: https://git.openjdk.org/jdk/pull/11037/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11037&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296243 Stats: 22 lines in 2 files changed: 12 ins; 2 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/11037.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11037/head:pull/11037 PR: https://git.openjdk.org/jdk/pull/11037 From redestad at openjdk.org Tue Nov 8 10:35:28 2022 From: redestad at openjdk.org (Claes Redestad) Date: Tue, 8 Nov 2022 10:35:28 GMT Subject: RFR: 8296426: x86: Narrow UseAVX and UseSSE flags [v2] In-Reply-To: <9Hh2v2R9eYze3BcSTXpOmeLtG8IXtNuGtYlohg3HuIw=.b248e5fe-132d-429a-a2db-43c23c0ce0e5@github.com> References: <1-Zq6sFmnAdJMwJxyqmtLQWBNsdctsQToKb6PYRyee0=.37542925-d668-487f-8344-13cc6f63325b@github.com> <9Hh2v2R9eYze3BcSTXpOmeLtG8IXtNuGtYlohg3HuIw=.b248e5fe-132d-429a-a2db-43c23c0ce0e5@github.com> Message-ID: On Tue, 8 Nov 2022 03:17:40 GMT, Vladimir Kozlov wrote: > You need to fix these flags type in JVMCI too. JVMCI tests failed. Fixed. Interestingly the only use of these that I could find assume the flags are `int` already (perhaps a misunderstanding of `intx`): https://github.com/openjdk/jdk/blob/master/src/jdk.internal.vm.ci/share/classes/jdk.vm.ci.hotspot.amd64/src/jdk/vm/ci/hotspot/amd64/AMD64HotSpotVMConfig.java#L45 ------------- PR: https://git.openjdk.org/jdk/pull/10997 From roland at openjdk.org Tue Nov 8 14:06:38 2022 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 8 Nov 2022 14:06:38 GMT Subject: RFR: 8294217: Assertion failure: parsing found no loops but there are some [v2] In-Reply-To: <8wgYaLn82fk_CgKacAQsEygK63k6KDDKYXf8m4cv_OM=.d04ae5d7-b3f8-47b3-bec8-aaaa4234f018@github.com> References: <8wgYaLn82fk_CgKacAQsEygK63k6KDDKYXf8m4cv_OM=.d04ae5d7-b3f8-47b3-bec8-aaaa4234f018@github.com> Message-ID: > This was reported on 11 and is not reproducible with the current > jdk. The reason is that the PhaseIdealLoop invocation before EA was > changed from LoopOptsNone to LoopOptsMaxUnroll. In the absence of > loops, LoopOptsMaxUnroll exits earlier than LoopOptsNone. That wasn't > intended and this patch makes sure they behave the same. Once that's > changed, the crash reproduces with the current jdk. > > The assert fires because PhaseIdealLoop::only_has_infinite_loops() > returns false even though the IR only has infinite loops. There's a > single loop nest and the inner most loop is an infinite loop. The > current logic only looks at loops that are direct children of the root > of the loop tree. It's not the first bug where > PhaseIdealLoop::only_has_infinite_loops() fails to catch an infinite > loop (8257574 was the previous one) and it's proving challenging to > have PhaseIdealLoop::only_has_infinite_loops() handle corner cases > robustly. I reworked PhaseIdealLoop::only_has_infinite_loops() once > more. This time it goes over all children of the root of the loop > tree, collects all controls for the loop and its inner loop. It then > checks whether any control is a branch out of the loop and if it is > whether it's not a NeverBranch. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: review ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10904/files - new: https://git.openjdk.org/jdk/pull/10904/files/3838eee1..1d2a7ef7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10904&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10904&range=00-01 Stats: 14 lines in 2 files changed: 5 ins; 3 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/10904.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10904/head:pull/10904 PR: https://git.openjdk.org/jdk/pull/10904 From roland at openjdk.org Tue Nov 8 14:08:26 2022 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 8 Nov 2022 14:08:26 GMT Subject: RFR: 8294217: Assertion failure: parsing found no loops but there are some [v2] In-Reply-To: References: <8wgYaLn82fk_CgKacAQsEygK63k6KDDKYXf8m4cv_OM=.d04ae5d7-b3f8-47b3-bec8-aaaa4234f018@github.com> Message-ID: On Mon, 31 Oct 2022 07:51:29 GMT, Christian Hagedorn wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> review > > That looks reasonable to me. @chhagedorn thanks for reviewing! I pushed a new commit that should address your comments. @TobiHartmann thanks for reviewing! ------------- PR: https://git.openjdk.org/jdk/pull/10904 From redestad at openjdk.org Tue Nov 8 14:27:32 2022 From: redestad at openjdk.org (Claes Redestad) Date: Tue, 8 Nov 2022 14:27:32 GMT Subject: RFR: 8296426: x86: Narrow UseAVX and UseSSE flags [v3] In-Reply-To: <1-Zq6sFmnAdJMwJxyqmtLQWBNsdctsQToKb6PYRyee0=.37542925-d668-487f-8344-13cc6f63325b@github.com> References: <1-Zq6sFmnAdJMwJxyqmtLQWBNsdctsQToKb6PYRyee0=.37542925-d668-487f-8344-13cc6f63325b@github.com> Message-ID: > This patch narrows down the UseAVX and UseSSE flags to their actual supported range and uses int rather than intx for their type. This avoids need for some silly casts, and surprisingly has a small beneficial effect to binary size (-4kb libjvm on linux-x64) > > This changes behavior of previously in-range values: `-XX:UseAVX=4` would emit a strongly worded warning, but with the proposed change we'll instead terminate the JVM with an error similar to `-XX:UseAVX=100`. I believe this is too trivial for a CSR, since it only changes behavior for unsupported values. Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: Use getIntVMFlag WB API in compiler/floatingpoint/NaNTest.java ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10997/files - new: https://git.openjdk.org/jdk/pull/10997/files/104b2234..781b4abd Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10997&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10997&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10997.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10997/head:pull/10997 PR: https://git.openjdk.org/jdk/pull/10997 From chagedorn at openjdk.org Tue Nov 8 15:33:27 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 8 Nov 2022 15:33:27 GMT Subject: RFR: 8294217: Assertion failure: parsing found no loops but there are some [v2] In-Reply-To: References: <8wgYaLn82fk_CgKacAQsEygK63k6KDDKYXf8m4cv_OM=.d04ae5d7-b3f8-47b3-bec8-aaaa4234f018@github.com> Message-ID: On Tue, 8 Nov 2022 14:06:38 GMT, Roland Westrelin wrote: >> This was reported on 11 and is not reproducible with the current >> jdk. The reason is that the PhaseIdealLoop invocation before EA was >> changed from LoopOptsNone to LoopOptsMaxUnroll. In the absence of >> loops, LoopOptsMaxUnroll exits earlier than LoopOptsNone. That wasn't >> intended and this patch makes sure they behave the same. Once that's >> changed, the crash reproduces with the current jdk. >> >> The assert fires because PhaseIdealLoop::only_has_infinite_loops() >> returns false even though the IR only has infinite loops. There's a >> single loop nest and the inner most loop is an infinite loop. The >> current logic only looks at loops that are direct children of the root >> of the loop tree. It's not the first bug where >> PhaseIdealLoop::only_has_infinite_loops() fails to catch an infinite >> loop (8257574 was the previous one) and it's proving challenging to >> have PhaseIdealLoop::only_has_infinite_loops() handle corner cases >> robustly. I reworked PhaseIdealLoop::only_has_infinite_loops() once >> more. This time it goes over all children of the root of the loop >> tree, collects all controls for the loop and its inner loop. It then >> checks whether any control is a branch out of the loop and if it is >> whether it's not a NeverBranch. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > review Looks good, thanks for doing the updates! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10904 From kvn at openjdk.org Tue Nov 8 15:34:36 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 8 Nov 2022 15:34:36 GMT Subject: RFR: 8296426: x86: Narrow UseAVX and UseSSE flags [v3] In-Reply-To: References: <1-Zq6sFmnAdJMwJxyqmtLQWBNsdctsQToKb6PYRyee0=.37542925-d668-487f-8344-13cc6f63325b@github.com> Message-ID: On Tue, 8 Nov 2022 14:27:32 GMT, Claes Redestad wrote: >> This patch narrows down the UseAVX and UseSSE flags to their actual supported range and uses int rather than intx for their type. This avoids need for some silly casts, and surprisingly has a small beneficial effect to binary size (-4kb libjvm on linux-x64) >> >> This changes behavior of previously in-range values: `-XX:UseAVX=4` would emit a strongly worded warning, but with the proposed change we'll instead terminate the JVM with an error similar to `-XX:UseAVX=100`. I believe this is too trivial for a CSR, since it only changes behavior for unsupported values. > > Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: > > Use getIntVMFlag WB API in compiler/floatingpoint/NaNTest.java Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10997 From roland at openjdk.org Tue Nov 8 15:34:37 2022 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 8 Nov 2022 15:34:37 GMT Subject: RFR: 6312651: Compiler should only use verified interface types for optimization [v2] In-Reply-To: <01z8uzrzUhUPTxFfIqlLbrZYjtn9xWd3Hth6rRWtbZg=.7792c2ee-f92d-4a2e-a605-973d4fd85816@github.com> References: <01z8uzrzUhUPTxFfIqlLbrZYjtn9xWd3Hth6rRWtbZg=.7792c2ee-f92d-4a2e-a605-973d4fd85816@github.com> Message-ID: On Mon, 7 Nov 2022 23:51:44 GMT, Vladimir Ivanov wrote: > BTW will there still be a need in `CastPP` vs `CheckCastPP` dichotomy once the patch goes in? In principle, probably not. > src/hotspot/share/ci/ciObjectFactory.cpp line 160: > >> 158: InstanceKlass* ik = vmClasses::name(); \ >> 159: ciEnv::_##name = get_metadata(ik)->as_instance_klass(); \ >> 160: Array* interfaces = ik->transitive_interfaces(); \ > > What's the purpose of interface-related part of the code? ciInstanceKlass objects for the vm classes all need to be allocated from the same long lived arena that's created in ciObjectFactory::initialize() because they are shared between compilations. Without that code, the ciInstanceKlass for a particular vm class is in the long lived arena but the ciInstanceKlass objects for the interfaces are created later when they are needed in the arena of some compilation. Once that compilation is over the interface objects are destroyed but still referenced from shared types such as TypeInstPtr::MIRROR. ------------- PR: https://git.openjdk.org/jdk/pull/10901 From mdoerr at openjdk.org Tue Nov 8 17:23:25 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 8 Nov 2022 17:23:25 GMT Subject: RFR: 8296243: [IR Framework] Fix issues with IRNode.ALLOC* regexes In-Reply-To: References: Message-ID: <7yENZc-AZsVyRl1S8tmZx9XNzHSGbaY5N2O68uOUGzY=.071ff70b-1086-4352-81cd-a108f7428821@github.com> On Tue, 8 Nov 2022 09:58:42 GMT, Christian Hagedorn wrote: > There are currently two problems with `IRNode.ALLOC*` regexes: > 1. On PPC64, we do not account for an `LI` instruction which matches the array size. As a result, we could miss some array allocations with the `ALLOC_ARRAY*` regexes: > > 2e4 LD R3, offset, R3 // load ptr precise [java/lang/Object: > 0x0000200058006e40 *: :Constant:exact * from TOC (lo) > 2e8 STD R17, [R1_SP + #104+0] // spill copy > 2ec LI R4, #1 <------- we only look for LGHI here which is specific to s390 while LI is used for PPC64 > 2f0 CALL,static 0x00002000177cd300 // ==> wrapper for: _new_array_Java > > This was revealed by a new test added by [JDK-8280378](https://bugs.openjdk.org/browse/JDK-8280378) but was already a problem before this change. > > 2. The newly added `IRNode.ALLOC*` regexes in JDK-8280378 which can be matched on the independent ideal compile phases by using the name of the IR node "Allocate" also matches "AllocateArray" (substring match). This is unexpected. I've changed this by matching "Allocate" exactly. > > I've additionally removed the matching of `LI` and `LGHI` for the `ALLOC` regexes on normal objects as we do not have an array size. I think it's safe to remove these (might need some additional testing on PPC64/s390). > > Thanks @TheRealMDoerr for helping to test the initial fix on PPC64! > > Thanks, > Christian LGTM, but I'm not very familiar with the IR framework. The test `TestPhaseIRMatching.java` has passed on PPC64 and s390. ------------- Marked as reviewed by mdoerr (Reviewer). PR: https://git.openjdk.org/jdk/pull/11037 From mdoerr at openjdk.org Tue Nov 8 17:55:02 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 8 Nov 2022 17:55:02 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v6] In-Reply-To: References: Message-ID: On Mon, 7 Nov 2022 22:36:55 GMT, Dean Long wrote: >> Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: >> >> Need to ignore own Method when using metadata_do. > > src/hotspot/share/code/nmethod.cpp line 503: > >> 501: basic_lock_sp_offset, >> 502: oop_maps); >> 503: #ifdef ASSERT > > How about making this whole block a separate function? Done. And thanks for the hint regarding `is_permanent_class_loader_data`. That was what I was missing. Assertion added. ------------- PR: https://git.openjdk.org/jdk/pull/10933 From mdoerr at openjdk.org Tue Nov 8 17:54:58 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 8 Nov 2022 17:54:58 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v7] In-Reply-To: References: Message-ID: > This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: Refactor and add assert that Method's class is permanent. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10933/files - new: https://git.openjdk.org/jdk/pull/10933/files/860b71ed..8d12f754 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10933&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10933&range=05-06 Stats: 35 lines in 1 file changed: 20 ins; 13 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/10933.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10933/head:pull/10933 PR: https://git.openjdk.org/jdk/pull/10933 From mdoerr at openjdk.org Tue Nov 8 17:57:56 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 8 Nov 2022 17:57:56 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v8] In-Reply-To: References: Message-ID: > This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: Fix typo in comment. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10933/files - new: https://git.openjdk.org/jdk/pull/10933/files/8d12f754..525d9a81 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10933&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10933&range=06-07 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10933.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10933/head:pull/10933 PR: https://git.openjdk.org/jdk/pull/10933 From mdoerr at openjdk.org Tue Nov 8 18:46:25 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 8 Nov 2022 18:46:25 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v6] In-Reply-To: References: Message-ID: On Tue, 8 Nov 2022 07:08:47 GMT, Erik ?sterlund wrote: >> src/hotspot/share/code/nmethod.cpp line 512: >> >>> 510: nm->oops_do(&cfo); >>> 511: assert(!cfo.found_oop(), "no oops allowed"); >>> 512: CheckForMetadataClosure cfm(/* ignore reference to own Method */ nm->method()); >> >> If we are going to ignore method(), then how about checking that it's classloader is parmanent (ClassLoaderData::is_permanent_class_loader_data)? We also might need a review by a GC expert. @fisk, do you agree it is safe to put nmethod in NonNMethod space given the above checks? > > Hmm. The GC used to walk all nmethods an ensure the is_unloading_state is computed for all of them. When using a STW collector, we will crash if trying to compute this outside of the UnloadingScope, as we then don't have rnough information to be able to compute it. So I suppose this is a bit fragile; if an nmethod is acquired from outside of an nmethod iterator, and the is_unloading() question is asked, we risk crashing. > I suppose we could strengthen this by letting is_unloading() know rhat the answer is false for method handle intrinsics. > It feels more and more like method handle intrinsics shouldn't be nmethods as more and more code that deals with nmethods needs to know that this isn't really an nmethod. Although that's not a new observation for this patch. So, calling `IsUnloadingBehaviour::is_unloading` is unsafe. Should we only call it when `is_permanent_class_loader_data()` for the method returns false? ------------- PR: https://git.openjdk.org/jdk/pull/10933 From dlong at openjdk.org Tue Nov 8 19:38:25 2022 From: dlong at openjdk.org (Dean Long) Date: Tue, 8 Nov 2022 19:38:25 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v8] In-Reply-To: References: Message-ID: On Tue, 8 Nov 2022 17:57:56 GMT, Martin Doerr wrote: >> This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Fix typo in comment. It sounds like using nmethods for compiled method handle intrinsics isn't necessary and causes problems. What if we store them as MethodHandlesAdapterBlobs like we do for the interpreter? Lookup would need to go through method->adapter() instead of method->code(). Is there already an RFE filed for this? ------------- PR: https://git.openjdk.org/jdk/pull/10933 From dlong at openjdk.org Tue Nov 8 19:55:28 2022 From: dlong at openjdk.org (Dean Long) Date: Tue, 8 Nov 2022 19:55:28 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v6] In-Reply-To: References: Message-ID: On Tue, 8 Nov 2022 18:42:32 GMT, Martin Doerr wrote: >> Hmm. The GC used to walk all nmethods an ensure the is_unloading_state is computed for all of them. When using a STW collector, we will crash if trying to compute this outside of the UnloadingScope, as we then don't have rnough information to be able to compute it. So I suppose this is a bit fragile; if an nmethod is acquired from outside of an nmethod iterator, and the is_unloading() question is asked, we risk crashing. >> I suppose we could strengthen this by letting is_unloading() know rhat the answer is false for method handle intrinsics. >> It feels more and more like method handle intrinsics shouldn't be nmethods as more and more code that deals with nmethods needs to know that this isn't really an nmethod. Although that's not a new observation for this patch. > > So, calling `IsUnloadingBehaviour::is_unloading` is unsafe. Should we only call it when `is_permanent_class_loader_data()` for the method returns false? Does this also mean that nmethods for method handle intrinsics could previously get unloaded when they became "cold", and now if they are in NonNMethod they live forever? Someone could probably write a test that fills up NonNMethod with method handle intrinsics for various signatures by spinning up temporary classes with random signatures. ------------- PR: https://git.openjdk.org/jdk/pull/10933 From duke at openjdk.org Tue Nov 8 21:41:58 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Tue, 8 Nov 2022 21:41:58 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v8] In-Reply-To: References: Message-ID: > Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. > > - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. > - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. > - Added a JMH perf test. > - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. > > Perf before: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s > > and after: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s Volodymyr Paprotski has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 15 commits: - make UsePolyIntrinsics option diagnostic - Merge remote-tracking branch 'origin/master' into avx512-poly - iwanowww review - Merge remote-tracking branch 'origin/master' into avx512-poly - address Jamil's review - invalidkeyexception and some review comments - extra whitespace character - assembler checks and test case fixes - Merge remote-tracking branch 'origin/master' into avx512-poly - Merge remote-tracking branch 'origin' into avx512-poly - ... and 5 more: https://git.openjdk.org/jdk/compare/0ee25de7...120247d5 ------------- Changes: https://git.openjdk.org/jdk/pull/10582/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=07 Stats: 1814 lines in 32 files changed: 1777 ins; 3 del; 34 mod Patch: https://git.openjdk.org/jdk/pull/10582.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10582/head:pull/10582 PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Tue Nov 8 21:42:03 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Tue, 8 Nov 2022 21:42:03 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v7] In-Reply-To: References: Message-ID: On Fri, 4 Nov 2022 17:25:16 GMT, Volodymyr Paprotski wrote: >> src/hotspot/share/opto/library_call.cpp line 7036: >> >>> 7034: assert(r_start, "r array is NULL"); >>> 7035: >>> 7036: Node* call = make_runtime_call(RC_LEAF, >> >> Can we safely change this to `RC_LEAF | RC_NO_FP`? For the ChaCha20 block intrinsic I'm working on I've been using that parameter because I'm not touching the FP registers and that looks to be the case here (though your intrinsic is a lot more complicated than mine so I may have missed something). I believe the GHASH and AES library call routines also call `make_runtime_call()` in this way. > > Makes sense to me, will put it in and re-test (no fp registers anywhere in the intrinsic). Thanks! done ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Tue Nov 8 21:42:08 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Tue, 8 Nov 2022 21:42:08 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v6] In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 23:21:57 GMT, Vladimir Ivanov wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: >> >> invalidkeyexception and some review comments > > src/hotspot/share/runtime/globals.hpp line 241: > >> 239: "Use intrinsics for java.util.Base64") \ >> 240: \ >> 241: product(bool, UsePolyIntrinsics, false, \ > > I'm not a fan of introducing new flags for individual intrinsics (there's already `-XX:DisableIntrinsic=_name` specifically for that), but since we already have many, shouldn't it be declared as a diagnostic flag, at least? Started removing the option, but its quite convenient to have the boolean global, so just made the option diagnostic. "done" ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Tue Nov 8 22:03:20 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Tue, 8 Nov 2022 22:03:20 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v6] In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 23:49:17 GMT, Vladimir Ivanov wrote: >> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 2002: >> >>> 2000: } >>> 2001: >>> 2002: address StubGenerator::generate_poly1305_masksCP() { >> >> I suggest to turn it into a C++ literal constant and move the declaration next to `poly1305_process_blocks_avx512` where they are used. As an example, here's how it is handled in GHASH stubs: >> https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/stubGenerator_x86_64_ghash.cpp#L35 >> >> That would allow to avoid to simplify the code a bit (no need in `StubRoutines::x86::_poly1305_mask_addr`/`poly1305_mask_addr()` and no need to generate the constants during VM startup). >> >> You could split it into 3 constants, but then using a single base register (`polyCP`) won't work anymore. >> Thinking more about it, I'm not sure why you can't just do the split and use address literals instead to access individual constants (and repurpose `r13` to be used as a scratch register when RIP-relative addressing mode doesn't work). > > The case of AES stubs may be even a better fit here: > https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp#L47 > > It doesn't use/introduce any shared constants, so declaring a constant and a local accessor (to save on pointer to address casts at use sites) is enough. @iwanowww moved to StubGenerator as suggested.. moving functions to the stubGenerator_x86_64.hpp header doesn't seem 'clean' but I think that's the pattern. The constant pool.. stared at it for a while and ended up keeping it mostly intact (its now a static function, not a member function; header bit cleaner; followed AES pattern). Did not split it up into individual constants. The main 'problem' is that `Address` and `ExternalAddress` are not compatible. Most instructions do not take `AddressLiteral`, so can't use `ExternalAddress` to refer to those constants. (If I did get the instructions I use to take `AddressLiteral`, I think we would end up with more `lea(rscratch)`s generated; but that's more of a silver-lining) I also thought of loading constants at run-time, (load and replicate for vector.. what I mentioned in my comment above) but that seems needlessly complicated in hindsight.. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From shade at openjdk.org Tue Nov 8 22:12:22 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 8 Nov 2022 22:12:22 GMT Subject: RFR: 8296545: C2 Blackholes should allow load optimizations Message-ID: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> If you look at generated code for the JMH benchmark like: public class ArrayRead { @Param({"1", "100", "10000", "1000000"}) int size; int[] is; @Setup public void setup() { is = new int[size]; for (int c = 0; c < size; c++) { is[c] = c; } } @Benchmark public void test(Blackhole bh) { for (int i = 0; i < is.length; i++) { bh.consume(is[i]); } } } ...then you would notice that the loop always re-reads `is`, `is.length`, does the range check, etc. -- all the things we would otherwise expect to be hoisted out of the loop. This is because C2 blackholes are modeled as membars that pinch both control and memory slices (like you would expect from the opaque non-inlined call), therefore every iteration has to re-read the referenced memory contents and recompute everything dependent on those loads. This behavior is not new -- the old, non-compiler blackholes were doing the same thing, accidentally -- but it was drowned in blackhole overheads. Now, these effects are clearly visible. We can try to do this a bit better: allow load optimizations to work across the blackholes, leaving only "prevent dead code elimination" part, as minimally required by blackhole semantics. Motivational improvements on the test above: Benchmark (size) Mode Cnt Score Error Units # Before, full Java blackholes ArrayRead.test 1 avgt 9 5.422 ? 0.023 ns/op ArrayRead.test 100 avgt 9 460.619 ? 0.421 ns/op ArrayRead.test 10000 avgt 9 44697.909 ? 1964.787 ns/op ArrayRead.test 1000000 avgt 9 4332723.304 ? 2791.324 ns/op # Before, compiler blackholes ArrayRead.test 1 avgt 9 1.791 ? 0.007 ns/op ArrayRead.test 100 avgt 9 114.103 ? 1.677 ns/op ArrayRead.test 10000 avgt 9 8528.544 ? 52.010 ns/op ArrayRead.test 1000000 avgt 9 1005139.070 ? 2883.011 ns/op # After, compiler blackholes ArrayRead.test 1 avgt 9 1.686 ? 0.006 ns/op ; ~1.1x better ArrayRead.test 100 avgt 9 16.249 ? 0.019 ns/op ; ~7.0x better ArrayRead.test 10000 avgt 9 1375.265 ? 2.420 ns/op ; ~6.2x better ArrayRead.test 1000000 avgt 9 136862.574 ? 1057.100 ns/op ; ~7.3x better `-prof perfasm` shows the reason for these improvements clearly: Before: ? 0x00007f0b54498360: mov 0xc(%r12,%r10,8),%edx ; range check 1 7.97% ? 0x00007f0b54498365: cmp %edx,%r11d 1.27% ? 0x00007f0b54498368: jae 0x00007f0b5449838f ? 0x00007f0b5449836a: shl $0x3,%r10 0.03% ? 0x00007f0b5449836e: mov 0x10(%r10,%r11,4),%r10d ; get "is[i]" 7.76% ? 0x00007f0b54498373: mov 0x10(%r9),%r10d ; restore "is" 0.24% ? 0x00007f0b54498377: mov 0x3c0(%r15),%rdx ; safepoint poll, part 1 17.48% ? 0x00007f0b5449837e: inc %r11d ; i++ 0.17% ? 0x00007f0b54498381: test %eax,(%rdx) ; safepoint poll, part 2 53.26% ? 0x00007f0b54498383: mov 0xc(%r12,%r10,8),%edx ; loop index check 4.84% ? 0x00007f0b54498388: cmp %edx,%r11d 0.31% ? 0x00007f0b5449838b: jl 0x00007f0b54498360 After: ? 0x00007fa06c49a8b0: mov 0x2c(%rbp,%r10,4),%r9d ; stride read 19.66% ? 0x00007fa06c49a8b5: mov 0x28(%rbp,%r10,4),%edx 0.14% ? 0x00007fa06c49a8ba: mov 0x10(%rbp,%r10,4),%ebx 22.09% ? 0x00007fa06c49a8bf: mov 0x14(%rbp,%r10,4),%ebx 0.21% ? 0x00007fa06c49a8c4: mov 0x18(%rbp,%r10,4),%ebx 20.19% ? 0x00007fa06c49a8c9: mov 0x1c(%rbp,%r10,4),%ebx 0.04% ? 0x00007fa06c49a8ce: mov 0x20(%rbp,%r10,4),%ebx 24.02% ? 0x00007fa06c49a8d3: mov 0x24(%rbp,%r10,4),%ebx 0.21% ? 0x00007fa06c49a8d8: add $0x8,%r10d ; i += 8 ? 0x00007fa06c49a8dc: cmp %esi,%r10d 0.07% ? 0x00007fa06c49a8df: jl 0x00007fa06c49a8b0 Additional testing: - [x] Eyeballing JMH Samples `-prof perfasm` - [x] Linux x86_64 fastdebug, `compiler/blackhole`, `compiler/c2/irTests/blackhole` - [x] Linux x86_64 fastdebug, JDK benchmark corpus ------------- Commit messages: - Fix Changes: https://git.openjdk.org/jdk/pull/11041/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11041&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296545 Stats: 128 lines in 3 files changed: 127 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11041.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11041/head:pull/11041 PR: https://git.openjdk.org/jdk/pull/11041 From adinn at openjdk.org Tue Nov 8 22:28:26 2022 From: adinn at openjdk.org (Andrew Dinn) Date: Tue, 8 Nov 2022 22:28:26 GMT Subject: RFR: 8296545: C2 Blackholes should allow load optimizations In-Reply-To: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> References: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> Message-ID: <5LT__RwLkAKkf7VjO_I5N9P60q_PGY_7MAXeZR3AMkA=.6141f468-44c9-4f13-8521-34b54a93b466@github.com> On Tue, 8 Nov 2022 15:48:01 GMT, Aleksey Shipilev wrote: > If you look at generated code for the JMH benchmark like: > > > public class ArrayRead { > @Param({"1", "100", "10000", "1000000"}) > int size; > > int[] is; > > @Setup > public void setup() { > is = new int[size]; > for (int c = 0; c < size; c++) { > is[c] = c; > } > } > > @Benchmark > public void test(Blackhole bh) { > for (int i = 0; i < is.length; i++) { > bh.consume(is[i]); > } > } > } > > > ...then you would notice that the loop always re-reads `is`, `is.length`, does the range check, etc. -- all the things we would otherwise expect to be hoisted out of the loop. > > This is because C2 blackholes are modeled as membars that pinch both control and memory slices (like you would expect from the opaque non-inlined call), therefore every iteration has to re-read the referenced memory contents and recompute everything dependent on those loads. This behavior is not new -- the old, non-compiler blackholes were doing the same thing, accidentally -- but it was drowned in blackhole overheads. Now, these effects are clearly visible. > > We can try to do this a bit better: allow load optimizations to work across the blackholes, leaving only "prevent dead code elimination" part, as minimally required by blackhole semantics. > > Motivational improvements on the test above: > > > Benchmark (size) Mode Cnt Score Error Units > > # Before, full Java blackholes > ArrayRead.test 1 avgt 9 5.422 ? 0.023 ns/op > ArrayRead.test 100 avgt 9 460.619 ? 0.421 ns/op > ArrayRead.test 10000 avgt 9 44697.909 ? 1964.787 ns/op > ArrayRead.test 1000000 avgt 9 4332723.304 ? 2791.324 ns/op > > # Before, compiler blackholes > ArrayRead.test 1 avgt 9 1.791 ? 0.007 ns/op > ArrayRead.test 100 avgt 9 114.103 ? 1.677 ns/op > ArrayRead.test 10000 avgt 9 8528.544 ? 52.010 ns/op > ArrayRead.test 1000000 avgt 9 1005139.070 ? 2883.011 ns/op > > # After, compiler blackholes > ArrayRead.test 1 avgt 9 1.686 ? 0.006 ns/op ; ~1.1x better > ArrayRead.test 100 avgt 9 16.249 ? 0.019 ns/op ; ~7.0x better > ArrayRead.test 10000 avgt 9 1375.265 ? 2.420 ns/op ; ~6.2x better > ArrayRead.test 1000000 avgt 9 136862.574 ? 1057.100 ns/op ; ~7.3x better > > > `-prof perfasm` shows the reason for these improvements clearly: > > Before: > > > ? 0x00007f0b54498360: mov 0xc(%r12,%r10,8),%edx ; range check 1 > 7.97% ? 0x00007f0b54498365: cmp %edx,%r11d > 1.27% ? 0x00007f0b54498368: jae 0x00007f0b5449838f > ? 0x00007f0b5449836a: shl $0x3,%r10 > 0.03% ? 0x00007f0b5449836e: mov 0x10(%r10,%r11,4),%r10d ; get "is[i]" > 7.76% ? 0x00007f0b54498373: mov 0x10(%r9),%r10d ; restore "is" > 0.24% ? 0x00007f0b54498377: mov 0x3c0(%r15),%rdx ; safepoint poll, part 1 > 17.48% ? 0x00007f0b5449837e: inc %r11d ; i++ > 0.17% ? 0x00007f0b54498381: test %eax,(%rdx) ; safepoint poll, part 2 > 53.26% ? 0x00007f0b54498383: mov 0xc(%r12,%r10,8),%edx ; loop index check > 4.84% ? 0x00007f0b54498388: cmp %edx,%r11d > 0.31% ? 0x00007f0b5449838b: jl 0x00007f0b54498360 > > > After: > > > > ? 0x00007fa06c49a8b0: mov 0x2c(%rbp,%r10,4),%r9d ; stride read > 19.66% ? 0x00007fa06c49a8b5: mov 0x28(%rbp,%r10,4),%edx > 0.14% ? 0x00007fa06c49a8ba: mov 0x10(%rbp,%r10,4),%ebx > 22.09% ? 0x00007fa06c49a8bf: mov 0x14(%rbp,%r10,4),%ebx > 0.21% ? 0x00007fa06c49a8c4: mov 0x18(%rbp,%r10,4),%ebx > 20.19% ? 0x00007fa06c49a8c9: mov 0x1c(%rbp,%r10,4),%ebx > 0.04% ? 0x00007fa06c49a8ce: mov 0x20(%rbp,%r10,4),%ebx > 24.02% ? 0x00007fa06c49a8d3: mov 0x24(%rbp,%r10,4),%ebx > 0.21% ? 0x00007fa06c49a8d8: add $0x8,%r10d ; i += 8 > ? 0x00007fa06c49a8dc: cmp %esi,%r10d > 0.07% ? 0x00007fa06c49a8df: jl 0x00007fa06c49a8b0 > > > Additional testing: > - [x] Eyeballing JMH Samples `-prof perfasm` > - [x] Linux x86_64 fastdebug, `compiler/blackhole`, `compiler/c2/irTests/blackhole` > - [x] Linux x86_64 fastdebug, JDK benchmark corpus Ooh, nice work. Unfortunately the VerifyGraphEdges test seems to be failing. ------------- PR: https://git.openjdk.org/jdk/pull/11041 From duke at openjdk.org Tue Nov 8 23:21:58 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Tue, 8 Nov 2022 23:21:58 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v9] In-Reply-To: References: Message-ID: > Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. > > - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. > - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. > - Added a JMH perf test. > - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. > > Perf before: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s > > and after: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: fix 32-bit build ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10582/files - new: https://git.openjdk.org/jdk/pull/10582/files/120247d5..da560452 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=07-08 Stats: 0 lines in 1 file changed: 0 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10582.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10582/head:pull/10582 PR: https://git.openjdk.org/jdk/pull/10582 From redestad at openjdk.org Tue Nov 8 23:42:26 2022 From: redestad at openjdk.org (Claes Redestad) Date: Tue, 8 Nov 2022 23:42:26 GMT Subject: RFR: 8296426: x86: Narrow UseAVX and UseSSE flags [v3] In-Reply-To: References: <1-Zq6sFmnAdJMwJxyqmtLQWBNsdctsQToKb6PYRyee0=.37542925-d668-487f-8344-13cc6f63325b@github.com> Message-ID: On Tue, 8 Nov 2022 14:27:32 GMT, Claes Redestad wrote: >> This patch narrows down the UseAVX and UseSSE flags to their actual supported range and uses int rather than intx for their type. This avoids need for some silly casts, and surprisingly has a small beneficial effect to binary size (-4kb libjvm on linux-x64) >> >> This changes behavior of previously in-range values: `-XX:UseAVX=4` would emit a strongly worded warning, but with the proposed change we'll instead terminate the JVM with an error similar to `-XX:UseAVX=100`. I believe this is too trivial for a CSR, since it only changes behavior for unsupported values. > > Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: > > Use getIntVMFlag WB API in compiler/floatingpoint/NaNTest.java Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/10997 From redestad at openjdk.org Tue Nov 8 23:44:30 2022 From: redestad at openjdk.org (Claes Redestad) Date: Tue, 8 Nov 2022 23:44:30 GMT Subject: Integrated: 8296426: x86: Narrow UseAVX and UseSSE flags In-Reply-To: <1-Zq6sFmnAdJMwJxyqmtLQWBNsdctsQToKb6PYRyee0=.37542925-d668-487f-8344-13cc6f63325b@github.com> References: <1-Zq6sFmnAdJMwJxyqmtLQWBNsdctsQToKb6PYRyee0=.37542925-d668-487f-8344-13cc6f63325b@github.com> Message-ID: On Fri, 4 Nov 2022 20:28:57 GMT, Claes Redestad wrote: > This patch narrows down the UseAVX and UseSSE flags to their actual supported range and uses int rather than intx for their type. This avoids need for some silly casts, and surprisingly has a small beneficial effect to binary size (-4kb libjvm on linux-x64) > > This changes behavior of previously in-range values: `-XX:UseAVX=4` would emit a strongly worded warning, but with the proposed change we'll instead terminate the JVM with an error similar to `-XX:UseAVX=100`. I believe this is too trivial for a CSR, since it only changes behavior for unsupported values. This pull request has now been integrated. Changeset: d9b25e86 Author: Claes Redestad URL: https://git.openjdk.org/jdk/commit/d9b25e860b0d73f5fc0890c006bfad0614b23d5c Stats: 14 lines in 4 files changed: 0 ins; 0 del; 14 mod 8296426: x86: Narrow UseAVX and UseSSE flags Reviewed-by: vlivanov, kvn ------------- PR: https://git.openjdk.org/jdk/pull/10997 From kvn at openjdk.org Tue Nov 8 23:58:26 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 8 Nov 2022 23:58:26 GMT Subject: RFR: 8296545: C2 Blackholes should allow load optimizations In-Reply-To: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> References: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> Message-ID: On Tue, 8 Nov 2022 15:48:01 GMT, Aleksey Shipilev wrote: > If you look at generated code for the JMH benchmark like: > > > public class ArrayRead { > @Param({"1", "100", "10000", "1000000"}) > int size; > > int[] is; > > @Setup > public void setup() { > is = new int[size]; > for (int c = 0; c < size; c++) { > is[c] = c; > } > } > > @Benchmark > public void test(Blackhole bh) { > for (int i = 0; i < is.length; i++) { > bh.consume(is[i]); > } > } > } > > > ...then you would notice that the loop always re-reads `is`, `is.length`, does the range check, etc. -- all the things we would otherwise expect to be hoisted out of the loop. > > This is because C2 blackholes are modeled as membars that pinch both control and memory slices (like you would expect from the opaque non-inlined call), therefore every iteration has to re-read the referenced memory contents and recompute everything dependent on those loads. This behavior is not new -- the old, non-compiler blackholes were doing the same thing, accidentally -- but it was drowned in blackhole overheads. Now, these effects are clearly visible. > > We can try to do this a bit better: allow load optimizations to work across the blackholes, leaving only "prevent dead code elimination" part, as minimally required by blackhole semantics. > > Motivational improvements on the test above: > > > Benchmark (size) Mode Cnt Score Error Units > > # Before, full Java blackholes > ArrayRead.test 1 avgt 9 5.422 ? 0.023 ns/op > ArrayRead.test 100 avgt 9 460.619 ? 0.421 ns/op > ArrayRead.test 10000 avgt 9 44697.909 ? 1964.787 ns/op > ArrayRead.test 1000000 avgt 9 4332723.304 ? 2791.324 ns/op > > # Before, compiler blackholes > ArrayRead.test 1 avgt 9 1.791 ? 0.007 ns/op > ArrayRead.test 100 avgt 9 114.103 ? 1.677 ns/op > ArrayRead.test 10000 avgt 9 8528.544 ? 52.010 ns/op > ArrayRead.test 1000000 avgt 9 1005139.070 ? 2883.011 ns/op > > # After, compiler blackholes > ArrayRead.test 1 avgt 9 1.686 ? 0.006 ns/op ; ~1.1x better > ArrayRead.test 100 avgt 9 16.249 ? 0.019 ns/op ; ~7.0x better > ArrayRead.test 10000 avgt 9 1375.265 ? 2.420 ns/op ; ~6.2x better > ArrayRead.test 1000000 avgt 9 136862.574 ? 1057.100 ns/op ; ~7.3x better > > > `-prof perfasm` shows the reason for these improvements clearly: > > Before: > > > ? 0x00007f0b54498360: mov 0xc(%r12,%r10,8),%edx ; range check 1 > 7.97% ? 0x00007f0b54498365: cmp %edx,%r11d > 1.27% ? 0x00007f0b54498368: jae 0x00007f0b5449838f > ? 0x00007f0b5449836a: shl $0x3,%r10 > 0.03% ? 0x00007f0b5449836e: mov 0x10(%r10,%r11,4),%r10d ; get "is[i]" > 7.76% ? 0x00007f0b54498373: mov 0x10(%r9),%r10d ; restore "is" > 0.24% ? 0x00007f0b54498377: mov 0x3c0(%r15),%rdx ; safepoint poll, part 1 > 17.48% ? 0x00007f0b5449837e: inc %r11d ; i++ > 0.17% ? 0x00007f0b54498381: test %eax,(%rdx) ; safepoint poll, part 2 > 53.26% ? 0x00007f0b54498383: mov 0xc(%r12,%r10,8),%edx ; loop index check > 4.84% ? 0x00007f0b54498388: cmp %edx,%r11d > 0.31% ? 0x00007f0b5449838b: jl 0x00007f0b54498360 > > > After: > > > > ? 0x00007fa06c49a8b0: mov 0x2c(%rbp,%r10,4),%r9d ; stride read > 19.66% ? 0x00007fa06c49a8b5: mov 0x28(%rbp,%r10,4),%edx > 0.14% ? 0x00007fa06c49a8ba: mov 0x10(%rbp,%r10,4),%ebx > 22.09% ? 0x00007fa06c49a8bf: mov 0x14(%rbp,%r10,4),%ebx > 0.21% ? 0x00007fa06c49a8c4: mov 0x18(%rbp,%r10,4),%ebx > 20.19% ? 0x00007fa06c49a8c9: mov 0x1c(%rbp,%r10,4),%ebx > 0.04% ? 0x00007fa06c49a8ce: mov 0x20(%rbp,%r10,4),%ebx > 24.02% ? 0x00007fa06c49a8d3: mov 0x24(%rbp,%r10,4),%ebx > 0.21% ? 0x00007fa06c49a8d8: add $0x8,%r10d ; i += 8 > ? 0x00007fa06c49a8dc: cmp %esi,%r10d > 0.07% ? 0x00007fa06c49a8df: jl 0x00007fa06c49a8b0 > > > Additional testing: > - [x] Eyeballing JMH Samples `-prof perfasm` > - [x] Linux x86_64 fastdebug, `compiler/blackhole`, `compiler/c2/irTests/blackhole` > - [x] Linux x86_64 fastdebug, JDK benchmark corpus Looks good. It is just copy of `insert_mem_bar()` without setting memory to BH. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11041 From eosterlund at openjdk.org Wed Nov 9 00:11:05 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Wed, 9 Nov 2022 00:11:05 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v6] In-Reply-To: References: Message-ID: On Tue, 8 Nov 2022 19:53:13 GMT, Dean Long wrote: >> So, calling `IsUnloadingBehaviour::is_unloading` is unsafe. Should we only call it when `is_permanent_class_loader_data()` for the method returns false? > > Does this also mean that nmethods for method handle intrinsics could previously get unloaded when they became "cold", and now if they are in NonNMethod they live forever? Someone could probably write a test that fills up NonNMethod with method handle intrinsics for various signatures by spinning up temporary classes with random signatures. IIRC the is_cold check returns false for method handle intrinsics. Mostly because it mimicked the sweeper heuristics for coldness. Having said that, the sweeper heuristics couldn't sample method handle intrinsics because they had no activation records and hence would always appear as cold even when they were not, while in the nmethod entry barrier heuristics for coldness, we would be able to to phase out cold method handle intrinsic nmethods. But I expect the amount of memory consumed by them to be rather small, so the magnitude of the win of going in that direction is questionable. ------------- PR: https://git.openjdk.org/jdk/pull/10933 From vlivanov at openjdk.org Wed Nov 9 00:29:32 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 9 Nov 2022 00:29:32 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v9] In-Reply-To: References: Message-ID: On Tue, 8 Nov 2022 23:21:58 GMT, Volodymyr Paprotski wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: > > fix 32-bit build src/hotspot/cpu/x86/macroAssembler_x86.hpp line 970: > 968: > 969: void addmq(int disp, Register r1, Register r2); > 970: Leftover formatting changes. src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 95: > 93: > 94: // OFFSET 64: mask_44 > 95: 0xfffffffffff, 0xfffffffffff, Please, keep leading zeroes explicit in the constants. src/hotspot/cpu/x86/stubRoutines_x86.cpp line 2: > 1: /* > 2: * Copyright (c) 2013, 2022, Oracle and/or its affiliates. All rights reserved. No changes in the file anymore. src/hotspot/share/opto/library_call.cpp line 7014: > 7012: const TypeKlassPtr* rklass = TypeKlassPtr::make(instklass_ImmutableElement); > 7013: const TypeOopPtr* rtype = rklass->as_instance_type()->cast_to_ptr_type(TypePtr::NotNull); > 7014: Node* rObj = new CheckCastPPNode(control(), rFace, rtype); FTR it's an unsafe cast since it doesn't involve a runtime check from `IntegerModuloP` to `ImmutableElement`. Please, lift as much checks into Java wrapper as possible. src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 175: > 173: > 174: int blockMultipleLength = len & (~(BLOCK_LENGTH-1)); > 175: Objects.checkFromIndexSize(offset, blockMultipleLength, input.length); I suggest to move the checks into `processMultipleBlocks`, introduce new static helper method specifically for the intrinsic part, and lift more logic (e.g., field loads) from the intrinsic into Java code. As an additional step, you can switch to double-register addressing mode (base + offset) for input data (`input`, `alimbs`, `rlimbs`) and simplify the intrinsic part even more (will involve a switch from `array_element_address` to `make_unsafe_address`). ------------- PR: https://git.openjdk.org/jdk/pull/10582 From kvn at openjdk.org Wed Nov 9 00:40:28 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 9 Nov 2022 00:40:28 GMT Subject: RFR: 8296545: C2 Blackholes should allow load optimizations In-Reply-To: <5LT__RwLkAKkf7VjO_I5N9P60q_PGY_7MAXeZR3AMkA=.6141f468-44c9-4f13-8521-34b54a93b466@github.com> References: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> <5LT__RwLkAKkf7VjO_I5N9P60q_PGY_7MAXeZR3AMkA=.6141f468-44c9-4f13-8521-34b54a93b466@github.com> Message-ID: On Tue, 8 Nov 2022 22:24:39 GMT, Andrew Dinn wrote: > Ooh, nice work. Unfortunately the VerifyGraphEdges test seems to be failing. https://bugs.openjdk.org/browse/JDK-8295867 ------------- PR: https://git.openjdk.org/jdk/pull/11041 From vlivanov at openjdk.org Wed Nov 9 00:42:32 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 9 Nov 2022 00:42:32 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v6] In-Reply-To: References: Message-ID: On Tue, 8 Nov 2022 22:01:19 GMT, Volodymyr Paprotski wrote: > Did not split it up into individual constants. The main 'problem' is that Address and ExternalAddress are not compatible. There's a reason for that and it's because RIP-relative addressing doesn't always work, so additional register may be needed. > Most instructions do not take AddressLiteral, so can't use ExternalAddress to refer to those constants. I counted 4 instructions accessing the constants (`evpandq`, `andq`, `evporq`, and `vpternlogq`) in your patch. `macroAssembler_x86.hpp` is the place for `AddressLiteral`-related overloads (there are already numerous cases present) and it's trivial to add new ones. > (If I did get the instructions I use to take AddressLiteral, I think we would end up with more lea(rscratch)s generated; but that's more of a silver-lining) It depends on memory layout. If constants end up placed close enough in the address space, there'll be no additional instructions generated. Anyway, it doesn't look like something important from throughput perspective. Overall, I find it clearer when the code refers to individual constants through `AddressLiteral`s, but I'm also fine with it as it is now. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From vlivanov at openjdk.org Wed Nov 9 00:47:16 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 9 Nov 2022 00:47:16 GMT Subject: RFR: 8296545: C2 Blackholes should allow load optimizations In-Reply-To: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> References: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> Message-ID: On Tue, 8 Nov 2022 15:48:01 GMT, Aleksey Shipilev wrote: > If you look at generated code for the JMH benchmark like: > > > public class ArrayRead { > @Param({"1", "100", "10000", "1000000"}) > int size; > > int[] is; > > @Setup > public void setup() { > is = new int[size]; > for (int c = 0; c < size; c++) { > is[c] = c; > } > } > > @Benchmark > public void test(Blackhole bh) { > for (int i = 0; i < is.length; i++) { > bh.consume(is[i]); > } > } > } > > > ...then you would notice that the loop always re-reads `is`, `is.length`, does the range check, etc. -- all the things we would otherwise expect to be hoisted out of the loop. > > This is because C2 blackholes are modeled as membars that pinch both control and memory slices (like you would expect from the opaque non-inlined call), therefore every iteration has to re-read the referenced memory contents and recompute everything dependent on those loads. This behavior is not new -- the old, non-compiler blackholes were doing the same thing, accidentally -- but it was drowned in blackhole overheads. Now, these effects are clearly visible. > > We can try to do this a bit better: allow load optimizations to work across the blackholes, leaving only "prevent dead code elimination" part, as minimally required by blackhole semantics. > > Motivational improvements on the test above: > > > Benchmark (size) Mode Cnt Score Error Units > > # Before, full Java blackholes > ArrayRead.test 1 avgt 9 5.422 ? 0.023 ns/op > ArrayRead.test 100 avgt 9 460.619 ? 0.421 ns/op > ArrayRead.test 10000 avgt 9 44697.909 ? 1964.787 ns/op > ArrayRead.test 1000000 avgt 9 4332723.304 ? 2791.324 ns/op > > # Before, compiler blackholes > ArrayRead.test 1 avgt 9 1.791 ? 0.007 ns/op > ArrayRead.test 100 avgt 9 114.103 ? 1.677 ns/op > ArrayRead.test 10000 avgt 9 8528.544 ? 52.010 ns/op > ArrayRead.test 1000000 avgt 9 1005139.070 ? 2883.011 ns/op > > # After, compiler blackholes > ArrayRead.test 1 avgt 9 1.686 ? 0.006 ns/op ; ~1.1x better > ArrayRead.test 100 avgt 9 16.249 ? 0.019 ns/op ; ~7.0x better > ArrayRead.test 10000 avgt 9 1375.265 ? 2.420 ns/op ; ~6.2x better > ArrayRead.test 1000000 avgt 9 136862.574 ? 1057.100 ns/op ; ~7.3x better > > > `-prof perfasm` shows the reason for these improvements clearly: > > Before: > > > ? 0x00007f0b54498360: mov 0xc(%r12,%r10,8),%edx ; range check 1 > 7.97% ? 0x00007f0b54498365: cmp %edx,%r11d > 1.27% ? 0x00007f0b54498368: jae 0x00007f0b5449838f > ? 0x00007f0b5449836a: shl $0x3,%r10 > 0.03% ? 0x00007f0b5449836e: mov 0x10(%r10,%r11,4),%r10d ; get "is[i]" > 7.76% ? 0x00007f0b54498373: mov 0x10(%r9),%r10d ; restore "is" > 0.24% ? 0x00007f0b54498377: mov 0x3c0(%r15),%rdx ; safepoint poll, part 1 > 17.48% ? 0x00007f0b5449837e: inc %r11d ; i++ > 0.17% ? 0x00007f0b54498381: test %eax,(%rdx) ; safepoint poll, part 2 > 53.26% ? 0x00007f0b54498383: mov 0xc(%r12,%r10,8),%edx ; loop index check > 4.84% ? 0x00007f0b54498388: cmp %edx,%r11d > 0.31% ? 0x00007f0b5449838b: jl 0x00007f0b54498360 > > > After: > > > > ? 0x00007fa06c49a8b0: mov 0x2c(%rbp,%r10,4),%r9d ; stride read > 19.66% ? 0x00007fa06c49a8b5: mov 0x28(%rbp,%r10,4),%edx > 0.14% ? 0x00007fa06c49a8ba: mov 0x10(%rbp,%r10,4),%ebx > 22.09% ? 0x00007fa06c49a8bf: mov 0x14(%rbp,%r10,4),%ebx > 0.21% ? 0x00007fa06c49a8c4: mov 0x18(%rbp,%r10,4),%ebx > 20.19% ? 0x00007fa06c49a8c9: mov 0x1c(%rbp,%r10,4),%ebx > 0.04% ? 0x00007fa06c49a8ce: mov 0x20(%rbp,%r10,4),%ebx > 24.02% ? 0x00007fa06c49a8d3: mov 0x24(%rbp,%r10,4),%ebx > 0.21% ? 0x00007fa06c49a8d8: add $0x8,%r10d ; i += 8 > ? 0x00007fa06c49a8dc: cmp %esi,%r10d > 0.07% ? 0x00007fa06c49a8df: jl 0x00007fa06c49a8b0 > > > Additional testing: > - [x] Eyeballing JMH Samples `-prof perfasm` > - [x] Linux x86_64 fastdebug, `compiler/blackhole`, `compiler/c2/irTests/blackhole` > - [x] Linux x86_64 fastdebug, JDK benchmark corpus src/hotspot/share/opto/library_call.cpp line 7790: > 7788: MemBarNode* mb = MemBarNode::make(C, Op_Blackhole); > 7789: mb->init_req(TypeFunc::Control, control()); > 7790: mb->init_req(TypeFunc::Memory, mem); Does it need memory at all? In other words, is `Blackhole` still a `MemBar` or can it become a pure control node now? ------------- PR: https://git.openjdk.org/jdk/pull/11041 From yyang at openjdk.org Wed Nov 9 01:51:30 2022 From: yyang at openjdk.org (Yi Yang) Date: Wed, 9 Nov 2022 01:51:30 GMT Subject: RFR: 8288204: GVN Crash: assert() failed: correct memory chain [v3] In-Reply-To: References: Message-ID: On Sat, 29 Oct 2022 00:02:24 GMT, Vladimir Kozlov wrote: > `te[int:>=0]:exact+any*` to instance array type `byte[int:8]:NotNull:exact+any *,iid=177`. Mostly because it does not adjust array's parameters. This cast chain simply does not work for arra Make sense. But I guess we should cast more beyond the array parameter, i.e. [the offset should be cast as well?](https://github.com/openjdk/jdk/pull/9777/files/063d2468f2ae1102dc1e9bf396a29b92f1a40569) Because we have a Phi#1109(`byte[int:>=0]:exact+any*`) and AddP#473(`byte[int:8]:NotNull:exact[0] *,iid=177`) with fixed offset due to ConI#585 335 CheckCastPP === 331 461 [[... ]] #byte[int:8]:NotNull:exact *,iid=177 585 ConL === 0 [[ 473 1054 970 ]] #long:18 473 AddP === _ 335 335 585 #byte[int:8]:NotNull:exact[0] *,iid=177 ------------- PR: https://git.openjdk.org/jdk/pull/9777 From duke at openjdk.org Wed Nov 9 02:22:01 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Wed, 9 Nov 2022 02:22:01 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v6] In-Reply-To: References: Message-ID: On Wed, 9 Nov 2022 00:38:45 GMT, Vladimir Ivanov wrote: >> @iwanowww moved to StubGenerator as suggested.. moving functions to the stubGenerator_x86_64.hpp header doesn't seem 'clean' but I think that's the pattern. >> >> The constant pool.. stared at it for a while and ended up keeping it mostly intact (its now a static function, not a member function; header bit cleaner; followed AES pattern). >> >> Did not split it up into individual constants. The main 'problem' is that `Address` and `ExternalAddress` are not compatible. Most instructions do not take `AddressLiteral`, so can't use `ExternalAddress` to refer to those constants. (If I did get the instructions I use to take `AddressLiteral`, I think we would end up with more `lea(rscratch)`s generated; but that's more of a silver-lining) >> >> I also thought of loading constants at run-time, (load and replicate for vector.. what I mentioned in my comment above) but that seems needlessly complicated in hindsight.. > >> Did not split it up into individual constants. The main 'problem' is that Address and ExternalAddress are not compatible. > > There's a reason for that and it's because RIP-relative addressing doesn't always work, so additional register may be needed. > >> Most instructions do not take AddressLiteral, so can't use ExternalAddress to refer to those constants. > > I counted 4 instructions accessing the constants (`evpandq`, `andq`, `evporq`, and `vpternlogq`) in your patch. > > `macroAssembler_x86.hpp` is the place for `AddressLiteral`-related overloads (there are already numerous cases present) and it's trivial to add new ones. > >> (If I did get the instructions I use to take AddressLiteral, I think we would end up with more lea(rscratch)s generated; but that's more of a silver-lining) > > It depends on memory layout. If constants end up placed close enough in the address space, there'll be no additional instructions generated. > > Anyway, it doesn't look like something important from throughput perspective. Overall, I find it clearer when the code refers to individual constants through `AddressLiteral`s, but I'm also fine with it as it is now. Makes sense to me, that would indeed be cleaner, will add a couple more overloads. (Still getting used to what is 'clean' in this code base). ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Wed Nov 9 02:22:04 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Wed, 9 Nov 2022 02:22:04 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v9] In-Reply-To: References: Message-ID: <9eURte9F6DahXze39MUQEegF0nNqZRfXh-au-mRNhpA=.b145ca11-9d61-4976-aece-4da91aa2f719@github.com> On Wed, 9 Nov 2022 00:10:48 GMT, Vladimir Ivanov wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: >> >> fix 32-bit build > > src/hotspot/share/opto/library_call.cpp line 7014: > >> 7012: const TypeKlassPtr* rklass = TypeKlassPtr::make(instklass_ImmutableElement); >> 7013: const TypeOopPtr* rtype = rklass->as_instance_type()->cast_to_ptr_type(TypePtr::NotNull); >> 7014: Node* rObj = new CheckCastPPNode(control(), rFace, rtype); > > FTR it's an unsafe cast since it doesn't involve a runtime check from `IntegerModuloP` to `ImmutableElement`. Please, lift as much checks into Java wrapper as possible. Ah, yeah.. I quite suspected I didn't emulate all the bytecodes needed. Thanks for the info. So this is a bit of a quandary.. I had done the intrinsic more in Java before, but it slows down the non-intrinsic path (This was the discussion we were having with Jamil). In Java, the limbs are not 'accessible' per-se.. They are in a separate package hidden behind and interface.. and in a nested non-static class inside an abstract class... Its quite well designed. Its just makes what I want to do break most encapsulations. There is a method (`asByteArray`) to extract the limbs that I was previously using, but that slows down non-intrinsic path (to be honest it slows down the intrinsic path too, but the assembler makes some of that back.. You can see that from the numbers I posted for Jamil, vs original in the PR header; originally I got 18x, and now with accessing limbs directly its 19x). If I had some way to check 'is intrinsic available', I could at least not slow down the current code. I would still have to break the encapsulation to do the checks/casts though. It all seems less-then-perfect. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From roland at openjdk.org Wed Nov 9 08:35:51 2022 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 9 Nov 2022 08:35:51 GMT Subject: RFR: 6312651: Compiler should only use verified interface types for optimization [v2] In-Reply-To: <50d-zariA__s7xBrD-gRjTxc0rDpWU4xw7_VhqbMjoA=.cb87e513-43c2-4564-9617-f4443625750e@github.com> References: <50d-zariA__s7xBrD-gRjTxc0rDpWU4xw7_VhqbMjoA=.cb87e513-43c2-4564-9617-f4443625750e@github.com> Message-ID: <1j149t-ytW5ejxpv5A-YU4T-Mu3vVtL2c6MIdVzCzqw=.6a68249e-a9c3-407f-943f-10efc29069f3@github.com> On Fri, 28 Oct 2022 17:01:06 GMT, Vladimir Ivanov wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> build fix > > src/hotspot/share/ci/ciInstanceKlass.cpp line 735: > >> 733: GrowableArray* result = NULL; >> 734: GUARDED_VM_ENTRY( >> 735: InstanceKlass* ik = get_instanceKlass(); > > Does it make sense to cache the result on `ciInstanceKlass` instance? Maybe but it's not as simple as it seems (for the reason discussed below for vm classes). Some classes are allocated in a special arena because they are shared between compilations. A _transitive_interfaces array would need to be allocated in that same arena (otherwise if lazily allocated during a particular compilation it would be in the thread's resource are and its content would be wiped out when the compilation is over but still reachable). That's not straightforward with the current implementation as the arena a ciInstanceKlass is allocated in is not available to the ciInstanceKlass so that would require extra changes and I'm not sure the extra complexity is worth it. > src/hotspot/share/opto/type.cpp line 4840: > >> 4838: } >> 4839: interfaces = this_interfaces.intersection_with(tp_interfaces); >> 4840: return TypeInstPtr::make(ptr, ciEnv::current()->Object_klass(), interfaces, false, NULL,offset, instance_id, speculative, depth); > >> NULL,offset > > missing space Ok. > src/hotspot/share/opto/type.cpp line 5737: > >> 5735: // below the centerline when the superclass is exact. We need to >> 5736: // do the same here. >> 5737: if (klass()->equals(ciEnv::current()->Object_klass()) && this_interfaces.intersection_with(tp_interfaces).eq(this_interfaces) && !klass_is_exact()) { > >> this_interfaces.intersection_with(tp_interfaces).eq(this_interfaces) > > Maybe a case for a helper method `InterfaceSet::contains(InterfaceSet)`? Indeed, there are several uses of this pattern. I will make that change. > src/hotspot/share/opto/type.cpp line 5861: > >> 5859: bool klass_is_exact = ik->is_final(); >> 5860: if (!klass_is_exact && >> 5861: deps != NULL && UseUniqueSubclasses) { > > Please, put `UseUniqueSubclasses` guard at the top of the method. Ok. > src/hotspot/share/opto/type.hpp line 1154: > >> 1152: // Respects UseUniqueSubclasses. >> 1153: // If the klass is final, the resulting type will be exact. >> 1154: static const TypeOopPtr* make_from_klass(ciKlass* klass, bool trust_interface = false) { > > I'd suggest to use an enum (`trust_interfaces`/`ignore_interfaces`) instead of a `bool`, so the intention is clear at call sites. Good suggestion. I will make that change. ------------- PR: https://git.openjdk.org/jdk/pull/10901 From chagedorn at openjdk.org Wed Nov 9 10:25:39 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 9 Nov 2022 10:25:39 GMT Subject: RFR: 8296243: [IR Framework] Fix issues with IRNode.ALLOC* regexes In-Reply-To: References: Message-ID: On Tue, 8 Nov 2022 09:58:42 GMT, Christian Hagedorn wrote: > There are currently two problems with `IRNode.ALLOC*` regexes: > 1. On PPC64, we do not account for an `LI` instruction which matches the array size. As a result, we could miss some array allocations with the `ALLOC_ARRAY*` regexes: > > 2e4 LD R3, offset, R3 // load ptr precise [java/lang/Object: > 0x0000200058006e40 *: :Constant:exact * from TOC (lo) > 2e8 STD R17, [R1_SP + #104+0] // spill copy > 2ec LI R4, #1 <------- we only look for LGHI here which is specific to s390 while LI is used for PPC64 > 2f0 CALL,static 0x00002000177cd300 // ==> wrapper for: _new_array_Java > > This was revealed by a new test added by [JDK-8280378](https://bugs.openjdk.org/browse/JDK-8280378) but was already a problem before this change. > > 2. The newly added `IRNode.ALLOC*` regexes in JDK-8280378 which can be matched on the independent ideal compile phases by using the name of the IR node "Allocate" also matches "AllocateArray" (substring match). This is unexpected. I've changed this by matching "Allocate" exactly. > > I've additionally removed the matching of `LI` and `LGHI` for the `ALLOC` regexes on normal objects as we do not have an array size. I think it's safe to remove these (might need some additional testing on PPC64/s390). > > Thanks @TheRealMDoerr for helping to test the initial fix on PPC64! > > Thanks, > Christian Thanks Martin for your review and testing it again! ------------- PR: https://git.openjdk.org/jdk/pull/11037 From vkempik at openjdk.org Wed Nov 9 11:20:05 2022 From: vkempik at openjdk.org (Vladimir Kempik) Date: Wed, 9 Nov 2022 11:20:05 GMT Subject: RFR: 8296602: RISC-V: improve performance of copy_memory stub Message-ID: Please review this change to improve the performance of copy_memory stub on risc-v ------------- Commit messages: - 8296602: RISC-V: improve performance of copy_memory stub Changes: https://git.openjdk.org/jdk/pull/11058/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11058&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296602 Stats: 63 lines in 1 file changed: 37 ins; 1 del; 25 mod Patch: https://git.openjdk.org/jdk/pull/11058.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11058/head:pull/11058 PR: https://git.openjdk.org/jdk/pull/11058 From vkempik at openjdk.org Wed Nov 9 11:34:43 2022 From: vkempik at openjdk.org (Vladimir Kempik) Date: Wed, 9 Nov 2022 11:34:43 GMT Subject: RFR: 8296602: RISC-V: improve performance of copy_memory stub [v2] In-Reply-To: References: Message-ID: > Please review this change to improve the performance of copy_memory stub on risc-v > > This change has three parts > 1) use copy32 if possible to do 4 ld and 4 st per loop cycle > 2) don't produce precopy code if is_aligned is true, it's not executed. > 3) in the end of loop8 and loop32, remove data dependency between two addi opcodes, to allow them to be scheduled simultaneously > > testing: org.openjdk.bench.vm.compiler.ArrayCopyObject, hotspot_compiler_arraycopy, hotspot:tier1 - all ok > hotspot:tier2 is on the way. > > and for the benchmark results, using > org.openjdk.bench.vm.compiler.ArrayCopyObject.conjoint_micro > > thead rvb-ice c910 > thead > > Before ( copy8 only ) > Benchmark (size) Mode Cnt Score Error Units > ArrayCopyObject.conjoint_micro 31 thrpt 25 6653.095 ? 251.565 ops/ms > ArrayCopyObject.conjoint_micro 63 thrpt 25 4933.970 ? 77.559 ops/ms > ArrayCopyObject.conjoint_micro 127 thrpt 25 3627.454 ? 34.589 ops/ms > ArrayCopyObject.conjoint_micro 2047 thrpt 25 368.249 ? 0.453 ops/ms > ArrayCopyObject.conjoint_micro 4095 thrpt 25 187.776 ? 0.306 ops/ms > ArrayCopyObject.conjoint_micro 8191 thrpt 25 94.477 ? 0.340 ops/ms > > after ( with copy32 ) > ArrayCopyObject.conjoint_micro 31 thrpt 25 7620.546 ? 69.756 ops/ms > ArrayCopyObject.conjoint_micro 63 thrpt 25 6677.978 ? 33.112 ops/ms > ArrayCopyObject.conjoint_micro 127 thrpt 25 5206.973 ? 22.612 ops/ms > ArrayCopyObject.conjoint_micro 2047 thrpt 25 653.655 ? 31.494 ops/ms > ArrayCopyObject.conjoint_micro 4095 thrpt 25 352.905 ? 7.390 ops/ms > ArrayCopyObject.conjoint_micro 8191 thrpt 25 165.127 ? 0.832 ops/ms > > after ( copy32 with dead code elimination and independent addis ) > ArrayCopyObject.conjoint_micro 31 thrpt 25 7576.346 ? 94.487 ops/ms > ArrayCopyObject.conjoint_micro 63 thrpt 25 6475.730 ? 252.590 ops/ms > ArrayCopyObject.conjoint_micro 127 thrpt 25 5221.764 ? 20.415 ops/ms > ArrayCopyObject.conjoint_micro 2047 thrpt 25 691.847 ? 1.102 ops/ms > ArrayCopyObject.conjoint_micro 4095 thrpt 25 360.269 ? 1.091 ops/ms > ArrayCopyObject.conjoint_micro 8191 thrpt 25 179.733 ? 3.012 ops/ms > > on hifive unmatched: > > before: > Benchmark (size) Mode Cnt Score Error Units > ArrayCopyObject.conjoint_micro 31 thrpt 25 5391.575 ? 152.984 ops/ms > ArrayCopyObject.conjoint_micro 63 thrpt 25 3700.946 ? 43.175 ops/ms > ArrayCopyObject.conjoint_micro 127 thrpt 25 2316.160 ? 24.734 ops/ms > ArrayCopyObject.conjoint_micro 2047 thrpt 25 188.616 ? 0.151 ops/ms > ArrayCopyObject.conjoint_micro 4095 thrpt 25 95.323 ? 0.053 ops/ms > ArrayCopyObject.conjoint_micro 8191 thrpt 25 46.935 ? 0.041 ops/ms > > after: > Benchmark (size) Mode Cnt Score Error Units > ArrayCopyObject.conjoint_micro 31 thrpt 25 6136.169 ? 330.409 ops/ms > ArrayCopyObject.conjoint_micro 63 thrpt 25 4924.020 ? 78.529 ops/ms > ArrayCopyObject.conjoint_micro 127 thrpt 25 3732.561 ? 89.606 ops/ms > ArrayCopyObject.conjoint_micro 2047 thrpt 25 431.103 ? 0.505 ops/ms > ArrayCopyObject.conjoint_micro 4095 thrpt 25 221.543 ? 0.363 ops/ms > ArrayCopyObject.conjoint_micro 8191 thrpt 25 100.586 ? 0.197 ops/ms Vladimir Kempik has updated the pull request incrementally with one additional commit since the last revision: remove excessive comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11058/files - new: https://git.openjdk.org/jdk/pull/11058/files/39815db0..a788f8f2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11058&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11058&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11058.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11058/head:pull/11058 PR: https://git.openjdk.org/jdk/pull/11058 From shade at openjdk.org Wed Nov 9 11:57:45 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 9 Nov 2022 11:57:45 GMT Subject: RFR: 8296545: C2 Blackholes should allow load optimizations [v2] In-Reply-To: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> References: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> Message-ID: > If you look at generated code for the JMH benchmark like: > > > public class ArrayRead { > @Param({"1", "100", "10000", "1000000"}) > int size; > > int[] is; > > @Setup > public void setup() { > is = new int[size]; > for (int c = 0; c < size; c++) { > is[c] = c; > } > } > > @Benchmark > public void test(Blackhole bh) { > for (int i = 0; i < is.length; i++) { > bh.consume(is[i]); > } > } > } > > > ...then you would notice that the loop always re-reads `is`, `is.length`, does the range check, etc. -- all the things we would otherwise expect to be hoisted out of the loop. > > This is because C2 blackholes are modeled as membars that pinch both control and memory slices (like you would expect from the opaque non-inlined call), therefore every iteration has to re-read the referenced memory contents and recompute everything dependent on those loads. This behavior is not new -- the old, non-compiler blackholes were doing the same thing, accidentally -- but it was drowned in blackhole overheads. Now, these effects are clearly visible. > > We can try to do this a bit better: allow load optimizations to work across the blackholes, leaving only "prevent dead code elimination" part, as minimally required by blackhole semantics. > > Motivational improvements on the test above: > > > Benchmark (size) Mode Cnt Score Error Units > > # Before, full Java blackholes > ArrayRead.test 1 avgt 9 5.422 ? 0.023 ns/op > ArrayRead.test 100 avgt 9 460.619 ? 0.421 ns/op > ArrayRead.test 10000 avgt 9 44697.909 ? 1964.787 ns/op > ArrayRead.test 1000000 avgt 9 4332723.304 ? 2791.324 ns/op > > # Before, compiler blackholes > ArrayRead.test 1 avgt 9 1.791 ? 0.007 ns/op > ArrayRead.test 100 avgt 9 114.103 ? 1.677 ns/op > ArrayRead.test 10000 avgt 9 8528.544 ? 52.010 ns/op > ArrayRead.test 1000000 avgt 9 1005139.070 ? 2883.011 ns/op > > # After, compiler blackholes > ArrayRead.test 1 avgt 9 1.686 ? 0.006 ns/op ; ~1.1x better > ArrayRead.test 100 avgt 9 16.249 ? 0.019 ns/op ; ~7.0x better > ArrayRead.test 10000 avgt 9 1375.265 ? 2.420 ns/op ; ~6.2x better > ArrayRead.test 1000000 avgt 9 136862.574 ? 1057.100 ns/op ; ~7.3x better > > > `-prof perfasm` shows the reason for these improvements clearly: > > Before: > > > ? 0x00007f0b54498360: mov 0xc(%r12,%r10,8),%edx ; range check 1 > 7.97% ? 0x00007f0b54498365: cmp %edx,%r11d > 1.27% ? 0x00007f0b54498368: jae 0x00007f0b5449838f > ? 0x00007f0b5449836a: shl $0x3,%r10 > 0.03% ? 0x00007f0b5449836e: mov 0x10(%r10,%r11,4),%r10d ; get "is[i]" > 7.76% ? 0x00007f0b54498373: mov 0x10(%r9),%r10d ; restore "is" > 0.24% ? 0x00007f0b54498377: mov 0x3c0(%r15),%rdx ; safepoint poll, part 1 > 17.48% ? 0x00007f0b5449837e: inc %r11d ; i++ > 0.17% ? 0x00007f0b54498381: test %eax,(%rdx) ; safepoint poll, part 2 > 53.26% ? 0x00007f0b54498383: mov 0xc(%r12,%r10,8),%edx ; loop index check > 4.84% ? 0x00007f0b54498388: cmp %edx,%r11d > 0.31% ? 0x00007f0b5449838b: jl 0x00007f0b54498360 > > > After: > > > > ? 0x00007fa06c49a8b0: mov 0x2c(%rbp,%r10,4),%r9d ; stride read > 19.66% ? 0x00007fa06c49a8b5: mov 0x28(%rbp,%r10,4),%edx > 0.14% ? 0x00007fa06c49a8ba: mov 0x10(%rbp,%r10,4),%ebx > 22.09% ? 0x00007fa06c49a8bf: mov 0x14(%rbp,%r10,4),%ebx > 0.21% ? 0x00007fa06c49a8c4: mov 0x18(%rbp,%r10,4),%ebx > 20.19% ? 0x00007fa06c49a8c9: mov 0x1c(%rbp,%r10,4),%ebx > 0.04% ? 0x00007fa06c49a8ce: mov 0x20(%rbp,%r10,4),%ebx > 24.02% ? 0x00007fa06c49a8d3: mov 0x24(%rbp,%r10,4),%ebx > 0.21% ? 0x00007fa06c49a8d8: add $0x8,%r10d ; i += 8 > ? 0x00007fa06c49a8dc: cmp %esi,%r10d > 0.07% ? 0x00007fa06c49a8df: jl 0x00007fa06c49a8b0 > > > Additional testing: > - [x] Eyeballing JMH Samples `-prof perfasm` > - [x] Linux x86_64 fastdebug, `compiler/blackhole`, `compiler/c2/irTests/blackhole` > - [x] Linux x86_64 fastdebug, JDK benchmark corpus Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: Do not touch memory at all ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11041/files - new: https://git.openjdk.org/jdk/pull/11041/files/5a91ed9a..1ca2febe Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11041&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11041&range=00-01 Stats: 8 lines in 2 files changed: 0 ins; 5 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/11041.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11041/head:pull/11041 PR: https://git.openjdk.org/jdk/pull/11041 From shade at openjdk.org Wed Nov 9 11:59:20 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 9 Nov 2022 11:59:20 GMT Subject: RFR: 8296545: C2 Blackholes should allow load optimizations [v2] In-Reply-To: References: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> Message-ID: On Wed, 9 Nov 2022 00:43:15 GMT, Vladimir Ivanov wrote: >> Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: >> >> Do not touch memory at all > > src/hotspot/share/opto/library_call.cpp line 7790: > >> 7788: MemBarNode* mb = MemBarNode::make(C, Op_Blackhole); >> 7789: mb->init_req(TypeFunc::Control, control()); >> 7790: mb->init_req(TypeFunc::Memory, mem); > > Does it need memory at all? In other words, is `Blackhole` still a `MemBar` or can it become a pure control node now? That's a good question. I don't think it needs memory. I disconnected the input memory in new commit as well, and all tests seem fine. This also allows to simplify anti-dependence logic, as `Blackhole` does not have the asymmetry of "takes memory, but does not produce it" anymore. AFAICS, `Blackhole` being a subclass of `MemBar` helps to avoid additional checks in other places in compiler, where we can test for `is_MemBar`. I can see if we can move `Blackhole` out of `MemBar` class without messing things up. ------------- PR: https://git.openjdk.org/jdk/pull/11041 From dsamersoff at openjdk.org Wed Nov 9 12:49:05 2022 From: dsamersoff at openjdk.org (Dmitry Samersoff) Date: Wed, 9 Nov 2022 12:49:05 GMT Subject: RFR: JDK-8294947: Use 64bit atomics in patch_verified_entry on x86_64 Message-ID: In the void NativeJump::patch_verified_entry() we atomically patch first 4 bytes, then atomically patch 5th byte, then atomically patch first 4 bytes again. But from CMC (cross-modified code) point of view it's better to patch atomically 8 bytes at once. The patch was tested with hotspot jtreg tests in bare-metal and virtualized environments. ------------- Commit messages: - JDK-8294947: Use 64bit atomics in patch_verified_entry on x86_64 Changes: https://git.openjdk.org/jdk/pull/11059/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11059&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8294947 Stats: 23 lines in 1 file changed: 20 ins; 3 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11059.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11059/head:pull/11059 PR: https://git.openjdk.org/jdk/pull/11059 From roland at openjdk.org Wed Nov 9 13:20:14 2022 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 9 Nov 2022 13:20:14 GMT Subject: RFR: 6312651: Compiler should only use verified interface types for optimization [v2] In-Reply-To: <1j149t-ytW5ejxpv5A-YU4T-Mu3vVtL2c6MIdVzCzqw=.6a68249e-a9c3-407f-943f-10efc29069f3@github.com> References: <50d-zariA__s7xBrD-gRjTxc0rDpWU4xw7_VhqbMjoA=.cb87e513-43c2-4564-9617-f4443625750e@github.com> <1j149t-ytW5ejxpv5A-YU4T-Mu3vVtL2c6MIdVzCzqw=.6a68249e-a9c3-407f-943f-10efc29069f3@github.com> Message-ID: On Wed, 9 Nov 2022 08:31:58 GMT, Roland Westrelin wrote: >> src/hotspot/share/ci/ciInstanceKlass.cpp line 735: >> >>> 733: GrowableArray* result = NULL; >>> 734: GUARDED_VM_ENTRY( >>> 735: InstanceKlass* ik = get_instanceKlass(); >> >> Does it make sense to cache the result on `ciInstanceKlass` instance? > > Maybe but it's not as simple as it seems (for the reason discussed below for vm classes). Some classes are allocated in a special arena because they are shared between compilations. A _transitive_interfaces array would need to be allocated in that same arena (otherwise if lazily allocated during a particular compilation it would be in the thread's resource are and its content would be wiped out when the compilation is over but still reachable). That's not straightforward with the current implementation as the arena a ciInstanceKlass is allocated in is not available to the ciInstanceKlass so that would require extra changes and I'm not sure the extra complexity is worth it. scratch that. I see it's done for `_nonstatic_fields` and will do the same. ------------- PR: https://git.openjdk.org/jdk/pull/10901 From adinn at openjdk.org Wed Nov 9 13:54:26 2022 From: adinn at openjdk.org (Andrew Dinn) Date: Wed, 9 Nov 2022 13:54:26 GMT Subject: RFR: 8296545: C2 Blackholes should allow load optimizations [v2] In-Reply-To: References: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> Message-ID: On Wed, 9 Nov 2022 11:56:43 GMT, Aleksey Shipilev wrote: >> src/hotspot/share/opto/library_call.cpp line 7790: >> >>> 7788: MemBarNode* mb = MemBarNode::make(C, Op_Blackhole); >>> 7789: mb->init_req(TypeFunc::Control, control()); >>> 7790: mb->init_req(TypeFunc::Memory, mem); >> >> Does it need memory at all? In other words, is `Blackhole` still a `MemBar` or can it become a pure control node now? > > That's a good question. I don't think it needs memory. I disconnected the input memory in new commit as well, and all tests seem fine. This also allows to simplify anti-dependence logic, as `Blackhole` does not have the asymmetry of "takes memory, but does not produce it" anymore. > > AFAICS, `Blackhole` being a subclass of `MemBar` helps to avoid additional checks in other places in compiler, where we can test for `is_MemBar`. I can see if we can move `Blackhole` out of `MemBar` class without messing things up. Yeah, I also cannot see why it would need the memory input so long as it has the data dependency. The key question is whether it's status as a membar is stopping things being re-ordered around it. Changing it's type might immediately show a problem ... or it might not (the problem with a 'suck it and see' approach is that you need to be sure you have sucked every type of sweetie in the sweet tin). ------------- PR: https://git.openjdk.org/jdk/pull/11041 From shade at openjdk.org Wed Nov 9 14:36:22 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 9 Nov 2022 14:36:22 GMT Subject: RFR: 8296545: C2 Blackholes should allow load optimizations [v2] In-Reply-To: References: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> Message-ID: On Wed, 9 Nov 2022 13:50:37 GMT, Andrew Dinn wrote: >> That's a good question. I don't think it needs memory. I disconnected the input memory in new commit as well, and all tests seem fine. This also allows to simplify anti-dependence logic, as `Blackhole` does not have the asymmetry of "takes memory, but does not produce it" anymore. >> >> AFAICS, `Blackhole` being a subclass of `MemBar` helps to avoid additional checks in other places in compiler, where we can test for `is_MemBar`. I can see if we can move `Blackhole` out of `MemBar` class without messing things up. > > Yeah, I also cannot see why it would need the memory input so long as it has the data dependency. > > The key question is whether it's status as a membar is stopping things being re-ordered around it. Changing it's type might immediately show a problem ... or it might not (the problem with a 'suck it and see' approach is that you need to be sure you have sucked every type of sweetie in the sweet tin). I whipped up this patch that pulls `Blackhole` from `MemBarNode` to be the more generic `MultiNode` (can probably even be `Node`, if I understand how should control-output-only nodes be defined): https://cr.openjdk.java.net/~shade/8296545/blackhole-cfg-1.patch -- it seems to "work fine" on adhoc tests. But, I am still a bit uneasy to unhook blackhole from membar, on the off-chance it matters in some non-obvious way. ------------- PR: https://git.openjdk.org/jdk/pull/11041 From roland at openjdk.org Wed Nov 9 14:47:41 2022 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 9 Nov 2022 14:47:41 GMT Subject: RFR: 6312651: Compiler should only use verified interface types for optimization [v3] In-Reply-To: References: Message-ID: > This change is mostly the same I sent for review 3 years ago but was > never integrated: > > https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2019-May/033803.html > > The main difference is that, in the meantime, I submitted a couple of > refactoring changes extracted from the 2019 patch: > > 8266550: C2: mirror TypeOopPtr/TypeInstPtr/TypeAryPtr with TypeKlassPtr/TypeInstKlassPtr/TypeAryKlassPtr > 8275201: C2: hide klass() accessor from TypeOopPtr and typeKlassPtr subclasses > > As a result, the current patch is much smaller (but still not small). > > The implementation is otherwise largely the same as in the 2019 > patch. I tried to remove some of the code duplication between the > TypeOopPtr and TypeKlassPtr hierarchies by having some of the logic > shared in template methods. In the 2019 patch, interfaces were trusted > when types were constructed and I had added code to drop interfaces > from a type where they couldn't be trusted. This new patch proceeds > the other way around: interfaces are not trusted when a type is > constructed and code that uses the type must explicitly request that > they are included (this was suggested as an improvement by Vladimir > Ivanov I think). Roland Westrelin has updated the pull request incrementally with five additional commits since the last revision: - review - review - review - review - review ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10901/files - new: https://git.openjdk.org/jdk/pull/10901/files/c8927519..49d1bf3e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10901&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10901&range=01-02 Stats: 205 lines in 13 files changed: 70 ins; 10 del; 125 mod Patch: https://git.openjdk.org/jdk/pull/10901.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10901/head:pull/10901 PR: https://git.openjdk.org/jdk/pull/10901 From roland at openjdk.org Wed Nov 9 14:47:42 2022 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 9 Nov 2022 14:47:42 GMT Subject: RFR: 6312651: Compiler should only use verified interface types for optimization [v2] In-Reply-To: <50d-zariA__s7xBrD-gRjTxc0rDpWU4xw7_VhqbMjoA=.cb87e513-43c2-4564-9617-f4443625750e@github.com> References: <50d-zariA__s7xBrD-gRjTxc0rDpWU4xw7_VhqbMjoA=.cb87e513-43c2-4564-9617-f4443625750e@github.com> Message-ID: On Sat, 29 Oct 2022 00:13:55 GMT, Vladimir Ivanov wrote: > Thanks, Roland! Overall, looks very good. Thanks for reviewing this. I pushed new commits that should address your comments/suggestions. > src/hotspot/share/opto/type.cpp line 572: > >> 570: >> 571: TypeAryPtr::_array_interfaces = new TypePtr::InterfaceSet(); >> 572: GrowableArray* array_interfaces = ciArrayKlass::interfaces(); > > Maybe move the code into a constructor or a factory method? > After that, the only user of `TypePtr::InterfaceSet::add()` will be `TypePtr::interfaces()`. > It would be nice to make `TypePtr::InterfaceSet` immutable and cache query results (`InterfaceSet::is_loaded() ` and `InterfaceSet::exact_klass()`). Good suggestion as well. ------------- PR: https://git.openjdk.org/jdk/pull/10901 From vkempik at openjdk.org Wed Nov 9 15:57:41 2022 From: vkempik at openjdk.org (Vladimir Kempik) Date: Wed, 9 Nov 2022 15:57:41 GMT Subject: RFR: 8296602: RISC-V: improve performance of copy_memory stub [v2] In-Reply-To: References: Message-ID: On Wed, 9 Nov 2022 11:34:43 GMT, Vladimir Kempik wrote: >> Please review this change to improve the performance of copy_memory stub on risc-v >> >> This change has three parts >> 1) use copy32 if possible to do 4 ld and 4 st per loop cycle >> 2) don't produce precopy code if is_aligned is true, it's not executed. >> 3) in the end of loop8 and loop32, remove data dependency between two addi opcodes, to allow them to be scheduled simultaneously >> >> testing: org.openjdk.bench.vm.compiler.ArrayCopyObject, hotspot_compiler_arraycopy, hotspot:tier1 - all ok >> hotspot:tier2 is on the way. >> >> and for the benchmark results, using >> org.openjdk.bench.vm.compiler.ArrayCopyObject.conjoint_micro >> >> thead rvb-ice c910 >> thead >> >> Before ( copy8 only ) >> Benchmark (size) Mode Cnt Score Error Units >> ArrayCopyObject.conjoint_micro 31 thrpt 25 6653.095 ? 251.565 ops/ms >> ArrayCopyObject.conjoint_micro 63 thrpt 25 4933.970 ? 77.559 ops/ms >> ArrayCopyObject.conjoint_micro 127 thrpt 25 3627.454 ? 34.589 ops/ms >> ArrayCopyObject.conjoint_micro 2047 thrpt 25 368.249 ? 0.453 ops/ms >> ArrayCopyObject.conjoint_micro 4095 thrpt 25 187.776 ? 0.306 ops/ms >> ArrayCopyObject.conjoint_micro 8191 thrpt 25 94.477 ? 0.340 ops/ms >> >> after ( with copy32 ) >> ArrayCopyObject.conjoint_micro 31 thrpt 25 7620.546 ? 69.756 ops/ms >> ArrayCopyObject.conjoint_micro 63 thrpt 25 6677.978 ? 33.112 ops/ms >> ArrayCopyObject.conjoint_micro 127 thrpt 25 5206.973 ? 22.612 ops/ms >> ArrayCopyObject.conjoint_micro 2047 thrpt 25 653.655 ? 31.494 ops/ms >> ArrayCopyObject.conjoint_micro 4095 thrpt 25 352.905 ? 7.390 ops/ms >> ArrayCopyObject.conjoint_micro 8191 thrpt 25 165.127 ? 0.832 ops/ms >> >> after ( copy32 with dead code elimination and independent addis ) >> ArrayCopyObject.conjoint_micro 31 thrpt 25 7576.346 ? 94.487 ops/ms >> ArrayCopyObject.conjoint_micro 63 thrpt 25 6475.730 ? 252.590 ops/ms >> ArrayCopyObject.conjoint_micro 127 thrpt 25 5221.764 ? 20.415 ops/ms >> ArrayCopyObject.conjoint_micro 2047 thrpt 25 691.847 ? 1.102 ops/ms >> ArrayCopyObject.conjoint_micro 4095 thrpt 25 360.269 ? 1.091 ops/ms >> ArrayCopyObject.conjoint_micro 8191 thrpt 25 179.733 ? 3.012 ops/ms >> >> on hifive unmatched: >> >> before: >> Benchmark (size) Mode Cnt Score Error Units >> ArrayCopyObject.conjoint_micro 31 thrpt 25 5391.575 ? 152.984 ops/ms >> ArrayCopyObject.conjoint_micro 63 thrpt 25 3700.946 ? 43.175 ops/ms >> ArrayCopyObject.conjoint_micro 127 thrpt 25 2316.160 ? 24.734 ops/ms >> ArrayCopyObject.conjoint_micro 2047 thrpt 25 188.616 ? 0.151 ops/ms >> ArrayCopyObject.conjoint_micro 4095 thrpt 25 95.323 ? 0.053 ops/ms >> ArrayCopyObject.conjoint_micro 8191 thrpt 25 46.935 ? 0.041 ops/ms >> >> after: >> Benchmark (size) Mode Cnt Score Error Units >> ArrayCopyObject.conjoint_micro 31 thrpt 25 6136.169 ? 330.409 ops/ms >> ArrayCopyObject.conjoint_micro 63 thrpt 25 4924.020 ? 78.529 ops/ms >> ArrayCopyObject.conjoint_micro 127 thrpt 25 3732.561 ? 89.606 ops/ms >> ArrayCopyObject.conjoint_micro 2047 thrpt 25 431.103 ? 0.505 ops/ms >> ArrayCopyObject.conjoint_micro 4095 thrpt 25 221.543 ? 0.363 ops/ms >> ArrayCopyObject.conjoint_micro 8191 thrpt 25 100.586 ? 0.197 ops/ms > > Vladimir Kempik has updated the pull request incrementally with one additional commit since the last revision: > > remove excessive comments linux -x86 failure is infra issue and unrelated. ------------- PR: https://git.openjdk.org/jdk/pull/11058 From jbhateja at openjdk.org Wed Nov 9 15:59:27 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 9 Nov 2022 15:59:27 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v9] In-Reply-To: References: Message-ID: On Tue, 8 Nov 2022 23:21:58 GMT, Volodymyr Paprotski wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: > > fix 32-bit build src/hotspot/cpu/x86/vm_version_x86.cpp line 1181: > 1179: #ifdef _LP64 > 1180: if (supports_avx512ifma() & supports_avx512vlbw()) { > 1181: if (FLAG_IS_DEFAULT(UsePolyIntrinsics)) { MaxVectorSize > 32 can be added along with feature checks your code mainly uses ZMMs ------------- PR: https://git.openjdk.org/jdk/pull/10582 From mdoerr at openjdk.org Wed Nov 9 17:02:42 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Wed, 9 Nov 2022 17:02:42 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v9] In-Reply-To: References: Message-ID: <_bFejTor9CgadUAiFQ6k1RSQp3wqwmS1OjScTH3SuXA=.14d25934-cb17-49a6-94f3-633d5d1db1c0@github.com> > This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: Don't call IsUnloadingBehaviour for methods with permanant class loader. It may crash when the nmethod is in NonNmethod space. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10933/files - new: https://git.openjdk.org/jdk/pull/10933/files/525d9a81..705657d0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10933&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10933&range=07-08 Stats: 5 lines in 1 file changed: 5 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10933.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10933/head:pull/10933 PR: https://git.openjdk.org/jdk/pull/10933 From mdoerr at openjdk.org Wed Nov 9 17:09:21 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Wed, 9 Nov 2022 17:09:21 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v6] In-Reply-To: References: Message-ID: On Wed, 9 Nov 2022 00:08:57 GMT, Erik ?sterlund wrote: >> Does this also mean that nmethods for method handle intrinsics could previously get unloaded when they became "cold", and now if they are in NonNMethod they live forever? Someone could probably write a test that fills up NonNMethod with method handle intrinsics for various signatures by spinning up temporary classes with random signatures. > > IIRC the is_cold check returns false for method handle intrinsics. Mostly because it mimicked the sweeper heuristics for coldness. Having said that, the sweeper heuristics couldn't sample method handle intrinsics because they had no activation records and hence would always appear as cold even when they were not, while in the nmethod entry barrier heuristics for coldness, we would be able to to phase out cold method handle intrinsic nmethods. But I expect the amount of memory consumed by them to be rather small, so the magnitude of the win of going in that direction is questionable. We were able to hit the `IsUnloadingBehaviour::is_unloading` problem! I'm testing a workaround, now. Note that the Methods are also permanently kept in `_invoke_method_intrinsic_table`. Getting rid of method handle intrinsics is currently not supported. ------------- PR: https://git.openjdk.org/jdk/pull/10933 From mdoerr at openjdk.org Wed Nov 9 17:16:36 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Wed, 9 Nov 2022 17:16:36 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v8] In-Reply-To: References: Message-ID: On Tue, 8 Nov 2022 19:34:43 GMT, Dean Long wrote: > It sounds like using nmethods for compiled method handle intrinsics isn't necessary and causes problems. What if we store them as MethodHandlesAdapterBlobs like we do for the interpreter? Lookup would need to go through method->adapter() instead of method->code(). Is there already an RFE filed for this? This may be a better approach. Generation should probably get moved out of `generate_native_wrapper`. I'll take a closer look when I find more time. ------------- PR: https://git.openjdk.org/jdk/pull/10933 From kvn at openjdk.org Wed Nov 9 17:42:41 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 9 Nov 2022 17:42:41 GMT Subject: RFR: JDK-8294947: Use 64bit atomics in patch_verified_entry on x86_64 In-Reply-To: References: Message-ID: On Wed, 9 Nov 2022 12:41:59 GMT, Dmitry Samersoff wrote: > In the void NativeJump::patch_verified_entry() we atomically patch first 4 bytes, then atomically patch 5th byte, then atomically patch first 4 bytes again. But from CMC (cross-modified code) point of view it's better to patch atomically 8 bytes at once. > > The patch was tested with hotspot jtreg tests in bare-metal and virtualized environments. src/hotspot/cpu/x86/nativeInst_x86.cpp line 514: > 512: // complete jump instruction (to be inserted) is in code_buffer; > 513: #ifdef AMD64 > 514: unsigned char code_buffer[8]; Should we align this buffer too (to 8/jlong)? src/hotspot/cpu/x86/nativeInst_x86.cpp line 532: > 530: > 531: #else > 532: unsigned char code_buffer[5]; Should this be aligned? src/hotspot/cpu/x86/nativeInst_x86.cpp line 562: > 560: > 561: // Patch bytes 0-3 (from jump instruction) > 562: *(int32_t*)verified_entry = *(int32_t *)code_buffer; Is this store and at line 552 atomic? ------------- PR: https://git.openjdk.org/jdk/pull/11059 From duke at openjdk.org Wed Nov 9 17:53:21 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Wed, 9 Nov 2022 17:53:21 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v9] In-Reply-To: References: Message-ID: <68kLqa_FyCKuyBqSpXlq_1QnR8I-HFL1aFr4uQ6DyoM=.817059a2-8945-44e7-9ce8-a5f5ee5a50c5@github.com> On Wed, 9 Nov 2022 00:10:48 GMT, Vladimir Ivanov wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: >> >> fix 32-bit build > > src/hotspot/share/opto/library_call.cpp line 7014: > >> 7012: const TypeKlassPtr* rklass = TypeKlassPtr::make(instklass_ImmutableElement); >> 7013: const TypeOopPtr* rtype = rklass->as_instance_type()->cast_to_ptr_type(TypePtr::NotNull); >> 7014: Node* rObj = new CheckCastPPNode(control(), rFace, rtype); > > FTR it's an unsafe cast since it doesn't involve a runtime check from `IntegerModuloP` to `ImmutableElement`. Please, lift as much checks into Java wrapper as possible. @iwanowww just to save some of your time... Sandhya suggested another way to move the checks to Java, hopefully without too much penalty to non-intrinsic path. Should upload it later today. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From kvn at openjdk.org Wed Nov 9 18:33:03 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 9 Nov 2022 18:33:03 GMT Subject: RFR: 8295867: TestVerifyGraphEdges.java fails with exit code -1073741571 when using AlwaysIncrementalInline Message-ID: When we use -Xcomp we compile `java.lang.invoke.LambdaForm$Kind::` very long linear method for enum class: [LambdaForm.java#L250](https://github.com/openjdk/jdk/blame/master/src/java.base/share/classes/java/lang/invoke/LambdaForm.java#L250) In addition we inline all class initializer for EA when run with -Xcomp: [bytecodeInfo.cpp#L410](https://github.com/openjdk/jdk/blame/master/src/hotspot/share/opto/bytecodeInfo.cpp#L410) Recent CDS change [JDK-8293979](https://bugs.openjdk.org/browse/JDK-8293979) allows to inline a bit more deeply too. Adding `-XX:+AlwaysIncrementalInline` worsened situation even more. Running with `-XX:+LogCompilation` shows that we hit `NodeCountInliningCutoff (18000)` during `java.lang.invoke.LambdaForm$Kind::` compilation. In short, we have very long (>40000 live nodes) linear IR graph. `Node::verify_edges()` methods processed nodes depth-first starting from first input which is control edge. So it is not surprise that depth of this method recursion reached 6000. With frame size of 10 words (320 bytes) we easy hit stack overflow (768K in 32-bits debug VM). I fixed it by using local buffer `Node_List` instead of recursion in `Node::verify_edges()`. The algorithm was change to simplify code. We process inputs in reverse order - last input processed first. And I noticed that maximum use of buffer is only about 1000 or less element for this compilation (that is why I use live_nodes/16 as initial size of buffer). Then I did additional experiment with keeping recursion but also processing inputs in reverse order: // Recursive walk over all input edges - for( i = 0; i < len(); i++ ) { - n = in(i); + for( i = len(); i > 0; i++ ) { + n = in(i - 1); if( n != NULL ) in(i)->verify_edges(visited); } And show the same around 1000 stack depth! I decided to keep my original fix because it should be faster (put only one value on list instead of putting all locals, PC, SP on stack and calls) and much less stack usage. Testing tier1-3, hs-comp-stress and `TestVerifyGraphEdges.java` test runs with `-XX:+AlwaysIncrementalInline`. ------------- Commit messages: - remove white space - Merge branch 'master' into JDK-8295867 - 8295867: TestVerifyGraphEdges.java fails with exit code -1073741571 when using AlwaysIncrementalInline Changes: https://git.openjdk.org/jdk/pull/11065/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11065&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295867 Stats: 60 lines in 3 files changed: 23 ins; 13 del; 24 mod Patch: https://git.openjdk.org/jdk/pull/11065.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11065/head:pull/11065 PR: https://git.openjdk.org/jdk/pull/11065 From vlivanov at openjdk.org Wed Nov 9 20:33:29 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 9 Nov 2022 20:33:29 GMT Subject: RFR: 8296545: C2 Blackholes should allow load optimizations [v2] In-Reply-To: References: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> Message-ID: On Wed, 9 Nov 2022 11:57:45 GMT, Aleksey Shipilev wrote: >> If you look at generated code for the JMH benchmark like: >> >> >> public class ArrayRead { >> @Param({"1", "100", "10000", "1000000"}) >> int size; >> >> int[] is; >> >> @Setup >> public void setup() { >> is = new int[size]; >> for (int c = 0; c < size; c++) { >> is[c] = c; >> } >> } >> >> @Benchmark >> public void test(Blackhole bh) { >> for (int i = 0; i < is.length; i++) { >> bh.consume(is[i]); >> } >> } >> } >> >> >> ...then you would notice that the loop always re-reads `is`, `is.length`, does the range check, etc. -- all the things we would otherwise expect to be hoisted out of the loop. >> >> This is because C2 blackholes are modeled as membars that pinch both control and memory slices (like you would expect from the opaque non-inlined call), therefore every iteration has to re-read the referenced memory contents and recompute everything dependent on those loads. This behavior is not new -- the old, non-compiler blackholes were doing the same thing, accidentally -- but it was drowned in blackhole overheads. Now, these effects are clearly visible. >> >> We can try to do this a bit better: allow load optimizations to work across the blackholes, leaving only "prevent dead code elimination" part, as minimally required by blackhole semantics. >> >> Motivational improvements on the test above: >> >> >> Benchmark (size) Mode Cnt Score Error Units >> >> # Before, full Java blackholes >> ArrayRead.test 1 avgt 9 5.422 ? 0.023 ns/op >> ArrayRead.test 100 avgt 9 460.619 ? 0.421 ns/op >> ArrayRead.test 10000 avgt 9 44697.909 ? 1964.787 ns/op >> ArrayRead.test 1000000 avgt 9 4332723.304 ? 2791.324 ns/op >> >> # Before, compiler blackholes >> ArrayRead.test 1 avgt 9 1.791 ? 0.007 ns/op >> ArrayRead.test 100 avgt 9 114.103 ? 1.677 ns/op >> ArrayRead.test 10000 avgt 9 8528.544 ? 52.010 ns/op >> ArrayRead.test 1000000 avgt 9 1005139.070 ? 2883.011 ns/op >> >> # After, compiler blackholes >> ArrayRead.test 1 avgt 9 1.686 ? 0.006 ns/op ; ~1.1x better >> ArrayRead.test 100 avgt 9 16.249 ? 0.019 ns/op ; ~7.0x better >> ArrayRead.test 10000 avgt 9 1375.265 ? 2.420 ns/op ; ~6.2x better >> ArrayRead.test 1000000 avgt 9 136862.574 ? 1057.100 ns/op ; ~7.3x better >> >> >> `-prof perfasm` shows the reason for these improvements clearly: >> >> Before: >> >> >> ? 0x00007f0b54498360: mov 0xc(%r12,%r10,8),%edx ; range check 1 >> 7.97% ? 0x00007f0b54498365: cmp %edx,%r11d >> 1.27% ? 0x00007f0b54498368: jae 0x00007f0b5449838f >> ? 0x00007f0b5449836a: shl $0x3,%r10 >> 0.03% ? 0x00007f0b5449836e: mov 0x10(%r10,%r11,4),%r10d ; get "is[i]" >> 7.76% ? 0x00007f0b54498373: mov 0x10(%r9),%r10d ; restore "is" >> 0.24% ? 0x00007f0b54498377: mov 0x3c0(%r15),%rdx ; safepoint poll, part 1 >> 17.48% ? 0x00007f0b5449837e: inc %r11d ; i++ >> 0.17% ? 0x00007f0b54498381: test %eax,(%rdx) ; safepoint poll, part 2 >> 53.26% ? 0x00007f0b54498383: mov 0xc(%r12,%r10,8),%edx ; loop index check >> 4.84% ? 0x00007f0b54498388: cmp %edx,%r11d >> 0.31% ? 0x00007f0b5449838b: jl 0x00007f0b54498360 >> >> >> After: >> >> >> >> ? 0x00007fa06c49a8b0: mov 0x2c(%rbp,%r10,4),%r9d ; stride read >> 19.66% ? 0x00007fa06c49a8b5: mov 0x28(%rbp,%r10,4),%edx >> 0.14% ? 0x00007fa06c49a8ba: mov 0x10(%rbp,%r10,4),%ebx >> 22.09% ? 0x00007fa06c49a8bf: mov 0x14(%rbp,%r10,4),%ebx >> 0.21% ? 0x00007fa06c49a8c4: mov 0x18(%rbp,%r10,4),%ebx >> 20.19% ? 0x00007fa06c49a8c9: mov 0x1c(%rbp,%r10,4),%ebx >> 0.04% ? 0x00007fa06c49a8ce: mov 0x20(%rbp,%r10,4),%ebx >> 24.02% ? 0x00007fa06c49a8d3: mov 0x24(%rbp,%r10,4),%ebx >> 0.21% ? 0x00007fa06c49a8d8: add $0x8,%r10d ; i += 8 >> ? 0x00007fa06c49a8dc: cmp %esi,%r10d >> 0.07% ? 0x00007fa06c49a8df: jl 0x00007fa06c49a8b0 >> >> >> Additional testing: >> - [x] Eyeballing JMH Samples `-prof perfasm` >> - [x] Linux x86_64 fastdebug, `compiler/blackhole`, `compiler/c2/irTests/blackhole` >> - [x] Linux x86_64 fastdebug, JDK benchmark corpus > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Do not touch memory at all src/hotspot/share/opto/library_call.cpp line 7784: > 7782: // side effects like breaking the optimizations across the blackhole. > 7783: > 7784: MemBarNode* mb = MemBarNode::make(C, Op_Blackhole); One thing to clear if you decide to keep modeling it as `MemBar`: pass `AliasIdxTop` as `alias_idx` . ------------- PR: https://git.openjdk.org/jdk/pull/11041 From vlivanov at openjdk.org Wed Nov 9 20:33:30 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 9 Nov 2022 20:33:30 GMT Subject: RFR: 8296545: C2 Blackholes should allow load optimizations [v2] In-Reply-To: References: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> Message-ID: On Wed, 9 Nov 2022 14:32:13 GMT, Aleksey Shipilev wrote: > can probably even be Node, if I understand how should control-output-only nodes be defined Yes, `MultiNode` is specifically for cases when a node produces multiple results. FTR I looked through the code base for `is_MemBar` usages and didn't spot any problematic cases, if you clear away memory input. But I'm fine with keeping it a membar. ------------- PR: https://git.openjdk.org/jdk/pull/11041 From duke at openjdk.org Wed Nov 9 21:48:59 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Wed, 9 Nov 2022 21:48:59 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v10] In-Reply-To: References: Message-ID: > Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. > > - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. > - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. > - Added a JMH perf test. > - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. > > Perf before: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s > > and after: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: add getLimbs to interface and reviews ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10582/files - new: https://git.openjdk.org/jdk/pull/10582/files/da560452..8b1b40f7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=09 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=08-09 Stats: 235 lines in 11 files changed: 103 ins; 79 del; 53 mod Patch: https://git.openjdk.org/jdk/pull/10582.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10582/head:pull/10582 PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Wed Nov 9 21:49:10 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Wed, 9 Nov 2022 21:49:10 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v9] In-Reply-To: References: Message-ID: On Wed, 9 Nov 2022 15:55:53 GMT, Jatin Bhateja wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: >> >> fix 32-bit build > > src/hotspot/cpu/x86/vm_version_x86.cpp line 1181: > >> 1179: #ifdef _LP64 >> 1180: if (supports_avx512ifma() & supports_avx512vlbw()) { >> 1181: if (FLAG_IS_DEFAULT(UsePolyIntrinsics)) { > > MaxVectorSize > 32 can be added along with feature checks your code mainly uses ZMMs done (`MaxVectorSize >= 64`) ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Wed Nov 9 21:49:09 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Wed, 9 Nov 2022 21:49:09 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v9] In-Reply-To: References: Message-ID: On Wed, 9 Nov 2022 00:23:21 GMT, Vladimir Ivanov wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: >> >> fix 32-bit build > > src/hotspot/cpu/x86/macroAssembler_x86.hpp line 970: > >> 968: >> 969: void addmq(int disp, Register r1, Register r2); >> 970: > > Leftover formatting changes. done > src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 95: > >> 93: >> 94: // OFFSET 64: mask_44 >> 95: 0xfffffffffff, 0xfffffffffff, > > Please, keep leading zeroes explicit in the constants. done. Also split things up and added ExternalAddress version of instructions. > src/hotspot/cpu/x86/stubRoutines_x86.cpp line 2: > >> 1: /* >> 2: * Copyright (c) 2013, 2022, Oracle and/or its affiliates. All rights reserved. > > No changes in the file anymore. done > src/hotspot/share/opto/library_call.cpp line 7014: > >> 7012: const TypeKlassPtr* rklass = TypeKlassPtr::make(instklass_ImmutableElement); >> 7013: const TypeOopPtr* rtype = rklass->as_instance_type()->cast_to_ptr_type(TypePtr::NotNull); >> 7014: Node* rObj = new CheckCastPPNode(control(), rFace, rtype); > > FTR it's an unsafe cast since it doesn't involve a runtime check from `IntegerModuloP` to `ImmutableElement`. Please, lift as much checks into Java wrapper as possible. @iwanowww Please have a look, just pushed a different way to fetch the limbs. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Wed Nov 9 21:57:46 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Wed, 9 Nov 2022 21:57:46 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v9] In-Reply-To: References: Message-ID: <9VOSld7kTyK9X5jTVkY2Dm_7CVOdZlHzOcoXSF8iLG4=.0fb02250-253e-4c1a-9c4e-b7e147c3e2b2@github.com> On Tue, 8 Nov 2022 23:59:42 GMT, Vladimir Ivanov wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: >> >> fix 32-bit build > > src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 175: > >> 173: >> 174: int blockMultipleLength = len & (~(BLOCK_LENGTH-1)); >> 175: Objects.checkFromIndexSize(offset, blockMultipleLength, input.length); > > I suggest to move the checks into `processMultipleBlocks`, introduce new static helper method specifically for the intrinsic part, and lift more logic (e.g., field loads) from the intrinsic into Java code. > > As an additional step, you can switch to double-register addressing mode (base + offset) for input data (`input`, `alimbs`, `rlimbs`) and simplify the intrinsic part even more (will involve a switch from `array_element_address` to `make_unsafe_address`). `array_element_address` vs `make_unsafe_address`. Don't know that I understood.. but going to guess :) "It might be cleaner to encode base+offset into the instruction opcode, save some `lea`s" I think that ship has 'sailed'? - `input`: I went and removed `offset` from intrinsic stub parameter list and instead passed it to `array_element_address`. But also, because I was really running out of GPRs, I had to do a `lea` before that at the function entry. Can't keep the offset register free for encoding.. - `alimbs`: offset already 0. Also, I mostly keep the actual value `a2:a1:a0` around. Just need address to write result back out. - `rlimbs`: offset already 0 and address itself discarded right after loading the R value into 2 GPRs. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Wed Nov 9 22:00:40 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Wed, 9 Nov 2022 22:00:40 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v6] In-Reply-To: References: Message-ID: On Wed, 9 Nov 2022 02:19:29 GMT, Volodymyr Paprotski wrote: >>> Did not split it up into individual constants. The main 'problem' is that Address and ExternalAddress are not compatible. >> >> There's a reason for that and it's because RIP-relative addressing doesn't always work, so additional register may be needed. >> >>> Most instructions do not take AddressLiteral, so can't use ExternalAddress to refer to those constants. >> >> I counted 4 instructions accessing the constants (`evpandq`, `andq`, `evporq`, and `vpternlogq`) in your patch. >> >> `macroAssembler_x86.hpp` is the place for `AddressLiteral`-related overloads (there are already numerous cases present) and it's trivial to add new ones. >> >>> (If I did get the instructions I use to take AddressLiteral, I think we would end up with more lea(rscratch)s generated; but that's more of a silver-lining) >> >> It depends on memory layout. If constants end up placed close enough in the address space, there'll be no additional instructions generated. >> >> Anyway, it doesn't look like something important from throughput perspective. Overall, I find it clearer when the code refers to individual constants through `AddressLiteral`s, but I'm also fine with it as it is now. > > Makes sense to me, that would indeed be cleaner, will add a couple more overloads. (Still getting used to what is 'clean' in this code base). done ------------- PR: https://git.openjdk.org/jdk/pull/10582 From vlivanov at openjdk.org Thu Nov 10 00:17:11 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Thu, 10 Nov 2022 00:17:11 GMT Subject: RFR: 6312651: Compiler should only use verified interface types for optimization [v3] In-Reply-To: References: Message-ID: On Wed, 9 Nov 2022 14:47:41 GMT, Roland Westrelin wrote: >> This change is mostly the same I sent for review 3 years ago but was >> never integrated: >> >> https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2019-May/033803.html >> >> The main difference is that, in the meantime, I submitted a couple of >> refactoring changes extracted from the 2019 patch: >> >> 8266550: C2: mirror TypeOopPtr/TypeInstPtr/TypeAryPtr with TypeKlassPtr/TypeInstKlassPtr/TypeAryKlassPtr >> 8275201: C2: hide klass() accessor from TypeOopPtr and typeKlassPtr subclasses >> >> As a result, the current patch is much smaller (but still not small). >> >> The implementation is otherwise largely the same as in the 2019 >> patch. I tried to remove some of the code duplication between the >> TypeOopPtr and TypeKlassPtr hierarchies by having some of the logic >> shared in template methods. In the 2019 patch, interfaces were trusted >> when types were constructed and I had added code to drop interfaces >> from a type where they couldn't be trusted. This new patch proceeds >> the other way around: interfaces are not trusted when a type is >> constructed and code that uses the type must explicitly request that >> they are included (this was suggested as an improvement by Vladimir >> Ivanov I think). > > Roland Westrelin has updated the pull request incrementally with five additional commits since the last revision: > > - review > - review > - review > - review > - review Much better, thanks! Minor comments/suggestions follow. FTR test results are clean. I'll submit performance testing shortly. src/hotspot/share/ci/ciArrayKlass.cpp line 108: > 106: } > 107: > 108: GrowableArray* ciArrayKlass::interfaces() { FTR there's a subtle asymmetry between `ciArrayKlass::interfaces()` and `ciInstanceKlass::transitive_interfaces()` when it comes to memory allocation: the former allocates from resource area while the latter from compiler arena. It doesn't cause any problems since `ciArrayKlass::interfaces()` is used only in `Type::Initialize_shared()` to instantiate shared `TypeAryPtr::_array_interfaces`, but it took me some time to find that out. Maybe a helper method in `type.cpp` is a better place for that logic. src/hotspot/share/ci/ciInstanceKlass.cpp line 736: > 734: if (_transitive_interfaces == NULL) { > 735: GUARDED_VM_ENTRY( > 736: InstanceKlass* ik = get_instanceKlass(); A candidate for `compute_transitive_interfaces()` helper method? src/hotspot/share/opto/subnode.cpp line 1050: > 1048: // return the ConP(Foo.klass) > 1049: assert(mirror_type->is_klass(), "mirror_type should represent a Klass*"); > 1050: return phase->makecon(TypeKlassPtr::make(mirror_type->as_klass(), Type::trust_interfaces)); Extra space. src/hotspot/share/opto/type.cpp line 4003: > 4001: assert(loaded->ptr() != TypePtr::Null, "insanity check"); > 4002: // > 4003: if( loaded->ptr() == TypePtr::TopPTR ) { return unloaded; } Missing space after `if`. src/hotspot/share/opto/type.cpp line 4017: > 4015: // Both are unloaded, not the same class, not Object > 4016: // Or meet unloaded with a different loaded class, not java/lang/Object > 4017: if( ptr != TypePtr::BotPTR ) { Missing space after `if`. ------------- PR: https://git.openjdk.org/jdk/pull/10901 From vlivanov at openjdk.org Thu Nov 10 00:17:13 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Thu, 10 Nov 2022 00:17:13 GMT Subject: RFR: 6312651: Compiler should only use verified interface types for optimization [v2] In-Reply-To: References: <01z8uzrzUhUPTxFfIqlLbrZYjtn9xWd3Hth6rRWtbZg=.7792c2ee-f92d-4a2e-a605-973d4fd85816@github.com> Message-ID: On Tue, 8 Nov 2022 15:30:08 GMT, Roland Westrelin wrote: >> src/hotspot/share/ci/ciObjectFactory.cpp line 160: >> >>> 158: InstanceKlass* ik = vmClasses::name(); \ >>> 159: ciEnv::_##name = get_metadata(ik)->as_instance_klass(); \ >>> 160: Array* interfaces = ik->transitive_interfaces(); \ >> >> What's the purpose of interface-related part of the code? > > ciInstanceKlass objects for the vm classes all need to be allocated from the same long lived arena that's created in ciObjectFactory::initialize() because they are shared between compilations. Without that code, the ciInstanceKlass for a particular vm class is in the long lived arena but the ciInstanceKlass objects for the interfaces are created later when they are needed in the arena of some compilation. Once that compilation is over the interface objects are destroyed but still referenced from shared types such as TypeInstPtr::MIRROR. Thanks, got it now. Is it really needed considering there's `transitive_interfaces()` call later? ------------- PR: https://git.openjdk.org/jdk/pull/10901 From duke at openjdk.org Thu Nov 10 01:22:04 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Thu, 10 Nov 2022 01:22:04 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v11] In-Reply-To: References: Message-ID: > Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. > > - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. > - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. > - Added a JMH perf test. > - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. > > Perf before: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s > > and after: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: fix windows and 32b linux builds ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10582/files - new: https://git.openjdk.org/jdk/pull/10582/files/8b1b40f7..abfc68f4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=10 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=09-10 Stats: 5 lines in 3 files changed: 4 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10582.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10582/head:pull/10582 PR: https://git.openjdk.org/jdk/pull/10582 From dzhang at openjdk.org Thu Nov 10 01:51:52 2022 From: dzhang at openjdk.org (Dingli Zhang) Date: Thu, 10 Nov 2022 01:51:52 GMT Subject: RFR: 8296638: RISC-V: Add T_SHORT/T_BYTE support for NegVI Message-ID: <5c6-E5vE3iSau2tWJWNe-2YX6DebdqMblUQe31fipQs=.764e9122-82ef-49c5-99a4-6283de54392c@github.com> Hi, NegVI also matches nodes with vector elements of T_SHORT/T_BYTE type, so it is necessary to add support for these two basic types in the instruct `vnegI`. Meanwhile, I removed some useless trailing whitespace. Please take a look and have some reviews. Thanks a lot. ## Testing: - test/jdk/jdk/incubator/vector/* with fastdebug/release on qemu ------------- Commit messages: - remove some useless trailing whitespace - Add T_SHORT/T_BYTE support for NegVI Changes: https://git.openjdk.org/jdk/pull/11074/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11074&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296638 Stats: 12 lines in 1 file changed: 5 ins; 0 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/11074.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11074/head:pull/11074 PR: https://git.openjdk.org/jdk/pull/11074 From dzhang at openjdk.org Thu Nov 10 02:25:23 2022 From: dzhang at openjdk.org (Dingli Zhang) Date: Thu, 10 Nov 2022 02:25:23 GMT Subject: RFR: 8296638: RISC-V: Add T_SHORT/T_BYTE support for NegVI [v2] In-Reply-To: <5c6-E5vE3iSau2tWJWNe-2YX6DebdqMblUQe31fipQs=.764e9122-82ef-49c5-99a4-6283de54392c@github.com> References: <5c6-E5vE3iSau2tWJWNe-2YX6DebdqMblUQe31fipQs=.764e9122-82ef-49c5-99a4-6283de54392c@github.com> Message-ID: > Hi, > > NegVI also matches nodes with vector elements of T_SHORT/T_BYTE type, so it is necessary to add support for these two basic types in the instruct `vnegI`. > > These two tests are currently failing: > test/jdk/jdk/incubator/vector/Byte256VectorTests.java > test/jdk/jdk/incubator/vector/Short256VectorTests.java > > Meanwhile, I removed some useless trailing whitespace. > > Please take a look and have some reviews. Thanks a lot. > > ## Testing: > > - test/jdk/jdk/incubator/vector/* with fastdebug/release on qemu Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: Remove useless predicate ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11074/files - new: https://git.openjdk.org/jdk/pull/11074/files/378fe8db..a5a04735 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11074&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11074&range=00-01 Stats: 3 lines in 1 file changed: 0 ins; 3 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11074.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11074/head:pull/11074 PR: https://git.openjdk.org/jdk/pull/11074 From dlong at openjdk.org Thu Nov 10 02:41:33 2022 From: dlong at openjdk.org (Dean Long) Date: Thu, 10 Nov 2022 02:41:33 GMT Subject: RFR: JDK-8294947: Use 64bit atomics in patch_verified_entry on x86_64 In-Reply-To: References: Message-ID: On Wed, 9 Nov 2022 17:19:55 GMT, Vladimir Kozlov wrote: >> In the void NativeJump::patch_verified_entry() we atomically patch first 4 bytes, then atomically patch 5th byte, then atomically patch first 4 bytes again. But from CMC (cross-modified code) point of view it's better to patch atomically 8 bytes at once. >> >> The patch was tested with hotspot jtreg tests in bare-metal and virtualized environments. > > src/hotspot/cpu/x86/nativeInst_x86.cpp line 514: > >> 512: // complete jump instruction (to be inserted) is in code_buffer; >> 513: #ifdef AMD64 >> 514: unsigned char code_buffer[8]; > > Should we align this buffer too (to 8/jlong)? I suggest using a union. ------------- PR: https://git.openjdk.org/jdk/pull/11059 From gcao at openjdk.org Thu Nov 10 02:50:31 2022 From: gcao at openjdk.org (Gui Cao) Date: Thu, 10 Nov 2022 02:50:31 GMT Subject: RFR: 8296638: RISC-V: NegVI node emits wrong code when vector element basic type is T_BYTE/T_SHORT [v2] In-Reply-To: References: <5c6-E5vE3iSau2tWJWNe-2YX6DebdqMblUQe31fipQs=.764e9122-82ef-49c5-99a4-6283de54392c@github.com> Message-ID: On Thu, 10 Nov 2022 02:25:23 GMT, Dingli Zhang wrote: >> Hi, >> >> test/jdk/jdk/incubator/vector/Byte256VectorTests.java fails on riscv with the following error: >> >> test Byte256VectorTests.negByte256VectorTests (byte [i * 5]): failure >> java.lang.AssertionError: at index #2, input = 10 expected [-10] but found [-11] >> >> >> Currently, `NegVI` can only handle the vector element basic type `T_INT` with`vsetvli(t0, x0, Assembler::e32)` but `T_SHORT/T_BYTE` can also be matched with `NegVI`, so these two types of tests are currently failing: >> >> test/jdk/jdk/incubator/vector/Byte*VectorTests.java >> test/jdk/jdk/incubator/vector/Short*VectorTests.java >> >> >> Meanwhile, I removed some useless trailing whitespace. >> >> Please take a look and have some reviews. Thanks a lot. >> >> ## Testing: >> >> - test/jdk/jdk/incubator/vector (fastdebug/release with UseRVV on QEMU) > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Remove useless predicate LGTM, Thanks! ------------- Marked as reviewed by gcao (Author). PR: https://git.openjdk.org/jdk/pull/11074 From duke at openjdk.org Thu Nov 10 03:09:43 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Thu, 10 Nov 2022 03:09:43 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v11] In-Reply-To: References: Message-ID: <9Ststt1zBbU04qp9Ilb7zPQx3bA5uIQEi-TtbpiMn1s=.01700387-0a3b-4f47-9daa-1febc7230539@github.com> On Thu, 10 Nov 2022 01:22:04 GMT, Volodymyr Paprotski wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: > > fix windows and 32b linux builds Revised numbers with `getLimbs()` interface change. Compared to previous version that got limbs in IR, change is within deviation.. (mostly -1%) datasize | master | optimized | disabled | opt/mst | dis/mst -- | -- | -- | -- | -- | -- 32 | 3218169 | 3651078 | 3116558 | 1.13 | 0.97 64 | 2858030 | 3407518 | 2824903 | 1.19 | 0.99 128 | 2396796 | 3357224 | 2394802 | 1.40 | 1.00 256 | 1780679 | 3050142 | 1751130 | 1.71 | 0.98 512 | 1168824 | 2938952 | 1148479 | 2.51 | 0.98 1024 | 648772.1 | 2728454 | 687016.7 | 4.21 | 1.06 2048 | 357009 | 2393507 | 392928.2 | 6.70 | 1.10 16384 | 48854.33 | 903175.4 | 52874.78 | 18.49 | 1.08 1048576 | 771.461 | 14951.24 | 840.792 | 19.38 | 1.09 ------------- PR: https://git.openjdk.org/jdk/pull/10582 From fyang at openjdk.org Thu Nov 10 03:46:31 2022 From: fyang at openjdk.org (Fei Yang) Date: Thu, 10 Nov 2022 03:46:31 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v3] In-Reply-To: References: Message-ID: On Sun, 6 Nov 2022 17:28:53 GMT, Richard Reingruber wrote: >> Hi, >> >> this is the port of [JEP 425: Virtual Threads (Preview)](https://openjdk.org/jeps/425)) to PPC64. >> More precisely it is the port of vm continuations in hotspot to PPC64. It allows to run with `-XX:+VMContinuations` which is a prerequisit for 'real' virtual threads (oxymoron?). >> >> Most of the shared code changes are related to a new platform dependent constant `frame::metadata_words_at_top`. It is either added or subtracted to a frame address or size. >> >> The following is supposed to explain (without real life details) why it is needed in addition to the existing `frame::metadata_words`. The diagram shows a frame at `SP` and its stack arguments. The caller frame is located at `callers_SP`. >> >> >> X86 / AARCH64 PPC64: >> >> : : : : >> : : : : >> | | | | >> |-----------------| |-----------------| >> | | | | >> | stack arguments | | stack arguments | >> | |<- callers_SP | | >> =================== |-----------------| >> | | | | >> | metadata at bottom | | metadata at top | >> | | | |<- callers_SP >> |-----------------| =================== >> | | | | >> | | | | >> | | | | >> | | | | >> | |<- SP | | >> =================== |-----------------| >> | | >> | metadata at top | >> | |<- SP >> =================== >> >> >> On X86 and AARCH64 metadata (that's return address, saved SP/FP, etc.) is stored at the frame bottom (`metadata at bottom`). On PPC64 it is stored at the frame top (`metadata at top`) where it affects size and address calculations. Shared code deals with this by making use of the platform dependent constant `frame::metadata_words_at_top`. >> >> * size required to 'freeze' a single frame with its stack arguments in a `StackChunk`: >> `sizeof(frame) + sizeof(stack arguments) + frame::metadata_words_at_top` >> >> * address of stack arguments: >> `callers_SP + frame::metadata_words_at_top` >> >> * The value of `frame::metadata_words_at_top` is 0 words on X86 and AARCH64. On PPC64 it is 4 words. >> >> Please refer to comments I've added for more details (e.g. the comment on StackChunkFrameStream::frame_size()). Recently I've given a talk about vm continuations and the PPC64 port to my colleagues. It's rather an overview than a deep dive. [The slides](http://cr.openjdk.java.net/~rrich/webrevs/2022/8286302/202210_loom_ppc64_port.pdf) might serve as well as an intro to the matter. >> >> The pr includes the new test jdk/jdk/internal/vm/Continuation/BasicExp.java which I wrote while doing the port. The test cases vary from simple to not-so-simple. One of the main features is that a CompilationPolicy passed as argument controls which frames are compiled/interpreted when freezing by defining a sliding window of compiled / interpreted frames which produces interesting transitions with and without stack arguments. There is overlap with Basic.java and Fuzz.java. Let me know wether to remove or keep BasicExp.java. Runtime with fastdebug: 2m on Apple M1 Pro and 3m on Intel Xeon E5-2660 v3 @ 2.60GHz. Note that -XX:+VerifyContinuations is explicitly set as a found it very useful, it increases the runtime quite a bit though. >> >> Testing: the change passed our CI testing: most JCK and JTREG tests, also in Xcomp mode, SPECjvm2008, SPECjbb2015, Renaissance Suite and SAP specific tests with fastdebug and release builds on the standard platforms plus PPC64. These tests do include hotspot_loom and jdk_loom JTREG tests which I've also run with TEST_VM_OPTS="-XX:+VerifyContinuations" on X86_64, PPC64, and AARCH64. >> >> Thanks, Richard. > > Richard Reingruber has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: > > - Fix cpp condition and add PPC64 > - Changes lost in merge > - Merge branch 'master' into 8286302_Port_JEP_425_to_PPC64 > - Use callers_sp for fsize calculation in recurse_freeze_interpreted_frame > - Loom ppc64le port Hi, I also tested this on my Linux RISC-V platform. Test results look good. Thanks. - Tier1-4 tests - Tests under test/jdk/java/lang/Thread/virtual with extra options "-XX:+VerifyContinuations -XX:+VerifyStack" ------------- PR: https://git.openjdk.org/jdk/pull/10961 From fyang at openjdk.org Thu Nov 10 04:08:27 2022 From: fyang at openjdk.org (Fei Yang) Date: Thu, 10 Nov 2022 04:08:27 GMT Subject: RFR: 8296638: RISC-V: NegVI node emits wrong code when vector element basic type is T_BYTE/T_SHORT [v2] In-Reply-To: References: <5c6-E5vE3iSau2tWJWNe-2YX6DebdqMblUQe31fipQs=.764e9122-82ef-49c5-99a4-6283de54392c@github.com> Message-ID: On Thu, 10 Nov 2022 02:25:23 GMT, Dingli Zhang wrote: >> Hi, >> >> test/jdk/jdk/incubator/vector/Byte256VectorTests.java fails on riscv with the following error: >> >> test Byte256VectorTests.negByte256VectorTests (byte [i * 5]): failure >> java.lang.AssertionError: at index #2, input = 10 expected [-10] but found [-11] >> >> >> Currently, `NegVI` can only handle the vector element basic type `T_INT` with`vsetvli(t0, x0, Assembler::e32)` but `T_SHORT/T_BYTE` can also be matched with `NegVI`, so these two types of tests are currently failing: >> >> test/jdk/jdk/incubator/vector/Byte*VectorTests.java >> test/jdk/jdk/incubator/vector/Short*VectorTests.java >> >> >> Meanwhile, I removed some useless trailing whitespace. >> >> Please take a look and have some reviews. Thanks a lot. >> >> ## Testing: >> >> - test/jdk/jdk/incubator/vector (fastdebug/release with UseRVV on QEMU) > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Remove useless predicate Looks reasonable. ------------- Marked as reviewed by fyang (Reviewer). PR: https://git.openjdk.org/jdk/pull/11074 From dzhang at openjdk.org Thu Nov 10 04:47:26 2022 From: dzhang at openjdk.org (Dingli Zhang) Date: Thu, 10 Nov 2022 04:47:26 GMT Subject: RFR: 8296638: RISC-V: NegVI node emits wrong code when vector element basic type is T_BYTE/T_SHORT [v2] In-Reply-To: References: <5c6-E5vE3iSau2tWJWNe-2YX6DebdqMblUQe31fipQs=.764e9122-82ef-49c5-99a4-6283de54392c@github.com> Message-ID: On Thu, 10 Nov 2022 02:47:58 GMT, Gui Cao wrote: >> Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove useless predicate > > LGTM, Thanks! @zifeihan @RealFYang Thanks for the review! ------------- PR: https://git.openjdk.org/jdk/pull/11074 From chagedorn at openjdk.org Thu Nov 10 07:23:32 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 10 Nov 2022 07:23:32 GMT Subject: RFR: 8295867: TestVerifyGraphEdges.java fails with exit code -1073741571 when using AlwaysIncrementalInline In-Reply-To: References: Message-ID: On Wed, 9 Nov 2022 18:14:43 GMT, Vladimir Kozlov wrote: > When we use -Xcomp we compile `java.lang.invoke.LambdaForm$Kind::` very long linear method for enum class: [LambdaForm.java#L250](https://github.com/openjdk/jdk/blame/master/src/java.base/share/classes/java/lang/invoke/LambdaForm.java#L250) > > In addition we inline all class initializers for EA when run with -Xcomp: [bytecodeInfo.cpp#L410](https://github.com/openjdk/jdk/blame/master/src/hotspot/share/opto/bytecodeInfo.cpp#L410) > > Recent CDS change [JDK-8293979](https://bugs.openjdk.org/browse/JDK-8293979) allows to inline a bit more deeply too. > > Adding `-XX:+AlwaysIncrementalInline` worsened situation even more. > > Running with `-XX:+LogCompilation` shows that we hit `NodeCountInliningCutoff (18000)` during `java.lang.invoke.LambdaForm$Kind::` compilation. > > In short, we have very long (>40000 live nodes) linear IR graph. `Node::verify_edges()` method process nodes depth-first starting from first input which is control edge. So it is not surprise that depth of this method recursion reached 6000. > With frame size of 10 words (320 bytes) we easy hit stack overflow (768K in 32-bits debug VM). > > I fixed it by using local buffer `Node_List` instead of recursion in `Node::verify_edges()`. > The algorithm was changed to simplify code. It processes inputs in reverse order - last input processed first. And I noticed that maximum use of buffer is only about 1000 or less elements for this compilation (that is why I use live_nodes/16 as initial size of buffer). > > Then I did additional experiment with keeping recursion but processing inputs in reverse order: > > > // Recursive walk over all input edges > - for( i = 0; i < len(); i++ ) { > - n = in(i); > + for( i = len(); i > 0; i++ ) { > + n = in(i - 1); > if( n != NULL ) > in(i)->verify_edges(visited); > } > > > And it shows the same around 1000 stack depth! > > I decided to keep my original fix because it should be faster (put only one value on list instead of putting all locals, PC, SP on stack and calls) and much less stack usage. > > Testing tier1-3, hs-comp-stress and `TestVerifyGraphEdges.java` test runs with `-XX:+AlwaysIncrementalInline`. I agree that a non-recursive solution is preferable in this case. I only have some minor code style comments - otherwise, the fix looks good! src/hotspot/share/opto/compile.cpp line 4243: > 4241: // Allocate stack of size C->live_nodes()/16 to avoid frequent realloc > 4242: uint stack_size = live_nodes() >> 4; > 4243: Node_List nstack(MAX2(stack_size, (uint)OptoNodeListSize)); As you only need the stack in `verify_edges()`, I suggest to move these lines directly into the method `verify_edges()`. src/hotspot/share/opto/node.cpp line 2699: > 2697: uint length = next->len(); > 2698: for (uint i = 0; i < length; i++) { > 2699: Node* n = next->in(i); I suggest to rename `n` to `input` to make it easier to see if it is the current node or an input to it. src/hotspot/share/opto/node.cpp line 2714: > 2712: // Check for duplicate edges > 2713: // walk the input array downcounting the input edges to n > 2714: for(uint j = 0; j < length; j++) { Suggestion: for (uint j = 0; j < length; j++) { src/hotspot/share/opto/node.hpp line 1220: > 1218: virtual void dump_compact_spec(outputStream *st) const { dump_spec(st); } > 1219: > 1220: static void verify_edges(Node* root, Unique_Node_List &visited, Node_List &nstack); // Verify bi-directional edges Maybe you can directly rename the method according to your comment: `verify_bidirectional_edges()`. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11065 From shade at openjdk.org Thu Nov 10 08:38:35 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 10 Nov 2022 08:38:35 GMT Subject: RFR: 8296638: RISC-V: NegVI node emits wrong code when vector element basic type is T_BYTE/T_SHORT [v2] In-Reply-To: References: <5c6-E5vE3iSau2tWJWNe-2YX6DebdqMblUQe31fipQs=.764e9122-82ef-49c5-99a4-6283de54392c@github.com> Message-ID: On Thu, 10 Nov 2022 02:25:23 GMT, Dingli Zhang wrote: >> Hi, >> >> test/jdk/jdk/incubator/vector/Byte256VectorTests.java fails on riscv with the following error: >> >> test Byte256VectorTests.negByte256VectorTests (byte [i * 5]): failure >> java.lang.AssertionError: at index #2, input = 10 expected [-10] but found [-11] >> >> >> Currently, `NegVI` can only handle the vector element basic type `T_INT` with`vsetvli(t0, x0, Assembler::e32)` but `T_SHORT/T_BYTE` can also be matched with `NegVI`, so these two types of tests are currently failing: >> >> test/jdk/jdk/incubator/vector/Byte*VectorTests.java >> test/jdk/jdk/incubator/vector/Short*VectorTests.java >> >> >> Meanwhile, I removed some useless trailing whitespace. >> >> Please take a look and have some reviews. Thanks a lot. >> >> ## Testing: >> >> - test/jdk/jdk/incubator/vector (fastdebug/release with UseRVV on QEMU) > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Remove useless predicate Marked as reviewed by shade (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11074 From dzhang at openjdk.org Thu Nov 10 08:41:54 2022 From: dzhang at openjdk.org (Dingli Zhang) Date: Thu, 10 Nov 2022 08:41:54 GMT Subject: Integrated: 8296638: RISC-V: NegVI node emits wrong code when vector element basic type is T_BYTE/T_SHORT In-Reply-To: <5c6-E5vE3iSau2tWJWNe-2YX6DebdqMblUQe31fipQs=.764e9122-82ef-49c5-99a4-6283de54392c@github.com> References: <5c6-E5vE3iSau2tWJWNe-2YX6DebdqMblUQe31fipQs=.764e9122-82ef-49c5-99a4-6283de54392c@github.com> Message-ID: On Thu, 10 Nov 2022 01:43:40 GMT, Dingli Zhang wrote: > Hi, > > test/jdk/jdk/incubator/vector/Byte256VectorTests.java fails on riscv with the following error: > > test Byte256VectorTests.negByte256VectorTests (byte [i * 5]): failure > java.lang.AssertionError: at index #2, input = 10 expected [-10] but found [-11] > > > Currently, `NegVI` can only handle the vector element basic type `T_INT` with`vsetvli(t0, x0, Assembler::e32)` but `T_SHORT/T_BYTE` can also be matched with `NegVI`, so these two types of tests are currently failing: > > test/jdk/jdk/incubator/vector/Byte*VectorTests.java > test/jdk/jdk/incubator/vector/Short*VectorTests.java > > > Meanwhile, I removed some useless trailing whitespace. > > Please take a look and have some reviews. Thanks a lot. > > ## Testing: > > - test/jdk/jdk/incubator/vector (fastdebug/release with UseRVV on QEMU) This pull request has now been integrated. Changeset: f2acdfdc Author: Dingli Zhang Committer: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/f2acdfdcbd2a49c1167656e73b67b38b545f9472 Stats: 9 lines in 1 file changed: 2 ins; 0 del; 7 mod 8296638: RISC-V: NegVI node emits wrong code when vector element basic type is T_BYTE/T_SHORT Reviewed-by: gcao, fyang, shade ------------- PR: https://git.openjdk.org/jdk/pull/11074 From shade at openjdk.org Thu Nov 10 08:42:31 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 10 Nov 2022 08:42:31 GMT Subject: RFR: 8295867: TestVerifyGraphEdges.java fails with exit code -1073741571 when using AlwaysIncrementalInline In-Reply-To: References: Message-ID: On Wed, 9 Nov 2022 18:14:43 GMT, Vladimir Kozlov wrote: > When we use -Xcomp we compile `java.lang.invoke.LambdaForm$Kind::` very long linear method for enum class: [LambdaForm.java#L250](https://github.com/openjdk/jdk/blame/master/src/java.base/share/classes/java/lang/invoke/LambdaForm.java#L250) > > In addition we inline all class initializers for EA when run with -Xcomp: [bytecodeInfo.cpp#L410](https://github.com/openjdk/jdk/blame/master/src/hotspot/share/opto/bytecodeInfo.cpp#L410) > > Recent CDS change [JDK-8293979](https://bugs.openjdk.org/browse/JDK-8293979) allows to inline a bit more deeply too. > > Adding `-XX:+AlwaysIncrementalInline` worsened situation even more. > > Running with `-XX:+LogCompilation` shows that we hit `NodeCountInliningCutoff (18000)` during `java.lang.invoke.LambdaForm$Kind::` compilation. > > In short, we have very long (>40000 live nodes) linear IR graph. `Node::verify_edges()` method process nodes depth-first starting from first input which is control edge. So it is not surprise that depth of this method recursion reached 6000. > With frame size of 10 words (320 bytes) we easy hit stack overflow (768K in 32-bits debug VM). > > I fixed it by using local buffer `Node_List` instead of recursion in `Node::verify_edges()`. > The algorithm was changed to simplify code. It processes inputs in reverse order - last input processed first. And I noticed that maximum use of buffer is only about 1000 or less elements for this compilation (that is why I use live_nodes/16 as initial size of buffer). > > Then I did additional experiment with keeping recursion but processing inputs in reverse order: > > > // Recursive walk over all input edges > - for( i = 0; i < len(); i++ ) { > - n = in(i); > + for( i = len(); i > 0; i++ ) { > + n = in(i - 1); > if( n != NULL ) > in(i)->verify_edges(visited); > } > > > And it shows the same around 1000 stack depth! > > I decided to keep my original fix because it should be faster (put only one value on list instead of putting all locals, PC, SP on stack and calls) and much less stack usage. > > Testing tier1-3, hs-comp-stress and `TestVerifyGraphEdges.java` test runs with `-XX:+AlwaysIncrementalInline`. x86_32 seems to be happy with this change. I ran both `TestVerifyGraphEdges` and `tier1 tier2` with `-XX:+VerifyGraphEdges` without problems with Linux x86_32 fastdebug. The code looks reasonable too. ------------- Marked as reviewed by shade (Reviewer). PR: https://git.openjdk.org/jdk/pull/11065 From roland at openjdk.org Thu Nov 10 08:43:23 2022 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 10 Nov 2022 08:43:23 GMT Subject: RFR: 6312651: Compiler should only use verified interface types for optimization [v3] In-Reply-To: References: Message-ID: On Thu, 10 Nov 2022 00:06:40 GMT, Vladimir Ivanov wrote: >> Roland Westrelin has updated the pull request incrementally with five additional commits since the last revision: >> >> - review >> - review >> - review >> - review >> - review > > src/hotspot/share/ci/ciArrayKlass.cpp line 108: > >> 106: } >> 107: >> 108: GrowableArray* ciArrayKlass::interfaces() { > > FTR there's a subtle asymmetry between `ciArrayKlass::interfaces()` and `ciInstanceKlass::transitive_interfaces()` when it comes to memory allocation: the former allocates from resource area while the latter from compiler arena. > > It doesn't cause any problems since `ciArrayKlass::interfaces()` is used only in `Type::Initialize_shared()` to instantiate shared `TypeAryPtr::_array_interfaces`, but it took me some time to find that out. > > Maybe a helper method in `type.cpp` is a better place for that logic. How would that work? ciArrayKlass::interfaces() has to transition into the vm which is not something that would feel right in type.cpp. ------------- PR: https://git.openjdk.org/jdk/pull/10901 From xlinzheng at openjdk.org Thu Nov 10 09:43:25 2022 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Thu, 10 Nov 2022 09:43:25 GMT Subject: RFR: 8296771: RISC-V: C2: assert(false) failed: bad AD file Message-ID: The backend encounters the same assertion failure as [JDK-8295414](https://bugs.openjdk.org/browse/JDK-8295414). This patch is a similar fix. Same, `partialSubtypeCheck` uses `iRegP_R15` as a result while its use `iRegP` doesn't match the `iRegP_R15`, causing this failure. The details are in the JBS issue [JDK-8296771](https://bugs.openjdk.org/browse/JDK-8296771). Tested the failed `compiler/types/TestSubTypeCheckMacroTrichotomy.java`, and a hotspot tier1 is running now. Thanks, Xiaolin ------------- Commit messages: - Fix simply Changes: https://git.openjdk.org/jdk/pull/11085/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11085&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296771 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11085.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11085/head:pull/11085 PR: https://git.openjdk.org/jdk/pull/11085 From rrich at openjdk.org Thu Nov 10 09:49:28 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Thu, 10 Nov 2022 09:49:28 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v3] In-Reply-To: References: Message-ID: On Thu, 10 Nov 2022 03:44:16 GMT, Fei Yang wrote: > Hi, I also tested this on my Linux RISC-V platform which has support for virtual threads. Test results look good. Thanks. > > * Tier1-4 tests > > * Tests under test/jdk/java/lang/Thread/virtual with extra options "-XX:+VerifyContinuations -XX:+VerifyStack" > > * Non-trivial benchmark workloads, like Renaissance, SPECjvm2008, SPECjbb2015, etc. Great. Thanks for the testing. Richard. ------------- PR: https://git.openjdk.org/jdk/pull/10961 From roland at openjdk.org Thu Nov 10 10:01:48 2022 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 10 Nov 2022 10:01:48 GMT Subject: RFR: 6312651: Compiler should only use verified interface types for optimization [v4] In-Reply-To: References: Message-ID: > This change is mostly the same I sent for review 3 years ago but was > never integrated: > > https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2019-May/033803.html > > The main difference is that, in the meantime, I submitted a couple of > refactoring changes extracted from the 2019 patch: > > 8266550: C2: mirror TypeOopPtr/TypeInstPtr/TypeAryPtr with TypeKlassPtr/TypeInstKlassPtr/TypeAryKlassPtr > 8275201: C2: hide klass() accessor from TypeOopPtr and typeKlassPtr subclasses > > As a result, the current patch is much smaller (but still not small). > > The implementation is otherwise largely the same as in the 2019 > patch. I tried to remove some of the code duplication between the > TypeOopPtr and TypeKlassPtr hierarchies by having some of the logic > shared in template methods. In the 2019 patch, interfaces were trusted > when types were constructed and I had added code to drop interfaces > from a type where they couldn't be trusted. This new patch proceeds > the other way around: interfaces are not trusted when a type is > constructed and code that uses the type must explicitly request that > they are included (this was suggested as an improvement by Vladimir > Ivanov I think). Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 10 additional commits since the last revision: - review - Merge branch 'master' into JDK-6312651 - review - review - review - review - review - build fix - whitespaces - interfaces ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10901/files - new: https://git.openjdk.org/jdk/pull/10901/files/49d1bf3e..f49a042a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10901&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10901&range=02-03 Stats: 58679 lines in 765 files changed: 18375 ins; 36526 del; 3778 mod Patch: https://git.openjdk.org/jdk/pull/10901.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10901/head:pull/10901 PR: https://git.openjdk.org/jdk/pull/10901 From roland at openjdk.org Thu Nov 10 10:01:49 2022 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 10 Nov 2022 10:01:49 GMT Subject: RFR: 6312651: Compiler should only use verified interface types for optimization [v3] In-Reply-To: References: Message-ID: On Thu, 10 Nov 2022 00:14:53 GMT, Vladimir Ivanov wrote: > Much better, thanks! > > Minor comments/suggestions follow. > > FTR test results are clean. I'll submit performance testing shortly. Thanks. I updated the patch and removed the interface specific code in the VM_CLASS_DEFN() macro that's indeed no longer needed. > src/hotspot/share/ci/ciInstanceKlass.cpp line 736: > >> 734: if (_transitive_interfaces == NULL) { >> 735: GUARDED_VM_ENTRY( >> 736: InstanceKlass* ik = get_instanceKlass(); > > A candidate for `compute_transitive_interfaces()` helper method? Done in new commit. > src/hotspot/share/opto/subnode.cpp line 1050: > >> 1048: // return the ConP(Foo.klass) >> 1049: assert(mirror_type->is_klass(), "mirror_type should represent a Klass*"); >> 1050: return phase->makecon(TypeKlassPtr::make(mirror_type->as_klass(), Type::trust_interfaces)); > > Extra space. Done. > src/hotspot/share/opto/type.cpp line 4003: > >> 4001: assert(loaded->ptr() != TypePtr::Null, "insanity check"); >> 4002: // >> 4003: if( loaded->ptr() == TypePtr::TopPTR ) { return unloaded; } > > Missing space after `if`. Done. > src/hotspot/share/opto/type.cpp line 4017: > >> 4015: // Both are unloaded, not the same class, not Object >> 4016: // Or meet unloaded with a different loaded class, not java/lang/Object >> 4017: if( ptr != TypePtr::BotPTR ) { > > Missing space after `if`. Done. ------------- PR: https://git.openjdk.org/jdk/pull/10901 From shade at openjdk.org Thu Nov 10 10:25:25 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 10 Nov 2022 10:25:25 GMT Subject: RFR: 8296771: RISC-V: C2: assert(false) failed: bad AD file In-Reply-To: References: Message-ID: On Thu, 10 Nov 2022 09:35:52 GMT, Xiaolin Zheng wrote: > The backend encounters the same assertion failure as [JDK-8295414](https://bugs.openjdk.org/browse/JDK-8295414). This patch is a similar fix. > > Same, `partialSubtypeCheck` uses `iRegP_R15` as a result while its use `iRegP` doesn't match the `iRegP_R15`, causing this failure. > > The details are in the JBS issue [JDK-8296771](https://bugs.openjdk.org/browse/JDK-8296771). > > Tested the failed `compiler/types/TestSubTypeCheckMacroTrichotomy.java`, and a hotspot tier1 is running now. > > Thanks, > Xiaolin Looks fine to me. I do wonder if there are similar lurking bugs with other `iRegP_*` that should probably also be part of `operand iRegP`; this limited fix is okay, though, as `riscv.ad` only has `iRegP_R15` and `iRegP_R10` as result registers in existing match rules. ------------- Marked as reviewed by shade (Reviewer). PR: https://git.openjdk.org/jdk/pull/11085 From shade at openjdk.org Thu Nov 10 10:46:26 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 10 Nov 2022 10:46:26 GMT Subject: RFR: 8296545: C2 Blackholes should allow load optimizations [v2] In-Reply-To: References: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> Message-ID: On Wed, 9 Nov 2022 20:29:13 GMT, Vladimir Ivanov wrote: >> Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: >> >> Do not touch memory at all > > src/hotspot/share/opto/library_call.cpp line 7784: > >> 7782: // side effects like breaking the optimizations across the blackhole. >> 7783: >> 7784: MemBarNode* mb = MemBarNode::make(C, Op_Blackhole); > > One thing to clear if you decide to keep modeling it as `MemBar`: pass `AliasIdxTop` as `alias_idx` . OK, I need to understand why, though. Does passing `AliasIdxTop` sentinel value here protects us from accidentally doing memory merges over the Blackhole node that does not touch memory? Is that the idea, or there is some other reason for it? ------------- PR: https://git.openjdk.org/jdk/pull/11041 From xlinzheng at openjdk.org Thu Nov 10 11:06:32 2022 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Thu, 10 Nov 2022 11:06:32 GMT Subject: RFR: 8296771: RISC-V: C2: assert(false) failed: bad AD file In-Reply-To: References: Message-ID: On Thu, 10 Nov 2022 10:23:12 GMT, Aleksey Shipilev wrote: > Looks Yes, it seems to me as well that only `iRegP_R10` and `iRegP_R15` are result registers after some checking. And thanks for the fast review! ------------- PR: https://git.openjdk.org/jdk/pull/11085 From fyang at openjdk.org Thu Nov 10 12:47:03 2022 From: fyang at openjdk.org (Fei Yang) Date: Thu, 10 Nov 2022 12:47:03 GMT Subject: RFR: 8296771: RISC-V: C2: assert(false) failed: bad AD file In-Reply-To: References: Message-ID: <7XTK6Z8oaI88ioZvdOoXV1ZttusNDRsS_5y5Z-4Z2Q0=.d03a6338-4e94-460a-a12a-c7314f6a5b8c@github.com> On Thu, 10 Nov 2022 09:35:52 GMT, Xiaolin Zheng wrote: > The backend encounters the same assertion failure as [JDK-8295414](https://bugs.openjdk.org/browse/JDK-8295414). This patch is a similar fix. > > Same, `partialSubtypeCheck` uses `iRegP_R15` as a result while its use `iRegP` doesn't match the `iRegP_R15`, causing this failure. > > The details are in the JBS issue [JDK-8296771](https://bugs.openjdk.org/browse/JDK-8296771). > > Tested the failed `compiler/types/TestSubTypeCheckMacroTrichotomy.java`, and a hotspot tier1 is running now. > > Thanks, > Xiaolin Marked as reviewed by fyang (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11085 From tholenstein at openjdk.org Thu Nov 10 13:19:58 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Thu, 10 Nov 2022 13:19:58 GMT Subject: RFR: JDK-8295934: IGV: keep node selection when changing view or graph Message-ID: In IGV nodes can be selected by clicking on it. When a user selects nodes in a certain view, e.g. "Cluster nodes into blocks" view, and then change to e.g. "Sea of nodes" view, the selection should be kept. Same when the user goes a different graph in the same group, the selection should be kept (as long as the nodes are still present) New selection features: - When opening a new graph and no nodes where selected previously the root nodes is selected and centered. selected_root - When a graph in the same group is opened, the previously selected nodes as selected as well if they are present in the graph. The selected nodes are centered in the new graph. - The selected nodes are kept when changing the view, or the properties of the view (e.g. "show neighboring nodes semi-transparent") cluster_view desired - When "show neighboring nodes semi-transparent" is disabled, previously semi-transparent nodes that were selected are now unselected (because they are not visible anymore) - It would also be desired adjust the scroll pane to center the selected nodes when changing view, graph, etc. ------------- Commit messages: - remove redundant scrollRectToVisible() - remove whitespace - showIfHidden flag for addSelectedNodes() - cleanup imports - centerSingleSelectedFigure() - use AnimatorListener for centerSelectedFigures() - updateFigureTexts() - center selected nodes in ControlFlow and Bytecode TopComponent - centerSelectedFigures() - adjust centerRectangle() - ... and 16 more: https://git.openjdk.org/jdk/compare/4a0093cc...c365e70a Changes: https://git.openjdk.org/jdk/pull/11062/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11062&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295934 Stats: 703 lines in 20 files changed: 309 ins; 205 del; 189 mod Patch: https://git.openjdk.org/jdk/pull/11062.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11062/head:pull/11062 PR: https://git.openjdk.org/jdk/pull/11062 From tholenstein at openjdk.org Thu Nov 10 13:22:08 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Thu, 10 Nov 2022 13:22:08 GMT Subject: RFR: JDK-8296665: IGV: Show dialog with stack trace for exceptions Message-ID: <2dCzlObmwCSPAUcF0onZea7bmP6UlBhblebQ391VUU0=.7fad4bcd-f21e-4be6-a27a-85eeeca2934c@github.com> Currently in IGV when an exception occurs a small red icon in the bottom right corner appears. The user often does not see this and if he sees it, usually not immediately when the error occurs: exception now The exception reporting level is changed to `1000` (Level.SEVERE) in IGV to show a dialog with the stack-trace. The user can still close it and continue the work: exception suggestion To test invert something like the following somewhere in the codebase try { int i=1/0; } catch (Exception e) { throw new RuntimeException(e); } ------------- Commit messages: - move to .conf file - JDK-8296665: IGV: Show dialog with stack trace for exceptions Changes: https://git.openjdk.org/jdk/pull/11060/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11060&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296665 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11060.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11060/head:pull/11060 PR: https://git.openjdk.org/jdk/pull/11060 From mdoerr at openjdk.org Thu Nov 10 14:11:47 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 10 Nov 2022 14:11:47 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v10] In-Reply-To: References: Message-ID: > This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: Add missing include. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10933/files - new: https://git.openjdk.org/jdk/pull/10933/files/705657d0..1cc1fc21 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10933&range=09 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10933&range=08-09 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10933.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10933/head:pull/10933 PR: https://git.openjdk.org/jdk/pull/10933 From rkennke at openjdk.org Thu Nov 10 15:13:48 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 10 Nov 2022 15:13:48 GMT Subject: RFR: 8296170: Refactor stack-locking path in C2_MacroAssembler::fast_unlock() [v2] In-Reply-To: References: Message-ID: > The code in C2_MacroAssembler::fast_unlock() has several (minor) issues: > - The stack-locking path for x86_32 is not under UseHeavyMonitors - it would be executed even when stack-locking is disabled. > - The stack-locking paths are the same for x86_32 and x86_64 - they can be merged into a common path. > - In x86_32 path, we call get_thread(boxReg) which is totally bogus because we clear boxReg right afterwards with xorptr(boxReg, boxReg). > - In x86_32 path, the CheckSucc label is identical to the DONE label, and in-fact CheckSucc is only ever really used in the x86_64 path and can be moved there. > > Testing: > - [x] tier1 (x86_32, x86_64) > - [x] tier2 (x86_32, x86_64) Roman Kennke has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge remote-tracking branch 'upstream/master' into JDK-8296170 - 8296170: Refactor stack-locking path in C2_MacroAssembler::fast_unlock() ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10936/files - new: https://git.openjdk.org/jdk/pull/10936/files/40817c8f..bc4fd918 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10936&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10936&range=00-01 Stats: 58187 lines in 838 files changed: 17989 ins; 36573 del; 3625 mod Patch: https://git.openjdk.org/jdk/pull/10936.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10936/head:pull/10936 PR: https://git.openjdk.org/jdk/pull/10936 From shade at openjdk.org Thu Nov 10 15:25:28 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 10 Nov 2022 15:25:28 GMT Subject: RFR: 8296545: C2 Blackholes should allow load optimizations [v2] In-Reply-To: References: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> Message-ID: On Wed, 9 Nov 2022 20:27:00 GMT, Vladimir Ivanov wrote: >> I whipped up this patch that pulls `Blackhole` from `MemBarNode` to be the more generic `MultiNode` (can probably even be `Node`, if I understand how should control-output-only nodes be defined): https://cr.openjdk.java.net/~shade/8296545/blackhole-cfg-1.patch -- it seems to "work fine" on adhoc tests. But, I am still a bit uneasy to unhook blackhole from membar, on the off-chance it matters in some non-obvious way. > >> can probably even be Node, if I understand how should control-output-only nodes be defined > > Yes, `MultiNode` is specifically for cases when a node produces multiple results. > > FTR I looked through the code base for `is_MemBar` usages and didn't spot any problematic cases, if you clear away memory input. But I'm fine with keeping it a membar. Well, I don't see how I can make `Blackhole` produce only control. It seems the rest of C2 code frowns upon CFG nodes that are not control projections (or safepoints), see e.g. `is_control_proj_or_safepoint` asserts. Any hints how to proceed here? Maybe an example for such CFG node somewhere? Otherwise, I'd think keeping Blackhole a `MultiNode` and then take the control projection off it -- like in my `blackhole-cfg-1.patch` above -- is the way to do it. ------------- PR: https://git.openjdk.org/jdk/pull/11041 From duke at openjdk.org Thu Nov 10 16:14:32 2022 From: duke at openjdk.org (Tom Shull) Date: Thu, 10 Nov 2022 16:14:32 GMT Subject: RFR: 8262901: [macos_aarch64] NativeCallTest expected:<-3.8194101E18> but was:<3.02668882E10> [v3] In-Reply-To: <8M1GZUYKpvbr4DCJ3139r8D0-njoWu07yWLnP0jLxtU=.8cd5de15-a4c7-4a6a-9705-9979cbc82f60@github.com> References: <6AcveZEfV2AvLEpEP-nSTG3r9aqc_S82tbDatEw1h4s=.f8b29e01-67af-4ff6-9f60-b84264bc724d@github.com> <8M1GZUYKpvbr4DCJ3139r8D0-njoWu07yWLnP0jLxtU=.8cd5de15-a4c7-4a6a-9705-9979cbc82f60@github.com> Message-ID: On Mon, 7 Nov 2022 13:31:15 GMT, Olga Mikhaltsova wrote: >> Olga Mikhaltsova has updated the pull request incrementally with one additional commit since the last revision: >> >> Refactoring > > @teshull could you please take a look! I tried to make fixes according to your comments in #6641 @omikhaltsova Your changes look good to me. Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/10238 From omikhaltcova at openjdk.org Thu Nov 10 16:28:33 2022 From: omikhaltcova at openjdk.org (Olga Mikhaltsova) Date: Thu, 10 Nov 2022 16:28:33 GMT Subject: RFR: 8262901: [macos_aarch64] NativeCallTest expected:<-3.8194101E18> but was:<3.02668882E10> [v3] In-Reply-To: <5V_W5w5OM69N3cKaT5-4AK7YuXS1YPAX5w_sDxfoeeo=.ed4a54db-95b4-4b2d-a3fb-d714725bab96@github.com> References: <6AcveZEfV2AvLEpEP-nSTG3r9aqc_S82tbDatEw1h4s=.f8b29e01-67af-4ff6-9f60-b84264bc724d@github.com> <5V_W5w5OM69N3cKaT5-4AK7YuXS1YPAX5w_sDxfoeeo=.ed4a54db-95b4-4b2d-a3fb-d714725bab96@github.com> Message-ID: On Mon, 7 Nov 2022 16:59:03 GMT, Andrew Haley wrote: >> Olga Mikhaltsova has updated the pull request incrementally with one additional commit since the last revision: >> >> Refactoring > > Looks good. Thanks. @theRealAph @teshull thanks for the review! ------------- PR: https://git.openjdk.org/jdk/pull/10238 From kvn at openjdk.org Thu Nov 10 16:30:26 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 10 Nov 2022 16:30:26 GMT Subject: RFR: 8295867: TestVerifyGraphEdges.java fails with exit code -1073741571 when using AlwaysIncrementalInline In-Reply-To: References: Message-ID: On Thu, 10 Nov 2022 06:55:07 GMT, Christian Hagedorn wrote: >> When we use -Xcomp we compile `java.lang.invoke.LambdaForm$Kind::` very long linear method for enum class: [LambdaForm.java#L250](https://github.com/openjdk/jdk/blame/master/src/java.base/share/classes/java/lang/invoke/LambdaForm.java#L250) >> >> In addition we inline all class initializers for EA when run with -Xcomp: [bytecodeInfo.cpp#L410](https://github.com/openjdk/jdk/blame/master/src/hotspot/share/opto/bytecodeInfo.cpp#L410) >> >> Recent CDS change [JDK-8293979](https://bugs.openjdk.org/browse/JDK-8293979) allows to inline a bit more deeply too. >> >> Adding `-XX:+AlwaysIncrementalInline` worsened situation even more. >> >> Running with `-XX:+LogCompilation` shows that we hit `NodeCountInliningCutoff (18000)` during `java.lang.invoke.LambdaForm$Kind::` compilation. >> >> In short, we have very long (>40000 live nodes) linear IR graph. `Node::verify_edges()` method process nodes depth-first starting from first input which is control edge. So it is not surprise that depth of this method recursion reached 6000. >> With frame size of 10 words (320 bytes) we easy hit stack overflow (768K in 32-bits debug VM). >> >> I fixed it by using local buffer `Node_List` instead of recursion in `Node::verify_edges()`. >> The algorithm was changed to simplify code. It processes inputs in reverse order - last input processed first. And I noticed that maximum use of buffer is only about 1000 or less elements for this compilation (that is why I use live_nodes/16 as initial size of buffer). >> >> Then I did additional experiment with keeping recursion but processing inputs in reverse order: >> >> >> // Recursive walk over all input edges >> - for( i = 0; i < len(); i++ ) { >> - n = in(i); >> + for( i = len(); i > 0; i++ ) { >> + n = in(i - 1); >> if( n != NULL ) >> in(i)->verify_edges(visited); >> } >> >> >> And it shows the same around 1000 stack depth! >> >> I decided to keep my original fix because it should be faster (put only one value on list instead of putting all locals, PC, SP on stack and calls) and much less stack usage. >> >> Testing tier1-3, hs-comp-stress and `TestVerifyGraphEdges.java` test runs with `-XX:+AlwaysIncrementalInline`. > > src/hotspot/share/opto/compile.cpp line 4243: > >> 4241: // Allocate stack of size C->live_nodes()/16 to avoid frequent realloc >> 4242: uint stack_size = live_nodes() >> 4; >> 4243: Node_List nstack(MAX2(stack_size, (uint)OptoNodeListSize)); > > As you only need the stack in `verify_edges()`, I suggest to move these lines directly into the method `verify_edges()`. I need `live_nodes()` value or `stack_size` or `C` to pass for creating list inside method. I decided to move renamed `verify_bidirectional_edges()` method to `Compile` class to get these values inside the method. It does not need to be in `Node` class after I removed recursion. > src/hotspot/share/opto/node.cpp line 2699: > >> 2697: uint length = next->len(); >> 2698: for (uint i = 0; i < length; i++) { >> 2699: Node* n = next->in(i); > > I suggest to rename `n` to `input` to make it easier to see if it is the current node or an input to it. Renamed to `in` ------------- PR: https://git.openjdk.org/jdk/pull/11065 From kvn at openjdk.org Thu Nov 10 16:32:31 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 10 Nov 2022 16:32:31 GMT Subject: RFR: 8295867: TestVerifyGraphEdges.java fails with exit code -1073741571 when using AlwaysIncrementalInline In-Reply-To: References: Message-ID: <3_iAic-vfsZXpud-Tw0SNX-ocbqZtA06IQfSJfA-nQU=.80f6d437-5473-40ed-8e45-6491e63ce504@github.com> On Wed, 9 Nov 2022 18:14:43 GMT, Vladimir Kozlov wrote: > When we use -Xcomp we compile `java.lang.invoke.LambdaForm$Kind::` very long linear method for enum class: [LambdaForm.java#L250](https://github.com/openjdk/jdk/blame/master/src/java.base/share/classes/java/lang/invoke/LambdaForm.java#L250) > > In addition we inline all class initializers for EA when run with -Xcomp: [bytecodeInfo.cpp#L410](https://github.com/openjdk/jdk/blame/master/src/hotspot/share/opto/bytecodeInfo.cpp#L410) > > Recent CDS change [JDK-8293979](https://bugs.openjdk.org/browse/JDK-8293979) allows to inline a bit more deeply too. > > Adding `-XX:+AlwaysIncrementalInline` worsened situation even more. > > Running with `-XX:+LogCompilation` shows that we hit `NodeCountInliningCutoff (18000)` during `java.lang.invoke.LambdaForm$Kind::` compilation. > > In short, we have very long (>40000 live nodes) linear IR graph. `Node::verify_edges()` method process nodes depth-first starting from first input which is control edge. So it is not surprise that depth of this method recursion reached 6000. > With frame size of 10 words (320 bytes) we easy hit stack overflow (768K in 32-bits debug VM). > > I fixed it by using local buffer `Node_List` instead of recursion in `Node::verify_edges()`. > The algorithm was changed to simplify code. It processes inputs in reverse order - last input processed first. And I noticed that maximum use of buffer is only about 1000 or less elements for this compilation (that is why I use live_nodes/16 as initial size of buffer). > > Then I did additional experiment with keeping recursion but processing inputs in reverse order: > > > // Recursive walk over all input edges > - for( i = 0; i < len(); i++ ) { > - n = in(i); > + for( i = len(); i > 0; i++ ) { > + n = in(i - 1); > if( n != NULL ) > in(i)->verify_edges(visited); > } > > > And it shows the same around 1000 stack depth! > > I decided to keep my original fix because it should be faster (put only one value on list instead of putting all locals, PC, SP on stack and calls) and much less stack usage. > > Testing tier1-3, hs-comp-stress and `TestVerifyGraphEdges.java` test runs with `-XX:+AlwaysIncrementalInline`. Thank you, Aleksey and Christian. I am testing changes requested by Christian. ------------- PR: https://git.openjdk.org/jdk/pull/11065 From roland at openjdk.org Thu Nov 10 17:03:57 2022 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 10 Nov 2022 17:03:57 GMT Subject: RFR: 8296805: ctw build is broken Message-ID: I noticed the build for the ctw tool based on the WhiteBox API is broken. This fixes it AFAICT. ------------- Commit messages: - fix Changes: https://git.openjdk.org/jdk/pull/11090/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11090&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296805 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11090.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11090/head:pull/11090 PR: https://git.openjdk.org/jdk/pull/11090 From kvn at openjdk.org Thu Nov 10 18:10:15 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 10 Nov 2022 18:10:15 GMT Subject: RFR: 8296805: ctw build is broken In-Reply-To: References: Message-ID: On Thu, 10 Nov 2022 16:54:39 GMT, Roland Westrelin wrote: > I noticed the build for the ctw tool based on the WhiteBox API is > broken. This fixes it AFAICT. test/hotspot/jtreg/testlibrary/ctw/Makefile line 50: > 48: $(TESTLIBRARY_DIR)/jtreg \ > 49: -maxdepth 1 -name '*.java') > 50: WB_SRC_FILES = $(shell find $(TESTLIBRARY_DIR)/jdk/test/lib/compiler $(TESTLIBRARY_DIR)/jdk/test/whitebox -name '*.java') I think `$(TESTLIBRARY_DIR)/sun/hotspot` should be removed. [JDK-8067223](https://bugs.openjdk.org/browse/JDK-8067223) left copy of WB there because some tests were still using it. But recent @coleenp changes [JDK-8271707](https://bugs.openjdk.org/browse/JDK-8271707) and followed [JDK-8275662](https://bugs.openjdk.org/browse/JDK-8275662) removed duplicated WB in `sun/hotspot`. ------------- PR: https://git.openjdk.org/jdk/pull/11090 From kvn at openjdk.org Thu Nov 10 18:18:23 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 10 Nov 2022 18:18:23 GMT Subject: RFR: 8296805: ctw build is broken In-Reply-To: References: Message-ID: On Thu, 10 Nov 2022 18:08:05 GMT, Vladimir Kozlov wrote: >> I noticed the build for the ctw tool based on the WhiteBox API is >> broken. This fixes it AFAICT. > > test/hotspot/jtreg/testlibrary/ctw/Makefile line 50: > >> 48: $(TESTLIBRARY_DIR)/jtreg \ >> 49: -maxdepth 1 -name '*.java') >> 50: WB_SRC_FILES = $(shell find $(TESTLIBRARY_DIR)/jdk/test/lib/compiler $(TESTLIBRARY_DIR)/jdk/test/whitebox -name '*.java') > > I think `$(TESTLIBRARY_DIR)/sun/hotspot` should be removed. > [JDK-8067223](https://bugs.openjdk.org/browse/JDK-8067223) left copy of WB there because some tests were still using it. But recent @coleenp changes [JDK-8271707](https://bugs.openjdk.org/browse/JDK-8271707) and followed [JDK-8275662](https://bugs.openjdk.org/browse/JDK-8275662) removed duplicated WB in `sun/hotspot`. NM that. I looked on description in JBS and saw that you need CompilerUtils.java which is in `jdk/test/lib/compiler`. I will run our internal testing. ------------- PR: https://git.openjdk.org/jdk/pull/11090 From kvn at openjdk.org Thu Nov 10 18:28:07 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 10 Nov 2022 18:28:07 GMT Subject: RFR: 8296805: ctw build is broken In-Reply-To: References: Message-ID: On Thu, 10 Nov 2022 18:14:53 GMT, Vladimir Kozlov wrote: >> test/hotspot/jtreg/testlibrary/ctw/Makefile line 50: >> >>> 48: $(TESTLIBRARY_DIR)/jtreg \ >>> 49: -maxdepth 1 -name '*.java') >>> 50: WB_SRC_FILES = $(shell find $(TESTLIBRARY_DIR)/jdk/test/lib/compiler $(TESTLIBRARY_DIR)/jdk/test/whitebox -name '*.java') >> >> I think `$(TESTLIBRARY_DIR)/sun/hotspot` should be removed. >> [JDK-8067223](https://bugs.openjdk.org/browse/JDK-8067223) left copy of WB there because some tests were still using it. But recent @coleenp changes [JDK-8271707](https://bugs.openjdk.org/browse/JDK-8271707) and followed [JDK-8275662](https://bugs.openjdk.org/browse/JDK-8275662) removed duplicated WB in `sun/hotspot`. > > NM that. I looked on description in JBS and saw that you need CompilerUtils.java which is in `jdk/test/lib/compiler`. > I will run our internal testing. I looked and `ClassTransformer.java` (which needs `CompilerUtils.java`) is recent addition: [JDK-8240908](https://bugs.openjdk.org/browse/JDK-8240908) That is why CTW did not need `CompilerUtils.java` before. ------------- PR: https://git.openjdk.org/jdk/pull/11090 From kvn at openjdk.org Thu Nov 10 18:30:18 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 10 Nov 2022 18:30:18 GMT Subject: RFR: 8295867: TestVerifyGraphEdges.java fails with exit code -1073741571 when using AlwaysIncrementalInline [v2] In-Reply-To: References: Message-ID: > When we use -Xcomp we compile `java.lang.invoke.LambdaForm$Kind::` very long linear method for enum class: [LambdaForm.java#L250](https://github.com/openjdk/jdk/blame/master/src/java.base/share/classes/java/lang/invoke/LambdaForm.java#L250) > > In addition we inline all class initializers for EA when run with -Xcomp: [bytecodeInfo.cpp#L410](https://github.com/openjdk/jdk/blame/master/src/hotspot/share/opto/bytecodeInfo.cpp#L410) > > Recent CDS change [JDK-8293979](https://bugs.openjdk.org/browse/JDK-8293979) allows to inline a bit more deeply too. > > Adding `-XX:+AlwaysIncrementalInline` worsened situation even more. > > Running with `-XX:+LogCompilation` shows that we hit `NodeCountInliningCutoff (18000)` during `java.lang.invoke.LambdaForm$Kind::` compilation. > > In short, we have very long (>40000 live nodes) linear IR graph. `Node::verify_edges()` method process nodes depth-first starting from first input which is control edge. So it is not surprise that depth of this method recursion reached 6000. > With frame size of 10 words (320 bytes) we easy hit stack overflow (768K in 32-bits debug VM). > > I fixed it by using local buffer `Node_List` instead of recursion in `Node::verify_edges()`. > The algorithm was changed to simplify code. It processes inputs in reverse order - last input processed first. And I noticed that maximum use of buffer is only about 1000 or less elements for this compilation (that is why I use live_nodes/16 as initial size of buffer). > > Then I did additional experiment with keeping recursion but processing inputs in reverse order: > > > // Recursive walk over all input edges > - for( i = 0; i < len(); i++ ) { > - n = in(i); > + for( i = len(); i > 0; i++ ) { > + n = in(i - 1); > if( n != NULL ) > in(i)->verify_edges(visited); > } > > > And it shows the same around 1000 stack depth! > > I decided to keep my original fix because it should be faster (put only one value on list instead of putting all locals, PC, SP on stack and calls) and much less stack usage. > > Testing tier1-3, hs-comp-stress and `TestVerifyGraphEdges.java` test runs with `-XX:+AlwaysIncrementalInline`. Vladimir Kozlov has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: - Rename verify_edges() method and move it to Compile class - Merge branch 'master' into JDK-8295867 - remove white space - Merge branch 'master' into JDK-8295867 - 8295867: TestVerifyGraphEdges.java fails with exit code -1073741571 when using AlwaysIncrementalInline ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11065/files - new: https://git.openjdk.org/jdk/pull/11065/files/2978787e..b74d1215 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11065&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11065&range=00-01 Stats: 2504 lines in 212 files changed: 1415 ins; 579 del; 510 mod Patch: https://git.openjdk.org/jdk/pull/11065.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11065/head:pull/11065 PR: https://git.openjdk.org/jdk/pull/11065 From shade at openjdk.org Thu Nov 10 19:24:30 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 10 Nov 2022 19:24:30 GMT Subject: RFR: 8295867: TestVerifyGraphEdges.java fails with exit code -1073741571 when using AlwaysIncrementalInline [v2] In-Reply-To: References: Message-ID: On Thu, 10 Nov 2022 18:30:18 GMT, Vladimir Kozlov wrote: >> When we use -Xcomp we compile `java.lang.invoke.LambdaForm$Kind::` very long linear method for enum class: [LambdaForm.java#L250](https://github.com/openjdk/jdk/blame/master/src/java.base/share/classes/java/lang/invoke/LambdaForm.java#L250) >> >> In addition we inline all class initializers for EA when run with -Xcomp: [bytecodeInfo.cpp#L410](https://github.com/openjdk/jdk/blame/master/src/hotspot/share/opto/bytecodeInfo.cpp#L410) >> >> Recent CDS change [JDK-8293979](https://bugs.openjdk.org/browse/JDK-8293979) allows to inline a bit more deeply too. >> >> Adding `-XX:+AlwaysIncrementalInline` worsened situation even more. >> >> Running with `-XX:+LogCompilation` shows that we hit `NodeCountInliningCutoff (18000)` during `java.lang.invoke.LambdaForm$Kind::` compilation. >> >> In short, we have very long (>40000 live nodes) linear IR graph. `Node::verify_edges()` method process nodes depth-first starting from first input which is control edge. So it is not surprise that depth of this method recursion reached 6000. >> With frame size of 10 words (320 bytes) we easy hit stack overflow (768K in 32-bits debug VM). >> >> I fixed it by using local buffer `Node_List` instead of recursion in `Node::verify_edges()`. >> The algorithm was changed to simplify code. It processes inputs in reverse order - last input processed first. And I noticed that maximum use of buffer is only about 1000 or less elements for this compilation (that is why I use live_nodes/16 as initial size of buffer). >> >> Then I did additional experiment with keeping recursion but processing inputs in reverse order: >> >> >> // Recursive walk over all input edges >> - for( i = 0; i < len(); i++ ) { >> - n = in(i); >> + for( i = len(); i > 0; i++ ) { >> + n = in(i - 1); >> if( n != NULL ) >> in(i)->verify_edges(visited); >> } >> >> >> And it shows the same around 1000 stack depth! >> >> I decided to keep my original fix because it should be faster (put only one value on list instead of putting all locals, PC, SP on stack and calls) and much less stack usage. >> >> Testing tier1-3, hs-comp-stress and `TestVerifyGraphEdges.java` test runs with `-XX:+AlwaysIncrementalInline`. > > Vladimir Kozlov has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Rename verify_edges() method and move it to Compile class > - Merge branch 'master' into JDK-8295867 > - remove white space > - Merge branch 'master' into JDK-8295867 > - 8295867: TestVerifyGraphEdges.java fails with exit code -1073741571 when using AlwaysIncrementalInline Marked as reviewed by shade (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11065 From kvn at openjdk.org Thu Nov 10 19:38:25 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 10 Nov 2022 19:38:25 GMT Subject: RFR: 8296805: ctw build is broken In-Reply-To: References: Message-ID: <4YkL1GdhkoqSKUN7Levb8ScevrPJFvts0gIgp4RsuEA=.7e9b8fb6-0528-47fc-99da-0691ab7fe01f@github.com> On Thu, 10 Nov 2022 16:54:39 GMT, Roland Westrelin wrote: > I noticed the build for the ctw tool based on the WhiteBox API is > broken. This fixes it AFAICT. I verified that our testing does not use this Makefile. I run `make` locally as you did and it passed with this fix. I consider it is trivial. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11090 From mdoerr at openjdk.org Thu Nov 10 20:09:42 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 10 Nov 2022 20:09:42 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v11] In-Reply-To: References: Message-ID: > This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: We can't retrieve class loader oop during class unloading. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10933/files - new: https://git.openjdk.org/jdk/pull/10933/files/1cc1fc21..90617b36 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10933&range=10 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10933&range=09-10 Stats: 7 lines in 3 files changed: 2 ins; 1 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/10933.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10933/head:pull/10933 PR: https://git.openjdk.org/jdk/pull/10933 From omikhaltcova at openjdk.org Thu Nov 10 21:09:03 2022 From: omikhaltcova at openjdk.org (Olga Mikhaltsova) Date: Thu, 10 Nov 2022 21:09:03 GMT Subject: Integrated: 8262901: [macos_aarch64] NativeCallTest expected:<-3.8194101E18> but was:<3.02668882E10> In-Reply-To: References: Message-ID: On Mon, 12 Sep 2022 14:35:30 GMT, Olga Mikhaltsova wrote: > This PR is opened as a follow-up for [1] and included the "must-done" fixes pointed by @teshull. > > This patch for JVMCI includes the following fixes related to the macOS AArch64 calling convention: > 1. arguments may consume slots on the stack that are not multiples of 8 bytes [2] > 2. natural alignment of stack arguments [2] > 3. stack must remain 16-byte aligned [3][4] > > Tested with tier1 on macOS AArch64 and Linux AArch64. > > [1] https://github.com/openjdk/jdk/pull/6641 > [2] https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple-platforms > [3] https://docs.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions?view=msvc-160#stack > [4] https://docs.microsoft.com/en-us/cpp/build/stack-usage?view=msvc-170 This pull request has now been integrated. Changeset: 6b456f7a Author: Olga Mikhaltsova Committer: Anton Kozlov URL: https://git.openjdk.org/jdk/commit/6b456f7a9b6344506033dfdc5a59c0f3e95c4b2a Stats: 112 lines in 8 files changed: 100 ins; 6 del; 6 mod 8262901: [macos_aarch64] NativeCallTest expected:<-3.8194101E18> but was:<3.02668882E10> Reviewed-by: aph ------------- PR: https://git.openjdk.org/jdk/pull/10238 From sviswanathan at openjdk.org Thu Nov 10 22:14:36 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Thu, 10 Nov 2022 22:14:36 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v11] In-Reply-To: References: Message-ID: On Thu, 10 Nov 2022 01:22:04 GMT, Volodymyr Paprotski wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: > > fix windows and 32b linux builds src/hotspot/share/opto/library_call.cpp line 6981: > 6979: > 6980: if (!stubAddr) return false; > 6981: Node* polyObj = argument(0); Minor cleanup: This could be removed as it is not used. src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 28: > 26: package com.sun.crypto.provider; > 27: > 28: import java.lang.reflect.Field; Minor cleanup: This could be removed. src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 249: > 247: @ForceInline > 248: @IntrinsicCandidate > 249: private void processMultipleBlocks(byte[] input, int offset, int length, long[] aLimbs, long[] rLimbs) { A comment here to indicate aLimbs and rLimbs are part of a and r and used in intrinsic. src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 253: > 251: n.setValue(input, offset, BLOCK_LENGTH, (byte)0x01); > 252: a.setSum(n); // A += (temp | 0x01) > 253: a.setProduct(r); // A = (A * R) % p Comment needs update to match code. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Thu Nov 10 22:48:37 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Thu, 10 Nov 2022 22:48:37 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v12] In-Reply-To: References: Message-ID: > Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. > > - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. > - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. > - Added a JMH perf test. > - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. > > Perf before: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s > > and after: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: Sandhya's review ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10582/files - new: https://git.openjdk.org/jdk/pull/10582/files/abfc68f4..2176caf8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=11 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=10-11 Stats: 6 lines in 2 files changed: 2 ins; 2 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/10582.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10582/head:pull/10582 PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Thu Nov 10 22:48:38 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Thu, 10 Nov 2022 22:48:38 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v11] In-Reply-To: References: Message-ID: On Thu, 10 Nov 2022 22:03:24 GMT, Sandhya Viswanathan wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: >> >> fix windows and 32b linux builds > > src/hotspot/share/opto/library_call.cpp line 6981: > >> 6979: >> 6980: if (!stubAddr) return false; >> 6981: Node* polyObj = argument(0); > > Minor cleanup: This could be removed as it is not used. done > src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 28: > >> 26: package com.sun.crypto.provider; >> 27: >> 28: import java.lang.reflect.Field; > > Minor cleanup: This could be removed. done > src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 249: > >> 247: @ForceInline >> 248: @IntrinsicCandidate >> 249: private void processMultipleBlocks(byte[] input, int offset, int length, long[] aLimbs, long[] rLimbs) { > > A comment here to indicate aLimbs and rLimbs are part of a and r and used in intrinsic. done > src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 253: > >> 251: n.setValue(input, offset, BLOCK_LENGTH, (byte)0x01); >> 252: a.setSum(n); // A += (temp | 0x01) >> 253: a.setProduct(r); // A = (A * R) % p > > Comment needs update to match code. done ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Thu Nov 10 22:59:52 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Thu, 10 Nov 2022 22:59:52 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v13] In-Reply-To: References: Message-ID: > Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. > > - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. > - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. > - Added a JMH perf test. > - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. > > Perf before: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s > > and after: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: jcheck ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10582/files - new: https://git.openjdk.org/jdk/pull/10582/files/2176caf8..196ee35b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=12 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=11-12 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10582.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10582/head:pull/10582 PR: https://git.openjdk.org/jdk/pull/10582 From kvn at openjdk.org Thu Nov 10 23:59:49 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 10 Nov 2022 23:59:49 GMT Subject: RFR: 8296824: ProblemList compiler/jvmci/jdk.vm.ci.code.test/src/jdk/vm/ci/code/test/NativeCallTest.java Message-ID: Put `NativeCallTest.java` JVMCI test on problem list until [JDK-8296821](https://bugs.openjdk.org/browse/JDK-8296821) is fixed. ------------- Commit messages: - 8296824: ProblemList compiler/jvmci/jdk.vm.ci.code.test/src/jdk/vm/ci/code/test/NativeCallTest.java Changes: https://git.openjdk.org/jdk/pull/11096/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11096&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296824 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11096.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11096/head:pull/11096 PR: https://git.openjdk.org/jdk/pull/11096 From dcubed at openjdk.org Thu Nov 10 23:59:49 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Thu, 10 Nov 2022 23:59:49 GMT Subject: RFR: 8296824: ProblemList compiler/jvmci/jdk.vm.ci.code.test/src/jdk/vm/ci/code/test/NativeCallTest.java In-Reply-To: References: Message-ID: On Thu, 10 Nov 2022 23:45:46 GMT, Vladimir Kozlov wrote: > Put `NativeCallTest.java` JVMCI test on problem list until [JDK-8296821](https://bugs.openjdk.org/browse/JDK-8296821) is fixed. Thumbs up. This is a trivial fix. ------------- Marked as reviewed by dcubed (Reviewer). PR: https://git.openjdk.org/jdk/pull/11096 From kvn at openjdk.org Fri Nov 11 00:01:37 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 11 Nov 2022 00:01:37 GMT Subject: Integrated: 8296824: ProblemList compiler/jvmci/jdk.vm.ci.code.test/src/jdk/vm/ci/code/test/NativeCallTest.java In-Reply-To: References: Message-ID: On Thu, 10 Nov 2022 23:45:46 GMT, Vladimir Kozlov wrote: > Put `NativeCallTest.java` JVMCI test on problem list until [JDK-8296821](https://bugs.openjdk.org/browse/JDK-8296821) is fixed. This pull request has now been integrated. Changeset: 2f9a94f4 Author: Vladimir Kozlov URL: https://git.openjdk.org/jdk/commit/2f9a94f41c1b5ea38efa8ee6dd71f0b6db401028 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod 8296824: ProblemList compiler/jvmci/jdk.vm.ci.code.test/src/jdk/vm/ci/code/test/NativeCallTest.java Reviewed-by: dcubed ------------- PR: https://git.openjdk.org/jdk/pull/11096 From vlivanov at openjdk.org Fri Nov 11 00:04:38 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 11 Nov 2022 00:04:38 GMT Subject: RFR: 6312651: Compiler should only use verified interface types for optimization [v4] In-Reply-To: References: Message-ID: On Thu, 10 Nov 2022 10:01:48 GMT, Roland Westrelin wrote: >> This change is mostly the same I sent for review 3 years ago but was >> never integrated: >> >> https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2019-May/033803.html >> >> The main difference is that, in the meantime, I submitted a couple of >> refactoring changes extracted from the 2019 patch: >> >> 8266550: C2: mirror TypeOopPtr/TypeInstPtr/TypeAryPtr with TypeKlassPtr/TypeInstKlassPtr/TypeAryKlassPtr >> 8275201: C2: hide klass() accessor from TypeOopPtr and typeKlassPtr subclasses >> >> As a result, the current patch is much smaller (but still not small). >> >> The implementation is otherwise largely the same as in the 2019 >> patch. I tried to remove some of the code duplication between the >> TypeOopPtr and TypeKlassPtr hierarchies by having some of the logic >> shared in template methods. In the 2019 patch, interfaces were trusted >> when types were constructed and I had added code to drop interfaces >> from a type where they couldn't be trusted. This new patch proceeds >> the other way around: interfaces are not trusted when a type is >> constructed and code that uses the type must explicitly request that >> they are included (this was suggested as an improvement by Vladimir >> Ivanov I think). > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 10 additional commits since the last revision: > > - review > - Merge branch 'master' into JDK-6312651 > - review > - review > - review > - review > - review > - build fix > - whitespaces > - interfaces FTR performance results look good. ------------- PR: https://git.openjdk.org/jdk/pull/10901 From vlivanov at openjdk.org Fri Nov 11 00:04:39 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 11 Nov 2022 00:04:39 GMT Subject: RFR: 6312651: Compiler should only use verified interface types for optimization [v3] In-Reply-To: References: Message-ID: On Thu, 10 Nov 2022 08:39:38 GMT, Roland Westrelin wrote: >> src/hotspot/share/ci/ciArrayKlass.cpp line 108: >> >>> 106: } >>> 107: >>> 108: GrowableArray* ciArrayKlass::interfaces() { >> >> FTR there's a subtle asymmetry between `ciArrayKlass::interfaces()` and `ciInstanceKlass::transitive_interfaces()` when it comes to memory allocation: the former allocates from resource area while the latter from compiler arena. >> >> It doesn't cause any problems since `ciArrayKlass::interfaces()` is used only in `Type::Initialize_shared()` to instantiate shared `TypeAryPtr::_array_interfaces`, but it took me some time to find that out. >> >> Maybe a helper method in `type.cpp` is a better place for that logic. > > How would that work? ciArrayKlass::interfaces() has to transition into the vm which is not something that would feel right in type.cpp. Indeed, good point. IMO you could just work directly with CI mirrors for `Serializable` and `Cloneable`, but now I'm curious why does CI code diverge from `ObjArrayKlass::compute_secondary_supers()`? https://github.com/openjdk/jdk/blob/master/src/hotspot/share/oops/objArrayKlass.cpp#L376 ------------- PR: https://git.openjdk.org/jdk/pull/10901 From vlivanov at openjdk.org Fri Nov 11 00:18:30 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 11 Nov 2022 00:18:30 GMT Subject: RFR: 6312651: Compiler should only use verified interface types for optimization [v3] In-Reply-To: References: Message-ID: <3sO3Hd3tOq5MWO_0EZO2pegdPa-QAbv13bG3maAI0Ag=.9bde5da6-c61e-430c-8bcb-a4fd7b88bb80@github.com> On Thu, 10 Nov 2022 23:58:54 GMT, Vladimir Ivanov wrote: >> How would that work? ciArrayKlass::interfaces() has to transition into the vm which is not something that would feel right in type.cpp. > > Indeed, good point. > > IMO you could just work directly with CI mirrors for `Serializable` and `Cloneable`, but now I'm curious why does CI code diverge from `ObjArrayKlass::compute_secondary_supers()`? > https://github.com/openjdk/jdk/blob/master/src/hotspot/share/oops/objArrayKlass.cpp#L376 And to answer my question: only `Serializable` and `Cloneable` are interfaces among those. The rest are arrays. ------------- PR: https://git.openjdk.org/jdk/pull/10901 From vlivanov at openjdk.org Fri Nov 11 00:47:31 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 11 Nov 2022 00:47:31 GMT Subject: RFR: 8296545: C2 Blackholes should allow load optimizations [v2] In-Reply-To: References: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> Message-ID: On Thu, 10 Nov 2022 10:42:30 GMT, Aleksey Shipilev wrote: >> src/hotspot/share/opto/library_call.cpp line 7784: >> >>> 7782: // side effects like breaking the optimizations across the blackhole. >>> 7783: >>> 7784: MemBarNode* mb = MemBarNode::make(C, Op_Blackhole); >> >> One thing to clear if you decide to keep modeling it as `MemBar`: pass `AliasIdxTop` as `alias_idx` . > > OK, I need to understand why, though. Does passing `AliasIdxTop` sentinel value here protects us from accidentally doing memory merges over the Blackhole node that does not touch memory? Is that the idea, or there is some other reason for it? I'm not sure whether it causes any problems or not (since the node is completely disconnected from the memory graph), but it is just weird to have a memory node consuming `TOP` and reporting `TypePtr::BOTTOM` as `adr_type` (which alias with everything). I won't be surprised if it eventually breaks somewhere in `MemBar`-specific code. ------------- PR: https://git.openjdk.org/jdk/pull/11041 From vlivanov at openjdk.org Fri Nov 11 01:06:31 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 11 Nov 2022 01:06:31 GMT Subject: RFR: 8296545: C2 Blackholes should allow load optimizations [v2] In-Reply-To: References: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> Message-ID: <4b-2Gfc6NLLmAysjDqBtvLAuxksKkmU5ZP5VDBb_2NQ=.548c9041-b4c9-469a-9c0e-58a9295d4dda@github.com> On Thu, 10 Nov 2022 15:21:32 GMT, Aleksey Shipilev wrote: >>> can probably even be Node, if I understand how should control-output-only nodes be defined >> >> Yes, `MultiNode` is specifically for cases when a node produces multiple results. >> >> FTR I looked through the code base for `is_MemBar` usages and didn't spot any problematic cases, if you clear away memory input. But I'm fine with keeping it a membar. > > Well, I don't see how I can make `Blackhole` produce only control. It seems the rest of C2 code frowns upon CFG nodes that are not control projections (or safepoints), see e.g. `is_control_proj_or_safepoint` asserts. Any hints how to proceed here? Maybe an example for such CFG node somewhere? > > Otherwise, I'd think keeping Blackhole a `MultiNode` and then take the control projection off it -- like in my `blackhole-cfg-1.patch` above -- is the way to do it. Yeah, I agree that your `blackhole-cfg-1.patch` is the lowest friction way to achieve the goal. There are some pure control nodes (e.g., `Region`), but they are treated specially during code motion. So, special cases for `Blackhole` would be needed there. ------------- PR: https://git.openjdk.org/jdk/pull/11041 From vlivanov at openjdk.org Fri Nov 11 01:07:42 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 11 Nov 2022 01:07:42 GMT Subject: RFR: 6312651: Compiler should only use verified interface types for optimization [v4] In-Reply-To: References: Message-ID: On Thu, 10 Nov 2022 10:01:48 GMT, Roland Westrelin wrote: >> This change is mostly the same I sent for review 3 years ago but was >> never integrated: >> >> https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2019-May/033803.html >> >> The main difference is that, in the meantime, I submitted a couple of >> refactoring changes extracted from the 2019 patch: >> >> 8266550: C2: mirror TypeOopPtr/TypeInstPtr/TypeAryPtr with TypeKlassPtr/TypeInstKlassPtr/TypeAryKlassPtr >> 8275201: C2: hide klass() accessor from TypeOopPtr and typeKlassPtr subclasses >> >> As a result, the current patch is much smaller (but still not small). >> >> The implementation is otherwise largely the same as in the 2019 >> patch. I tried to remove some of the code duplication between the >> TypeOopPtr and TypeKlassPtr hierarchies by having some of the logic >> shared in template methods. In the 2019 patch, interfaces were trusted >> when types were constructed and I had added code to drop interfaces >> from a type where they couldn't be trusted. This new patch proceeds >> the other way around: interfaces are not trusted when a type is >> constructed and code that uses the type must explicitly request that >> they are included (this was suggested as an improvement by Vladimir >> Ivanov I think). > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 10 additional commits since the last revision: > > - review > - Merge branch 'master' into JDK-6312651 > - review > - review > - review > - review > - review > - build fix > - whitespaces > - interfaces Great work, Roland! I'm approving the PR. (hs-tier1 - hs-tier2 sanity testing passed with latest version.) Feel free to handle `ciArrayKlass::interfaces()` as you find most appropriate. ------------- Marked as reviewed by vlivanov (Reviewer). PR: https://git.openjdk.org/jdk/pull/10901 From duke at openjdk.org Fri Nov 11 01:14:05 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Fri, 11 Nov 2022 01:14:05 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v14] In-Reply-To: References: Message-ID: > Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. > > - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. > - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. > - Added a JMH perf test. > - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. > > Perf before: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s > > and after: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: live review with Sandhya ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10582/files - new: https://git.openjdk.org/jdk/pull/10582/files/196ee35b..835fbe3a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=13 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=12-13 Stats: 32 lines in 3 files changed: 17 ins; 5 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/10582.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10582/head:pull/10582 PR: https://git.openjdk.org/jdk/pull/10582 From sviswanathan at openjdk.org Fri Nov 11 01:15:50 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Fri, 11 Nov 2022 01:15:50 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v14] In-Reply-To: References: Message-ID: On Fri, 11 Nov 2022 01:14:05 GMT, Volodymyr Paprotski wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: > > live review with Sandhya Marked as reviewed by sviswanathan (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/10582 From sviswanathan at openjdk.org Fri Nov 11 01:21:42 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Fri, 11 Nov 2022 01:21:42 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v14] In-Reply-To: References: Message-ID: On Fri, 11 Nov 2022 01:14:05 GMT, Volodymyr Paprotski wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: > > live review with Sandhya The PR looks good to me. @ascarpino Please let us know if the Java side changes look good to you. @iwanowww Please let us know if the compiler side changes look good to you. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From kbarrett at openjdk.org Fri Nov 11 01:44:28 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Fri, 11 Nov 2022 01:44:28 GMT Subject: RFR: 8296349: [aarch64] Avoid slicing Address::extend In-Reply-To: References: Message-ID: <73b5QjC78nBjirQMQRcRjMX-gega04t56oJ556mlFXM=.5d471275-79a6-43f6-a603-0db3179300c7@github.com> On Fri, 4 Nov 2022 03:00:54 GMT, Kim Barrett wrote: > Please review this change around `Address::extend`. The 4 derived classes are > replaced by static functions of the same name as the former class. These > functions return an `extend` object initialized with the same values as were > used by the corresponding derived class constructor. > > Testing: mach5 tier1-3 Anyone else? ------------- PR: https://git.openjdk.org/jdk/pull/10976 From vlivanov at openjdk.org Fri Nov 11 01:47:41 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 11 Nov 2022 01:47:41 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v14] In-Reply-To: References: Message-ID: On Fri, 11 Nov 2022 01:14:05 GMT, Volodymyr Paprotski wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: > > live review with Sandhya Overall, it looks good. src/hotspot/cpu/x86/macroAssembler_x86.hpp line 733: > 731: void andptr(Register src1, Register src2) { LP64_ONLY(andq(src1, src2)) NOT_LP64(andl(src1, src2)) ; } > 732: > 733: #ifdef _LP64 Why is it x64-specific? src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 161: > 159: const XMMRegister P2_H = xmm5; > 160: const XMMRegister TMP1 = xmm6; > 161: const Register polyCP = r13; Could be renamed to `rscratch` (or `tmp`) since it doesn't hold constant base address anymore. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From vlivanov at openjdk.org Fri Nov 11 01:47:41 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 11 Nov 2022 01:47:41 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v11] In-Reply-To: References: Message-ID: On Thu, 10 Nov 2022 22:41:31 GMT, Volodymyr Paprotski wrote: >> src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 249: >> >>> 247: @ForceInline >>> 248: @IntrinsicCandidate >>> 249: private void processMultipleBlocks(byte[] input, int offset, int length, long[] aLimbs, long[] rLimbs) { >> >> A comment here to indicate aLimbs and rLimbs are part of a and r and used in intrinsic. > > done Overall, it looks weird to see aLimbs/rLimbs being unused, but I see why it is so. If security folks are fine with that, I'm OK with it as well. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From vlivanov at openjdk.org Fri Nov 11 01:47:42 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 11 Nov 2022 01:47:42 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v13] In-Reply-To: References: Message-ID: On Thu, 10 Nov 2022 22:59:52 GMT, Volodymyr Paprotski wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: > > jcheck src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 252: > 250: private void processMultipleBlocks(byte[] input, int offset, int length, long[] aLimbs, long[] rLimbs) { > 251: while (length >= BLOCK_LENGTH) { > 252: n.setValue(input, offset, BLOCK_LENGTH, (byte)0x01); You could call `processBlock(input, offset, BLOCK_LENGTH);` here. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From vlivanov at openjdk.org Fri Nov 11 01:58:34 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 11 Nov 2022 01:58:34 GMT Subject: RFR: 8289943: Simplify some object allocation merges [v13] In-Reply-To: References: Message-ID: On Mon, 7 Nov 2022 02:03:08 GMT, Cesar Soares Lucas wrote: > Does it look like a better approach? It definitely does! Thanks a lot for thinking it through. > LoadNode::split_through_phi requires is_known_instance_field and therefore can't be run before split_unique_types without changes. IMO it should be fine to customize `LoadNode::split_through_phi` for our purposes. > So, we have to decide if it's best to re-use the code in PhaseMacroExpand::scalar_replacement in adjust_scalar_replaceable_state or if we want to add new code to create SSON for merge Phis in PhaseMacroExpand::scalar_replacement. >From design perspective, it's cleaner to modify the IR during `PhaseMacroExpand::scalar_replacement()`, since `ConnectionGraph::adjust_scalar_replaceable_state()` operates solely on `ConnectionGraph` instance being built. > Cons: the logic to split merge phi is spread throughout escape analysis and scalar replacement. It would be helpful to see an implementation sketch to get better understanding how much complexity it adds. ------------- PR: https://git.openjdk.org/jdk/pull/9073 From xlinzheng at openjdk.org Fri Nov 11 03:03:44 2022 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Fri, 11 Nov 2022 03:03:44 GMT Subject: RFR: 8296771: RISC-V: C2: assert(false) failed: bad AD file In-Reply-To: References: Message-ID: On Thu, 10 Nov 2022 09:35:52 GMT, Xiaolin Zheng wrote: > The backend encounters the same assertion failure as [JDK-8295414](https://bugs.openjdk.org/browse/JDK-8295414). This patch is a similar fix. > > Same, `partialSubtypeCheck` uses `iRegP_R15` as a result while its use `iRegP` doesn't match the `iRegP_R15`, causing this failure. > > The details are in the JBS issue [JDK-8296771](https://bugs.openjdk.org/browse/JDK-8296771). > > Tested the failed `compiler/types/TestSubTypeCheckMacroTrichotomy.java`, and a hotspot tier1 is running now. > > Thanks, > Xiaolin Tier1 says okay. Thanks for reviewing! ------------- PR: https://git.openjdk.org/jdk/pull/11085 From thartmann at openjdk.org Fri Nov 11 07:04:41 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 11 Nov 2022 07:04:41 GMT Subject: RFR: 8296805: ctw build is broken In-Reply-To: References: Message-ID: On Thu, 10 Nov 2022 16:54:39 GMT, Roland Westrelin wrote: > I noticed the build for the ctw tool based on the WhiteBox API is > broken. This fixes it AFAICT. Looks good to me too. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11090 From thartmann at openjdk.org Fri Nov 11 07:48:37 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 11 Nov 2022 07:48:37 GMT Subject: RFR: 8294217: Assertion failure: parsing found no loops but there are some [v2] In-Reply-To: References: <8wgYaLn82fk_CgKacAQsEygK63k6KDDKYXf8m4cv_OM=.d04ae5d7-b3f8-47b3-bec8-aaaa4234f018@github.com> Message-ID: On Tue, 8 Nov 2022 14:06:38 GMT, Roland Westrelin wrote: >> This was reported on 11 and is not reproducible with the current >> jdk. The reason is that the PhaseIdealLoop invocation before EA was >> changed from LoopOptsNone to LoopOptsMaxUnroll. In the absence of >> loops, LoopOptsMaxUnroll exits earlier than LoopOptsNone. That wasn't >> intended and this patch makes sure they behave the same. Once that's >> changed, the crash reproduces with the current jdk. >> >> The assert fires because PhaseIdealLoop::only_has_infinite_loops() >> returns false even though the IR only has infinite loops. There's a >> single loop nest and the inner most loop is an infinite loop. The >> current logic only looks at loops that are direct children of the root >> of the loop tree. It's not the first bug where >> PhaseIdealLoop::only_has_infinite_loops() fails to catch an infinite >> loop (8257574 was the previous one) and it's proving challenging to >> have PhaseIdealLoop::only_has_infinite_loops() handle corner cases >> robustly. I reworked PhaseIdealLoop::only_has_infinite_loops() once >> more. This time it goes over all children of the root of the loop >> tree, collects all controls for the loop and its inner loop. It then >> checks whether any control is a branch out of the loop and if it is >> whether it's not a NeverBranch. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > review Marked as reviewed by thartmann (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/10904 From thartmann at openjdk.org Fri Nov 11 07:54:31 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 11 Nov 2022 07:54:31 GMT Subject: RFR: 8296243: [IR Framework] Fix issues with IRNode.ALLOC* regexes In-Reply-To: References: Message-ID: On Tue, 8 Nov 2022 09:58:42 GMT, Christian Hagedorn wrote: > There are currently two problems with `IRNode.ALLOC*` regexes: > 1. On PPC64, we do not account for an `LI` instruction which matches the array size. As a result, we could miss some array allocations with the `ALLOC_ARRAY*` regexes: > > 2e4 LD R3, offset, R3 // load ptr precise [java/lang/Object: > 0x0000200058006e40 *: :Constant:exact * from TOC (lo) > 2e8 STD R17, [R1_SP + #104+0] // spill copy > 2ec LI R4, #1 <------- we only look for LGHI here which is specific to s390 while LI is used for PPC64 > 2f0 CALL,static 0x00002000177cd300 // ==> wrapper for: _new_array_Java > > This was revealed by a new test added by [JDK-8280378](https://bugs.openjdk.org/browse/JDK-8280378) but was already a problem before this change. > > 2. The newly added `IRNode.ALLOC*` regexes in JDK-8280378 which can be matched on the independent ideal compile phases by using the name of the IR node "Allocate" also matches "AllocateArray" (substring match). This is unexpected. I've changed this by matching "Allocate" exactly. > > I've additionally removed the matching of `LI` and `LGHI` for the `ALLOC` regexes on normal objects as we do not have an array size. I think it's safe to remove these (might need some additional testing on PPC64/s390). > > Thanks @TheRealMDoerr for helping to test the initial fix on PPC64! > > Thanks, > Christian Looks good to me. test/hotspot/jtreg/testlibrary_tests/ir_framework/tests/TestPhaseIRMatching.java line 248: > 246: obj2 = new Object[1]; > 247: } > 248: @Test I think a newline before `@Test` would be good. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11037 From thartmann at openjdk.org Fri Nov 11 07:58:29 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 11 Nov 2022 07:58:29 GMT Subject: RFR: 8296349: [aarch64] Avoid slicing Address::extend In-Reply-To: References: Message-ID: On Fri, 4 Nov 2022 03:00:54 GMT, Kim Barrett wrote: > Please review this change around `Address::extend`. The 4 derived classes are > replaced by static functions of the same name as the former class. These > functions return an `extend` object initialized with the same values as were > used by the corresponding derived class constructor. > > Testing: mach5 tier1-3 Looks good to me too. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/10976 From xlinzheng at openjdk.org Fri Nov 11 08:06:39 2022 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Fri, 11 Nov 2022 08:06:39 GMT Subject: Integrated: 8296771: RISC-V: C2: assert(false) failed: bad AD file In-Reply-To: References: Message-ID: <9EG7H0u0e_quPsUE5SLnkzRIegksoN-f4SRcao2Bxvo=.d6f7fb1d-5b82-4f63-86f0-5521f2f89782@github.com> On Thu, 10 Nov 2022 09:35:52 GMT, Xiaolin Zheng wrote: > The backend encounters the same assertion failure as [JDK-8295414](https://bugs.openjdk.org/browse/JDK-8295414). This patch is a similar fix. > > Same, `partialSubtypeCheck` uses `iRegP_R15` as a result while its use `iRegP` doesn't match the `iRegP_R15`, causing this failure. > > The details are in the JBS issue [JDK-8296771](https://bugs.openjdk.org/browse/JDK-8296771). > > Tested the failed `compiler/types/TestSubTypeCheckMacroTrichotomy.java`, and a hotspot tier1 is running now. > > Thanks, > Xiaolin This pull request has now been integrated. Changeset: 7244eac9 Author: Xiaolin Zheng Committer: Fei Yang URL: https://git.openjdk.org/jdk/commit/7244eac9dfe4e7e9c3eea613149f0fb1390f00aa Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod 8296771: RISC-V: C2: assert(false) failed: bad AD file Reviewed-by: shade, fyang ------------- PR: https://git.openjdk.org/jdk/pull/11085 From kbarrett at openjdk.org Fri Nov 11 08:36:49 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Fri, 11 Nov 2022 08:36:49 GMT Subject: RFR: 8296349: [aarch64] Avoid slicing Address::extend [v2] In-Reply-To: References: Message-ID: > Please review this change around `Address::extend`. The 4 derived classes are > replaced by static functions of the same name as the former class. These > functions return an `extend` object initialized with the same values as were > used by the corresponding derived class constructor. > > Testing: mach5 tier1-3 Kim Barrett has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge branch 'master' into flatten-extend - flatten Address::extend ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10976/files - new: https://git.openjdk.org/jdk/pull/10976/files/a351cd6a..77ac8cdd Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10976&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10976&range=00-01 Stats: 14959 lines in 520 files changed: 6697 ins; 6140 del; 2122 mod Patch: https://git.openjdk.org/jdk/pull/10976.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10976/head:pull/10976 PR: https://git.openjdk.org/jdk/pull/10976 From kbarrett at openjdk.org Fri Nov 11 08:36:49 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Fri, 11 Nov 2022 08:36:49 GMT Subject: RFR: 8296349: [aarch64] Avoid slicing Address::extend [v2] In-Reply-To: References: Message-ID: On Fri, 4 Nov 2022 08:43:53 GMT, Andrew Haley wrote: >> Kim Barrett has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: >> >> - Merge branch 'master' into flatten-extend >> - flatten Address::extend > > Yep. That's a nice simplification, thanks. Thanks @theRealAph and @TobiHartmann for reviews. ------------- PR: https://git.openjdk.org/jdk/pull/10976 From kbarrett at openjdk.org Fri Nov 11 08:38:35 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Fri, 11 Nov 2022 08:38:35 GMT Subject: Integrated: 8296349: [aarch64] Avoid slicing Address::extend In-Reply-To: References: Message-ID: <80gHwwCkjjZNv_F35wm02mlMzxAGL-6p1-OBMhJW63s=.d45b3096-4545-48cd-bb54-1208d8bf7382@github.com> On Fri, 4 Nov 2022 03:00:54 GMT, Kim Barrett wrote: > Please review this change around `Address::extend`. The 4 derived classes are > replaced by static functions of the same name as the former class. These > functions return an `extend` object initialized with the same values as were > used by the corresponding derived class constructor. > > Testing: mach5 tier1-3 This pull request has now been integrated. Changeset: 12e76cbc Author: Kim Barrett URL: https://git.openjdk.org/jdk/commit/12e76cbc725ff87577e2ef23267590eae37a82d1 Stats: 20 lines in 2 files changed: 0 ins; 14 del; 6 mod 8296349: [aarch64] Avoid slicing Address::extend Reviewed-by: aph, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/10976 From thartmann at openjdk.org Fri Nov 11 08:46:29 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 11 Nov 2022 08:46:29 GMT Subject: RFR: 8295407: C2 crash: Error: ShouldNotReachHere() in multiple vector tests with -XX:-MonomorphicArrayCheck -XX:-UncommonNullCast In-Reply-To: References: Message-ID: On Tue, 8 Nov 2022 02:36:13 GMT, Fei Gao wrote: > For unsupported `CMove` patterns, [JDK-8293833](https://bugs.openjdk.org/browse/JDK-8295407) helps remove unused `CMove` and related packs from superword candidate packset by the function `remove_cmove_and_related_packs()`, but it only works when `-XX:+UseVectorCmov` is enabled[1]. When the option is not enabled, these unsupported `CMove` packs are still kept in the superword packset, causing the same failure. > > Actually, the function `filter_packs()` in superword is to filter out unsupported packs but it can't work as expected currently for these `CMove` cases. As we know, not all `CMove` packs can be vectorized. `merge_packs_to_cmovd()`[2] looks through all packs in the superword packset and generates a `CMove` candidate packset to collect all qualified `CMove` packs. Hence, only `CMove` packs in the `CMove` candidate packset are our target patterns and can be vectorized. But `filter_packs()` thinks, if the `CMove` pack is in a superword packset and its vector node is implemented in the current platform, then it can be vectorized. Therefore, the function doesn't remove these unsupported packs. > > We can adjust the function `implemented()` in the stage of `filter_packs()` to check if the current `CMove` pack is in the `CMove` candidate packset. If not, `filter_packs()` considers it not to be vectorized and then remove it. After the fix, whether > `-XX:+UseVectorCmov` is enabled or not, these unsupported packs can be removed by `filter_packs()`. In this way, we don't need the function`remove_cmove_and_related_packs()` anymore and thus the patch also cleans related code. > > [1] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L537 > [2] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L1892 Looks good to me otherwise but someone more familiar with superword should look at this as well. test/hotspot/jtreg/compiler/loopopts/TestUnsupportedConditionalMove.java line 28: > 26: * @bug 8295407 > 27: * @summary C2 crash: Error: ShouldNotReachHere() in multiple vector tests with > 28: * -XX:-MonomorphicArrayCheck -XX:-UncommonNullCast `MonomorphicArrayCheck` and `UncommonNullCast` are debug flags, the test will fail with a release build. ------------- PR: https://git.openjdk.org/jdk/pull/11034 From thartmann at openjdk.org Fri Nov 11 09:12:37 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 11 Nov 2022 09:12:37 GMT Subject: RFR: JDK-8295934: IGV: keep node selection when changing view or graph In-Reply-To: References: Message-ID: On Wed, 9 Nov 2022 14:30:48 GMT, Tobias Holenstein wrote: > In IGV nodes can be selected by clicking on it. When a user selects nodes in a certain view, e.g. "Cluster nodes into blocks" view, and then change to e.g. "Sea of nodes" view, the selection should be kept. Same when the user goes a different graph in the same group, the selection should be kept (as long as the nodes are still present) > > New selection features: > - When opening a new graph and no nodes where selected previously the root nodes is selected and centered. > selected_root > > - When a graph in the same group is opened, the previously selected nodes as selected as well if they are present in the graph. The selected nodes are centered in the new graph. > - The selected nodes are kept when changing the view, or the properties of the view (e.g. "show neighboring nodes semi-transparent") > cluster_view > desired > > - When "show neighboring nodes semi-transparent" is disabled, previously semi-transparent nodes that were selected are now unselected (because they are not visible anymore) > - It would also be desired adjust the scroll pane to center the selected nodes when changing view, graph, etc. Looks good to me functionality-wise. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11062 From dsamersoff at openjdk.org Fri Nov 11 09:21:07 2022 From: dsamersoff at openjdk.org (Dmitry Samersoff) Date: Fri, 11 Nov 2022 09:21:07 GMT Subject: RFR: JDK-8294947: Use 64bit atomics in patch_verified_entry on x86_64 In-Reply-To: References: Message-ID: <6lrtKJYgowdr9MsRCcUNvj0_zZJliP4gk_GC0wLiNzo=.699c8627-7f22-4a74-bd71-7de01a60443a@github.com> On Wed, 9 Nov 2022 17:36:05 GMT, Vladimir Kozlov wrote: >> In the void NativeJump::patch_verified_entry() we atomically patch first 4 bytes, then atomically patch 5th byte, then atomically patch first 4 bytes again. But from CMC (cross-modified code) point of view it's better to patch atomically 8 bytes at once. >> >> The patch was tested with hotspot jtreg tests in bare-metal and virtualized environments. > > src/hotspot/cpu/x86/nativeInst_x86.cpp line 532: > >> 530: >> 531: #else >> 532: unsigned char code_buffer[5]; > > Should this be aligned? I would prefer to keep original 32bit code, that is here for ages, as it is. Verified entry point is always aligned, so alignment shouldn't be a problem. > src/hotspot/cpu/x86/nativeInst_x86.cpp line 562: > >> 560: >> 561: // Patch bytes 0-3 (from jump instruction) >> 562: *(int32_t*)verified_entry = *(int32_t *)code_buffer; > > Is this store and at line 552 atomic? This code is also inherited. On x86 pointer sized stores is atomic, I used Atomic::store in 64bit code above just to improve readability. ------------- PR: https://git.openjdk.org/jdk/pull/11059 From thartmann at openjdk.org Fri Nov 11 09:26:36 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 11 Nov 2022 09:26:36 GMT Subject: RFR: JDK-8296665: IGV: Show dialog with stack trace for exceptions In-Reply-To: <2dCzlObmwCSPAUcF0onZea7bmP6UlBhblebQ391VUU0=.7fad4bcd-f21e-4be6-a27a-85eeeca2934c@github.com> References: <2dCzlObmwCSPAUcF0onZea7bmP6UlBhblebQ391VUU0=.7fad4bcd-f21e-4be6-a27a-85eeeca2934c@github.com> Message-ID: On Wed, 9 Nov 2022 13:10:54 GMT, Tobias Holenstein wrote: > Currently in IGV when an exception occurs a small red icon in the bottom right corner appears. The user often does not see this and if he sees it, usually not immediately when the error occurs: > exception now > > The exception reporting level is changed to `1000` (Level.SEVERE) in IGV to show a dialog with the stack-trace. The user can still close it and continue the work: > exception suggestion > > To test invert something like the following somewhere in the codebase > > try { > int i=1/0; > } catch (Exception e) { > throw new RuntimeException(e); > } Looks good to me. If we start hitting too many exceptions, we might need to (temporarily) disable this again. Just wondering, isn't there a config file or something to put such Netbeans specific settings instead of passing them as `-J-D` args to the JVM? ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11060 From chagedorn at openjdk.org Fri Nov 11 09:42:22 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 11 Nov 2022 09:42:22 GMT Subject: RFR: 8295867: TestVerifyGraphEdges.java fails with exit code -1073741571 when using AlwaysIncrementalInline [v2] In-Reply-To: References: Message-ID: On Thu, 10 Nov 2022 18:30:18 GMT, Vladimir Kozlov wrote: >> When we use -Xcomp we compile `java.lang.invoke.LambdaForm$Kind::` very long linear method for enum class: [LambdaForm.java#L250](https://github.com/openjdk/jdk/blame/master/src/java.base/share/classes/java/lang/invoke/LambdaForm.java#L250) >> >> In addition we inline all class initializers for EA when run with -Xcomp: [bytecodeInfo.cpp#L410](https://github.com/openjdk/jdk/blame/master/src/hotspot/share/opto/bytecodeInfo.cpp#L410) >> >> Recent CDS change [JDK-8293979](https://bugs.openjdk.org/browse/JDK-8293979) allows to inline a bit more deeply too. >> >> Adding `-XX:+AlwaysIncrementalInline` worsened situation even more. >> >> Running with `-XX:+LogCompilation` shows that we hit `NodeCountInliningCutoff (18000)` during `java.lang.invoke.LambdaForm$Kind::` compilation. >> >> In short, we have very long (>40000 live nodes) linear IR graph. `Node::verify_edges()` method process nodes depth-first starting from first input which is control edge. So it is not surprise that depth of this method recursion reached 6000. >> With frame size of 10 words (320 bytes) we easy hit stack overflow (768K in 32-bits debug VM). >> >> I fixed it by using local buffer `Node_List` instead of recursion in `Node::verify_edges()`. >> The algorithm was changed to simplify code. It processes inputs in reverse order - last input processed first. And I noticed that maximum use of buffer is only about 1000 or less elements for this compilation (that is why I use live_nodes/16 as initial size of buffer). >> >> Then I did additional experiment with keeping recursion but processing inputs in reverse order: >> >> >> // Recursive walk over all input edges >> - for( i = 0; i < len(); i++ ) { >> - n = in(i); >> + for( i = len(); i > 0; i++ ) { >> + n = in(i - 1); >> if( n != NULL ) >> in(i)->verify_edges(visited); >> } >> >> >> And it shows the same around 1000 stack depth! >> >> I decided to keep my original fix because it should be faster (put only one value on list instead of putting all locals, PC, SP on stack and calls) and much less stack usage. >> >> Testing tier1-3, hs-comp-stress and `TestVerifyGraphEdges.java` test runs with `-XX:+AlwaysIncrementalInline`. > > Vladimir Kozlov has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Rename verify_edges() method and move it to Compile class > - Merge branch 'master' into JDK-8295867 > - remove white space > - Merge branch 'master' into JDK-8295867 > - 8295867: TestVerifyGraphEdges.java fails with exit code -1073741571 when using AlwaysIncrementalInline Thanks for doing the updates, looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11065 From chagedorn at openjdk.org Fri Nov 11 09:42:24 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 11 Nov 2022 09:42:24 GMT Subject: RFR: 8295867: TestVerifyGraphEdges.java fails with exit code -1073741571 when using AlwaysIncrementalInline [v2] In-Reply-To: References: Message-ID: On Thu, 10 Nov 2022 16:26:13 GMT, Vladimir Kozlov wrote: >> src/hotspot/share/opto/compile.cpp line 4243: >> >>> 4241: // Allocate stack of size C->live_nodes()/16 to avoid frequent realloc >>> 4242: uint stack_size = live_nodes() >> 4; >>> 4243: Node_List nstack(MAX2(stack_size, (uint)OptoNodeListSize)); >> >> As you only need the stack in `verify_edges()`, I suggest to move these lines directly into the method `verify_edges()`. > > I need `live_nodes()` value or `stack_size` or `C` to pass for creating list inside method. > I decided to move renamed `verify_bidirectional_edges()` method to `Compile` class to get these values inside the method. > It does not need to be in `Node` class after I removed recursion. Right, I've missed that before. Moving it to `Compile` is a good idea to apply that! ------------- PR: https://git.openjdk.org/jdk/pull/11065 From chagedorn at openjdk.org Fri Nov 11 09:46:14 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 11 Nov 2022 09:46:14 GMT Subject: RFR: 8296243: [IR Framework] Fix issues with IRNode.ALLOC* regexes [v2] In-Reply-To: References: Message-ID: > There are currently two problems with `IRNode.ALLOC*` regexes: > 1. On PPC64, we do not account for an `LI` instruction which matches the array size. As a result, we could miss some array allocations with the `ALLOC_ARRAY*` regexes: > > 2e4 LD R3, offset, R3 // load ptr precise [java/lang/Object: > 0x0000200058006e40 *: :Constant:exact * from TOC (lo) > 2e8 STD R17, [R1_SP + #104+0] // spill copy > 2ec LI R4, #1 <------- we only look for LGHI here which is specific to s390 while LI is used for PPC64 > 2f0 CALL,static 0x00002000177cd300 // ==> wrapper for: _new_array_Java > > This was revealed by a new test added by [JDK-8280378](https://bugs.openjdk.org/browse/JDK-8280378) but was already a problem before this change. > > 2. The newly added `IRNode.ALLOC*` regexes in JDK-8280378 which can be matched on the independent ideal compile phases by using the name of the IR node "Allocate" also matches "AllocateArray" (substring match). This is unexpected. I've changed this by matching "Allocate" exactly. > > I've additionally removed the matching of `LI` and `LGHI` for the `ALLOC` regexes on normal objects as we do not have an array size. I think it's safe to remove these (might need some additional testing on PPC64/s390). > > Thanks @TheRealMDoerr for helping to test the initial fix on PPC64! > > Thanks, > Christian Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: new line ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11037/files - new: https://git.openjdk.org/jdk/pull/11037/files/c36e645b..0d28220f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11037&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11037&range=00-01 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11037.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11037/head:pull/11037 PR: https://git.openjdk.org/jdk/pull/11037 From chagedorn at openjdk.org Fri Nov 11 09:46:15 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 11 Nov 2022 09:46:15 GMT Subject: RFR: 8296243: [IR Framework] Fix issues with IRNode.ALLOC* regexes In-Reply-To: References: Message-ID: On Tue, 8 Nov 2022 09:58:42 GMT, Christian Hagedorn wrote: > There are currently two problems with `IRNode.ALLOC*` regexes: > 1. On PPC64, we do not account for an `LI` instruction which matches the array size. As a result, we could miss some array allocations with the `ALLOC_ARRAY*` regexes: > > 2e4 LD R3, offset, R3 // load ptr precise [java/lang/Object: > 0x0000200058006e40 *: :Constant:exact * from TOC (lo) > 2e8 STD R17, [R1_SP + #104+0] // spill copy > 2ec LI R4, #1 <------- we only look for LGHI here which is specific to s390 while LI is used for PPC64 > 2f0 CALL,static 0x00002000177cd300 // ==> wrapper for: _new_array_Java > > This was revealed by a new test added by [JDK-8280378](https://bugs.openjdk.org/browse/JDK-8280378) but was already a problem before this change. > > 2. The newly added `IRNode.ALLOC*` regexes in JDK-8280378 which can be matched on the independent ideal compile phases by using the name of the IR node "Allocate" also matches "AllocateArray" (substring match). This is unexpected. I've changed this by matching "Allocate" exactly. > > I've additionally removed the matching of `LI` and `LGHI` for the `ALLOC` regexes on normal objects as we do not have an array size. I think it's safe to remove these (might need some additional testing on PPC64/s390). > > Thanks @TheRealMDoerr for helping to test the initial fix on PPC64! > > Thanks, > Christian Thanks Tobias for your review! I've added a new line as suggested. ------------- PR: https://git.openjdk.org/jdk/pull/11037 From thartmann at openjdk.org Fri Nov 11 09:54:33 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 11 Nov 2022 09:54:33 GMT Subject: RFR: 8296170: Refactor stack-locking path in C2_MacroAssembler::fast_unlock() [v2] In-Reply-To: References: Message-ID: On Thu, 10 Nov 2022 15:13:48 GMT, Roman Kennke wrote: >> The code in C2_MacroAssembler::fast_unlock() has several (minor) issues: >> - The stack-locking path for x86_32 is not under UseHeavyMonitors - it would be executed even when stack-locking is disabled. >> - The stack-locking paths are the same for x86_32 and x86_64 - they can be merged into a common path. >> - In x86_32 path, we call get_thread(boxReg) which is totally bogus because we clear boxReg right afterwards with xorptr(boxReg, boxReg). >> - In x86_32 path, the CheckSucc label is identical to the DONE label, and in-fact CheckSucc is only ever really used in the x86_64 path and can be moved there. >> >> Testing: >> - [x] tier1 (x86_32, x86_64) >> - [x] tier2 (x86_32, x86_64) > > Roman Kennke has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge remote-tracking branch 'upstream/master' into JDK-8296170 > - 8296170: Refactor stack-locking path in C2_MacroAssembler::fast_unlock() src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 928: > 926: } > 927: > 928: // TODO: Comment still valid? We have the same comment here: https://github.com/openjdk/jdk/blob/12e76cbc725ff87577e2ef23267590eae37a82d1/src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp#L707 And it seems to be there from the beginning: https://hg.openjdk.java.net/jdk/jdk/annotate/74aaad871363/hotspot/src/cpu/x86/vm/x86_32.ad#l3441 I think instead of adding a `TODO`, we should either completely remove the comments or file a follow-up enhancement to investigate. ------------- PR: https://git.openjdk.org/jdk/pull/10936 From thartmann at openjdk.org Fri Nov 11 09:55:36 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 11 Nov 2022 09:55:36 GMT Subject: RFR: 8296243: [IR Framework] Fix issues with IRNode.ALLOC* regexes [v2] In-Reply-To: References: Message-ID: On Fri, 11 Nov 2022 09:46:14 GMT, Christian Hagedorn wrote: >> There are currently two problems with `IRNode.ALLOC*` regexes: >> 1. On PPC64, we do not account for an `LI` instruction which matches the array size. As a result, we could miss some array allocations with the `ALLOC_ARRAY*` regexes: >> >> 2e4 LD R3, offset, R3 // load ptr precise [java/lang/Object: >> 0x0000200058006e40 *: :Constant:exact * from TOC (lo) >> 2e8 STD R17, [R1_SP + #104+0] // spill copy >> 2ec LI R4, #1 <------- we only look for LGHI here which is specific to s390 while LI is used for PPC64 >> 2f0 CALL,static 0x00002000177cd300 // ==> wrapper for: _new_array_Java >> >> This was revealed by a new test added by [JDK-8280378](https://bugs.openjdk.org/browse/JDK-8280378) but was already a problem before this change. >> >> 2. The newly added `IRNode.ALLOC*` regexes in JDK-8280378 which can be matched on the independent ideal compile phases by using the name of the IR node "Allocate" also matches "AllocateArray" (substring match). This is unexpected. I've changed this by matching "Allocate" exactly. >> >> I've additionally removed the matching of `LI` and `LGHI` for the `ALLOC` regexes on normal objects as we do not have an array size. I think it's safe to remove these (might need some additional testing on PPC64/s390). >> >> Thanks @TheRealMDoerr for helping to test the initial fix on PPC64! >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > new line Marked as reviewed by thartmann (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11037 From dsamersoff at openjdk.org Fri Nov 11 11:47:33 2022 From: dsamersoff at openjdk.org (Dmitry Samersoff) Date: Fri, 11 Nov 2022 11:47:33 GMT Subject: RFR: JDK-8294947: Use 64bit atomics in patch_verified_entry on x86_64 In-Reply-To: References: Message-ID: On Thu, 10 Nov 2022 02:37:50 GMT, Dean Long wrote: >> src/hotspot/cpu/x86/nativeInst_x86.cpp line 514: >> >>> 512: // complete jump instruction (to be inserted) is in code_buffer; >>> 513: #ifdef AMD64 >>> 514: unsigned char code_buffer[8]; >> >> Should we align this buffer too (to 8/jlong)? > > I suggest using a union. @dean-long Complete decomposition to union, like one below, looks really nice, but requires gcc-specific attribute, that I would like to avoid. union { jlong cb_long; struct { char code; int32_t disp; } __attribute__((packed)) instr; } u; ``` Do you prefer ``` union { jlong cb_long; unsigned char code_buffer[8]; } u; ``` over the cast that is similar to one in 32bit version? ------------- PR: https://git.openjdk.org/jdk/pull/11059 From dsamersoff at openjdk.org Fri Nov 11 12:13:10 2022 From: dsamersoff at openjdk.org (Dmitry Samersoff) Date: Fri, 11 Nov 2022 12:13:10 GMT Subject: RFR: JDK-8294947: Use 64bit atomics in patch_verified_entry on x86_64 In-Reply-To: References: Message-ID: <8aQHbzt0g0EQyvuEvycMD8_puZ8Z_Zs65LKiV-jcWUE=.3cd9286f-2177-4506-81e7-28a98beacc46@github.com> On Wed, 9 Nov 2022 17:19:55 GMT, Vladimir Kozlov wrote: >> In the void NativeJump::patch_verified_entry() we atomically patch first 4 bytes, then atomically patch 5th byte, then atomically patch first 4 bytes again. But from CMC (cross-modified code) point of view it's better to patch atomically 8 bytes at once. >> >> The patch was tested with hotspot jtreg tests in bare-metal and virtualized environments. > > src/hotspot/cpu/x86/nativeInst_x86.cpp line 514: > >> 512: // complete jump instruction (to be inserted) is in code_buffer; >> 513: #ifdef AMD64 >> 514: unsigned char code_buffer[8]; > > Should we align this buffer too (to 8/jlong)? @vnkozlov CXX optimizes that code to a few of register operations, and optimize out local variable _code_buffer_, so we need not to care about its alignment. ------------- PR: https://git.openjdk.org/jdk/pull/11059 From bulasevich at openjdk.org Fri Nov 11 12:33:17 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Fri, 11 Nov 2022 12:33:17 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v7] In-Reply-To: References: Message-ID: > The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. > > Testing: jtreg hotspot&jdk, Renaissance benchmarks Boris Ulasevich has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 12 additional commits since the last revision: - adding jtreg test for CompressedSparseDataReadStream impl - align java impl to cpp impl - rewrite the SparseDataWriteStream not to use _curr_byte - introduce and call flush() excplicitly, add the gtest - minor renaming. adding encoding examples table - cleanup and rename - cleanup - rewrite code without virtual functions - warning fix and name fix - optimize the encoding - ... and 2 more: https://git.openjdk.org/jdk/compare/89c16176...637c94be ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10025/files - new: https://git.openjdk.org/jdk/pull/10025/files/f365d780..637c94be Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10025&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10025&range=05-06 Stats: 198377 lines in 2647 files changed: 114571 ins; 45710 del; 38096 mod Patch: https://git.openjdk.org/jdk/pull/10025.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10025/head:pull/10025 PR: https://git.openjdk.org/jdk/pull/10025 From tholenstein at openjdk.org Fri Nov 11 13:53:26 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Fri, 11 Nov 2022 13:53:26 GMT Subject: RFR: JDK-8296665: IGV: Show dialog with stack trace for exceptions In-Reply-To: References: <2dCzlObmwCSPAUcF0onZea7bmP6UlBhblebQ391VUU0=.7fad4bcd-f21e-4be6-a27a-85eeeca2934c@github.com> Message-ID: On Fri, 11 Nov 2022 09:22:45 GMT, Tobias Hartmann wrote: > Looks good to me. If we start hitting too many exceptions, we might need to (temporarily) disable this again. > > Just wondering, isn't there a config file or something to put such Netbeans specific settings instead of passing them as `-J-D` args to the JVM? Unfortunately, the only config file i know is `/etc/idealgraphvisualizer.conf` (`jdk/open/src/utils/IdealGraphVisualizer/application/target/idealgraphvisualizer/etc/idealgraphvisualizer.conf`). This file can also be modified after compiling IGV and changes the settings without recompiling. Another way is to set option in code like the following System.setProperty("netbeans.exception.report.min.level", "1000"); ------------- PR: https://git.openjdk.org/jdk/pull/11060 From rkennke at openjdk.org Fri Nov 11 14:56:05 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Fri, 11 Nov 2022 14:56:05 GMT Subject: RFR: 8296170: Refactor stack-locking path in C2_MacroAssembler::fast_unlock() [v3] In-Reply-To: References: Message-ID: > The code in C2_MacroAssembler::fast_unlock() has several (minor) issues: > - The stack-locking path for x86_32 is not under UseHeavyMonitors - it would be executed even when stack-locking is disabled. > - The stack-locking paths are the same for x86_32 and x86_64 - they can be merged into a common path. > - In x86_32 path, we call get_thread(boxReg) which is totally bogus because we clear boxReg right afterwards with xorptr(boxReg, boxReg). > - In x86_32 path, the CheckSucc label is identical to the DONE label, and in-fact CheckSucc is only ever really used in the x86_64 path and can be moved there. > > Testing: > - [x] tier1 (x86_32, x86_64) > - [x] tier2 (x86_32, x86_64) Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: Remove comments about DONE_LABEL being a hot target ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10936/files - new: https://git.openjdk.org/jdk/pull/10936/files/bc4fd918..153353ac Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10936&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10936&range=01-02 Stats: 12 lines in 1 file changed: 0 ins; 12 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10936.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10936/head:pull/10936 PR: https://git.openjdk.org/jdk/pull/10936 From rkennke at openjdk.org Fri Nov 11 14:56:09 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Fri, 11 Nov 2022 14:56:09 GMT Subject: RFR: 8296170: Refactor stack-locking path in C2_MacroAssembler::fast_unlock() [v2] In-Reply-To: References: Message-ID: <9sH191K8M6sFLR314aZxJs1iXu1h6PdT1viQdWY3-d4=.ada1c4b7-4f64-47af-8038-055e24ca419f@github.com> On Fri, 11 Nov 2022 09:50:51 GMT, Tobias Hartmann wrote: >> Roman Kennke has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: >> >> - Merge remote-tracking branch 'upstream/master' into JDK-8296170 >> - 8296170: Refactor stack-locking path in C2_MacroAssembler::fast_unlock() > > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 928: > >> 926: } >> 927: >> 928: // TODO: Comment still valid? > > We have the same comment here: > https://github.com/openjdk/jdk/blob/12e76cbc725ff87577e2ef23267590eae37a82d1/src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp#L707 > > And it seems to be there from the beginning: > https://hg.openjdk.java.net/jdk/jdk/annotate/74aaad871363/hotspot/src/cpu/x86/vm/x86_32.ad#l3441 > > I think instead of adding a `TODO`, we should either completely remove the comments or file a follow-up enhancement to investigate. I decided to remove the TODO. If we were to optimize this, then there are probably lower-hanging fruits like looking at the subsequent branch after the DONE_LABEL where we jump around the counting code. ------------- PR: https://git.openjdk.org/jdk/pull/10936 From thartmann at openjdk.org Fri Nov 11 15:18:28 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 11 Nov 2022 15:18:28 GMT Subject: RFR: JDK-8296665: IGV: Show dialog with stack trace for exceptions In-Reply-To: <2dCzlObmwCSPAUcF0onZea7bmP6UlBhblebQ391VUU0=.7fad4bcd-f21e-4be6-a27a-85eeeca2934c@github.com> References: <2dCzlObmwCSPAUcF0onZea7bmP6UlBhblebQ391VUU0=.7fad4bcd-f21e-4be6-a27a-85eeeca2934c@github.com> Message-ID: On Wed, 9 Nov 2022 13:10:54 GMT, Tobias Holenstein wrote: > Currently in IGV when an exception occurs a small red icon in the bottom right corner appears. The user often does not see this and if he sees it, usually not immediately when the error occurs: > exception now > > The exception reporting level is changed to `1000` (Level.SEVERE) in IGV to show a dialog with the stack-trace. The user can still close it and continue the work: > exception suggestion > > To test insert something like the following somewhere in the codebase > > try { > int i=1/0; > } catch (Exception e) { > throw new RuntimeException(e); > } Okay, thanks for the details. It's probably best to leave the patch as is then. ------------- PR: https://git.openjdk.org/jdk/pull/11060 From chagedorn at openjdk.org Fri Nov 11 15:33:33 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 11 Nov 2022 15:33:33 GMT Subject: RFR: JDK-8296665: IGV: Show dialog with stack trace for exceptions In-Reply-To: <2dCzlObmwCSPAUcF0onZea7bmP6UlBhblebQ391VUU0=.7fad4bcd-f21e-4be6-a27a-85eeeca2934c@github.com> References: <2dCzlObmwCSPAUcF0onZea7bmP6UlBhblebQ391VUU0=.7fad4bcd-f21e-4be6-a27a-85eeeca2934c@github.com> Message-ID: On Wed, 9 Nov 2022 13:10:54 GMT, Tobias Holenstein wrote: > Currently in IGV when an exception occurs a small red icon in the bottom right corner appears. The user often does not see this and if he sees it, usually not immediately when the error occurs: > exception now > > The exception reporting level is changed to `1000` (Level.SEVERE) in IGV to show a dialog with the stack-trace. The user can still close it and continue the work: > exception suggestion > > To test insert something like the following somewhere in the codebase > > try { > int i=1/0; > } catch (Exception e) { > throw new RuntimeException(e); > } Makes sense to add this back now that IGV is more stable - looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11060 From kvn at openjdk.org Fri Nov 11 16:13:02 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 11 Nov 2022 16:13:02 GMT Subject: Integrated: 8295867: TestVerifyGraphEdges.java fails with exit code -1073741571 when using AlwaysIncrementalInline In-Reply-To: References: Message-ID: <8Ecv8CWSqhORMcHw8FLJkWzCTJI6nSBuLm7P3Eb-7BE=.3365af3c-1557-484e-b1f6-dbc1e457858d@github.com> On Wed, 9 Nov 2022 18:14:43 GMT, Vladimir Kozlov wrote: > When we use -Xcomp we compile `java.lang.invoke.LambdaForm$Kind::` very long linear method for enum class: [LambdaForm.java#L250](https://github.com/openjdk/jdk/blame/master/src/java.base/share/classes/java/lang/invoke/LambdaForm.java#L250) > > In addition we inline all class initializers for EA when run with -Xcomp: [bytecodeInfo.cpp#L410](https://github.com/openjdk/jdk/blame/master/src/hotspot/share/opto/bytecodeInfo.cpp#L410) > > Recent CDS change [JDK-8293979](https://bugs.openjdk.org/browse/JDK-8293979) allows to inline a bit more deeply too. > > Adding `-XX:+AlwaysIncrementalInline` worsened situation even more. > > Running with `-XX:+LogCompilation` shows that we hit `NodeCountInliningCutoff (18000)` during `java.lang.invoke.LambdaForm$Kind::` compilation. > > In short, we have very long (>40000 live nodes) linear IR graph. `Node::verify_edges()` method process nodes depth-first starting from first input which is control edge. So it is not surprise that depth of this method recursion reached 6000. > With frame size of 10 words (320 bytes) we easy hit stack overflow (768K in 32-bits debug VM). > > I fixed it by using local buffer `Node_List` instead of recursion in `Node::verify_edges()`. > The algorithm was changed to simplify code. It processes inputs in reverse order - last input processed first. And I noticed that maximum use of buffer is only about 1000 or less elements for this compilation (that is why I use live_nodes/16 as initial size of buffer). > > Then I did additional experiment with keeping recursion but processing inputs in reverse order: > > > // Recursive walk over all input edges > - for( i = 0; i < len(); i++ ) { > - n = in(i); > + for( i = len(); i > 0; i++ ) { > + n = in(i - 1); > if( n != NULL ) > in(i)->verify_edges(visited); > } > > > And it shows the same around 1000 stack depth! > > I decided to keep my original fix because it should be faster (put only one value on list instead of putting all locals, PC, SP on stack and calls) and much less stack usage. > > Testing tier1-3, hs-comp-stress and `TestVerifyGraphEdges.java` test runs with `-XX:+AlwaysIncrementalInline`. This pull request has now been integrated. Changeset: 819c6919 Author: Vladimir Kozlov URL: https://git.openjdk.org/jdk/commit/819c6919ca3067ec475b5b268f54e10700eec039 Stats: 106 lines in 4 files changed: 58 ins; 46 del; 2 mod 8295867: TestVerifyGraphEdges.java fails with exit code -1073741571 when using AlwaysIncrementalInline Reviewed-by: chagedorn, shade ------------- PR: https://git.openjdk.org/jdk/pull/11065 From chagedorn at openjdk.org Fri Nov 11 17:33:02 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 11 Nov 2022 17:33:02 GMT Subject: RFR: 8286800: Assert in PhaseIdealLoop::dump_real_LCA is too strong [v3] In-Reply-To: <2pkCki3_MbAhroP1p8173jY1ErpHaEEvGlRxH7nGkvM=.3675ca65-171e-4634-b0ed-b61c7c30ebf3@github.com> References: <2pkCki3_MbAhroP1p8173jY1ErpHaEEvGlRxH7nGkvM=.3675ca65-171e-4634-b0ed-b61c7c30ebf3@github.com> Message-ID: > We sometimes hit the following assert when dumping a bad graph (before crashing with the bad graph assertion): > > assert(real_LCA != NULL, "must always find an LCA" > ``` > The algorithm is not correct as we should always find an LCA of two nodes. To fix this, I've re-implemented the algorithm and improved the dumped idom chains: > - I limited the node dump to idx + node name to reduce the noise which made it hard to read. > - Reversed the idom chain dumps to reflect the graph structure. > > Example output: > > Bad graph detected in build_loop_late > n: 138 CastPP === 205 38 [[ 263 140 140 168 ]] #Test:NotNull * Oop:Test:NotNull * !jvms: Test::mainTest @ bci:40 (line 154) > > [... same output as before ...] > > idoms of early "197 IfFalse": > idom[2]: 42 If > idom[1]: 44 IfTrue > idom[0]: 196 If > n: 197 IfFalse > > idoms of (wrong) LCA "205 IfTrue": > idom[4]: 42 If > idom[3]: 37 Region > idom[2]: 73 If > idom[1]: 83 IfTrue > idom[0]: 204 If > n: 205 IfTrue > > Real LCA of early "197 IfFalse" (idom[2]) and wrong LCA "205 IfTrue" (idom[4]): > 42 If === 30 41 [[ 43 44 ]] P=0.999000, C=-1.000000 !jvms: Test::mainTest @ bci:32 (line 153) > > Tested by manually calling `dump_idoms` during a compilation and by running reproducers of different bad graph assertion bugs. > > Thanks, > Christian Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: Change algorithm as suggested by Roberto ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11015/files - new: https://git.openjdk.org/jdk/pull/11015/files/342213d6..0e1954ba Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11015&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11015&range=01-02 Stats: 40 lines in 1 file changed: 12 ins; 9 del; 19 mod Patch: https://git.openjdk.org/jdk/pull/11015.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11015/head:pull/11015 PR: https://git.openjdk.org/jdk/pull/11015 From chagedorn at openjdk.org Fri Nov 11 17:33:05 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 11 Nov 2022 17:33:05 GMT Subject: RFR: 8286800: Assert in PhaseIdealLoop::dump_real_LCA is too strong [v2] In-Reply-To: References: <2pkCki3_MbAhroP1p8173jY1ErpHaEEvGlRxH7nGkvM=.3675ca65-171e-4634-b0ed-b61c7c30ebf3@github.com> Message-ID: On Mon, 7 Nov 2022 11:47:51 GMT, Christian Hagedorn wrote: >> We sometimes hit the following assert when dumping a bad graph (before crashing with the bad graph assertion): >> >> assert(real_LCA != NULL, "must always find an LCA" >> ``` >> The algorithm is not correct as we should always find an LCA of two nodes. To fix this, I've re-implemented the algorithm and improved the dumped idom chains: >> - I limited the node dump to idx + node name to reduce the noise which made it hard to read. >> - Reversed the idom chain dumps to reflect the graph structure. >> >> Example output: >> >> Bad graph detected in build_loop_late >> n: 138 CastPP === 205 38 [[ 263 140 140 168 ]] #Test:NotNull * Oop:Test:NotNull * !jvms: Test::mainTest @ bci:40 (line 154) >> >> [... same output as before ...] >> >> idoms of early "197 IfFalse": >> idom[2]: 42 If >> idom[1]: 44 IfTrue >> idom[0]: 196 If >> n: 197 IfFalse >> >> idoms of (wrong) LCA "205 IfTrue": >> idom[4]: 42 If >> idom[3]: 37 Region >> idom[2]: 73 If >> idom[1]: 83 IfTrue >> idom[0]: 204 If >> n: 205 IfTrue >> >> Real LCA of early "197 IfFalse" (idom[2]) and wrong LCA "205 IfTrue" (idom[4]): >> 42 If === 30 41 [[ 43 44 ]] P=0.999000, C=-1.000000 !jvms: Test::mainTest @ bci:32 (line 153) >> >> Tested by manually calling `dump_idoms` during a compilation and by running reproducers of different bad graph assertion bugs. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Fix optimized build Thank you Roberto for the suggestion! That is indeed cleaner. I've pushed an update accordingly. To make it easier, I've pushed `_early` and `_wrong_lca` to the node lists of `find_real_lca()`. ------------- PR: https://git.openjdk.org/jdk/pull/11015 From duke at openjdk.org Fri Nov 11 17:56:55 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Fri, 11 Nov 2022 17:56:55 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v15] In-Reply-To: References: Message-ID: > Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. > > - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. > - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. > - Added a JMH perf test. > - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. > > Perf before: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s > > and after: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: Vladimir's review ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10582/files - new: https://git.openjdk.org/jdk/pull/10582/files/835fbe3a..2a225e42 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=14 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=13-14 Stats: 23 lines in 2 files changed: 0 ins; 2 del; 21 mod Patch: https://git.openjdk.org/jdk/pull/10582.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10582/head:pull/10582 PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Fri Nov 11 18:12:20 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Fri, 11 Nov 2022 18:12:20 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v14] In-Reply-To: References: Message-ID: On Fri, 11 Nov 2022 01:26:40 GMT, Vladimir Ivanov wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: >> >> live review with Sandhya > > src/hotspot/cpu/x86/macroAssembler_x86.hpp line 733: > >> 731: void andptr(Register src1, Register src2) { LP64_ONLY(andq(src1, src2)) NOT_LP64(andl(src1, src2)) ; } >> 732: >> 733: #ifdef _LP64 > > Why is it x64-specific? I believe its needed. TLDR.. Couple of check ins ago, I broke the 32-bit build, and that was the 'easy' fix.. > src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 161: > >> 159: const XMMRegister P2_H = xmm5; >> 160: const XMMRegister TMP1 = xmm6; >> 161: const Register polyCP = r13; > > Could be renamed to `rscratch` (or `tmp`) since it doesn't hold constant base address anymore. done ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Fri Nov 11 18:12:23 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Fri, 11 Nov 2022 18:12:23 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v13] In-Reply-To: References: Message-ID: On Fri, 11 Nov 2022 01:25:07 GMT, Vladimir Ivanov wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: >> >> jcheck > > src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 252: > >> 250: private void processMultipleBlocks(byte[] input, int offset, int length, long[] aLimbs, long[] rLimbs) { >> 251: while (length >= BLOCK_LENGTH) { >> 252: n.setValue(input, offset, BLOCK_LENGTH, (byte)0x01); > > You could call `processBlock(input, offset, BLOCK_LENGTH);` here. done (duh.. thanks, neater code) ------------- PR: https://git.openjdk.org/jdk/pull/10582 From vlivanov at openjdk.org Fri Nov 11 20:01:34 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 11 Nov 2022 20:01:34 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v14] In-Reply-To: References: Message-ID: On Fri, 11 Nov 2022 18:08:50 GMT, Volodymyr Paprotski wrote: >> src/hotspot/cpu/x86/macroAssembler_x86.hpp line 733: >> >>> 731: void andptr(Register src1, Register src2) { LP64_ONLY(andq(src1, src2)) NOT_LP64(andl(src1, src2)) ; } >>> 732: >>> 733: #ifdef _LP64 >> >> Why is it x64-specific? > > I believe its needed. > > TLDR.. Couple of check ins ago, I broke the 32-bit build, and that was the 'easy' fix.. Right, `addq` instructions are x64-specific. I was confused because `assembler_x86.hpp` doesn't declare them as such which is a bug. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Fri Nov 11 20:10:33 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Fri, 11 Nov 2022 20:10:33 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v14] In-Reply-To: References: Message-ID: On Fri, 11 Nov 2022 19:56:40 GMT, Vladimir Ivanov wrote: >> I believe its needed. >> >> TLDR.. Couple of check ins ago, I broke the 32-bit build, and that was the 'easy' fix.. > > Right, `addq` instructions are x64-specific. I was confused because `assembler_x86.hpp` doesn't declare them as such which is a bug. I am mystified at how it actually gets removed from the `assembler_x86.o` object on 32-bit.. The only reliable/portable way _would_ be with `#ifdef` but its not there.. so.. code-generation? `sed`-like preprocessing? Can one edit object files after the gcc ran? The build must be doing something clever!! Haven't seen it yet.. Whatever the trick is, `assembler_x86.hpp` gets it, but not `macroAssembler_x86.hpp`. If it doesn't ring any bells, maybe I will spend some more time looking at the traces, maybe can figure out what the build script is doing to remove the symbol. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From vlivanov at openjdk.org Fri Nov 11 20:38:33 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 11 Nov 2022 20:38:33 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v14] In-Reply-To: References: Message-ID: On Fri, 11 Nov 2022 20:08:27 GMT, Volodymyr Paprotski wrote: >> Right, `addq` instructions are x64-specific. I was confused because `assembler_x86.hpp` doesn't declare them as such which is a bug. > > I am mystified at how it actually gets removed from the `assembler_x86.o` object on 32-bit.. The only reliable/portable way _would_ be with `#ifdef` but its not there.. so.. code-generation? `sed`-like preprocessing? Can one edit object files after the gcc ran? The build must be doing something clever!! Haven't seen it yet.. > > Whatever the trick is, `assembler_x86.hpp` gets it, but not `macroAssembler_x86.hpp`. > > If it doesn't ring any bells, maybe I will spend some more time looking at the traces, maybe can figure out what the build script is doing to remove the symbol. It's not specific to `andq`: there's a huge `#ifdef` block around the definitions in `assembler_x86.hpp` (lines 12201 - 13773; and there's even a nested `#ifdef _LP64` (lines 13515-13585)!) , but declarations aren't guarded by `#ifdef _LP64`. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Fri Nov 11 20:49:40 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Fri, 11 Nov 2022 20:49:40 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v14] In-Reply-To: References: Message-ID: On Fri, 11 Nov 2022 20:34:34 GMT, Vladimir Ivanov wrote: >> I am mystified at how it actually gets removed from the `assembler_x86.o` object on 32-bit.. The only reliable/portable way _would_ be with `#ifdef` but its not there.. so.. code-generation? `sed`-like preprocessing? Can one edit object files after the gcc ran? The build must be doing something clever!! Haven't seen it yet.. >> >> Whatever the trick is, `assembler_x86.hpp` gets it, but not `macroAssembler_x86.hpp`. >> >> If it doesn't ring any bells, maybe I will spend some more time looking at the traces, maybe can figure out what the build script is doing to remove the symbol. > > It's not specific to `andq`: there's a huge `#ifdef` block around the definitions in `assembler_x86.hpp` (lines 12201 - 13773; and there's even a nested `#ifdef _LP64` (lines 13515-13585)!) , but declarations aren't guarded by `#ifdef _LP64`. Yeah, just got to about the same conclusion by looking at the preprocessor `-E` output.. its declared in the header, but not defined in the 'cpp' file.. One would think that that's a compile error, but its been more then a decade since I looked at the C++ spec; 'C++ compiler is always right'. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From dlong at openjdk.org Fri Nov 11 22:42:01 2022 From: dlong at openjdk.org (Dean Long) Date: Fri, 11 Nov 2022 22:42:01 GMT Subject: RFR: JDK-8294947: Use 64bit atomics in patch_verified_entry on x86_64 In-Reply-To: References: Message-ID: On Wed, 9 Nov 2022 12:41:59 GMT, Dmitry Samersoff wrote: > In the void NativeJump::patch_verified_entry() we atomically patch first 4 bytes, then atomically patch 5th byte, then atomically patch first 4 bytes again. But from CMC (cross-modified code) point of view it's better to patch atomically 8 bytes at once. > > The patch was tested with hotspot jtreg tests in bare-metal and virtualized environments. src/hotspot/cpu/x86/nativeInst_x86.cpp line 511: > 509: // In JVMCI, the restriction is enforced by HotSpotFrameContext.enter(...) > 510: // > 511: void NativeJump::patch_verified_entry(address entry, address verified_entry, address dest) { Should we assert that the appropriate lock is held here? ------------- PR: https://git.openjdk.org/jdk/pull/11059 From dlong at openjdk.org Fri Nov 11 22:42:03 2022 From: dlong at openjdk.org (Dean Long) Date: Fri, 11 Nov 2022 22:42:03 GMT Subject: RFR: JDK-8294947: Use 64bit atomics in patch_verified_entry on x86_64 In-Reply-To: <8aQHbzt0g0EQyvuEvycMD8_puZ8Z_Zs65LKiV-jcWUE=.3cd9286f-2177-4506-81e7-28a98beacc46@github.com> References: <8aQHbzt0g0EQyvuEvycMD8_puZ8Z_Zs65LKiV-jcWUE=.3cd9286f-2177-4506-81e7-28a98beacc46@github.com> Message-ID: On Fri, 11 Nov 2022 12:09:10 GMT, Dmitry Samersoff wrote: >> src/hotspot/cpu/x86/nativeInst_x86.cpp line 514: >> >>> 512: // complete jump instruction (to be inserted) is in code_buffer; >>> 513: #ifdef AMD64 >>> 514: unsigned char code_buffer[8]; >> >> Should we align this buffer too (to 8/jlong)? > > @vnkozlov > > CXX optimizes that code to a few of register operations, and optimize out local variable _code_buffer_, so we need not to care about its alignment. > ``` > union { > jlong cb_long; > unsigned char code_buffer[8]; > } u; > ``` > This version is fine. It removes any question about the alignment of cb_long at the same time. ------------- PR: https://git.openjdk.org/jdk/pull/11059 From kvn at openjdk.org Fri Nov 11 23:50:26 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 11 Nov 2022 23:50:26 GMT Subject: RFR: JDK-8294947: Use 64bit atomics in patch_verified_entry on x86_64 In-Reply-To: References: <8aQHbzt0g0EQyvuEvycMD8_puZ8Z_Zs65LKiV-jcWUE=.3cd9286f-2177-4506-81e7-28a98beacc46@github.com> Message-ID: On Fri, 11 Nov 2022 22:38:10 GMT, Dean Long wrote: > This version is fine. It removes any question about the alignment of cb_long at the same time. I agree. ------------- PR: https://git.openjdk.org/jdk/pull/11059 From vlivanov at openjdk.org Sat Nov 12 00:42:40 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Sat, 12 Nov 2022 00:42:40 GMT Subject: RFR: JDK-8294947: Use 64bit atomics in patch_verified_entry on x86_64 In-Reply-To: References: Message-ID: On Wed, 9 Nov 2022 12:41:59 GMT, Dmitry Samersoff wrote: > In the void NativeJump::patch_verified_entry() we atomically patch first 4 bytes, then atomically patch 5th byte, then atomically patch first 4 bytes again. But from CMC (cross-modified code) point of view it's better to patch atomically 8 bytes at once. > > The patch was tested with hotspot jtreg tests in bare-metal and virtualized environments. src/hotspot/cpu/x86/nativeInst_x86.cpp line 513: > 511: void NativeJump::patch_verified_entry(address entry, address verified_entry, address dest) { > 512: // complete jump instruction (to be inserted) is in code_buffer; > 513: #ifdef AMD64 Just a minor suggestion: `_LP64` is more appropriate here since it's x86-specific file. ------------- PR: https://git.openjdk.org/jdk/pull/11059 From omikhaltcova at openjdk.org Sat Nov 12 23:39:03 2022 From: omikhaltcova at openjdk.org (Olga Mikhaltsova) Date: Sat, 12 Nov 2022 23:39:03 GMT Subject: RFR: 8296821: compiler/jvmci/jdk.vm.ci.code.test/src/jdk/vm/ci/code/test/NativeCallTest.java fails after JDK-8262901 Message-ID: This is a fix for a bug in the test that was added by [JDK-8262901](https://bugs.openjdk.org/browse/JDK-8262901). The root cause of the issue was because the test returns 'int' while 'float' was expected. In addition the stack growth considering 16 bytes alignment in AMD64TestAssembler is fixed (similar to AArch64TestAssembler). Tested on macOS x64 / AArch64, Linux x64 / AArch64 as follow: ` $JTREG_HOME/bin/jtreg -ea -jdk:$BUILD_HOME -nativepath:$NATIVE_PATH ./test/hotspot/jtreg/compiler/jvmci/` ------------- Commit messages: - 8296821: compiler/jvmci/jdk.vm.ci.code.test/src/jdk/vm/ci/code/test/NativeCallTest.java fails after JDK-8262901 Changes: https://git.openjdk.org/jdk/pull/11114/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11114&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296821 Stats: 12 lines in 3 files changed: 6 ins; 3 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/11114.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11114/head:pull/11114 PR: https://git.openjdk.org/jdk/pull/11114 From omikhaltcova at openjdk.org Sat Nov 12 23:55:37 2022 From: omikhaltcova at openjdk.org (Olga Mikhaltsova) Date: Sat, 12 Nov 2022 23:55:37 GMT Subject: RFR: 8296821: compiler/jvmci/jdk.vm.ci.code.test/src/jdk/vm/ci/code/test/NativeCallTest.java fails after JDK-8262901 In-Reply-To: References: Message-ID: On Fri, 11 Nov 2022 22:34:59 GMT, Olga Mikhaltsova wrote: > This is a fix for a bug in the test that was added by [JDK-8262901](https://bugs.openjdk.org/browse/JDK-8262901). > The root cause of the issue was because the test returns 'int' while 'float' was expected. > > In addition the stack growth considering 16 bytes alignment in AMD64TestAssembler is fixed (similar to AArch64TestAssembler). > > Tested on macOS x64 / AArch64, Linux x64 / AArch64 as follow: > ` $JTREG_HOME/bin/jtreg -ea -jdk:$BUILD_HOME -nativepath:$NATIVE_PATH ./test/hotspot/jtreg/compiler/jvmci/` 1 test failed on Linux x86: `compiler/c2/TestVerifyGraphEdges.java` Seems it's already been fixed by [JDK-8295867](https://bugs.openjdk.org/browse/JDK-8295867) ([JDK-8295936](https://bugs.openjdk.org/browse/JDK-8295936)). ------------- PR: https://git.openjdk.org/jdk/pull/11114 From kvn at openjdk.org Sun Nov 13 02:35:27 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sun, 13 Nov 2022 02:35:27 GMT Subject: RFR: 8296821: compiler/jvmci/jdk.vm.ci.code.test/src/jdk/vm/ci/code/test/NativeCallTest.java fails after JDK-8262901 In-Reply-To: References: Message-ID: On Fri, 11 Nov 2022 22:34:59 GMT, Olga Mikhaltsova wrote: > This is a fix for a bug in the test that was added by [JDK-8262901](https://bugs.openjdk.org/browse/JDK-8262901). > The root cause of the issue was because the test returns 'int' while 'float' was expected. > > In addition the stack growth considering 16 bytes alignment in AMD64TestAssembler is fixed (similar to AArch64TestAssembler). > > Tested on macOS x64 / AArch64, Linux x64 / AArch64 as follow: > ` $JTREG_HOME/bin/jtreg -ea -jdk:$BUILD_HOME -nativepath:$NATIVE_PATH ./test/hotspot/jtreg/compiler/jvmci/` Looks reasonable. I submitted testing. ------------- PR: https://git.openjdk.org/jdk/pull/11114 From kvn at openjdk.org Sun Nov 13 15:23:26 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sun, 13 Nov 2022 15:23:26 GMT Subject: RFR: 8296821: compiler/jvmci/jdk.vm.ci.code.test/src/jdk/vm/ci/code/test/NativeCallTest.java fails after JDK-8262901 In-Reply-To: References: Message-ID: On Fri, 11 Nov 2022 22:34:59 GMT, Olga Mikhaltsova wrote: > This is a fix for a bug in the test that was added by [JDK-8262901](https://bugs.openjdk.org/browse/JDK-8262901). > The root cause of the issue was because the test returns 'int' while 'float' was expected. > > In addition the stack growth considering 16 bytes alignment in AMD64TestAssembler is fixed (similar to AArch64TestAssembler). > > Tested on macOS x64 / AArch64, Linux x64 / AArch64 as follow: > ` $JTREG_HOME/bin/jtreg -ea -jdk:$BUILD_HOME -nativepath:$NATIVE_PATH ./test/hotspot/jtreg/compiler/jvmci/` My tier1-3 testing passed. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11114 From fgao at openjdk.org Mon Nov 14 01:39:32 2022 From: fgao at openjdk.org (Fei Gao) Date: Mon, 14 Nov 2022 01:39:32 GMT Subject: RFR: 8295407: C2 crash: Error: ShouldNotReachHere() in multiple vector tests with -XX:-MonomorphicArrayCheck -XX:-UncommonNullCast In-Reply-To: References: Message-ID: On Fri, 11 Nov 2022 08:38:34 GMT, Tobias Hartmann wrote: >> For unsupported `CMove` patterns, [JDK-8293833](https://bugs.openjdk.org/browse/JDK-8295407) helps remove unused `CMove` and related packs from superword candidate packset by the function `remove_cmove_and_related_packs()`, but it only works when `-XX:+UseVectorCmov` is enabled[1]. When the option is not enabled, these unsupported `CMove` packs are still kept in the superword packset, causing the same failure. >> >> Actually, the function `filter_packs()` in superword is to filter out unsupported packs but it can't work as expected currently for these `CMove` cases. As we know, not all `CMove` packs can be vectorized. `merge_packs_to_cmovd()`[2] looks through all packs in the superword packset and generates a `CMove` candidate packset to collect all qualified `CMove` packs. Hence, only `CMove` packs in the `CMove` candidate packset are our target patterns and can be vectorized. But `filter_packs()` thinks, if the `CMove` pack is in a superword packset and its vector node is implemented in the current platform, then it can be vectorized. Therefore, the function doesn't remove these unsupported packs. >> >> We can adjust the function `implemented()` in the stage of `filter_packs()` to check if the current `CMove` pack is in the `CMove` candidate packset. If not, `filter_packs()` considers it not to be vectorized and then remove it. After the fix, whether >> `-XX:+UseVectorCmov` is enabled or not, these unsupported packs can be removed by `filter_packs()`. In this way, we don't need the function`remove_cmove_and_related_packs()` anymore and thus the patch also cleans related code. >> >> [1] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L537 >> [2] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L1892 > > test/hotspot/jtreg/compiler/loopopts/TestUnsupportedConditionalMove.java line 28: > >> 26: * @bug 8295407 >> 27: * @summary C2 crash: Error: ShouldNotReachHere() in multiple vector tests with >> 28: * -XX:-MonomorphicArrayCheck -XX:-UncommonNullCast > > `MonomorphicArrayCheck` and `UncommonNullCast` are debug flags, the test will fail with a release build. @TobiHartmann, thanks for your review! The options added here is commented as part of summary title, not as JVM options. I suppose it should be fine for a release build, right? :-) ------------- PR: https://git.openjdk.org/jdk/pull/11034 From dongbo at openjdk.org Mon Nov 14 02:14:17 2022 From: dongbo at openjdk.org (Dong Bo) Date: Mon, 14 Nov 2022 02:14:17 GMT Subject: RFR: 8295698: AArch64: test/jdk/sun/security/ec/ed/EdDSATest.java failed with -XX:+UseSHA3Intrinsics In-Reply-To: References: Message-ID: On Wed, 2 Nov 2022 03:06:21 GMT, Dong Bo wrote: > In JDK-8252204, when implemented SHA3 intrinsics, we use `digest_length` to differentiate SHA3-224, SHA3-256, SHA3-384, SHA3-512 and calculate `block_size` with `block_size = 200 - 2 * digest_length`. > However, there are two extra SHA3 instances, SHAKE256 and SHAKE128, allowing an arbitrary `digest_length`: > > digest_length block_size > SHA3-224 28 144 > SHA3-256 32 136 > SHA3-384 48 104 > SHA3-512 64 72 > SHAKE128 variable 168 > SHAKE256 variable 136 > > > This causes SIGSEGV crash or hash code mismatch with `test/jdk/sun/security/ec/ed/EdDSATest.java`. The test calls `SHAKE256` in `Ed448`. > > The main idea of the patch is to pass the `block_size` to differentiate SHA3 instances. > Tests `test/jdk/sun/security/ec/ed/EdDSATest.java` and `./test/jdk/sun/security/provider/MessageDigest/SHA3.java` both passed. > And tier1~3 passed on SHA3 supported hardware. > > The SHA3 intrinsics still deliver 20%~40% performance improvement on our pre-silicon simulated platform. > The latency and throughput of crypto SHA3 ops are designed to be 1 cpu cycle and 2 execution pipes respectively. > > Compared with the main stream code, the performance change with this patch are negligible on real hardware and simulation platform. > Based on the JMH results of SHA3 intirinsics, performance can be improved by ~50% on some hardware, while some hardware have ~30% regression. > These performance details are available in the comments of the issue page. > I guess the performance benefit of SHA3 intrinsics is dependent on the micro architecture, it should be switched on/off based on the running platform. Hi, @nick-arm @theRealAph, could you help to review this? Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/10939 From thartmann at openjdk.org Mon Nov 14 06:07:30 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 14 Nov 2022 06:07:30 GMT Subject: RFR: 8295407: C2 crash: Error: ShouldNotReachHere() in multiple vector tests with -XX:-MonomorphicArrayCheck -XX:-UncommonNullCast In-Reply-To: References: Message-ID: On Mon, 14 Nov 2022 01:37:29 GMT, Fei Gao wrote: >> test/hotspot/jtreg/compiler/loopopts/TestUnsupportedConditionalMove.java line 28: >> >>> 26: * @bug 8295407 >>> 27: * @summary C2 crash: Error: ShouldNotReachHere() in multiple vector tests with >>> 28: * -XX:-MonomorphicArrayCheck -XX:-UncommonNullCast >> >> `MonomorphicArrayCheck` and `UncommonNullCast` are debug flags, the test will fail with a release build. > > @TobiHartmann, thanks for your review! The options added here is commented as part of summary title, not as JVM options. I suppose it should be fine for a release build, right? :-) Right, I missed that. Does the test reproduce the issue without these flags? In any case, I think a more descriptive summary would be good. ------------- PR: https://git.openjdk.org/jdk/pull/11034 From roland at openjdk.org Mon Nov 14 08:05:47 2022 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 14 Nov 2022 08:05:47 GMT Subject: RFR: 8276064: CheckCastPP with raw oop input floats below a safepoint In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 11:55:04 GMT, Tobias Hartmann wrote: > This bug is similar to [JDK-8271600](https://bugs.openjdk.org/browse/JDK-8271600): A CheckCastPP with a raw oop input floats out of a loop and below a safepoint. Since C2 does not generate OopMap entries for raw pointers, the GC will not update the oop if the corresponding object is moved during the safepoint. We either assert already during OopMap creation, or crash when dereferencing a stale oop during runtime (the verification code does not always detect such live raw oops at safepoints, I included a fix for that as well). > > I think the fix for [JDK-8271600](https://bugs.openjdk.org/browse/JDK-8271600) is incomplete, because it only bails out of [PhaseIdealLoop::try_sink_out_of_loop](https://github.com/openjdk/jdk/commit/2ff4c01d42f1afcc53abd48e074356fb4a700754) while the underlying issue is that a raw CheckCastPP ends up with ctrl "far away" from its Allocate/Initialize and potentially even below a safepoint. Usually, the CheckCastPP would always be part of safepoint debug info and therefore late ctrl would be guaranteed to be above the safepoint. However, vector objects are aggressively scalar replaced in safepoints, which allows late ctrl to be set to further below. This is specific to vectors, since "normal" Java objects would either be fully scalarized or not be scalarized at all. > > In the failing case, Loop Unswitching clones the loop body and creates a Phi to merge the oop results from the vector allocations in both loops. Since ctrl of the CheckCastPP is outside of the loop, its data input is changed to the newly created Phi and its control input is set to the region that merges the loop exits. This moves the CheckCastPP below a safepoint in the loop. > > Below graphs show the details. `395 CheckCastPP` is removed from the debug info for `262 CallStaticJava` because it's scalarized (`326 SafepointScalarObject`). Late ctrl is then computed to be outside of the loop because the CheckCastPP is only used in the return. > > ![8276064_Before](https://user-images.githubusercontent.com/5312595/199204249-17564a59-2b67-4426-be71-19bc0eafac99.png) > > Now Loop Unswitching creates a `487 Region` and `517 Phi` to merge control and data inputs to the CheckCastPP from the fast and slow loops (see `PhaseIdealLoop::clone_loop_handle_data_uses`). Control of the `395 CheckCastPP` is updated accordingly. > > ![8276064_After](https://user-images.githubusercontent.com/5312595/199204273-44341cd7-b5b6-4ec0-b8c9-6f349393dbd1.png) > > As a result, the raw oop input of the `395 CheckCastPP` is live at `262 CallStaticJava`. > > We could now add another point fix to prevent loop unswitching from moving the CheckCastPP out of the loop, but I think there is a risk that other current or future optimizations would rely on the CheckCastPP's late ctrl and do a similar thing. I would therefore suggest to pin all CheckCastPPs with a raw oop input, similar to what [JDK-5071820](https://bugs.openjdk.org/browse/JDK-5071820) did in GCM: > https://github.com/openjdk/jdk/blob/37107fc1574a4191987420d88f7182e63c7da60c/src/hotspot/share/opto/gcm.cpp#L1325-L1330 > > The fix for [JDK-8271600](https://bugs.openjdk.org/browse/JDK-8271600) can then be reverted, also because Roland's fix for [JDK-8272562](https://bugs.openjdk.org/browse/JDK-8272562) disabled moving **all** CheckCastPPs out of loops anyway. Roland said that he plans to revisit that decision with [JDK-8275202](https://bugs.openjdk.org/browse/JDK-8275202). The tests added with this PR will cover the `PhaseIdealLoop::try_sink_out_of_loop` case as well and therefore serve as regression tests for [JDK-8271600](https://bugs.openjdk.org/browse/JDK-8271600). > > We could improve this by adding logic to set late ctrl just above the safepoint, but I'm not sure if it's worth the complexity because we would need to walk up the control paths from late to early control and compute the dominator of all safepoints. > > I also fixed the verification code in `OopFlow::build_oop_map` to account for spilling. Before, compilation of `test1` would pass and only crash during execution. Now, we assert and print: > > > 454 DefinitionSpillCopy === _ 122 [[ 321 ]] !jvms: IntVector::zero @ bci:26 (line 563) TestRawOopAtSafepoint::test1 @ bci:25 (line 74) > 321 Phi === 315 454 503 [[ 512 ]] #rawptr:NotNull !jvms: IntVector::zero @ bci:26 (line 563) TestRawOopAtSafepoint::test1 @ bci:25 (line 74) > 38 CallStaticJavaDirect === 40 126 136 102 0 455 452 138 139 453 460 [[ 39 84 130 37 388 ]] Static compiler.vectorapi.TestRawOopAtSafepoint::safepoint # void ( int ) TestRawOopAtSafepoint::test1 @ bci:44 (line 75) !jvms: TestRawOopAtSafepoint::test1 @ bci:44 (line 75) > > > > What do you think? > > Thanks, > Tobias Looks good to me. ------------- Marked as reviewed by roland (Reviewer). PR: https://git.openjdk.org/jdk/pull/10932 From thartmann at openjdk.org Mon Nov 14 08:11:37 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 14 Nov 2022 08:11:37 GMT Subject: RFR: 8276064: CheckCastPP with raw oop input floats below a safepoint In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 11:55:04 GMT, Tobias Hartmann wrote: > This bug is similar to [JDK-8271600](https://bugs.openjdk.org/browse/JDK-8271600): A CheckCastPP with a raw oop input floats out of a loop and below a safepoint. Since C2 does not generate OopMap entries for raw pointers, the GC will not update the oop if the corresponding object is moved during the safepoint. We either assert already during OopMap creation, or crash when dereferencing a stale oop during runtime (the verification code does not always detect such live raw oops at safepoints, I included a fix for that as well). > > I think the fix for [JDK-8271600](https://bugs.openjdk.org/browse/JDK-8271600) is incomplete, because it only bails out of [PhaseIdealLoop::try_sink_out_of_loop](https://github.com/openjdk/jdk/commit/2ff4c01d42f1afcc53abd48e074356fb4a700754) while the underlying issue is that a raw CheckCastPP ends up with ctrl "far away" from its Allocate/Initialize and potentially even below a safepoint. Usually, the CheckCastPP would always be part of safepoint debug info and therefore late ctrl would be guaranteed to be above the safepoint. However, vector objects are aggressively scalar replaced in safepoints, which allows late ctrl to be set to further below. This is specific to vectors, since "normal" Java objects would either be fully scalarized or not be scalarized at all. > > In the failing case, Loop Unswitching clones the loop body and creates a Phi to merge the oop results from the vector allocations in both loops. Since ctrl of the CheckCastPP is outside of the loop, its data input is changed to the newly created Phi and its control input is set to the region that merges the loop exits. This moves the CheckCastPP below a safepoint in the loop. > > Below graphs show the details. `395 CheckCastPP` is removed from the debug info for `262 CallStaticJava` because it's scalarized (`326 SafepointScalarObject`). Late ctrl is then computed to be outside of the loop because the CheckCastPP is only used in the return. > > ![8276064_Before](https://user-images.githubusercontent.com/5312595/199204249-17564a59-2b67-4426-be71-19bc0eafac99.png) > > Now Loop Unswitching creates a `487 Region` and `517 Phi` to merge control and data inputs to the CheckCastPP from the fast and slow loops (see `PhaseIdealLoop::clone_loop_handle_data_uses`). Control of the `395 CheckCastPP` is updated accordingly. > > ![8276064_After](https://user-images.githubusercontent.com/5312595/199204273-44341cd7-b5b6-4ec0-b8c9-6f349393dbd1.png) > > As a result, the raw oop input of the `395 CheckCastPP` is live at `262 CallStaticJava`. > > We could now add another point fix to prevent loop unswitching from moving the CheckCastPP out of the loop, but I think there is a risk that other current or future optimizations would rely on the CheckCastPP's late ctrl and do a similar thing. I would therefore suggest to pin all CheckCastPPs with a raw oop input, similar to what [JDK-5071820](https://bugs.openjdk.org/browse/JDK-5071820) did in GCM: > https://github.com/openjdk/jdk/blob/37107fc1574a4191987420d88f7182e63c7da60c/src/hotspot/share/opto/gcm.cpp#L1325-L1330 > > The fix for [JDK-8271600](https://bugs.openjdk.org/browse/JDK-8271600) can then be reverted, also because Roland's fix for [JDK-8272562](https://bugs.openjdk.org/browse/JDK-8272562) disabled moving **all** CheckCastPPs out of loops anyway. Roland said that he plans to revisit that decision with [JDK-8275202](https://bugs.openjdk.org/browse/JDK-8275202). The tests added with this PR will cover the `PhaseIdealLoop::try_sink_out_of_loop` case as well and therefore serve as regression tests for [JDK-8271600](https://bugs.openjdk.org/browse/JDK-8271600). > > We could improve this by adding logic to set late ctrl just above the safepoint, but I'm not sure if it's worth the complexity because we would need to walk up the control paths from late to early control and compute the dominator of all safepoints. > > I also fixed the verification code in `OopFlow::build_oop_map` to account for spilling. Before, compilation of `test1` would pass and only crash during execution. Now, we assert and print: > > > 454 DefinitionSpillCopy === _ 122 [[ 321 ]] !jvms: IntVector::zero @ bci:26 (line 563) TestRawOopAtSafepoint::test1 @ bci:25 (line 74) > 321 Phi === 315 454 503 [[ 512 ]] #rawptr:NotNull !jvms: IntVector::zero @ bci:26 (line 563) TestRawOopAtSafepoint::test1 @ bci:25 (line 74) > 38 CallStaticJavaDirect === 40 126 136 102 0 455 452 138 139 453 460 [[ 39 84 130 37 388 ]] Static compiler.vectorapi.TestRawOopAtSafepoint::safepoint # void ( int ) TestRawOopAtSafepoint::test1 @ bci:44 (line 75) !jvms: TestRawOopAtSafepoint::test1 @ bci:44 (line 75) > > > > What do you think? > > Thanks, > Tobias Thanks, Roland! ------------- PR: https://git.openjdk.org/jdk/pull/10932 From chagedorn at openjdk.org Mon Nov 14 08:29:54 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 14 Nov 2022 08:29:54 GMT Subject: RFR: 8296243: [IR Framework] Fix issues with IRNode.ALLOC* regexes [v3] In-Reply-To: References: Message-ID: <0HI-wa3aB37Rb3ezgqGi5fSFKS_cGVZGlBwr75af_As=.b007168c-d1e0-4cbe-8146-486f8692f7f3@github.com> > There are currently two problems with `IRNode.ALLOC*` regexes: > 1. On PPC64, we do not account for an `LI` instruction which matches the array size. As a result, we could miss some array allocations with the `ALLOC_ARRAY*` regexes: > > 2e4 LD R3, offset, R3 // load ptr precise [java/lang/Object: > 0x0000200058006e40 *: :Constant:exact * from TOC (lo) > 2e8 STD R17, [R1_SP + #104+0] // spill copy > 2ec LI R4, #1 <------- we only look for LGHI here which is specific to s390 while LI is used for PPC64 > 2f0 CALL,static 0x00002000177cd300 // ==> wrapper for: _new_array_Java > > This was revealed by a new test added by [JDK-8280378](https://bugs.openjdk.org/browse/JDK-8280378) but was already a problem before this change. > > 2. The newly added `IRNode.ALLOC*` regexes in JDK-8280378 which can be matched on the independent ideal compile phases by using the name of the IR node "Allocate" also matches "AllocateArray" (substring match). This is unexpected. I've changed this by matching "Allocate" exactly. > > I've additionally removed the matching of `LI` and `LGHI` for the `ALLOC` regexes on normal objects as we do not have an array size. I think it's safe to remove these (might need some additional testing on PPC64/s390). > > Thanks @TheRealMDoerr for helping to test the initial fix on PPC64! > > Thanks, > Christian Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: remove whitespaces ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11037/files - new: https://git.openjdk.org/jdk/pull/11037/files/0d28220f..24713be2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11037&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11037&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11037.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11037/head:pull/11037 PR: https://git.openjdk.org/jdk/pull/11037 From chagedorn at openjdk.org Mon Nov 14 08:33:07 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 14 Nov 2022 08:33:07 GMT Subject: Integrated: 8296243: [IR Framework] Fix issues with IRNode.ALLOC* regexes In-Reply-To: References: Message-ID: <6V1Ks18_NFqpZAA5WMfPhqyNs3wVHQ2MEp_qNFzQa-M=.b5a9525e-3cca-4d80-9974-6ea6a3285449@github.com> On Tue, 8 Nov 2022 09:58:42 GMT, Christian Hagedorn wrote: > There are currently two problems with `IRNode.ALLOC*` regexes: > 1. On PPC64, we do not account for an `LI` instruction which matches the array size. As a result, we could miss some array allocations with the `ALLOC_ARRAY*` regexes: > > 2e4 LD R3, offset, R3 // load ptr precise [java/lang/Object: > 0x0000200058006e40 *: :Constant:exact * from TOC (lo) > 2e8 STD R17, [R1_SP + #104+0] // spill copy > 2ec LI R4, #1 <------- we only look for LGHI here which is specific to s390 while LI is used for PPC64 > 2f0 CALL,static 0x00002000177cd300 // ==> wrapper for: _new_array_Java > > This was revealed by a new test added by [JDK-8280378](https://bugs.openjdk.org/browse/JDK-8280378) but was already a problem before this change. > > 2. The newly added `IRNode.ALLOC*` regexes in JDK-8280378 which can be matched on the independent ideal compile phases by using the name of the IR node "Allocate" also matches "AllocateArray" (substring match). This is unexpected. I've changed this by matching "Allocate" exactly. > > I've additionally removed the matching of `LI` and `LGHI` for the `ALLOC` regexes on normal objects as we do not have an array size. I think it's safe to remove these (might need some additional testing on PPC64/s390). > > Thanks @TheRealMDoerr for helping to test the initial fix on PPC64! > > Thanks, > Christian This pull request has now been integrated. Changeset: 34d10f19 Author: Christian Hagedorn URL: https://git.openjdk.org/jdk/commit/34d10f19f5321961bdeea8d1c9aff7ca89101d1f Stats: 23 lines in 2 files changed: 13 ins; 2 del; 8 mod 8296243: [IR Framework] Fix issues with IRNode.ALLOC* regexes Reviewed-by: mdoerr, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/11037 From thartmann at openjdk.org Mon Nov 14 08:40:13 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 14 Nov 2022 08:40:13 GMT Subject: RFR: 8296821: compiler/jvmci/jdk.vm.ci.code.test/src/jdk/vm/ci/code/test/NativeCallTest.java fails after JDK-8262901 In-Reply-To: References: Message-ID: <6rVrQMjRdXvMhKiPudEFO0KWg2E0tX7iPwycQXDRR18=.09d33dde-e638-4f17-b90f-caf72782bd45@github.com> On Fri, 11 Nov 2022 22:34:59 GMT, Olga Mikhaltsova wrote: > This is a fix for a bug in the test that was added by [JDK-8262901](https://bugs.openjdk.org/browse/JDK-8262901). > The root cause of the issue was because the test returns 'int' while 'float' was expected. > > In addition the stack growth considering 16 bytes alignment in AMD64TestAssembler is fixed (similar to AArch64TestAssembler). > > Tested on macOS x64 / AArch64, Linux x64 / AArch64 as follow: > ` $JTREG_HOME/bin/jtreg -ea -jdk:$BUILD_HOME -nativepath:$NATIVE_PATH ./test/hotspot/jtreg/compiler/jvmci/` Looks good to me too. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11114 From tholenstein at openjdk.org Mon Nov 14 08:42:32 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 14 Nov 2022 08:42:32 GMT Subject: RFR: JDK-8296665: IGV: Show dialog with stack trace for exceptions In-Reply-To: References: <2dCzlObmwCSPAUcF0onZea7bmP6UlBhblebQ391VUU0=.7fad4bcd-f21e-4be6-a27a-85eeeca2934c@github.com> Message-ID: <9KkeYTA6sq0nLYA0AHPL6itlosf3tkLHzw2UOu2d8mQ=.4de4bed1-8b6a-429c-8c47-2491bcff7be4@github.com> On Fri, 11 Nov 2022 15:14:54 GMT, Tobias Hartmann wrote: >> Currently in IGV when an exception occurs a small red icon in the bottom right corner appears. The user often does not see this and if he sees it, usually not immediately when the error occurs: >> exception now >> >> The exception reporting level is changed to `1000` (Level.SEVERE) in IGV to show a dialog with the stack-trace. The user can still close it and continue the work: >> exception suggestion >> >> To test insert something like the following somewhere in the codebase >> >> try { >> int i=1/0; >> } catch (Exception e) { >> throw new RuntimeException(e); >> } > > Okay, thanks for the details. It's probably best to leave the patch as is then. thanks @TobiHartmann and @chhagedorn for the reviews! ------------- PR: https://git.openjdk.org/jdk/pull/11060 From omikhaltcova at openjdk.org Mon Nov 14 08:43:34 2022 From: omikhaltcova at openjdk.org (Olga Mikhaltsova) Date: Mon, 14 Nov 2022 08:43:34 GMT Subject: Integrated: 8296821: compiler/jvmci/jdk.vm.ci.code.test/src/jdk/vm/ci/code/test/NativeCallTest.java fails after JDK-8262901 In-Reply-To: References: Message-ID: On Fri, 11 Nov 2022 22:34:59 GMT, Olga Mikhaltsova wrote: > This is a fix for a bug in the test that was added by [JDK-8262901](https://bugs.openjdk.org/browse/JDK-8262901). > The root cause of the issue was because the test returns 'int' while 'float' was expected. > > In addition the stack growth considering 16 bytes alignment in AMD64TestAssembler is fixed (similar to AArch64TestAssembler). > > Tested on macOS x64 / AArch64, Linux x64 / AArch64 as follow: > ` $JTREG_HOME/bin/jtreg -ea -jdk:$BUILD_HOME -nativepath:$NATIVE_PATH ./test/hotspot/jtreg/compiler/jvmci/` This pull request has now been integrated. Changeset: 277f0c24 Author: Olga Mikhaltsova Committer: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/277f0c24a2e186166bfe70fc93ba79aec10585aa Stats: 12 lines in 3 files changed: 6 ins; 3 del; 3 mod 8296821: compiler/jvmci/jdk.vm.ci.code.test/src/jdk/vm/ci/code/test/NativeCallTest.java fails after JDK-8262901 Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/11114 From tholenstein at openjdk.org Mon Nov 14 08:44:16 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 14 Nov 2022 08:44:16 GMT Subject: Integrated: JDK-8296665: IGV: Show dialog with stack trace for exceptions In-Reply-To: <2dCzlObmwCSPAUcF0onZea7bmP6UlBhblebQ391VUU0=.7fad4bcd-f21e-4be6-a27a-85eeeca2934c@github.com> References: <2dCzlObmwCSPAUcF0onZea7bmP6UlBhblebQ391VUU0=.7fad4bcd-f21e-4be6-a27a-85eeeca2934c@github.com> Message-ID: On Wed, 9 Nov 2022 13:10:54 GMT, Tobias Holenstein wrote: > Currently in IGV when an exception occurs a small red icon in the bottom right corner appears. The user often does not see this and if he sees it, usually not immediately when the error occurs: > exception now > > The exception reporting level is changed to `1000` (Level.SEVERE) in IGV to show a dialog with the stack-trace. The user can still close it and continue the work: > exception suggestion > > To test insert something like the following somewhere in the codebase > > try { > int i=1/0; > } catch (Exception e) { > throw new RuntimeException(e); > } This pull request has now been integrated. Changeset: 68301cde Author: Tobias Holenstein URL: https://git.openjdk.org/jdk/commit/68301cdecae861ecb6c910aeb89465a787184454 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8296665: IGV: Show dialog with stack trace for exceptions Reviewed-by: thartmann, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/11060 From dsamersoff at openjdk.org Mon Nov 14 09:17:31 2022 From: dsamersoff at openjdk.org (Dmitry Samersoff) Date: Mon, 14 Nov 2022 09:17:31 GMT Subject: RFR: JDK-8294947: Use 64bit atomics in patch_verified_entry on x86_64 [v2] In-Reply-To: References: Message-ID: > In the void NativeJump::patch_verified_entry() we atomically patch first 4 bytes, then atomically patch 5th byte, then atomically patch first 4 bytes again. But from CMC (cross-modified code) point of view it's better to patch atomically 8 bytes at once. > > The patch was tested with hotspot jtreg tests in bare-metal and virtualized environments. Dmitry Samersoff has updated the pull request incrementally with one additional commit since the last revision: JDK-8294947: Use 64bit atomics in patch_verified_entry on x86_64 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11059/files - new: https://git.openjdk.org/jdk/pull/11059/files/198a1e85..bbacb7f4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11059&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11059&range=00-01 Stats: 13 lines in 1 file changed: 3 ins; 4 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/11059.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11059/head:pull/11059 PR: https://git.openjdk.org/jdk/pull/11059 From dsamersoff at openjdk.org Mon Nov 14 09:24:07 2022 From: dsamersoff at openjdk.org (Dmitry Samersoff) Date: Mon, 14 Nov 2022 09:24:07 GMT Subject: RFR: JDK-8294947: Use 64bit atomics in patch_verified_entry on x86_64 [v2] In-Reply-To: References: Message-ID: On Fri, 11 Nov 2022 22:39:32 GMT, Dean Long wrote: >> Dmitry Samersoff has updated the pull request incrementally with one additional commit since the last revision: >> >> JDK-8294947: Use 64bit atomics in patch_verified_entry on x86_64 > > src/hotspot/cpu/x86/nativeInst_x86.cpp line 511: > >> 509: // In JVMCI, the restriction is enforced by HotSpotFrameContext.enter(...) >> 510: // >> 511: void NativeJump::patch_verified_entry(address entry, address verified_entry, address dest) { > > Should we assert that the appropriate lock is held here? This code can be called from different places and therefore can require different locks, so I wold prefer not to put any caller-specific logic to this function. Also, original 32bit code doesn't have such assert. ------------- PR: https://git.openjdk.org/jdk/pull/11059 From dsamersoff at openjdk.org Mon Nov 14 09:24:10 2022 From: dsamersoff at openjdk.org (Dmitry Samersoff) Date: Mon, 14 Nov 2022 09:24:10 GMT Subject: RFR: JDK-8294947: Use 64bit atomics in patch_verified_entry on x86_64 [v2] In-Reply-To: References: Message-ID: <78aKIpKhgNw_9GCZ-U1kzSwgeMJmPFtu60ZmRyFZMbY=.c6570ba1-5153-4078-b027-75b1db981a66@github.com> On Sat, 12 Nov 2022 00:40:01 GMT, Vladimir Ivanov wrote: >> Dmitry Samersoff has updated the pull request incrementally with one additional commit since the last revision: >> >> JDK-8294947: Use 64bit atomics in patch_verified_entry on x86_64 > > src/hotspot/cpu/x86/nativeInst_x86.cpp line 513: > >> 511: void NativeJump::patch_verified_entry(address entry, address verified_entry, address dest) { >> 512: // complete jump instruction (to be inserted) is in code_buffer; >> 513: #ifdef AMD64 > > Just a minor suggestion: `_LP64` is more appropriate here since it's x86-specific file. done ------------- PR: https://git.openjdk.org/jdk/pull/11059 From dsamersoff at openjdk.org Mon Nov 14 09:24:11 2022 From: dsamersoff at openjdk.org (Dmitry Samersoff) Date: Mon, 14 Nov 2022 09:24:11 GMT Subject: RFR: JDK-8294947: Use 64bit atomics in patch_verified_entry on x86_64 [v2] In-Reply-To: References: <8aQHbzt0g0EQyvuEvycMD8_puZ8Z_Zs65LKiV-jcWUE=.3cd9286f-2177-4506-81e7-28a98beacc46@github.com> Message-ID: On Fri, 11 Nov 2022 23:48:22 GMT, Vladimir Kozlov wrote: >>> ``` >>> union { >>> jlong cb_long; >>> unsigned char code_buffer[8]; >>> } u; >>> ``` >>> >> >> This version is fine. It removes any question about the alignment of cb_long at the same time. > >> This version is fine. It removes any question about the alignment of cb_long at the same time. > > I agree. done ------------- PR: https://git.openjdk.org/jdk/pull/11059 From aph at openjdk.org Mon Nov 14 09:24:33 2022 From: aph at openjdk.org (Andrew Haley) Date: Mon, 14 Nov 2022 09:24:33 GMT Subject: RFR: 8295698: AArch64: test/jdk/sun/security/ec/ed/EdDSATest.java failed with -XX:+UseSHA3Intrinsics In-Reply-To: References: Message-ID: On Wed, 2 Nov 2022 03:06:21 GMT, Dong Bo wrote: > In JDK-8252204, when implemented SHA3 intrinsics, we use `digest_length` to differentiate SHA3-224, SHA3-256, SHA3-384, SHA3-512 and calculate `block_size` with `block_size = 200 - 2 * digest_length`. > However, there are two extra SHA3 instances, SHAKE256 and SHAKE128, allowing an arbitrary `digest_length`: > > digest_length block_size > SHA3-224 28 144 > SHA3-256 32 136 > SHA3-384 48 104 > SHA3-512 64 72 > SHAKE128 variable 168 > SHAKE256 variable 136 > > > This causes SIGSEGV crash or hash code mismatch with `test/jdk/sun/security/ec/ed/EdDSATest.java`. The test calls `SHAKE256` in `Ed448`. > > The main idea of the patch is to pass the `block_size` to differentiate SHA3 instances. > Tests `test/jdk/sun/security/ec/ed/EdDSATest.java` and `./test/jdk/sun/security/provider/MessageDigest/SHA3.java` both passed. > And tier1~3 passed on SHA3 supported hardware. > > The SHA3 intrinsics still deliver 20%~40% performance improvement on our pre-silicon simulated platform. > The latency and throughput of crypto SHA3 ops are designed to be 1 cpu cycle and 2 execution pipes respectively. > > Compared with the main stream code, the performance change with this patch are negligible on real hardware and simulation platform. > Based on the JMH results of SHA3 intirinsics, performance can be improved by ~50% on some hardware, while some hardware have ~30% regression. > These performance details are available in the comments of the issue page. > I guess the performance benefit of SHA3 intrinsics is dependent on the micro architecture, it should be switched on/off based on the running platform. This looks right, but I don't think I can test it, which I usually would do with a patch this complicated. When we have a processor without FEAT_SHA3) we should define BCAX, EOR3, RAX1, and XAR as macros. Could you do that, please? ------------- PR: https://git.openjdk.org/jdk/pull/10939 From rcastanedalo at openjdk.org Mon Nov 14 09:25:29 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 14 Nov 2022 09:25:29 GMT Subject: RFR: 8286800: Assert in PhaseIdealLoop::dump_real_LCA is too strong [v3] In-Reply-To: References: <2pkCki3_MbAhroP1p8173jY1ErpHaEEvGlRxH7nGkvM=.3675ca65-171e-4634-b0ed-b61c7c30ebf3@github.com> Message-ID: On Fri, 11 Nov 2022 17:33:02 GMT, Christian Hagedorn wrote: >> We sometimes hit the following assert when dumping a bad graph (before crashing with the bad graph assertion): >> >> assert(real_LCA != NULL, "must always find an LCA" >> ``` >> The algorithm is not correct as we should always find an LCA of two nodes. To fix this, I've re-implemented the algorithm and improved the dumped idom chains: >> - I limited the node dump to idx + node name to reduce the noise which made it hard to read. >> - Reversed the idom chain dumps to reflect the graph structure. >> >> Example output: >> >> Bad graph detected in build_loop_late >> n: 138 CastPP === 205 38 [[ 263 140 140 168 ]] #Test:NotNull * Oop:Test:NotNull * !jvms: Test::mainTest @ bci:40 (line 154) >> >> [... same output as before ...] >> >> idoms of early "197 IfFalse": >> idom[2]: 42 If >> idom[1]: 44 IfTrue >> idom[0]: 196 If >> n: 197 IfFalse >> >> idoms of (wrong) LCA "205 IfTrue": >> idom[4]: 42 If >> idom[3]: 37 Region >> idom[2]: 73 If >> idom[1]: 83 IfTrue >> idom[0]: 204 If >> n: 205 IfTrue >> >> Real LCA of early "197 IfFalse" (idom[2]) and wrong LCA "205 IfTrue" (idom[4]): >> 42 If === 30 41 [[ 43 44 ]] P=0.999000, C=-1.000000 !jvms: Test::mainTest @ bci:32 (line 153) >> >> Tested by manually calling `dump_idoms` during a compilation and by running reproducers of different bad graph assertion bugs. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Change algorithm as suggested by Roberto Thanks for taking the suggestion into account, looks good! ------------- Marked as reviewed by rcastanedalo (Reviewer). PR: https://git.openjdk.org/jdk/pull/11015 From bkilambi at openjdk.org Mon Nov 14 09:37:53 2022 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Mon, 14 Nov 2022 09:37:53 GMT Subject: RFR: 8293488: Add EOR3 backend rule for aarch64 SHA3 extension [v4] In-Reply-To: References: Message-ID: > Arm ISA v8.2A and v9.0A include SHA3 feature extensions and one of those SHA3 instructions - "eor3" performs an exclusive OR of three vectors. This is helpful in applications that have multiple, consecutive "eor" operations which can be reduced by clubbing them into fewer operations using the "eor3" instruction. For example - > > eor a, a, b > eor a, a, c > > can be optimized to single instruction - `eor3 a, b, c` > > This patch adds backend rules for Neon and SVE2 "eor3" instructions and a micro benchmark to assess the performance gains with this patch. Following are the results of the included micro benchmark on a 128-bit aarch64 machine that supports Neon, SVE2 and SHA3 features - > > > Benchmark gain > TestEor3.test1Int 10.87% > TestEor3.test1Long 8.84% > TestEor3.test2Int 21.68% > TestEor3.test2Long 21.04% > > > The numbers shown are performance gains with using Neon eor3 instruction over the master branch that uses multiple "eor" instructions instead. Similar gains can be observed with the SVE2 "eor3" version as well since the "eor3" instruction is unpredicated and the machine under test uses a maximum vector width of 128 bits which makes the SVE2 code generation very similar to the one with Neon. Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: Removed svesha3 feature check for eor3 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10407/files - new: https://git.openjdk.org/jdk/pull/10407/files/449524ad..7f413360 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10407&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10407&range=02-03 Stats: 16 lines in 6 files changed: 0 ins; 9 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/10407.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10407/head:pull/10407 PR: https://git.openjdk.org/jdk/pull/10407 From bkilambi at openjdk.org Mon Nov 14 09:37:54 2022 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Mon, 14 Nov 2022 09:37:54 GMT Subject: RFR: 8293488: Add EOR3 backend rule for aarch64 SHA3 extension [v3] In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 14:27:34 GMT, Bhavana Kilambi wrote: >> Arm ISA v8.2A and v9.0A include SHA3 feature extensions and one of those SHA3 instructions - "eor3" performs an exclusive OR of three vectors. This is helpful in applications that have multiple, consecutive "eor" operations which can be reduced by clubbing them into fewer operations using the "eor3" instruction. For example - >> >> eor a, a, b >> eor a, a, c >> >> can be optimized to single instruction - `eor3 a, b, c` >> >> This patch adds backend rules for Neon and SVE2 "eor3" instructions and a micro benchmark to assess the performance gains with this patch. Following are the results of the included micro benchmark on a 128-bit aarch64 machine that supports Neon, SVE2 and SHA3 features - >> >> >> Benchmark gain >> TestEor3.test1Int 10.87% >> TestEor3.test1Long 8.84% >> TestEor3.test2Int 21.68% >> TestEor3.test2Long 21.04% >> >> >> The numbers shown are performance gains with using Neon eor3 instruction over the master branch that uses multiple "eor" instructions instead. Similar gains can be observed with the SVE2 "eor3" version as well since the "eor3" instruction is unpredicated and the machine under test uses a maximum vector width of 128 bits which makes the SVE2 code generation very similar to the one with Neon. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Changed the modifier order preference in JTREG test The new patch removes the svesha3 feature check for eor3 instruction. Eor3 instruction is part of the SHA3 feature but it is present by default in SVE2 and is not part of the SVESHA3 feature. Please review. Thank you .. ------------- PR: https://git.openjdk.org/jdk/pull/10407 From rcastanedalo at openjdk.org Mon Nov 14 09:56:25 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 14 Nov 2022 09:56:25 GMT Subject: RFR: JDK-8295934: IGV: keep node selection when changing view or graph In-Reply-To: References: Message-ID: On Wed, 9 Nov 2022 14:30:48 GMT, Tobias Holenstein wrote: > In IGV nodes can be selected by clicking on it. When a user selects nodes in a certain view, e.g. "Cluster nodes into blocks" view, and then change to e.g. "Sea of nodes" view, the selection should be kept. Same when the user goes a different graph in the same group, the selection should be kept (as long as the nodes are still present) > > New selection features: > - When opening a new graph and no nodes where selected previously the root nodes is selected and centered. > selected_root > > - When a graph in the same group is opened, the previously selected nodes as selected as well if they are present in the graph. The selected nodes are centered in the new graph. > - The selected nodes are kept when changing the view, or the properties of the view (e.g. "show neighboring nodes semi-transparent") > cluster_view > desired > > - When "show neighboring nodes semi-transparent" is disabled, previously semi-transparent nodes that were selected are now unselected (because they are not visible anymore) > - It would also be desired adjust the scroll pane to center the selected nodes when changing view, graph, etc. Keeping the user's node selection across views and related graphs is a useful enhancement, thanks! As a user, I only have an objection to selecting the Root node when opening a new graph without a previous selection: I would rather not select any node in that situation. Selecting the Root node forces the attention of the user to an arbitrary part of the graph, and this node is not relevant to all views, e.g. the CFG view of the current "Final Code" graph. Regarding the code changes, I think it would be better for ease of reviewing and traceability to leave cleanups and unrelated refactorings (such as removal of unused imports in `GraphViewerImplementation.java` or not passing a `SceneAnimator` to `LineWidget.java`) to separate RFEs. ------------- Changes requested by rcastanedalo (Reviewer). PR: https://git.openjdk.org/jdk/pull/11062 From bulasevich at openjdk.org Mon Nov 14 11:26:47 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Mon, 14 Nov 2022 11:26:47 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v8] In-Reply-To: References: Message-ID: > The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. > > Testing: jtreg hotspot&jdk, Renaissance benchmarks Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: add test for buffer grow ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10025/files - new: https://git.openjdk.org/jdk/pull/10025/files/637c94be..99522fd0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10025&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10025&range=06-07 Stats: 25 lines in 2 files changed: 22 ins; 3 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10025.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10025/head:pull/10025 PR: https://git.openjdk.org/jdk/pull/10025 From dongbo at openjdk.org Mon Nov 14 11:50:27 2022 From: dongbo at openjdk.org (Dong Bo) Date: Mon, 14 Nov 2022 11:50:27 GMT Subject: RFR: 8295698: AArch64: test/jdk/sun/security/ec/ed/EdDSATest.java failed with -XX:+UseSHA3Intrinsics In-Reply-To: References: Message-ID: <2_tbWMZjjQIyN_HVI_KUY9znox8zGh1vSEtj56vvZog=.0b9c6be0-cae9-4f06-99d9-6c4058e273c3@github.com> On Mon, 14 Nov 2022 09:20:32 GMT, Andrew Haley wrote: > This looks right, but I don't think I can test it, which I usually would do with a patch this complicated. When we have a processor without FEAT_SHA3) we should define BCAX, EOR3, RAX1, and XAR as macros. Could you do that, please? Thanks for the comments. Do you mean that we need a patch (not to be merged) to support testing on processor without FEAT_SHA3? In which, the SHA3 instructions are substituted by multiple instructions, something like: `eor3 v1, v2, v3, v4 => eor v1, v2, v3; eor v1, v1, v4`? BTW, FEAT_SHA3 is supported on M1. If you happen to have one, the test can be done on it. :) To test this on M1/MacOS, modification below is needed to enable SHA3Intriniscs by default. Since other features, i.e. UseSHA, can not be automatically detected neither, I think it is irrelevant with this patch. --- a/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp +++ b/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp @@ -334,15 +334,15 @@ void VM_Version::initialize() { FLAG_SET_DEFAULT(UseSHA256Intrinsics, false); } - if (UseSHA && VM_Version::supports_sha3()) { + // if (UseSHA && VM_Version::supports_sha3()) { // Do not auto-enable UseSHA3Intrinsics until it has been fully tested on hardware - // if (FLAG_IS_DEFAULT(UseSHA3Intrinsics)) { - // FLAG_SET_DEFAULT(UseSHA3Intrinsics, true); - // } - } else if (UseSHA3Intrinsics) { - warning("Intrinsics for SHA3-224, SHA3-256, SHA3-384 and SHA3-512 crypto hash functions not available on this CPU."); - FLAG_SET_DEFAULT(UseSHA3Intrinsics, false); - } + if (FLAG_IS_DEFAULT(UseSHA3Intrinsics)) { + FLAG_SET_DEFAULT(UseSHA3Intrinsics, true); + } + //} else if (UseSHA3Intrinsics) { + // warning("Intrinsics for SHA3-224, SHA3-256, SHA3-384 and SHA3-512 crypto hash functions not available on this CPU."); + // FLAG_SET_DEFAULT(UseSHA3Intrinsics, false); + //} ------------- PR: https://git.openjdk.org/jdk/pull/10939 From roland at openjdk.org Mon Nov 14 15:07:18 2022 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 14 Nov 2022 15:07:18 GMT Subject: RFR: 8296805: ctw build is broken In-Reply-To: <4YkL1GdhkoqSKUN7Levb8ScevrPJFvts0gIgp4RsuEA=.7e9b8fb6-0528-47fc-99da-0691ab7fe01f@github.com> References: <4YkL1GdhkoqSKUN7Levb8ScevrPJFvts0gIgp4RsuEA=.7e9b8fb6-0528-47fc-99da-0691ab7fe01f@github.com> Message-ID: On Thu, 10 Nov 2022 19:34:46 GMT, Vladimir Kozlov wrote: >> I noticed the build for the ctw tool based on the WhiteBox API is >> broken. This fixes it AFAICT. > > I verified that our testing does not use this Makefile. > I run `make` locally as you did and it passed with this fix. > I consider it is trivial. Thanks @vnkozlov @TobiHartmann for the reviews ------------- PR: https://git.openjdk.org/jdk/pull/11090 From roland at openjdk.org Mon Nov 14 15:09:03 2022 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 14 Nov 2022 15:09:03 GMT Subject: Integrated: 8296805: ctw build is broken In-Reply-To: References: Message-ID: <-oPwJCGaXTVCeH476lE-RtrRMm_wZLIQM1B-u2WeJTg=.a626abae-a3fc-4894-b308-4b41908e97e9@github.com> On Thu, 10 Nov 2022 16:54:39 GMT, Roland Westrelin wrote: > I noticed the build for the ctw tool based on the WhiteBox API is > broken. This fixes it AFAICT. This pull request has now been integrated. Changeset: 0fe2bf51 Author: Roland Westrelin URL: https://git.openjdk.org/jdk/commit/0fe2bf51b2f62bd95ef653fec4b97bea82e002e8 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8296805: ctw build is broken Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/11090 From roland at openjdk.org Mon Nov 14 15:12:38 2022 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 14 Nov 2022 15:12:38 GMT Subject: Integrated: 8294217: Assertion failure: parsing found no loops but there are some In-Reply-To: <8wgYaLn82fk_CgKacAQsEygK63k6KDDKYXf8m4cv_OM=.d04ae5d7-b3f8-47b3-bec8-aaaa4234f018@github.com> References: <8wgYaLn82fk_CgKacAQsEygK63k6KDDKYXf8m4cv_OM=.d04ae5d7-b3f8-47b3-bec8-aaaa4234f018@github.com> Message-ID: On Fri, 28 Oct 2022 14:34:42 GMT, Roland Westrelin wrote: > This was reported on 11 and is not reproducible with the current > jdk. The reason is that the PhaseIdealLoop invocation before EA was > changed from LoopOptsNone to LoopOptsMaxUnroll. In the absence of > loops, LoopOptsMaxUnroll exits earlier than LoopOptsNone. That wasn't > intended and this patch makes sure they behave the same. Once that's > changed, the crash reproduces with the current jdk. > > The assert fires because PhaseIdealLoop::only_has_infinite_loops() > returns false even though the IR only has infinite loops. There's a > single loop nest and the inner most loop is an infinite loop. The > current logic only looks at loops that are direct children of the root > of the loop tree. It's not the first bug where > PhaseIdealLoop::only_has_infinite_loops() fails to catch an infinite > loop (8257574 was the previous one) and it's proving challenging to > have PhaseIdealLoop::only_has_infinite_loops() handle corner cases > robustly. I reworked PhaseIdealLoop::only_has_infinite_loops() once > more. This time it goes over all children of the root of the loop > tree, collects all controls for the loop and its inner loop. It then > checks whether any control is a branch out of the loop and if it is > whether it's not a NeverBranch. This pull request has now been integrated. Changeset: 8c472e48 Author: Roland Westrelin URL: https://git.openjdk.org/jdk/commit/8c472e481676ed0ef475c4989477d5714880c59e Stats: 113 lines in 2 files changed: 98 ins; 3 del; 12 mod 8294217: Assertion failure: parsing found no loops but there are some Reviewed-by: chagedorn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/10904 From tholenstein at openjdk.org Mon Nov 14 15:41:37 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 14 Nov 2022 15:41:37 GMT Subject: RFR: JDK-8295934: IGV: keep node selection when changing view or graph [v2] In-Reply-To: References: Message-ID: > In IGV nodes can be selected by clicking on it. When a user selects nodes in a certain view, e.g. "Cluster nodes into blocks" view, and then change to e.g. "Sea of nodes" view, the selection should be kept. Same when the user goes a different graph in the same group, the selection should be kept (as long as the nodes are still present) > > New selection features: > - When opening a new graph and no nodes where selected previously the root nodes is selected and centered. > selected_root > > - When a graph in the same group is opened, the previously selected nodes as selected as well if they are present in the graph. The selected nodes are centered in the new graph. > - The selected nodes are kept when changing the view, or the properties of the view (e.g. "show neighboring nodes semi-transparent") > cluster_view > desired > > - When "show neighboring nodes semi-transparent" is disabled, previously semi-transparent nodes that were selected are now unselected (because they are not visible anymore) > - It would also be desired adjust the scroll pane to center the selected nodes when changing view, graph, etc. Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: center root but do not select ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11062/files - new: https://git.openjdk.org/jdk/pull/11062/files/c365e70a..f2521c3b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11062&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11062&range=00-01 Stats: 42 lines in 3 files changed: 25 ins; 14 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/11062.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11062/head:pull/11062 PR: https://git.openjdk.org/jdk/pull/11062 From luhenry at openjdk.org Mon Nov 14 15:49:31 2022 From: luhenry at openjdk.org (Ludovic Henry) Date: Mon, 14 Nov 2022 15:49:31 GMT Subject: RFR: 8296548: Improve MD5 intrinsic for x86_64 In-Reply-To: References: Message-ID: On Wed, 9 Nov 2022 07:57:30 GMT, Yi-Fan Tsai wrote: > The LEA instruction loads the effective address, but MD5 intrinsic uses it for computing values than addresses. This usage potentially uses more cycles than ADDs and reduces the throughput. > > This change replaces > LEA: r1 = r1 + rsi * 1 + t > with > ADDs: r1 += t; r1 += rsi. > > Microbenchmark evaluation shows ~40% performance improvement on Haswell, Broadwell, Skylake, and Cascade Lake. There is ~20% improvement on 2nd gen Epyc. > > No performance change for the same microbenchmark on Ice Lake and 3rd gen Epyc. > > Similar results can be observed with TestMD5Intrinsics and TestMD5MultiBlockIntrinsics. There is ~15% improvement in throughput on Haswell, Broadwell, Skylake, and Cascade Lake. Could you please post JMH microbenchmarks with and without this change? You can run them with `org.openjdk.bench.java.security.MessageDigests` [1] [1] https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java ------------- PR: https://git.openjdk.org/jdk/pull/11054 From tholenstein at openjdk.org Mon Nov 14 15:57:41 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 14 Nov 2022 15:57:41 GMT Subject: RFR: JDK-8295934: IGV: keep node selection when changing view or graph [v3] In-Reply-To: References: Message-ID: > In IGV nodes can be selected by clicking on it. When a user selects nodes in a certain view, e.g. "Cluster nodes into blocks" view, and then change to e.g. "Sea of nodes" view, the selection should be kept. Same when the user goes a different graph in the same group, the selection should be kept (as long as the nodes are still present) > > New selection features: > - When opening a new graph and no nodes where selected previously the root nodes is selected and centered. > selected_root > > - When a graph in the same group is opened, the previously selected nodes as selected as well if they are present in the graph. The selected nodes are centered in the new graph. > - The selected nodes are kept when changing the view, or the properties of the view (e.g. "show neighboring nodes semi-transparent") > cluster_view > desired > > - When "show neighboring nodes semi-transparent" is disabled, previously semi-transparent nodes that were selected are now unselected (because they are not visible anymore) > - It would also be desired adjust the scroll pane to center the selected nodes when changing view, graph, etc. Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: added missing validate ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11062/files - new: https://git.openjdk.org/jdk/pull/11062/files/f2521c3b..b74423b0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11062&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11062&range=01-02 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11062.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11062/head:pull/11062 PR: https://git.openjdk.org/jdk/pull/11062 From bulasevich at openjdk.org Mon Nov 14 16:01:01 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Mon, 14 Nov 2022 16:01:01 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v9] In-Reply-To: References: Message-ID: > The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. > > Testing: jtreg hotspot&jdk, Renaissance benchmarks Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: warning fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10025/files - new: https://git.openjdk.org/jdk/pull/10025/files/99522fd0..e5f03dda Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10025&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10025&range=07-08 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10025.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10025/head:pull/10025 PR: https://git.openjdk.org/jdk/pull/10025 From tholenstein at openjdk.org Mon Nov 14 16:11:45 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 14 Nov 2022 16:11:45 GMT Subject: RFR: JDK-8295934: IGV: keep node selection when changing view or graph [v4] In-Reply-To: References: Message-ID: > In IGV nodes can be selected by clicking on it. When a user selects nodes in a certain view, e.g. "Cluster nodes into blocks" view, and then change to e.g. "Sea of nodes" view, the selection should be kept. Same when the user goes a different graph in the same group, the selection should be kept (as long as the nodes are still present) > > New selection features: > - When opening a new graph and no nodes where selected previously the root nodes is selected and centered. > selected_root > > - When a graph in the same group is opened, the previously selected nodes as selected as well if they are present in the graph. The selected nodes are centered in the new graph. > - The selected nodes are kept when changing the view, or the properties of the view (e.g. "show neighboring nodes semi-transparent") > cluster_view > desired > > - When "show neighboring nodes semi-transparent" is disabled, previously semi-transparent nodes that were selected are now unselected (because they are not visible anymore) > - It would also be desired adjust the scroll pane to center the selected nodes when changing view, graph, etc. Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: revert unrelated changes ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11062/files - new: https://git.openjdk.org/jdk/pull/11062/files/b74423b0..8e8033da Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11062&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11062&range=02-03 Stats: 9 lines in 4 files changed: 8 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11062.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11062/head:pull/11062 PR: https://git.openjdk.org/jdk/pull/11062 From tholenstein at openjdk.org Mon Nov 14 16:26:35 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 14 Nov 2022 16:26:35 GMT Subject: RFR: JDK-8295934: IGV: keep node selection when changing view or graph [v4] In-Reply-To: References: Message-ID: On Mon, 14 Nov 2022 09:54:00 GMT, Roberto Casta?eda Lozano wrote: > Keeping the user's node selection across views and related graphs is a useful enhancement, thanks! > > As a user, I only have an objection to selecting the Root node when opening a new graph without a previous selection: I would rather not select any node in that situation. Selecting the Root node forces the attention of the user to an arbitrary part of the graph, and this node is not relevant to all views, e.g. the CFG view of the current "Final Code" graph. > > Regarding the code changes, I think it would be better for ease of reviewing and traceability to leave cleanups and unrelated refactorings (such as removal of unused imports in `GraphViewerImplementation.java` or not passing a `SceneAnimator` to `LineWidget.java`) to separate RFEs. Hi @robcasloz Thanks for your comment! I agree that the root not should not be selected. I changed the PR accordingly: now the root node is only centered but not selected anymore. I also agree that import of untouched files should not be touches. I reverted that. The `SceneAnimator` in `LineWidget.java` was removed because in order to implement this PR I had to refactor the `processOutputSlot()` method which is the only place there `LineWidget.java` is initiated. ------------- PR: https://git.openjdk.org/jdk/pull/11062 From dsamersoff at openjdk.org Mon Nov 14 16:44:27 2022 From: dsamersoff at openjdk.org (Dmitry Samersoff) Date: Mon, 14 Nov 2022 16:44:27 GMT Subject: RFR: JDK-8294947: Use 64bit atomics in patch_verified_entry on x86_64 [v3] In-Reply-To: References: Message-ID: <2_USY981vz6kpzHryQDRTV02yC2ylucq74lW4GYL9E4=.9bdd7ade-2837-423e-9125-bffb41df6fc3@github.com> > In the void NativeJump::patch_verified_entry() we atomically patch first 4 bytes, then atomically patch 5th byte, then atomically patch first 4 bytes again. But from CMC (cross-modified code) point of view it's better to patch atomically 8 bytes at once. > > The patch was tested with hotspot jtreg tests in bare-metal and virtualized environments. Dmitry Samersoff has updated the pull request incrementally with one additional commit since the last revision: 8294947: Use 64bit atomics in patch_verified_entry on x86_64 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11059/files - new: https://git.openjdk.org/jdk/pull/11059/files/bbacb7f4..82c3dbaf Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11059&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11059&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11059.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11059/head:pull/11059 PR: https://git.openjdk.org/jdk/pull/11059 From kvn at openjdk.org Mon Nov 14 17:02:27 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 14 Nov 2022 17:02:27 GMT Subject: RFR: JDK-8294947: Use 64bit atomics in patch_verified_entry on x86_64 [v3] In-Reply-To: <2_USY981vz6kpzHryQDRTV02yC2ylucq74lW4GYL9E4=.9bdd7ade-2837-423e-9125-bffb41df6fc3@github.com> References: <2_USY981vz6kpzHryQDRTV02yC2ylucq74lW4GYL9E4=.9bdd7ade-2837-423e-9125-bffb41df6fc3@github.com> Message-ID: <7bU8kajQ4Wxpemm9HTm7r4ueUaZvzpkWgFmnnysh_xs=.6f2edabf-d115-4ddc-9c2a-26ab4c377db5@github.com> On Mon, 14 Nov 2022 16:44:27 GMT, Dmitry Samersoff wrote: >> In the void NativeJump::patch_verified_entry() we atomically patch first 4 bytes, then atomically patch 5th byte, then atomically patch first 4 bytes again. But from CMC (cross-modified code) point of view it's better to patch atomically 8 bytes at once. >> >> The patch was tested with hotspot jtreg tests in bare-metal and virtualized environments. > > Dmitry Samersoff has updated the pull request incrementally with one additional commit since the last revision: > > 8294947: Use 64bit atomics in patch_verified_entry on x86_64 Looks good. I submitted testing. ------------- PR: https://git.openjdk.org/jdk/pull/11059 From eastigeevich at openjdk.org Mon Nov 14 17:33:20 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Mon, 14 Nov 2022 17:33:20 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v9] In-Reply-To: References: Message-ID: On Mon, 14 Nov 2022 16:01:01 GMT, Boris Ulasevich wrote: >> The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. >> >> Testing: jtreg hotspot&jdk, Renaissance benchmarks > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > warning fix Changes requested by eastigeevich (Committer). src/hotspot/share/code/compressedStream.cpp line 192: > 190: if (_position >= _size) { > 191: grow(); > 192: } Now we have these checks spread across the code. There are two actions changing `_postion`: - `_position++` - `set_position` We can replace `_position++` with `inc_position` where we can have the check with `grow`. Regarding `set_position `, I have looked at its current uses. Its uses are to support shared debug info: - We write info. - We check if we have written the same info. - If yes, we use the one written before and roll back position. If I haven't missed other uses, the meaning of `set_position` is to roll back. In such case, no `grow` is needed. I suggest to rename `set_position` to `roll_back_to` or `move_back_to`. src/hotspot/share/code/compressedStream.hpp line 184: > 182: } > 183: > 184: void flush() { Why do we need `flush` if we modify data in place? ------------- PR: https://git.openjdk.org/jdk/pull/10025 From eastigeevich at openjdk.org Mon Nov 14 17:33:31 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Mon, 14 Nov 2022 17:33:31 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v7] In-Reply-To: References: Message-ID: On Fri, 11 Nov 2022 12:33:17 GMT, Boris Ulasevich wrote: >> The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. >> >> Testing: jtreg hotspot&jdk, Renaissance benchmarks > > Boris Ulasevich has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 12 additional commits since the last revision: > > - adding jtreg test for CompressedSparseDataReadStream impl > - align java impl to cpp impl > - rewrite the SparseDataWriteStream not to use _curr_byte > - introduce and call flush() excplicitly, add the gtest > - minor renaming. adding encoding examples table > - cleanup and rename > - cleanup > - rewrite code without virtual functions > - warning fix and name fix > - optimize the encoding > - ... and 2 more: https://git.openjdk.org/jdk/compare/f8e33862...637c94be src/hotspot/share/code/compressedStream.cpp line 219: > 217: > 218: void CompressedSparseDataWriteStream::grow() { > 219: int nsize = _size * 2; Signed integer overflow is UB. The correct assert: assert(_size <= INT_MAX / 2, "debug data size must not exceed INT_MAX"); int nsize = _size * 2; ------------- PR: https://git.openjdk.org/jdk/pull/10025 From duke at openjdk.org Mon Nov 14 17:50:47 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Mon, 14 Nov 2022 17:50:47 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v14] In-Reply-To: References: Message-ID: On Fri, 11 Nov 2022 20:46:57 GMT, Volodymyr Paprotski wrote: >> It's not specific to `andq`: there's a huge `#ifdef` block around the definitions in `assembler_x86.hpp` (lines 12201 - 13773; and there's even a nested `#ifdef _LP64` (lines 13515-13585)!) , but declarations aren't guarded by `#ifdef _LP64`. > > Yeah, just got to about the same conclusion by looking at the preprocessor `-E` output.. its declared in the header, but not defined in the 'cpp' file.. One would think that that's a compile error, but its been more then a decade since I looked at the C++ spec; 'C++ compiler is always right'. Don't know that there is anything else for me to do here? `assembler_x86.hpp` `#ifdef _LP64` macros were there before (and it not 'that wrong' or if a better/clean fix exists). `macroAssembler_x86.hpp` has to mirror that with `andq`. (Just going through all the comments, making sure they have been addressed.) PS: In general I get worried about having macros changing object layout, but that's 'water under the bridge' and 64-bit seems big enough reason to have different layout. But its always 'entertaining debugging session' when offset of `a.f` is different in `a.o` and `b.o`, because somebody forgot to define same macros for `b.c` compile command as for `a.c`.. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From aph at openjdk.org Mon Nov 14 17:57:31 2022 From: aph at openjdk.org (Andrew Haley) Date: Mon, 14 Nov 2022 17:57:31 GMT Subject: RFR: 8295698: AArch64: test/jdk/sun/security/ec/ed/EdDSATest.java failed with -XX:+UseSHA3Intrinsics In-Reply-To: <2_tbWMZjjQIyN_HVI_KUY9znox8zGh1vSEtj56vvZog=.0b9c6be0-cae9-4f06-99d9-6c4058e273c3@github.com> References: <2_tbWMZjjQIyN_HVI_KUY9znox8zGh1vSEtj56vvZog=.0b9c6be0-cae9-4f06-99d9-6c4058e273c3@github.com> Message-ID: On Mon, 14 Nov 2022 11:46:57 GMT, Dong Bo wrote: > > This looks right, but I don't think I can test it, which I usually would do with a patch this complicated. When we have a processor without FEAT_SHA3) we should define BCAX, EOR3, RAX1, and XAR as macros. Could you do that, please? > > Thanks for the comments. > > Do you mean that we need a patch (not to be merged) to support testing on processor without FEAT_SHA3? In which, the SHA3 instructions are substituted by multiple instructions, something like: `eor3 v1, v2, v3, v4 => eor v1, v2, v3; eor v1, v1, v4`? Yes, exactly. So the intrinsic will work everywhere. Why not merge it? It seems obvious to me. > BTW, FEAT_SHA3 is supported on M1. If you happen to have one, the test can be done on it. :) To test this on M1/MacOS, modification below is needed to enable SHA3Intriniscs by default. Since other features, i.e. UseSHA, can not be automatically detected neither, I think it is irrelevant with this patch. Do you know why SHA3 isn't enabled on M1? ------------- PR: https://git.openjdk.org/jdk/pull/10939 From duke at openjdk.org Mon Nov 14 17:58:36 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Mon, 14 Nov 2022 17:58:36 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v16] In-Reply-To: References: Message-ID: > Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. > > - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. > - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. > - Added a JMH perf test. > - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. > > Perf before: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s > > and after: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s Volodymyr Paprotski has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 23 commits: - Merge remote-tracking branch 'origin/master' into avx512-poly - Vladimir's review - live review with Sandhya - jcheck - Sandhya's review - fix windows and 32b linux builds - add getLimbs to interface and reviews - fix 32-bit build - make UsePolyIntrinsics option diagnostic - Merge remote-tracking branch 'origin/master' into avx512-poly - ... and 13 more: https://git.openjdk.org/jdk/compare/e269dc03...a26ac7db ------------- Changes: https://git.openjdk.org/jdk/pull/10582/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=15 Stats: 1851 lines in 32 files changed: 1815 ins; 3 del; 33 mod Patch: https://git.openjdk.org/jdk/pull/10582.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10582/head:pull/10582 PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Mon Nov 14 17:58:37 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Mon, 14 Nov 2022 17:58:37 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v15] In-Reply-To: References: Message-ID: On Fri, 11 Nov 2022 17:56:55 GMT, Volodymyr Paprotski wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: > > Vladimir's review Try to get clean build, pull in https://github.com/openjdk/jdk/pull/11065 ------------- PR: https://git.openjdk.org/jdk/pull/10582 From vlivanov at openjdk.org Mon Nov 14 18:39:29 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Mon, 14 Nov 2022 18:39:29 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v14] In-Reply-To: References: Message-ID: On Mon, 14 Nov 2022 17:48:25 GMT, Volodymyr Paprotski wrote: >> Yeah, just got to about the same conclusion by looking at the preprocessor `-E` output.. its declared in the header, but not defined in the 'cpp' file.. One would think that that's a compile error, but its been more then a decade since I looked at the C++ spec; 'C++ compiler is always right'. > > Don't know that there is anything else for me to do here? `assembler_x86.hpp` `#ifdef _LP64` macros were there before (and it not 'that wrong' or if a better/clean fix exists). `macroAssembler_x86.hpp` has to mirror that with `andq`. > > (Just going through all the comments, making sure they have been addressed.) > > PS: In general I get worried about having macros changing object layout, but that's 'water under the bridge' and 64-bit seems big enough reason to have different layout. But its always 'entertaining debugging session' when offset of `a.f` is different in `a.o` and `b.o`, because somebody forgot to define same macros for `b.c` compile command as for `a.c`.. Leave it as is. It'll be addressed separately. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From shade at openjdk.org Mon Nov 14 19:47:29 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 14 Nov 2022 19:47:29 GMT Subject: RFR: 8296545: C2 Blackholes should allow load optimizations [v3] In-Reply-To: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> References: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> Message-ID: > If you look at generated code for the JMH benchmark like: > > > public class ArrayRead { > @Param({"1", "100", "10000", "1000000"}) > int size; > > int[] is; > > @Setup > public void setup() { > is = new int[size]; > for (int c = 0; c < size; c++) { > is[c] = c; > } > } > > @Benchmark > public void test(Blackhole bh) { > for (int i = 0; i < is.length; i++) { > bh.consume(is[i]); > } > } > } > > > ...then you would notice that the loop always re-reads `is`, `is.length`, does the range check, etc. -- all the things we would otherwise expect to be hoisted out of the loop. > > This is because C2 blackholes are modeled as membars that pinch both control and memory slices (like you would expect from the opaque non-inlined call), therefore every iteration has to re-read the referenced memory contents and recompute everything dependent on those loads. This behavior is not new -- the old, non-compiler blackholes were doing the same thing, accidentally -- but it was drowned in blackhole overheads. Now, these effects are clearly visible. > > We can try to do this a bit better: allow load optimizations to work across the blackholes, leaving only "prevent dead code elimination" part, as minimally required by blackhole semantics. > > Motivational improvements on the test above: > > > Benchmark (size) Mode Cnt Score Error Units > > # Before, full Java blackholes > ArrayRead.test 1 avgt 9 5.422 ? 0.023 ns/op > ArrayRead.test 100 avgt 9 460.619 ? 0.421 ns/op > ArrayRead.test 10000 avgt 9 44697.909 ? 1964.787 ns/op > ArrayRead.test 1000000 avgt 9 4332723.304 ? 2791.324 ns/op > > # Before, compiler blackholes > ArrayRead.test 1 avgt 9 1.791 ? 0.007 ns/op > ArrayRead.test 100 avgt 9 114.103 ? 1.677 ns/op > ArrayRead.test 10000 avgt 9 8528.544 ? 52.010 ns/op > ArrayRead.test 1000000 avgt 9 1005139.070 ? 2883.011 ns/op > > # After, compiler blackholes > ArrayRead.test 1 avgt 9 1.686 ? 0.006 ns/op ; ~1.1x better > ArrayRead.test 100 avgt 9 16.249 ? 0.019 ns/op ; ~7.0x better > ArrayRead.test 10000 avgt 9 1375.265 ? 2.420 ns/op ; ~6.2x better > ArrayRead.test 1000000 avgt 9 136862.574 ? 1057.100 ns/op ; ~7.3x better > > > `-prof perfasm` shows the reason for these improvements clearly: > > Before: > > > ? 0x00007f0b54498360: mov 0xc(%r12,%r10,8),%edx ; range check 1 > 7.97% ? 0x00007f0b54498365: cmp %edx,%r11d > 1.27% ? 0x00007f0b54498368: jae 0x00007f0b5449838f > ? 0x00007f0b5449836a: shl $0x3,%r10 > 0.03% ? 0x00007f0b5449836e: mov 0x10(%r10,%r11,4),%r10d ; get "is[i]" > 7.76% ? 0x00007f0b54498373: mov 0x10(%r9),%r10d ; restore "is" > 0.24% ? 0x00007f0b54498377: mov 0x3c0(%r15),%rdx ; safepoint poll, part 1 > 17.48% ? 0x00007f0b5449837e: inc %r11d ; i++ > 0.17% ? 0x00007f0b54498381: test %eax,(%rdx) ; safepoint poll, part 2 > 53.26% ? 0x00007f0b54498383: mov 0xc(%r12,%r10,8),%edx ; loop index check > 4.84% ? 0x00007f0b54498388: cmp %edx,%r11d > 0.31% ? 0x00007f0b5449838b: jl 0x00007f0b54498360 > > > After: > > > > ? 0x00007fa06c49a8b0: mov 0x2c(%rbp,%r10,4),%r9d ; stride read > 19.66% ? 0x00007fa06c49a8b5: mov 0x28(%rbp,%r10,4),%edx > 0.14% ? 0x00007fa06c49a8ba: mov 0x10(%rbp,%r10,4),%ebx > 22.09% ? 0x00007fa06c49a8bf: mov 0x14(%rbp,%r10,4),%ebx > 0.21% ? 0x00007fa06c49a8c4: mov 0x18(%rbp,%r10,4),%ebx > 20.19% ? 0x00007fa06c49a8c9: mov 0x1c(%rbp,%r10,4),%ebx > 0.04% ? 0x00007fa06c49a8ce: mov 0x20(%rbp,%r10,4),%ebx > 24.02% ? 0x00007fa06c49a8d3: mov 0x24(%rbp,%r10,4),%ebx > 0.21% ? 0x00007fa06c49a8d8: add $0x8,%r10d ; i += 8 > ? 0x00007fa06c49a8dc: cmp %esi,%r10d > 0.07% ? 0x00007fa06c49a8df: jl 0x00007fa06c49a8b0 > > > Additional testing: > - [x] Eyeballing JMH Samples `-prof perfasm` > - [x] Linux x86_64 fastdebug, `compiler/blackhole`, `compiler/c2/irTests/blackhole` > - [x] Linux x86_64 fastdebug, JDK benchmark corpus Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: - Blackhole as CFG node - Merge branch 'master' into JDK-8296545-blackhole-effects - Blackhole should be AliasIdxTop - Do not touch memory at all - Fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11041/files - new: https://git.openjdk.org/jdk/pull/11041/files/1ca2febe..66247f75 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11041&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11041&range=01-02 Stats: 8138 lines in 394 files changed: 2756 ins; 3855 del; 1527 mod Patch: https://git.openjdk.org/jdk/pull/11041.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11041/head:pull/11041 PR: https://git.openjdk.org/jdk/pull/11041 From shade at openjdk.org Mon Nov 14 19:47:31 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 14 Nov 2022 19:47:31 GMT Subject: RFR: 8296545: C2 Blackholes should allow load optimizations [v3] In-Reply-To: <4b-2Gfc6NLLmAysjDqBtvLAuxksKkmU5ZP5VDBb_2NQ=.548c9041-b4c9-469a-9c0e-58a9295d4dda@github.com> References: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> <4b-2Gfc6NLLmAysjDqBtvLAuxksKkmU5ZP5VDBb_2NQ=.548c9041-b4c9-469a-9c0e-58a9295d4dda@github.com> Message-ID: On Fri, 11 Nov 2022 01:02:35 GMT, Vladimir Ivanov wrote: >> Well, I don't see how I can make `Blackhole` produce only control. It seems the rest of C2 code frowns upon CFG nodes that are not control projections (or safepoints), see e.g. `is_control_proj_or_safepoint` asserts. Any hints how to proceed here? Maybe an example for such CFG node somewhere? >> >> Otherwise, I'd think keeping Blackhole a `MultiNode` and then take the control projection off it -- like in my `blackhole-cfg-1.patch` above -- is the way to do it. > > Yeah, I agree that your `blackhole-cfg-1.patch` is the lowest friction way to achieve the goal. > > There are some pure control nodes (e.g., `Region`), but they are treated specially during code motion. So, special cases for `Blackhole` would be needed there. All right, I pushed the version that does `MultiNode` and passes all my tests. ------------- PR: https://git.openjdk.org/jdk/pull/11041 From dnsimon at openjdk.org Mon Nov 14 20:29:05 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 14 Nov 2022 20:29:05 GMT Subject: RFR: 8296958: [JVMCI] add API for retrieving ConstantValue attributes Message-ID: In order to properly initialize classes in a native image at run time, Native Image needs to capture the value of [`ConstantValue` attributes](https://docs.oracle.com/javase/specs/jvms/se19/html/jvms-4.html#jvms-4.7.2) at image build time. This PR adds `ResolvedJavaField.getConstantValue()` for this purpose. ------------- Commit messages: - added ResolvedJavaField.getConstantValue Changes: https://git.openjdk.org/jdk/pull/11144/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11144&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296958 Stats: 204 lines in 9 files changed: 198 ins; 5 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11144.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11144/head:pull/11144 PR: https://git.openjdk.org/jdk/pull/11144 From dnsimon at openjdk.org Mon Nov 14 20:38:21 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 14 Nov 2022 20:38:21 GMT Subject: RFR: 8296960: [JVMCI] list HotSpotConstantPool.loadReferencedType to ConstantPool Message-ID: `HotSpotConstantPool.loadReferencedType(int cpi, int opcode, boolean initialize)` allows loading a type without triggering class initialization. This PR lifts this method up to `ConstantPool` so that this functionality can be used without depending on HotSpot-specific JVMCI classes. ------------- Commit messages: - lift loadReferencedType with initialize parameter up to ConstantPool (GR-41975) Changes: https://git.openjdk.org/jdk/pull/11145/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11145&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296960 Stats: 19 lines in 2 files changed: 19 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11145.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11145/head:pull/11145 PR: https://git.openjdk.org/jdk/pull/11145 From dnsimon at openjdk.org Mon Nov 14 20:44:54 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 14 Nov 2022 20:44:54 GMT Subject: RFR: 8292961: [JVMCI] Access to j.l.r.Method/Constructor/Field for ResolvedJavaMethod/ResolvedJavaField Message-ID: Native Image needs to convert `ResolvedJavaMethod` objects to `java.lang.reflect.Executable` objects and `ResolvedJavaField` objects to `java.lang.reflect.Field` objects. This is currently done by digging into JVMCI internals with reflection. Instead, this functionality should be exposed by public JVMCI API which is what this PR does. ------------- Commit messages: - add API to convert ResolvedJava[Method|Field] to Method|Constructor|Field (GR-41976) Changes: https://git.openjdk.org/jdk/pull/11146/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11146&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8292961 Stats: 37 lines in 2 files changed: 34 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/11146.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11146/head:pull/11146 PR: https://git.openjdk.org/jdk/pull/11146 From dnsimon at openjdk.org Mon Nov 14 21:12:07 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 14 Nov 2022 21:12:07 GMT Subject: RFR: 8296967: JVMCI] rationalize relationship between getCodeSize and getCode in ResolvedJavaMethod Message-ID: When `ResolvedJavaMethod.getCodeSize()` returns a value > 0, `ResolvedJavaMethod.getCode()` will return `null` if the declaring class is not linked, contrary to the intuition of most JVMCI API users. This PR rationalizes the API such that: ResolvedJavaMethod m = ...; ResolvedJavaType c = m.getDeclaringClass(); assert (m.getCodeSize() > 0) == (m.getCode() != null); // m is a non-abstract, non-native method whose declaring class is linked in the current runtime assert (m.getCodeSize() == 0) == (m.getCode() == null); // m is an abstract or native method assert c.isLinked() == (m.getCodeSize() >= 0); // m's code size will always be >= 0 if its declaring class is linked in the current runtime ------------- Commit messages: - getCodeSize and hasBytecodes in ResolvedJavaMethod now return -1 and false for concrete methods in unlinked classes (GR-41977) Changes: https://git.openjdk.org/jdk/pull/11147/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11147&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296967 Stats: 317 lines in 8 files changed: 232 ins; 38 del; 47 mod Patch: https://git.openjdk.org/jdk/pull/11147.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11147/head:pull/11147 PR: https://git.openjdk.org/jdk/pull/11147 From kvn at openjdk.org Mon Nov 14 21:21:00 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 14 Nov 2022 21:21:00 GMT Subject: RFR: 8296545: C2 Blackholes should allow load optimizations [v3] In-Reply-To: References: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> Message-ID: <8oFMPgU_6YZxluUlWdVW2EApTVylfVuk2hyzFXPgBY0=.3919a854-56d6-4183-87e8-264d7c93d885@github.com> On Mon, 14 Nov 2022 19:47:29 GMT, Aleksey Shipilev wrote: >> If you look at generated code for the JMH benchmark like: >> >> >> public class ArrayRead { >> @Param({"1", "100", "10000", "1000000"}) >> int size; >> >> int[] is; >> >> @Setup >> public void setup() { >> is = new int[size]; >> for (int c = 0; c < size; c++) { >> is[c] = c; >> } >> } >> >> @Benchmark >> public void test(Blackhole bh) { >> for (int i = 0; i < is.length; i++) { >> bh.consume(is[i]); >> } >> } >> } >> >> >> ...then you would notice that the loop always re-reads `is`, `is.length`, does the range check, etc. -- all the things we would otherwise expect to be hoisted out of the loop. >> >> This is because C2 blackholes are modeled as membars that pinch both control and memory slices (like you would expect from the opaque non-inlined call), therefore every iteration has to re-read the referenced memory contents and recompute everything dependent on those loads. This behavior is not new -- the old, non-compiler blackholes were doing the same thing, accidentally -- but it was drowned in blackhole overheads. Now, these effects are clearly visible. >> >> We can try to do this a bit better: allow load optimizations to work across the blackholes, leaving only "prevent dead code elimination" part, as minimally required by blackhole semantics. >> >> Motivational improvements on the test above: >> >> >> Benchmark (size) Mode Cnt Score Error Units >> >> # Before, full Java blackholes >> ArrayRead.test 1 avgt 9 5.422 ? 0.023 ns/op >> ArrayRead.test 100 avgt 9 460.619 ? 0.421 ns/op >> ArrayRead.test 10000 avgt 9 44697.909 ? 1964.787 ns/op >> ArrayRead.test 1000000 avgt 9 4332723.304 ? 2791.324 ns/op >> >> # Before, compiler blackholes >> ArrayRead.test 1 avgt 9 1.791 ? 0.007 ns/op >> ArrayRead.test 100 avgt 9 114.103 ? 1.677 ns/op >> ArrayRead.test 10000 avgt 9 8528.544 ? 52.010 ns/op >> ArrayRead.test 1000000 avgt 9 1005139.070 ? 2883.011 ns/op >> >> # After, compiler blackholes >> ArrayRead.test 1 avgt 9 1.686 ? 0.006 ns/op ; ~1.1x better >> ArrayRead.test 100 avgt 9 16.249 ? 0.019 ns/op ; ~7.0x better >> ArrayRead.test 10000 avgt 9 1375.265 ? 2.420 ns/op ; ~6.2x better >> ArrayRead.test 1000000 avgt 9 136862.574 ? 1057.100 ns/op ; ~7.3x better >> >> >> `-prof perfasm` shows the reason for these improvements clearly: >> >> Before: >> >> >> ? 0x00007f0b54498360: mov 0xc(%r12,%r10,8),%edx ; range check 1 >> 7.97% ? 0x00007f0b54498365: cmp %edx,%r11d >> 1.27% ? 0x00007f0b54498368: jae 0x00007f0b5449838f >> ? 0x00007f0b5449836a: shl $0x3,%r10 >> 0.03% ? 0x00007f0b5449836e: mov 0x10(%r10,%r11,4),%r10d ; get "is[i]" >> 7.76% ? 0x00007f0b54498373: mov 0x10(%r9),%r10d ; restore "is" >> 0.24% ? 0x00007f0b54498377: mov 0x3c0(%r15),%rdx ; safepoint poll, part 1 >> 17.48% ? 0x00007f0b5449837e: inc %r11d ; i++ >> 0.17% ? 0x00007f0b54498381: test %eax,(%rdx) ; safepoint poll, part 2 >> 53.26% ? 0x00007f0b54498383: mov 0xc(%r12,%r10,8),%edx ; loop index check >> 4.84% ? 0x00007f0b54498388: cmp %edx,%r11d >> 0.31% ? 0x00007f0b5449838b: jl 0x00007f0b54498360 >> >> >> After: >> >> >> >> ? 0x00007fa06c49a8b0: mov 0x2c(%rbp,%r10,4),%r9d ; stride read >> 19.66% ? 0x00007fa06c49a8b5: mov 0x28(%rbp,%r10,4),%edx >> 0.14% ? 0x00007fa06c49a8ba: mov 0x10(%rbp,%r10,4),%ebx >> 22.09% ? 0x00007fa06c49a8bf: mov 0x14(%rbp,%r10,4),%ebx >> 0.21% ? 0x00007fa06c49a8c4: mov 0x18(%rbp,%r10,4),%ebx >> 20.19% ? 0x00007fa06c49a8c9: mov 0x1c(%rbp,%r10,4),%ebx >> 0.04% ? 0x00007fa06c49a8ce: mov 0x20(%rbp,%r10,4),%ebx >> 24.02% ? 0x00007fa06c49a8d3: mov 0x24(%rbp,%r10,4),%ebx >> 0.21% ? 0x00007fa06c49a8d8: add $0x8,%r10d ; i += 8 >> ? 0x00007fa06c49a8dc: cmp %esi,%r10d >> 0.07% ? 0x00007fa06c49a8df: jl 0x00007fa06c49a8b0 >> >> >> Additional testing: >> - [x] Eyeballing JMH Samples `-prof perfasm` >> - [x] Linux x86_64 fastdebug, `compiler/blackhole`, `compiler/c2/irTests/blackhole` >> - [x] Linux x86_64 fastdebug, JDK benchmark corpus > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Blackhole as CFG node > - Merge branch 'master' into JDK-8296545-blackhole-effects > - Blackhole should be AliasIdxTop > - Do not touch memory at all > - Fix Looks good for me with one comment. src/hotspot/share/opto/cfgnode.hpp line 610: > 608: // Blackhole all arguments. This node would survive through the compiler > 609: // the effects on its arguments, and would be finally matched to nothing. > 610: class BlackholeNode : public MultiNode { Also update comment at [cfgnode.hpp#L43](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.hpp#L43) ------------- PR: https://git.openjdk.org/jdk/pull/11041 From duke at openjdk.org Mon Nov 14 21:29:08 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Mon, 14 Nov 2022 21:29:08 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v16] In-Reply-To: References: Message-ID: On Mon, 14 Nov 2022 17:58:36 GMT, Volodymyr Paprotski wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > Volodymyr Paprotski has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 23 commits: > > - Merge remote-tracking branch 'origin/master' into avx512-poly > - Vladimir's review > - live review with Sandhya > - jcheck > - Sandhya's review > - fix windows and 32b linux builds > - add getLimbs to interface and reviews > - fix 32-bit build > - make UsePolyIntrinsics option diagnostic > - Merge remote-tracking branch 'origin/master' into avx512-poly > - ... and 13 more: https://git.openjdk.org/jdk/compare/e269dc03...a26ac7db (Build finally passing!) Hi @TobiHartmann you had mentioned there were some more tests to run? Looking to see what else needs fixing. Thanks. @iwanowww thanks for the reviews! As you have time, let me know what else you see or if its good for approval? Don't want to switch too much to another intrinsic yet, one crypto algorithm is about what I can fit into my brain at a time. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From kvn at openjdk.org Mon Nov 14 22:12:08 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 14 Nov 2022 22:12:08 GMT Subject: RFR: JDK-8294947: Use 64bit atomics in patch_verified_entry on x86_64 [v3] In-Reply-To: <2_USY981vz6kpzHryQDRTV02yC2ylucq74lW4GYL9E4=.9bdd7ade-2837-423e-9125-bffb41df6fc3@github.com> References: <2_USY981vz6kpzHryQDRTV02yC2ylucq74lW4GYL9E4=.9bdd7ade-2837-423e-9125-bffb41df6fc3@github.com> Message-ID: On Mon, 14 Nov 2022 16:44:27 GMT, Dmitry Samersoff wrote: >> In the void NativeJump::patch_verified_entry() we atomically patch first 4 bytes, then atomically patch 5th byte, then atomically patch first 4 bytes again. But from CMC (cross-modified code) point of view it's better to patch atomically 8 bytes at once. >> >> The patch was tested with hotspot jtreg tests in bare-metal and virtualized environments. > > Dmitry Samersoff has updated the pull request incrementally with one additional commit since the last revision: > > 8294947: Use 64bit atomics in patch_verified_entry on x86_64 My testing passed. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11059 From kvn at openjdk.org Mon Nov 14 22:16:03 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 14 Nov 2022 22:16:03 GMT Subject: RFR: 8286800: Assert in PhaseIdealLoop::dump_real_LCA is too strong [v3] In-Reply-To: References: <2pkCki3_MbAhroP1p8173jY1ErpHaEEvGlRxH7nGkvM=.3675ca65-171e-4634-b0ed-b61c7c30ebf3@github.com> Message-ID: On Fri, 11 Nov 2022 17:33:02 GMT, Christian Hagedorn wrote: >> We sometimes hit the following assert when dumping a bad graph (before crashing with the bad graph assertion): >> >> assert(real_LCA != NULL, "must always find an LCA" >> ``` >> The algorithm is not correct as we should always find an LCA of two nodes. To fix this, I've re-implemented the algorithm and improved the dumped idom chains: >> - I limited the node dump to idx + node name to reduce the noise which made it hard to read. >> - Reversed the idom chain dumps to reflect the graph structure. >> >> Example output: >> >> Bad graph detected in build_loop_late >> n: 138 CastPP === 205 38 [[ 263 140 140 168 ]] #Test:NotNull * Oop:Test:NotNull * !jvms: Test::mainTest @ bci:40 (line 154) >> >> [... same output as before ...] >> >> idoms of early "197 IfFalse": >> idom[2]: 42 If >> idom[1]: 44 IfTrue >> idom[0]: 196 If >> n: 197 IfFalse >> >> idoms of (wrong) LCA "205 IfTrue": >> idom[4]: 42 If >> idom[3]: 37 Region >> idom[2]: 73 If >> idom[1]: 83 IfTrue >> idom[0]: 204 If >> n: 205 IfTrue >> >> Real LCA of early "197 IfFalse" (idom[2]) and wrong LCA "205 IfTrue" (idom[4]): >> 42 If === 30 41 [[ 43 44 ]] P=0.999000, C=-1.000000 !jvms: Test::mainTest @ bci:32 (line 153) >> >> Tested by manually calling `dump_idoms` during a compilation and by running reproducers of different bad graph assertion bugs. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Change algorithm as suggested by Roberto Update looks good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11015 From kvn at openjdk.org Mon Nov 14 22:50:55 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 14 Nov 2022 22:50:55 GMT Subject: RFR: 8295407: C2 crash: Error: ShouldNotReachHere() in multiple vector tests with -XX:-MonomorphicArrayCheck -XX:-UncommonNullCast In-Reply-To: References: Message-ID: On Tue, 8 Nov 2022 02:36:13 GMT, Fei Gao wrote: > For unsupported `CMove` patterns, [JDK-8293833](https://bugs.openjdk.org/browse/JDK-8295407) helps remove unused `CMove` and related packs from superword candidate packset by the function `remove_cmove_and_related_packs()`, but it only works when `-XX:+UseVectorCmov` is enabled[1]. When the option is not enabled, these unsupported `CMove` packs are still kept in the superword packset, causing the same failure. > > Actually, the function `filter_packs()` in superword is to filter out unsupported packs but it can't work as expected currently for these `CMove` cases. As we know, not all `CMove` packs can be vectorized. `merge_packs_to_cmovd()`[2] looks through all packs in the superword packset and generates a `CMove` candidate packset to collect all qualified `CMove` packs. Hence, only `CMove` packs in the `CMove` candidate packset are our target patterns and can be vectorized. But `filter_packs()` thinks, if the `CMove` pack is in a superword packset and its vector node is implemented in the current platform, then it can be vectorized. Therefore, the function doesn't remove these unsupported packs. > > We can adjust the function `implemented()` in the stage of `filter_packs()` to check if the current `CMove` pack is in the `CMove` candidate packset. If not, `filter_packs()` considers it not to be vectorized and then remove it. After the fix, whether > `-XX:+UseVectorCmov` is enabled or not, these unsupported packs can be removed by `filter_packs()`. In this way, we don't need the function`remove_cmove_and_related_packs()` anymore and thus the patch also cleans related code. > > [1] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L537 > [2] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L1892 I have request since you are touching this code. [8192846](https://bugs.openjdk.org/browse/JDK-8192846) changes were a little sloppy and did not rename some methods which causing confusion. Please, rename `is_CmpD_candidate`, `merge_packs_to_cmpd` and others you find to general `fp` as you did with `is_cmove_fo_opcode`. ------------- Changes requested by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11034 From kvn at openjdk.org Mon Nov 14 23:05:38 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 14 Nov 2022 23:05:38 GMT Subject: RFR: 8295407: C2 crash: Error: ShouldNotReachHere() in multiple vector tests with -XX:-MonomorphicArrayCheck -XX:-UncommonNullCast In-Reply-To: References: Message-ID: On Tue, 8 Nov 2022 02:36:13 GMT, Fei Gao wrote: > For unsupported `CMove` patterns, [JDK-8293833](https://bugs.openjdk.org/browse/JDK-8295407) helps remove unused `CMove` and related packs from superword candidate packset by the function `remove_cmove_and_related_packs()`, but it only works when `-XX:+UseVectorCmov` is enabled[1]. When the option is not enabled, these unsupported `CMove` packs are still kept in the superword packset, causing the same failure. > > Actually, the function `filter_packs()` in superword is to filter out unsupported packs but it can't work as expected currently for these `CMove` cases. As we know, not all `CMove` packs can be vectorized. `merge_packs_to_cmovd()`[2] looks through all packs in the superword packset and generates a `CMove` candidate packset to collect all qualified `CMove` packs. Hence, only `CMove` packs in the `CMove` candidate packset are our target patterns and can be vectorized. But `filter_packs()` thinks, if the `CMove` pack is in a superword packset and its vector node is implemented in the current platform, then it can be vectorized. Therefore, the function doesn't remove these unsupported packs. > > We can adjust the function `implemented()` in the stage of `filter_packs()` to check if the current `CMove` pack is in the `CMove` candidate packset. If not, `filter_packs()` considers it not to be vectorized and then remove it. After the fix, whether > `-XX:+UseVectorCmov` is enabled or not, these unsupported packs can be removed by `filter_packs()`. In this way, we don't need the function`remove_cmove_and_related_packs()` anymore and thus the patch also cleans related code. > > [1] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L537 > [2] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L1892 Or may be simply remove `D`: `is_Cmp_candidate(), merge_packs_to_cmove(), test_cmp_pack()`. There are also comments which describes only `CMoveD`. Do we have IR tests to verify cmove vectorization? ------------- PR: https://git.openjdk.org/jdk/pull/11034 From tsteele at openjdk.org Mon Nov 14 23:11:56 2022 From: tsteele at openjdk.org (Tyler Steele) Date: Mon, 14 Nov 2022 23:11:56 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v3] In-Reply-To: References: Message-ID: <5czeTlp7rCxjCT0RwWiyLCwqj17Zl_zt2WYJU9gscIs=.90952c18-8013-4bf5-b9be-c689beb24280@github.com> On Sun, 6 Nov 2022 17:28:53 GMT, Richard Reingruber wrote: >> Hi, >> >> this is the port of [JEP 425: Virtual Threads (Preview)](https://openjdk.org/jeps/425)) to PPC64. >> More precisely it is the port of vm continuations in hotspot to PPC64. It allows to run with `-XX:+VMContinuations` which is a prerequisit for 'real' virtual threads (oxymoron?). >> >> Most of the shared code changes are related to a new platform dependent constant `frame::metadata_words_at_top`. It is either added or subtracted to a frame address or size. >> >> The following is supposed to explain (without real life details) why it is needed in addition to the existing `frame::metadata_words`. The diagram shows a frame at `SP` and its stack arguments. The caller frame is located at `callers_SP`. >> >> >> X86 / AARCH64 PPC64: >> >> : : : : >> : : : : >> | | | | >> |-----------------| |-----------------| >> | | | | >> | stack arguments | | stack arguments | >> | |<- callers_SP | | >> =================== |-----------------| >> | | | | >> | metadata at bottom | | metadata at top | >> | | | |<- callers_SP >> |-----------------| =================== >> | | | | >> | | | | >> | | | | >> | | | | >> | |<- SP | | >> =================== |-----------------| >> | | >> | metadata at top | >> | |<- SP >> =================== >> >> >> On X86 and AARCH64 metadata (that's return address, saved SP/FP, etc.) is stored at the frame bottom (`metadata at bottom`). On PPC64 it is stored at the frame top (`metadata at top`) where it affects size and address calculations. Shared code deals with this by making use of the platform dependent constant `frame::metadata_words_at_top`. >> >> * size required to 'freeze' a single frame with its stack arguments in a `StackChunk`: >> `sizeof(frame) + sizeof(stack arguments) + frame::metadata_words_at_top` >> >> * address of stack arguments: >> `callers_SP + frame::metadata_words_at_top` >> >> * The value of `frame::metadata_words_at_top` is 0 words on X86 and AARCH64. On PPC64 it is 4 words. >> >> Please refer to comments I've added for more details (e.g. the comment on StackChunkFrameStream::frame_size()). Recently I've given a talk about vm continuations and the PPC64 port to my colleagues. It's rather an overview than a deep dive. [The slides](http://cr.openjdk.java.net/~rrich/webrevs/2022/8286302/202210_loom_ppc64_port.pdf) might serve as well as an intro to the matter. >> >> The pr includes the new test jdk/jdk/internal/vm/Continuation/BasicExp.java which I wrote while doing the port. The test cases vary from simple to not-so-simple. One of the main features is that a CompilationPolicy passed as argument controls which frames are compiled/interpreted when freezing by defining a sliding window of compiled / interpreted frames which produces interesting transitions with and without stack arguments. There is overlap with Basic.java and Fuzz.java. Let me know wether to remove or keep BasicExp.java. Runtime with fastdebug: 2m on Apple M1 Pro and 3m on Intel Xeon E5-2660 v3 @ 2.60GHz. Note that -XX:+VerifyContinuations is explicitly set as a found it very useful, it increases the runtime quite a bit though. >> >> Testing: the change passed our CI testing: most JCK and JTREG tests, also in Xcomp mode, SPECjvm2008, SPECjbb2015, Renaissance Suite and SAP specific tests with fastdebug and release builds on the standard platforms plus PPC64. These tests do include hotspot_loom and jdk_loom JTREG tests which I've also run with TEST_VM_OPTS="-XX:+VerifyContinuations" on X86_64, PPC64, and AARCH64. >> >> Thanks, Richard. > > Richard Reingruber has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: > > - Fix cpp condition and add PPC64 > - Changes lost in merge > - Merge branch 'master' into 8286302_Port_JEP_425_to_PPC64 > - Use callers_sp for fsize calculation in recurse_freeze_interpreted_frame > - Loom ppc64le port I see a couple failures on Linux/ppc64le when testing with `-XX:+VerifyContinuations`. jdk/internal/vm/Continuation/Fuzz.java#default jdk/internal/vm/Continuation/Fuzz.java#preserve-fp It may be reasonable to add these to a ProblemList.txt and address them at a different time. ------------- PR: https://git.openjdk.org/jdk/pull/10961 From kvn at openjdk.org Mon Nov 14 23:21:14 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 14 Nov 2022 23:21:14 GMT Subject: RFR: 8296548: Improve MD5 intrinsic for x86_64 In-Reply-To: References: Message-ID: <4VgZ82kW_Fc5dwN2IimRW7StzUF8tWaJjDq4hRrhUoI=.e943ec6d-c3db-4655-8744-39c858767b45@github.com> On Wed, 9 Nov 2022 07:57:30 GMT, Yi-Fan Tsai wrote: > The LEA instruction loads the effective address, but MD5 intrinsic uses it for computing values than addresses. This usage potentially uses more cycles than ADDs and reduces the throughput. > > This change replaces > LEA: r1 = r1 + rsi * 1 + t > with > ADDs: r1 += t; r1 += rsi. > > Microbenchmark evaluation shows ~40% performance improvement on Haswell, Broadwell, Skylake, and Cascade Lake. There is ~20% improvement on 2nd gen Epyc. > > No performance change for the same microbenchmark on Ice Lake and 3rd gen Epyc. > > Similar results can be observed with TestMD5Intrinsics and TestMD5MultiBlockIntrinsics. There is ~15% improvement in throughput on Haswell, Broadwell, Skylake, and Cascade Lake. Yes, please, post performance data. Note, TestMD5Intrinsics and TestMD5MultiBlockIntrinsics are regression/correctness tests. Would be nice to have proper JMH benchmarks to show improvement. @sviswa7 or @jatin-bhateja do you agree with these changes? ------------- PR: https://git.openjdk.org/jdk/pull/11054 From vlivanov at openjdk.org Mon Nov 14 23:35:59 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Mon, 14 Nov 2022 23:35:59 GMT Subject: RFR: 8296545: C2 Blackholes should allow load optimizations [v3] In-Reply-To: References: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> Message-ID: On Mon, 14 Nov 2022 19:47:29 GMT, Aleksey Shipilev wrote: >> If you look at generated code for the JMH benchmark like: >> >> >> public class ArrayRead { >> @Param({"1", "100", "10000", "1000000"}) >> int size; >> >> int[] is; >> >> @Setup >> public void setup() { >> is = new int[size]; >> for (int c = 0; c < size; c++) { >> is[c] = c; >> } >> } >> >> @Benchmark >> public void test(Blackhole bh) { >> for (int i = 0; i < is.length; i++) { >> bh.consume(is[i]); >> } >> } >> } >> >> >> ...then you would notice that the loop always re-reads `is`, `is.length`, does the range check, etc. -- all the things we would otherwise expect to be hoisted out of the loop. >> >> This is because C2 blackholes are modeled as membars that pinch both control and memory slices (like you would expect from the opaque non-inlined call), therefore every iteration has to re-read the referenced memory contents and recompute everything dependent on those loads. This behavior is not new -- the old, non-compiler blackholes were doing the same thing, accidentally -- but it was drowned in blackhole overheads. Now, these effects are clearly visible. >> >> We can try to do this a bit better: allow load optimizations to work across the blackholes, leaving only "prevent dead code elimination" part, as minimally required by blackhole semantics. >> >> Motivational improvements on the test above: >> >> >> Benchmark (size) Mode Cnt Score Error Units >> >> # Before, full Java blackholes >> ArrayRead.test 1 avgt 9 5.422 ? 0.023 ns/op >> ArrayRead.test 100 avgt 9 460.619 ? 0.421 ns/op >> ArrayRead.test 10000 avgt 9 44697.909 ? 1964.787 ns/op >> ArrayRead.test 1000000 avgt 9 4332723.304 ? 2791.324 ns/op >> >> # Before, compiler blackholes >> ArrayRead.test 1 avgt 9 1.791 ? 0.007 ns/op >> ArrayRead.test 100 avgt 9 114.103 ? 1.677 ns/op >> ArrayRead.test 10000 avgt 9 8528.544 ? 52.010 ns/op >> ArrayRead.test 1000000 avgt 9 1005139.070 ? 2883.011 ns/op >> >> # After, compiler blackholes >> ArrayRead.test 1 avgt 9 1.686 ? 0.006 ns/op ; ~1.1x better >> ArrayRead.test 100 avgt 9 16.249 ? 0.019 ns/op ; ~7.0x better >> ArrayRead.test 10000 avgt 9 1375.265 ? 2.420 ns/op ; ~6.2x better >> ArrayRead.test 1000000 avgt 9 136862.574 ? 1057.100 ns/op ; ~7.3x better >> >> >> `-prof perfasm` shows the reason for these improvements clearly: >> >> Before: >> >> >> ? 0x00007f0b54498360: mov 0xc(%r12,%r10,8),%edx ; range check 1 >> 7.97% ? 0x00007f0b54498365: cmp %edx,%r11d >> 1.27% ? 0x00007f0b54498368: jae 0x00007f0b5449838f >> ? 0x00007f0b5449836a: shl $0x3,%r10 >> 0.03% ? 0x00007f0b5449836e: mov 0x10(%r10,%r11,4),%r10d ; get "is[i]" >> 7.76% ? 0x00007f0b54498373: mov 0x10(%r9),%r10d ; restore "is" >> 0.24% ? 0x00007f0b54498377: mov 0x3c0(%r15),%rdx ; safepoint poll, part 1 >> 17.48% ? 0x00007f0b5449837e: inc %r11d ; i++ >> 0.17% ? 0x00007f0b54498381: test %eax,(%rdx) ; safepoint poll, part 2 >> 53.26% ? 0x00007f0b54498383: mov 0xc(%r12,%r10,8),%edx ; loop index check >> 4.84% ? 0x00007f0b54498388: cmp %edx,%r11d >> 0.31% ? 0x00007f0b5449838b: jl 0x00007f0b54498360 >> >> >> After: >> >> >> >> ? 0x00007fa06c49a8b0: mov 0x2c(%rbp,%r10,4),%r9d ; stride read >> 19.66% ? 0x00007fa06c49a8b5: mov 0x28(%rbp,%r10,4),%edx >> 0.14% ? 0x00007fa06c49a8ba: mov 0x10(%rbp,%r10,4),%ebx >> 22.09% ? 0x00007fa06c49a8bf: mov 0x14(%rbp,%r10,4),%ebx >> 0.21% ? 0x00007fa06c49a8c4: mov 0x18(%rbp,%r10,4),%ebx >> 20.19% ? 0x00007fa06c49a8c9: mov 0x1c(%rbp,%r10,4),%ebx >> 0.04% ? 0x00007fa06c49a8ce: mov 0x20(%rbp,%r10,4),%ebx >> 24.02% ? 0x00007fa06c49a8d3: mov 0x24(%rbp,%r10,4),%ebx >> 0.21% ? 0x00007fa06c49a8d8: add $0x8,%r10d ; i += 8 >> ? 0x00007fa06c49a8dc: cmp %esi,%r10d >> 0.07% ? 0x00007fa06c49a8df: jl 0x00007fa06c49a8b0 >> >> >> Additional testing: >> - [x] Eyeballing JMH Samples `-prof perfasm` >> - [x] Linux x86_64 fastdebug, `compiler/blackhole`, `compiler/c2/irTests/blackhole` >> - [x] Linux x86_64 fastdebug, JDK benchmark corpus > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Blackhole as CFG node > - Merge branch 'master' into JDK-8296545-blackhole-effects > - Blackhole should be AliasIdxTop > - Do not touch memory at all > - Fix Looks good. (hs-tier1 - hs-tier2 testing passed.) ------------- Marked as reviewed by vlivanov (Reviewer). PR: https://git.openjdk.org/jdk/pull/11041 From vlivanov at openjdk.org Tue Nov 15 00:23:53 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Tue, 15 Nov 2022 00:23:53 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v16] In-Reply-To: References: Message-ID: <6oNkr_1EGAdRQqa7GDrsa-tIpV_kO-_HJAjdA8Mkf28=.34da74e0-c6f1-4eec-bc45-8c8dd02f68f0@github.com> On Mon, 14 Nov 2022 17:58:36 GMT, Volodymyr Paprotski wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > Volodymyr Paprotski has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 23 commits: > > - Merge remote-tracking branch 'origin/master' into avx512-poly > - Vladimir's review > - live review with Sandhya > - jcheck > - Sandhya's review > - fix windows and 32b linux builds > - add getLimbs to interface and reviews > - fix 32-bit build > - make UsePolyIntrinsics option diagnostic > - Merge remote-tracking branch 'origin/master' into avx512-poly > - ... and 13 more: https://git.openjdk.org/jdk/compare/e269dc03...a26ac7db src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 103: > 101: > 102: ATTRIBUTE_ALIGNED(64) uint64_t POLY1305_MASK44[] = { > 103: // OFFSET 64: mask_44 Redundant comment. src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 384: > 382: void StubGenerator::poly1305_limbs(const Register limbs, const Register a0, const Register a1, const Register a2, bool only128) > 383: { > 384: const Register t1 = r13; Please, make the temps explicit and lift them into arguments. Otherwise, it's hard to see what registers are clobbered when helper methods are called. src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 387: > 385: const Register t2 = r14; > 386: > 387: __ movq(a0, Address(limbs, 0)); I don't understand how it works. `limbs` comes directly from `c_rarg2` and contains raw oop. So, `Address(limbs, 0)` reads object mark word rather than the first element from the array. (Same situation in `poly1305_limbs_out`. And now I'm curious why doesn't object header corruption trigger a crash.) src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 987: > 985: > 986: // Load R into r1:r0 > 987: poly1305_limbs(R, r0, r1, r1, true); What's the intention here when you pass `r1` twice? Just load `R[0]` and `R[2]`. You could use `noreg` to mark an optional operation and check for it in `poly1305_limbs` before loading the corresponding element. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From sviswanathan at openjdk.org Tue Nov 15 00:28:06 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 15 Nov 2022 00:28:06 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v16] In-Reply-To: <6oNkr_1EGAdRQqa7GDrsa-tIpV_kO-_HJAjdA8Mkf28=.34da74e0-c6f1-4eec-bc45-8c8dd02f68f0@github.com> References: <6oNkr_1EGAdRQqa7GDrsa-tIpV_kO-_HJAjdA8Mkf28=.34da74e0-c6f1-4eec-bc45-8c8dd02f68f0@github.com> Message-ID: <-JVYIHKOY_LuVTqyH5xuubtPdk8pK_wi5z-8pestRis=.e63938ab-0ac2-4880-8238-e6e6d8debf03@github.com> On Tue, 15 Nov 2022 00:10:35 GMT, Vladimir Ivanov wrote: >> Volodymyr Paprotski has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 23 commits: >> >> - Merge remote-tracking branch 'origin/master' into avx512-poly >> - Vladimir's review >> - live review with Sandhya >> - jcheck >> - Sandhya's review >> - fix windows and 32b linux builds >> - add getLimbs to interface and reviews >> - fix 32-bit build >> - make UsePolyIntrinsics option diagnostic >> - Merge remote-tracking branch 'origin/master' into avx512-poly >> - ... and 13 more: https://git.openjdk.org/jdk/compare/e269dc03...a26ac7db > > src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 387: > >> 385: const Register t2 = r14; >> 386: >> 387: __ movq(a0, Address(limbs, 0)); > > I don't understand how it works. `limbs` comes directly from `c_rarg2` and contains raw oop. So, `Address(limbs, 0)` reads object mark word rather than the first element from the array. > > (Same situation in `poly1305_limbs_out`. And now I'm curious why doesn't object header corruption trigger a crash.) library_call.cpp takes care of that, it passes the address of 0'th element to the stub. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From vlivanov at openjdk.org Tue Nov 15 00:51:07 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Tue, 15 Nov 2022 00:51:07 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v16] In-Reply-To: <-JVYIHKOY_LuVTqyH5xuubtPdk8pK_wi5z-8pestRis=.e63938ab-0ac2-4880-8238-e6e6d8debf03@github.com> References: <6oNkr_1EGAdRQqa7GDrsa-tIpV_kO-_HJAjdA8Mkf28=.34da74e0-c6f1-4eec-bc45-8c8dd02f68f0@github.com> <-JVYIHKOY_LuVTqyH5xuubtPdk8pK_wi5z-8pestRis=.e63938ab-0ac2-4880-8238-e6e6d8debf03@github.com> Message-ID: On Tue, 15 Nov 2022 00:25:46 GMT, Sandhya Viswanathan wrote: >> src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 387: >> >>> 385: const Register t2 = r14; >>> 386: >>> 387: __ movq(a0, Address(limbs, 0)); >> >> I don't understand how it works. `limbs` comes directly from `c_rarg2` and contains raw oop. So, `Address(limbs, 0)` reads object mark word rather than the first element from the array. >> >> (Same situation in `poly1305_limbs_out`. And now I'm curious why doesn't object header corruption trigger a crash.) > > library_call.cpp takes care of that, it passes the address of 0'th element to the stub. Ah, got it. Worth elaborating that in the comments. Otherwise, they confuse rather than help: // void processBlocks(byte[] input, int len, int[5] a, int[5] r) const Register input = rdi; //input+offset const Register length = rbx; const Register accumulator = rcx; const Register R = r8; ------------- PR: https://git.openjdk.org/jdk/pull/10582 From vlivanov at openjdk.org Tue Nov 15 00:51:09 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Tue, 15 Nov 2022 00:51:09 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v16] In-Reply-To: References: Message-ID: On Mon, 14 Nov 2022 17:58:36 GMT, Volodymyr Paprotski wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > Volodymyr Paprotski has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 23 commits: > > - Merge remote-tracking branch 'origin/master' into avx512-poly > - Vladimir's review > - live review with Sandhya > - jcheck > - Sandhya's review > - fix windows and 32b linux builds > - add getLimbs to interface and reviews > - fix 32-bit build > - make UsePolyIntrinsics option diagnostic > - Merge remote-tracking branch 'origin/master' into avx512-poly > - ... and 13 more: https://git.openjdk.org/jdk/compare/e269dc03...a26ac7db src/hotspot/share/opto/library_call.cpp line 6976: > 6974: > 6975: if (!stubAddr) return false; > 6976: Node* input = argument(1); Receiver null check is missing. Since the method being intrinsified is non-static, the intrinsic itself has to take care of receiver null check. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From xlinzheng at openjdk.org Tue Nov 15 04:11:34 2022 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Tue, 15 Nov 2022 04:11:34 GMT Subject: RFR: 8296975: RISC-V: Enable UseRVA20U64 profile by default Message-ID: The main purpose is to turn the option `UseRVC` on by default before JDK20 RDP 1. As per discussions [1], we can enable UseRVA20U64[2] by default to fulfill this. [1] https://mail.openjdk.org/pipermail/riscv-port-dev/2022-November/000668.html [2] https://github.com/openjdk/jdk/blob/873eccde01895de06e2216f6838d52d07188addd/src/hotspot/cpu/riscv/vm_version_riscv.cpp#L39-L44 Thanks, Xiaolin ------------- Commit messages: - Merge remote-tracking branch 'github-openjdk/master' into rvc-by-default - Enable UseRVA20U64 by default and move my home - Revert "RVC by default" - Revert "Move my home" - Move my home - RVC by default Changes: https://git.openjdk.org/jdk/pull/11155/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11155&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296975 Stats: 3 lines in 1 file changed: 1 ins; 1 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11155.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11155/head:pull/11155 PR: https://git.openjdk.org/jdk/pull/11155 From bulasevich at openjdk.org Tue Nov 15 04:11:39 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Tue, 15 Nov 2022 04:11:39 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v10] In-Reply-To: References: Message-ID: <0cY55nS7G8QvH_FRQzNLgRPXZ2iYJxodaA9xx1Ab7GA=.f6429844-7df4-4f78-b649-5eb8669abb01@github.com> > The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. > > Testing: jtreg hotspot&jdk, Renaissance benchmarks Boris Ulasevich has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 14 additional commits since the last revision: - warning fix - add test for buffer grow - adding jtreg test for CompressedSparseDataReadStream impl - align java impl to cpp impl - rewrite the SparseDataWriteStream not to use _curr_byte - introduce and call flush() excplicitly, add the gtest - minor renaming. adding encoding examples table - cleanup and rename - cleanup - rewrite code without virtual functions - ... and 4 more: https://git.openjdk.org/jdk/compare/256958bf...3ceefe68 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10025/files - new: https://git.openjdk.org/jdk/pull/10025/files/e5f03dda..3ceefe68 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10025&range=09 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10025&range=08-09 Stats: 64600 lines in 1035 files changed: 21920 ins; 38055 del; 4625 mod Patch: https://git.openjdk.org/jdk/pull/10025.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10025/head:pull/10025 PR: https://git.openjdk.org/jdk/pull/10025 From dongbo at openjdk.org Tue Nov 15 06:41:09 2022 From: dongbo at openjdk.org (Dong Bo) Date: Tue, 15 Nov 2022 06:41:09 GMT Subject: RFR: 8295698: AArch64: test/jdk/sun/security/ec/ed/EdDSATest.java failed with -XX:+UseSHA3Intrinsics In-Reply-To: References: <2_tbWMZjjQIyN_HVI_KUY9znox8zGh1vSEtj56vvZog=.0b9c6be0-cae9-4f06-99d9-6c4058e273c3@github.com> Message-ID: On Mon, 14 Nov 2022 17:53:46 GMT, Andrew Haley wrote: > > > This looks right, but I don't think I can test it, which I usually would do with a patch this complicated. When we have a processor without FEAT_SHA3) we should define BCAX, EOR3, RAX1, and XAR as macros. Could you do that, please? > > > > > > Thanks for the comments. > > Do you mean that we need a patch (not to be merged) to support testing on processor without FEAT_SHA3? In which, the SHA3 instructions are substituted by multiple instructions, something like: `eor3 v1, v2, v3, v4 => eor v1, v2, v3; eor v1, v1, v4`? > > Yes, exactly. So the intrinsic will work everywhere. Why not merge it? It seems obvious to me. > I see, great idea. But I'm afraid this cannot easily be done due to the high register pressure of the `keccak()` algorithm loop. Three registers are passed to `xar Vd, Vn, Vm, #imm`, which operation equals to `tmp = Vn ROR Vm; Vd = ROR(tmp[127:64], imm):ROR(tmp[63:0], imm)`. Registers `Vm/Vn` will be read by operations below, we need another `tmp` register, or we will get clobbered `Vm/Vn` when translating `xar` to NEON instructions. And we cannot find a free `tmp` register, almost every register has useful data for future use. The other three SHA3 instructions have similar register issue. > > BTW, FEAT_SHA3 is supported on M1. If you happen to have one, the test can be done on it. :) To test this on M1/MacOS, modification below is needed to enable SHA3Intriniscs by default. Since other features, i.e. UseSHA, can not be automatically detected neither, I think it is irrelevant with this patch. > > Do you know why SHA3 isn't enabled on M1? FEAT_SHA3 is not detected on MacOS, this code snippet below would be enough for FEAT_SHA3. I think there are some other hardware features, e.g. FEAT_SHA1, FEAT_SHA256, FEAT_SHA512, are not detected on MacOS too. We may need another PR to fix it. diff --git a/src/hotspot/os_cpu/bsd_aarch64/vm_version_bsd_aarch64.cpp b/src/hotspot/os_cpu/bsd_aarch64/vm_version_bsd_aarch64.cpp index 45cd77b3ba5..0ac6b4a56d1 100644 --- a/src/hotspot/os_cpu/bsd_aarch64/vm_version_bsd_aarch64.cpp +++ b/src/hotspot/os_cpu/bsd_aarch64/vm_version_bsd_aarch64.cpp @@ -68,6 +68,8 @@ void VM_Version::get_os_cpu_info() { if (cpu_has("hw.optional.armv8_crc32")) _features |= CPU_CRC32; if (cpu_has("hw.optional.armv8_1_atomics")) _features |= CPU_LSE; + if (cpu_has("hw.optional.armv8_2_sha3")) _features |= CPU_SHA3; + int cache_line_size; int hw_conf_cache_line[] = { CTL_HW, HW_CACHELINE }; sysctllen = sizeof(cache_line_size); ------------- PR: https://git.openjdk.org/jdk/pull/10939 From thartmann at openjdk.org Tue Nov 15 06:55:53 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 15 Nov 2022 06:55:53 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v3] In-Reply-To: <9XWZNcNcmELCLXDwpuNgpztPrw8xXajJQcj_daf4jhU=.4af44336-021f-4688-9a56-6a90c8e12f53@github.com> References: <9XWZNcNcmELCLXDwpuNgpztPrw8xXajJQcj_daf4jhU=.4af44336-021f-4688-9a56-6a90c8e12f53@github.com> Message-ID: On Mon, 24 Oct 2022 09:02:58 GMT, Tobias Hartmann wrote: >> Volodymyr Paprotski has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: >> >> further restrict UsePolyIntrinsics with supports_avx512vlbw > > Thanks, I'll re-run testing. > Hi @TobiHartmann you had mentioned there were some more tests to run? Looking to see what else needs fixing. Thanks. Sure, I re-submitted testing. EDIT: I see that Vladimir already did that. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From bulasevich at openjdk.org Tue Nov 15 07:05:46 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Tue, 15 Nov 2022 07:05:46 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v11] In-Reply-To: References: Message-ID: <2D9ynUtu7IxcnyELEChKZf0zpksKpmAWZorKxVJlm40=.c9b41147-c5cf-48dd-a6af-d9c30d2705d6@github.com> > The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. > > Testing: jtreg hotspot&jdk, Renaissance benchmarks Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: cleanup, rename ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10025/files - new: https://git.openjdk.org/jdk/pull/10025/files/3ceefe68..1135bac4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10025&range=10 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10025&range=09-10 Stats: 51 lines in 3 files changed: 6 ins; 12 del; 33 mod Patch: https://git.openjdk.org/jdk/pull/10025.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10025/head:pull/10025 PR: https://git.openjdk.org/jdk/pull/10025 From bulasevich at openjdk.org Tue Nov 15 07:05:47 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Tue, 15 Nov 2022 07:05:47 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section In-Reply-To: <7xYMYWazVI1VHQK0g9GafFqeH9kA-EWZfsOHhy4cIXs=.8ad5ab4f-6e3a-4b25-92a7-c027a390e018@github.com> References: <7xYMYWazVI1VHQK0g9GafFqeH9kA-EWZfsOHhy4cIXs=.8ad5ab4f-6e3a-4b25-92a7-c027a390e018@github.com> Message-ID: <5vzYee-gnYizXbBtZAvwkEUQBoUpBwM2iTIMPUsiBY0=.d3ca9e61-29be-4c3a-a2b2-8d11ef7f64ed@github.com> On Tue, 4 Oct 2022 12:58:45 GMT, Boris Ulasevich wrote: >>> > What is the performance impact of making several of the methods virtual? >>> >>> Good question! My experiments show that in the worst case, the performance of the debug write thread is reduced by 424->113 MB/s with virtual functions. Compared to compile time, this is miserable: ?ompilation takes 1000ms per method, while generation of 300 bytes of scopes data with virtual function (worst case) takes 3ms. And I do not see any regression with benchmarks. >> >> I was wondering more about read performance. I would expect that the debuginfo could be read many more times than it is written. Also, from 424 to 113 seems like a very large slowdown. > >> > > What is the performance impact of making several of the methods virtual? >> > >> > >> > Good question! My experiments show that in the worst case, the performance of the debug write thread is reduced by 424->113 MB/s with virtual functions. Compared to compile time, this is miserable: ?ompilation takes 1000ms per method, while generation of 300 bytes of scopes data with virtual function (worst case) takes 3ms. And I do not see any regression with benchmarks. >> >> I was wondering more about read performance. I would expect that the debuginfo could be read many more times than it is written. Also, from 424 to 113 seems like a very large slowdown. > > Right. With counters in virtual methods, I see that reading debug information is less frequent than writing. Anyway. Let me rewrite code without virtual functions. > @bulasevich, Could you please add gtest unit tests checking `CompressedSparseDataWriteStream`/`CompressedSparseDataReadStream`? Yes. Thanks ------------- PR: https://git.openjdk.org/jdk/pull/10025 From bulasevich at openjdk.org Tue Nov 15 07:06:01 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Tue, 15 Nov 2022 07:06:01 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v9] In-Reply-To: References: Message-ID: On Mon, 14 Nov 2022 17:29:20 GMT, Evgeny Astigeevich wrote: >> Boris Ulasevich has refreshed the contents of this pull request, and previous commits have been removed. Incremental views are not available. > > src/hotspot/share/code/compressedStream.cpp line 192: > >> 190: if (_position >= _size) { >> 191: grow(); >> 192: } > > Now we have these checks spread across the code. > There are two actions changing `_postion`: > - `_position++` > - `set_position` > > We can replace `_position++` with `inc_position` where we can have the check with `grow`. > Regarding `set_position `, I have looked at its current uses. > Its uses are to support shared debug info: > - We write info. > - We check if we have written the same info. > - If yes, we use the one written before and roll back position. > > If I haven't missed other uses, the meaning of `set_position` is to roll back. In such case, no `grow` is needed. I suggest to rename `set_position` to `roll_back_to` or `move_back_to`. (1) we have a few _position increments (2) we have a few _buffer[_position] access places The check `if (_position >= _size) { grow(); }` can go either to either after (1) of before (2). I prefer the latter because we do not need to extend a buffer if we do not write there. > src/hotspot/share/code/compressedStream.hpp line 184: > >> 182: } >> 183: >> 184: void flush() { > > Why do we need `flush` if we modify data in place? now it is align() ------------- PR: https://git.openjdk.org/jdk/pull/10025 From bulasevich at openjdk.org Tue Nov 15 07:06:03 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Tue, 15 Nov 2022 07:06:03 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v7] In-Reply-To: References: Message-ID: <6RdkbyvuuKIY135ZRkp_8tlwlfAdFQ6sMC44rPqW0lA=.b4890fc4-affc-4f7e-ac34-d5c95567ba87@github.com> On Mon, 14 Nov 2022 13:45:56 GMT, Evgeny Astigeevich wrote: >> Boris Ulasevich has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits: >> >> - adding jtreg test for CompressedSparseDataReadStream impl >> - align java impl to cpp impl >> - rewrite the SparseDataWriteStream not to use _curr_byte >> - introduce and call flush() excplicitly, add the gtest >> - minor renaming. adding encoding examples table >> - cleanup and rename >> - cleanup >> - rewrite code without virtual functions >> - warning fix and name fix >> - optimize the encoding >> - ... and 2 more: https://git.openjdk.org/jdk/compare/fd668dc4...637c94be > > src/hotspot/share/code/compressedStream.cpp line 225: > >> 223: memcpy(_new_buffer, _buffer, _position); >> 224: _buffer = _new_buffer; >> 225: _size = nsize; > > Signed integer overflow is UB. > The correct assert: > > assert(_size <= INT_MAX / 2, "debug data size must not exceed INT_MAX"); > int nsize = _size * 2; OK ------------- PR: https://git.openjdk.org/jdk/pull/10025 From bulasevich at openjdk.org Tue Nov 15 07:05:58 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Tue, 15 Nov 2022 07:05:58 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v6] In-Reply-To: References: Message-ID: On Sun, 30 Oct 2022 17:27:23 GMT, Evgeny Astigeevich wrote: >> Boris Ulasevich has refreshed the contents of this pull request, and previous commits have been removed. Incremental views are not available. > > src/hotspot/share/code/compressedStream.cpp line 117: > >> 115: >> 116: >> 117: bool CompressedSparseDataReadStream::read_zero() { > > If the last value written to a stream was 0, a reader would not know this is one 0 or eight 0s. Is there a guarantee that the number of reads will the same as the number of writes? Reader knows what he reads. We have no marks in a raw data. > src/hotspot/share/code/compressedStream.cpp line 145: > >> 143: // >> 144: // value | byte0 | byte1 | byte2 | byte3 | byte4 >> 145: // -----------+----------+----------+----------+----------+---------- > > in each byte bit 6 indicates whether it is the last byte in the sequence such comment would be misleading > src/hotspot/share/code/compressedStream.cpp line 171: > >> 169: if (_bit_pos == 0) { >> 170: return _position; >> 171: } > > We can rewrite the function not to use `_curr_byte`. We work directly on `_buffer` and rename `_bit_pos` into `_used_bits`: > > if (_used_bits == 8) { > _buffer[++_position] = 0; > _used_bits = 1; > } else { > _buffer[_position] >>= 1; > ++_used_bits; > } thanks for the idea not to use _curr_byte _bit_pos is used in both Read and Write streams. I think proper name is _bit_position. > src/hotspot/share/code/compressedStream.cpp line 175: > >> 173: write(_curr_byte << (8 - _bit_pos)); >> 174: _curr_byte = 0; >> 175: _bit_pos = 0; > > Let's extract this and call it `flush()`. Ok. Now it is align() as _curr_byte is evicted > src/hotspot/share/code/compressedStream.cpp line 189: > >> 187: >> 188: void CompressedSparseDataWriteStream::write_byte_impl(uint8_t b) { >> 189: write((_curr_byte << (8 - _bit_pos)) | (b >> _bit_pos)); > > _buffer[_position] |= (b >> _used_bits); > _buffer[++_position] = (b << (8 - _used_bits)); OK > I don't see what causes it to be written. It was a flush side effect caused by get_position in the end of the buffer filling. Anyway, it is rewritten now. Thanks! > src/hotspot/share/code/compressedStream.hpp line 115: > >> 113: }; >> 114: >> 115: class CompressedBitStream : public ResourceObj { > > Maybe it is better `CompressedSparseData`? OK > src/hotspot/share/code/compressedStream.hpp line 185: > >> 183: int position(); // method have a side effect: the current byte becomes aligned >> 184: void set_position(int pos) { >> 185: position(); > > `position()` -> `flush()` OK > src/hotspot/share/code/compressedStream.hpp line 197: > >> 195: grow(); >> 196: } >> 197: _buffer[_position++] = b; > > I think `pos` must be `<= position()`. Should we check this? In fact, it is used as rollback. Though I do not want to limit functionality to this usage only. Let me add `assert(_position < _size, "set_position is only used for rollback");` check here > src/hotspot/share/code/debugInfo.hpp line 298: > >> 296: // debugging information. Used by ScopeDesc. >> 297: >> 298: class DebugInfoReadStream : public CompressedSparseDataReadStream { > > I don't think `DebugInfoReadStream`/`DebugInfoWriteStream` need public inheritance. The relation is more like composition. > I would have implemented them like: > > class DebugInfoReadStream : private CompressedSparseDataReadStream { > public: > // we are using only needed functions from CompressedSparseDataReadStream. > using CompressedSparseDataReadStream::buffer(); > using CompressedSparseDataReadStream::read_int(); > using ... > }; > > Or > > template class DebugInfoReadStream { > public: > // define only needed functions which use a minimum number of functions from DataReadStream > }; > > > I prefer the templates because we can easily switch between different implementations of `DataReadStream`/DataWriteStream` without doing this kind of modifications. @No templates please! :) For me, the following change is counterproductive as well. - class DebugInfoReadStream : private CompressedSparseDataReadStream { + class DebugInfoReadStream : private CompressedSparseDataReadStream { + public: + using CompressedSparseDataReadStream::read_int; + using CompressedSparseDataReadStream::read_signed_int; + using CompressedSparseDataReadStream::read_double; + using CompressedSparseDataReadStream::read_long; + using CompressedSparseDataReadStream::read_bool; + using CompressedSparseDataReadStream::CompressedSparseDataReadStream; > src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CompressedSparseDataReadStream.java line 28: > >> 26: import sun.jvm.hotspot.debugger.*; >> 27: >> 28: public class CompressedSparseDataReadStream extends CompressedReadStream { > > This needs to be aligned with C++ code. > Can we test the code? I have added a simple jtreg test. thanks! ------------- PR: https://git.openjdk.org/jdk/pull/10025 From rrich at openjdk.org Tue Nov 15 08:39:14 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Tue, 15 Nov 2022 08:39:14 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v3] In-Reply-To: <5czeTlp7rCxjCT0RwWiyLCwqj17Zl_zt2WYJU9gscIs=.90952c18-8013-4bf5-b9be-c689beb24280@github.com> References: <5czeTlp7rCxjCT0RwWiyLCwqj17Zl_zt2WYJU9gscIs=.90952c18-8013-4bf5-b9be-c689beb24280@github.com> Message-ID: On Mon, 14 Nov 2022 23:08:09 GMT, Tyler Steele wrote: > I see a couple failures on Linux/ppc64le when testing with `-XX:+VerifyContinuations`. > > jdk/internal/vm/Continuation/Fuzz.java#default jdk/internal/vm/Continuation/Fuzz.java#preserve-fp > > It may be reasonable to add these to a ProblemList.txt and address them at a different time. Thanks for looking at the pr. The failures are due to timeouts. Depending on the load on your test system it might be necessary to increment the timeout factor (JTREG="TIMEOUT_FACTOR=8"). There were issues with the test before (see https://bugs.openjdk.org/browse/JDK-8290211). ------------- PR: https://git.openjdk.org/jdk/pull/10961 From roland at openjdk.org Tue Nov 15 08:58:14 2022 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 15 Nov 2022 08:58:14 GMT Subject: RFR: 6312651: Compiler should only use verified interface types for optimization [v5] In-Reply-To: References: Message-ID: > This change is mostly the same I sent for review 3 years ago but was > never integrated: > > https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2019-May/033803.html > > The main difference is that, in the meantime, I submitted a couple of > refactoring changes extracted from the 2019 patch: > > 8266550: C2: mirror TypeOopPtr/TypeInstPtr/TypeAryPtr with TypeKlassPtr/TypeInstKlassPtr/TypeAryKlassPtr > 8275201: C2: hide klass() accessor from TypeOopPtr and typeKlassPtr subclasses > > As a result, the current patch is much smaller (but still not small). > > The implementation is otherwise largely the same as in the 2019 > patch. I tried to remove some of the code duplication between the > TypeOopPtr and TypeKlassPtr hierarchies by having some of the logic > shared in template methods. In the 2019 patch, interfaces were trusted > when types were constructed and I had added code to drop interfaces > from a type where they couldn't be trusted. This new patch proceeds > the other way around: interfaces are not trusted when a type is > constructed and code that uses the type must explicitly request that > they are included (this was suggested as an improvement by Vladimir > Ivanov I think). Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: review ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10901/files - new: https://git.openjdk.org/jdk/pull/10901/files/f49a042a..ff151ff3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10901&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10901&range=03-04 Stats: 22 lines in 4 files changed: 3 ins; 14 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/10901.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10901/head:pull/10901 PR: https://git.openjdk.org/jdk/pull/10901 From roland at openjdk.org Tue Nov 15 08:58:15 2022 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 15 Nov 2022 08:58:15 GMT Subject: RFR: 6312651: Compiler should only use verified interface types for optimization [v4] In-Reply-To: References: Message-ID: On Fri, 11 Nov 2022 01:03:41 GMT, Vladimir Ivanov wrote: > Great work, Roland! I'm approving the PR. (hs-tier1 - hs-tier2 sanity testing passed with latest version.) > > Feel free to handle `ciArrayKlass::interfaces()` as you find most appropriate. Thanks for the review (and running tests). I removed `ciArrayKlass::interfaces()` in the new commit. Is the resulting code what you suggested? ------------- PR: https://git.openjdk.org/jdk/pull/10901 From rcastanedalo at openjdk.org Tue Nov 15 09:36:59 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 15 Nov 2022 09:36:59 GMT Subject: RFR: JDK-8295934: IGV: keep node selection when changing view or graph [v4] In-Reply-To: References: Message-ID: On Mon, 14 Nov 2022 16:11:45 GMT, Tobias Holenstein wrote: >> In IGV nodes can be selected by clicking on it. When a user selects nodes in a certain view, e.g. "Cluster nodes into blocks" view, and then change to e.g. "Sea of nodes" view, the selection should be kept. Same when the user goes a different graph in the same group, the selection should be kept (as long as the nodes are still present) >> >> New selection features: >> - When opening a new graph and no nodes where selected previously the root nodes is centered. >> - When a graph in the same group is opened, the previously selected nodes as selected as well if they are present in the graph. The selected nodes are centered in the new graph. >> - The selected nodes are kept when changing the view, or the properties of the view (e.g. "show neighboring nodes semi-transparent") >> cluster_view >> desired >> >> - When "show neighboring nodes semi-transparent" is disabled, previously semi-transparent nodes that were selected are now unselected (because they are not visible anymore) >> - It would also be desired adjust the scroll pane to center the selected nodes when changing view, graph, etc. > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > revert unrelated changes Looks good now, thanks for addressing my feedback and for the explanation about `LineWidget.java`. ------------- Marked as reviewed by rcastanedalo (Reviewer). PR: https://git.openjdk.org/jdk/pull/11062 From aph at openjdk.org Tue Nov 15 10:03:07 2022 From: aph at openjdk.org (Andrew Haley) Date: Tue, 15 Nov 2022 10:03:07 GMT Subject: RFR: 8295698: AArch64: test/jdk/sun/security/ec/ed/EdDSATest.java failed with -XX:+UseSHA3Intrinsics In-Reply-To: References: Message-ID: On Wed, 2 Nov 2022 03:06:21 GMT, Dong Bo wrote: > In JDK-8252204, when implemented SHA3 intrinsics, we use `digest_length` to differentiate SHA3-224, SHA3-256, SHA3-384, SHA3-512 and calculate `block_size` with `block_size = 200 - 2 * digest_length`. > However, there are two extra SHA3 instances, SHAKE256 and SHAKE128, allowing an arbitrary `digest_length`: > > digest_length block_size > SHA3-224 28 144 > SHA3-256 32 136 > SHA3-384 48 104 > SHA3-512 64 72 > SHAKE128 variable 168 > SHAKE256 variable 136 > > > This causes SIGSEGV crash or hash code mismatch with `test/jdk/sun/security/ec/ed/EdDSATest.java`. The test calls `SHAKE256` in `Ed448`. > > The main idea of the patch is to pass the `block_size` to differentiate SHA3 instances. > Tests `test/jdk/sun/security/ec/ed/EdDSATest.java` and `./test/jdk/sun/security/provider/MessageDigest/SHA3.java` both passed. > And tier1~3 passed on SHA3 supported hardware. > > The SHA3 intrinsics still deliver 20%~40% performance improvement on our pre-silicon simulated platform. > The latency and throughput of crypto SHA3 ops are designed to be 1 cpu cycle and 2 execution pipes respectively. > > Compared with the main stream code, the performance change with this patch are negligible on real hardware and simulation platform. > Based on the JMH results of SHA3 intirinsics, performance can be improved by ~50% on some hardware, while some hardware have ~30% regression. > These performance details are available in the comments of the issue page. > I guess the performance benefit of SHA3 intrinsics is dependent on the micro architecture, it should be switched on/off based on the running platform. Marked as reviewed by aph (Reviewer). Hmm, okay. Looks like there's work to do on this. I'll approve this patch, but we really must get MacOS fixed for JDK 20. ------------- PR: https://git.openjdk.org/jdk/pull/10939 From tholenstein at openjdk.org Tue Nov 15 10:16:19 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Tue, 15 Nov 2022 10:16:19 GMT Subject: RFR: JDK-8295934: IGV: keep node selection when changing view or graph [v4] In-Reply-To: References: Message-ID: On Fri, 11 Nov 2022 09:10:16 GMT, Tobias Hartmann wrote: >> Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: >> >> revert unrelated changes > > Looks good to me functionality-wise. Thanks @TobiHartmann and @robcasloz for the reviews! ------------- PR: https://git.openjdk.org/jdk/pull/11062 From tholenstein at openjdk.org Tue Nov 15 10:16:19 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Tue, 15 Nov 2022 10:16:19 GMT Subject: Integrated: JDK-8295934: IGV: keep node selection when changing view or graph In-Reply-To: References: Message-ID: <64rIm2yxyUAWvQMLmwNwXXLDVX8TWgsWc0C84RGeed0=.558ecbf8-aea8-4be1-a426-87f21d0e50b3@github.com> On Wed, 9 Nov 2022 14:30:48 GMT, Tobias Holenstein wrote: > In IGV nodes can be selected by clicking on it. When a user selects nodes in a certain view, e.g. "Cluster nodes into blocks" view, and then change to e.g. "Sea of nodes" view, the selection should be kept. Same when the user goes a different graph in the same group, the selection should be kept (as long as the nodes are still present) > > New selection features: > - When opening a new graph and no nodes where selected previously the root nodes is centered. > - When a graph in the same group is opened, the previously selected nodes as selected as well if they are present in the graph. The selected nodes are centered in the new graph. > - The selected nodes are kept when changing the view, or the properties of the view (e.g. "show neighboring nodes semi-transparent") > cluster_view > desired > > - When "show neighboring nodes semi-transparent" is disabled, previously semi-transparent nodes that were selected are now unselected (because they are not visible anymore) > - It would also be desired adjust the scroll pane to center the selected nodes when changing view, graph, etc. This pull request has now been integrated. Changeset: 6f467cd8 Author: Tobias Holenstein URL: https://git.openjdk.org/jdk/commit/6f467cd8292d41afa57c183879a704c987515243 Stats: 706 lines in 17 files changed: 321 ins; 197 del; 188 mod 8295934: IGV: keep node selection when changing view or graph Reviewed-by: thartmann, rcastanedalo ------------- PR: https://git.openjdk.org/jdk/pull/11062 From chagedorn at openjdk.org Tue Nov 15 10:19:55 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 15 Nov 2022 10:19:55 GMT Subject: RFR: 8295952: Problemlist existing compiler/rtm tests also on x86 In-Reply-To: References: Message-ID: On Wed, 26 Oct 2022 16:43:26 GMT, zzambers wrote: > Problemlist should be extended so that existing compiler/rtm entries include x86 (32-bit) intel builds as well, as these are also affected. Looks good and trivial! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10875 From dsamersoff at openjdk.org Tue Nov 15 10:47:11 2022 From: dsamersoff at openjdk.org (Dmitry Samersoff) Date: Tue, 15 Nov 2022 10:47:11 GMT Subject: Integrated: JDK-8294947: Use 64bit atomics in patch_verified_entry on x86_64 In-Reply-To: References: Message-ID: On Wed, 9 Nov 2022 12:41:59 GMT, Dmitry Samersoff wrote: > In the void NativeJump::patch_verified_entry() we atomically patch first 4 bytes, then atomically patch 5th byte, then atomically patch first 4 bytes again. But from CMC (cross-modified code) point of view it's better to patch atomically 8 bytes at once. > > The patch was tested with hotspot jtreg tests in bare-metal and virtualized environments. This pull request has now been integrated. Changeset: d0fae43e Author: Dmitry Samersoff URL: https://git.openjdk.org/jdk/commit/d0fae43e89a73e9d73b074fa12276c43ba629278 Stats: 22 lines in 1 file changed: 19 ins; 3 del; 0 mod 8294947: Use 64bit atomics in patch_verified_entry on x86_64 Reviewed-by: kvn ------------- PR: https://git.openjdk.org/jdk/pull/11059 From roland at openjdk.org Tue Nov 15 11:47:56 2022 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 15 Nov 2022 11:47:56 GMT Subject: RFR: 8295788: C2 compilation hits "assert((mode == ControlAroundStripMined && use == sfpt) || !use->is_reachable_from_root()) failed: missed a node" Message-ID: <_rSAe8s5BhWa4Cgo6uwk57tkXBVZLqoTOpJvvTQ-toQ=.6bbebdf8-28e6-4c7b-b751-59c05ef0aab8@github.com> This failure is similar to previous failures with loop strip mining: a node is encountered that has control set in the outer strip mined loop but is not reachable from the safepoint. There's already logic in loop cloning to find those and fix their control to be outside the loop. Usually a node ends up in the outer loop because some of its inputs is in the outer loop. The current logic to catch nodes that are erroneously assigned control in the outer loop is to start from safepoint's inputs and look for uses with incorrect control. That doesn't work in this case because: 1) the node is created by IdealLoopTree::reassociate in the outer loop because its inputs are indeed there 2) but a pass of split if updates the control to be inside the inner loop. To fix this, I propose reusing the existing clone_outer_loop_helper() but apply it to the loop body as well. I had to tweak that method because I ran into cases of dead nodes still reachable from a node in the loop body but removed from the _body list by IdealLoopTree::DCE_loop_body() (and as a result not cloned). ------------- Commit messages: - test - fix Changes: https://git.openjdk.org/jdk/pull/11162/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11162&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295788 Stats: 79 lines in 2 files changed: 68 ins; 0 del; 11 mod Patch: https://git.openjdk.org/jdk/pull/11162.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11162/head:pull/11162 PR: https://git.openjdk.org/jdk/pull/11162 From jbhateja at openjdk.org Tue Nov 15 12:26:59 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 15 Nov 2022 12:26:59 GMT Subject: RFR: 8296548: Improve MD5 intrinsic for x86_64 In-Reply-To: <4VgZ82kW_Fc5dwN2IimRW7StzUF8tWaJjDq4hRrhUoI=.e943ec6d-c3db-4655-8744-39c858767b45@github.com> References: <4VgZ82kW_Fc5dwN2IimRW7StzUF8tWaJjDq4hRrhUoI=.e943ec6d-c3db-4655-8744-39c858767b45@github.com> Message-ID: On Mon, 14 Nov 2022 23:16:42 GMT, Vladimir Kozlov wrote: > @sviswa7 or @jatin-bhateja do you agree with these changes? Patch shows significant improvement and better port utilization with 3+ micro ops on CLX. JDK-With-opt: Benchmark (digesterName) (length) (provider) Mode Cnt Score Error Units MessageDigests.digest md5 64 DEFAULT thrpt 2 5613.517 ops/ms MessageDigests.digest md5 16384 DEFAULT thrpt 2 50.026 ops/ms 43,24,11,23,563 exe_activity.1_ports_util (79.97%) 54,01,28,04,330 exe_activity.2_ports_util (80.22%) 25,20,63,64,512 exe_activity.3_ports_util (80.00%) 6,42,47,64,948 exe_activity.4_ports_util (79.83%) JDK-baseline: Benchmark (digesterName) (length) (provider) Mode Cnt Score Error Units MessageDigests.digest md5 64 DEFAULT thrpt 2 4087.112 ops/ms MessageDigests.digest md5 16384 DEFAULT thrpt 2 35.291 ops/ms 50,76,35,89,853 exe_activity.1_ports_util (80.09%) 36,59,68,98,931 exe_activity.2_ports_util (79.89%) 9,61,69,23,581 exe_activity.3_ports_util (80.02%) 1,88,94,94,202 exe_activity.4_ports_util (79.98%) ------------- PR: https://git.openjdk.org/jdk/pull/11054 From duke at openjdk.org Tue Nov 15 12:56:00 2022 From: duke at openjdk.org (Yi-Fan Tsai) Date: Tue, 15 Nov 2022 12:56:00 GMT Subject: RFR: 8296548: Improve MD5 intrinsic for x86_64 In-Reply-To: References: Message-ID: On Wed, 9 Nov 2022 07:57:30 GMT, Yi-Fan Tsai wrote: > The LEA instruction loads the effective address, but MD5 intrinsic uses it for computing values than addresses. This usage potentially uses more cycles than ADDs and reduces the throughput. > > This change replaces > LEA: r1 = r1 + rsi * 1 + t > with > ADDs: r1 += t; r1 += rsi. > > Microbenchmark evaluation shows ~40% performance improvement on Haswell, Broadwell, Skylake, and Cascade Lake. There is ~20% improvement on 2nd gen Epyc. > > No performance change for the same microbenchmark on Ice Lake and 3rd gen Epyc. > > Similar results can be observed with TestMD5Intrinsics and TestMD5MultiBlockIntrinsics. There is ~15% improvement in throughput on Haswell, Broadwell, Skylake, and Cascade Lake. Performance without the optimization on Cascade Lake: Benchmark (digesterName) (length) (provider) Mode Cnt Score Error Units MessageDigests.digest md5 64 DEFAULT thrpt 15 3315.328 ? 65.799 ops/ms MessageDigests.digest md5 16384 DEFAULT thrpt 15 27.482 ? 0.006 ops/ms MessageDigests.getAndDigest md5 64 DEFAULT thrpt 15 2916.207 ? 127.293 ops/ms MessageDigests.getAndDigest md5 16384 DEFAULT thrpt 15 27.381 ? 0.003 ops/ms Performance with optimization on Cascade Lake: Benchmark (digesterName) (length) (provider) Mode Cnt Score Error Units MessageDigests.digest md5 64 DEFAULT thrpt 15 4474.780 ? 17.583 ops/ms MessageDigests.digest md5 16384 DEFAULT thrpt 15 38.926 ? 0.005 ops/ms MessageDigests.getAndDigest md5 64 DEFAULT thrpt 15 3796.684 ? 153.887 ops/ms MessageDigests.getAndDigest md5 16384 DEFAULT thrpt 15 38.724 ? 0.005 ops/ms ------------- PR: https://git.openjdk.org/jdk/pull/11054 From shade at openjdk.org Tue Nov 15 13:43:06 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 15 Nov 2022 13:43:06 GMT Subject: RFR: 8296545: C2 Blackholes should allow load optimizations [v4] In-Reply-To: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> References: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> Message-ID: > If you look at generated code for the JMH benchmark like: > > > public class ArrayRead { > @Param({"1", "100", "10000", "1000000"}) > int size; > > int[] is; > > @Setup > public void setup() { > is = new int[size]; > for (int c = 0; c < size; c++) { > is[c] = c; > } > } > > @Benchmark > public void test(Blackhole bh) { > for (int i = 0; i < is.length; i++) { > bh.consume(is[i]); > } > } > } > > > ...then you would notice that the loop always re-reads `is`, `is.length`, does the range check, etc. -- all the things we would otherwise expect to be hoisted out of the loop. > > This is because C2 blackholes are modeled as membars that pinch both control and memory slices (like you would expect from the opaque non-inlined call), therefore every iteration has to re-read the referenced memory contents and recompute everything dependent on those loads. This behavior is not new -- the old, non-compiler blackholes were doing the same thing, accidentally -- but it was drowned in blackhole overheads. Now, these effects are clearly visible. > > We can try to do this a bit better: allow load optimizations to work across the blackholes, leaving only "prevent dead code elimination" part, as minimally required by blackhole semantics. > > Motivational improvements on the test above: > > > Benchmark (size) Mode Cnt Score Error Units > > # Before, full Java blackholes > ArrayRead.test 1 avgt 9 5.422 ? 0.023 ns/op > ArrayRead.test 100 avgt 9 460.619 ? 0.421 ns/op > ArrayRead.test 10000 avgt 9 44697.909 ? 1964.787 ns/op > ArrayRead.test 1000000 avgt 9 4332723.304 ? 2791.324 ns/op > > # Before, compiler blackholes > ArrayRead.test 1 avgt 9 1.791 ? 0.007 ns/op > ArrayRead.test 100 avgt 9 114.103 ? 1.677 ns/op > ArrayRead.test 10000 avgt 9 8528.544 ? 52.010 ns/op > ArrayRead.test 1000000 avgt 9 1005139.070 ? 2883.011 ns/op > > # After, compiler blackholes > ArrayRead.test 1 avgt 9 1.686 ? 0.006 ns/op ; ~1.1x better > ArrayRead.test 100 avgt 9 16.249 ? 0.019 ns/op ; ~7.0x better > ArrayRead.test 10000 avgt 9 1375.265 ? 2.420 ns/op ; ~6.2x better > ArrayRead.test 1000000 avgt 9 136862.574 ? 1057.100 ns/op ; ~7.3x better > > > `-prof perfasm` shows the reason for these improvements clearly: > > Before: > > > ? 0x00007f0b54498360: mov 0xc(%r12,%r10,8),%edx ; range check 1 > 7.97% ? 0x00007f0b54498365: cmp %edx,%r11d > 1.27% ? 0x00007f0b54498368: jae 0x00007f0b5449838f > ? 0x00007f0b5449836a: shl $0x3,%r10 > 0.03% ? 0x00007f0b5449836e: mov 0x10(%r10,%r11,4),%r10d ; get "is[i]" > 7.76% ? 0x00007f0b54498373: mov 0x10(%r9),%r10d ; restore "is" > 0.24% ? 0x00007f0b54498377: mov 0x3c0(%r15),%rdx ; safepoint poll, part 1 > 17.48% ? 0x00007f0b5449837e: inc %r11d ; i++ > 0.17% ? 0x00007f0b54498381: test %eax,(%rdx) ; safepoint poll, part 2 > 53.26% ? 0x00007f0b54498383: mov 0xc(%r12,%r10,8),%edx ; loop index check > 4.84% ? 0x00007f0b54498388: cmp %edx,%r11d > 0.31% ? 0x00007f0b5449838b: jl 0x00007f0b54498360 > > > After: > > > > ? 0x00007fa06c49a8b0: mov 0x2c(%rbp,%r10,4),%r9d ; stride read > 19.66% ? 0x00007fa06c49a8b5: mov 0x28(%rbp,%r10,4),%edx > 0.14% ? 0x00007fa06c49a8ba: mov 0x10(%rbp,%r10,4),%ebx > 22.09% ? 0x00007fa06c49a8bf: mov 0x14(%rbp,%r10,4),%ebx > 0.21% ? 0x00007fa06c49a8c4: mov 0x18(%rbp,%r10,4),%ebx > 20.19% ? 0x00007fa06c49a8c9: mov 0x1c(%rbp,%r10,4),%ebx > 0.04% ? 0x00007fa06c49a8ce: mov 0x20(%rbp,%r10,4),%ebx > 24.02% ? 0x00007fa06c49a8d3: mov 0x24(%rbp,%r10,4),%ebx > 0.21% ? 0x00007fa06c49a8d8: add $0x8,%r10d ; i += 8 > ? 0x00007fa06c49a8dc: cmp %esi,%r10d > 0.07% ? 0x00007fa06c49a8df: jl 0x00007fa06c49a8b0 > > > Additional testing: > - [x] Eyeballing JMH Samples `-prof perfasm` > - [x] Linux x86_64 fastdebug, `compiler/blackhole`, `compiler/c2/irTests/blackhole` > - [x] Linux x86_64 fastdebug, JDK benchmark corpus Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: Add comment in cfgnode.hpp ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11041/files - new: https://git.openjdk.org/jdk/pull/11041/files/66247f75..06eb3d6a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11041&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11041&range=02-03 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11041.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11041/head:pull/11041 PR: https://git.openjdk.org/jdk/pull/11041 From shade at openjdk.org Tue Nov 15 13:43:08 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 15 Nov 2022 13:43:08 GMT Subject: RFR: 8296545: C2 Blackholes should allow load optimizations [v3] In-Reply-To: References: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> Message-ID: On Mon, 14 Nov 2022 19:47:29 GMT, Aleksey Shipilev wrote: >> If you look at generated code for the JMH benchmark like: >> >> >> public class ArrayRead { >> @Param({"1", "100", "10000", "1000000"}) >> int size; >> >> int[] is; >> >> @Setup >> public void setup() { >> is = new int[size]; >> for (int c = 0; c < size; c++) { >> is[c] = c; >> } >> } >> >> @Benchmark >> public void test(Blackhole bh) { >> for (int i = 0; i < is.length; i++) { >> bh.consume(is[i]); >> } >> } >> } >> >> >> ...then you would notice that the loop always re-reads `is`, `is.length`, does the range check, etc. -- all the things we would otherwise expect to be hoisted out of the loop. >> >> This is because C2 blackholes are modeled as membars that pinch both control and memory slices (like you would expect from the opaque non-inlined call), therefore every iteration has to re-read the referenced memory contents and recompute everything dependent on those loads. This behavior is not new -- the old, non-compiler blackholes were doing the same thing, accidentally -- but it was drowned in blackhole overheads. Now, these effects are clearly visible. >> >> We can try to do this a bit better: allow load optimizations to work across the blackholes, leaving only "prevent dead code elimination" part, as minimally required by blackhole semantics. >> >> Motivational improvements on the test above: >> >> >> Benchmark (size) Mode Cnt Score Error Units >> >> # Before, full Java blackholes >> ArrayRead.test 1 avgt 9 5.422 ? 0.023 ns/op >> ArrayRead.test 100 avgt 9 460.619 ? 0.421 ns/op >> ArrayRead.test 10000 avgt 9 44697.909 ? 1964.787 ns/op >> ArrayRead.test 1000000 avgt 9 4332723.304 ? 2791.324 ns/op >> >> # Before, compiler blackholes >> ArrayRead.test 1 avgt 9 1.791 ? 0.007 ns/op >> ArrayRead.test 100 avgt 9 114.103 ? 1.677 ns/op >> ArrayRead.test 10000 avgt 9 8528.544 ? 52.010 ns/op >> ArrayRead.test 1000000 avgt 9 1005139.070 ? 2883.011 ns/op >> >> # After, compiler blackholes >> ArrayRead.test 1 avgt 9 1.686 ? 0.006 ns/op ; ~1.1x better >> ArrayRead.test 100 avgt 9 16.249 ? 0.019 ns/op ; ~7.0x better >> ArrayRead.test 10000 avgt 9 1375.265 ? 2.420 ns/op ; ~6.2x better >> ArrayRead.test 1000000 avgt 9 136862.574 ? 1057.100 ns/op ; ~7.3x better >> >> >> `-prof perfasm` shows the reason for these improvements clearly: >> >> Before: >> >> >> ? 0x00007f0b54498360: mov 0xc(%r12,%r10,8),%edx ; range check 1 >> 7.97% ? 0x00007f0b54498365: cmp %edx,%r11d >> 1.27% ? 0x00007f0b54498368: jae 0x00007f0b5449838f >> ? 0x00007f0b5449836a: shl $0x3,%r10 >> 0.03% ? 0x00007f0b5449836e: mov 0x10(%r10,%r11,4),%r10d ; get "is[i]" >> 7.76% ? 0x00007f0b54498373: mov 0x10(%r9),%r10d ; restore "is" >> 0.24% ? 0x00007f0b54498377: mov 0x3c0(%r15),%rdx ; safepoint poll, part 1 >> 17.48% ? 0x00007f0b5449837e: inc %r11d ; i++ >> 0.17% ? 0x00007f0b54498381: test %eax,(%rdx) ; safepoint poll, part 2 >> 53.26% ? 0x00007f0b54498383: mov 0xc(%r12,%r10,8),%edx ; loop index check >> 4.84% ? 0x00007f0b54498388: cmp %edx,%r11d >> 0.31% ? 0x00007f0b5449838b: jl 0x00007f0b54498360 >> >> >> After: >> >> >> >> ? 0x00007fa06c49a8b0: mov 0x2c(%rbp,%r10,4),%r9d ; stride read >> 19.66% ? 0x00007fa06c49a8b5: mov 0x28(%rbp,%r10,4),%edx >> 0.14% ? 0x00007fa06c49a8ba: mov 0x10(%rbp,%r10,4),%ebx >> 22.09% ? 0x00007fa06c49a8bf: mov 0x14(%rbp,%r10,4),%ebx >> 0.21% ? 0x00007fa06c49a8c4: mov 0x18(%rbp,%r10,4),%ebx >> 20.19% ? 0x00007fa06c49a8c9: mov 0x1c(%rbp,%r10,4),%ebx >> 0.04% ? 0x00007fa06c49a8ce: mov 0x20(%rbp,%r10,4),%ebx >> 24.02% ? 0x00007fa06c49a8d3: mov 0x24(%rbp,%r10,4),%ebx >> 0.21% ? 0x00007fa06c49a8d8: add $0x8,%r10d ; i += 8 >> ? 0x00007fa06c49a8dc: cmp %esi,%r10d >> 0.07% ? 0x00007fa06c49a8df: jl 0x00007fa06c49a8b0 >> >> >> Additional testing: >> - [x] Eyeballing JMH Samples `-prof perfasm` >> - [x] Linux x86_64 fastdebug, `compiler/blackhole`, `compiler/c2/irTests/blackhole` >> - [x] Linux x86_64 fastdebug, JDK benchmark corpus > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Blackhole as CFG node > - Merge branch 'master' into JDK-8296545-blackhole-effects > - Blackhole should be AliasIdxTop > - Do not touch memory at all > - Fix Thanks for reviews. I'll sit on it a bit, while doing more extensive benchmarking with these new blackholes. ------------- PR: https://git.openjdk.org/jdk/pull/11041 From shade at openjdk.org Tue Nov 15 13:43:08 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 15 Nov 2022 13:43:08 GMT Subject: RFR: 8296545: C2 Blackholes should allow load optimizations [v3] In-Reply-To: <8oFMPgU_6YZxluUlWdVW2EApTVylfVuk2hyzFXPgBY0=.3919a854-56d6-4183-87e8-264d7c93d885@github.com> References: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> <8oFMPgU_6YZxluUlWdVW2EApTVylfVuk2hyzFXPgBY0=.3919a854-56d6-4183-87e8-264d7c93d885@github.com> Message-ID: <_L9vCHele-RJbFEMqgUySkFAnRe9DitExNAapQnEVeM=.2c3aa2d5-c6b7-420b-a59d-d02d7301dca9@github.com> On Mon, 14 Nov 2022 21:14:38 GMT, Vladimir Kozlov wrote: >> Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: >> >> - Blackhole as CFG node >> - Merge branch 'master' into JDK-8296545-blackhole-effects >> - Blackhole should be AliasIdxTop >> - Do not touch memory at all >> - Fix > > src/hotspot/share/opto/cfgnode.hpp line 610: > >> 608: // Blackhole all arguments. This node would survive through the compiler >> 609: // the effects on its arguments, and would be finally matched to nothing. >> 610: class BlackholeNode : public MultiNode { > > Also update comment at [cfgnode.hpp#L43](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.hpp#L43) Done so! ------------- PR: https://git.openjdk.org/jdk/pull/11041 From tholenstein at openjdk.org Tue Nov 15 13:50:03 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Tue, 15 Nov 2022 13:50:03 GMT Subject: RFR: JDK-8297032: IGV: shortcut to center selected nodes Message-ID: Introduce a new shortcut `CTRL-9`/ `CMD-9` to center the nodes that are currently selected in IGV ![center_selected_nodes](https://user-images.githubusercontent.com/71546117/201934216-0b65caa2-af62-4083-877b-e5747d5409ee.png) ------------- Commit messages: - JDK-8297032: IGV: shortcut to center selected nodes Changes: https://git.openjdk.org/jdk/pull/11167/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11167&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297032 Stats: 283 lines in 3 files changed: 281 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/11167.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11167/head:pull/11167 PR: https://git.openjdk.org/jdk/pull/11167 From duke at openjdk.org Tue Nov 15 13:53:57 2022 From: duke at openjdk.org (Yi-Fan Tsai) Date: Tue, 15 Nov 2022 13:53:57 GMT Subject: RFR: 8296548: Improve MD5 intrinsic for x86_64 In-Reply-To: References: Message-ID: On Wed, 9 Nov 2022 07:57:30 GMT, Yi-Fan Tsai wrote: > The LEA instruction loads the effective address, but MD5 intrinsic uses it for computing values than addresses. This usage potentially uses more cycles than ADDs and reduces the throughput. > > This change replaces > LEA: r1 = r1 + rsi * 1 + t > with > ADDs: r1 += t; r1 += rsi. > > Microbenchmark evaluation shows ~40% performance improvement on Haswell, Broadwell, Skylake, and Cascade Lake. There is ~20% improvement on 2nd gen Epyc. > > No performance change for the same microbenchmark on Ice Lake and 3rd gen Epyc. > > Similar results can be observed with TestMD5Intrinsics and TestMD5MultiBlockIntrinsics. There is ~15% improvement in throughput on Haswell, Broadwell, Skylake, and Cascade Lake. Performance without the optimization on Ice Lake: Benchmark (digesterName) (length) (provider) Mode Cnt Score Error Units MessageDigests.digest md5 64 DEFAULT thrpt 15 5402.018 ? 17.033 ops/ms MessageDigests.digest md5 16384 DEFAULT thrpt 15 43.722 ? 0.003 ops/ms MessageDigests.getAndDigest md5 64 DEFAULT thrpt 15 4652.620 ? 35.432 ops/ms MessageDigests.getAndDigest md5 16384 DEFAULT thrpt 15 43.573 ? 0.016 ops/ms Performance with optimization on Ice Lake: Benchmark (digesterName) (length) (provider) Mode Cnt Score Error Units MessageDigests.digest md5 64 DEFAULT thrpt 15 5348.594 ? 14.303 ops/ms MessageDigests.digest md5 16384 DEFAULT thrpt 15 43.671 ? 0.008 ops/ms MessageDigests.getAndDigest md5 64 DEFAULT thrpt 15 4583.530 ? 12.752 ops/ms MessageDigests.getAndDigest md5 16384 DEFAULT thrpt 15 43.545 ? 0.006 ops/ms ------------- PR: https://git.openjdk.org/jdk/pull/11054 From duke at openjdk.org Tue Nov 15 13:56:06 2022 From: duke at openjdk.org (zzambers) Date: Tue, 15 Nov 2022 13:56:06 GMT Subject: RFR: 8295952: Problemlist existing compiler/rtm tests also on x86 In-Reply-To: References: Message-ID: <4pDysZ5rfejMZTjsuofGKhBiZ6GftT5CwyeW2eWWj14=.8bf56418-43b9-4ebe-af4f-e9bb3e73614b@github.com> On Tue, 15 Nov 2022 10:16:22 GMT, Christian Hagedorn wrote: >> Problemlist should be extended so that existing compiler/rtm entries include x86 (32-bit) intel builds as well, as these are also affected. > > Looks good and trivial! @chhagedorn Thanks for review ------------- PR: https://git.openjdk.org/jdk/pull/10875 From chagedorn at openjdk.org Tue Nov 15 14:45:04 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 15 Nov 2022 14:45:04 GMT Subject: RFR: 8286800: Assert in PhaseIdealLoop::dump_real_LCA is too strong [v3] In-Reply-To: References: <2pkCki3_MbAhroP1p8173jY1ErpHaEEvGlRxH7nGkvM=.3675ca65-171e-4634-b0ed-b61c7c30ebf3@github.com> Message-ID: On Fri, 11 Nov 2022 17:33:02 GMT, Christian Hagedorn wrote: >> We sometimes hit the following assert when dumping a bad graph (before crashing with the bad graph assertion): >> >> assert(real_LCA != NULL, "must always find an LCA" >> ``` >> The algorithm is not correct as we should always find an LCA of two nodes. To fix this, I've re-implemented the algorithm and improved the dumped idom chains: >> - I limited the node dump to idx + node name to reduce the noise which made it hard to read. >> - Reversed the idom chain dumps to reflect the graph structure. >> >> Example output: >> >> Bad graph detected in build_loop_late >> n: 138 CastPP === 205 38 [[ 263 140 140 168 ]] #Test:NotNull * Oop:Test:NotNull * !jvms: Test::mainTest @ bci:40 (line 154) >> >> [... same output as before ...] >> >> idoms of early "197 IfFalse": >> idom[2]: 42 If >> idom[1]: 44 IfTrue >> idom[0]: 196 If >> n: 197 IfFalse >> >> idoms of (wrong) LCA "205 IfTrue": >> idom[4]: 42 If >> idom[3]: 37 Region >> idom[2]: 73 If >> idom[1]: 83 IfTrue >> idom[0]: 204 If >> n: 205 IfTrue >> >> Real LCA of early "197 IfFalse" (idom[2]) and wrong LCA "205 IfTrue" (idom[4]): >> 42 If === 30 41 [[ 43 44 ]] P=0.999000, C=-1.000000 !jvms: Test::mainTest @ bci:32 (line 153) >> >> Tested by manually calling `dump_idoms` during a compilation and by running reproducers of different bad graph assertion bugs. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Change algorithm as suggested by Roberto Thank you Roberto and Vladimir for your reviews! ------------- PR: https://git.openjdk.org/jdk/pull/11015 From chagedorn at openjdk.org Tue Nov 15 14:47:07 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 15 Nov 2022 14:47:07 GMT Subject: Integrated: 8286800: Assert in PhaseIdealLoop::dump_real_LCA is too strong In-Reply-To: <2pkCki3_MbAhroP1p8173jY1ErpHaEEvGlRxH7nGkvM=.3675ca65-171e-4634-b0ed-b61c7c30ebf3@github.com> References: <2pkCki3_MbAhroP1p8173jY1ErpHaEEvGlRxH7nGkvM=.3675ca65-171e-4634-b0ed-b61c7c30ebf3@github.com> Message-ID: On Mon, 7 Nov 2022 11:21:36 GMT, Christian Hagedorn wrote: > We sometimes hit the following assert when dumping a bad graph (before crashing with the bad graph assertion): > > assert(real_LCA != NULL, "must always find an LCA" > ``` > The algorithm is not correct as we should always find an LCA of two nodes. To fix this, I've re-implemented the algorithm and improved the dumped idom chains: > - I limited the node dump to idx + node name to reduce the noise which made it hard to read. > - Reversed the idom chain dumps to reflect the graph structure. > > Example output: > > Bad graph detected in build_loop_late > n: 138 CastPP === 205 38 [[ 263 140 140 168 ]] #Test:NotNull * Oop:Test:NotNull * !jvms: Test::mainTest @ bci:40 (line 154) > > [... same output as before ...] > > idoms of early "197 IfFalse": > idom[2]: 42 If > idom[1]: 44 IfTrue > idom[0]: 196 If > n: 197 IfFalse > > idoms of (wrong) LCA "205 IfTrue": > idom[4]: 42 If > idom[3]: 37 Region > idom[2]: 73 If > idom[1]: 83 IfTrue > idom[0]: 204 If > n: 205 IfTrue > > Real LCA of early "197 IfFalse" (idom[2]) and wrong LCA "205 IfTrue" (idom[4]): > 42 If === 30 41 [[ 43 44 ]] P=0.999000, C=-1.000000 !jvms: Test::mainTest @ bci:32 (line 153) > > Tested by manually calling `dump_idoms` during a compilation and by running reproducers of different bad graph assertion bugs. > > Thanks, > Christian This pull request has now been integrated. Changeset: decb1b79 Author: Christian Hagedorn URL: https://git.openjdk.org/jdk/commit/decb1b79bc475f024a02135fa3394ff97098e758 Stats: 145 lines in 2 files changed: 70 ins; 40 del; 35 mod 8286800: Assert in PhaseIdealLoop::dump_real_LCA is too strong Reviewed-by: kvn, rcastanedalo ------------- PR: https://git.openjdk.org/jdk/pull/11015 From tholenstein at openjdk.org Tue Nov 15 14:51:48 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Tue, 15 Nov 2022 14:51:48 GMT Subject: RFR: JDK-8297047: IGV: graphContent not set when opening a new tab Message-ID: Open any graph in IGV. The graph will be opened in a new tab as expected. But the tab has the name "graph" instead of the actual graph name. Further, the "Bytecode" and "Control Flow" windows are not updated with the current graph. The reason was that `graphContent` was not set when opening a new EditorTopComponent. Before: ![graph_not_updated](https://user-images.githubusercontent.com/71546117/201946772-727f1c57-d69e-4551-a560-14d18cfb2b63.png) Now the title of tab and the "Control Flow" is updated: ![graph_updated](https://user-images.githubusercontent.com/71546117/201947659-a238d0a2-b064-4373-81dc-7fb3f0dea7ec.png) ------------- Commit messages: - JDK-8297047: IGV: graphContent not set when opening a new tab Changes: https://git.openjdk.org/jdk/pull/11168/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11168&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297047 Stats: 13 lines in 1 file changed: 8 ins; 4 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11168.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11168/head:pull/11168 PR: https://git.openjdk.org/jdk/pull/11168 From duke at openjdk.org Tue Nov 15 15:13:52 2022 From: duke at openjdk.org (Joshua Cao) Date: Tue, 15 Nov 2022 15:13:52 GMT Subject: RFR: JDK-8296969: C1: PrintC1Statistics is broken after JDK-8292878 Message-ID: Issue is coming from https://bugs.openjdk.org/browse/JDK-8292878. This PR adds the `rscratch1` argument to the affected `incrementl` call. ------------- Commit messages: - 8296969: -XX:+PrintC1Statistics is broken Changes: https://git.openjdk.org/jdk/pull/11170/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11170&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296969 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11170.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11170/head:pull/11170 PR: https://git.openjdk.org/jdk/pull/11170 From chagedorn at openjdk.org Tue Nov 15 16:07:00 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 15 Nov 2022 16:07:00 GMT Subject: RFR: JDK-8296969: C1: PrintC1Statistics is broken after JDK-8292878 In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 15:06:05 GMT, Joshua Cao wrote: > Issue is coming from https://bugs.openjdk.org/browse/JDK-8292878. This PR adds the `rscratch1` argument to the affected `incrementl` call. Looks good! I suggest to additionally add a hello world test with that flag to do some sanity testing. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11170 From tholenstein at openjdk.org Tue Nov 15 16:18:47 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Tue, 15 Nov 2022 16:18:47 GMT Subject: RFR: JDK-8297007: IGV: Link/Unlink node selection of open tabs Message-ID: In IGV graphs can be opened in several tabs and then display them side-by-side. Previously, when the user selected nodes in tab A the selection was also applied in tab B. We now introduce a new global button to link and unlink the selection of different tabs. ![link_button](https://user-images.githubusercontent.com/71546117/201961318-e1263f6c-b3e9-41d5-a1f1-5493d5294bb5.png) If the button is **pressed**, the selection is **linked** globally across tabs: ![linked](https://user-images.githubusercontent.com/71546117/201960953-88f90c74-1c87-4c29-9881-47b55e7c26b9.png) If the button is **not pressed**, the selection is **not linked** across tabs. This is the default setting: ![unlinked](https://user-images.githubusercontent.com/71546117/201961012-f531e7b9-1f23-4584-b207-02529ae25d5a.png) # Implementation The `SelectionCoordinator` is responsible to update the other tabs when the selection changes. We simply disable the `SelectionCoordinator` when the link button is not pressed, and enable it otherwise. ------------- Commit messages: - GlobalSelectionAction Changes: https://git.openjdk.org/jdk/pull/11171/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11171&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297007 Stats: 93 lines in 5 files changed: 74 ins; 5 del; 14 mod Patch: https://git.openjdk.org/jdk/pull/11171.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11171/head:pull/11171 PR: https://git.openjdk.org/jdk/pull/11171 From eastigeevich at openjdk.org Tue Nov 15 16:34:58 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 15 Nov 2022 16:34:58 GMT Subject: RFR: 8296548: Improve MD5 intrinsic for x86_64 In-Reply-To: References: Message-ID: <2O5n6UzCnpjCM-BKPnVxlwrnTmtF73m2A-5K6xdANfI=.25c48d96-8a9f-4f7f-9153-0367995b5dff@github.com> On Mon, 14 Nov 2022 15:47:25 GMT, Ludovic Henry wrote: >> The LEA instruction loads the effective address, but MD5 intrinsic uses it for computing values than addresses. This usage potentially uses more cycles than ADDs and reduces the throughput. >> >> This change replaces >> LEA: r1 = r1 + rsi * 1 + t >> with >> ADDs: r1 += t; r1 += rsi. >> >> Microbenchmark evaluation shows ~40% performance improvement on Haswell, Broadwell, Skylake, and Cascade Lake. There is ~20% improvement on 2nd gen Epyc. >> >> No performance change for the same microbenchmark on Ice Lake and 3rd gen Epyc. >> >> Similar results can be observed with TestMD5Intrinsics and TestMD5MultiBlockIntrinsics. There is ~15% improvement in throughput on Haswell, Broadwell, Skylake, and Cascade Lake. > > Could you please post JMH microbenchmarks with and without this change? You can run them with `org.openjdk.bench.java.security.MessageDigests` [1] > > [1] https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java @luhenry, @vnkozlov Sorry for the uninformative PR description. In the MD5 intrinsic stub we use 3 operand LEA. This LEA is on the critical path. The optimization is done according to the Intel 64 and IA-32 Architectures Optimization Reference Manual (Feb 2022), 3.5.1.2: In Sandy Bridge microarchitecture, there are two significant changes to the performance characteristics of LEA instruction: For LEA instructions with three source operands and some specific situations, instruction latency has increased to 3 cycles, and must dispatch via port 1: ? LEA that has all three source operands: base, index, and offset. ? LEA that uses base and index registers where the base is EBP, RBP, or R13. ? LEA that uses RIP relative addressing mode. ? LEA that uses 16-bit addressing mode. Assembly/Compiler Coding Rule 30. (ML impact, L generality) If an LEA instruction using the scaled index is on the critical path, a sequence with ADDs may be better. ADD has had latency 1 and throughput 4 since Haswell (see https://www.agner.org/optimize/instruction_tables.pdf). >From https://www.agner.org/optimize/instruction_tables.pdf, in Ice Lake LEA performance was improved to latency 1 and throughput 2. This explains no improvement on it. The patch correctness was tested with TestMD5Intrinsics and TestMD5MultiBlockIntrinsics. The microbenchmark we used: import org.apache.commons.lang3.RandomStringUtils; import org.openjdk.jmh.annotations.*; import org.openjdk.jmh.infra.BenchmarkParams; import java.nio.charset.StandardCharsets; import java.security.MessageDigest; import java.util.Arrays; import java.util.ArrayList; import java.util.List; import java.util.concurrent.ThreadLocalRandom; import java.util.concurrent.TimeUnit; import java.util.stream.IntStream; @BenchmarkMode(Mode.Throughput) @OutputTimeUnit(TimeUnit.MILLISECONDS) @State(Scope.Benchmark) public class MD5Benchmark { private static final int MAX_INPUTS_COUNT = 1000; private static final int MAX_INPUT_LENGTH = 128 * 1024; private static List inputs; static { inputs = new ArrayList<>(); IntStream.rangeClosed(1, MAX_INPUTS_COUNT).forEach(value -> inputs.add(RandomStringUtils.randomAlphabetic(MAX_INPUT_LENGTH).getBytes(StandardCharsets.UTF_8))); } @Param({"64", "128", "256", "512", "1024", "2048", "4096", "8192", "16384", "32768", "65536", "131072"}) private int data_len; @State(Scope.Thread) public static class InputData { byte[] data; int count; byte[] expectedDigest; byte[] digest; @Setup public void setup(BenchmarkParams params) { data = inputs.get(ThreadLocalRandom.current().nextInt(0, MAX_INPUTS_COUNT)); count = Integer.parseInt(params.getParam("data_len")); expectedDigest = calculateJdkMD5Checksum(data, count); } @TearDown public void check() { if (!Arrays.equals(expectedDigest, digest)) { throw new RuntimeException("Expected md5 digest:\n" + Arrays.toString(expectedDigest) + "\nGot:\n" + Arrays.toString(digest)); } } } @Benchmark public void testMD5(InputData in) { in.digest = calculateMD5Checksum(in.data, in.count); } private static byte[] calculateMD5Checksum(byte[] input, int count) { try { MessageDigest md5 = MessageDigest.getInstance("MD5"); md5.update(input, 0, count); return md5.digest(); } catch (Exception e) { throw new RuntimeException(e); } } } ------------- PR: https://git.openjdk.org/jdk/pull/11054 From chagedorn at openjdk.org Tue Nov 15 16:50:00 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 15 Nov 2022 16:50:00 GMT Subject: RFR: JDK-8297047: IGV: graphContent not set when opening a new tab In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 14:38:57 GMT, Tobias Holenstein wrote: > Open any graph in IGV. The graph will be opened in a new tab as expected. But the tab has the name "graph" instead of the actual graph name. Further, the "Bytecode" and "Control Flow" windows are not updated with the current graph. > > The reason was that `graphContent` was not set when opening a new EditorTopComponent. > > Before: > ![graph_not_updated](https://user-images.githubusercontent.com/71546117/201946772-727f1c57-d69e-4551-a560-14d18cfb2b63.png) > > Now the title of tab and the "Control Flow" is updated: > ![graph_updated](https://user-images.githubusercontent.com/71546117/201947659-a238d0a2-b064-4373-81dc-7fb3f0dea7ec.png) Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11168 From chagedorn at openjdk.org Tue Nov 15 16:52:03 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 15 Nov 2022 16:52:03 GMT Subject: RFR: JDK-8297032: IGV: shortcut to center selected nodes In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 13:39:45 GMT, Tobias Holenstein wrote: > Introduce a new shortcut `CTRL-9`/ `CMD-9` to center the nodes that are currently selected in IGV > > ![center_selected_nodes](https://user-images.githubusercontent.com/71546117/201934216-0b65caa2-af62-4083-877b-e5747d5409ee.png) Works as expected on Linux - looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11167 From duke at openjdk.org Tue Nov 15 17:16:50 2022 From: duke at openjdk.org (Joshua Cao) Date: Tue, 15 Nov 2022 17:16:50 GMT Subject: RFR: JDK-8296969: C1: PrintC1Statistics is broken after JDK-8292878 [v2] In-Reply-To: References: Message-ID: > Issue is coming from https://bugs.openjdk.org/browse/JDK-8292878. This PR adds the `rscratch1` argument to the affected `incrementl` call. Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: PrintC1Statistics test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11170/files - new: https://git.openjdk.org/jdk/pull/11170/files/7a2c075b..da9b98c5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11170&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11170&range=00-01 Stats: 51 lines in 1 file changed: 51 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11170.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11170/head:pull/11170 PR: https://git.openjdk.org/jdk/pull/11170 From duke at openjdk.org Tue Nov 15 17:16:51 2022 From: duke at openjdk.org (Joshua Cao) Date: Tue, 15 Nov 2022 17:16:51 GMT Subject: RFR: JDK-8296969: C1: PrintC1Statistics is broken after JDK-8292878 [v2] In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 16:04:35 GMT, Christian Hagedorn wrote: > Looks good! I suggest to additionally add a hello world test with that flag to do some sanity testing. Thanks. I added a basic test. ------------- PR: https://git.openjdk.org/jdk/pull/11170 From kvn at openjdk.org Tue Nov 15 17:24:20 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 15 Nov 2022 17:24:20 GMT Subject: RFR: 8296548: Improve MD5 intrinsic for x86_64 In-Reply-To: References: Message-ID: On Wed, 9 Nov 2022 07:57:30 GMT, Yi-Fan Tsai wrote: > The LEA instruction loads the effective address, but MD5 intrinsic uses it for computing values than addresses. This usage potentially uses more cycles than ADDs and reduces the throughput. > > This change replaces > LEA: r1 = r1 + rsi * 1 + t > with > ADDs: r1 += t; r1 += rsi. > > Microbenchmark evaluation shows ~40% performance improvement on Haswell, Broadwell, Skylake, and Cascade Lake. There is ~20% improvement on 2nd gen Epyc. > > No performance change for the same microbenchmark on Ice Lake and 3rd gen Epyc. > > Similar results can be observed with TestMD5Intrinsics and TestMD5MultiBlockIntrinsics. There is ~15% improvement in throughput on Haswell, Broadwell, Skylake, and Cascade Lake. Thank you all for providing performance data. Looks good. I will run testing. Do we have other intrinsics which use LEA (not for this fix)? ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11054 From kvn at openjdk.org Tue Nov 15 17:24:20 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 15 Nov 2022 17:24:20 GMT Subject: RFR: 8296548: Improve MD5 intrinsic for x86_64 In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 13:51:24 GMT, Yi-Fan Tsai wrote: >> The LEA instruction loads the effective address, but MD5 intrinsic uses it for computing values than addresses. This usage potentially uses more cycles than ADDs and reduces the throughput. >> >> This change replaces >> LEA: r1 = r1 + rsi * 1 + t >> with >> ADDs: r1 += t; r1 += rsi. >> >> Microbenchmark evaluation shows ~40% performance improvement on Haswell, Broadwell, Skylake, and Cascade Lake. There is ~20% improvement on 2nd gen Epyc. >> >> No performance change for the same microbenchmark on Ice Lake and 3rd gen Epyc. >> >> Similar results can be observed with TestMD5Intrinsics and TestMD5MultiBlockIntrinsics. There is ~15% improvement in throughput on Haswell, Broadwell, Skylake, and Cascade Lake. > > Performance without the optimization on Ice Lake: > > Benchmark (digesterName) (length) (provider) Mode Cnt Score Error Units > MessageDigests.digest md5 64 DEFAULT thrpt 15 5402.018 ? 17.033 ops/ms > MessageDigests.digest md5 16384 DEFAULT thrpt 15 43.722 ? 0.003 ops/ms > MessageDigests.getAndDigest md5 64 DEFAULT thrpt 15 4652.620 ? 35.432 ops/ms > MessageDigests.getAndDigest md5 16384 DEFAULT thrpt 15 43.573 ? 0.016 ops/ms > > > Performance with optimization on Ice Lake: > > Benchmark (digesterName) (length) (provider) Mode Cnt Score Error Units > MessageDigests.digest md5 64 DEFAULT thrpt 15 5348.594 ? 14.303 ops/ms > MessageDigests.digest md5 16384 DEFAULT thrpt 15 43.671 ? 0.008 ops/ms > MessageDigests.getAndDigest md5 64 DEFAULT thrpt 15 4583.530 ? 12.752 ops/ms > MessageDigests.getAndDigest md5 16384 DEFAULT thrpt 15 43.545 ? 0.006 ops/ms @yftsai can you merge latest JDK sources? Some of GHA testing failures should be fixed. ------------- PR: https://git.openjdk.org/jdk/pull/11054 From jnimeh at openjdk.org Tue Nov 15 17:37:58 2022 From: jnimeh at openjdk.org (Jamil Nimeh) Date: Tue, 15 Nov 2022 17:37:58 GMT Subject: RFR: 8296548: Improve MD5 intrinsic for x86_64 In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 17:18:37 GMT, Vladimir Kozlov wrote: > Do we have other intrinsics which use LEA (not for this fix)? My pending ChaCha20 intrinsics ( #7702 ) use LEA for getting the address of constant data to be loaded into SIMD registers. That happens before the 10-iteration loop that implements the 20 rounds (which is the critical section of the intrinsic). ------------- PR: https://git.openjdk.org/jdk/pull/11054 From duke at openjdk.org Tue Nov 15 17:44:12 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Tue, 15 Nov 2022 17:44:12 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v16] In-Reply-To: <6oNkr_1EGAdRQqa7GDrsa-tIpV_kO-_HJAjdA8Mkf28=.34da74e0-c6f1-4eec-bc45-8c8dd02f68f0@github.com> References: <6oNkr_1EGAdRQqa7GDrsa-tIpV_kO-_HJAjdA8Mkf28=.34da74e0-c6f1-4eec-bc45-8c8dd02f68f0@github.com> Message-ID: On Tue, 15 Nov 2022 00:06:40 GMT, Vladimir Ivanov wrote: >> Volodymyr Paprotski has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 23 commits: >> >> - Merge remote-tracking branch 'origin/master' into avx512-poly >> - Vladimir's review >> - live review with Sandhya >> - jcheck >> - Sandhya's review >> - fix windows and 32b linux builds >> - add getLimbs to interface and reviews >> - fix 32-bit build >> - make UsePolyIntrinsics option diagnostic >> - Merge remote-tracking branch 'origin/master' into avx512-poly >> - ... and 13 more: https://git.openjdk.org/jdk/compare/e269dc03...a26ac7db > > src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 384: > >> 382: void StubGenerator::poly1305_limbs(const Register limbs, const Register a0, const Register a1, const Register a2, bool only128) >> 383: { >> 384: const Register t1 = r13; > > Please, make the temps explicit and lift them into arguments. Otherwise, it's hard to see what registers are clobbered when helper methods are called. Thanks for pointing this out.. I spent quite a bit of time and went back and forth on 'register allocation'... it does make sense to pass all the temps needed, when the number of temps is small. This is the case for the three `*_limbs_*` functions. Maybe I should indeed do that... On other hand, there are functions like `poly1305_multiply8_avx512` and `poly1305_process_blocks_avx512` that use a _lot_ of temp registers. I think it makes sense to keep those as 'function-header declarations'. Then there are functions like `poly1305_multiply_scalar` that could go either way, has some temps and 'implicitly clobbered' registers, but probably should stay 'as is'.. I ended up being 'pedantic' and making _all_ temps into 'header variables'. I also tried to comment, but those probably mean more to me then anyone else in hindsight? // Register Map: // GPRs: // input = rdi // length = rbx // accumulator = rcx // R = r8 // a0 = rsi // a1 = r9 // a2 = r10 // r0 = r11 // r1 = r12 // c1 = r8; // t1 = r13 // t2 = r14 // t3 = r15 // t0 = r14 // rscratch = r13 // stack(rsp, rbp) // imul(rax, rdx) // ZMMs: // T: xmm0-6 // C: xmm7-9 // A: xmm13-18 // B: xmm19-24 // R: xmm25-29 ... // Register Map: // reserved: rsp, rbp, rcx // PARAMs: rdi, rbx, rsi, r8-r12 // poly1305_multiply_scalar clobbers: r13-r15, rax, rdx const Register t0 = r14; const Register t1 = r13; const Register rscratch = r13; // poly1305_limbs_avx512 clobbers: xmm0, xmm1 // poly1305_multiply8_avx512 clobbers: xmm0-xmm6 const XMMRegister T0 = xmm2; ... I think I am ok changing the `*limbs*` functions (even started, before I remembered my train of thought from months back..) but let me know if you agree with the rest of the reasoning? ------------- PR: https://git.openjdk.org/jdk/pull/10582 From eastigeevich at openjdk.org Tue Nov 15 17:50:59 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 15 Nov 2022 17:50:59 GMT Subject: RFR: 8296548: Improve MD5 intrinsic for x86_64 In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 17:33:50 GMT, Jamil Nimeh wrote: > Do we have other intrinsics which use LEA (not for this fix)? I have plans to look at other uses of LEA in Hotspot. I have not started yet due to other urgent work. ------------- PR: https://git.openjdk.org/jdk/pull/11054 From eastigeevich at openjdk.org Tue Nov 15 18:00:04 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 15 Nov 2022 18:00:04 GMT Subject: RFR: 8296548: Improve MD5 intrinsic for x86_64 In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 17:33:50 GMT, Jamil Nimeh wrote: > > Do we have other intrinsics which use LEA (not for this fix)? > > My pending ChaCha20 intrinsics ( #7702 ) use LEA for getting the address of constant data to be loaded into SIMD registers. That happens before the 10-iteration loop that implements the 20 rounds (which is the critical section of the intrinsic). >From #7702, I see they are not 3 operand LEA. No need to change them. ------------- PR: https://git.openjdk.org/jdk/pull/11054 From rcastanedalo at openjdk.org Tue Nov 15 18:30:13 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 15 Nov 2022 18:30:13 GMT Subject: RFR: JDK-8297032: IGV: shortcut to center selected nodes In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 13:39:45 GMT, Tobias Holenstein wrote: > Introduce a new shortcut `CTRL-9`/ `CMD-9` to center the nodes that are currently selected in IGV > > ![center_selected_nodes](https://user-images.githubusercontent.com/71546117/201934216-0b65caa2-af62-4083-877b-e5747d5409ee.png) Looks good to me! A nit (up to you whether you want to address it in this PR): you might disable the action when no node is selected, similarly to the node extraction action. ------------- Marked as reviewed by rcastanedalo (Reviewer). PR: https://git.openjdk.org/jdk/pull/11167 From kvn at openjdk.org Tue Nov 15 18:32:01 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 15 Nov 2022 18:32:01 GMT Subject: RFR: 8295952: Problemlist existing compiler/rtm tests also on x86 In-Reply-To: References: Message-ID: On Wed, 26 Oct 2022 16:43:26 GMT, zzambers wrote: > Problemlist should be extended so that existing compiler/rtm entries include x86 (32-bit) intel builds as well, as these are also affected. Marked as reviewed by kvn (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/10875 From kvn at openjdk.org Tue Nov 15 18:36:56 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 15 Nov 2022 18:36:56 GMT Subject: RFR: JDK-8296969: C1: PrintC1Statistics is broken after JDK-8292878 [v2] In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 17:16:50 GMT, Joshua Cao wrote: >> Issue is coming from https://bugs.openjdk.org/browse/JDK-8292878. This PR adds the `rscratch1` argument to the affected `incrementl` call. > > Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: > > PrintC1Statistics test Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11170 From rcastanedalo at openjdk.org Tue Nov 15 18:43:13 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 15 Nov 2022 18:43:13 GMT Subject: RFR: JDK-8297047: IGV: graphContent not set when opening a new tab In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 14:38:57 GMT, Tobias Holenstein wrote: > Open any graph in IGV. The graph will be opened in a new tab as expected. But the tab has the name "graph" instead of the actual graph name. Further, the "Bytecode" and "Control Flow" windows are not updated with the current graph. > > The reason was that `graphContent` was not set when opening a new EditorTopComponent. > > Before: > ![graph_not_updated](https://user-images.githubusercontent.com/71546117/201946772-727f1c57-d69e-4551-a560-14d18cfb2b63.png) > > Now the title of tab and the "Control Flow" is updated: > ![graph_updated](https://user-images.githubusercontent.com/71546117/201947659-a238d0a2-b064-4373-81dc-7fb3f0dea7ec.png) Looks good. ------------- Marked as reviewed by rcastanedalo (Reviewer). PR: https://git.openjdk.org/jdk/pull/11168 From sviswanathan at openjdk.org Tue Nov 15 18:52:08 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 15 Nov 2022 18:52:08 GMT Subject: RFR: 8296548: Improve MD5 intrinsic for x86_64 In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 17:57:35 GMT, Evgeny Astigeevich wrote: > Do we have other intrinsics which use LEA (not for this fix)? There is a VM_Version::supports_fast_2op_lea() and VM_Version::supports_fast_3op_lea() check available which is used to do lea optimizations. ------------- PR: https://git.openjdk.org/jdk/pull/11054 From rcastanedalo at openjdk.org Tue Nov 15 18:56:31 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 15 Nov 2022 18:56:31 GMT Subject: RFR: JDK-8297007: IGV: Link/Unlink node selection of open tabs In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 15:37:26 GMT, Tobias Holenstein wrote: > In IGV graphs can be opened in several tabs and then display them side-by-side. Previously, when the user selected nodes in tab A the selection was also applied in tab B. > > We now introduce a new global button to link and unlink the selection of different tabs. > ![link_button](https://user-images.githubusercontent.com/71546117/201961318-e1263f6c-b3e9-41d5-a1f1-5493d5294bb5.png) > > If the button is **pressed**, the selection is **linked** globally across tabs: > ![linked](https://user-images.githubusercontent.com/71546117/201960953-88f90c74-1c87-4c29-9881-47b55e7c26b9.png) > > If the button is **not pressed**, the selection is **not linked** across tabs. This is the default setting: > ![unlinked](https://user-images.githubusercontent.com/71546117/201961012-f531e7b9-1f23-4584-b207-02529ae25d5a.png) > > # Implementation > The `SelectionCoordinator` is responsible to update the other tabs when the selection changes. We simply disable the `SelectionCoordinator` when the link button is not pressed, and enable it otherwise. Great improvement, this will really enable side-by-side viewing/exploration. src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/actions/GlobalSelectionAction.java line 2: > 1: /* > 2: * Copyright (c) 2011, 2015, Oracle and/or its affiliates. All rights reserved. Set new copyright year. src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/actions/GlobalSelectionAction.java line 32: > 30: import javax.swing.ImageIcon; > 31: import org.openide.util.ImageUtilities; > 32: Would be nice to assign a shortcut to this action. ------------- Marked as reviewed by rcastanedalo (Reviewer). PR: https://git.openjdk.org/jdk/pull/11171 From xliu at openjdk.org Tue Nov 15 19:07:57 2022 From: xliu at openjdk.org (Xin Liu) Date: Tue, 15 Nov 2022 19:07:57 GMT Subject: RFR: JDK-8296969: C1: PrintC1Statistics is broken after JDK-8292878 [v2] In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 17:16:50 GMT, Joshua Cao wrote: >> Issue is coming from https://bugs.openjdk.org/browse/JDK-8292878. This PR adds the `rscratch1` argument to the affected `incrementl` call. > > Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: > > PrintC1Statistics test LGTM. I am not a reviewer. ------------- Marked as reviewed by xliu (Committer). PR: https://git.openjdk.org/jdk/pull/11170 From kvn at openjdk.org Tue Nov 15 19:21:02 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 15 Nov 2022 19:21:02 GMT Subject: RFR: 8296548: Improve MD5 intrinsic for x86_64 In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 18:48:14 GMT, Sandhya Viswanathan wrote: > > Do we have other intrinsics which use LEA (not for this fix)? > > There is a VM_Version::supports_fast_2op_lea() and VM_Version::supports_fast_3op_lea() check available which is used to do lea optimizations. Thanks you @sviswa7 For this fix, based on IceLake data provided by @yftsai, `supports_fast_3op_lea()` potential help is not enough to justify increase complexity of code. May be in other places it would be more useful but not here IMHO. ------------- PR: https://git.openjdk.org/jdk/pull/11054 From sviswanathan at openjdk.org Tue Nov 15 19:33:00 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 15 Nov 2022 19:33:00 GMT Subject: RFR: 8296548: Improve MD5 intrinsic for x86_64 In-Reply-To: References: Message-ID: On Wed, 9 Nov 2022 07:57:30 GMT, Yi-Fan Tsai wrote: > The LEA instruction loads the effective address, but MD5 intrinsic uses it for computing values than addresses. This usage potentially uses more cycles than ADDs and reduces the throughput. > > This change replaces > LEA: r1 = r1 + rsi * 1 + t > with > ADDs: r1 += t; r1 += rsi. > > Microbenchmark evaluation shows ~40% performance improvement on Haswell, Broadwell, Skylake, and Cascade Lake. There is ~20% improvement on 2nd gen Epyc. > > No performance change for the same microbenchmark on Ice Lake and 3rd gen Epyc. > > Similar results can be observed with TestMD5Intrinsics and TestMD5MultiBlockIntrinsics. There is ~15% improvement in throughput on Haswell, Broadwell, Skylake, and Cascade Lake. Marked as reviewed by sviswanathan (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11054 From sviswanathan at openjdk.org Tue Nov 15 19:33:05 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 15 Nov 2022 19:33:05 GMT Subject: RFR: 8296548: Improve MD5 intrinsic for x86_64 In-Reply-To: References: Message-ID: <3hlcUrGv9rVz2BWY2uKlObuyAXSD2BjDViGR3v57z-Q=.f1966c4b-cf39-48ac-b627-6a7001c65acc@github.com> On Tue, 15 Nov 2022 19:19:01 GMT, Vladimir Kozlov wrote: > > > Do we have other intrinsics which use LEA (not for this fix)? > > > > > > There is a VM_Version::supports_fast_2op_lea() and VM_Version::supports_fast_3op_lea() check available which is used to do lea optimizations. > > Thanks you @sviswa7 > > For this fix, based on IceLake data provided by @yftsai, `supports_fast_3op_lea()` potential help is not enough to justify increase complexity of code. May be in other places it would be more useful but not here IMHO. Yes, I agree. The PR looks good to me. ------------- PR: https://git.openjdk.org/jdk/pull/11054 From duke at openjdk.org Tue Nov 15 19:43:17 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Tue, 15 Nov 2022 19:43:17 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v16] In-Reply-To: <6oNkr_1EGAdRQqa7GDrsa-tIpV_kO-_HJAjdA8Mkf28=.34da74e0-c6f1-4eec-bc45-8c8dd02f68f0@github.com> References: <6oNkr_1EGAdRQqa7GDrsa-tIpV_kO-_HJAjdA8Mkf28=.34da74e0-c6f1-4eec-bc45-8c8dd02f68f0@github.com> Message-ID: <5Jz1ZjH_bvH1Imw-Dptwrg9vZFA9lP8PNxnUWjnCru8=.18040fe2-6520-425a-8836-fe382a1e2f34@github.com> On Tue, 15 Nov 2022 00:16:19 GMT, Vladimir Ivanov wrote: >> Volodymyr Paprotski has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 23 commits: >> >> - Merge remote-tracking branch 'origin/master' into avx512-poly >> - Vladimir's review >> - live review with Sandhya >> - jcheck >> - Sandhya's review >> - fix windows and 32b linux builds >> - add getLimbs to interface and reviews >> - fix 32-bit build >> - make UsePolyIntrinsics option diagnostic >> - Merge remote-tracking branch 'origin/master' into avx512-poly >> - ... and 13 more: https://git.openjdk.org/jdk/compare/e269dc03...a26ac7db > > src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 103: > >> 101: >> 102: ATTRIBUTE_ALIGNED(64) uint64_t POLY1305_MASK44[] = { >> 103: // OFFSET 64: mask_44 > > Redundant comment. done > src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 987: > >> 985: >> 986: // Load R into r1:r0 >> 987: poly1305_limbs(R, r0, r1, r1, true); > > What's the intention here when you pass `r1` twice? Just load `R[0]` and `R[2]`. You could use `noreg` to mark an optional operation and check for it in `poly1305_limbs` before loading the corresponding element. ah, I was wondering how to make an 'optional reg' when parameter is not a pointer. `noreg` is exactly what I needed, thanks. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Tue Nov 15 19:43:18 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Tue, 15 Nov 2022 19:43:18 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v16] In-Reply-To: References: <6oNkr_1EGAdRQqa7GDrsa-tIpV_kO-_HJAjdA8Mkf28=.34da74e0-c6f1-4eec-bc45-8c8dd02f68f0@github.com> <-JVYIHKOY_LuVTqyH5xuubtPdk8pK_wi5z-8pestRis=.e63938ab-0ac2-4880-8238-e6e6d8debf03@github.com> Message-ID: <9p2RTAI9FPWstQu0OtpSmSB7dqhFwmxbw86zZQg4GtU=.1be10660-ee5d-4654-9d4e-4fe3e449fd9b@github.com> On Tue, 15 Nov 2022 00:45:54 GMT, Vladimir Ivanov wrote: >> library_call.cpp takes care of that, it passes the address of 0'th element to the stub. > > Ah, got it. Worth elaborating that in the comments. Otherwise, they confuse rather than help: > > // void processBlocks(byte[] input, int len, int[5] a, int[5] r) > const Register input = rdi; //input+offset > const Register length = rbx; > const Register accumulator = rcx; > const Register R = r8; Added a comment, hopefully less confusing. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Tue Nov 15 19:43:18 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Tue, 15 Nov 2022 19:43:18 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v16] In-Reply-To: References: <6oNkr_1EGAdRQqa7GDrsa-tIpV_kO-_HJAjdA8Mkf28=.34da74e0-c6f1-4eec-bc45-8c8dd02f68f0@github.com> Message-ID: On Tue, 15 Nov 2022 17:42:08 GMT, Volodymyr Paprotski wrote: >> src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 384: >> >>> 382: void StubGenerator::poly1305_limbs(const Register limbs, const Register a0, const Register a1, const Register a2, bool only128) >>> 383: { >>> 384: const Register t1 = r13; >> >> Please, make the temps explicit and lift them into arguments. Otherwise, it's hard to see what registers are clobbered when helper methods are called. > > Thanks for pointing this out.. I spent quite a bit of time and went back and forth on 'register allocation'... it does make sense to pass all the temps needed, when the number of temps is small. This is the case for the three `*_limbs_*` functions. Maybe I should indeed do that... > > On other hand, there are functions like `poly1305_multiply8_avx512` and `poly1305_process_blocks_avx512` that use a _lot_ of temp registers. I think it makes sense to keep those as 'function-header declarations'. > > Then there are functions like `poly1305_multiply_scalar` that could go either way, has some temps and 'implicitly clobbered' registers, but probably should stay 'as is'.. > > I ended up being 'pedantic' and making _all_ temps into 'header variables'. I also tried to comment, but those probably mean more to me then anyone else in hindsight? > > > // Register Map: > // GPRs: > // input = rdi > // length = rbx > // accumulator = rcx > // R = r8 > // a0 = rsi > // a1 = r9 > // a2 = r10 > // r0 = r11 > // r1 = r12 > // c1 = r8; > // t1 = r13 > // t2 = r14 > // t3 = r15 > // t0 = r14 > // rscratch = r13 > // stack(rsp, rbp) > // imul(rax, rdx) > // ZMMs: > // T: xmm0-6 > // C: xmm7-9 > // A: xmm13-18 > // B: xmm19-24 > // R: xmm25-29 > ... > // Register Map: > // reserved: rsp, rbp, rcx > // PARAMs: rdi, rbx, rsi, r8-r12 > // poly1305_multiply_scalar clobbers: r13-r15, rax, rdx > const Register t0 = r14; > const Register t1 = r13; > const Register rscratch = r13; > > // poly1305_limbs_avx512 clobbers: xmm0, xmm1 > // poly1305_multiply8_avx512 clobbers: xmm0-xmm6 > const XMMRegister T0 = xmm2; > ... > > > I think I am ok changing the `*limbs*` functions (even started, before I remembered my train of thought from months back..) but let me know if you agree with the rest of the reasoning? Changed just the three `*limbs*` functions. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Tue Nov 15 19:46:48 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Tue, 15 Nov 2022 19:46:48 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v18] In-Reply-To: References: Message-ID: > Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. > > - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. > - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. > - Added a JMH perf test. > - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. > > Perf before: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s > > and after: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: extra whitespace ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10582/files - new: https://git.openjdk.org/jdk/pull/10582/files/8f5942d9..58488f42 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=17 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=16-17 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/10582.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10582/head:pull/10582 PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Tue Nov 15 19:46:52 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Tue, 15 Nov 2022 19:46:52 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v16] In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 00:43:16 GMT, Vladimir Ivanov wrote: >> Volodymyr Paprotski has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 23 commits: >> >> - Merge remote-tracking branch 'origin/master' into avx512-poly >> - Vladimir's review >> - live review with Sandhya >> - jcheck >> - Sandhya's review >> - fix windows and 32b linux builds >> - add getLimbs to interface and reviews >> - fix 32-bit build >> - make UsePolyIntrinsics option diagnostic >> - Merge remote-tracking branch 'origin/master' into avx512-poly >> - ... and 13 more: https://git.openjdk.org/jdk/compare/e269dc03...a26ac7db > > src/hotspot/share/opto/library_call.cpp line 6976: > >> 6974: >> 6975: if (!stubAddr) return false; >> 6976: Node* input = argument(1); > > Receiver null check is missing. Since the method being intrinsified is non-static, the intrinsic itself has to take care of receiver null check. I think I found the right code to copy-paste, if you could check again pls. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Tue Nov 15 19:43:11 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Tue, 15 Nov 2022 19:43:11 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v17] In-Reply-To: References: Message-ID: <7hXP-vwxc6J7fklu8QuJqiIcSQRff-QyR1SZ0Fzfqmc=.33a38a51-38c3-451a-a756-ed538507f04e@github.com> > Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. > > - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. > - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. > - Added a JMH perf test. > - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. > > Perf before: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s > > and after: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s Volodymyr Paprotski has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 25 commits: - Vladimir's review comments - Merge remote-tracking branch 'origin/master' into avx512-poly - Merge remote-tracking branch 'origin/master' into avx512-poly - Vladimir's review - live review with Sandhya - jcheck - Sandhya's review - fix windows and 32b linux builds - add getLimbs to interface and reviews - fix 32-bit build - ... and 15 more: https://git.openjdk.org/jdk/compare/7357a1a3...8f5942d9 ------------- Changes: https://git.openjdk.org/jdk/pull/10582/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=16 Stats: 1859 lines in 32 files changed: 1823 ins; 3 del; 33 mod Patch: https://git.openjdk.org/jdk/pull/10582.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10582/head:pull/10582 PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Tue Nov 15 20:09:41 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Tue, 15 Nov 2022 20:09:41 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v19] In-Reply-To: References: Message-ID: > Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. > > - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. > - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. > - Added a JMH perf test. > - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. > > Perf before: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s > > and after: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: use noreg properly in poly1305_limbs ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10582/files - new: https://git.openjdk.org/jdk/pull/10582/files/58488f42..cbf49380 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=18 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=17-18 Stats: 7 lines in 2 files changed: 0 ins; 1 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/10582.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10582/head:pull/10582 PR: https://git.openjdk.org/jdk/pull/10582 From kvn at openjdk.org Tue Nov 15 20:33:02 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 15 Nov 2022 20:33:02 GMT Subject: RFR: 8295788: C2 compilation hits "assert((mode == ControlAroundStripMined && use == sfpt) || !use->is_reachable_from_root()) failed: missed a node" In-Reply-To: <_rSAe8s5BhWa4Cgo6uwk57tkXBVZLqoTOpJvvTQ-toQ=.6bbebdf8-28e6-4c7b-b751-59c05ef0aab8@github.com> References: <_rSAe8s5BhWa4Cgo6uwk57tkXBVZLqoTOpJvvTQ-toQ=.6bbebdf8-28e6-4c7b-b751-59c05ef0aab8@github.com> Message-ID: On Tue, 15 Nov 2022 11:42:00 GMT, Roland Westrelin wrote: > This failure is similar to previous failures with loop strip mining: a > node is encountered that has control set in the outer strip mined loop > but is not reachable from the safepoint. There's already logic in loop > cloning to find those and fix their control to be outside the > loop. Usually a node ends up in the outer loop because some of its > inputs is in the outer loop. The current logic to catch nodes that are > erroneously assigned control in the outer loop is to start from > safepoint's inputs and look for uses with incorrect control. That > doesn't work in this case because: 1) the node is created by > IdealLoopTree::reassociate in the outer loop because its inputs are > indeed there 2) but a pass of split if updates the control to be > inside the inner loop. > > To fix this, I propose reusing the existing clone_outer_loop_helper() > but apply it to the loop body as well. I had to tweak that method > because I ran into cases of dead nodes still reachable from a node in > the loop body but removed from the _body list by > IdealLoopTree::DCE_loop_body() (and as a result not cloned). Make sense. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11162 From tsteele at openjdk.org Tue Nov 15 20:56:15 2022 From: tsteele at openjdk.org (Tyler Steele) Date: Tue, 15 Nov 2022 20:56:15 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v3] In-Reply-To: References: Message-ID: On Sun, 6 Nov 2022 17:28:53 GMT, Richard Reingruber wrote: >> Hi, >> >> this is the port of [JEP 425: Virtual Threads (Preview)](https://openjdk.org/jeps/425)) to PPC64. >> More precisely it is the port of vm continuations in hotspot to PPC64. It allows to run with `-XX:+VMContinuations` which is a prerequisit for 'real' virtual threads (oxymoron?). >> >> Most of the shared code changes are related to a new platform dependent constant `frame::metadata_words_at_top`. It is either added or subtracted to a frame address or size. >> >> The following is supposed to explain (without real life details) why it is needed in addition to the existing `frame::metadata_words`. The diagram shows a frame at `SP` and its stack arguments. The caller frame is located at `callers_SP`. >> >> >> X86 / AARCH64 PPC64: >> >> : : : : >> : : : : >> | | | | >> |-----------------| |-----------------| >> | | | | >> | stack arguments | | stack arguments | >> | |<- callers_SP | | >> =================== |-----------------| >> | | | | >> | metadata at bottom | | metadata at top | >> | | | |<- callers_SP >> |-----------------| =================== >> | | | | >> | | | | >> | | | | >> | | | | >> | |<- SP | | >> =================== |-----------------| >> | | >> | metadata at top | >> | |<- SP >> =================== >> >> >> On X86 and AARCH64 metadata (that's return address, saved SP/FP, etc.) is stored at the frame bottom (`metadata at bottom`). On PPC64 it is stored at the frame top (`metadata at top`) where it affects size and address calculations. Shared code deals with this by making use of the platform dependent constant `frame::metadata_words_at_top`. >> >> * size required to 'freeze' a single frame with its stack arguments in a `StackChunk`: >> `sizeof(frame) + sizeof(stack arguments) + frame::metadata_words_at_top` >> >> * address of stack arguments: >> `callers_SP + frame::metadata_words_at_top` >> >> * The value of `frame::metadata_words_at_top` is 0 words on X86 and AARCH64. On PPC64 it is 4 words. >> >> Please refer to comments I've added for more details (e.g. the comment on StackChunkFrameStream::frame_size()). Recently I've given a talk about vm continuations and the PPC64 port to my colleagues. It's rather an overview than a deep dive. [The slides](http://cr.openjdk.java.net/~rrich/webrevs/2022/8286302/202210_loom_ppc64_port.pdf) might serve as well as an intro to the matter. >> >> The pr includes the new test jdk/jdk/internal/vm/Continuation/BasicExp.java which I wrote while doing the port. The test cases vary from simple to not-so-simple. One of the main features is that a CompilationPolicy passed as argument controls which frames are compiled/interpreted when freezing by defining a sliding window of compiled / interpreted frames which produces interesting transitions with and without stack arguments. There is overlap with Basic.java and Fuzz.java. Let me know wether to remove or keep BasicExp.java. Runtime with fastdebug: 2m on Apple M1 Pro and 3m on Intel Xeon E5-2660 v3 @ 2.60GHz. Note that -XX:+VerifyContinuations is explicitly set as a found it very useful, it increases the runtime quite a bit though. >> >> Testing: the change passed our CI testing: most JCK and JTREG tests, also in Xcomp mode, SPECjvm2008, SPECjbb2015, Renaissance Suite and SAP specific tests with fastdebug and release builds on the standard platforms plus PPC64. These tests do include hotspot_loom and jdk_loom JTREG tests which I've also run with TEST_VM_OPTS="-XX:+VerifyContinuations" on X86_64, PPC64, and AARCH64. >> >> Thanks, Richard. > > Richard Reingruber has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: > > - Fix cpp condition and add PPC64 > - Changes lost in merge > - Merge branch 'master' into 8286302_Port_JEP_425_to_PPC64 > - Use callers_sp for fsize calculation in recurse_freeze_interpreted_frame > - Loom ppc64le port These changes are significant. They appear well thought out and well executed. Thanks for submitting this PR. The failures are indeed time-dependant, and pass with JTREG="TIMEOUT_FACTOR=16". src/hotspot/share/oops/stackChunkOop.inline.hpp line 126: > 124: inline bool stackChunkOopDesc::is_empty() const { > 125: assert(sp() <= stack_size(), ""); > 126: assert((sp() == stack_size()) == (sp() >= stack_size() - argsize() - frame::metadata_words_at_top), Your change here looks good, but the assertion condition seems incorrect. If `(sp() == stack_size()) == false` and `(sp() >= stack_size() - argsize() - frame::metadata_words_at_top) == false`, then the assertion passes. Unless there is a case for this behaviour, I think it's safe to change this comparison to logical AND. ------------- Marked as reviewed by tsteele (Committer). PR: https://git.openjdk.org/jdk/pull/10961 From tsteele at openjdk.org Tue Nov 15 20:56:16 2022 From: tsteele at openjdk.org (Tyler Steele) Date: Tue, 15 Nov 2022 20:56:16 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v2] In-Reply-To: References: Message-ID: On Fri, 4 Nov 2022 09:36:04 GMT, Richard Reingruber wrote: >> Hi, >> >> this is the port of [JEP 425: Virtual Threads (Preview)](https://openjdk.org/jeps/425)) to PPC64. >> More precisely it is the port of vm continuations in hotspot to PPC64. It allows to run with `-XX:+VMContinuations` which is a prerequisit for 'real' virtual threads (oxymoron?). >> >> Most of the shared code changes are related to a new platform dependent constant `frame::metadata_words_at_top`. It is either added or subtracted to a frame address or size. >> >> The following is supposed to explain (without real life details) why it is needed in addition to the existing `frame::metadata_words`. The diagram shows a frame at `SP` and its stack arguments. The caller frame is located at `callers_SP`. >> >> >> X86 / AARCH64 PPC64: >> >> : : : : >> : : : : >> | | | | >> |-----------------| |-----------------| >> | | | | >> | stack arguments | | stack arguments | >> | |<- callers_SP | | >> =================== |-----------------| >> | | | | >> | metadata at bottom | | metadata at top | >> | | | |<- callers_SP >> |-----------------| =================== >> | | | | >> | | | | >> | | | | >> | | | | >> | |<- SP | | >> =================== |-----------------| >> | | >> | metadata at top | >> | |<- SP >> =================== >> >> >> On X86 and AARCH64 metadata (that's return address, saved SP/FP, etc.) is stored at the frame bottom (`metadata at bottom`). On PPC64 it is stored at the frame top (`metadata at top`) where it affects size and address calculations. Shared code deals with this by making use of the platform dependent constant `frame::metadata_words_at_top`. >> >> * size required to 'freeze' a single frame with its stack arguments in a `StackChunk`: >> `sizeof(frame) + sizeof(stack arguments) + frame::metadata_words_at_top` >> >> * address of stack arguments: >> `callers_SP + frame::metadata_words_at_top` >> >> * The value of `frame::metadata_words_at_top` is 0 words on X86 and AARCH64. On PPC64 it is 4 words. >> >> Please refer to comments I've added for more details (e.g. the comment on StackChunkFrameStream::frame_size()). Recently I've given a talk about vm continuations and the PPC64 port to my colleagues. It's rather an overview than a deep dive. [The slides](http://cr.openjdk.java.net/~rrich/webrevs/2022/8286302/202210_loom_ppc64_port.pdf) might serve as well as an intro to the matter. >> >> The pr includes the new test jdk/jdk/internal/vm/Continuation/BasicExp.java which I wrote while doing the port. The test cases vary from simple to not-so-simple. One of the main features is that a CompilationPolicy passed as argument controls which frames are compiled/interpreted when freezing by defining a sliding window of compiled / interpreted frames which produces interesting transitions with and without stack arguments. There is overlap with Basic.java and Fuzz.java. Let me know wether to remove or keep BasicExp.java. Runtime with fastdebug: 2m on Apple M1 Pro and 3m on Intel Xeon E5-2660 v3 @ 2.60GHz. Note that -XX:+VerifyContinuations is explicitly set as a found it very useful, it increases the runtime quite a bit though. >> >> Testing: the change passed our CI testing: most JCK and JTREG tests, also in Xcomp mode, SPECjvm2008, SPECjbb2015, Renaissance Suite and SAP specific tests with fastdebug and release builds on the standard platforms plus PPC64. These tests do include hotspot_loom and jdk_loom JTREG tests which I've also run with TEST_VM_OPTS="-XX:+VerifyContinuations" on X86_64, PPC64, and AARCH64. >> >> Thanks, Richard. > > Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: > > Use callers_sp for fsize calculation in recurse_freeze_interpreted_frame src/hotspot/cpu/ppc/sharedRuntime_ppc.cpp line 1694: > 1692: #ifdef ASSERT > 1693: __ load_const_optimized(tmp2, 0x1234); > 1694: __ stw(tmp2, in_bytes(ContinuationEntry::cookie_offset()), R1_SP); Would it be appropriate to call ContinuationEntry::cookie_value() here instead? ------------- PR: https://git.openjdk.org/jdk/pull/10961 From duke at openjdk.org Tue Nov 15 21:17:16 2022 From: duke at openjdk.org (Joshua Cao) Date: Tue, 15 Nov 2022 21:17:16 GMT Subject: Integrated: JDK-8296969: C1: PrintC1Statistics is broken after JDK-8292878 In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 15:06:05 GMT, Joshua Cao wrote: > Issue is coming from https://bugs.openjdk.org/browse/JDK-8292878. This PR adds the `rscratch1` argument to the affected `incrementl` call. This pull request has now been integrated. Changeset: 0cbf084b Author: Joshua Cao Committer: Xin Liu URL: https://git.openjdk.org/jdk/commit/0cbf084b44cbae1b879f4dd7847de0a551e5c1ea Stats: 52 lines in 2 files changed: 51 ins; 0 del; 1 mod 8296969: C1: PrintC1Statistics is broken after JDK-8292878 Reviewed-by: chagedorn, kvn, xliu ------------- PR: https://git.openjdk.org/jdk/pull/11170 From luhenry at openjdk.org Tue Nov 15 22:25:05 2022 From: luhenry at openjdk.org (Ludovic Henry) Date: Tue, 15 Nov 2022 22:25:05 GMT Subject: RFR: 8296548: Improve MD5 intrinsic for x86_64 In-Reply-To: References: Message-ID: <17xVJqTabeiuAn-pEdhiUcdfBuqknPJjqXcVW1eSdWE=.4af5a40b-237b-4ac7-8866-5292d5921754@github.com> On Wed, 9 Nov 2022 07:57:30 GMT, Yi-Fan Tsai wrote: > The LEA instruction loads the effective address, but MD5 intrinsic uses it for computing values than addresses. This usage potentially uses more cycles than ADDs and reduces the throughput. > > This change replaces > LEA: r1 = r1 + rsi * 1 + t > with > ADDs: r1 += t; r1 += rsi. > > Microbenchmark evaluation shows ~40% performance improvement on Haswell, Broadwell, Skylake, and Cascade Lake. There is ~20% improvement on 2nd gen Epyc. > > No performance change for the same microbenchmark on Ice Lake and 3rd gen Epyc. > > Similar results can be observed with TestMD5Intrinsics and TestMD5MultiBlockIntrinsics. There is ~15% improvement in throughput on Haswell, Broadwell, Skylake, and Cascade Lake. Marked as reviewed by luhenry (Author). ------------- PR: https://git.openjdk.org/jdk/pull/11054 From duke at openjdk.org Tue Nov 15 23:43:12 2022 From: duke at openjdk.org (Yi-Fan Tsai) Date: Tue, 15 Nov 2022 23:43:12 GMT Subject: RFR: 8296548: Improve MD5 intrinsic for x86_64 [v2] In-Reply-To: References: Message-ID: > The LEA instruction loads the effective address, but MD5 intrinsic uses it for computing values than addresses. This usage potentially uses more cycles than ADDs and reduces the throughput. > > This change replaces > LEA: r1 = r1 + rsi * 1 + t > with > ADDs: r1 += t; r1 += rsi. > > Microbenchmark evaluation shows ~40% performance improvement on Haswell, Broadwell, Skylake, and Cascade Lake. There is ~20% improvement on 2nd gen Epyc. > > No performance change for the same microbenchmark on Ice Lake and 3rd gen Epyc. > > Similar results can be observed with TestMD5Intrinsics and TestMD5MultiBlockIntrinsics. There is ~15% improvement in throughput on Haswell, Broadwell, Skylake, and Cascade Lake. Yi-Fan Tsai has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge branch 'openjdk:master' into JDK-8296548 - 8296548: Improve MD5 intrinsic for x86_64 The LEA instruction loads the effective address, but MD5 intrinsic uses it for computing values than addresses. This usage potentially uses more cycles than ADDs and reduces the throughput. This change replaces LEA: r1 = r1 + rsi * 1 + t with ADDs: r1 += t; r1 += rsi. Microbenchmark evaluation shows ~40% performance improvement on Haswell, Broadwell, Skylake, and Cascade Lake. There is ~20% improvement on 2nd gen Epyc. No performance change for the same microbenchmark on Ice Lake and 3rd gen Epyc. Similar results can also be observed in TestMD5Intrinsics and TestMD5MultiBlockIntrinsics with a more moderate improvement, e.g. ~15% improvement in throughput on Haswell. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11054/files - new: https://git.openjdk.org/jdk/pull/11054/files/6ed4348c..be07b342 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11054&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11054&range=00-01 Stats: 11165 lines in 460 files changed: 4691 ins; 4515 del; 1959 mod Patch: https://git.openjdk.org/jdk/pull/11054.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11054/head:pull/11054 PR: https://git.openjdk.org/jdk/pull/11054 From vlivanov at openjdk.org Tue Nov 15 23:56:57 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Tue, 15 Nov 2022 23:56:57 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v16] In-Reply-To: References: Message-ID: On Mon, 14 Nov 2022 17:58:36 GMT, Volodymyr Paprotski wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > Volodymyr Paprotski has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 23 commits: > > - Merge remote-tracking branch 'origin/master' into avx512-poly > - Vladimir's review > - live review with Sandhya > - jcheck > - Sandhya's review > - fix windows and 32b linux builds > - add getLimbs to interface and reviews > - fix 32-bit build > - make UsePolyIntrinsics option diagnostic > - Merge remote-tracking branch 'origin/master' into avx512-poly > - ... and 13 more: https://git.openjdk.org/jdk/compare/e269dc03...a26ac7db src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 370: > 368: // Middle 44-bit limbs of new blocks > 369: __ vpsrlq(L1, L0, 44, Assembler::AVX_512bit); > 370: __ vpsllq(TMP2, TMP1, 20, Assembler::AVX_512bit); Any particular reason to use `TMP2` here? Can you just update `TMP1` instead (w/ `vpsllq(TMP1, TMP1, 20, Assembler::AVX_512bit);`)? ------------- PR: https://git.openjdk.org/jdk/pull/10582 From vlivanov at openjdk.org Tue Nov 15 23:56:59 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Tue, 15 Nov 2022 23:56:59 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v16] In-Reply-To: References: <6oNkr_1EGAdRQqa7GDrsa-tIpV_kO-_HJAjdA8Mkf28=.34da74e0-c6f1-4eec-bc45-8c8dd02f68f0@github.com> Message-ID: <6ks_fjBAWGK7eqIki9sA9oWjTOheJR-JAakGUx5t6Ro=.df7278d3-5d28-4219-819f-74c73dfb0677@github.com> On Tue, 15 Nov 2022 17:42:08 GMT, Volodymyr Paprotski wrote: >> src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 384: >> >>> 382: void StubGenerator::poly1305_limbs(const Register limbs, const Register a0, const Register a1, const Register a2, bool only128) >>> 383: { >>> 384: const Register t1 = r13; >> >> Please, make the temps explicit and lift them into arguments. Otherwise, it's hard to see what registers are clobbered when helper methods are called. > > Thanks for pointing this out.. I spent quite a bit of time and went back and forth on 'register allocation'... it does make sense to pass all the temps needed, when the number of temps is small. This is the case for the three `*_limbs_*` functions. Maybe I should indeed do that... > > On other hand, there are functions like `poly1305_multiply8_avx512` and `poly1305_process_blocks_avx512` that use a _lot_ of temp registers. I think it makes sense to keep those as 'function-header declarations'. > > Then there are functions like `poly1305_multiply_scalar` that could go either way, has some temps and 'implicitly clobbered' registers, but probably should stay 'as is'.. > > I ended up being 'pedantic' and making _all_ temps into 'header variables'. I also tried to comment, but those probably mean more to me then anyone else in hindsight? > > > // Register Map: > // GPRs: > // input = rdi > // length = rbx > // accumulator = rcx > // R = r8 > // a0 = rsi > // a1 = r9 > // a2 = r10 > // r0 = r11 > // r1 = r12 > // c1 = r8; > // t1 = r13 > // t2 = r14 > // t3 = r15 > // t0 = r14 > // rscratch = r13 > // stack(rsp, rbp) > // imul(rax, rdx) > // ZMMs: > // T: xmm0-6 > // C: xmm7-9 > // A: xmm13-18 > // B: xmm19-24 > // R: xmm25-29 > ... > // Register Map: > // reserved: rsp, rbp, rcx > // PARAMs: rdi, rbx, rsi, r8-r12 > // poly1305_multiply_scalar clobbers: r13-r15, rax, rdx > const Register t0 = r14; > const Register t1 = r13; > const Register rscratch = r13; > > // poly1305_limbs_avx512 clobbers: xmm0, xmm1 > // poly1305_multiply8_avx512 clobbers: xmm0-xmm6 > const XMMRegister T0 = xmm2; > ... > > > I think I am ok changing the `*limbs*` functions (even started, before I remembered my train of thought from months back..) but let me know if you agree with the rest of the reasoning? > On other hand, there are functions like poly1305_multiply8_avx512 and poly1305_process_blocks_avx512 that use a lot of temp registers. I think it makes sense to keep those as 'function-header declarations'. I agree with you on `poly1305_process_blocks_avx512`, but `poly1305_multiply8_avx512` already takes 8 arguments. Putting 8 more arguments for temps doesn't look prohibitive. > I think it makes sense to keep those as 'function-header declarations'. IMO it's not enough. Ideally, if there are any implicit usages, those should be clearly spelled out at every call site. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From vlivanov at openjdk.org Tue Nov 15 23:57:00 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Tue, 15 Nov 2022 23:57:00 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v16] In-Reply-To: <9p2RTAI9FPWstQu0OtpSmSB7dqhFwmxbw86zZQg4GtU=.1be10660-ee5d-4654-9d4e-4fe3e449fd9b@github.com> References: <6oNkr_1EGAdRQqa7GDrsa-tIpV_kO-_HJAjdA8Mkf28=.34da74e0-c6f1-4eec-bc45-8c8dd02f68f0@github.com> <-JVYIHKOY_LuVTqyH5xuubtPdk8pK_wi5z-8pestRis=.e63938ab-0ac2-4880-8238-e6e6d8debf03@github.com> <9p2RTAI9FPWstQu0OtpSmSB7dqhFwmxbw86zZQg4GtU=.1be10660-ee5d-4654-9d4e-4fe3e449fd9b@github.com> Message-ID: On Tue, 15 Nov 2022 19:38:26 GMT, Volodymyr Paprotski wrote: >> Ah, got it. Worth elaborating that in the comments. Otherwise, they confuse rather than help: >> >> // void processBlocks(byte[] input, int len, int[5] a, int[5] r) >> const Register input = rdi; //input+offset >> const Register length = rbx; >> const Register accumulator = rcx; >> const Register R = r8; > > Added a comment, hopefully less confusing. On a second thought, passing derived pointers as arguments doesn't mix well with safepoint awareness. (And this stub eventually has to become safepoint aware.) Deriving a pointer inside the stub from a base oop and offset is trivial, recovering base oop from derived pointer is hard. It doesn't mean we have to address it right now. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From vlivanov at openjdk.org Tue Nov 15 23:57:04 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Tue, 15 Nov 2022 23:57:04 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v17] In-Reply-To: <7hXP-vwxc6J7fklu8QuJqiIcSQRff-QyR1SZ0Fzfqmc=.33a38a51-38c3-451a-a756-ed538507f04e@github.com> References: <7hXP-vwxc6J7fklu8QuJqiIcSQRff-QyR1SZ0Fzfqmc=.33a38a51-38c3-451a-a756-ed538507f04e@github.com> Message-ID: On Tue, 15 Nov 2022 19:43:11 GMT, Volodymyr Paprotski wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > Volodymyr Paprotski has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 25 commits: > > - Vladimir's review comments > - Merge remote-tracking branch 'origin/master' into avx512-poly > - Merge remote-tracking branch 'origin/master' into avx512-poly > - Vladimir's review > - live review with Sandhya > - jcheck > - Sandhya's review > - fix windows and 32b linux builds > - add getLimbs to interface and reviews > - fix 32-bit build > - ... and 15 more: https://git.openjdk.org/jdk/compare/7357a1a3...8f5942d9 src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 896: > 894: > 895: // Cleanup > 896: __ vpxorq(xmm0, xmm0, xmm0, Assembler::AVX_512bit); What's the purpose of the cleanup? src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 1004: > 1002: __ jcc(Assembler::less, L_process16Loop); > 1003: > 1004: poly1305_process_blocks_avx512(input, length, I'd like to see a comment here explaining what register effects are implicit. `poly1305_process_blocks_avx512` has the following comment, but it doesn't mention xmm registers: // Register Map: // reserved: rsp, rbp, rcx // PARAMs: rdi, rbx, rsi, r8-r12 // poly1305_multiply_scalar clobbers: r13-r15, rax, rdx ------------- PR: https://git.openjdk.org/jdk/pull/10582 From eastigeevich at openjdk.org Tue Nov 15 23:59:08 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 15 Nov 2022 23:59:08 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v11] In-Reply-To: <2D9ynUtu7IxcnyELEChKZf0zpksKpmAWZorKxVJlm40=.c9b41147-c5cf-48dd-a6af-d9c30d2705d6@github.com> References: <2D9ynUtu7IxcnyELEChKZf0zpksKpmAWZorKxVJlm40=.c9b41147-c5cf-48dd-a6af-d9c30d2705d6@github.com> Message-ID: On Tue, 15 Nov 2022 07:05:46 GMT, Boris Ulasevich wrote: >> The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. >> >> Testing: jtreg hotspot&jdk, Renaissance benchmarks > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > cleanup, rename src/hotspot/share/code/compressedStream.cpp line 193: > 191: if (_bit_position > 0) { > 192: grow_if_need(); > 193: _buffer[_position] = (b << (8 - _bit_position)); I see many `grow_if_need` in code. This means if we forget to call it, we can have an issue. What if we combine `grow_if_need` and `_buffer[_position] = ...` into one operation: void put(u_char b) { grow_if_need(); _buffer[_position] = b; } src/hotspot/share/code/compressedStream.hpp line 127: > 125: } > 126: > 127: u_char* buffer() const { return _buffer; } The function changes access to `_buffer` from `protected` to `public`. Should it be: const u_char* buffer() const; src/hotspot/share/code/compressedStream.hpp line 195: > 193: return _position; > 194: } > 195: void set_position(int pos) { Now I see why we have a problem to implement `position` and `set_position`. `position` originally had a meaning of the position where data would be written. Because of this it could be used to get the total amount of data written (see `DebugInformationRecorder::data_size`). It was also used to mark a position to roll back later (e.g. `DebugInformationRecorder::serialize_scope_values`). This violates the single-responsibility principle and makes difficult to add another implementation. To restore the principle we need separate functionalities from `position` and `set_position` into something like: // Mark the state of the stream. void mark(); // Roll the stream state back to the marked one. void roll_back(); // Return the amount of data the stream contains. int data_size(); We implement `mark` as creating copies of `_position`, `_bit_position` and `_buffer[_position]`. `roll_back` uses the copies to restore the state of the stream. `CompressedSparseDataWriteStream::data_size()` just returns `_position + 1`. There is the problem with `DebugInformationRecorder::find_sharable_decode_offset(int stream_offset)`. It calculates `stream_length` using `position()`. It depends too much on the current implementation. Because of this dependency we have to emulate it in our new implementation. ------------- PR: https://git.openjdk.org/jdk/pull/10025 From eastigeevich at openjdk.org Tue Nov 15 23:59:12 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 15 Nov 2022 23:59:12 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v6] In-Reply-To: References: Message-ID: On Sun, 30 Oct 2022 17:07:58 GMT, Evgeny Astigeevich wrote: >> Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: >> >> minor renaming. adding encoding examples table > > src/hotspot/share/code/compressedStream.hpp line 135: > >> 133: CompressedSparseDataReadStream(u_char* buffer, int position) : CompressedBitStream(buffer, position) {} >> 134: >> 135: void set_position(int pos) { > > Are there uses of it? If no, let's remove it. Why is it marked resolved? ------------- PR: https://git.openjdk.org/jdk/pull/10025 From eastigeevich at openjdk.org Tue Nov 15 23:59:12 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 15 Nov 2022 23:59:12 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v6] In-Reply-To: References: Message-ID: <0D3aOfTLoyewYlYCiyVXLb8OxE1QktqumXaaXX2kdiQ=.86290507-f0f0-4321-aa98-3cb131dcef89@github.com> On Tue, 15 Nov 2022 06:57:09 GMT, Boris Ulasevich wrote: >> src/hotspot/share/code/compressedStream.hpp line 197: >> >>> 195: grow(); >>> 196: } >>> 197: _buffer[_position++] = b; >> >> I think `pos` must be `<= position()`. Should we check this? > > In fact, it is used as rollback. Though I do not want to limit functionality to this usage only. > Let me add `assert(_position < _size, "set_position is only used for rollback");` check here IMHO, the functionality of changing the position after the current position won't be correct. It breaks the invariant of a stream: all data written up to the current position must be correct. Anyway, at least we have an assertion. >> src/hotspot/share/code/debugInfo.hpp line 298: >> >>> 296: // debugging information. Used by ScopeDesc. >>> 297: >>> 298: class DebugInfoReadStream : public CompressedSparseDataReadStream { >> >> I don't think `DebugInfoReadStream`/`DebugInfoWriteStream` need public inheritance. The relation is more like composition. >> I would have implemented them like: >> >> class DebugInfoReadStream : private CompressedSparseDataReadStream { >> public: >> // we are using only needed functions from CompressedSparseDataReadStream. >> using CompressedSparseDataReadStream::buffer(); >> using CompressedSparseDataReadStream::read_int(); >> using ... >> }; >> >> Or >> >> template class DebugInfoReadStream { >> public: >> // define only needed functions which use a minimum number of functions from DataReadStream >> }; >> >> >> I prefer the templates because we can easily switch between different implementations of `DataReadStream`/DataWriteStream` without doing this kind of modifications. > > @No templates please! :) > For me, the following change is counterproductive as well. > > - class DebugInfoReadStream : private CompressedSparseDataReadStream { > + class DebugInfoReadStream : private CompressedSparseDataReadStream { > + public: > + using CompressedSparseDataReadStream::read_int; > + using CompressedSparseDataReadStream::read_signed_int; > + using CompressedSparseDataReadStream::read_double; > + using CompressedSparseDataReadStream::read_long; > + using CompressedSparseDataReadStream::read_bool; > + using CompressedSparseDataReadStream::CompressedSparseDataReadStream; Ok. I accept your arguments. ------------- PR: https://git.openjdk.org/jdk/pull/10025 From duke at openjdk.org Wed Nov 16 00:08:07 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Wed, 16 Nov 2022 00:08:07 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v17] In-Reply-To: References: <7hXP-vwxc6J7fklu8QuJqiIcSQRff-QyR1SZ0Fzfqmc=.33a38a51-38c3-451a-a756-ed538507f04e@github.com> Message-ID: On Tue, 15 Nov 2022 19:41:25 GMT, Vladimir Ivanov wrote: >> Volodymyr Paprotski has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 25 commits: >> >> - Vladimir's review comments >> - Merge remote-tracking branch 'origin/master' into avx512-poly >> - Merge remote-tracking branch 'origin/master' into avx512-poly >> - Vladimir's review >> - live review with Sandhya >> - jcheck >> - Sandhya's review >> - fix windows and 32b linux builds >> - add getLimbs to interface and reviews >> - fix 32-bit build >> - ... and 15 more: https://git.openjdk.org/jdk/compare/7357a1a3...8f5942d9 > > src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 896: > >> 894: >> 895: // Cleanup >> 896: __ vpxorq(xmm0, xmm0, xmm0, Assembler::AVX_512bit); > > What's the purpose of the cleanup? The internal security review asked me to blank out all the key material after I am done. i.e. R (and its powers on the stack) ------------- PR: https://git.openjdk.org/jdk/pull/10582 From eastigeevich at openjdk.org Wed Nov 16 01:01:14 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Wed, 16 Nov 2022 01:01:14 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v11] In-Reply-To: References: <2D9ynUtu7IxcnyELEChKZf0zpksKpmAWZorKxVJlm40=.c9b41147-c5cf-48dd-a6af-d9c30d2705d6@github.com> Message-ID: On Tue, 15 Nov 2022 23:31:51 GMT, Evgeny Astigeevich wrote: >> Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: >> >> cleanup, rename > > src/hotspot/share/code/compressedStream.hpp line 195: > >> 193: return _position; >> 194: } >> 195: void set_position(int pos) { > > Now I see why we have a problem to implement `position` and `set_position`. > `position` originally had a meaning of the position where data would be written. Because of this it could be used to get the total amount of data written (see `DebugInformationRecorder::data_size`). > It was also used to mark a position to roll back later (e.g. `DebugInformationRecorder::serialize_scope_values`). > This violates the single-responsibility principle and makes difficult to add another implementation. > To restore the principle we need separate functionalities from `position` and `set_position` into something like: > > // Mark the state of the stream. > void mark(); > > // Roll the stream state back to the marked one. > void roll_back(); > > // Return the amount of data the stream contains. > int data_size(); > > > We implement `mark` as creating copies of `_position`, `_bit_position` and `_buffer[_position]`. `roll_back` uses the copies to restore the state of the stream. > `CompressedSparseDataWriteStream::data_size()` just returns `_position + 1`. > > There is the problem with `DebugInformationRecorder::find_sharable_decode_offset(int stream_offset)`. It calculates `stream_length` using `position()`. It depends too much on the current implementation. Because of this dependency we have to emulate it in our new implementation. I have an idea which might solve the issues. ------------- PR: https://git.openjdk.org/jdk/pull/10025 From fgao at openjdk.org Wed Nov 16 02:06:36 2022 From: fgao at openjdk.org (Fei Gao) Date: Wed, 16 Nov 2022 02:06:36 GMT Subject: RFR: 8295407: C2 crash: Error: ShouldNotReachHere() in multiple vector tests with -XX:-MonomorphicArrayCheck -XX:-UncommonNullCast [v2] In-Reply-To: References: Message-ID: > For unsupported `CMove` patterns, [JDK-8293833](https://bugs.openjdk.org/browse/JDK-8295407) helps remove unused `CMove` and related packs from superword candidate packset by the function `remove_cmove_and_related_packs()`, but it only works when `-XX:+UseVectorCmov` is enabled[1]. When the option is not enabled, these unsupported `CMove` packs are still kept in the superword packset, causing the same failure. > > Actually, the function `filter_packs()` in superword is to filter out unsupported packs but it can't work as expected currently for these `CMove` cases. As we know, not all `CMove` packs can be vectorized. `merge_packs_to_cmovd()`[2] looks through all packs in the superword packset and generates a `CMove` candidate packset to collect all qualified `CMove` packs. Hence, only `CMove` packs in the `CMove` candidate packset are our target patterns and can be vectorized. But `filter_packs()` thinks, if the `CMove` pack is in a superword packset and its vector node is implemented in the current platform, then it can be vectorized. Therefore, the function doesn't remove these unsupported packs. > > We can adjust the function `implemented()` in the stage of `filter_packs()` to check if the current `CMove` pack is in the `CMove` candidate packset. If not, `filter_packs()` considers it not to be vectorized and then remove it. After the fix, whether > `-XX:+UseVectorCmov` is enabled or not, these unsupported packs can be removed by `filter_packs()`. In this way, we don't need the function`remove_cmove_and_related_packs()` anymore and thus the patch also cleans related code. > > [1] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L537 > [2] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L1892 Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Clean up related code - Merge branch 'master' into fg8295407 - 8295407: C2 crash: Error: ShouldNotReachHere() in multiple vector tests with -XX:-MonomorphicArrayCheck -XX:-UncommonNullCast For unsupported `CMove` patterns, JDK-8293833 helps remove unused `CMove` and related packs from superword candidate packset by the function `remove_cmove_and_related_packs()`, but it only works when `-XX:+UseVectorCmov` is enabled[1]. When the option is not enabled, these unsupported `CMove` packs are still kept in the superword packset, causing the same failure. Actually, the function `filter_packs()` in superword is to filter out unsupported packs but it can't work as expected currently for these `CMove` cases. As we know, not all `CMove` packs can be vectorized. `merge_packs_to_cmovd()`[2] looks through all packs in the superword packset and generates a `CMove` candidate packset to collect all qualified `CMove` packs. Hence, only `CMove` packs in the `CMove` candidate packset are our target patterns and can be vectorized. But `filter_packs()` thinks, if the `CMove` pack is in a superword packset and its vector node is implemented in the current platform, then it can be vectorized. Therefore, the function doesn't remove these unsupported packs. We can adjust the function `implemented()` in the stage of `filter_packs()` to check if the current `CMove` pack is in the `CMove` candidate packset. If not, `filter_packs()` considers it not to be vectorized and then remove it. After the fix, whether `-XX:+UseVectorCmov` is enabled or not, these unsupported packs can be removed by `filter_packs()`. In this way, we don't need the function`remove_cmove_and_related_packs()` anymore and thus the patch also cleans related code. [1] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L537 [2] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L1892 [3] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L2701 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11034/files - new: https://git.openjdk.org/jdk/pull/11034/files/97a27264..bcf6a21e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11034&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11034&range=00-01 Stats: 11431 lines in 460 files changed: 4358 ins; 5241 del; 1832 mod Patch: https://git.openjdk.org/jdk/pull/11034.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11034/head:pull/11034 PR: https://git.openjdk.org/jdk/pull/11034 From fgao at openjdk.org Wed Nov 16 02:12:57 2022 From: fgao at openjdk.org (Fei Gao) Date: Wed, 16 Nov 2022 02:12:57 GMT Subject: RFR: 8295407: C2 crash: Error: ShouldNotReachHere() in multiple vector tests with -XX:-MonomorphicArrayCheck -XX:-UncommonNullCast [v2] In-Reply-To: References: Message-ID: On Mon, 14 Nov 2022 06:03:42 GMT, Tobias Hartmann wrote: >> @TobiHartmann, thanks for your review! The options added here is commented as part of summary title, not as JVM options. I suppose it should be fine for a release build, right? :-) > > Right, I missed that. Does the test reproduce the issue without these flags? In any case, I think a more descriptive summary would be good. Yes, the added testcase here can reproduce the issue even without these flags and I suppose those flags help C2 vectorize more hot loops in the reported testcases. I updated the summary part in the new commit. Thanks for your suggestion @TobiHartmann! ------------- PR: https://git.openjdk.org/jdk/pull/11034 From fgao at openjdk.org Wed Nov 16 02:19:45 2022 From: fgao at openjdk.org (Fei Gao) Date: Wed, 16 Nov 2022 02:19:45 GMT Subject: RFR: 8295407: C2 crash: Error: ShouldNotReachHere() in multiple vector tests with -XX:-MonomorphicArrayCheck -XX:-UncommonNullCast [v2] In-Reply-To: References: Message-ID: On Wed, 16 Nov 2022 02:06:36 GMT, Fei Gao wrote: >> For unsupported `CMove` patterns, [JDK-8293833](https://bugs.openjdk.org/browse/JDK-8295407) helps remove unused `CMove` and related packs from superword candidate packset by the function `remove_cmove_and_related_packs()`, but it only works when `-XX:+UseVectorCmov` is enabled[1]. When the option is not enabled, these unsupported `CMove` packs are still kept in the superword packset, causing the same failure. >> >> Actually, the function `filter_packs()` in superword is to filter out unsupported packs but it can't work as expected currently for these `CMove` cases. As we know, not all `CMove` packs can be vectorized. `merge_packs_to_cmovd()`[2] looks through all packs in the superword packset and generates a `CMove` candidate packset to collect all qualified `CMove` packs. Hence, only `CMove` packs in the `CMove` candidate packset are our target patterns and can be vectorized. But `filter_packs()` thinks, if the `CMove` pack is in a superword packset and its vector node is implemented in the current platform, then it can be vectorized. Therefore, the function doesn't remove these unsupported packs. >> >> We can adjust the function `implemented()` in the stage of `filter_packs()` to check if the current `CMove` pack is in the `CMove` candidate packset. If not, `filter_packs()` considers it not to be vectorized and then remove it. After the fix, whether >> `-XX:+UseVectorCmov` is enabled or not, these unsupported packs can be removed by `filter_packs()`. In this way, we don't need the function`remove_cmove_and_related_packs()` anymore and thus the patch also cleans related code. >> >> [1] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L537 >> [2] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L1892 > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Clean up related code > - Merge branch 'master' into fg8295407 > - 8295407: C2 crash: Error: ShouldNotReachHere() in multiple vector tests with -XX:-MonomorphicArrayCheck -XX:-UncommonNullCast > > For unsupported `CMove` patterns, JDK-8293833 helps remove unused > `CMove` and related packs from superword candidate packset by the > function `remove_cmove_and_related_packs()`, but it only works > when `-XX:+UseVectorCmov` is enabled[1]. When the option is not > enabled, these unsupported `CMove` packs are still kept in the > superword packset, causing the same failure. > > Actually, the function `filter_packs()` in superword is to filter > out unsupported packs but it can't work as expected currently for > these `CMove` cases. As we know, not all `CMove` packs can be > vectorized. `merge_packs_to_cmovd()`[2] looks through all packs > in the superword packset and generates a `CMove` candidate > packset to collect all qualified `CMove` packs. Hence, only > `CMove` packs in the `CMove` candidate packset are our target > patterns and can be vectorized. But `filter_packs()` thinks, > if the `CMove` pack is in a superword packset and its vector > node is implemented in the current platform, then it can > be vectorized. Therefore, the function doesn't remove > these unsupported packs. > > We can adjust the function `implemented()` in the stage of > `filter_packs()` to check if the current `CMove` pack is in > the `CMove` candidate packset. If not, `filter_packs()` considers > it not to be vectorized and then remove it. After the fix, > whether `-XX:+UseVectorCmov` is enabled or not, these > unsupported packs can be removed by `filter_packs()`. In this > way, we don't need the function`remove_cmove_and_related_packs()` > anymore and thus the patch also cleans related code. > > [1] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L537 > [2] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L1892 > [3] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L2701 > I have request since you are touching this code. [8192846](https://bugs.openjdk.org/browse/JDK-8192846) changes were a little sloppy and did not rename some methods which causing confusion. Please, rename `is_CmpD_candidate`, `merge_packs_to_cmpd` and others you find to general `fp` as you did with `is_cmove_fo_opcode`. > Or may be simply remove `D`: `is_Cmp_candidate(), merge_packs_to_cmove(), test_cmp_pack()`. There are also comments which describes only `CMoveD`. @vnkozlov, thanks for point it out. I cleaned up related code and comments in the new commit. Could you please help review it? Thanks! > Do we have IR tests to verify cmove vectorization? Yes, we do. I added [compiler/c2/irTests/TestVectorConditionalMove.java](https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/c2/irTests/TestVectorConditionalMove.java) in [JDK-8289422](https://bugs.openjdk.org/browse/JDK-8289422). ------------- PR: https://git.openjdk.org/jdk/pull/11034 From kvn at openjdk.org Wed Nov 16 02:22:58 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 16 Nov 2022 02:22:58 GMT Subject: RFR: 8296548: Improve MD5 intrinsic for x86_64 [v2] In-Reply-To: References: Message-ID: <-u2SvidKHdlnJvslxVaY27GgjVxQyHnRBkZwiJ08nwo=.d0562bb7-51fa-45b9-a085-0cb69d666f2c@github.com> On Tue, 15 Nov 2022 23:43:12 GMT, Yi-Fan Tsai wrote: >> The LEA instruction loads the effective address, but MD5 intrinsic uses it for computing values than addresses. This usage potentially uses more cycles than ADDs and reduces the throughput. >> >> This change replaces >> LEA: r1 = r1 + rsi * 1 + t >> with >> ADDs: r1 += t; r1 += rsi. >> >> Microbenchmark evaluation shows ~40% performance improvement on Haswell, Broadwell, Skylake, and Cascade Lake. There is ~20% improvement on 2nd gen Epyc. >> >> No performance change for the same microbenchmark on Ice Lake and 3rd gen Epyc. >> >> Similar results can be observed with TestMD5Intrinsics and TestMD5MultiBlockIntrinsics. There is ~15% improvement in throughput on Haswell, Broadwell, Skylake, and Cascade Lake. > > Yi-Fan Tsai has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge branch 'openjdk:master' into JDK-8296548 > - 8296548: Improve MD5 intrinsic for x86_64 > > The LEA instruction loads the effective address, but MD5 intrinsic uses > it for computing values than addresses. This usage potentially uses > more cycles than ADDs and reduces the throughput. > > This change replaces > LEA: r1 = r1 + rsi * 1 + t > with > ADDs: r1 += t; r1 += rsi. > > Microbenchmark evaluation shows ~40% performance improvement on Haswell, > Broadwell, Skylake, and Cascade Lake. There is ~20% improvement on 2nd > gen Epyc. > > No performance change for the same microbenchmark on Ice Lake and 3rd > gen Epyc. > > Similar results can also be observed in TestMD5Intrinsics and > TestMD5MultiBlockIntrinsics with a more moderate improvement, e.g. ~15% > improvement in throughput on Haswell. My testing passed. ------------- PR: https://git.openjdk.org/jdk/pull/11054 From yyang at openjdk.org Wed Nov 16 02:31:00 2022 From: yyang at openjdk.org (Yi Yang) Date: Wed, 16 Nov 2022 02:31:00 GMT Subject: RFR: 8290432: C2 compilation fails with assert(node->_last_del == _last) failed: must have deleted the edge just produced [v3] In-Reply-To: References: <5ff-r2RgTNzao-sZ4D1kKWOPHWwzaCZxDDxyxl1Y0Us=.ae799d57-29ab-42c5-9908-a5811a8db0bc@github.com> <4iWrjtgXQfRvRYkT2_wUGAQkIouqwlng4IJmHyvCHqQ=.6b36faaa-eb22-4e32-bfc2-dfedd645eff2@github.com> <9AT3XEsyxvrRfTLGGGVCeANLLMDIcRyzKxjagsbYAoo=.e33159d7-c7a3-41bd-90db-4b01068adbeb@github.com> Message-ID: On Thu, 8 Sep 2022 07:19:10 GMT, Tobias Hartmann wrote: > > Why? Can you clarify more? > > As I mentioned above, I don't understand how your newly added condition is supposed to work. @vnkozlov @TobiHartmann I reviewed the above comments again. I think the proposed fix is good. According to my analysis at https://github.com/openjdk/jdk/pull/9695#issuecomment-1206221152 , we have a parallel IV which formed as `Add->CastII->Phi`, the birth of the form looks good, and this pattern is only legal after [JDK-8273585](https://bugs.openjdk.org/browse/JDK-8273585). In that case, `Add->CastII->Phi` parallel IV was generated by Preconditions.checkIndex, CastII was controlled by ConstraintCast. In the proposed fix, I'm trying to avoid to recognize similar pattern if CastII was not controlled by ConstraintCast. I reviewed the previous discussion and I think this fix is ok. According to my analysis(https://github.com/openjdk/jdk/pull/9695#issuecomment-1206221152 ), the entire IR(Add-CastII-Phi) generation is reasonable, and this format is only valid after [JDK-8273585](https://bugs.openjdk.org/browse/JDK-8273585). In that case, CastII is controlled by ConstraintCast, so in the current patch, we refuse to recognize other cases besides this. As far as this fix itself is concerned, I think the whole process is reasonable. The only thing I'm not sure about is whether this Add-CastII-Phi form is really reasonable as an IV, which is why I think adding an IR verification test for [JDK-8273585](https://bugs.openjdk.org/browse/JDK-8273585) is reasonable, I will go back to investigate that again after this patch, what do you think? ------------- PR: https://git.openjdk.org/jdk/pull/9695 From kvn at openjdk.org Wed Nov 16 02:40:19 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 16 Nov 2022 02:40:19 GMT Subject: RFR: 8295407: C2 crash: Error: ShouldNotReachHere() in multiple vector tests with -XX:-MonomorphicArrayCheck -XX:-UncommonNullCast [v2] In-Reply-To: References: Message-ID: On Wed, 16 Nov 2022 02:06:36 GMT, Fei Gao wrote: >> For unsupported `CMove` patterns, [JDK-8293833](https://bugs.openjdk.org/browse/JDK-8295407) helps remove unused `CMove` and related packs from superword candidate packset by the function `remove_cmove_and_related_packs()`, but it only works when `-XX:+UseVectorCmov` is enabled[1]. When the option is not enabled, these unsupported `CMove` packs are still kept in the superword packset, causing the same failure. >> >> Actually, the function `filter_packs()` in superword is to filter out unsupported packs but it can't work as expected currently for these `CMove` cases. As we know, not all `CMove` packs can be vectorized. `merge_packs_to_cmovd()`[2] looks through all packs in the superword packset and generates a `CMove` candidate packset to collect all qualified `CMove` packs. Hence, only `CMove` packs in the `CMove` candidate packset are our target patterns and can be vectorized. But `filter_packs()` thinks, if the `CMove` pack is in a superword packset and its vector node is implemented in the current platform, then it can be vectorized. Therefore, the function doesn't remove these unsupported packs. >> >> We can adjust the function `implemented()` in the stage of `filter_packs()` to check if the current `CMove` pack is in the `CMove` candidate packset. If not, `filter_packs()` considers it not to be vectorized and then remove it. After the fix, whether >> `-XX:+UseVectorCmov` is enabled or not, these unsupported packs can be removed by `filter_packs()`. In this way, we don't need the function`remove_cmove_and_related_packs()` anymore and thus the patch also cleans related code. >> >> [1] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L537 >> [2] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L1892 > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Clean up related code > - Merge branch 'master' into fg8295407 > - 8295407: C2 crash: Error: ShouldNotReachHere() in multiple vector tests with -XX:-MonomorphicArrayCheck -XX:-UncommonNullCast > > For unsupported `CMove` patterns, JDK-8293833 helps remove unused > `CMove` and related packs from superword candidate packset by the > function `remove_cmove_and_related_packs()`, but it only works > when `-XX:+UseVectorCmov` is enabled[1]. When the option is not > enabled, these unsupported `CMove` packs are still kept in the > superword packset, causing the same failure. > > Actually, the function `filter_packs()` in superword is to filter > out unsupported packs but it can't work as expected currently for > these `CMove` cases. As we know, not all `CMove` packs can be > vectorized. `merge_packs_to_cmovd()`[2] looks through all packs > in the superword packset and generates a `CMove` candidate > packset to collect all qualified `CMove` packs. Hence, only > `CMove` packs in the `CMove` candidate packset are our target > patterns and can be vectorized. But `filter_packs()` thinks, > if the `CMove` pack is in a superword packset and its vector > node is implemented in the current platform, then it can > be vectorized. Therefore, the function doesn't remove > these unsupported packs. > > We can adjust the function `implemented()` in the stage of > `filter_packs()` to check if the current `CMove` pack is in > the `CMove` candidate packset. If not, `filter_packs()` considers > it not to be vectorized and then remove it. After the fix, > whether `-XX:+UseVectorCmov` is enabled or not, these > unsupported packs can be removed by `filter_packs()`. In this > way, we don't need the function`remove_cmove_and_related_packs()` > anymore and thus the patch also cleans related code. > > [1] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L537 > [2] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L1892 > [3] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L2701 Looks good. I submitted testing. ------------- PR: https://git.openjdk.org/jdk/pull/11034 From thartmann at openjdk.org Wed Nov 16 06:06:32 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 16 Nov 2022 06:06:32 GMT Subject: RFR: 8295407: C2 crash: Error: ShouldNotReachHere() in multiple vector tests with -XX:-MonomorphicArrayCheck -XX:-UncommonNullCast [v2] In-Reply-To: References: Message-ID: On Wed, 16 Nov 2022 02:06:36 GMT, Fei Gao wrote: >> For unsupported `CMove` patterns, [JDK-8293833](https://bugs.openjdk.org/browse/JDK-8295407) helps remove unused `CMove` and related packs from superword candidate packset by the function `remove_cmove_and_related_packs()`, but it only works when `-XX:+UseVectorCmov` is enabled[1]. When the option is not enabled, these unsupported `CMove` packs are still kept in the superword packset, causing the same failure. >> >> Actually, the function `filter_packs()` in superword is to filter out unsupported packs but it can't work as expected currently for these `CMove` cases. As we know, not all `CMove` packs can be vectorized. `merge_packs_to_cmovd()`[2] looks through all packs in the superword packset and generates a `CMove` candidate packset to collect all qualified `CMove` packs. Hence, only `CMove` packs in the `CMove` candidate packset are our target patterns and can be vectorized. But `filter_packs()` thinks, if the `CMove` pack is in a superword packset and its vector node is implemented in the current platform, then it can be vectorized. Therefore, the function doesn't remove these unsupported packs. >> >> We can adjust the function `implemented()` in the stage of `filter_packs()` to check if the current `CMove` pack is in the `CMove` candidate packset. If not, `filter_packs()` considers it not to be vectorized and then remove it. After the fix, whether >> `-XX:+UseVectorCmov` is enabled or not, these unsupported packs can be removed by `filter_packs()`. In this way, we don't need the function`remove_cmove_and_related_packs()` anymore and thus the patch also cleans related code. >> >> [1] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L537 >> [2] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L1892 > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Clean up related code > - Merge branch 'master' into fg8295407 > - 8295407: C2 crash: Error: ShouldNotReachHere() in multiple vector tests with -XX:-MonomorphicArrayCheck -XX:-UncommonNullCast > > For unsupported `CMove` patterns, JDK-8293833 helps remove unused > `CMove` and related packs from superword candidate packset by the > function `remove_cmove_and_related_packs()`, but it only works > when `-XX:+UseVectorCmov` is enabled[1]. When the option is not > enabled, these unsupported `CMove` packs are still kept in the > superword packset, causing the same failure. > > Actually, the function `filter_packs()` in superword is to filter > out unsupported packs but it can't work as expected currently for > these `CMove` cases. As we know, not all `CMove` packs can be > vectorized. `merge_packs_to_cmovd()`[2] looks through all packs > in the superword packset and generates a `CMove` candidate > packset to collect all qualified `CMove` packs. Hence, only > `CMove` packs in the `CMove` candidate packset are our target > patterns and can be vectorized. But `filter_packs()` thinks, > if the `CMove` pack is in a superword packset and its vector > node is implemented in the current platform, then it can > be vectorized. Therefore, the function doesn't remove > these unsupported packs. > > We can adjust the function `implemented()` in the stage of > `filter_packs()` to check if the current `CMove` pack is in > the `CMove` candidate packset. If not, `filter_packs()` considers > it not to be vectorized and then remove it. After the fix, > whether `-XX:+UseVectorCmov` is enabled or not, these > unsupported packs can be removed by `filter_packs()`. In this > way, we don't need the function`remove_cmove_and_related_packs()` > anymore and thus the patch also cleans related code. > > [1] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L537 > [2] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L1892 > [3] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L2701 Looks good to me. Let's wait for Vladimir's testing to finish. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11034 From thartmann at openjdk.org Wed Nov 16 06:06:34 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 16 Nov 2022 06:06:34 GMT Subject: RFR: 8295407: C2 crash: Error: ShouldNotReachHere() in multiple vector tests with -XX:-MonomorphicArrayCheck -XX:-UncommonNullCast [v2] In-Reply-To: References: Message-ID: On Wed, 16 Nov 2022 02:08:33 GMT, Fei Gao wrote: >> Right, I missed that. Does the test reproduce the issue without these flags? In any case, I think a more descriptive summary would be good. > > Yes, the added testcase here can reproduce the issue even without these flags and I suppose those flags help C2 vectorize more hot loops in the reported testcases. I updated the summary part in the new commit. Thanks for your suggestion @TobiHartmann! Great, thanks for confirming and updating the summary! ------------- PR: https://git.openjdk.org/jdk/pull/11034 From duke at openjdk.org Wed Nov 16 06:16:55 2022 From: duke at openjdk.org (Yi-Fan Tsai) Date: Wed, 16 Nov 2022 06:16:55 GMT Subject: Integrated: 8296548: Improve MD5 intrinsic for x86_64 In-Reply-To: References: Message-ID: <2AnMa7kndLRDNh88RgZxmTHEHLg6tdhxPJ3WYO8ARBE=.9f962915-8fb8-4907-be2b-9e7f530fa493@github.com> On Wed, 9 Nov 2022 07:57:30 GMT, Yi-Fan Tsai wrote: > The LEA instruction loads the effective address, but MD5 intrinsic uses it for computing values than addresses. This usage potentially uses more cycles than ADDs and reduces the throughput. > > This change replaces > LEA: r1 = r1 + rsi * 1 + t > with > ADDs: r1 += t; r1 += rsi. > > Microbenchmark evaluation shows ~40% performance improvement on Haswell, Broadwell, Skylake, and Cascade Lake. There is ~20% improvement on 2nd gen Epyc. > > No performance change for the same microbenchmark on Ice Lake and 3rd gen Epyc. > > Similar results can be observed with TestMD5Intrinsics and TestMD5MultiBlockIntrinsics. There is ~15% improvement in throughput on Haswell, Broadwell, Skylake, and Cascade Lake. This pull request has now been integrated. Changeset: 6ead2b01 Author: Yi-Fan Tsai Committer: Jatin Bhateja URL: https://git.openjdk.org/jdk/commit/6ead2b019595f9b54a70603da84f11271ee070b6 Stats: 8 lines in 1 file changed: 4 ins; 0 del; 4 mod 8296548: Improve MD5 intrinsic for x86_64 Reviewed-by: kvn, sviswanathan, luhenry ------------- PR: https://git.openjdk.org/jdk/pull/11054 From dongbo at openjdk.org Wed Nov 16 06:20:32 2022 From: dongbo at openjdk.org (Dong Bo) Date: Wed, 16 Nov 2022 06:20:32 GMT Subject: RFR: 8295698: AArch64: test/jdk/sun/security/ec/ed/EdDSATest.java failed with -XX:+UseSHA3Intrinsics In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 09:58:43 GMT, Andrew Haley wrote: > Hmm, okay. Looks like there's work to do on this. I'll approve this patch, but we really must get MacOS fixed for JDK 20. Thanks for the review. For MacOS issue, I have filed an issue, see https://bugs.openjdk.org/browse/JDK-8297092. Because we only have a limited testing environment for MacOS, I've asked @shqking for help, he is willing to fix it. We have two reviews for this SHA3 PR, and it has been fully tested on several real processors. I am going to integrate this, @theRealAph, would you like to sponsor this? Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/10939 From thartmann at openjdk.org Wed Nov 16 06:22:06 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 16 Nov 2022 06:22:06 GMT Subject: RFR: 8295788: C2 compilation hits "assert((mode == ControlAroundStripMined && use == sfpt) || !use->is_reachable_from_root()) failed: missed a node" In-Reply-To: <_rSAe8s5BhWa4Cgo6uwk57tkXBVZLqoTOpJvvTQ-toQ=.6bbebdf8-28e6-4c7b-b751-59c05ef0aab8@github.com> References: <_rSAe8s5BhWa4Cgo6uwk57tkXBVZLqoTOpJvvTQ-toQ=.6bbebdf8-28e6-4c7b-b751-59c05ef0aab8@github.com> Message-ID: On Tue, 15 Nov 2022 11:42:00 GMT, Roland Westrelin wrote: > This failure is similar to previous failures with loop strip mining: a > node is encountered that has control set in the outer strip mined loop > but is not reachable from the safepoint. There's already logic in loop > cloning to find those and fix their control to be outside the > loop. Usually a node ends up in the outer loop because some of its > inputs is in the outer loop. The current logic to catch nodes that are > erroneously assigned control in the outer loop is to start from > safepoint's inputs and look for uses with incorrect control. That > doesn't work in this case because: 1) the node is created by > IdealLoopTree::reassociate in the outer loop because its inputs are > indeed there 2) but a pass of split if updates the control to be > inside the inner loop. > > To fix this, I propose reusing the existing clone_outer_loop_helper() > but apply it to the loop body as well. I had to tweak that method > because I ran into cases of dead nodes still reachable from a node in > the loop body but removed from the _body list by > IdealLoopTree::DCE_loop_body() (and as a result not cloned). Looks good to me otherwise. src/hotspot/share/opto/loopopts.cpp line 2304: > 2302: for (uint i = 0; i < loop->_body.size(); i++) { > 2303: Node* old = loop->_body.at(i); > 2304: clone_outer_loop_helper(old, loop, outer_loop, old_new, wq, this, true); While you're at it, could you rename the helper method to something more meaningful? test/hotspot/jtreg/compiler/loopstripmining/TestUseFromInnerInOuterUnusedBySfpt.java line 57: > 55: public static void main(String[] strArr) { > 56: TestUseFromInnerInOuterUnusedBySfpt _instance = new TestUseFromInnerInOuterUnusedBySfpt(); > 57: for (int i = 0; i < 10; i++ ) { Suggestion: for (int i = 0; i < 10; i++) { ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11162 From duke at openjdk.org Wed Nov 16 06:34:07 2022 From: duke at openjdk.org (zzambers) Date: Wed, 16 Nov 2022 06:34:07 GMT Subject: Integrated: 8295952: Problemlist existing compiler/rtm tests also on x86 In-Reply-To: References: Message-ID: <4IZfuq0OUU6CFZjPHEaPCy-CjhRsXJQyXLmA1rg4hgA=.af884463-64e4-44e9-945c-802114f34125@github.com> On Wed, 26 Oct 2022 16:43:26 GMT, zzambers wrote: > Problemlist should be extended so that existing compiler/rtm entries include x86 (32-bit) intel builds as well, as these are also affected. This pull request has now been integrated. Changeset: 3f2f128a Author: Zdenek Zambersky Committer: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/3f2f128af6ec2f9097af7758bfd41aeaa4354d40 Stats: 11 lines in 1 file changed: 0 ins; 0 del; 11 mod 8295952: Problemlist existing compiler/rtm tests also on x86 Reviewed-by: chagedorn, kvn ------------- PR: https://git.openjdk.org/jdk/pull/10875 From kvn at openjdk.org Wed Nov 16 07:34:38 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 16 Nov 2022 07:34:38 GMT Subject: RFR: 8295407: C2 crash: Error: ShouldNotReachHere() in multiple vector tests with -XX:-MonomorphicArrayCheck -XX:-UncommonNullCast [v2] In-Reply-To: References: Message-ID: On Wed, 16 Nov 2022 02:06:36 GMT, Fei Gao wrote: >> For unsupported `CMove` patterns, [JDK-8293833](https://bugs.openjdk.org/browse/JDK-8295407) helps remove unused `CMove` and related packs from superword candidate packset by the function `remove_cmove_and_related_packs()`, but it only works when `-XX:+UseVectorCmov` is enabled[1]. When the option is not enabled, these unsupported `CMove` packs are still kept in the superword packset, causing the same failure. >> >> Actually, the function `filter_packs()` in superword is to filter out unsupported packs but it can't work as expected currently for these `CMove` cases. As we know, not all `CMove` packs can be vectorized. `merge_packs_to_cmovd()`[2] looks through all packs in the superword packset and generates a `CMove` candidate packset to collect all qualified `CMove` packs. Hence, only `CMove` packs in the `CMove` candidate packset are our target patterns and can be vectorized. But `filter_packs()` thinks, if the `CMove` pack is in a superword packset and its vector node is implemented in the current platform, then it can be vectorized. Therefore, the function doesn't remove these unsupported packs. >> >> We can adjust the function `implemented()` in the stage of `filter_packs()` to check if the current `CMove` pack is in the `CMove` candidate packset. If not, `filter_packs()` considers it not to be vectorized and then remove it. After the fix, whether >> `-XX:+UseVectorCmov` is enabled or not, these unsupported packs can be removed by `filter_packs()`. In this way, we don't need the function`remove_cmove_and_related_packs()` anymore and thus the patch also cleans related code. >> >> [1] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L537 >> [2] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L1892 > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Clean up related code > - Merge branch 'master' into fg8295407 > - 8295407: C2 crash: Error: ShouldNotReachHere() in multiple vector tests with -XX:-MonomorphicArrayCheck -XX:-UncommonNullCast > > For unsupported `CMove` patterns, JDK-8293833 helps remove unused > `CMove` and related packs from superword candidate packset by the > function `remove_cmove_and_related_packs()`, but it only works > when `-XX:+UseVectorCmov` is enabled[1]. When the option is not > enabled, these unsupported `CMove` packs are still kept in the > superword packset, causing the same failure. > > Actually, the function `filter_packs()` in superword is to filter > out unsupported packs but it can't work as expected currently for > these `CMove` cases. As we know, not all `CMove` packs can be > vectorized. `merge_packs_to_cmovd()`[2] looks through all packs > in the superword packset and generates a `CMove` candidate > packset to collect all qualified `CMove` packs. Hence, only > `CMove` packs in the `CMove` candidate packset are our target > patterns and can be vectorized. But `filter_packs()` thinks, > if the `CMove` pack is in a superword packset and its vector > node is implemented in the current platform, then it can > be vectorized. Therefore, the function doesn't remove > these unsupported packs. > > We can adjust the function `implemented()` in the stage of > `filter_packs()` to check if the current `CMove` pack is in > the `CMove` candidate packset. If not, `filter_packs()` considers > it not to be vectorized and then remove it. After the fix, > whether `-XX:+UseVectorCmov` is enabled or not, these > unsupported packs can be removed by `filter_packs()`. In this > way, we don't need the function`remove_cmove_and_related_packs()` > anymore and thus the patch also cleans related code. > > [1] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L537 > [2] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L1892 > [3] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L2701 I got 2 failures. Description in the bug report. One could be not related. But Vector512ConversionTests.java test failure with flags from the bug report is suspicious. ------------- PR: https://git.openjdk.org/jdk/pull/11034 From kvn at openjdk.org Wed Nov 16 07:42:55 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 16 Nov 2022 07:42:55 GMT Subject: RFR: 8295407: C2 crash: Error: ShouldNotReachHere() in multiple vector tests with -XX:-MonomorphicArrayCheck -XX:-UncommonNullCast [v2] In-Reply-To: References: Message-ID: On Wed, 16 Nov 2022 02:06:36 GMT, Fei Gao wrote: >> For unsupported `CMove` patterns, [JDK-8293833](https://bugs.openjdk.org/browse/JDK-8295407) helps remove unused `CMove` and related packs from superword candidate packset by the function `remove_cmove_and_related_packs()`, but it only works when `-XX:+UseVectorCmov` is enabled[1]. When the option is not enabled, these unsupported `CMove` packs are still kept in the superword packset, causing the same failure. >> >> Actually, the function `filter_packs()` in superword is to filter out unsupported packs but it can't work as expected currently for these `CMove` cases. As we know, not all `CMove` packs can be vectorized. `merge_packs_to_cmovd()`[2] looks through all packs in the superword packset and generates a `CMove` candidate packset to collect all qualified `CMove` packs. Hence, only `CMove` packs in the `CMove` candidate packset are our target patterns and can be vectorized. But `filter_packs()` thinks, if the `CMove` pack is in a superword packset and its vector node is implemented in the current platform, then it can be vectorized. Therefore, the function doesn't remove these unsupported packs. >> >> We can adjust the function `implemented()` in the stage of `filter_packs()` to check if the current `CMove` pack is in the `CMove` candidate packset. If not, `filter_packs()` considers it not to be vectorized and then remove it. After the fix, whether >> `-XX:+UseVectorCmov` is enabled or not, these unsupported packs can be removed by `filter_packs()`. In this way, we don't need the function`remove_cmove_and_related_packs()` anymore and thus the patch also cleans related code. >> >> [1] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L537 >> [2] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L1892 > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Clean up related code > - Merge branch 'master' into fg8295407 > - 8295407: C2 crash: Error: ShouldNotReachHere() in multiple vector tests with -XX:-MonomorphicArrayCheck -XX:-UncommonNullCast > > For unsupported `CMove` patterns, JDK-8293833 helps remove unused > `CMove` and related packs from superword candidate packset by the > function `remove_cmove_and_related_packs()`, but it only works > when `-XX:+UseVectorCmov` is enabled[1]. When the option is not > enabled, these unsupported `CMove` packs are still kept in the > superword packset, causing the same failure. > > Actually, the function `filter_packs()` in superword is to filter > out unsupported packs but it can't work as expected currently for > these `CMove` cases. As we know, not all `CMove` packs can be > vectorized. `merge_packs_to_cmovd()`[2] looks through all packs > in the superword packset and generates a `CMove` candidate > packset to collect all qualified `CMove` packs. Hence, only > `CMove` packs in the `CMove` candidate packset are our target > patterns and can be vectorized. But `filter_packs()` thinks, > if the `CMove` pack is in a superword packset and its vector > node is implemented in the current platform, then it can > be vectorized. Therefore, the function doesn't remove > these unsupported packs. > > We can adjust the function `implemented()` in the stage of > `filter_packs()` to check if the current `CMove` pack is in > the `CMove` candidate packset. If not, `filter_packs()` considers > it not to be vectorized and then remove it. After the fix, > whether `-XX:+UseVectorCmov` is enabled or not, these > unsupported packs can be removed by `filter_packs()`. In this > way, we don't need the function`remove_cmove_and_related_packs()` > anymore and thus the patch also cleans related code. > > [1] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L537 > [2] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L1892 > [3] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L2701 Digging in JBS shows that Vector512ConversionTests.java failure is most likely [JDK-8276064](https://bugs.openjdk.org/browse/JDK-8276064). So it is not new failure. Looks like your changes are fine. ------------- PR: https://git.openjdk.org/jdk/pull/11034 From kvn at openjdk.org Wed Nov 16 07:53:01 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 16 Nov 2022 07:53:01 GMT Subject: RFR: 8295407: C2 crash: Error: ShouldNotReachHere() in multiple vector tests with -XX:-MonomorphicArrayCheck -XX:-UncommonNullCast [v2] In-Reply-To: References: Message-ID: On Wed, 16 Nov 2022 02:06:36 GMT, Fei Gao wrote: >> For unsupported `CMove` patterns, [JDK-8293833](https://bugs.openjdk.org/browse/JDK-8295407) helps remove unused `CMove` and related packs from superword candidate packset by the function `remove_cmove_and_related_packs()`, but it only works when `-XX:+UseVectorCmov` is enabled[1]. When the option is not enabled, these unsupported `CMove` packs are still kept in the superword packset, causing the same failure. >> >> Actually, the function `filter_packs()` in superword is to filter out unsupported packs but it can't work as expected currently for these `CMove` cases. As we know, not all `CMove` packs can be vectorized. `merge_packs_to_cmovd()`[2] looks through all packs in the superword packset and generates a `CMove` candidate packset to collect all qualified `CMove` packs. Hence, only `CMove` packs in the `CMove` candidate packset are our target patterns and can be vectorized. But `filter_packs()` thinks, if the `CMove` pack is in a superword packset and its vector node is implemented in the current platform, then it can be vectorized. Therefore, the function doesn't remove these unsupported packs. >> >> We can adjust the function `implemented()` in the stage of `filter_packs()` to check if the current `CMove` pack is in the `CMove` candidate packset. If not, `filter_packs()` considers it not to be vectorized and then remove it. After the fix, whether >> `-XX:+UseVectorCmov` is enabled or not, these unsupported packs can be removed by `filter_packs()`. In this way, we don't need the function`remove_cmove_and_related_packs()` anymore and thus the patch also cleans related code. >> >> [1] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L537 >> [2] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L1892 > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Clean up related code > - Merge branch 'master' into fg8295407 > - 8295407: C2 crash: Error: ShouldNotReachHere() in multiple vector tests with -XX:-MonomorphicArrayCheck -XX:-UncommonNullCast > > For unsupported `CMove` patterns, JDK-8293833 helps remove unused > `CMove` and related packs from superword candidate packset by the > function `remove_cmove_and_related_packs()`, but it only works > when `-XX:+UseVectorCmov` is enabled[1]. When the option is not > enabled, these unsupported `CMove` packs are still kept in the > superword packset, causing the same failure. > > Actually, the function `filter_packs()` in superword is to filter > out unsupported packs but it can't work as expected currently for > these `CMove` cases. As we know, not all `CMove` packs can be > vectorized. `merge_packs_to_cmovd()`[2] looks through all packs > in the superword packset and generates a `CMove` candidate > packset to collect all qualified `CMove` packs. Hence, only > `CMove` packs in the `CMove` candidate packset are our target > patterns and can be vectorized. But `filter_packs()` thinks, > if the `CMove` pack is in a superword packset and its vector > node is implemented in the current platform, then it can > be vectorized. Therefore, the function doesn't remove > these unsupported packs. > > We can adjust the function `implemented()` in the stage of > `filter_packs()` to check if the current `CMove` pack is in > the `CMove` candidate packset. If not, `filter_packs()` considers > it not to be vectorized and then remove it. After the fix, > whether `-XX:+UseVectorCmov` is enabled or not, these > unsupported packs can be removed by `filter_packs()`. In this > way, we don't need the function`remove_cmove_and_related_packs()` > anymore and thus the patch also cleans related code. > > [1] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L537 > [2] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L1892 > [3] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L2701 Marked as reviewed by kvn (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11034 From bulasevich at openjdk.org Wed Nov 16 07:54:28 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Wed, 16 Nov 2022 07:54:28 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v12] In-Reply-To: References: Message-ID: > The nmethod "scopes data" section is 10% of the size of nmethod. Now the data is compressed using the Pack200 algorithm, which is good for encoding small integers (LineNumberTable, etc). Using the fact that half of the data in the partition contains zeros, I reduce its size by another 30%. > > Testing: jtreg hotspot&jdk, Renaissance benchmarks Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: buffer() returns const array ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10025/files - new: https://git.openjdk.org/jdk/pull/10025/files/1135bac4..a24683d9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10025&range=11 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10025&range=10-11 Stats: 24 lines in 5 files changed: 10 ins; 4 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/10025.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10025/head:pull/10025 PR: https://git.openjdk.org/jdk/pull/10025 From bulasevich at openjdk.org Wed Nov 16 07:54:30 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Wed, 16 Nov 2022 07:54:30 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v11] In-Reply-To: References: <2D9ynUtu7IxcnyELEChKZf0zpksKpmAWZorKxVJlm40=.c9b41147-c5cf-48dd-a6af-d9c30d2705d6@github.com> Message-ID: On Tue, 15 Nov 2022 22:03:34 GMT, Evgeny Astigeevich wrote: >> Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: >> >> cleanup, rename > > src/hotspot/share/code/compressedStream.hpp line 127: > >> 125: } >> 126: >> 127: u_char* buffer() const { return _buffer; } > > The function changes access to `_buffer` from `protected` to `public`. > Should it be: > > const u_char* buffer() const; Please check my change. - I move _buffer from CompressedSparseData to CompressedSparseDataReadStream and CompressedSparseDataWriteStream - the first one is constant. - I update debugInfoRec.cpp, nmethod.cpp, nmethod.hpp. On the one hand, I would like to avoid a lot of file chaining. On the other hand, the const modifier adds readability. Is that what you meant? ------------- PR: https://git.openjdk.org/jdk/pull/10025 From bulasevich at openjdk.org Wed Nov 16 07:54:32 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Wed, 16 Nov 2022 07:54:32 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v11] In-Reply-To: References: <2D9ynUtu7IxcnyELEChKZf0zpksKpmAWZorKxVJlm40=.c9b41147-c5cf-48dd-a6af-d9c30d2705d6@github.com> Message-ID: On Wed, 16 Nov 2022 00:57:02 GMT, Evgeny Astigeevich wrote: >> src/hotspot/share/code/compressedStream.hpp line 195: >> >>> 193: return _position; >>> 194: } >>> 195: void set_position(int pos) { >> >> Now I see why we have a problem to implement `position` and `set_position`. >> `position` originally had a meaning of the position where data would be written. Because of this it could be used to get the total amount of data written (see `DebugInformationRecorder::data_size`). >> It was also used to mark a position to roll back later (e.g. `DebugInformationRecorder::serialize_scope_values`). >> This violates the single-responsibility principle and makes difficult to add another implementation. >> To restore the principle we need separate functionalities from `position` and `set_position` into something like: >> >> // Mark the state of the stream. >> void mark(); >> >> // Roll the stream state back to the marked one. >> void roll_back(); >> >> // Return the amount of data the stream contains. >> int data_size(); >> >> >> We implement `mark` as creating copies of `_position`, `_bit_position` and `_buffer[_position]`. `roll_back` uses the copies to restore the state of the stream. >> `CompressedSparseDataWriteStream::data_size()` just returns `_position + 1`. >> >> There is the problem with `DebugInformationRecorder::find_sharable_decode_offset(int stream_offset)`. It calculates `stream_length` using `position()`. It depends too much on the current implementation. Because of this dependency we have to emulate it in our new implementation. > > I have an idea which might solve the issues. > // Roll the stream state back to the marked one. > void roll_back(); get_position() is not about roll back only. See the DebugInformationRecorder, it serializes offsets into the stream. int DebugInformationRecorder::serialize_scope_values(...) { ... int result = stream()->position(); ... return result; } DebugToken* DebugInformationRecorder::create_scope_values(...) { ... return (DebugToken*) (intptr_t) serialize_scope_values(values); } void PhaseOutput::Process_OopMap_Node(MachNode *mach, int current_offset) { ... DebugToken *locvals = C->debug_info()->create_scope_values(locarray); DebugToken *expvals = C->debug_info()->create_scope_values(exparray); DebugToken *monvals = C->debug_info()->create_monitor_values(monarray); C->debug_info()->describe_scope( ... locvals, expvals, monvals ); void DebugInformationRecorder::describe_scope(... DebugToken* locals, DebugToken* expressions, DebugToken* monitors) { ... // serialize the locals/expressions/monitors stream()->write_int((intptr_t) locals); stream()->write_int((intptr_t) expressions); stream()->write_int((intptr_t) monitors); ------------- PR: https://git.openjdk.org/jdk/pull/10025 From aturbanov at openjdk.org Wed Nov 16 08:45:21 2022 From: aturbanov at openjdk.org (Andrey Turbanov) Date: Wed, 16 Nov 2022 08:45:21 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v3] In-Reply-To: References: Message-ID: On Sun, 6 Nov 2022 17:28:53 GMT, Richard Reingruber wrote: >> Hi, >> >> this is the port of [JEP 425: Virtual Threads (Preview)](https://openjdk.org/jeps/425)) to PPC64. >> More precisely it is the port of vm continuations in hotspot to PPC64. It allows to run with `-XX:+VMContinuations` which is a prerequisit for 'real' virtual threads (oxymoron?). >> >> Most of the shared code changes are related to a new platform dependent constant `frame::metadata_words_at_top`. It is either added or subtracted to a frame address or size. >> >> The following is supposed to explain (without real life details) why it is needed in addition to the existing `frame::metadata_words`. The diagram shows a frame at `SP` and its stack arguments. The caller frame is located at `callers_SP`. >> >> >> X86 / AARCH64 PPC64: >> >> : : : : >> : : : : >> | | | | >> |-----------------| |-----------------| >> | | | | >> | stack arguments | | stack arguments | >> | |<- callers_SP | | >> =================== |-----------------| >> | | | | >> | metadata at bottom | | metadata at top | >> | | | |<- callers_SP >> |-----------------| =================== >> | | | | >> | | | | >> | | | | >> | | | | >> | |<- SP | | >> =================== |-----------------| >> | | >> | metadata at top | >> | |<- SP >> =================== >> >> >> On X86 and AARCH64 metadata (that's return address, saved SP/FP, etc.) is stored at the frame bottom (`metadata at bottom`). On PPC64 it is stored at the frame top (`metadata at top`) where it affects size and address calculations. Shared code deals with this by making use of the platform dependent constant `frame::metadata_words_at_top`. >> >> * size required to 'freeze' a single frame with its stack arguments in a `StackChunk`: >> `sizeof(frame) + sizeof(stack arguments) + frame::metadata_words_at_top` >> >> * address of stack arguments: >> `callers_SP + frame::metadata_words_at_top` >> >> * The value of `frame::metadata_words_at_top` is 0 words on X86 and AARCH64. On PPC64 it is 4 words. >> >> Please refer to comments I've added for more details (e.g. the comment on StackChunkFrameStream::frame_size()). Recently I've given a talk about vm continuations and the PPC64 port to my colleagues. It's rather an overview than a deep dive. [The slides](http://cr.openjdk.java.net/~rrich/webrevs/2022/8286302/202210_loom_ppc64_port.pdf) might serve as well as an intro to the matter. >> >> The pr includes the new test jdk/jdk/internal/vm/Continuation/BasicExp.java which I wrote while doing the port. The test cases vary from simple to not-so-simple. One of the main features is that a CompilationPolicy passed as argument controls which frames are compiled/interpreted when freezing by defining a sliding window of compiled / interpreted frames which produces interesting transitions with and without stack arguments. There is overlap with Basic.java and Fuzz.java. Let me know wether to remove or keep BasicExp.java. Runtime with fastdebug: 2m on Apple M1 Pro and 3m on Intel Xeon E5-2660 v3 @ 2.60GHz. Note that -XX:+VerifyContinuations is explicitly set as a found it very useful, it increases the runtime quite a bit though. >> >> Testing: the change passed our CI testing: most JCK and JTREG tests, also in Xcomp mode, SPECjvm2008, SPECjbb2015, Renaissance Suite and SAP specific tests with fastdebug and release builds on the standard platforms plus PPC64. These tests do include hotspot_loom and jdk_loom JTREG tests which I've also run with TEST_VM_OPTS="-XX:+VerifyContinuations" on X86_64, PPC64, and AARCH64. >> >> Thanks, Richard. > > Richard Reingruber has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: > > - Fix cpp condition and add PPC64 > - Changes lost in merge > - Merge branch 'master' into 8286302_Port_JEP_425_to_PPC64 > - Use callers_sp for fsize calculation in recurse_freeze_interpreted_frame > - Loom ppc64le port test/jdk/jdk/internal/vm/Continuation/BasicExp.java line 201: > 199: // compLevel = CompilerWhiteBoxTest.COMP_LEVEL_FULL_PROFILE; > 200: > 201: compPolicySelection = Integer.parseInt(args[0]); nit Suggestion: compPolicySelection = Integer.parseInt(args[0]); test/jdk/jdk/internal/vm/Continuation/BasicExp.java line 341: > 339: > 340: public boolean shiftWindow() { > 341: if(compWindowMode == CompWindowMode.NO_COMP_WINDOW) return false; nit Suggestion: if (compWindowMode == CompWindowMode.NO_COMP_WINDOW) return false; test/jdk/jdk/internal/vm/Continuation/BasicExp.java line 483: > 481: try { > 482: cont.run(); > 483: } catch(UnhandledException e) { nit Suggestion: } catch (UnhandledException e) { ------------- PR: https://git.openjdk.org/jdk/pull/10961 From rrich at openjdk.org Wed Nov 16 10:06:04 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Wed, 16 Nov 2022 10:06:04 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v3] In-Reply-To: References: Message-ID: On Mon, 7 Nov 2022 22:15:15 GMT, Tyler Steele wrote: > Your change here looks good, but the assertion condition seems incorrect. If `(sp() == stack_size()) == false` and `(sp() >= stack_size() - argsize() - frame::metadata_words_at_top) == false`, then the assertion passes. Unless there is a case for this behaviour, I think it's safe to change this comparison to logical AND. That's intended. The assertion checks the equivalence of `sp() == stack_size()` and `sp() >= stack_size() - argsize() - frame::metadata_words_at_top` If the first predicate is false (meaning the stack is not empty) then the 2nd predicate must be false also. Note there is an alternative `is_empty()` method that uses the 2nd predicate: https://github.com/openjdk/jdk/blob/499406c764ba0ce57079b1f612297be5b148e5bb/src/hotspot/share/runtime/continuationFreezeThaw.cpp#L437-L441 Maybe the following diagram of the stack in the StackChunk is useful: Offset: stack_size() =================== | | | Stack Arguments | | to Bottom Frame | Offset: stack_size() - argsize() |-----------------| | | | metadata at top | | | Offset: stack_size() - argsize() - frame::metadata_words_at_top =================== | Bottom Frame | | | | | =================== | | : : : FRAMES : : : | | =================== | Top Frame | | | | | |-----------------| | | | metadata at top | | | Offset: sp() =================== : : : : : Free Space : : : : : : : Offset: 0 ................... Offsets are relative to stackChunkOopDesc::start_of_stack() ------------- PR: https://git.openjdk.org/jdk/pull/10961 From rrich at openjdk.org Wed Nov 16 10:14:16 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Wed, 16 Nov 2022 10:14:16 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v4] In-Reply-To: References: Message-ID: > Hi, > > this is the port of [JEP 425: Virtual Threads (Preview)](https://openjdk.org/jeps/425)) to PPC64. > More precisely it is the port of vm continuations in hotspot to PPC64. It allows to run with `-XX:+VMContinuations` which is a prerequisit for 'real' virtual threads (oxymoron?). > > Most of the shared code changes are related to a new platform dependent constant `frame::metadata_words_at_top`. It is either added or subtracted to a frame address or size. > > The following is supposed to explain (without real life details) why it is needed in addition to the existing `frame::metadata_words`. The diagram shows a frame at `SP` and its stack arguments. The caller frame is located at `callers_SP`. > > > X86 / AARCH64 PPC64: > > : : : : > : : : : > | | | | > |-----------------| |-----------------| > | | | | > | stack arguments | | stack arguments | > | |<- callers_SP | | > =================== |-----------------| > | | | | > | metadata at bottom | | metadata at top | > | | | |<- callers_SP > |-----------------| =================== > | | | | > | | | | > | | | | > | | | | > | |<- SP | | > =================== |-----------------| > | | > | metadata at top | > | |<- SP > =================== > > > On X86 and AARCH64 metadata (that's return address, saved SP/FP, etc.) is stored at the frame bottom (`metadata at bottom`). On PPC64 it is stored at the frame top (`metadata at top`) where it affects size and address calculations. Shared code deals with this by making use of the platform dependent constant `frame::metadata_words_at_top`. > > * size required to 'freeze' a single frame with its stack arguments in a `StackChunk`: > `sizeof(frame) + sizeof(stack arguments) + frame::metadata_words_at_top` > > * address of stack arguments: > `callers_SP + frame::metadata_words_at_top` > > * The value of `frame::metadata_words_at_top` is 0 words on X86 and AARCH64. On PPC64 it is 4 words. > > Please refer to comments I've added for more details (e.g. the comment on StackChunkFrameStream::frame_size()). Recently I've given a talk about vm continuations and the PPC64 port to my colleagues. It's rather an overview than a deep dive. [The slides](http://cr.openjdk.java.net/~rrich/webrevs/2022/8286302/202210_loom_ppc64_port.pdf) might serve as well as an intro to the matter. > > The pr includes the new test jdk/jdk/internal/vm/Continuation/BasicExp.java which I wrote while doing the port. The test cases vary from simple to not-so-simple. One of the main features is that a CompilationPolicy passed as argument controls which frames are compiled/interpreted when freezing by defining a sliding window of compiled / interpreted frames which produces interesting transitions with and without stack arguments. There is overlap with Basic.java and Fuzz.java. Let me know wether to remove or keep BasicExp.java. Runtime with fastdebug: 2m on Apple M1 Pro and 3m on Intel Xeon E5-2660 v3 @ 2.60GHz. Note that -XX:+VerifyContinuations is explicitly set as a found it very useful, it increases the runtime quite a bit though. > > Testing: the change passed our CI testing: most JCK and JTREG tests, also in Xcomp mode, SPECjvm2008, SPECjbb2015, Renaissance Suite and SAP specific tests with fastdebug and release builds on the standard platforms plus PPC64. These tests do include hotspot_loom and jdk_loom JTREG tests which I've also run with TEST_VM_OPTS="-XX:+VerifyContinuations" on X86_64, PPC64, and AARCH64. > > Thanks, Richard. Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: Feedback from backwaterred ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10961/files - new: https://git.openjdk.org/jdk/pull/10961/files/c1d2f878..f42de6b7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10961&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10961&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10961.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10961/head:pull/10961 PR: https://git.openjdk.org/jdk/pull/10961 From rrich at openjdk.org Wed Nov 16 10:14:16 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Wed, 16 Nov 2022 10:14:16 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v3] In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 20:51:39 GMT, Tyler Steele wrote: > These changes are significant. They appear well thought out and well executed. Thanks for submitting this PR. Thanks for your review @backwaterred ------------- PR: https://git.openjdk.org/jdk/pull/10961 From rrich at openjdk.org Wed Nov 16 10:14:17 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Wed, 16 Nov 2022 10:14:17 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v2] In-Reply-To: References: Message-ID: On Fri, 4 Nov 2022 17:50:16 GMT, Tyler Steele wrote: >> Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: >> >> Use callers_sp for fsize calculation in recurse_freeze_interpreted_frame > > src/hotspot/cpu/ppc/sharedRuntime_ppc.cpp line 1694: > >> 1692: #ifdef ASSERT >> 1693: __ load_const_optimized(tmp2, 0x1234); >> 1694: __ stw(tmp2, in_bytes(ContinuationEntry::cookie_offset()), R1_SP); > > Would it be appropriate to call ContinuationEntry::cookie_value() here instead? Done ------------- PR: https://git.openjdk.org/jdk/pull/10961 From rrich at openjdk.org Wed Nov 16 10:24:24 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Wed, 16 Nov 2022 10:24:24 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v5] In-Reply-To: References: Message-ID: <7LIFD7nq9mWL6nfRJMz1pwc9h1SZaEnAWlsPT5mG1yI=.cedeccab-8642-4131-a90e-d8fc9d015619@github.com> > Hi, > > this is the port of [JEP 425: Virtual Threads (Preview)](https://openjdk.org/jeps/425)) to PPC64. > More precisely it is the port of vm continuations in hotspot to PPC64. It allows to run with `-XX:+VMContinuations` which is a prerequisit for 'real' virtual threads (oxymoron?). > > Most of the shared code changes are related to a new platform dependent constant `frame::metadata_words_at_top`. It is either added or subtracted to a frame address or size. > > The following is supposed to explain (without real life details) why it is needed in addition to the existing `frame::metadata_words`. The diagram shows a frame at `SP` and its stack arguments. The caller frame is located at `callers_SP`. > > > X86 / AARCH64 PPC64: > > : : : : > : : : : > | | | | > |-----------------| |-----------------| > | | | | > | stack arguments | | stack arguments | > | |<- callers_SP | | > =================== |-----------------| > | | | | > | metadata at bottom | | metadata at top | > | | | |<- callers_SP > |-----------------| =================== > | | | | > | | | | > | | | | > | | | | > | |<- SP | | > =================== |-----------------| > | | > | metadata at top | > | |<- SP > =================== > > > On X86 and AARCH64 metadata (that's return address, saved SP/FP, etc.) is stored at the frame bottom (`metadata at bottom`). On PPC64 it is stored at the frame top (`metadata at top`) where it affects size and address calculations. Shared code deals with this by making use of the platform dependent constant `frame::metadata_words_at_top`. > > * size required to 'freeze' a single frame with its stack arguments in a `StackChunk`: > `sizeof(frame) + sizeof(stack arguments) + frame::metadata_words_at_top` > > * address of stack arguments: > `callers_SP + frame::metadata_words_at_top` > > * The value of `frame::metadata_words_at_top` is 0 words on X86 and AARCH64. On PPC64 it is 4 words. > > Please refer to comments I've added for more details (e.g. the comment on StackChunkFrameStream::frame_size()). Recently I've given a talk about vm continuations and the PPC64 port to my colleagues. It's rather an overview than a deep dive. [The slides](http://cr.openjdk.java.net/~rrich/webrevs/2022/8286302/202210_loom_ppc64_port.pdf) might serve as well as an intro to the matter. > > The pr includes the new test jdk/jdk/internal/vm/Continuation/BasicExp.java which I wrote while doing the port. The test cases vary from simple to not-so-simple. One of the main features is that a CompilationPolicy passed as argument controls which frames are compiled/interpreted when freezing by defining a sliding window of compiled / interpreted frames which produces interesting transitions with and without stack arguments. There is overlap with Basic.java and Fuzz.java. Let me know wether to remove or keep BasicExp.java. Runtime with fastdebug: 2m on Apple M1 Pro and 3m on Intel Xeon E5-2660 v3 @ 2.60GHz. Note that -XX:+VerifyContinuations is explicitly set as a found it very useful, it increases the runtime quite a bit though. > > Testing: the change passed our CI testing: most JCK and JTREG tests, also in Xcomp mode, SPECjvm2008, SPECjbb2015, Renaissance Suite and SAP specific tests with fastdebug and release builds on the standard platforms plus PPC64. These tests do include hotspot_loom and jdk_loom JTREG tests which I've also run with TEST_VM_OPTS="-XX:+VerifyContinuations" on X86_64, PPC64, and AARCH64. > > Thanks, Richard. Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: Cleanup BasicExp.java ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10961/files - new: https://git.openjdk.org/jdk/pull/10961/files/f42de6b7..7276a8ec Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10961&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10961&range=03-04 Stats: 9 lines in 1 file changed: 5 ins; 1 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/10961.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10961/head:pull/10961 PR: https://git.openjdk.org/jdk/pull/10961 From rrich at openjdk.org Wed Nov 16 10:24:30 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Wed, 16 Nov 2022 10:24:30 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v3] In-Reply-To: References: Message-ID: On Wed, 16 Nov 2022 08:42:34 GMT, Andrey Turbanov wrote: >> Richard Reingruber has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: >> >> - Fix cpp condition and add PPC64 >> - Changes lost in merge >> - Merge branch 'master' into 8286302_Port_JEP_425_to_PPC64 >> - Use callers_sp for fsize calculation in recurse_freeze_interpreted_frame >> - Loom ppc64le port > > test/jdk/jdk/internal/vm/Continuation/BasicExp.java line 201: > >> 199: // compLevel = CompilerWhiteBoxTest.COMP_LEVEL_FULL_PROFILE; >> 200: >> 201: compPolicySelection = Integer.parseInt(args[0]); > > nit > Suggestion: > > compPolicySelection = Integer.parseInt(args[0]); Done > test/jdk/jdk/internal/vm/Continuation/BasicExp.java line 341: > >> 339: >> 340: public boolean shiftWindow() { >> 341: if(compWindowMode == CompWindowMode.NO_COMP_WINDOW) return false; > > nit > Suggestion: > > if (compWindowMode == CompWindowMode.NO_COMP_WINDOW) return false; Done > test/jdk/jdk/internal/vm/Continuation/BasicExp.java line 483: > >> 481: try { >> 482: cont.run(); >> 483: } catch(UnhandledException e) { > > nit > Suggestion: > > } catch (UnhandledException e) { Done. Thanks for looking at the PR! ------------- PR: https://git.openjdk.org/jdk/pull/10961 From thartmann at openjdk.org Wed Nov 16 12:07:53 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 16 Nov 2022 12:07:53 GMT Subject: RFR: 8296912: C2: CreateExNode::Identity fails with assert(i < _max) failed: oob: i=1, _max=1 Message-ID: The fix for [JDK-8284358](https://bugs.openjdk.org/browse/JDK-8284358) added code that aggressively removes dead subgraphs when detecting an unreachable Region by walking up the CFG and replacing all nodes by top (because they must be unreachable as well). In this case, we detect that `280 Region` is unreachable from root and replace `276 Catch` by top while walking up the CFG: ![Screenshot from 2022-11-16 12-53-56](https://user-images.githubusercontent.com/5312595/202174104-9437914a-cb38-401c-b0fd-d6ca849969b0.png) Code in `CreateExNode::Identity` does not expect `292 CatchProj` to have a top input when processing `305 CreateEx`. The fix is to simply add a `in(0)->in(0)->is_Catch()` check. Thanks, Tobias ------------- Commit messages: - 8296912: C2: CreateExNode::Identity fails with assert(i < _max) failed: oob: i=1, _max=1 Changes: https://git.openjdk.org/jdk/pull/11181/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11181&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296912 Stats: 30 lines in 2 files changed: 23 ins; 1 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/11181.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11181/head:pull/11181 PR: https://git.openjdk.org/jdk/pull/11181 From chagedorn at openjdk.org Wed Nov 16 12:20:03 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 16 Nov 2022 12:20:03 GMT Subject: RFR: 8296912: C2: CreateExNode::Identity fails with assert(i < _max) failed: oob: i=1, _max=1 In-Reply-To: References: Message-ID: On Wed, 16 Nov 2022 12:00:34 GMT, Tobias Hartmann wrote: > The fix for [JDK-8284358](https://bugs.openjdk.org/browse/JDK-8284358) added code that aggressively removes dead subgraphs when detecting an unreachable Region by walking up the CFG and replacing all nodes by top (because they must be unreachable as well). In this case, we detect that `280 Region` is unreachable from root and replace `276 Catch` by top while walking up the CFG: > > ![Screenshot from 2022-11-16 12-53-56](https://user-images.githubusercontent.com/5312595/202174104-9437914a-cb38-401c-b0fd-d6ca849969b0.png) > > Code in `CreateExNode::Identity` does not expect `292 CatchProj` to have a top input when processing `305 CreateEx`. The fix is to simply add a `in(0)->in(0)->is_Catch()` check. > > Thanks, > Tobias Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11181 From thartmann at openjdk.org Wed Nov 16 12:27:48 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 16 Nov 2022 12:27:48 GMT Subject: RFR: 8290432: C2 compilation fails with assert(node->_last_del == _last) failed: must have deleted the edge just produced [v3] In-Reply-To: References: <5ff-r2RgTNzao-sZ4D1kKWOPHWwzaCZxDDxyxl1Y0Us=.ae799d57-29ab-42c5-9908-a5811a8db0bc@github.com> Message-ID: On Tue, 9 Aug 2022 12:51:33 GMT, Tobias Hartmann wrote: >> Yi Yang has updated the pull request incrementally with one additional commit since the last revision: >> >> update comment > > src/hotspot/share/opto/loopnode.cpp line 3688: > >> 3686: if (incr2->in(1)->is_ConstraintCast() && !incr2->in(1)->in(0)->is_RangeCheck()) { >> 3687: // Skip AddI->CastII->Phi case if CastII is not controlled by local RangeCheck >> 3688: // to reflect changes in LibraryCallKit::inline_preconditions_checkIndex > > In the valid case, isn't the ConstraintCast control input `incr2->in(1)->in(0)` the IfTrue projection of the RangeCheck? > > I would remove the second line because it's not clear which "changes" in `LibraryCallKit::inline_preconditions_checkIndex` it is referring to. > > Suggestion: It's still not clear to me how the `incr2->in(1)->in(0)->is_RangeCheck()` condition can ever be true. How can the control input of a ConstraintCast be a RangeCheck? Shouldn't there be a projection node in-between? ------------- PR: https://git.openjdk.org/jdk/pull/9695 From thartmann at openjdk.org Wed Nov 16 12:27:53 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 16 Nov 2022 12:27:53 GMT Subject: RFR: 8290432: C2 compilation fails with assert(node->_last_del == _last) failed: must have deleted the edge just produced [v3] In-Reply-To: <5ff-r2RgTNzao-sZ4D1kKWOPHWwzaCZxDDxyxl1Y0Us=.ae799d57-29ab-42c5-9908-a5811a8db0bc@github.com> References: <5ff-r2RgTNzao-sZ4D1kKWOPHWwzaCZxDDxyxl1Y0Us=.ae799d57-29ab-42c5-9908-a5811a8db0bc@github.com> Message-ID: On Sat, 6 Aug 2022 07:42:18 GMT, Yi Yang wrote: >> Hi, can I have a review for this patch? [JDK-8273585](https://bugs.openjdk.org/browse/JDK-8273585) recognized the form of `Phi->CastII->AddI` as additional parallel induction variables. In the following program: >> >> class Test { >> static int dontInline() { >> return 0; >> } >> >> static long test(int val, boolean b) { >> long ret = 0; >> long dArr[] = new long[100]; >> for (int i = 15; 293 > i; ++i) { >> ret = val; >> int j = 1; >> while (++j < 6) { >> int k = (val--); >> for (long l = i; 1 > l; ) { >> if (k != 0) { >> ret += dontInline(); >> } >> } >> if (b) { >> break; >> } >> } >> } >> return ret; >> } >> >> public static void main(String[] args) { >> for (int i = 0; i < 1000; i++) { >> test(0, false); >> } >> } >> } >> >> `val` is incorrectly matched with the new parallel IV form: >> ![image](https://user-images.githubusercontent.com/5010047/182059398-fc5204bc-8d95-4e3e-8c66-15776af457b8.png) >> And C2 further replaces it with newly added nodes, which finally leads the crash: >> ![image](https://user-images.githubusercontent.com/5010047/182059498-13148d46-b10f-4e18-b84a-f6b9f626ac7b.png) >> >> I think we can add more constraints to the new form. The form of `Phi->CastXX->AddX` appears when using Preconditions.checkIndex, and it would be recognized as additional IV when 1) Phi != phi2, 2) CastXX is controlled by RangeCheck(to reflect changes in Preconditions checkindex intrinsic) > > Yi Yang has updated the pull request incrementally with one additional commit since the last revision: > > update comment test/hotspot/jtreg/compiler/c2/TestUnexpectedParallelIV.java line 28: > 26: * @test > 27: * @bug 8290432 > 28: * @summary Unexpected parallel induction variable pattern was recongized This test does not reproduce the issue for me, whereas [Test-2.java](https://bugs.openjdk.org/secure/attachment/100710/Test-2.java) still works. ------------- PR: https://git.openjdk.org/jdk/pull/9695 From thartmann at openjdk.org Wed Nov 16 12:28:04 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 16 Nov 2022 12:28:04 GMT Subject: RFR: 8296912: C2: CreateExNode::Identity fails with assert(i < _max) failed: oob: i=1, _max=1 In-Reply-To: References: Message-ID: <6lAOw4duFeNImfK0iJ91VfHFFir3EH-BjB1XmfUdP44=.1f34bc51-876e-4bfb-915f-8632b254ad40@github.com> On Wed, 16 Nov 2022 12:00:34 GMT, Tobias Hartmann wrote: > The fix for [JDK-8284358](https://bugs.openjdk.org/browse/JDK-8284358) added code that aggressively removes dead subgraphs when detecting an unreachable Region by walking up the CFG and replacing all nodes by top (because they must be unreachable as well). In this case, we detect that `280 Region` is unreachable from root and replace `276 Catch` by top while walking up the CFG: > > ![Screenshot from 2022-11-16 12-53-56](https://user-images.githubusercontent.com/5312595/202174104-9437914a-cb38-401c-b0fd-d6ca849969b0.png) > > Code in `CreateExNode::Identity` does not expect `292 CatchProj` to have a top input when processing `305 CreateEx`. The fix is to simply add a `in(0)->in(0)->is_Catch()` check. > > Thanks, > Tobias Thanks for the quick review, Christian! ------------- PR: https://git.openjdk.org/jdk/pull/11181 From thartmann at openjdk.org Wed Nov 16 12:39:26 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 16 Nov 2022 12:39:26 GMT Subject: RFR: 8296170: Refactor stack-locking path in C2_MacroAssembler::fast_unlock() [v3] In-Reply-To: References: Message-ID: <-UXLKKtATNoYaSZXMvkdAtsRhz4Sxk3MsriCiC1CHJA=.97481aa9-2540-409b-92d9-05e3c4cccae1@github.com> On Fri, 11 Nov 2022 14:56:05 GMT, Roman Kennke wrote: >> The code in C2_MacroAssembler::fast_unlock() has several (minor) issues: >> - The stack-locking path for x86_32 is not under UseHeavyMonitors - it would be executed even when stack-locking is disabled. >> - The stack-locking paths are the same for x86_32 and x86_64 - they can be merged into a common path. >> - In x86_32 path, we call get_thread(boxReg) which is totally bogus because we clear boxReg right afterwards with xorptr(boxReg, boxReg). >> - In x86_32 path, the CheckSucc label is identical to the DONE label, and in-fact CheckSucc is only ever really used in the x86_64 path and can be moved there. >> >> Testing: >> - [x] tier1 (x86_32, x86_64) >> - [x] tier2 (x86_32, x86_64) > > Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: > > Remove comments about DONE_LABEL being a hot target Looks reasonable to me but I'm not an expert in that code. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/10936 From tholenstein at openjdk.org Wed Nov 16 13:52:58 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 16 Nov 2022 13:52:58 GMT Subject: RFR: JDK-8297047: IGV: graphContent not set when opening a new tab In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 16:47:36 GMT, Christian Hagedorn wrote: >> Open any graph in IGV. The graph will be opened in a new tab as expected. But the tab has the name "graph" instead of the actual graph name. Further, the "Bytecode" and "Control Flow" windows are not updated with the current graph. >> >> The reason was that `graphContent` was not set when opening a new EditorTopComponent. >> >> Before: >> ![graph_not_updated](https://user-images.githubusercontent.com/71546117/201946772-727f1c57-d69e-4551-a560-14d18cfb2b63.png) >> >> Now the title of tab and the "Control Flow" is updated: >> ![graph_updated](https://user-images.githubusercontent.com/71546117/201947659-a238d0a2-b064-4373-81dc-7fb3f0dea7ec.png) > > Looks good! Thanks @chhagedorn and @robcasloz for the reviews! ------------- PR: https://git.openjdk.org/jdk/pull/11168 From tholenstein at openjdk.org Wed Nov 16 13:56:35 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 16 Nov 2022 13:56:35 GMT Subject: Integrated: JDK-8297047: IGV: graphContent not set when opening a new tab In-Reply-To: References: Message-ID: <63SCUmAuCGO2rglgwoS7wyq-qDaNdOdwiR8FVwR1T1E=.b3870ab3-0b91-47ba-bf93-f73362fcbcdb@github.com> On Tue, 15 Nov 2022 14:38:57 GMT, Tobias Holenstein wrote: > Open any graph in IGV. The graph will be opened in a new tab as expected. But the tab has the name "graph" instead of the actual graph name. Further, the "Bytecode" and "Control Flow" windows are not updated with the current graph. > > The reason was that `graphContent` was not set when opening a new EditorTopComponent. > > Before: > ![graph_not_updated](https://user-images.githubusercontent.com/71546117/201946772-727f1c57-d69e-4551-a560-14d18cfb2b63.png) > > Now the title of tab and the "Control Flow" is updated: > ![graph_updated](https://user-images.githubusercontent.com/71546117/201947659-a238d0a2-b064-4373-81dc-7fb3f0dea7ec.png) This pull request has now been integrated. Changeset: 4946737f Author: Tobias Holenstein URL: https://git.openjdk.org/jdk/commit/4946737fcb581acaf2641d91c8db6728286ce29c Stats: 13 lines in 1 file changed: 8 ins; 4 del; 1 mod 8297047: IGV: graphContent not set when opening a new tab Reviewed-by: chagedorn, rcastanedalo ------------- PR: https://git.openjdk.org/jdk/pull/11168 From tholenstein at openjdk.org Wed Nov 16 15:05:22 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 16 Nov 2022 15:05:22 GMT Subject: RFR: JDK-8297032: IGV: shortcut to center selected nodes [v2] In-Reply-To: References: Message-ID: > Introduce a new shortcut `CTRL-9`/ `CMD-9` to center the nodes that are currently selected in IGV > > ![center_selected_nodes](https://user-images.githubusercontent.com/71546117/201934216-0b65caa2-af62-4083-877b-e5747d5409ee.png) Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: make CenterSelectedNodesAction a ModelAwareAction ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11167/files - new: https://git.openjdk.org/jdk/pull/11167/files/692a90bb..1f275fbd Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11167&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11167&range=00-01 Stats: 29 lines in 1 file changed: 4 ins; 16 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/11167.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11167/head:pull/11167 PR: https://git.openjdk.org/jdk/pull/11167 From tholenstein at openjdk.org Wed Nov 16 15:07:03 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 16 Nov 2022 15:07:03 GMT Subject: RFR: JDK-8297032: IGV: shortcut to center selected nodes [v2] In-Reply-To: References: Message-ID: <4UAHTE6nD_HOGlX4nfPK3LLDKn8y5ySYTDXlrTJO0kA=.eaabf7bd-7574-4df9-8bc3-7fb112d9012a@github.com> On Tue, 15 Nov 2022 18:27:50 GMT, Roberto Casta?eda Lozano wrote: > Looks good to me! A nit (up to you whether you want to address it in this PR): you might disable the action when no node is selected, similarly to the node extraction action. Thanks for the good input @robcasloz ! I addressed your suggestion by making `CenterSelectedNodesAction` a `ModelAwareAction` and overriding the `isEnabled()` function: The action is now disabled if no nodes are selected ------------- PR: https://git.openjdk.org/jdk/pull/11167 From phh at openjdk.org Wed Nov 16 18:10:03 2022 From: phh at openjdk.org (Paul Hohensee) Date: Wed, 16 Nov 2022 18:10:03 GMT Subject: RFR: 8296170: Refactor stack-locking path in C2_MacroAssembler::fast_unlock() [v3] In-Reply-To: References: Message-ID: <_BEB6RS59lihs7ofAF9kT1bFsY7dTg11nlab5wZ1t1A=.b26b7aff-1e3e-41f3-9a86-28a882cf27c0@github.com> On Fri, 11 Nov 2022 14:56:05 GMT, Roman Kennke wrote: >> The code in C2_MacroAssembler::fast_unlock() has several (minor) issues: >> - The stack-locking path for x86_32 is not under UseHeavyMonitors - it would be executed even when stack-locking is disabled. >> - The stack-locking paths are the same for x86_32 and x86_64 - they can be merged into a common path. >> - In x86_32 path, we call get_thread(boxReg) which is totally bogus because we clear boxReg right afterwards with xorptr(boxReg, boxReg). >> - In x86_32 path, the CheckSucc label is identical to the DONE label, and in-fact CheckSucc is only ever really used in the x86_64 path and can be moved there. >> >> Testing: >> - [x] tier1 (x86_32, x86_64) >> - [x] tier2 (x86_32, x86_64) > > Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: > > Remove comments about DONE_LABEL being a hot target The changes net to no change to the executed code, so also lgtm. ------------- Marked as reviewed by phh (Reviewer). PR: https://git.openjdk.org/jdk/pull/10936 From rkennke at openjdk.org Wed Nov 16 18:19:19 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Wed, 16 Nov 2022 18:19:19 GMT Subject: RFR: 8296170: Refactor stack-locking path in C2_MacroAssembler::fast_unlock() [v4] In-Reply-To: References: Message-ID: > The code in C2_MacroAssembler::fast_unlock() has several (minor) issues: > - The stack-locking path for x86_32 is not under UseHeavyMonitors - it would be executed even when stack-locking is disabled. > - The stack-locking paths are the same for x86_32 and x86_64 - they can be merged into a common path. > - In x86_32 path, we call get_thread(boxReg) which is totally bogus because we clear boxReg right afterwards with xorptr(boxReg, boxReg). > - In x86_32 path, the CheckSucc label is identical to the DONE label, and in-fact CheckSucc is only ever really used in the x86_64 path and can be moved there. > > Testing: > - [x] tier1 (x86_32, x86_64) > - [x] tier2 (x86_32, x86_64) Roman Kennke has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: - Merge branch 'master' into JDK-8296170 - Remove comments about DONE_LABEL being a hot target - Merge remote-tracking branch 'upstream/master' into JDK-8296170 - 8296170: Refactor stack-locking path in C2_MacroAssembler::fast_unlock() ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10936/files - new: https://git.openjdk.org/jdk/pull/10936/files/153353ac..a75c36b2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10936&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10936&range=02-03 Stats: 8530 lines in 368 files changed: 5074 ins; 2001 del; 1455 mod Patch: https://git.openjdk.org/jdk/pull/10936.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10936/head:pull/10936 PR: https://git.openjdk.org/jdk/pull/10936 From rkennke at openjdk.org Wed Nov 16 18:19:19 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Wed, 16 Nov 2022 18:19:19 GMT Subject: RFR: 8296170: Refactor stack-locking path in C2_MacroAssembler::fast_unlock() [v3] In-Reply-To: <_BEB6RS59lihs7ofAF9kT1bFsY7dTg11nlab5wZ1t1A=.b26b7aff-1e3e-41f3-9a86-28a882cf27c0@github.com> References: <_BEB6RS59lihs7ofAF9kT1bFsY7dTg11nlab5wZ1t1A=.b26b7aff-1e3e-41f3-9a86-28a882cf27c0@github.com> Message-ID: On Wed, 16 Nov 2022 18:06:10 GMT, Paul Hohensee wrote: > The changes net to no change to the executed code, so also lgtm. > > There is, however, a pre-submit linux-x86 tier1 test failure that should be resolved. Yes, that is a known issue. #11065 should fix it, I will pull in latest master and let GHA run again. ------------- PR: https://git.openjdk.org/jdk/pull/10936 From kvn at openjdk.org Wed Nov 16 18:53:55 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 16 Nov 2022 18:53:55 GMT Subject: RFR: 8296912: C2: CreateExNode::Identity fails with assert(i < _max) failed: oob: i=1, _max=1 In-Reply-To: References: Message-ID: On Wed, 16 Nov 2022 12:00:34 GMT, Tobias Hartmann wrote: > The fix for [JDK-8284358](https://bugs.openjdk.org/browse/JDK-8284358) added code that aggressively removes dead subgraphs when detecting an unreachable Region by walking up the CFG and replacing all nodes by top (because they must be unreachable as well). In this case, we detect that `280 Region` is unreachable from root and replace `276 Catch` by top while walking up the CFG: > > ![Screenshot from 2022-11-16 12-53-56](https://user-images.githubusercontent.com/5312595/202174104-9437914a-cb38-401c-b0fd-d6ca849969b0.png) > > Code in `CreateExNode::Identity` does not expect `292 CatchProj` to have a top input when processing `305 CreateEx`. The fix is to simply add a `in(0)->in(0)->is_Catch()` check. > > Thanks, > Tobias Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11181 From never at openjdk.org Wed Nov 16 19:18:03 2022 From: never at openjdk.org (Tom Rodriguez) Date: Wed, 16 Nov 2022 19:18:03 GMT Subject: RFR: 8296958: [JVMCI] add API for retrieving ConstantValue attributes In-Reply-To: References: Message-ID: <5YIG5LDPNSi9SdJItRsNi6pT4DEUDPiP1bEdPxJ4qFo=.191c8df3-f140-40b9-a696-b0e00cac32f7@github.com> On Mon, 14 Nov 2022 20:22:00 GMT, Doug Simon wrote: > In order to properly initialize classes in a native image at run time, Native Image needs to capture the value of [`ConstantValue` attributes](https://docs.oracle.com/javase/specs/jvms/se19/html/jvms-4.html#jvms-4.7.2) at image build time. This PR adds `ResolvedJavaField.getConstantValue()` for this purpose. Marked as reviewed by never (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11144 From duke at openjdk.org Wed Nov 16 19:32:04 2022 From: duke at openjdk.org (zzambers) Date: Wed, 16 Nov 2022 19:32:04 GMT Subject: RFR: 8295952: Problemlist existing compiler/rtm tests also on x86 In-Reply-To: References: Message-ID: On Wed, 26 Oct 2022 16:43:26 GMT, zzambers wrote: > Problemlist should be extended so that existing compiler/rtm entries include x86 (32-bit) intel builds as well, as these are also affected. @TobiHartmann @vnkozlov Thanks ------------- PR: https://git.openjdk.org/jdk/pull/10875 From never at openjdk.org Wed Nov 16 19:38:03 2022 From: never at openjdk.org (Tom Rodriguez) Date: Wed, 16 Nov 2022 19:38:03 GMT Subject: RFR: 8292961: [JVMCI] Access to j.l.r.Method/Constructor/Field for ResolvedJavaMethod/ResolvedJavaField In-Reply-To: References: Message-ID: <_GeZwFDBOVywKFFbQaLFc4IYb8ZLcKYK_bZRZd6snoI=.f4bdc0b5-8c8b-44b5-abbc-72d725ed9805@github.com> On Mon, 14 Nov 2022 20:37:37 GMT, Doug Simon wrote: > Native Image needs to convert `ResolvedJavaMethod` objects to `java.lang.reflect.Executable` objects and `ResolvedJavaField` objects to `java.lang.reflect.Field` objects. This is currently done by digging into JVMCI internals with reflection. Instead, this functionality should be exposed by public JVMCI API which is what this PR does. Marked as reviewed by never (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11146 From never at openjdk.org Wed Nov 16 19:38:05 2022 From: never at openjdk.org (Tom Rodriguez) Date: Wed, 16 Nov 2022 19:38:05 GMT Subject: RFR: 8296960: [JVMCI] list HotSpotConstantPool.loadReferencedType to ConstantPool In-Reply-To: References: Message-ID: <4gY0e73l3MiwbDQVJXmEnCR4QD8mH0UlMBu0BUpZPtI=.5b41c4c1-1fb9-4125-bbb1-374f8d883552@github.com> On Mon, 14 Nov 2022 20:30:28 GMT, Doug Simon wrote: > `HotSpotConstantPool.loadReferencedType(int cpi, int opcode, boolean initialize)` allows loading a type without triggering class initialization. This PR lifts this method up to `ConstantPool` so that this functionality can be used without depending on HotSpot-specific JVMCI classes. Marked as reviewed by never (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11145 From never at openjdk.org Wed Nov 16 19:43:00 2022 From: never at openjdk.org (Tom Rodriguez) Date: Wed, 16 Nov 2022 19:43:00 GMT Subject: RFR: 8296967: [JVMCI] rationalize relationship between getCodeSize and getCode in ResolvedJavaMethod In-Reply-To: References: Message-ID: On Mon, 14 Nov 2022 21:03:19 GMT, Doug Simon wrote: > When `ResolvedJavaMethod.getCodeSize()` returns a value > 0, `ResolvedJavaMethod.getCode()` will return `null` if the declaring class is not linked, contrary to the intuition of most JVMCI API users. > This PR rationalizes the API such that: > > ResolvedJavaMethod m = ...; > ResolvedJavaType c = m.getDeclaringClass(); > > assert (m.getCodeSize() > 0) == (m.getCode() != null); // m is a non-abstract, non-native method whose declaring class is linked in the current runtime > assert (m.getCodeSize() == 0) == (m.getCode() == null); // m is an abstract or native method > assert c.isLinked() == (m.getCodeSize() >= 0); // m's code size will always be >= 0 if its declaring class is linked in the current runtime Marked as reviewed by never (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11147 From dnsimon at openjdk.org Wed Nov 16 19:59:13 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 16 Nov 2022 19:59:13 GMT Subject: Integrated: 8296958: [JVMCI] add API for retrieving ConstantValue attributes In-Reply-To: References: Message-ID: On Mon, 14 Nov 2022 20:22:00 GMT, Doug Simon wrote: > In order to properly initialize classes in a native image at run time, Native Image needs to capture the value of [`ConstantValue` attributes](https://docs.oracle.com/javase/specs/jvms/se19/html/jvms-4.html#jvms-4.7.2) at image build time. This PR adds `ResolvedJavaField.getConstantValue()` for this purpose. This pull request has now been integrated. Changeset: 4ce4f384 Author: Doug Simon URL: https://git.openjdk.org/jdk/commit/4ce4f384d720ab66ffde898c48d95a58039b0080 Stats: 204 lines in 9 files changed: 198 ins; 5 del; 1 mod 8296958: [JVMCI] add API for retrieving ConstantValue attributes Reviewed-by: never ------------- PR: https://git.openjdk.org/jdk/pull/11144 From dnsimon at openjdk.org Wed Nov 16 20:01:39 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 16 Nov 2022 20:01:39 GMT Subject: Integrated: 8296961: [JVMCI] Access to j.l.r.Method/Constructor/Field for ResolvedJavaMethod/ResolvedJavaField In-Reply-To: References: Message-ID: On Mon, 14 Nov 2022 20:37:37 GMT, Doug Simon wrote: > Native Image needs to convert `ResolvedJavaMethod` objects to `java.lang.reflect.Executable` objects and `ResolvedJavaField` objects to `java.lang.reflect.Field` objects. This is currently done by digging into JVMCI internals with reflection. Instead, this functionality should be exposed by public JVMCI API which is what this PR does. This pull request has now been integrated. Changeset: 5db1b58c Author: Doug Simon URL: https://git.openjdk.org/jdk/commit/5db1b58c867608573a9e7cc57ca2ba22c9dd80d4 Stats: 37 lines in 2 files changed: 34 ins; 0 del; 3 mod 8296961: [JVMCI] Access to j.l.r.Method/Constructor/Field for ResolvedJavaMethod/ResolvedJavaField Reviewed-by: never ------------- PR: https://git.openjdk.org/jdk/pull/11146 From dnsimon at openjdk.org Wed Nov 16 20:28:33 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 16 Nov 2022 20:28:33 GMT Subject: Integrated: 8296960: [JVMCI] list HotSpotConstantPool.loadReferencedType to ConstantPool In-Reply-To: References: Message-ID: On Mon, 14 Nov 2022 20:30:28 GMT, Doug Simon wrote: > `HotSpotConstantPool.loadReferencedType(int cpi, int opcode, boolean initialize)` allows loading a type without triggering class initialization. This PR lifts this method up to `ConstantPool` so that this functionality can be used without depending on HotSpot-specific JVMCI classes. This pull request has now been integrated. Changeset: b3ef3375 Author: Doug Simon URL: https://git.openjdk.org/jdk/commit/b3ef337566c2cf78de1f636e039c799a1bfcb17e Stats: 19 lines in 2 files changed: 19 ins; 0 del; 0 mod 8296960: [JVMCI] list HotSpotConstantPool.loadReferencedType to ConstantPool Reviewed-by: never ------------- PR: https://git.openjdk.org/jdk/pull/11145 From dnsimon at openjdk.org Wed Nov 16 20:30:06 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 16 Nov 2022 20:30:06 GMT Subject: Integrated: 8296967: [JVMCI] rationalize relationship between getCodeSize and getCode in ResolvedJavaMethod In-Reply-To: References: Message-ID: On Mon, 14 Nov 2022 21:03:19 GMT, Doug Simon wrote: > When `ResolvedJavaMethod.getCodeSize()` returns a value > 0, `ResolvedJavaMethod.getCode()` will return `null` if the declaring class is not linked, contrary to the intuition of most JVMCI API users. > This PR rationalizes the API such that: > > ResolvedJavaMethod m = ...; > ResolvedJavaType c = m.getDeclaringClass(); > > assert (m.getCodeSize() > 0) == (m.getCode() != null); // m is a non-abstract, non-native method whose declaring class is linked in the current runtime > assert (m.getCodeSize() == 0) == (m.getCode() == null); // m is an abstract or native method > assert c.isLinked() == (m.getCodeSize() >= 0); // m's code size will always be >= 0 if its declaring class is linked in the current runtime This pull request has now been integrated. Changeset: 37848a9c Author: Doug Simon URL: https://git.openjdk.org/jdk/commit/37848a9ca2ab3021e7b3b2e112bab4631fbe1d99 Stats: 317 lines in 8 files changed: 232 ins; 38 del; 47 mod 8296967: [JVMCI] rationalize relationship between getCodeSize and getCode in ResolvedJavaMethod Reviewed-by: never ------------- PR: https://git.openjdk.org/jdk/pull/11147 From duke at openjdk.org Wed Nov 16 20:52:14 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Wed, 16 Nov 2022 20:52:14 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v20] In-Reply-To: References: Message-ID: > Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. > > - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. > - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. > - Added a JMH perf test. > - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. > > Perf before: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s > > and after: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: redo register alloc with explicit func params ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10582/files - new: https://git.openjdk.org/jdk/pull/10582/files/cbf49380..dbdfd1dc Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=19 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=18-19 Stats: 387 lines in 2 files changed: 83 ins; 51 del; 253 mod Patch: https://git.openjdk.org/jdk/pull/10582.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10582/head:pull/10582 PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Wed Nov 16 21:12:26 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Wed, 16 Nov 2022 21:12:26 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v16] In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 19:30:23 GMT, Vladimir Ivanov wrote: >> Volodymyr Paprotski has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 23 commits: >> >> - Merge remote-tracking branch 'origin/master' into avx512-poly >> - Vladimir's review >> - live review with Sandhya >> - jcheck >> - Sandhya's review >> - fix windows and 32b linux builds >> - add getLimbs to interface and reviews >> - fix 32-bit build >> - make UsePolyIntrinsics option diagnostic >> - Merge remote-tracking branch 'origin/master' into avx512-poly >> - ... and 13 more: https://git.openjdk.org/jdk/compare/e269dc03...a26ac7db > > src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 370: > >> 368: // Middle 44-bit limbs of new blocks >> 369: __ vpsrlq(L1, L0, 44, Assembler::AVX_512bit); >> 370: __ vpsllq(TMP2, TMP1, 20, Assembler::AVX_512bit); > > Any particular reason to use `TMP2` here? Can you just update `TMP1` instead (w/ `vpsllq(TMP1, TMP1, 20, Assembler::AVX_512bit);`)? Thanks for the catch. Removed TMP2. (Several refactors ago, `D[01]` and `L[0-2]` used the same registers, because I was running out.. likely forgot to cleanup after I removed 2/3 of the optimizations and re-did register allocation) done ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Wed Nov 16 21:34:22 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Wed, 16 Nov 2022 21:34:22 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v16] In-Reply-To: References: <6oNkr_1EGAdRQqa7GDrsa-tIpV_kO-_HJAjdA8Mkf28=.34da74e0-c6f1-4eec-bc45-8c8dd02f68f0@github.com> Message-ID: <62fFZ_M2aZHxrUV73RXbAVDUIGtpMOUFL4HdjqLqFJI=.392ad4b7-1917-49c6-b36b-178beee57102@github.com> On Tue, 15 Nov 2022 19:38:56 GMT, Volodymyr Paprotski wrote: >>> On other hand, there are functions like poly1305_multiply8_avx512 and poly1305_process_blocks_avx512 that use a lot of temp registers. I think it makes sense to keep those as 'function-header declarations'. >> >> I agree with you on `poly1305_process_blocks_avx512`, but `poly1305_multiply8_avx512` already takes 8 arguments. Putting 8 more arguments for temps doesn't look prohibitive. >> >>> I think it makes sense to keep those as 'function-header declarations'. >> >> IMO it's not enough. Ideally, if there are any implicit usages, those should be clearly spelled out at every call site. > > Changed just the three `*limbs*` functions. Lifted everything pretty much to just `poly1305_process_blocks_avx512` and `generate_poly1305_processBlocks` (i.e. two register maps) Took some time to make it 'reasonable' again, but I think it makes sense. (But then, true test would be me looking a month later or if it makes sense to others) Had to cleanup the names; 'local' names could all be play on `tmp`.. but the register reuse is much clearer from the 'global' names. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Wed Nov 16 21:34:22 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Wed, 16 Nov 2022 21:34:22 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v16] In-Reply-To: References: <6oNkr_1EGAdRQqa7GDrsa-tIpV_kO-_HJAjdA8Mkf28=.34da74e0-c6f1-4eec-bc45-8c8dd02f68f0@github.com> <-JVYIHKOY_LuVTqyH5xuubtPdk8pK_wi5z-8pestRis=.e63938ab-0ac2-4880-8238-e6e6d8debf03@github.com> <9p2RTAI9FPWstQu0OtpSmSB7dqhFwmxbw86zZQg4GtU=.1be10660-ee5d-4654-9d4e-4fe3e449fd9b@github.com> Message-ID: <9Mqr0y4WvsMWZ0WjKVW1gJPZ7tJAuVmZcEVGXNTU8uU=.4d4ffb1c-2813-4567-8495-1d8452079a79@github.com> On Tue, 15 Nov 2022 23:51:22 GMT, Vladimir Ivanov wrote: >> Added a comment, hopefully less confusing. > > On a second thought, passing derived pointers as arguments doesn't mix well with safepoint awareness. > (And this stub eventually has to become safepoint aware.) > Deriving a pointer inside the stub from a base oop and offset is trivial, recovering base oop from derived pointer is hard. > > It doesn't mean we have to address it right now. Left it as is. I also postponed Bytebuffer support for now, for a separate PR.. we can also fix it then? ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Wed Nov 16 21:34:26 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Wed, 16 Nov 2022 21:34:26 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v17] In-Reply-To: References: <7hXP-vwxc6J7fklu8QuJqiIcSQRff-QyR1SZ0Fzfqmc=.33a38a51-38c3-451a-a756-ed538507f04e@github.com> Message-ID: On Tue, 15 Nov 2022 19:44:16 GMT, Vladimir Ivanov wrote: >> Volodymyr Paprotski has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 25 commits: >> >> - Vladimir's review comments >> - Merge remote-tracking branch 'origin/master' into avx512-poly >> - Merge remote-tracking branch 'origin/master' into avx512-poly >> - Vladimir's review >> - live review with Sandhya >> - jcheck >> - Sandhya's review >> - fix windows and 32b linux builds >> - add getLimbs to interface and reviews >> - fix 32-bit build >> - ... and 15 more: https://git.openjdk.org/jdk/compare/7357a1a3...8f5942d9 > > src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 1004: > >> 1002: __ jcc(Assembler::less, L_process16Loop); >> 1003: >> 1004: poly1305_process_blocks_avx512(input, length, > > I'd like to see a comment here explaining what register effects are implicit. > > `poly1305_process_blocks_avx512` has the following comment, but it doesn't mention xmm registers: > > // Register Map: > // reserved: rsp, rbp, rcx > // PARAMs: rdi, rbx, rsi, r8-r12 > // poly1305_multiply_scalar clobbers: r13-r15, rax, rdx Just redid the register allocation, comments, names, function parameters.. hope its better ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Wed Nov 16 22:41:13 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Wed, 16 Nov 2022 22:41:13 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v14] In-Reply-To: References: Message-ID: <_Syg8xU1sH0-zQuv70r8OKuK9KwSIHvl4GzdJF_Gy9s=.247f2738-5d1e-4bf9-ac1e-93034d446f7b@github.com> On Fri, 11 Nov 2022 01:43:46 GMT, Vladimir Ivanov wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: >> >> live review with Sandhya > > Overall, it looks good. @iwanowww Answered your review comments, please take a look again? Thanks again! ------------- PR: https://git.openjdk.org/jdk/pull/10582 From sviswanathan at openjdk.org Wed Nov 16 22:50:31 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Wed, 16 Nov 2022 22:50:31 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v20] In-Reply-To: References: Message-ID: On Wed, 16 Nov 2022 20:52:14 GMT, Volodymyr Paprotski wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: > > redo register alloc with explicit func params src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 917: > 915: // Cleanup > 916: __ vpxorq(xmm0, xmm0, xmm0, Assembler::AVX_512bit); > 917: __ vpxorq(xmm1, xmm1, xmm1, Assembler::AVX_512bit); You could use T0, T1 in place of xmm0, xmm1 here. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From eastigeevich at openjdk.org Wed Nov 16 22:55:13 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Wed, 16 Nov 2022 22:55:13 GMT Subject: RFR: 8293170: Improve encoding of the debuginfo nmethod section [v11] In-Reply-To: References: <2D9ynUtu7IxcnyELEChKZf0zpksKpmAWZorKxVJlm40=.c9b41147-c5cf-48dd-a6af-d9c30d2705d6@github.com> Message-ID: On Wed, 16 Nov 2022 07:48:42 GMT, Boris Ulasevich wrote: >> I have an idea which might solve the issues. > >> // Roll the stream state back to the marked one. >> void roll_back(); > > get_position() is not about roll back only. See the DebugInformationRecorder, it serializes offsets into the stream. > > > int DebugInformationRecorder::serialize_scope_values(...) { > ... > int result = stream()->position(); > ... > return result; > } > > DebugToken* DebugInformationRecorder::create_scope_values(...) { > ... > return (DebugToken*) (intptr_t) serialize_scope_values(values); > } > > void PhaseOutput::Process_OopMap_Node(MachNode *mach, int current_offset) { > ... > DebugToken *locvals = C->debug_info()->create_scope_values(locarray); > DebugToken *expvals = C->debug_info()->create_scope_values(exparray); > DebugToken *monvals = C->debug_info()->create_monitor_values(monarray); > > C->debug_info()->describe_scope( > ... > locvals, > expvals, > monvals > ); > > void DebugInformationRecorder::describe_scope(... > DebugToken* locals, > DebugToken* expressions, > DebugToken* monitors) { > ... > // serialize the locals/expressions/monitors > stream()->write_int((intptr_t) locals); > stream()->write_int((intptr_t) expressions); > stream()->write_int((intptr_t) monitors); Thank you for the information. It's very helpful. I think we should not simulate `CompressedWriteStream`. `DebugInformationRecorder` needs certain operations: We write debug info into a stream writer: as grouped multiple data and single data. We need to know where bytes of grouped data begin and end. We need to keep offsets of grouped data in the stream. We need to be able to discard last written grouped data. We need to get the number of used bytes. We don't need to know how data stored in a stream. Based on the specification, we need a stream writer to provide operations: // Start grouped data. // Return a position (byte offset) in the stream where grouped data begins. int start_group(); // Finish grouped data. // Return a position (byte offset) in the stream where grouped data ends. int finish_group(); // Revert the stream to the specified position. void set_position(int pos); // Return the number of bytes stored data uses. int data_size() const; With them we don't have a function which in one implementation is const but in another implementation is with side effects. IMHO, at some point later side effects will cause bugs. Possible implementations: int start_group() { complete_current_byte(); // this is renamed align() return _position; } int finish_group() { complete_current_byte(); return _position; } int data_size() const { if (_position == 0 && _bit_position == 0) return 0; int used_bytes = _position; if (_bit_position != 0) ++used_bytes; return used_bytes; } In `DebugInformationRecorder` we will need to replace `position()` with `start_group()` and add `finish_group()`. We will need to change `int DebugInformationRecorder::find_sharable_decode_offset(int stream_offset)` to `int DebugInformationRecorder::find_sharable_decode_offset(int data_begin_offset, int data_end_offset)`. If we want `DebugInformationRecorder` to use `CompressedWriteStream` we can use an adapter: class CompressedWriteStreamAdapter: public CompressedWriteStream { public: ... int start_group() { return position(); } int finish_group() { return position(); } int data_size() const { return position(); } }; ------------- PR: https://git.openjdk.org/jdk/pull/10025 From vlivanov at openjdk.org Wed Nov 16 23:10:31 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 16 Nov 2022 23:10:31 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v20] In-Reply-To: References: Message-ID: <9sCYEHe6Q8oPYHxAOWE4DjPGFalX0TIpaXkyWaGSyGk=.eb1e9867-8b83-4fed-a809-8c871cda8a23@github.com> On Wed, 16 Nov 2022 22:47:37 GMT, Sandhya Viswanathan wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: >> >> redo register alloc with explicit func params > > src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 917: > >> 915: // Cleanup >> 916: __ vpxorq(xmm0, xmm0, xmm0, Assembler::AVX_512bit); >> 917: __ vpxorq(xmm1, xmm1, xmm1, Assembler::AVX_512bit); > > You could use T0, T1 in place of xmm0, xmm1 here. Or simply switch to `vzeroall` for `xmm0` - `xmm15`. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From vlivanov at openjdk.org Wed Nov 16 23:19:15 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 16 Nov 2022 23:19:15 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v20] In-Reply-To: References: Message-ID: <8UaOUsGNlGnh87OM1y8tMC6pVfeFn0nYUHyhvT7J-ss=.4e581e46-4759-4f69-8a6a-6383bc6f16de@github.com> On Wed, 16 Nov 2022 20:52:14 GMT, Volodymyr Paprotski wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: > > redo register alloc with explicit func params src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 756: > 754: > 755: // Store R^8-R for later use > 756: __ evmovdquq(Address(rsp, 64*0), B0, Assembler::AVX_512bit); Could these vector spills be eliminated? I counted 8 spare zmm registers available across the vector loop (xmm7-xmm12, xmm30, xmm31). And here's what is explicitly used in `process256Loop`: D0 D1 = xmm2-xmm3 B0 B1 B2 B3 B4 B5 = xmm19-xmm24 TMP = xmm6 A0 A1 A2 A3 A4 A5 = xmm13-xmm18 R0 R1 R2 R1P R2P = xmm25-xmm29 T0 T1 T2 T3 T4 T5 = xmm0-xmm5 ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Wed Nov 16 23:19:16 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Wed, 16 Nov 2022 23:19:16 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v20] In-Reply-To: <8UaOUsGNlGnh87OM1y8tMC6pVfeFn0nYUHyhvT7J-ss=.4e581e46-4759-4f69-8a6a-6383bc6f16de@github.com> References: <8UaOUsGNlGnh87OM1y8tMC6pVfeFn0nYUHyhvT7J-ss=.4e581e46-4759-4f69-8a6a-6383bc6f16de@github.com> Message-ID: On Wed, 16 Nov 2022 23:12:28 GMT, Vladimir Ivanov wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: >> >> redo register alloc with explicit func params > > src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 756: > >> 754: >> 755: // Store R^8-R for later use >> 756: __ evmovdquq(Address(rsp, 64*0), B0, Assembler::AVX_512bit); > > Could these vector spills be eliminated? I counted 8 spare zmm registers available across the vector loop (xmm7-xmm12, xmm30, xmm31). > > And here's what is explicitly used in `process256Loop`: > > D0 D1 = xmm2-xmm3 > B0 B1 B2 B3 B4 B5 = xmm19-xmm24 > TMP = xmm6 > A0 A1 A2 A3 A4 A5 = xmm13-xmm18 > R0 R1 R2 R1P R2P = xmm25-xmm29 > T0 T1 T2 T3 T4 T5 = xmm0-xmm5 Interesting!! Let me try that! ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Wed Nov 16 23:19:19 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Wed, 16 Nov 2022 23:19:19 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v20] In-Reply-To: <9sCYEHe6Q8oPYHxAOWE4DjPGFalX0TIpaXkyWaGSyGk=.eb1e9867-8b83-4fed-a809-8c871cda8a23@github.com> References: <9sCYEHe6Q8oPYHxAOWE4DjPGFalX0TIpaXkyWaGSyGk=.eb1e9867-8b83-4fed-a809-8c871cda8a23@github.com> Message-ID: On Wed, 16 Nov 2022 23:08:16 GMT, Vladimir Ivanov wrote: >> src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 917: >> >>> 915: // Cleanup >>> 916: __ vpxorq(xmm0, xmm0, xmm0, Assembler::AVX_512bit); >>> 917: __ vpxorq(xmm1, xmm1, xmm1, Assembler::AVX_512bit); >> >> You could use T0, T1 in place of xmm0, xmm1 here. > > Or simply switch to `vzeroall` for `xmm0` - `xmm15`. ah.. I remember thinking about doing that.. `vzeroall` isnt encoded yet and I figured since I already have to do the xmm16-29, might as well do them all.. should I add that instruction too? ------------- PR: https://git.openjdk.org/jdk/pull/10582 From vlivanov at openjdk.org Wed Nov 16 23:41:16 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 16 Nov 2022 23:41:16 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v20] In-Reply-To: References: <9sCYEHe6Q8oPYHxAOWE4DjPGFalX0TIpaXkyWaGSyGk=.eb1e9867-8b83-4fed-a809-8c871cda8a23@github.com> Message-ID: <2mAogemIauwWUYmfUtLgDNxtkskab6gjPrcFIQghuYk=.3f88c715-2d68-4995-9eaa-e989b6b5be8f@github.com> On Wed, 16 Nov 2022 23:14:45 GMT, Volodymyr Paprotski wrote: >> Or simply switch to `vzeroall` for `xmm0` - `xmm15`. > > ah.. I remember thinking about doing that.. `vzeroall` isnt encoded yet and I figured since I already have to do the xmm16-29, might as well do them all.. should I add that instruction too? Yes, please. And for the upper half of register file, just code it as a loop over register range: for (int rxmm_num = 16; rxmm_num < 30; rxmm_num++) { XMMRegister rxmm = as_XMMRegister(rxmm_num); __ vpxorq(rxmm, rxmm, rxmm, Assembler::AVX_512bit); } or even // Zeroes zmm16-zmm31. for (XMMRegister rxmm = xmm16; rxmm->is_valid(); rxmm = rxmm->successor()) { __ vpxorq(rxmm, rxmm, rxmm, Assembler::AVX_512bit); } ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Wed Nov 16 23:45:29 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Wed, 16 Nov 2022 23:45:29 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v20] In-Reply-To: <2mAogemIauwWUYmfUtLgDNxtkskab6gjPrcFIQghuYk=.3f88c715-2d68-4995-9eaa-e989b6b5be8f@github.com> References: <9sCYEHe6Q8oPYHxAOWE4DjPGFalX0TIpaXkyWaGSyGk=.eb1e9867-8b83-4fed-a809-8c871cda8a23@github.com> <2mAogemIauwWUYmfUtLgDNxtkskab6gjPrcFIQghuYk=.3f88c715-2d68-4995-9eaa-e989b6b5be8f@github.com> Message-ID: <8Wbchzsuzf9Vx7btDUoVEevuDmCdRCuKMlZZjBEyXmg=.8c6ef831-9f5c-4b32-b007-c0bca9161c9f@github.com> On Wed, 16 Nov 2022 23:39:00 GMT, Vladimir Ivanov wrote: >> ah.. I remember thinking about doing that.. `vzeroall` isnt encoded yet and I figured since I already have to do the xmm16-29, might as well do them all.. should I add that instruction too? > > Yes, please. And for the upper half of register file, just code it as a loop over register range: > > for (int rxmm_num = 16; rxmm_num < 30; rxmm_num++) { > XMMRegister rxmm = as_XMMRegister(rxmm_num); > __ vpxorq(rxmm, rxmm, rxmm, Assembler::AVX_512bit); > } > > or even > > // Zeroes zmm16-zmm31. > for (XMMRegister rxmm = xmm16; rxmm->is_valid(); rxmm = rxmm->successor()) { > __ vpxorq(rxmm, rxmm, rxmm, Assembler::AVX_512bit); > } Will do.. ("loop" erm.. wow.. "duh, this isn't assembler!") Thanks!! ------------- PR: https://git.openjdk.org/jdk/pull/10582 From fgao at openjdk.org Thu Nov 17 01:32:26 2022 From: fgao at openjdk.org (Fei Gao) Date: Thu, 17 Nov 2022 01:32:26 GMT Subject: RFR: 8295407: C2 crash: Error: ShouldNotReachHere() in multiple vector tests with -XX:-MonomorphicArrayCheck -XX:-UncommonNullCast [v2] In-Reply-To: References: Message-ID: On Wed, 16 Nov 2022 06:04:09 GMT, Tobias Hartmann wrote: >> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Clean up related code >> - Merge branch 'master' into fg8295407 >> - 8295407: C2 crash: Error: ShouldNotReachHere() in multiple vector tests with -XX:-MonomorphicArrayCheck -XX:-UncommonNullCast >> >> For unsupported `CMove` patterns, JDK-8293833 helps remove unused >> `CMove` and related packs from superword candidate packset by the >> function `remove_cmove_and_related_packs()`, but it only works >> when `-XX:+UseVectorCmov` is enabled[1]. When the option is not >> enabled, these unsupported `CMove` packs are still kept in the >> superword packset, causing the same failure. >> >> Actually, the function `filter_packs()` in superword is to filter >> out unsupported packs but it can't work as expected currently for >> these `CMove` cases. As we know, not all `CMove` packs can be >> vectorized. `merge_packs_to_cmovd()`[2] looks through all packs >> in the superword packset and generates a `CMove` candidate >> packset to collect all qualified `CMove` packs. Hence, only >> `CMove` packs in the `CMove` candidate packset are our target >> patterns and can be vectorized. But `filter_packs()` thinks, >> if the `CMove` pack is in a superword packset and its vector >> node is implemented in the current platform, then it can >> be vectorized. Therefore, the function doesn't remove >> these unsupported packs. >> >> We can adjust the function `implemented()` in the stage of >> `filter_packs()` to check if the current `CMove` pack is in >> the `CMove` candidate packset. If not, `filter_packs()` considers >> it not to be vectorized and then remove it. After the fix, >> whether `-XX:+UseVectorCmov` is enabled or not, these >> unsupported packs can be removed by `filter_packs()`. In this >> way, we don't need the function`remove_cmove_and_related_packs()` >> anymore and thus the patch also cleans related code. >> >> [1] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L537 >> [2] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L1892 >> [3] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L2701 > > Looks good to me. Let's wait for Vladimir's testing to finish. Thanks for your kind review and test work, @TobiHartmann @vnkozlov. I'll integrate it. ------------- PR: https://git.openjdk.org/jdk/pull/11034 From fgao at openjdk.org Thu Nov 17 01:44:14 2022 From: fgao at openjdk.org (Fei Gao) Date: Thu, 17 Nov 2022 01:44:14 GMT Subject: Integrated: 8295407: C2 crash: Error: ShouldNotReachHere() in multiple vector tests with -XX:-MonomorphicArrayCheck -XX:-UncommonNullCast In-Reply-To: References: Message-ID: <5Er16BuhDwjQhPd3-0Tfx_u1ocQBSP3b1Z7NDvByJH4=.9cb8494b-2938-4a04-9c81-7f6c6d3358d6@github.com> On Tue, 8 Nov 2022 02:36:13 GMT, Fei Gao wrote: > For unsupported `CMove` patterns, [JDK-8293833](https://bugs.openjdk.org/browse/JDK-8295407) helps remove unused `CMove` and related packs from superword candidate packset by the function `remove_cmove_and_related_packs()`, but it only works when `-XX:+UseVectorCmov` is enabled[1]. When the option is not enabled, these unsupported `CMove` packs are still kept in the superword packset, causing the same failure. > > Actually, the function `filter_packs()` in superword is to filter out unsupported packs but it can't work as expected currently for these `CMove` cases. As we know, not all `CMove` packs can be vectorized. `merge_packs_to_cmovd()`[2] looks through all packs in the superword packset and generates a `CMove` candidate packset to collect all qualified `CMove` packs. Hence, only `CMove` packs in the `CMove` candidate packset are our target patterns and can be vectorized. But `filter_packs()` thinks, if the `CMove` pack is in a superword packset and its vector node is implemented in the current platform, then it can be vectorized. Therefore, the function doesn't remove these unsupported packs. > > We can adjust the function `implemented()` in the stage of `filter_packs()` to check if the current `CMove` pack is in the `CMove` candidate packset. If not, `filter_packs()` considers it not to be vectorized and then remove it. After the fix, whether > `-XX:+UseVectorCmov` is enabled or not, these unsupported packs can be removed by `filter_packs()`. In this way, we don't need the function`remove_cmove_and_related_packs()` anymore and thus the patch also cleans related code. > > [1] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L537 > [2] https://github.com/openjdk/jdk/blob/9b9be88bcaa35c89b6915ff0c251e5a04b10b330/src/hotspot/share/opto/superword.cpp#L1892 This pull request has now been integrated. Changeset: cc444198 Author: Fei Gao Committer: Pengfei Li URL: https://git.openjdk.org/jdk/commit/cc44419840d98fed0bcdab66bbb835855f1a8a11 Stats: 175 lines in 3 files changed: 74 ins; 45 del; 56 mod 8295407: C2 crash: Error: ShouldNotReachHere() in multiple vector tests with -XX:-MonomorphicArrayCheck -XX:-UncommonNullCast Reviewed-by: thartmann, kvn ------------- PR: https://git.openjdk.org/jdk/pull/11034 From yyang at openjdk.org Thu Nov 17 02:44:28 2022 From: yyang at openjdk.org (Yi Yang) Date: Thu, 17 Nov 2022 02:44:28 GMT Subject: RFR: 8290432: C2 compilation fails with assert(node->_last_del == _last) failed: must have deleted the edge just produced [v3] In-Reply-To: References: <5ff-r2RgTNzao-sZ4D1kKWOPHWwzaCZxDDxyxl1Y0Us=.ae799d57-29ab-42c5-9908-a5811a8db0bc@github.com> Message-ID: On Wed, 16 Nov 2022 12:23:53 GMT, Tobias Hartmann wrote: >> src/hotspot/share/opto/loopnode.cpp line 3688: >> >>> 3686: if (incr2->in(1)->is_ConstraintCast() && !incr2->in(1)->in(0)->is_RangeCheck()) { >>> 3687: // Skip AddI->CastII->Phi case if CastII is not controlled by local RangeCheck >>> 3688: // to reflect changes in LibraryCallKit::inline_preconditions_checkIndex >> >> In the valid case, isn't the ConstraintCast control input `incr2->in(1)->in(0)` the IfTrue projection of the RangeCheck? >> >> I would remove the second line because it's not clear which "changes" in `LibraryCallKit::inline_preconditions_checkIndex` it is referring to. >> >> Suggestion: > > It's still not clear to me how the `incr2->in(1)->in(0)->is_RangeCheck()` condition can ever be true. How can the control input of a ConstraintCast be a RangeCheck? Shouldn't there be a projection node in-between? I missed this comment before... I need incr2->in(1)->in(0)->in(0)->is_RangeCheck() ------------- PR: https://git.openjdk.org/jdk/pull/9695 From yyang at openjdk.org Thu Nov 17 03:06:19 2022 From: yyang at openjdk.org (Yi Yang) Date: Thu, 17 Nov 2022 03:06:19 GMT Subject: RFR: 8290432: C2 compilation fails with assert(node->_last_del == _last) failed: must have deleted the edge just produced [v3] In-Reply-To: References: <5ff-r2RgTNzao-sZ4D1kKWOPHWwzaCZxDDxyxl1Y0Us=.ae799d57-29ab-42c5-9908-a5811a8db0bc@github.com> Message-ID: On Wed, 16 Nov 2022 12:25:21 GMT, Tobias Hartmann wrote: >> Yi Yang has updated the pull request incrementally with one additional commit since the last revision: >> >> update comment > > test/hotspot/jtreg/compiler/c2/TestUnexpectedParallelIV.java line 28: > >> 26: * @test >> 27: * @bug 8290432 >> 28: * @summary Unexpected parallel induction variable pattern was recongized > > This test does not reproduce the issue for me, whereas [Test-2.java](https://bugs.openjdk.org/secure/attachment/100710/Test-2.java) still works. You can reset to JDK-8273585, TestUnexpectedParallelIV.java reproduces the crash. Test-2.java can always reproduce without resetting commits. I added both of them as test cases. ------------- PR: https://git.openjdk.org/jdk/pull/9695 From duke at openjdk.org Thu Nov 17 03:23:49 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Thu, 17 Nov 2022 03:23:49 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v21] In-Reply-To: References: Message-ID: > Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. > > - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. > - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. > - Added a JMH perf test. > - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. > > Perf before: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s > > and after: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: vzeroall, no spill, reg re-map ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10582/files - new: https://git.openjdk.org/jdk/pull/10582/files/dbdfd1dc..56aed9b1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=20 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=19-20 Stats: 182 lines in 3 files changed: 15 ins; 44 del; 123 mod Patch: https://git.openjdk.org/jdk/pull/10582.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10582/head:pull/10582 PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Thu Nov 17 03:23:49 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Thu, 17 Nov 2022 03:23:49 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v21] In-Reply-To: References: Message-ID: On Thu, 17 Nov 2022 03:19:15 GMT, Volodymyr Paprotski wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: > > vzeroall, no spill, reg re-map @iwanowww Another round ready your way :) ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Thu Nov 17 03:23:51 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Thu, 17 Nov 2022 03:23:51 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v20] In-Reply-To: References: <8UaOUsGNlGnh87OM1y8tMC6pVfeFn0nYUHyhvT7J-ss=.4e581e46-4759-4f69-8a6a-6383bc6f16de@github.com> Message-ID: <42Jq_3oM24kB-AcDEzAdsHIQwcOZX0y9_boTctpLUa4=.6076a16c-9cc6-4755-9eaf-1f6ca4c1fb85@github.com> On Wed, 16 Nov 2022 23:16:14 GMT, Volodymyr Paprotski wrote: >> src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 756: >> >>> 754: >>> 755: // Store R^8-R for later use >>> 756: __ evmovdquq(Address(rsp, 64*0), B0, Assembler::AVX_512bit); >> >> Could these vector spills be eliminated? I counted 8 spare zmm registers available across the vector loop (xmm7-xmm12, xmm30, xmm31). >> >> And here's what is explicitly used in `process256Loop`: >> >> D0 D1 = xmm2-xmm3 >> B0 B1 B2 B3 B4 B5 = xmm19-xmm24 >> TMP = xmm6 >> A0 A1 A2 A3 A4 A5 = xmm13-xmm18 >> R0 R1 R2 R1P R2P = xmm25-xmm29 >> T0 T1 T2 T3 T4 T5 = xmm0-xmm5 > > Interesting!! Let me try that! Done! PS: This find really was great! PPS: I also reordered the map alphabetically and counted in-order... it was just really bugging me!! ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Thu Nov 17 03:23:52 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Thu, 17 Nov 2022 03:23:52 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v20] In-Reply-To: <8Wbchzsuzf9Vx7btDUoVEevuDmCdRCuKMlZZjBEyXmg=.8c6ef831-9f5c-4b32-b007-c0bca9161c9f@github.com> References: <9sCYEHe6Q8oPYHxAOWE4DjPGFalX0TIpaXkyWaGSyGk=.eb1e9867-8b83-4fed-a809-8c871cda8a23@github.com> <2mAogemIauwWUYmfUtLgDNxtkskab6gjPrcFIQghuYk=.3f88c715-2d68-4995-9eaa-e989b6b5be8f@github.com> <8Wbchzsuzf9Vx7btDUoVEevuDmCdRCuKMlZZjBEyXmg=.8c6ef831-9f5c-4b32-b007-c0bca9161c9f@github.com> Message-ID: On Wed, 16 Nov 2022 23:41:32 GMT, Volodymyr Paprotski wrote: >> Yes, please. And for the upper half of register file, just code it as a loop over register range: >> >> for (int rxmm_num = 16; rxmm_num < 30; rxmm_num++) { >> XMMRegister rxmm = as_XMMRegister(rxmm_num); >> __ vpxorq(rxmm, rxmm, rxmm, Assembler::AVX_512bit); >> } >> >> or even >> >> // Zeroes zmm16-zmm31. >> for (XMMRegister rxmm = xmm16; rxmm->is_valid(); rxmm = rxmm->successor()) { >> __ vpxorq(rxmm, rxmm, rxmm, Assembler::AVX_512bit); >> } > > Will do.. ("loop" erm.. wow.. "duh, this isn't assembler!") Thanks!! done (Note: disassembler proof for vzeroall encoding 0x7fffed0022f8: vzeroall 0x7fffed0022fb: vpxorq zmm16,zmm16,zmm16 0x7fffed002301: vpxorq zmm17,zmm17,zmm17 0x7fffed002307: vpxorq zmm18,zmm18,zmm18 0x7fffed00230d: vpxorq zmm19,zmm19,zmm19 0x7fffed002313: vpxorq zmm20,zmm20,zmm20 0x7fffed002319: vpxorq zmm21,zmm21,zmm21 0x7fffed00231f: vpxorq zmm22,zmm22,zmm22 0x7fffed002325: vpxorq zmm23,zmm23,zmm23 0x7fffed00232b: vpxorq zmm24,zmm24,zmm24 0x7fffed002331: vpxorq zmm25,zmm25,zmm25 0x7fffed002337: vpxorq zmm26,zmm26,zmm26 0x7fffed00233d: vpxorq zmm27,zmm27,zmm27 0x7fffed002343: vpxorq zmm28,zmm28,zmm28 0x7fffed002349: vpxorq zmm29,zmm29,zmm29 0x7fffed00234f: vpxorq zmm30,zmm30,zmm30 0x7fffed002355: vpxorq zmm31,zmm31,zmm31 0x7fffed00235b: cmp ebx,0x10 0x7fffed00235e: jl 0x7fffed0023e6 ) ------------- PR: https://git.openjdk.org/jdk/pull/10582 From yyang at openjdk.org Thu Nov 17 03:26:15 2022 From: yyang at openjdk.org (Yi Yang) Date: Thu, 17 Nov 2022 03:26:15 GMT Subject: RFR: 8290432: C2 compilation fails with assert(node->_last_del == _last) failed: must have deleted the edge just produced [v4] In-Reply-To: References: Message-ID: > Hi, can I have a review for this patch? [JDK-8273585](https://bugs.openjdk.org/browse/JDK-8273585) recognized the form of `Phi->CastII->AddI` as additional parallel induction variables. In the following program: > > class Test { > static int dontInline() { > return 0; > } > > static long test(int val, boolean b) { > long ret = 0; > long dArr[] = new long[100]; > for (int i = 15; 293 > i; ++i) { > ret = val; > int j = 1; > while (++j < 6) { > int k = (val--); > for (long l = i; 1 > l; ) { > if (k != 0) { > ret += dontInline(); > } > } > if (b) { > break; > } > } > } > return ret; > } > > public static void main(String[] args) { > for (int i = 0; i < 1000; i++) { > test(0, false); > } > } > } > > `val` is incorrectly matched with the new parallel IV form: > ![image](https://user-images.githubusercontent.com/5010047/182059398-fc5204bc-8d95-4e3e-8c66-15776af457b8.png) > And C2 further replaces it with newly added nodes, which finally leads the crash: > ![image](https://user-images.githubusercontent.com/5010047/182059498-13148d46-b10f-4e18-b84a-f6b9f626ac7b.png) > > I think we can add more constraints to the new form. The form of `Phi->CastXX->AddX` appears when using Preconditions.checkIndex, and it would be recognized as additional IV when 1) Phi != phi2, 2) CastXX is controlled by RangeCheck(to reflect changes in Preconditions checkindex intrinsic) Yi Yang has updated the pull request incrementally with one additional commit since the last revision: skip IfProj ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9695/files - new: https://git.openjdk.org/jdk/pull/9695/files/7c69d8fb..4cce45e5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9695&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9695&range=02-03 Stats: 68 lines in 2 files changed: 66 ins; 1 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9695.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9695/head:pull/9695 PR: https://git.openjdk.org/jdk/pull/9695 From thartmann at openjdk.org Thu Nov 17 06:01:29 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 17 Nov 2022 06:01:29 GMT Subject: Integrated: 8276064: CheckCastPP with raw oop input floats below a safepoint In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 11:55:04 GMT, Tobias Hartmann wrote: > This bug is similar to [JDK-8271600](https://bugs.openjdk.org/browse/JDK-8271600): A CheckCastPP with a raw oop input floats out of a loop and below a safepoint. Since C2 does not generate OopMap entries for raw pointers, the GC will not update the oop if the corresponding object is moved during the safepoint. We either assert already during OopMap creation, or crash when dereferencing a stale oop during runtime (the verification code does not always detect such live raw oops at safepoints, I included a fix for that as well). > > I think the fix for [JDK-8271600](https://bugs.openjdk.org/browse/JDK-8271600) is incomplete, because it only bails out of [PhaseIdealLoop::try_sink_out_of_loop](https://github.com/openjdk/jdk/commit/2ff4c01d42f1afcc53abd48e074356fb4a700754) while the underlying issue is that a raw CheckCastPP ends up with ctrl "far away" from its Allocate/Initialize and potentially even below a safepoint. Usually, the CheckCastPP would always be part of safepoint debug info and therefore late ctrl would be guaranteed to be above the safepoint. However, vector objects are aggressively scalar replaced in safepoints, which allows late ctrl to be set to further below. This is specific to vectors, since "normal" Java objects would either be fully scalarized or not be scalarized at all. > > In the failing case, Loop Unswitching clones the loop body and creates a Phi to merge the oop results from the vector allocations in both loops. Since ctrl of the CheckCastPP is outside of the loop, its data input is changed to the newly created Phi and its control input is set to the region that merges the loop exits. This moves the CheckCastPP below a safepoint in the loop. > > Below graphs show the details. `395 CheckCastPP` is removed from the debug info for `262 CallStaticJava` because it's scalarized (`326 SafepointScalarObject`). Late ctrl is then computed to be outside of the loop because the CheckCastPP is only used in the return. > > ![8276064_Before](https://user-images.githubusercontent.com/5312595/199204249-17564a59-2b67-4426-be71-19bc0eafac99.png) > > Now Loop Unswitching creates a `487 Region` and `517 Phi` to merge control and data inputs to the CheckCastPP from the fast and slow loops (see `PhaseIdealLoop::clone_loop_handle_data_uses`). Control of the `395 CheckCastPP` is updated accordingly. > > ![8276064_After](https://user-images.githubusercontent.com/5312595/199204273-44341cd7-b5b6-4ec0-b8c9-6f349393dbd1.png) > > As a result, the raw oop input of the `395 CheckCastPP` is live at `262 CallStaticJava`. > > We could now add another point fix to prevent loop unswitching from moving the CheckCastPP out of the loop, but I think there is a risk that other current or future optimizations would rely on the CheckCastPP's late ctrl and do a similar thing. I would therefore suggest to pin all CheckCastPPs with a raw oop input, similar to what [JDK-5071820](https://bugs.openjdk.org/browse/JDK-5071820) did in GCM: > https://github.com/openjdk/jdk/blob/37107fc1574a4191987420d88f7182e63c7da60c/src/hotspot/share/opto/gcm.cpp#L1325-L1330 > > The fix for [JDK-8271600](https://bugs.openjdk.org/browse/JDK-8271600) can then be reverted, also because Roland's fix for [JDK-8272562](https://bugs.openjdk.org/browse/JDK-8272562) disabled moving **all** CheckCastPPs out of loops anyway. Roland said that he plans to revisit that decision with [JDK-8275202](https://bugs.openjdk.org/browse/JDK-8275202). The tests added with this PR will cover the `PhaseIdealLoop::try_sink_out_of_loop` case as well and therefore serve as regression tests for [JDK-8271600](https://bugs.openjdk.org/browse/JDK-8271600). > > We could improve this by adding logic to set late ctrl just above the safepoint, but I'm not sure if it's worth the complexity because we would need to walk up the control paths from late to early control and compute the dominator of all safepoints. > > I also fixed the verification code in `OopFlow::build_oop_map` to account for spilling. Before, compilation of `test1` would pass and only crash during execution. Now, we assert and print: > > > 454 DefinitionSpillCopy === _ 122 [[ 321 ]] !jvms: IntVector::zero @ bci:26 (line 563) TestRawOopAtSafepoint::test1 @ bci:25 (line 74) > 321 Phi === 315 454 503 [[ 512 ]] #rawptr:NotNull !jvms: IntVector::zero @ bci:26 (line 563) TestRawOopAtSafepoint::test1 @ bci:25 (line 74) > 38 CallStaticJavaDirect === 40 126 136 102 0 455 452 138 139 453 460 [[ 39 84 130 37 388 ]] Static compiler.vectorapi.TestRawOopAtSafepoint::safepoint # void ( int ) TestRawOopAtSafepoint::test1 @ bci:44 (line 75) !jvms: TestRawOopAtSafepoint::test1 @ bci:44 (line 75) > > > > What do you think? > > Thanks, > Tobias This pull request has now been integrated. Changeset: cd9c688b Author: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/cd9c688bfce36e4b2d37dd68dd8031f197b9eddc Stats: 154 lines in 4 files changed: 146 ins; 4 del; 4 mod 8276064: CheckCastPP with raw oop input floats below a safepoint Reviewed-by: kvn, vlivanov, roland ------------- PR: https://git.openjdk.org/jdk/pull/10932 From thartmann at openjdk.org Thu Nov 17 06:07:33 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 17 Nov 2022 06:07:33 GMT Subject: RFR: 8296912: C2: CreateExNode::Identity fails with assert(i < _max) failed: oob: i=1, _max=1 In-Reply-To: References: Message-ID: On Wed, 16 Nov 2022 12:00:34 GMT, Tobias Hartmann wrote: > The fix for [JDK-8284358](https://bugs.openjdk.org/browse/JDK-8284358) added code that aggressively removes dead subgraphs when detecting an unreachable Region by walking up the CFG and replacing all nodes by top (because they must be unreachable as well). In this case, we detect that `280 Region` is unreachable from root and replace `276 Catch` by top while walking up the CFG: > > ![Screenshot from 2022-11-16 12-53-56](https://user-images.githubusercontent.com/5312595/202174104-9437914a-cb38-401c-b0fd-d6ca849969b0.png) > > Code in `CreateExNode::Identity` does not expect `292 CatchProj` to have a top input when processing `305 CreateEx`. The fix is to simply add a `in(0)->in(0)->is_Catch()` check. > > Thanks, > Tobias Thanks for the review, Vladimir! ------------- PR: https://git.openjdk.org/jdk/pull/11181 From haosun at openjdk.org Thu Nov 17 07:23:50 2022 From: haosun at openjdk.org (Hao Sun) Date: Thu, 17 Nov 2022 07:23:50 GMT Subject: RFR: 8293856: AArch64: Remove clear_inst_mark from aarch64_enc_java_dynamic_call Message-ID: 1) After the fix of JDK-8287394, there is no need for clear_inst_mark after trampoline_call. See the discussion in [1]. 2) MacroAssembler::ic_call has trampoline_call as the last call. Hence, clear_inst_mark after MacroAssembler::ic_call can be removed. There is such a case in aarch64_enc_java_dynamic_call. We conduct the cleanup in this patch. Testing: tier1~3 passed with no new failures on Linux/AArch64 platform. [1] https://github.com/openjdk/jdk/pull/8564#discussion_r871062342 ------------- Commit messages: - 8293856: AArch64: Remove clear_inst_mark from aarch64_enc_java_dynamic_call Changes: https://git.openjdk.org/jdk/pull/11200/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11200&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8293856 Stats: 1 line in 1 file changed: 0 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11200.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11200/head:pull/11200 PR: https://git.openjdk.org/jdk/pull/11200 From fyang at openjdk.org Thu Nov 17 07:34:31 2022 From: fyang at openjdk.org (Fei Yang) Date: Thu, 17 Nov 2022 07:34:31 GMT Subject: RFR: 8296602: RISC-V: improve performance of copy_memory stub [v2] In-Reply-To: References: Message-ID: <0sXo2P1_NlZ-N1wGd-C5KzZjsC0WVsw8p2lqCUeK5Jo=.d71d74ea-8932-4745-bfe6-d8c8bdd375ac@github.com> On Wed, 9 Nov 2022 11:34:43 GMT, Vladimir Kempik wrote: >> Please review this change to improve the performance of copy_memory stub on risc-v >> >> This change has three parts >> 1) use copy32 if possible to do 4 ld and 4 st per loop cycle >> 2) don't produce precopy code if is_aligned is true, it's not executed. >> 3) in the end of loop8 and loop32, remove data dependency between two addi opcodes, to allow them to be scheduled simultaneously >> >> testing: org.openjdk.bench.vm.compiler.ArrayCopyObject, hotspot_compiler_arraycopy, hotspot:tier1, hotspot:tier2 - all ok >> hotspot:tier2 is on the way. >> >> and for the benchmark results, using >> org.openjdk.bench.vm.compiler.ArrayCopyObject.conjoint_micro >> >> thead rvb-ice c910 >> thead >> >> Before ( copy8 only ) >> Benchmark (size) Mode Cnt Score Error Units >> ArrayCopyObject.conjoint_micro 31 thrpt 25 6653.095 ? 251.565 ops/ms >> ArrayCopyObject.conjoint_micro 63 thrpt 25 4933.970 ? 77.559 ops/ms >> ArrayCopyObject.conjoint_micro 127 thrpt 25 3627.454 ? 34.589 ops/ms >> ArrayCopyObject.conjoint_micro 2047 thrpt 25 368.249 ? 0.453 ops/ms >> ArrayCopyObject.conjoint_micro 4095 thrpt 25 187.776 ? 0.306 ops/ms >> ArrayCopyObject.conjoint_micro 8191 thrpt 25 94.477 ? 0.340 ops/ms >> >> after ( with copy32 ) >> ArrayCopyObject.conjoint_micro 31 thrpt 25 7620.546 ? 69.756 ops/ms >> ArrayCopyObject.conjoint_micro 63 thrpt 25 6677.978 ? 33.112 ops/ms >> ArrayCopyObject.conjoint_micro 127 thrpt 25 5206.973 ? 22.612 ops/ms >> ArrayCopyObject.conjoint_micro 2047 thrpt 25 653.655 ? 31.494 ops/ms >> ArrayCopyObject.conjoint_micro 4095 thrpt 25 352.905 ? 7.390 ops/ms >> ArrayCopyObject.conjoint_micro 8191 thrpt 25 165.127 ? 0.832 ops/ms >> >> after ( copy32 with dead code elimination and independent addis ) >> ArrayCopyObject.conjoint_micro 31 thrpt 25 7576.346 ? 94.487 ops/ms >> ArrayCopyObject.conjoint_micro 63 thrpt 25 6475.730 ? 252.590 ops/ms >> ArrayCopyObject.conjoint_micro 127 thrpt 25 5221.764 ? 20.415 ops/ms >> ArrayCopyObject.conjoint_micro 2047 thrpt 25 691.847 ? 1.102 ops/ms >> ArrayCopyObject.conjoint_micro 4095 thrpt 25 360.269 ? 1.091 ops/ms >> ArrayCopyObject.conjoint_micro 8191 thrpt 25 179.733 ? 3.012 ops/ms >> >> on hifive unmatched: >> >> before: >> Benchmark (size) Mode Cnt Score Error Units >> ArrayCopyObject.conjoint_micro 31 thrpt 25 5391.575 ? 152.984 ops/ms >> ArrayCopyObject.conjoint_micro 63 thrpt 25 3700.946 ? 43.175 ops/ms >> ArrayCopyObject.conjoint_micro 127 thrpt 25 2316.160 ? 24.734 ops/ms >> ArrayCopyObject.conjoint_micro 2047 thrpt 25 188.616 ? 0.151 ops/ms >> ArrayCopyObject.conjoint_micro 4095 thrpt 25 95.323 ? 0.053 ops/ms >> ArrayCopyObject.conjoint_micro 8191 thrpt 25 46.935 ? 0.041 ops/ms >> >> after: >> Benchmark (size) Mode Cnt Score Error Units >> ArrayCopyObject.conjoint_micro 31 thrpt 25 6136.169 ? 330.409 ops/ms >> ArrayCopyObject.conjoint_micro 63 thrpt 25 4924.020 ? 78.529 ops/ms >> ArrayCopyObject.conjoint_micro 127 thrpt 25 3732.561 ? 89.606 ops/ms >> ArrayCopyObject.conjoint_micro 2047 thrpt 25 431.103 ? 0.505 ops/ms >> ArrayCopyObject.conjoint_micro 4095 thrpt 25 221.543 ? 0.363 ops/ms >> ArrayCopyObject.conjoint_micro 8191 thrpt 25 100.586 ? 0.197 ops/ms > > Vladimir Kempik has updated the pull request incrementally with one additional commit since the last revision: > > remove excessive comments Nice numbers! Overall looks good to me. Several minor nits. src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 902: > 900: * if ((dst % 8) == (src % 8)) { > 901: * aligned; > 902: * goto copy_big; You might want to update code comment at line 884 too. src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 1042: > 1040: __ bind(copy32); > 1041: if (is_backwards) { > 1042: __ addi(src, src, -wordSize*4); Can you leave a space before and after the operator here and other places? src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 1059: > 1057: __ addi(dst, dst, wordSize*4); > 1058: } > 1059: __ addi(tmp4, cnt, -(32+wordSize*4)); Can we use 'tmp' instead of 'tmp4' here? Then it will be consistent in register usage with other places where we check 'cnt'. src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 1065: > 1063: __ beqz(cnt, done); // if that's all - done > 1064: > 1065: __ addi(tmp4, cnt, -8); // if not - copy the reminder Similar here. src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 1079: > 1077: __ addi(dst, dst, wordSize); > 1078: } > 1079: __ addi(tmp4, cnt, -(8+wordSize)); Similar here. ------------- PR: https://git.openjdk.org/jdk/pull/11058 From thartmann at openjdk.org Thu Nov 17 07:43:11 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 17 Nov 2022 07:43:11 GMT Subject: Integrated: 8296912: C2: CreateExNode::Identity fails with assert(i < _max) failed: oob: i=1, _max=1 In-Reply-To: References: Message-ID: On Wed, 16 Nov 2022 12:00:34 GMT, Tobias Hartmann wrote: > The fix for [JDK-8284358](https://bugs.openjdk.org/browse/JDK-8284358) added code that aggressively removes dead subgraphs when detecting an unreachable Region by walking up the CFG and replacing all nodes by top (because they must be unreachable as well). In this case, we detect that `280 Region` is unreachable from root and replace `276 Catch` by top while walking up the CFG: > > ![Screenshot from 2022-11-16 12-53-56](https://user-images.githubusercontent.com/5312595/202174104-9437914a-cb38-401c-b0fd-d6ca849969b0.png) > > Code in `CreateExNode::Identity` does not expect `292 CatchProj` to have a top input when processing `305 CreateEx`. The fix is to simply add a `in(0)->in(0)->is_Catch()` check. > > Thanks, > Tobias This pull request has now been integrated. Changeset: 502fa3ee Author: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/502fa3eeea849cfcc50436602be1654695ef4e26 Stats: 30 lines in 2 files changed: 23 ins; 1 del; 6 mod 8296912: C2: CreateExNode::Identity fails with assert(i < _max) failed: oob: i=1, _max=1 Reviewed-by: chagedorn, kvn ------------- PR: https://git.openjdk.org/jdk/pull/11181 From vkempik at openjdk.org Thu Nov 17 08:08:22 2022 From: vkempik at openjdk.org (Vladimir Kempik) Date: Thu, 17 Nov 2022 08:08:22 GMT Subject: RFR: 8296602: RISC-V: improve performance of copy_memory stub [v2] In-Reply-To: <0sXo2P1_NlZ-N1wGd-C5KzZjsC0WVsw8p2lqCUeK5Jo=.d71d74ea-8932-4745-bfe6-d8c8bdd375ac@github.com> References: <0sXo2P1_NlZ-N1wGd-C5KzZjsC0WVsw8p2lqCUeK5Jo=.d71d74ea-8932-4745-bfe6-d8c8bdd375ac@github.com> Message-ID: <1G-_KbLzZKnyecKCBUKJCoOCs5Z6s3Zew-Hd9JeWJL0=.6c82b000-5e5f-46d7-9a68-e5678d9ab2b0@github.com> On Thu, 17 Nov 2022 07:31:08 GMT, Fei Yang wrote: >> Vladimir Kempik has updated the pull request incrementally with one additional commit since the last revision: >> >> remove excessive comments > > src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 1042: > >> 1040: __ bind(copy32); >> 1041: if (is_backwards) { >> 1042: __ addi(src, src, -wordSize*4); > > Can you leave a space before and after the operator here and other places? Not sure I'm getting you right here. Please elaborate ------------- PR: https://git.openjdk.org/jdk/pull/11058 From fyang at openjdk.org Thu Nov 17 08:21:40 2022 From: fyang at openjdk.org (Fei Yang) Date: Thu, 17 Nov 2022 08:21:40 GMT Subject: RFR: 8296602: RISC-V: improve performance of copy_memory stub [v2] In-Reply-To: <1G-_KbLzZKnyecKCBUKJCoOCs5Z6s3Zew-Hd9JeWJL0=.6c82b000-5e5f-46d7-9a68-e5678d9ab2b0@github.com> References: <0sXo2P1_NlZ-N1wGd-C5KzZjsC0WVsw8p2lqCUeK5Jo=.d71d74ea-8932-4745-bfe6-d8c8bdd375ac@github.com> <1G-_KbLzZKnyecKCBUKJCoOCs5Z6s3Zew-Hd9JeWJL0=.6c82b000-5e5f-46d7-9a68-e5678d9ab2b0@github.com> Message-ID: <2muRxIfLubi5p1Lm4vDAjNHXNZHRUWE2jpOsaYRD3So=.bc51c6ec-fc66-47b2-ac88-63f3a3cfa87f@github.com> On Thu, 17 Nov 2022 08:05:59 GMT, Vladimir Kempik wrote: >> src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 1042: >> >>> 1040: __ bind(copy32); >>> 1041: if (is_backwards) { >>> 1042: __ addi(src, src, -wordSize*4); >> >> Can you leave a space before and after the operator here and other places? > > Not sure I'm getting you right here. Please elaborate I am suggesting this style: __ addi(src, src, -wordSize * 4); instead of? __ addi(src, src, -wordSize*4); ------------- PR: https://git.openjdk.org/jdk/pull/11058 From vkempik at openjdk.org Thu Nov 17 08:21:41 2022 From: vkempik at openjdk.org (Vladimir Kempik) Date: Thu, 17 Nov 2022 08:21:41 GMT Subject: RFR: 8296602: RISC-V: improve performance of copy_memory stub [v2] In-Reply-To: <0sXo2P1_NlZ-N1wGd-C5KzZjsC0WVsw8p2lqCUeK5Jo=.d71d74ea-8932-4745-bfe6-d8c8bdd375ac@github.com> References: <0sXo2P1_NlZ-N1wGd-C5KzZjsC0WVsw8p2lqCUeK5Jo=.d71d74ea-8932-4745-bfe6-d8c8bdd375ac@github.com> Message-ID: <25V66dP5_W10bgGZJbst-qGopmuQrsz4zku0km5PcQI=.a6d8b6af-ef44-46f9-af8d-27b99a3b4822@github.com> On Thu, 17 Nov 2022 07:28:51 GMT, Fei Yang wrote: >> Vladimir Kempik has updated the pull request incrementally with one additional commit since the last revision: >> >> remove excessive comments > > src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 1059: > >> 1057: __ addi(dst, dst, wordSize*4); >> 1058: } >> 1059: __ addi(tmp4, cnt, -(32+wordSize*4)); > > Can we use 'tmp' instead of 'tmp4' here? Then it will be consistent in register usage with other places where we check 'cnt'. done, here and few other places ------------- PR: https://git.openjdk.org/jdk/pull/11058 From vkempik at openjdk.org Thu Nov 17 08:21:40 2022 From: vkempik at openjdk.org (Vladimir Kempik) Date: Thu, 17 Nov 2022 08:21:40 GMT Subject: RFR: 8296602: RISC-V: improve performance of copy_memory stub [v3] In-Reply-To: References: Message-ID: > Please review this change to improve the performance of copy_memory stub on risc-v > > This change has three parts > 1) use copy32 if possible to do 4 ld and 4 st per loop cycle > 2) don't produce precopy code if is_aligned is true, it's not executed. > 3) in the end of loop8 and loop32, remove data dependency between two addi opcodes, to allow them to be scheduled simultaneously > > testing: org.openjdk.bench.vm.compiler.ArrayCopyObject, hotspot_compiler_arraycopy, hotspot:tier1, hotspot:tier2 - all ok > hotspot:tier2 is on the way. > > and for the benchmark results, using > org.openjdk.bench.vm.compiler.ArrayCopyObject.conjoint_micro > > thead rvb-ice c910 > thead > > Before ( copy8 only ) > Benchmark (size) Mode Cnt Score Error Units > ArrayCopyObject.conjoint_micro 31 thrpt 25 6653.095 ? 251.565 ops/ms > ArrayCopyObject.conjoint_micro 63 thrpt 25 4933.970 ? 77.559 ops/ms > ArrayCopyObject.conjoint_micro 127 thrpt 25 3627.454 ? 34.589 ops/ms > ArrayCopyObject.conjoint_micro 2047 thrpt 25 368.249 ? 0.453 ops/ms > ArrayCopyObject.conjoint_micro 4095 thrpt 25 187.776 ? 0.306 ops/ms > ArrayCopyObject.conjoint_micro 8191 thrpt 25 94.477 ? 0.340 ops/ms > > after ( with copy32 ) > ArrayCopyObject.conjoint_micro 31 thrpt 25 7620.546 ? 69.756 ops/ms > ArrayCopyObject.conjoint_micro 63 thrpt 25 6677.978 ? 33.112 ops/ms > ArrayCopyObject.conjoint_micro 127 thrpt 25 5206.973 ? 22.612 ops/ms > ArrayCopyObject.conjoint_micro 2047 thrpt 25 653.655 ? 31.494 ops/ms > ArrayCopyObject.conjoint_micro 4095 thrpt 25 352.905 ? 7.390 ops/ms > ArrayCopyObject.conjoint_micro 8191 thrpt 25 165.127 ? 0.832 ops/ms > > after ( copy32 with dead code elimination and independent addis ) > ArrayCopyObject.conjoint_micro 31 thrpt 25 7576.346 ? 94.487 ops/ms > ArrayCopyObject.conjoint_micro 63 thrpt 25 6475.730 ? 252.590 ops/ms > ArrayCopyObject.conjoint_micro 127 thrpt 25 5221.764 ? 20.415 ops/ms > ArrayCopyObject.conjoint_micro 2047 thrpt 25 691.847 ? 1.102 ops/ms > ArrayCopyObject.conjoint_micro 4095 thrpt 25 360.269 ? 1.091 ops/ms > ArrayCopyObject.conjoint_micro 8191 thrpt 25 179.733 ? 3.012 ops/ms > > on hifive unmatched: > > before: > Benchmark (size) Mode Cnt Score Error Units > ArrayCopyObject.conjoint_micro 31 thrpt 25 5391.575 ? 152.984 ops/ms > ArrayCopyObject.conjoint_micro 63 thrpt 25 3700.946 ? 43.175 ops/ms > ArrayCopyObject.conjoint_micro 127 thrpt 25 2316.160 ? 24.734 ops/ms > ArrayCopyObject.conjoint_micro 2047 thrpt 25 188.616 ? 0.151 ops/ms > ArrayCopyObject.conjoint_micro 4095 thrpt 25 95.323 ? 0.053 ops/ms > ArrayCopyObject.conjoint_micro 8191 thrpt 25 46.935 ? 0.041 ops/ms > > after: > Benchmark (size) Mode Cnt Score Error Units > ArrayCopyObject.conjoint_micro 31 thrpt 25 6136.169 ? 330.409 ops/ms > ArrayCopyObject.conjoint_micro 63 thrpt 25 4924.020 ? 78.529 ops/ms > ArrayCopyObject.conjoint_micro 127 thrpt 25 3732.561 ? 89.606 ops/ms > ArrayCopyObject.conjoint_micro 2047 thrpt 25 431.103 ? 0.505 ops/ms > ArrayCopyObject.conjoint_micro 4095 thrpt 25 221.543 ? 0.363 ops/ms > ArrayCopyObject.conjoint_micro 8191 thrpt 25 100.586 ? 0.197 ops/ms Vladimir Kempik has updated the pull request incrementally with one additional commit since the last revision: Update 1 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11058/files - new: https://git.openjdk.org/jdk/pull/11058/files/a788f8f2..8d2a5a25 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11058&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11058&range=01-02 Stats: 11 lines in 1 file changed: 4 ins; 0 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/11058.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11058/head:pull/11058 PR: https://git.openjdk.org/jdk/pull/11058 From vkempik at openjdk.org Thu Nov 17 08:26:19 2022 From: vkempik at openjdk.org (Vladimir Kempik) Date: Thu, 17 Nov 2022 08:26:19 GMT Subject: RFR: 8296602: RISC-V: improve performance of copy_memory stub [v2] In-Reply-To: <2muRxIfLubi5p1Lm4vDAjNHXNZHRUWE2jpOsaYRD3So=.bc51c6ec-fc66-47b2-ac88-63f3a3cfa87f@github.com> References: <0sXo2P1_NlZ-N1wGd-C5KzZjsC0WVsw8p2lqCUeK5Jo=.d71d74ea-8932-4745-bfe6-d8c8bdd375ac@github.com> <1G-_KbLzZKnyecKCBUKJCoOCs5Z6s3Zew-Hd9JeWJL0=.6c82b000-5e5f-46d7-9a68-e5678d9ab2b0@github.com> <2muRxIfLubi5p1Lm4vDAjNHXNZHRUWE2jpOsaYRD3So=.bc51c6ec-fc66-47b2-ac88-63f3a3cfa87f@github.com> Message-ID: <2rSJcf0EdcpSvwRzeQs7J1WCoecqwibiKpD76BXeE_8=.a22c3eec-7932-4967-9357-75547fde8a45@github.com> On Thu, 17 Nov 2022 08:16:12 GMT, Fei Yang wrote: >> Not sure I'm getting you right here. Please elaborate > > I am suggesting this style: > > __ addi(src, src, -wordSize * 4); > > instead of? > > __ addi(src, src, -wordSize*4); Thank you, done. ------------- PR: https://git.openjdk.org/jdk/pull/11058 From rkennke at openjdk.org Thu Nov 17 08:26:59 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 17 Nov 2022 08:26:59 GMT Subject: Integrated: 8296170: Refactor stack-locking path in C2_MacroAssembler::fast_unlock() In-Reply-To: References: Message-ID: On Tue, 1 Nov 2022 15:37:16 GMT, Roman Kennke wrote: > The code in C2_MacroAssembler::fast_unlock() has several (minor) issues: > - The stack-locking path for x86_32 is not under UseHeavyMonitors - it would be executed even when stack-locking is disabled. > - The stack-locking paths are the same for x86_32 and x86_64 - they can be merged into a common path. > - In x86_32 path, we call get_thread(boxReg) which is totally bogus because we clear boxReg right afterwards with xorptr(boxReg, boxReg). > - In x86_32 path, the CheckSucc label is identical to the DONE label, and in-fact CheckSucc is only ever really used in the x86_64 path and can be moved there. > > Testing: > - [x] tier1 (x86_32, x86_64) > - [x] tier2 (x86_32, x86_64) This pull request has now been integrated. Changeset: e81359f1 Author: Roman Kennke URL: https://git.openjdk.org/jdk/commit/e81359f14802ef520ad4dbb01202a74313c9dc7f Stats: 31 lines in 1 file changed: 2 ins; 26 del; 3 mod 8296170: Refactor stack-locking path in C2_MacroAssembler::fast_unlock() Reviewed-by: thartmann, phh ------------- PR: https://git.openjdk.org/jdk/pull/10936 From vkempik at openjdk.org Thu Nov 17 08:26:18 2022 From: vkempik at openjdk.org (Vladimir Kempik) Date: Thu, 17 Nov 2022 08:26:18 GMT Subject: RFR: 8296602: RISC-V: improve performance of copy_memory stub [v4] In-Reply-To: References: Message-ID: > Please review this change to improve the performance of copy_memory stub on risc-v > > This change has three parts > 1) use copy32 if possible to do 4 ld and 4 st per loop cycle > 2) don't produce precopy code if is_aligned is true, it's not executed. > 3) in the end of loop8 and loop32, remove data dependency between two addi opcodes, to allow them to be scheduled simultaneously > > testing: org.openjdk.bench.vm.compiler.ArrayCopyObject, hotspot_compiler_arraycopy, hotspot:tier1, hotspot:tier2 - all ok > hotspot:tier2 is on the way. > > and for the benchmark results, using > org.openjdk.bench.vm.compiler.ArrayCopyObject.conjoint_micro > > thead rvb-ice c910 > thead > > Before ( copy8 only ) > Benchmark (size) Mode Cnt Score Error Units > ArrayCopyObject.conjoint_micro 31 thrpt 25 6653.095 ? 251.565 ops/ms > ArrayCopyObject.conjoint_micro 63 thrpt 25 4933.970 ? 77.559 ops/ms > ArrayCopyObject.conjoint_micro 127 thrpt 25 3627.454 ? 34.589 ops/ms > ArrayCopyObject.conjoint_micro 2047 thrpt 25 368.249 ? 0.453 ops/ms > ArrayCopyObject.conjoint_micro 4095 thrpt 25 187.776 ? 0.306 ops/ms > ArrayCopyObject.conjoint_micro 8191 thrpt 25 94.477 ? 0.340 ops/ms > > after ( with copy32 ) > ArrayCopyObject.conjoint_micro 31 thrpt 25 7620.546 ? 69.756 ops/ms > ArrayCopyObject.conjoint_micro 63 thrpt 25 6677.978 ? 33.112 ops/ms > ArrayCopyObject.conjoint_micro 127 thrpt 25 5206.973 ? 22.612 ops/ms > ArrayCopyObject.conjoint_micro 2047 thrpt 25 653.655 ? 31.494 ops/ms > ArrayCopyObject.conjoint_micro 4095 thrpt 25 352.905 ? 7.390 ops/ms > ArrayCopyObject.conjoint_micro 8191 thrpt 25 165.127 ? 0.832 ops/ms > > after ( copy32 with dead code elimination and independent addis ) > ArrayCopyObject.conjoint_micro 31 thrpt 25 7576.346 ? 94.487 ops/ms > ArrayCopyObject.conjoint_micro 63 thrpt 25 6475.730 ? 252.590 ops/ms > ArrayCopyObject.conjoint_micro 127 thrpt 25 5221.764 ? 20.415 ops/ms > ArrayCopyObject.conjoint_micro 2047 thrpt 25 691.847 ? 1.102 ops/ms > ArrayCopyObject.conjoint_micro 4095 thrpt 25 360.269 ? 1.091 ops/ms > ArrayCopyObject.conjoint_micro 8191 thrpt 25 179.733 ? 3.012 ops/ms > > on hifive unmatched: > > before: > Benchmark (size) Mode Cnt Score Error Units > ArrayCopyObject.conjoint_micro 31 thrpt 25 5391.575 ? 152.984 ops/ms > ArrayCopyObject.conjoint_micro 63 thrpt 25 3700.946 ? 43.175 ops/ms > ArrayCopyObject.conjoint_micro 127 thrpt 25 2316.160 ? 24.734 ops/ms > ArrayCopyObject.conjoint_micro 2047 thrpt 25 188.616 ? 0.151 ops/ms > ArrayCopyObject.conjoint_micro 4095 thrpt 25 95.323 ? 0.053 ops/ms > ArrayCopyObject.conjoint_micro 8191 thrpt 25 46.935 ? 0.041 ops/ms > > after: > Benchmark (size) Mode Cnt Score Error Units > ArrayCopyObject.conjoint_micro 31 thrpt 25 6136.169 ? 330.409 ops/ms > ArrayCopyObject.conjoint_micro 63 thrpt 25 4924.020 ? 78.529 ops/ms > ArrayCopyObject.conjoint_micro 127 thrpt 25 3732.561 ? 89.606 ops/ms > ArrayCopyObject.conjoint_micro 2047 thrpt 25 431.103 ? 0.505 ops/ms > ArrayCopyObject.conjoint_micro 4095 thrpt 25 221.543 ? 0.363 ops/ms > ArrayCopyObject.conjoint_micro 8191 thrpt 25 100.586 ? 0.197 ops/ms Vladimir Kempik has updated the pull request incrementally with one additional commit since the last revision: Update formatting ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11058/files - new: https://git.openjdk.org/jdk/pull/11058/files/8d2a5a25..cc91f7b6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11058&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11058&range=02-03 Stats: 8 lines in 1 file changed: 0 ins; 0 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/11058.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11058/head:pull/11058 PR: https://git.openjdk.org/jdk/pull/11058 From rkennke at openjdk.org Thu Nov 17 08:25:17 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 17 Nov 2022 08:25:17 GMT Subject: RFR: 8296170: Refactor stack-locking path in C2_MacroAssembler::fast_unlock() [v4] In-Reply-To: References: Message-ID: On Wed, 16 Nov 2022 18:19:19 GMT, Roman Kennke wrote: >> The code in C2_MacroAssembler::fast_unlock() has several (minor) issues: >> - The stack-locking path for x86_32 is not under UseHeavyMonitors - it would be executed even when stack-locking is disabled. >> - The stack-locking paths are the same for x86_32 and x86_64 - they can be merged into a common path. >> - In x86_32 path, we call get_thread(boxReg) which is totally bogus because we clear boxReg right afterwards with xorptr(boxReg, boxReg). >> - In x86_32 path, the CheckSucc label is identical to the DONE label, and in-fact CheckSucc is only ever really used in the x86_64 path and can be moved there. >> >> Testing: >> - [x] tier1 (x86_32, x86_64) >> - [x] tier2 (x86_32, x86_64) > > Roman Kennke has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Merge branch 'master' into JDK-8296170 > - Remove comments about DONE_LABEL being a hot target > - Merge remote-tracking branch 'upstream/master' into JDK-8296170 > - 8296170: Refactor stack-locking path in C2_MacroAssembler::fast_unlock() Thanks for reviewing! GHA is green now, let's ------------- PR: https://git.openjdk.org/jdk/pull/10936 From fyang at openjdk.org Thu Nov 17 08:57:21 2022 From: fyang at openjdk.org (Fei Yang) Date: Thu, 17 Nov 2022 08:57:21 GMT Subject: RFR: 8296602: RISC-V: improve performance of copy_memory stub [v4] In-Reply-To: References: Message-ID: On Thu, 17 Nov 2022 08:26:18 GMT, Vladimir Kempik wrote: >> Please review this change to improve the performance of copy_memory stub on risc-v >> >> This change has three parts >> 1) use copy32 if possible to do 4 ld and 4 st per loop cycle >> 2) don't produce precopy code if is_aligned is true, it's not executed. >> 3) in the end of loop8 and loop32, remove data dependency between two addi opcodes, to allow them to be scheduled simultaneously >> >> testing: org.openjdk.bench.vm.compiler.ArrayCopyObject, hotspot_compiler_arraycopy, hotspot:tier1, hotspot:tier2 - all ok >> hotspot:tier2 is on the way. >> >> and for the benchmark results, using >> org.openjdk.bench.vm.compiler.ArrayCopyObject.conjoint_micro >> >> thead rvb-ice c910 >> thead >> >> Before ( copy8 only ) >> Benchmark (size) Mode Cnt Score Error Units >> ArrayCopyObject.conjoint_micro 31 thrpt 25 6653.095 ? 251.565 ops/ms >> ArrayCopyObject.conjoint_micro 63 thrpt 25 4933.970 ? 77.559 ops/ms >> ArrayCopyObject.conjoint_micro 127 thrpt 25 3627.454 ? 34.589 ops/ms >> ArrayCopyObject.conjoint_micro 2047 thrpt 25 368.249 ? 0.453 ops/ms >> ArrayCopyObject.conjoint_micro 4095 thrpt 25 187.776 ? 0.306 ops/ms >> ArrayCopyObject.conjoint_micro 8191 thrpt 25 94.477 ? 0.340 ops/ms >> >> after ( with copy32 ) >> ArrayCopyObject.conjoint_micro 31 thrpt 25 7620.546 ? 69.756 ops/ms >> ArrayCopyObject.conjoint_micro 63 thrpt 25 6677.978 ? 33.112 ops/ms >> ArrayCopyObject.conjoint_micro 127 thrpt 25 5206.973 ? 22.612 ops/ms >> ArrayCopyObject.conjoint_micro 2047 thrpt 25 653.655 ? 31.494 ops/ms >> ArrayCopyObject.conjoint_micro 4095 thrpt 25 352.905 ? 7.390 ops/ms >> ArrayCopyObject.conjoint_micro 8191 thrpt 25 165.127 ? 0.832 ops/ms >> >> after ( copy32 with dead code elimination and independent addis ) >> ArrayCopyObject.conjoint_micro 31 thrpt 25 7576.346 ? 94.487 ops/ms >> ArrayCopyObject.conjoint_micro 63 thrpt 25 6475.730 ? 252.590 ops/ms >> ArrayCopyObject.conjoint_micro 127 thrpt 25 5221.764 ? 20.415 ops/ms >> ArrayCopyObject.conjoint_micro 2047 thrpt 25 691.847 ? 1.102 ops/ms >> ArrayCopyObject.conjoint_micro 4095 thrpt 25 360.269 ? 1.091 ops/ms >> ArrayCopyObject.conjoint_micro 8191 thrpt 25 179.733 ? 3.012 ops/ms >> >> on hifive unmatched: >> >> before: >> Benchmark (size) Mode Cnt Score Error Units >> ArrayCopyObject.conjoint_micro 31 thrpt 25 5391.575 ? 152.984 ops/ms >> ArrayCopyObject.conjoint_micro 63 thrpt 25 3700.946 ? 43.175 ops/ms >> ArrayCopyObject.conjoint_micro 127 thrpt 25 2316.160 ? 24.734 ops/ms >> ArrayCopyObject.conjoint_micro 2047 thrpt 25 188.616 ? 0.151 ops/ms >> ArrayCopyObject.conjoint_micro 4095 thrpt 25 95.323 ? 0.053 ops/ms >> ArrayCopyObject.conjoint_micro 8191 thrpt 25 46.935 ? 0.041 ops/ms >> >> after: >> Benchmark (size) Mode Cnt Score Error Units >> ArrayCopyObject.conjoint_micro 31 thrpt 25 6136.169 ? 330.409 ops/ms >> ArrayCopyObject.conjoint_micro 63 thrpt 25 4924.020 ? 78.529 ops/ms >> ArrayCopyObject.conjoint_micro 127 thrpt 25 3732.561 ? 89.606 ops/ms >> ArrayCopyObject.conjoint_micro 2047 thrpt 25 431.103 ? 0.505 ops/ms >> ArrayCopyObject.conjoint_micro 4095 thrpt 25 221.543 ? 0.363 ops/ms >> ArrayCopyObject.conjoint_micro 8191 thrpt 25 100.586 ? 0.197 ops/ms > > Vladimir Kempik has updated the pull request incrementally with one additional commit since the last revision: > > Update formatting Updated change looks good. Thanks. PS: I think it might be better to rename 'copy8' and 'copy32' into 'copy8_loop' and 'copy32_loop' respectively. But it's up to you. ------------- Marked as reviewed by fyang (Reviewer). PR: https://git.openjdk.org/jdk/pull/11058 From dongbo at openjdk.org Thu Nov 17 09:10:28 2022 From: dongbo at openjdk.org (Dong Bo) Date: Thu, 17 Nov 2022 09:10:28 GMT Subject: Integrated: 8295698: AArch64: test/jdk/sun/security/ec/ed/EdDSATest.java failed with -XX:+UseSHA3Intrinsics In-Reply-To: References: Message-ID: On Wed, 2 Nov 2022 03:06:21 GMT, Dong Bo wrote: > In JDK-8252204, when implemented SHA3 intrinsics, we use `digest_length` to differentiate SHA3-224, SHA3-256, SHA3-384, SHA3-512 and calculate `block_size` with `block_size = 200 - 2 * digest_length`. > However, there are two extra SHA3 instances, SHAKE256 and SHAKE128, allowing an arbitrary `digest_length`: > > digest_length block_size > SHA3-224 28 144 > SHA3-256 32 136 > SHA3-384 48 104 > SHA3-512 64 72 > SHAKE128 variable 168 > SHAKE256 variable 136 > > > This causes SIGSEGV crash or hash code mismatch with `test/jdk/sun/security/ec/ed/EdDSATest.java`. The test calls `SHAKE256` in `Ed448`. > > The main idea of the patch is to pass the `block_size` to differentiate SHA3 instances. > Tests `test/jdk/sun/security/ec/ed/EdDSATest.java` and `./test/jdk/sun/security/provider/MessageDigest/SHA3.java` both passed. > And tier1~3 passed on SHA3 supported hardware. > > The SHA3 intrinsics still deliver 20%~40% performance improvement on our pre-silicon simulated platform. > The latency and throughput of crypto SHA3 ops are designed to be 1 cpu cycle and 2 execution pipes respectively. > > Compared with the main stream code, the performance change with this patch are negligible on real hardware and simulation platform. > Based on the JMH results of SHA3 intirinsics, performance can be improved by ~50% on some hardware, while some hardware have ~30% regression. > These performance details are available in the comments of the issue page. > I guess the performance benefit of SHA3 intrinsics is dependent on the micro architecture, it should be switched on/off based on the running platform. This pull request has now been integrated. Changeset: 2f728d0c Author: Dong Bo Committer: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/2f728d0cbb366b98158ca8b2acf4b6f58df2fd52 Stats: 68 lines in 4 files changed: 18 ins; 13 del; 37 mod 8295698: AArch64: test/jdk/sun/security/ec/ed/EdDSATest.java failed with -XX:+UseSHA3Intrinsics Reviewed-by: haosun, aph ------------- PR: https://git.openjdk.org/jdk/pull/10939 From aph at openjdk.org Thu Nov 17 09:43:17 2022 From: aph at openjdk.org (Andrew Haley) Date: Thu, 17 Nov 2022 09:43:17 GMT Subject: RFR: 8293856: AArch64: Remove clear_inst_mark from aarch64_enc_java_dynamic_call In-Reply-To: References: Message-ID: On Thu, 17 Nov 2022 07:08:17 GMT, Hao Sun wrote: > 1) After the fix of JDK-8287394, there is no need for clear_inst_mark after trampoline_call. See the discussion in [1]. > > 2) MacroAssembler::ic_call has trampoline_call as the last call. > > Hence, clear_inst_mark after MacroAssembler::ic_call can be removed. There is such a case in aarch64_enc_java_dynamic_call. We conduct the cleanup in this patch. > > Testing: tier1~3 passed with no new failures on Linux/AArch64 platform. > > [1] https://github.com/openjdk/jdk/pull/8564#discussion_r871062342 Marked as reviewed by aph (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11200 From haosun at openjdk.org Thu Nov 17 07:23:50 2022 From: haosun at openjdk.org (Hao Sun) Date: Thu, 17 Nov 2022 07:23:50 GMT Subject: RFR: 8293856: AArch64: Remove clear_inst_mark from aarch64_enc_java_dynamic_call Message-ID: 1) After the fix of JDK-8287394, there is no need for clear_inst_mark after trampoline_call. See the discussion in [1]. 2) MacroAssembler::ic_call has trampoline_call as the last call. Hence, clear_inst_mark after MacroAssembler::ic_call can be removed. There is such a case in aarch64_enc_java_dynamic_call. We conduct the cleanup in this patch. Testing: tier1~3 passed with no new failures on Linux/AArch64 platform. [1] https://github.com/openjdk/jdk/pull/8564#discussion_r871062342 ------------- Commit messages: - 8293856: AArch64: Remove clear_inst_mark from aarch64_enc_java_dynamic_call Changes: https://git.openjdk.org/jdk/pull/11200/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11200&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8293856 Stats: 1 line in 1 file changed: 0 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11200.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11200/head:pull/11200 PR: https://git.openjdk.org/jdk/pull/11200 From vkempik at openjdk.org Thu Nov 17 10:11:47 2022 From: vkempik at openjdk.org (Vladimir Kempik) Date: Thu, 17 Nov 2022 10:11:47 GMT Subject: RFR: 8296602: RISC-V: improve performance of copy_memory stub [v5] In-Reply-To: References: Message-ID: > Please review this change to improve the performance of copy_memory stub on risc-v > > This change has three parts > 1) use copy32 if possible to do 4 ld and 4 st per loop cycle > 2) don't produce precopy code if is_aligned is true, it's not executed. > 3) in the end of loop8 and loop32, remove data dependency between two addi opcodes, to allow them to be scheduled simultaneously > > testing: org.openjdk.bench.vm.compiler.ArrayCopyObject, hotspot_compiler_arraycopy, hotspot:tier1, hotspot:tier2 - all ok > hotspot:tier2 is on the way. > > and for the benchmark results, using > org.openjdk.bench.vm.compiler.ArrayCopyObject.conjoint_micro > > thead rvb-ice c910 > thead > > Before ( copy8 only ) > Benchmark (size) Mode Cnt Score Error Units > ArrayCopyObject.conjoint_micro 31 thrpt 25 6653.095 ? 251.565 ops/ms > ArrayCopyObject.conjoint_micro 63 thrpt 25 4933.970 ? 77.559 ops/ms > ArrayCopyObject.conjoint_micro 127 thrpt 25 3627.454 ? 34.589 ops/ms > ArrayCopyObject.conjoint_micro 2047 thrpt 25 368.249 ? 0.453 ops/ms > ArrayCopyObject.conjoint_micro 4095 thrpt 25 187.776 ? 0.306 ops/ms > ArrayCopyObject.conjoint_micro 8191 thrpt 25 94.477 ? 0.340 ops/ms > > after ( with copy32 ) > ArrayCopyObject.conjoint_micro 31 thrpt 25 7620.546 ? 69.756 ops/ms > ArrayCopyObject.conjoint_micro 63 thrpt 25 6677.978 ? 33.112 ops/ms > ArrayCopyObject.conjoint_micro 127 thrpt 25 5206.973 ? 22.612 ops/ms > ArrayCopyObject.conjoint_micro 2047 thrpt 25 653.655 ? 31.494 ops/ms > ArrayCopyObject.conjoint_micro 4095 thrpt 25 352.905 ? 7.390 ops/ms > ArrayCopyObject.conjoint_micro 8191 thrpt 25 165.127 ? 0.832 ops/ms > > after ( copy32 with dead code elimination and independent addis ) > ArrayCopyObject.conjoint_micro 31 thrpt 25 7576.346 ? 94.487 ops/ms > ArrayCopyObject.conjoint_micro 63 thrpt 25 6475.730 ? 252.590 ops/ms > ArrayCopyObject.conjoint_micro 127 thrpt 25 5221.764 ? 20.415 ops/ms > ArrayCopyObject.conjoint_micro 2047 thrpt 25 691.847 ? 1.102 ops/ms > ArrayCopyObject.conjoint_micro 4095 thrpt 25 360.269 ? 1.091 ops/ms > ArrayCopyObject.conjoint_micro 8191 thrpt 25 179.733 ? 3.012 ops/ms > > on hifive unmatched: > > before: > Benchmark (size) Mode Cnt Score Error Units > ArrayCopyObject.conjoint_micro 31 thrpt 25 5391.575 ? 152.984 ops/ms > ArrayCopyObject.conjoint_micro 63 thrpt 25 3700.946 ? 43.175 ops/ms > ArrayCopyObject.conjoint_micro 127 thrpt 25 2316.160 ? 24.734 ops/ms > ArrayCopyObject.conjoint_micro 2047 thrpt 25 188.616 ? 0.151 ops/ms > ArrayCopyObject.conjoint_micro 4095 thrpt 25 95.323 ? 0.053 ops/ms > ArrayCopyObject.conjoint_micro 8191 thrpt 25 46.935 ? 0.041 ops/ms > > after: > Benchmark (size) Mode Cnt Score Error Units > ArrayCopyObject.conjoint_micro 31 thrpt 25 6136.169 ? 330.409 ops/ms > ArrayCopyObject.conjoint_micro 63 thrpt 25 4924.020 ? 78.529 ops/ms > ArrayCopyObject.conjoint_micro 127 thrpt 25 3732.561 ? 89.606 ops/ms > ArrayCopyObject.conjoint_micro 2047 thrpt 25 431.103 ? 0.505 ops/ms > ArrayCopyObject.conjoint_micro 4095 thrpt 25 221.543 ? 0.363 ops/ms > ArrayCopyObject.conjoint_micro 8191 thrpt 25 100.586 ? 0.197 ops/ms Vladimir Kempik has updated the pull request incrementally with one additional commit since the last revision: rename loop labels ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11058/files - new: https://git.openjdk.org/jdk/pull/11058/files/cc91f7b6..ba548675 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11058&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11058&range=03-04 Stats: 13 lines in 1 file changed: 0 ins; 0 del; 13 mod Patch: https://git.openjdk.org/jdk/pull/11058.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11058/head:pull/11058 PR: https://git.openjdk.org/jdk/pull/11058 From rcastanedalo at openjdk.org Thu Nov 17 10:28:50 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 17 Nov 2022 10:28:50 GMT Subject: RFR: JDK-8297032: IGV: shortcut to center selected nodes [v2] In-Reply-To: References: Message-ID: On Wed, 16 Nov 2022 15:05:22 GMT, Tobias Holenstein wrote: >> Introduce a new shortcut `CTRL-9`/ `CMD-9` to center the nodes that are currently selected in IGV >> >> ![center_selected_nodes](https://user-images.githubusercontent.com/71546117/201934216-0b65caa2-af62-4083-877b-e5747d5409ee.png) > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > make CenterSelectedNodesAction a ModelAwareAction Thanks for addressing my suggestion, looks good! ------------- Marked as reviewed by rcastanedalo (Reviewer). PR: https://git.openjdk.org/jdk/pull/11167 From vkempik at openjdk.org Thu Nov 17 10:32:43 2022 From: vkempik at openjdk.org (Vladimir Kempik) Date: Thu, 17 Nov 2022 10:32:43 GMT Subject: Integrated: 8296602: RISC-V: improve performance of copy_memory stub In-Reply-To: References: Message-ID: On Wed, 9 Nov 2022 11:13:52 GMT, Vladimir Kempik wrote: > Please review this change to improve the performance of copy_memory stub on risc-v > > This change has three parts > 1) use copy32 if possible to do 4 ld and 4 st per loop cycle > 2) don't produce precopy code if is_aligned is true, it's not executed. > 3) in the end of loop8 and loop32, remove data dependency between two addi opcodes, to allow them to be scheduled simultaneously > > testing: org.openjdk.bench.vm.compiler.ArrayCopyObject, hotspot_compiler_arraycopy, hotspot:tier1, hotspot:tier2 - all ok > hotspot:tier2 is on the way. > > and for the benchmark results, using > org.openjdk.bench.vm.compiler.ArrayCopyObject.conjoint_micro > > thead rvb-ice c910 > thead > > Before ( copy8 only ) > Benchmark (size) Mode Cnt Score Error Units > ArrayCopyObject.conjoint_micro 31 thrpt 25 6653.095 ? 251.565 ops/ms > ArrayCopyObject.conjoint_micro 63 thrpt 25 4933.970 ? 77.559 ops/ms > ArrayCopyObject.conjoint_micro 127 thrpt 25 3627.454 ? 34.589 ops/ms > ArrayCopyObject.conjoint_micro 2047 thrpt 25 368.249 ? 0.453 ops/ms > ArrayCopyObject.conjoint_micro 4095 thrpt 25 187.776 ? 0.306 ops/ms > ArrayCopyObject.conjoint_micro 8191 thrpt 25 94.477 ? 0.340 ops/ms > > after ( with copy32 ) > ArrayCopyObject.conjoint_micro 31 thrpt 25 7620.546 ? 69.756 ops/ms > ArrayCopyObject.conjoint_micro 63 thrpt 25 6677.978 ? 33.112 ops/ms > ArrayCopyObject.conjoint_micro 127 thrpt 25 5206.973 ? 22.612 ops/ms > ArrayCopyObject.conjoint_micro 2047 thrpt 25 653.655 ? 31.494 ops/ms > ArrayCopyObject.conjoint_micro 4095 thrpt 25 352.905 ? 7.390 ops/ms > ArrayCopyObject.conjoint_micro 8191 thrpt 25 165.127 ? 0.832 ops/ms > > after ( copy32 with dead code elimination and independent addis ) > ArrayCopyObject.conjoint_micro 31 thrpt 25 7576.346 ? 94.487 ops/ms > ArrayCopyObject.conjoint_micro 63 thrpt 25 6475.730 ? 252.590 ops/ms > ArrayCopyObject.conjoint_micro 127 thrpt 25 5221.764 ? 20.415 ops/ms > ArrayCopyObject.conjoint_micro 2047 thrpt 25 691.847 ? 1.102 ops/ms > ArrayCopyObject.conjoint_micro 4095 thrpt 25 360.269 ? 1.091 ops/ms > ArrayCopyObject.conjoint_micro 8191 thrpt 25 179.733 ? 3.012 ops/ms > > on hifive unmatched: > > before: > Benchmark (size) Mode Cnt Score Error Units > ArrayCopyObject.conjoint_micro 31 thrpt 25 5391.575 ? 152.984 ops/ms > ArrayCopyObject.conjoint_micro 63 thrpt 25 3700.946 ? 43.175 ops/ms > ArrayCopyObject.conjoint_micro 127 thrpt 25 2316.160 ? 24.734 ops/ms > ArrayCopyObject.conjoint_micro 2047 thrpt 25 188.616 ? 0.151 ops/ms > ArrayCopyObject.conjoint_micro 4095 thrpt 25 95.323 ? 0.053 ops/ms > ArrayCopyObject.conjoint_micro 8191 thrpt 25 46.935 ? 0.041 ops/ms > > after: > Benchmark (size) Mode Cnt Score Error Units > ArrayCopyObject.conjoint_micro 31 thrpt 25 6136.169 ? 330.409 ops/ms > ArrayCopyObject.conjoint_micro 63 thrpt 25 4924.020 ? 78.529 ops/ms > ArrayCopyObject.conjoint_micro 127 thrpt 25 3732.561 ? 89.606 ops/ms > ArrayCopyObject.conjoint_micro 2047 thrpt 25 431.103 ? 0.505 ops/ms > ArrayCopyObject.conjoint_micro 4095 thrpt 25 221.543 ? 0.363 ops/ms > ArrayCopyObject.conjoint_micro 8191 thrpt 25 100.586 ? 0.197 ops/ms This pull request has now been integrated. Changeset: bd57e213 Author: Vladimir Kempik URL: https://git.openjdk.org/jdk/commit/bd57e2138fc980822a149af905e572ab71ccbf11 Stats: 70 lines in 1 file changed: 41 ins; 1 del; 28 mod 8296602: RISC-V: improve performance of copy_memory stub Reviewed-by: fyang ------------- PR: https://git.openjdk.org/jdk/pull/11058 From eastigeevich at openjdk.org Thu Nov 17 11:07:23 2022 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Thu, 17 Nov 2022 11:07:23 GMT Subject: RFR: 8293856: AArch64: Remove clear_inst_mark from aarch64_enc_java_dynamic_call In-Reply-To: References: Message-ID: <7SQ-2A2eCtI8khPxqGi4hVYjOzaJtwF1FUUJWICzsBw=.1deab45f-d35a-4976-8ea7-305da7cb2482@github.com> On Thu, 17 Nov 2022 07:08:17 GMT, Hao Sun wrote: > 1) After the fix of JDK-8287394, there is no need for clear_inst_mark after trampoline_call. See the discussion in [1]. > > 2) MacroAssembler::ic_call has trampoline_call as the last call. > > Hence, clear_inst_mark after MacroAssembler::ic_call can be removed. There is such a case in aarch64_enc_java_dynamic_call. We conduct the cleanup in this patch. > > Testing: tier1~3 passed with no new failures on Linux/AArch64 platform. > > [1] https://github.com/openjdk/jdk/pull/8564#discussion_r871062342 Marked as reviewed by eastigeevich (Committer). ------------- PR: https://git.openjdk.org/jdk/pull/11200 From tholenstein at openjdk.org Thu Nov 17 13:08:38 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Thu, 17 Nov 2022 13:08:38 GMT Subject: Integrated: JDK-8297032: IGV: shortcut to center selected nodes In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 13:39:45 GMT, Tobias Holenstein wrote: > Introduce a new shortcut `CTRL-9`/ `CMD-9` to center the nodes that are currently selected in IGV > > ![center_selected_nodes](https://user-images.githubusercontent.com/71546117/201934216-0b65caa2-af62-4083-877b-e5747d5409ee.png) This pull request has now been integrated. Changeset: d02bfdf9 Author: Tobias Holenstein URL: https://git.openjdk.org/jdk/commit/d02bfdf9d7de393b6c76d045b6cd079d7b62a89d Stats: 271 lines in 3 files changed: 269 ins; 0 del; 2 mod 8297032: IGV: shortcut to center selected nodes Reviewed-by: chagedorn, rcastanedalo ------------- PR: https://git.openjdk.org/jdk/pull/11167 From tholenstein at openjdk.org Thu Nov 17 13:08:37 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Thu, 17 Nov 2022 13:08:37 GMT Subject: RFR: JDK-8297032: IGV: shortcut to center selected nodes [v2] In-Reply-To: References: Message-ID: On Thu, 17 Nov 2022 10:25:24 GMT, Roberto Casta?eda Lozano wrote: >> Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: >> >> make CenterSelectedNodesAction a ModelAwareAction > > Thanks for addressing my suggestion, looks good! thanks @robcasloz and @chhagedorn for the reviews! ------------- PR: https://git.openjdk.org/jdk/pull/11167 From tholenstein at openjdk.org Thu Nov 17 13:11:11 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Thu, 17 Nov 2022 13:11:11 GMT Subject: RFR: JDK-8297007: IGV: Link/Unlink node selection of open tabs [v2] In-Reply-To: References: Message-ID: > In IGV graphs can be opened in several tabs and then display them side-by-side. Previously, when the user selected nodes in tab A the selection was also applied in tab B. > > We now introduce a new global button to link and unlink the selection of different tabs. > ![link_button](https://user-images.githubusercontent.com/71546117/201961318-e1263f6c-b3e9-41d5-a1f1-5493d5294bb5.png) > > If the button is **pressed**, the selection is **linked** globally across tabs: > ![linked](https://user-images.githubusercontent.com/71546117/201960953-88f90c74-1c87-4c29-9881-47b55e7c26b9.png) > > If the button is **not pressed**, the selection is **not linked** across tabs. This is the default setting: > ![unlinked](https://user-images.githubusercontent.com/71546117/201961012-f531e7b9-1f23-4584-b207-02529ae25d5a.png) > > # Implementation > The `SelectionCoordinator` is responsible to update the other tabs when the selection changes. We simply disable the `SelectionCoordinator` when the link button is not pressed, and enable it otherwise. Tobias Holenstein has updated the pull request incrementally with three additional commits since the last revision: - add Shortcuts to GlobalSelectionAction - correct class in all CallableSystemAction - copyright year ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11171/files - new: https://git.openjdk.org/jdk/pull/11171/files/f8cbc129..4accab3e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11171&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11171&range=00-01 Stats: 85 lines in 13 files changed: 56 ins; 4 del; 25 mod Patch: https://git.openjdk.org/jdk/pull/11171.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11171/head:pull/11171 PR: https://git.openjdk.org/jdk/pull/11171 From tholenstein at openjdk.org Thu Nov 17 13:11:14 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Thu, 17 Nov 2022 13:11:14 GMT Subject: RFR: JDK-8297007: IGV: Link/Unlink node selection of open tabs [v2] In-Reply-To: References: Message-ID: <3mARqqyRkprYmuvksecrJyuYpSchU1Km2jMuOtfoPDI=.e2017396-d41f-42aa-9dbf-1021cdaedd6f@github.com> On Tue, 15 Nov 2022 18:44:28 GMT, Roberto Casta?eda Lozano wrote: >> Tobias Holenstein has updated the pull request incrementally with three additional commits since the last revision: >> >> - add Shortcuts to GlobalSelectionAction >> - correct class in all CallableSystemAction >> - copyright year > > src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/actions/GlobalSelectionAction.java line 2: > >> 1: /* >> 2: * Copyright (c) 2011, 2015, Oracle and/or its affiliates. All rights reserved. > > Set new copyright year. done > src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/actions/GlobalSelectionAction.java line 32: > >> 30: import javax.swing.ImageIcon; >> 31: import org.openide.util.ImageUtilities; >> 32: > > Would be nice to assign a shortcut to this action. Good idea. I added `Ctrl-L` (L for link?) as a shortcut. Since this is a global action, the shortcut will never be disabled. ------------- PR: https://git.openjdk.org/jdk/pull/11171 From thartmann at openjdk.org Thu Nov 17 13:22:37 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 17 Nov 2022 13:22:37 GMT Subject: RFR: 8290432: C2 compilation fails with assert(node->_last_del == _last) failed: must have deleted the edge just produced [v4] In-Reply-To: References: Message-ID: On Thu, 17 Nov 2022 03:26:15 GMT, Yi Yang wrote: >> Hi, can I have a review for this patch? [JDK-8273585](https://bugs.openjdk.org/browse/JDK-8273585) recognized the form of `Phi->CastII->AddI` as additional parallel induction variables. In the following program: >> >> class Test { >> static int dontInline() { >> return 0; >> } >> >> static long test(int val, boolean b) { >> long ret = 0; >> long dArr[] = new long[100]; >> for (int i = 15; 293 > i; ++i) { >> ret = val; >> int j = 1; >> while (++j < 6) { >> int k = (val--); >> for (long l = i; 1 > l; ) { >> if (k != 0) { >> ret += dontInline(); >> } >> } >> if (b) { >> break; >> } >> } >> } >> return ret; >> } >> >> public static void main(String[] args) { >> for (int i = 0; i < 1000; i++) { >> test(0, false); >> } >> } >> } >> >> `val` is incorrectly matched with the new parallel IV form: >> ![image](https://user-images.githubusercontent.com/5010047/182059398-fc5204bc-8d95-4e3e-8c66-15776af457b8.png) >> And C2 further replaces it with newly added nodes, which finally leads the crash: >> ![image](https://user-images.githubusercontent.com/5010047/182059498-13148d46-b10f-4e18-b84a-f6b9f626ac7b.png) >> >> I think we can add more constraints to the new form. The form of `Phi->CastXX->AddX` appears when using Preconditions.checkIndex, and it would be recognized as additional IV when 1) Phi != phi2, 2) CastXX is controlled by RangeCheck(to reflect changes in Preconditions checkindex intrinsic) > > Yi Yang has updated the pull request incrementally with one additional commit since the last revision: > > skip IfProj The new version looks reasonable to me but please merge the two tests into one. There is also a jcheck whitespace error. If you don't intend to add an IR verification test for [JDK-8273585](https://bugs.openjdk.org/browse/JDK-8273585) with this fix, please file at least a follow-up enhancement. ------------- PR: https://git.openjdk.org/jdk/pull/9695 From rcastanedalo at openjdk.org Thu Nov 17 13:49:20 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 17 Nov 2022 13:49:20 GMT Subject: RFR: JDK-8297007: IGV: Link/Unlink node selection of open tabs [v2] In-Reply-To: References: Message-ID: <4VoEyNagwuZP2RHwQN4QkDZ1UYbye2juUd2D2Z0jArM=.a01f0468-800f-460d-9aec-24408c21f84e@github.com> On Thu, 17 Nov 2022 13:11:11 GMT, Tobias Holenstein wrote: >> In IGV graphs can be opened in several tabs and then display them side-by-side. Previously, when the user selected nodes in tab A the selection was also applied in tab B. >> >> We now introduce a new global button to link and unlink the selection of different tabs. >> ![link_button](https://user-images.githubusercontent.com/71546117/201961318-e1263f6c-b3e9-41d5-a1f1-5493d5294bb5.png) >> >> If the button is **pressed**, the selection is **linked** globally across tabs: >> ![linked](https://user-images.githubusercontent.com/71546117/201960953-88f90c74-1c87-4c29-9881-47b55e7c26b9.png) >> >> If the button is **not pressed**, the selection is **not linked** across tabs. This is the default setting: >> ![unlinked](https://user-images.githubusercontent.com/71546117/201961012-f531e7b9-1f23-4584-b207-02529ae25d5a.png) >> >> # Implementation >> The `SelectionCoordinator` is responsible to update the other tabs when the selection changes. We simply disable the `SelectionCoordinator` when the link button is not pressed, and enable it otherwise. > > Tobias Holenstein has updated the pull request incrementally with three additional commits since the last revision: > > - add Shortcuts to GlobalSelectionAction > - correct class in all CallableSystemAction > - copyright year Thanks for addressing my comments, Tobias! ------------- Marked as reviewed by rcastanedalo (Reviewer). PR: https://git.openjdk.org/jdk/pull/11171 From chagedorn at openjdk.org Thu Nov 17 13:56:12 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 17 Nov 2022 13:56:12 GMT Subject: RFR: JDK-8297007: IGV: Link/Unlink node selection of open tabs [v2] In-Reply-To: References: Message-ID: On Thu, 17 Nov 2022 13:11:11 GMT, Tobias Holenstein wrote: >> In IGV graphs can be opened in several tabs and then display them side-by-side. Previously, when the user selected nodes in tab A the selection was also applied in tab B. >> >> We now introduce a new global button to link and unlink the selection of different tabs. >> ![link_button](https://user-images.githubusercontent.com/71546117/201961318-e1263f6c-b3e9-41d5-a1f1-5493d5294bb5.png) >> >> If the button is **pressed**, the selection is **linked** globally across tabs: >> ![linked](https://user-images.githubusercontent.com/71546117/201960953-88f90c74-1c87-4c29-9881-47b55e7c26b9.png) >> >> If the button is **not pressed**, the selection is **not linked** across tabs. This is the default setting: >> ![unlinked](https://user-images.githubusercontent.com/71546117/201961012-f531e7b9-1f23-4584-b207-02529ae25d5a.png) >> >> # Implementation >> The `SelectionCoordinator` is responsible to update the other tabs when the selection changes. We simply disable the `SelectionCoordinator` when the link button is not pressed, and enable it otherwise. > > Tobias Holenstein has updated the pull request incrementally with three additional commits since the last revision: > > - add Shortcuts to GlobalSelectionAction > - correct class in all CallableSystemAction > - copyright year Looks good! Works as expected on Linux. src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/DiagramViewModel.java line 68: > 66: private boolean showEmptyBlocks; > 67: private boolean hideDuplicates; > 68: private static boolean globalSelection = false; Suggestion: private static boolean globalSelection = false; ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11171 From tholenstein at openjdk.org Thu Nov 17 14:08:27 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Thu, 17 Nov 2022 14:08:27 GMT Subject: RFR: JDK-8297007: IGV: Link/Unlink node selection of open tabs [v3] In-Reply-To: References: Message-ID: <8GqNq44f-N9LoKVXRxGwR7-3qJ0Dh4EfwIT53FqD3l0=.ab2693d5-a940-414e-adc9-cb57b808b6ad@github.com> > In IGV graphs can be opened in several tabs and then display them side-by-side. Previously, when the user selected nodes in tab A the selection was also applied in tab B. > > We now introduce a new global button to link and unlink the selection of different tabs. > ![link_button](https://user-images.githubusercontent.com/71546117/201961318-e1263f6c-b3e9-41d5-a1f1-5493d5294bb5.png) > > If the button is **pressed**, the selection is **linked** globally across tabs: > ![linked](https://user-images.githubusercontent.com/71546117/201960953-88f90c74-1c87-4c29-9881-47b55e7c26b9.png) > > If the button is **not pressed**, the selection is **not linked** across tabs. This is the default setting: > ![unlinked](https://user-images.githubusercontent.com/71546117/201961012-f531e7b9-1f23-4584-b207-02529ae25d5a.png) > > # Implementation > The `SelectionCoordinator` is responsible to update the other tabs when the selection changes. We simply disable the `SelectionCoordinator` when the link button is not pressed, and enable it otherwise. Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: Update src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/DiagramViewModel.java whitespace Co-authored-by: Christian Hagedorn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11171/files - new: https://git.openjdk.org/jdk/pull/11171/files/4accab3e..5fa5878a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11171&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11171&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11171.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11171/head:pull/11171 PR: https://git.openjdk.org/jdk/pull/11171 From tholenstein at openjdk.org Thu Nov 17 14:08:27 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Thu, 17 Nov 2022 14:08:27 GMT Subject: RFR: JDK-8297007: IGV: Link/Unlink node selection of open tabs [v2] In-Reply-To: <4VoEyNagwuZP2RHwQN4QkDZ1UYbye2juUd2D2Z0jArM=.a01f0468-800f-460d-9aec-24408c21f84e@github.com> References: <4VoEyNagwuZP2RHwQN4QkDZ1UYbye2juUd2D2Z0jArM=.a01f0468-800f-460d-9aec-24408c21f84e@github.com> Message-ID: On Thu, 17 Nov 2022 13:47:10 GMT, Roberto Casta?eda Lozano wrote: >> Tobias Holenstein has updated the pull request incrementally with three additional commits since the last revision: >> >> - add Shortcuts to GlobalSelectionAction >> - correct class in all CallableSystemAction >> - copyright year > > Thanks for addressing my comments, Tobias! thank you @robcasloz and @chhagedorn for the reviews! ------------- PR: https://git.openjdk.org/jdk/pull/11171 From tholenstein at openjdk.org Thu Nov 17 14:10:24 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Thu, 17 Nov 2022 14:10:24 GMT Subject: Integrated: JDK-8297007: IGV: Link/Unlink node selection of open tabs In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 15:37:26 GMT, Tobias Holenstein wrote: > In IGV graphs can be opened in several tabs and then display them side-by-side. Previously, when the user selected nodes in tab A the selection was also applied in tab B. > > We now introduce a new global button to link and unlink the selection of different tabs. > ![link_button](https://user-images.githubusercontent.com/71546117/201961318-e1263f6c-b3e9-41d5-a1f1-5493d5294bb5.png) > > If the button is **pressed**, the selection is **linked** globally across tabs: > ![linked](https://user-images.githubusercontent.com/71546117/201960953-88f90c74-1c87-4c29-9881-47b55e7c26b9.png) > > If the button is **not pressed**, the selection is **not linked** across tabs. This is the default setting: > ![unlinked](https://user-images.githubusercontent.com/71546117/201961012-f531e7b9-1f23-4584-b207-02529ae25d5a.png) > > # Implementation > The `SelectionCoordinator` is responsible to update the other tabs when the selection changes. We simply disable the `SelectionCoordinator` when the link button is not pressed, and enable it otherwise. This pull request has now been integrated. Changeset: 4120db13 Author: Tobias Holenstein URL: https://git.openjdk.org/jdk/commit/4120db13d48dfbae1aa3c3c9d03229d6ac133c91 Stats: 165 lines in 15 files changed: 127 ins; 6 del; 32 mod 8297007: IGV: Link/Unlink node selection of open tabs Reviewed-by: rcastanedalo, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/11171 From vlivanov at openjdk.org Thu Nov 17 19:36:33 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Thu, 17 Nov 2022 19:36:33 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v21] In-Reply-To: References: Message-ID: On Thu, 17 Nov 2022 03:23:49 GMT, Volodymyr Paprotski wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: > > vzeroall, no spill, reg re-map Overall, looks good. Just one minor cleanup suggestion. I've submitted the latest patch for testing (hs-tier1 - hs-tier4). src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 377: > 375: __ shlq(t0, 40); > 376: __ addq(a1, t0); > 377: if (a2 == noreg) { Please, get rid of early return and turn the check into `if (a2 != noreg) { ... }` which guards the following code. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From mdoerr at openjdk.org Thu Nov 17 19:52:25 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 17 Nov 2022 19:52:25 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v5] In-Reply-To: <7LIFD7nq9mWL6nfRJMz1pwc9h1SZaEnAWlsPT5mG1yI=.cedeccab-8642-4131-a90e-d8fc9d015619@github.com> References: <7LIFD7nq9mWL6nfRJMz1pwc9h1SZaEnAWlsPT5mG1yI=.cedeccab-8642-4131-a90e-d8fc9d015619@github.com> Message-ID: <4kTtXvjuyStJdVRl6nJqWaDZRqnuC77YaBtSU_51zNE=.0461f131-cf15-41a0-b120-3228dbe97969@github.com> On Wed, 16 Nov 2022 10:24:24 GMT, Richard Reingruber wrote: >> Hi, >> >> this is the port of [JEP 425: Virtual Threads (Preview)](https://openjdk.org/jeps/425)) to PPC64. >> More precisely it is the port of vm continuations in hotspot to PPC64. It allows to run with `-XX:+VMContinuations` which is a prerequisit for 'real' virtual threads (oxymoron?). >> >> Most of the shared code changes are related to a new platform dependent constant `frame::metadata_words_at_top`. It is either added or subtracted to a frame address or size. >> >> The following is supposed to explain (without real life details) why it is needed in addition to the existing `frame::metadata_words`. The diagram shows a frame at `SP` and its stack arguments. The caller frame is located at `callers_SP`. >> >> >> X86 / AARCH64 PPC64: >> >> : : : : >> : : : : >> | | | | >> |-----------------| |-----------------| >> | | | | >> | stack arguments | | stack arguments | >> | |<- callers_SP | | >> =================== |-----------------| >> | | | | >> | metadata at bottom | | metadata at top | >> | | | |<- callers_SP >> |-----------------| =================== >> | | | | >> | | | | >> | | | | >> | | | | >> | |<- SP | | >> =================== |-----------------| >> | | >> | metadata at top | >> | |<- SP >> =================== >> >> >> On X86 and AARCH64 metadata (that's return address, saved SP/FP, etc.) is stored at the frame bottom (`metadata at bottom`). On PPC64 it is stored at the frame top (`metadata at top`) where it affects size and address calculations. Shared code deals with this by making use of the platform dependent constant `frame::metadata_words_at_top`. >> >> * size required to 'freeze' a single frame with its stack arguments in a `StackChunk`: >> `sizeof(frame) + sizeof(stack arguments) + frame::metadata_words_at_top` >> >> * address of stack arguments: >> `callers_SP + frame::metadata_words_at_top` >> >> * The value of `frame::metadata_words_at_top` is 0 words on X86 and AARCH64. On PPC64 it is 4 words. >> >> Please refer to comments I've added for more details (e.g. the comment on StackChunkFrameStream::frame_size()). Recently I've given a talk about vm continuations and the PPC64 port to my colleagues. It's rather an overview than a deep dive. [The slides](http://cr.openjdk.java.net/~rrich/webrevs/2022/8286302/202210_loom_ppc64_port.pdf) might serve as well as an intro to the matter. >> >> The pr includes the new test jdk/jdk/internal/vm/Continuation/BasicExp.java which I wrote while doing the port. The test cases vary from simple to not-so-simple. One of the main features is that a CompilationPolicy passed as argument controls which frames are compiled/interpreted when freezing by defining a sliding window of compiled / interpreted frames which produces interesting transitions with and without stack arguments. There is overlap with Basic.java and Fuzz.java. Let me know wether to remove or keep BasicExp.java. Runtime with fastdebug: 2m on Apple M1 Pro and 3m on Intel Xeon E5-2660 v3 @ 2.60GHz. Note that -XX:+VerifyContinuations is explicitly set as a found it very useful, it increases the runtime quite a bit though. >> >> Testing: the change passed our CI testing: most JCK and JTREG tests, also in Xcomp mode, SPECjvm2008, SPECjbb2015, Renaissance Suite and SAP specific tests with fastdebug and release builds on the standard platforms plus PPC64. These tests do include hotspot_loom and jdk_loom JTREG tests which I've also run with TEST_VM_OPTS="-XX:+VerifyContinuations" on X86_64, PPC64, and AARCH64. >> >> Thanks, Richard. > > Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: > > Cleanup BasicExp.java Great work! Looks very good to me. I'm requesting precise node sizes in C2 (let's keep this invariant: nodes should either have a precisely precomputed size or use dynamic computation). The rest are just minor comments and suggestions. I need more time to study all explanations and to review the test, but I can't see anything which prevents us from shipping it with JDK 20. Thanks for the ASCII art pictures! src/hotspot/cpu/aarch64/frame_aarch64.hpp line 110: > 108: // between a callee frame and its stack arguments, where it is part > 109: // of the caller/callee overlap > 110: metadata_words_at_top = 0, I like the `_at_bottom` and `_at_top` versions. The old version could possibly get replaced completely in a separate RFE (this change is already large enough). src/hotspot/cpu/ppc/continuationFreezeThaw_ppc.inline.hpp line 277: > 275: // esp points one slot below the last argument > 276: intptr_t* x86_64_like_unextended_sp = f.interpreter_frame_esp() + 1 - frame::metadata_words_at_top; > 277: sp = fp - (f.fp() - x86_64_like_unextended_sp); Nice workaround for the strange `unextended_sp` which doesn't fit well for other platforms! src/hotspot/cpu/ppc/continuationFreezeThaw_ppc.inline.hpp line 498: > 496: // align fp > 497: int padding = fp - align_down(fp, frame::frame_alignment); > 498: fp -= padding; Additional whitespaces should better get removed. src/hotspot/cpu/ppc/continuationFreezeThaw_ppc.inline.hpp line 520: > 518: > 519: if ((bottom && argsize > 0) || caller.is_interpreted_frame()) { > 520: frame_sp -= argsize + frame::metadata_words_at_top; whitespace src/hotspot/cpu/ppc/frame_ppc.hpp line 370: > 368: > 369: union { > 370: intptr_t* _fp; // frame pointer whitespace src/hotspot/cpu/ppc/frame_ppc.inline.hpp line 113: > 111: // In thaw, non-heap frames use this constructor to pass oop_map. I don't know why. > 112: assert(_on_heap || _cb != nullptr, "these frames are always heap frames"); > 113: if (cb != NULL) { Better use `nullptr`. src/hotspot/cpu/ppc/nativeInst_ppc.cpp line 448: > 446: } > 447: > 448: // Inserts an undefined instruction at a given pc More precisely: An instruction which is specified to cause a SIGILL. src/hotspot/cpu/ppc/ppc.ad line 14386: > 14384: > 14385: format %{ "CALL,static $meth \t// ==> " %} > 14386: size(8); Please make the sizes precise. They are only 8 Byte when continuations are enabled. src/hotspot/cpu/ppc/sharedRuntime_ppc.cpp line 1803: > 1801: > 1802: // Read interpreter arguments into registers (this is an ad-hoc i2c adapter) > 1803: __ ld(reg_cont_obj, Interpreter::stackElementSize*3, R15_esp); Would you like to align the indentation? src/hotspot/cpu/ppc/stackChunkFrameStream_ppc.inline.hpp line 194: > 192: + 1 // for the mirror oop > 193: + ((intptr_t*)f.interpreter_frame_monitor_begin() > 194: - (intptr_t*)f.interpreter_frame_monitor_end())/BasicObjectLock::size(); whitespace src/hotspot/cpu/ppc/templateInterpreterGenerator_ppc.cpp line 634: > 632: } > 633: > 634: __ restore_interpreter_state(R11_scratch1, false /*bcp_and_mdx_only*/, true /*restore_top_frame_sp*/); Nice cleanup! src/hotspot/share/runtime/continuationFreezeThaw.cpp line 488: > 486: assert(!Interpreter::contains(_cont.entryPC()), ""); > 487: static const int doYield_stub_frame_size = NOT_PPC64(frame::metadata_words) > 488: PPC64_ONLY(frame::abi_reg_args_size >> LogBytesPerWord); Unfortunate that we still need to distinguish, here. But, ok. I don't have a better idea atm. src/hotspot/share/runtime/continuationFreezeThaw.cpp line 1024: > 1022: if (f.is_interpreted_frame()) { > 1023: assert(hf.is_heap_frame(), "should be"); > 1024: ContinuationHelper::InterpretedFrame::patch_sender_sp(hf, caller); Thanks for removing `unextended_sp`, here. It's not generic enough for shared code. ------------- Changes requested by mdoerr (Reviewer). PR: https://git.openjdk.org/jdk/pull/10961 From duke at openjdk.org Thu Nov 17 20:42:27 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Thu, 17 Nov 2022 20:42:27 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v21] In-Reply-To: References: Message-ID: On Thu, 17 Nov 2022 19:30:14 GMT, Vladimir Ivanov wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: >> >> vzeroall, no spill, reg re-map > > src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 377: > >> 375: __ shlq(t0, 40); >> 376: __ addq(a1, t0); >> 377: if (a2 == noreg) { > > Please, get rid of early return and turn the check into `if (a2 != noreg) { ... }` which guards the following code. done (some golang-ism slipped in.. rewiring habits again) ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Thu Nov 17 20:42:27 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Thu, 17 Nov 2022 20:42:27 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v22] In-Reply-To: References: Message-ID: > Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. > > - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. > - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. > - Added a JMH perf test. > - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. > > Perf before: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s > > and after: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: remove early return ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10582/files - new: https://git.openjdk.org/jdk/pull/10582/files/56aed9b1..08ea45e5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=21 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10582&range=20-21 Stats: 29 lines in 1 file changed: 13 ins; 14 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/10582.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10582/head:pull/10582 PR: https://git.openjdk.org/jdk/pull/10582 From haosun at openjdk.org Fri Nov 18 07:19:14 2022 From: haosun at openjdk.org (Hao Sun) Date: Fri, 18 Nov 2022 07:19:14 GMT Subject: RFR: 8293856: AArch64: Remove clear_inst_mark from aarch64_enc_java_dynamic_call In-Reply-To: References: Message-ID: <-0YMkfrDjOYHIDZZKLk0Ni645IJNBsgzUCOkaUs8HaY=.df84e1f7-24b7-44f7-a7ba-575df0c35f45@github.com> On Thu, 17 Nov 2022 07:08:17 GMT, Hao Sun wrote: > 1) After the fix of JDK-8287394, there is no need for clear_inst_mark after trampoline_call. See the discussion in [1]. > > 2) MacroAssembler::ic_call has trampoline_call as the last call. > > Hence, clear_inst_mark after MacroAssembler::ic_call can be removed. There is such a case in aarch64_enc_java_dynamic_call. We conduct the cleanup in this patch. > > Testing: tier1~3 passed with no new failures on Linux/AArch64 platform. > > [1] https://github.com/openjdk/jdk/pull/8564#discussion_r871062342 Thanks for your reviews. ------------- PR: https://git.openjdk.org/jdk/pull/11200 From roland at openjdk.org Fri Nov 18 08:45:25 2022 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 18 Nov 2022 08:45:25 GMT Subject: RFR: 8295788: C2 compilation hits "assert((mode == ControlAroundStripMined && use == sfpt) || !use->is_reachable_from_root()) failed: missed a node" [v2] In-Reply-To: <_rSAe8s5BhWa4Cgo6uwk57tkXBVZLqoTOpJvvTQ-toQ=.6bbebdf8-28e6-4c7b-b751-59c05ef0aab8@github.com> References: <_rSAe8s5BhWa4Cgo6uwk57tkXBVZLqoTOpJvvTQ-toQ=.6bbebdf8-28e6-4c7b-b751-59c05ef0aab8@github.com> Message-ID: > This failure is similar to previous failures with loop strip mining: a > node is encountered that has control set in the outer strip mined loop > but is not reachable from the safepoint. There's already logic in loop > cloning to find those and fix their control to be outside the > loop. Usually a node ends up in the outer loop because some of its > inputs is in the outer loop. The current logic to catch nodes that are > erroneously assigned control in the outer loop is to start from > safepoint's inputs and look for uses with incorrect control. That > doesn't work in this case because: 1) the node is created by > IdealLoopTree::reassociate in the outer loop because its inputs are > indeed there 2) but a pass of split if updates the control to be > inside the inner loop. > > To fix this, I propose reusing the existing clone_outer_loop_helper() > but apply it to the loop body as well. I had to tweak that method > because I ran into cases of dead nodes still reachable from a node in > the loop body but removed from the _body list by > IdealLoopTree::DCE_loop_body() (and as a result not cloned). Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: Update test/hotspot/jtreg/compiler/loopstripmining/TestUseFromInnerInOuterUnusedBySfpt.java Co-authored-by: Tobias Hartmann ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11162/files - new: https://git.openjdk.org/jdk/pull/11162/files/32271652..b824a622 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11162&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11162&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11162.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11162/head:pull/11162 PR: https://git.openjdk.org/jdk/pull/11162 From haosun at openjdk.org Fri Nov 18 09:04:27 2022 From: haosun at openjdk.org (Hao Sun) Date: Fri, 18 Nov 2022 09:04:27 GMT Subject: Integrated: 8293856: AArch64: Remove clear_inst_mark from aarch64_enc_java_dynamic_call In-Reply-To: References: Message-ID: On Thu, 17 Nov 2022 07:08:17 GMT, Hao Sun wrote: > 1) After the fix of JDK-8287394, there is no need for clear_inst_mark after trampoline_call. See the discussion in [1]. > > 2) MacroAssembler::ic_call has trampoline_call as the last call. > > Hence, clear_inst_mark after MacroAssembler::ic_call can be removed. There is such a case in aarch64_enc_java_dynamic_call. We conduct the cleanup in this patch. > > Testing: tier1~3 passed with no new failures on Linux/AArch64 platform. > > [1] https://github.com/openjdk/jdk/pull/8564#discussion_r871062342 This pull request has now been integrated. Changeset: 2b6dbc71 Author: Hao Sun Committer: Ningsheng Jian URL: https://git.openjdk.org/jdk/commit/2b6dbc71d8ad2843d3871c7d042313cd71d6d700 Stats: 1 line in 1 file changed: 0 ins; 1 del; 0 mod 8293856: AArch64: Remove clear_inst_mark from aarch64_enc_java_dynamic_call Reviewed-by: aph, eastigeevich ------------- PR: https://git.openjdk.org/jdk/pull/11200 From roland at openjdk.org Fri Nov 18 09:10:09 2022 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 18 Nov 2022 09:10:09 GMT Subject: RFR: 8295788: C2 compilation hits "assert((mode == ControlAroundStripMined && use == sfpt) || !use->is_reachable_from_root()) failed: missed a node" [v3] In-Reply-To: <_rSAe8s5BhWa4Cgo6uwk57tkXBVZLqoTOpJvvTQ-toQ=.6bbebdf8-28e6-4c7b-b751-59c05ef0aab8@github.com> References: <_rSAe8s5BhWa4Cgo6uwk57tkXBVZLqoTOpJvvTQ-toQ=.6bbebdf8-28e6-4c7b-b751-59c05ef0aab8@github.com> Message-ID: > This failure is similar to previous failures with loop strip mining: a > node is encountered that has control set in the outer strip mined loop > but is not reachable from the safepoint. There's already logic in loop > cloning to find those and fix their control to be outside the > loop. Usually a node ends up in the outer loop because some of its > inputs is in the outer loop. The current logic to catch nodes that are > erroneously assigned control in the outer loop is to start from > safepoint's inputs and look for uses with incorrect control. That > doesn't work in this case because: 1) the node is created by > IdealLoopTree::reassociate in the outer loop because its inputs are > indeed there 2) but a pass of split if updates the control to be > inside the inner loop. > > To fix this, I propose reusing the existing clone_outer_loop_helper() > but apply it to the loop body as well. I had to tweak that method > because I ran into cases of dead nodes still reachable from a node in > the loop body but removed from the _body list by > IdealLoopTree::DCE_loop_body() (and as a result not cloned). Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: more ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11162/files - new: https://git.openjdk.org/jdk/pull/11162/files/b824a622..c35cbf42 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11162&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11162&range=01-02 Stats: 7 lines in 1 file changed: 0 ins; 0 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/11162.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11162/head:pull/11162 PR: https://git.openjdk.org/jdk/pull/11162 From roland at openjdk.org Fri Nov 18 09:10:11 2022 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 18 Nov 2022 09:10:11 GMT Subject: RFR: 8295788: C2 compilation hits "assert((mode == ControlAroundStripMined && use == sfpt) || !use->is_reachable_from_root()) failed: missed a node" [v3] In-Reply-To: References: <_rSAe8s5BhWa4Cgo6uwk57tkXBVZLqoTOpJvvTQ-toQ=.6bbebdf8-28e6-4c7b-b751-59c05ef0aab8@github.com> Message-ID: On Wed, 16 Nov 2022 06:17:53 GMT, Tobias Hartmann wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> more > > src/hotspot/share/opto/loopopts.cpp line 2304: > >> 2302: for (uint i = 0; i < loop->_body.size(); i++) { >> 2303: Node* old = loop->_body.at(i); >> 2304: clone_outer_loop_helper(old, loop, outer_loop, old_new, wq, this, true); > > While you're at it, could you rename the helper method to something more meaningful? Thanks for reviewing this. I did in the new commit. Does it look good to you? ------------- PR: https://git.openjdk.org/jdk/pull/11162 From thartmann at openjdk.org Fri Nov 18 10:02:23 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 18 Nov 2022 10:02:23 GMT Subject: RFR: 8295788: C2 compilation hits "assert((mode == ControlAroundStripMined && use == sfpt) || !use->is_reachable_from_root()) failed: missed a node" [v3] In-Reply-To: References: <_rSAe8s5BhWa4Cgo6uwk57tkXBVZLqoTOpJvvTQ-toQ=.6bbebdf8-28e6-4c7b-b751-59c05ef0aab8@github.com> Message-ID: On Fri, 18 Nov 2022 09:10:09 GMT, Roland Westrelin wrote: >> This failure is similar to previous failures with loop strip mining: a >> node is encountered that has control set in the outer strip mined loop >> but is not reachable from the safepoint. There's already logic in loop >> cloning to find those and fix their control to be outside the >> loop. Usually a node ends up in the outer loop because some of its >> inputs is in the outer loop. The current logic to catch nodes that are >> erroneously assigned control in the outer loop is to start from >> safepoint's inputs and look for uses with incorrect control. That >> doesn't work in this case because: 1) the node is created by >> IdealLoopTree::reassociate in the outer loop because its inputs are >> indeed there 2) but a pass of split if updates the control to be >> inside the inner loop. >> >> To fix this, I propose reusing the existing clone_outer_loop_helper() >> but apply it to the loop body as well. I had to tweak that method >> because I ran into cases of dead nodes still reachable from a node in >> the loop body but removed from the _body list by >> IdealLoopTree::DCE_loop_body() (and as a result not cloned). > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > more Looks good, thanks for updating! ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11162 From roland at openjdk.org Fri Nov 18 10:04:22 2022 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 18 Nov 2022 10:04:22 GMT Subject: RFR: 8295788: C2 compilation hits "assert((mode == ControlAroundStripMined && use == sfpt) || !use->is_reachable_from_root()) failed: missed a node" [v3] In-Reply-To: References: <_rSAe8s5BhWa4Cgo6uwk57tkXBVZLqoTOpJvvTQ-toQ=.6bbebdf8-28e6-4c7b-b751-59c05ef0aab8@github.com> Message-ID: On Tue, 15 Nov 2022 20:30:43 GMT, Vladimir Kozlov wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> more > > Make sense. @vnkozlov @TobiHartmann thanks for the reviews. ------------- PR: https://git.openjdk.org/jdk/pull/11162 From bkilambi at openjdk.org Fri Nov 18 10:21:32 2022 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Fri, 18 Nov 2022 10:21:32 GMT Subject: RFR: 8293488: Add EOR3 backend rule for aarch64 SHA3 extension [v4] In-Reply-To: References: Message-ID: On Mon, 14 Nov 2022 09:37:53 GMT, Bhavana Kilambi wrote: >> Arm ISA v8.2A and v9.0A include SHA3 feature extensions and one of those SHA3 instructions - "eor3" performs an exclusive OR of three vectors. This is helpful in applications that have multiple, consecutive "eor" operations which can be reduced by clubbing them into fewer operations using the "eor3" instruction. For example - >> >> eor a, a, b >> eor a, a, c >> >> can be optimized to single instruction - `eor3 a, b, c` >> >> This patch adds backend rules for Neon and SVE2 "eor3" instructions and a micro benchmark to assess the performance gains with this patch. Following are the results of the included micro benchmark on a 128-bit aarch64 machine that supports Neon, SVE2 and SHA3 features - >> >> >> Benchmark gain >> TestEor3.test1Int 10.87% >> TestEor3.test1Long 8.84% >> TestEor3.test2Int 21.68% >> TestEor3.test2Long 21.04% >> >> >> The numbers shown are performance gains with using Neon eor3 instruction over the master branch that uses multiple "eor" instructions instead. Similar gains can be observed with the SVE2 "eor3" version as well since the "eor3" instruction is unpredicated and the machine under test uses a maximum vector width of 128 bits which makes the SVE2 code generation very similar to the one with Neon. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Removed svesha3 feature check for eor3 @turbanoff Hello, I have made the changes you've suggested plus some more changes regarding the feature detection for svesha3 in the latest patch. Could you please take a look? Thank you in advance .. ------------- PR: https://git.openjdk.org/jdk/pull/10407 From aturbanov at openjdk.org Fri Nov 18 10:32:21 2022 From: aturbanov at openjdk.org (Andrey Turbanov) Date: Fri, 18 Nov 2022 10:32:21 GMT Subject: RFR: 8293488: Add EOR3 backend rule for aarch64 SHA3 extension [v4] In-Reply-To: References: Message-ID: On Mon, 14 Nov 2022 09:37:53 GMT, Bhavana Kilambi wrote: >> Arm ISA v8.2A and v9.0A include SHA3 feature extensions and one of those SHA3 instructions - "eor3" performs an exclusive OR of three vectors. This is helpful in applications that have multiple, consecutive "eor" operations which can be reduced by clubbing them into fewer operations using the "eor3" instruction. For example - >> >> eor a, a, b >> eor a, a, c >> >> can be optimized to single instruction - `eor3 a, b, c` >> >> This patch adds backend rules for Neon and SVE2 "eor3" instructions and a micro benchmark to assess the performance gains with this patch. Following are the results of the included micro benchmark on a 128-bit aarch64 machine that supports Neon, SVE2 and SHA3 features - >> >> >> Benchmark gain >> TestEor3.test1Int 10.87% >> TestEor3.test1Long 8.84% >> TestEor3.test2Int 21.68% >> TestEor3.test2Long 21.04% >> >> >> The numbers shown are performance gains with using Neon eor3 instruction over the master branch that uses multiple "eor" instructions instead. Similar gains can be observed with the SVE2 "eor3" version as well since the "eor3" instruction is unpredicated and the machine under test uses a maximum vector width of 128 bits which makes the SVE2 code generation very similar to the one with Neon. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Removed svesha3 feature check for eor3 Marked as reviewed by aturbanov (Committer). ------------- PR: https://git.openjdk.org/jdk/pull/10407 From ngasson at openjdk.org Fri Nov 18 10:38:21 2022 From: ngasson at openjdk.org (Nick Gasson) Date: Fri, 18 Nov 2022 10:38:21 GMT Subject: RFR: 8295276: AArch64: Add backend support for half float conversion intrinsics In-Reply-To: <5AM4Pj8V60JHjfIHgbvE8FGx7BAyy2LmGnUkr3GWNMQ=.d138a971-fe0d-491a-887b-07c96fc03008@github.com> References: <5AM4Pj8V60JHjfIHgbvE8FGx7BAyy2LmGnUkr3GWNMQ=.d138a971-fe0d-491a-887b-07c96fc03008@github.com> Message-ID: On Thu, 20 Oct 2022 14:33:33 GMT, Bhavana Kilambi wrote: > This patch adds aarch64 backend support for library intrinsics that implement conversions between half-precision and single-precision floats. > > Ran the following benchmarks to assess the performance with this patch - > > org.openjdk.bench.java.math.Fp16ConversionBenchmark.floatToFloat16 org.openjdk.bench.java.math.Fp16ConversionBenchmark.float16ToFloat > > The performance (ops/ms) gain with the patch on an ARM NEON machine is shown below - > > > Benchmark Gain > Fp16ConversionBenchmark.float16ToFloat 3.42 > Fp16ConversionBenchmark.floatToFloat16 5.85 Looks OK to me but needs another review. ------------- Marked as reviewed by ngasson (Reviewer). PR: https://git.openjdk.org/jdk/pull/10796 From rrich at openjdk.org Fri Nov 18 11:08:34 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Fri, 18 Nov 2022 11:08:34 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v6] In-Reply-To: References: Message-ID: > Hi, > > this is the port of [JEP 425: Virtual Threads (Preview)](https://openjdk.org/jeps/425)) to PPC64. > More precisely it is the port of vm continuations in hotspot to PPC64. It allows to run with `-XX:+VMContinuations` which is a prerequisit for 'real' virtual threads (oxymoron?). > > Most of the shared code changes are related to a new platform dependent constant `frame::metadata_words_at_top`. It is either added or subtracted to a frame address or size. > > The following is supposed to explain (without real life details) why it is needed in addition to the existing `frame::metadata_words`. The diagram shows a frame at `SP` and its stack arguments. The caller frame is located at `callers_SP`. > > > X86 / AARCH64 PPC64: > > : : : : > : : : : > | | | | > |-----------------| |-----------------| > | | | | > | stack arguments | | stack arguments | > | |<- callers_SP | | > =================== |-----------------| > | | | | > | metadata at bottom | | metadata at top | > | | | |<- callers_SP > |-----------------| =================== > | | | | > | | | | > | | | | > | | | | > | |<- SP | | > =================== |-----------------| > | | > | metadata at top | > | |<- SP > =================== > > > On X86 and AARCH64 metadata (that's return address, saved SP/FP, etc.) is stored at the frame bottom (`metadata at bottom`). On PPC64 it is stored at the frame top (`metadata at top`) where it affects size and address calculations. Shared code deals with this by making use of the platform dependent constant `frame::metadata_words_at_top`. > > * size required to 'freeze' a single frame with its stack arguments in a `StackChunk`: > `sizeof(frame) + sizeof(stack arguments) + frame::metadata_words_at_top` > > * address of stack arguments: > `callers_SP + frame::metadata_words_at_top` > > * The value of `frame::metadata_words_at_top` is 0 words on X86 and AARCH64. On PPC64 it is 4 words. > > Please refer to comments I've added for more details (e.g. the comment on StackChunkFrameStream::frame_size()). Recently I've given a talk about vm continuations and the PPC64 port to my colleagues. It's rather an overview than a deep dive. [The slides](http://cr.openjdk.java.net/~rrich/webrevs/2022/8286302/202210_loom_ppc64_port.pdf) might serve as well as an intro to the matter. > > The pr includes the new test jdk/jdk/internal/vm/Continuation/BasicExp.java which I wrote while doing the port. The test cases vary from simple to not-so-simple. One of the main features is that a CompilationPolicy passed as argument controls which frames are compiled/interpreted when freezing by defining a sliding window of compiled / interpreted frames which produces interesting transitions with and without stack arguments. There is overlap with Basic.java and Fuzz.java. Let me know wether to remove or keep BasicExp.java. Runtime with fastdebug: 2m on Apple M1 Pro and 3m on Intel Xeon E5-2660 v3 @ 2.60GHz. Note that -XX:+VerifyContinuations is explicitly set as a found it very useful, it increases the runtime quite a bit though. > > Testing: the change passed our CI testing: most JCK and JTREG tests, also in Xcomp mode, SPECjvm2008, SPECjbb2015, Renaissance Suite and SAP specific tests with fastdebug and release builds on the standard platforms plus PPC64. These tests do include hotspot_loom and jdk_loom JTREG tests which I've also run with TEST_VM_OPTS="-XX:+VerifyContinuations" on X86_64, PPC64, and AARCH64. > > Thanks, Richard. Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: Feedback Martin ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10961/files - new: https://git.openjdk.org/jdk/pull/10961/files/7276a8ec..116839ee Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10961&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10961&range=04-05 Stats: 13 lines in 7 files changed: 0 ins; 0 del; 13 mod Patch: https://git.openjdk.org/jdk/pull/10961.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10961/head:pull/10961 PR: https://git.openjdk.org/jdk/pull/10961 From mdoerr at openjdk.org Fri Nov 18 11:17:35 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Fri, 18 Nov 2022 11:17:35 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v6] In-Reply-To: References: Message-ID: On Fri, 18 Nov 2022 11:08:34 GMT, Richard Reingruber wrote: >> Hi, >> >> this is the port of [JEP 425: Virtual Threads (Preview)](https://openjdk.org/jeps/425)) to PPC64. >> More precisely it is the port of vm continuations in hotspot to PPC64. It allows to run with `-XX:+VMContinuations` which is a prerequisit for 'real' virtual threads (oxymoron?). >> >> Most of the shared code changes are related to a new platform dependent constant `frame::metadata_words_at_top`. It is either added or subtracted to a frame address or size. >> >> The following is supposed to explain (without real life details) why it is needed in addition to the existing `frame::metadata_words`. The diagram shows a frame at `SP` and its stack arguments. The caller frame is located at `callers_SP`. >> >> >> X86 / AARCH64 PPC64: >> >> : : : : >> : : : : >> | | | | >> |-----------------| |-----------------| >> | | | | >> | stack arguments | | stack arguments | >> | |<- callers_SP | | >> =================== |-----------------| >> | | | | >> | metadata at bottom | | metadata at top | >> | | | |<- callers_SP >> |-----------------| =================== >> | | | | >> | | | | >> | | | | >> | | | | >> | |<- SP | | >> =================== |-----------------| >> | | >> | metadata at top | >> | |<- SP >> =================== >> >> >> On X86 and AARCH64 metadata (that's return address, saved SP/FP, etc.) is stored at the frame bottom (`metadata at bottom`). On PPC64 it is stored at the frame top (`metadata at top`) where it affects size and address calculations. Shared code deals with this by making use of the platform dependent constant `frame::metadata_words_at_top`. >> >> * size required to 'freeze' a single frame with its stack arguments in a `StackChunk`: >> `sizeof(frame) + sizeof(stack arguments) + frame::metadata_words_at_top` >> >> * address of stack arguments: >> `callers_SP + frame::metadata_words_at_top` >> >> * The value of `frame::metadata_words_at_top` is 0 words on X86 and AARCH64. On PPC64 it is 4 words. >> >> Please refer to comments I've added for more details (e.g. the comment on StackChunkFrameStream::frame_size()). Recently I've given a talk about vm continuations and the PPC64 port to my colleagues. It's rather an overview than a deep dive. [The slides](http://cr.openjdk.java.net/~rrich/webrevs/2022/8286302/202210_loom_ppc64_port.pdf) might serve as well as an intro to the matter. >> >> The pr includes the new test jdk/jdk/internal/vm/Continuation/BasicExp.java which I wrote while doing the port. The test cases vary from simple to not-so-simple. One of the main features is that a CompilationPolicy passed as argument controls which frames are compiled/interpreted when freezing by defining a sliding window of compiled / interpreted frames which produces interesting transitions with and without stack arguments. There is overlap with Basic.java and Fuzz.java. Let me know wether to remove or keep BasicExp.java. Runtime with fastdebug: 2m on Apple M1 Pro and 3m on Intel Xeon E5-2660 v3 @ 2.60GHz. Note that -XX:+VerifyContinuations is explicitly set as a found it very useful, it increases the runtime quite a bit though. >> >> Testing: the change passed our CI testing: most JCK and JTREG tests, also in Xcomp mode, SPECjvm2008, SPECjbb2015, Renaissance Suite and SAP specific tests with fastdebug and release builds on the standard platforms plus PPC64. These tests do include hotspot_loom and jdk_loom JTREG tests which I've also run with TEST_VM_OPTS="-XX:+VerifyContinuations" on X86_64, PPC64, and AARCH64. >> >> Thanks, Richard. > > Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: > > Feedback Martin Thanks for the updates! I'll look more into the explanations and the test, but you don't have to wait for that after you got your 2nd review. ------------- Marked as reviewed by mdoerr (Reviewer). PR: https://git.openjdk.org/jdk/pull/10961 From rrich at openjdk.org Fri Nov 18 11:17:38 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Fri, 18 Nov 2022 11:17:38 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v5] In-Reply-To: <4kTtXvjuyStJdVRl6nJqWaDZRqnuC77YaBtSU_51zNE=.0461f131-cf15-41a0-b120-3228dbe97969@github.com> References: <7LIFD7nq9mWL6nfRJMz1pwc9h1SZaEnAWlsPT5mG1yI=.cedeccab-8642-4131-a90e-d8fc9d015619@github.com> <4kTtXvjuyStJdVRl6nJqWaDZRqnuC77YaBtSU_51zNE=.0461f131-cf15-41a0-b120-3228dbe97969@github.com> Message-ID: On Thu, 17 Nov 2022 17:48:12 GMT, Martin Doerr wrote: >> Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: >> >> Cleanup BasicExp.java > > src/hotspot/cpu/aarch64/frame_aarch64.hpp line 110: > >> 108: // between a callee frame and its stack arguments, where it is part >> 109: // of the caller/callee overlap >> 110: metadata_words_at_top = 0, > > I like the `_at_bottom` and `_at_top` versions. The old version could possibly get replaced completely in a separate RFE (this change is already large enough). Thanks. I don't think though that these can replace metadata_words completely because for at least one platform you would replace it with 0. > src/hotspot/cpu/ppc/continuationFreezeThaw_ppc.inline.hpp line 498: > >> 496: // align fp >> 497: int padding = fp - align_down(fp, frame::frame_alignment); >> 498: fp -= padding; > > Additional whitespaces should better get removed. Done > src/hotspot/cpu/ppc/continuationFreezeThaw_ppc.inline.hpp line 520: > >> 518: >> 519: if ((bottom && argsize > 0) || caller.is_interpreted_frame()) { >> 520: frame_sp -= argsize + frame::metadata_words_at_top; > > whitespace Done > src/hotspot/cpu/ppc/frame_ppc.hpp line 370: > >> 368: >> 369: union { >> 370: intptr_t* _fp; // frame pointer > > whitespace Done. > src/hotspot/cpu/ppc/frame_ppc.inline.hpp line 113: > >> 111: // In thaw, non-heap frames use this constructor to pass oop_map. I don't know why. >> 112: assert(_on_heap || _cb != nullptr, "these frames are always heap frames"); >> 113: if (cb != NULL) { > > Better use `nullptr`. Done > src/hotspot/cpu/ppc/nativeInst_ppc.cpp line 448: > >> 446: } >> 447: >> 448: // Inserts an undefined instruction at a given pc > > More precisely: An instruction which is specified to cause a SIGILL. Done > src/hotspot/cpu/ppc/ppc.ad line 14386: > >> 14384: >> 14385: format %{ "CALL,static $meth \t// ==> " %} >> 14386: size(8); > > Please make the sizes precise. They are only 8 Byte when continuations are enabled. Done (note it's not done on other platforms). > src/hotspot/cpu/ppc/sharedRuntime_ppc.cpp line 1803: > >> 1801: >> 1802: // Read interpreter arguments into registers (this is an ad-hoc i2c adapter) >> 1803: __ ld(reg_cont_obj, Interpreter::stackElementSize*3, R15_esp); > > Would you like to align the indentation? Done > src/hotspot/cpu/ppc/stackChunkFrameStream_ppc.inline.hpp line 194: > >> 192: + 1 // for the mirror oop >> 193: + ((intptr_t*)f.interpreter_frame_monitor_begin() >> 194: - (intptr_t*)f.interpreter_frame_monitor_end())/BasicObjectLock::size(); > > whitespace Ok now? > src/hotspot/share/runtime/continuationFreezeThaw.cpp line 488: > >> 486: assert(!Interpreter::contains(_cont.entryPC()), ""); >> 487: static const int doYield_stub_frame_size = NOT_PPC64(frame::metadata_words) >> 488: PPC64_ONLY(frame::abi_reg_args_size >> LogBytesPerWord); > > Unfortunate that we still need to distinguish, here. But, ok. I don't have a better idea atm. A pd constant could be introduced. Not sure if it's worth it. ------------- PR: https://git.openjdk.org/jdk/pull/10961 From rrich at openjdk.org Fri Nov 18 11:21:19 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Fri, 18 Nov 2022 11:21:19 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v6] In-Reply-To: References: Message-ID: On Fri, 18 Nov 2022 11:15:10 GMT, Martin Doerr wrote: > Thanks for the updates! I'll look more into the explanations and the test, but you don't have to wait for that after you got your 2nd review. Thanks a lot indeed! ------------- PR: https://git.openjdk.org/jdk/pull/10961 From gcao at openjdk.org Fri Nov 18 13:22:11 2022 From: gcao at openjdk.org (Gui Cao) Date: Fri, 18 Nov 2022 13:22:11 GMT Subject: RFR: 8297238: RISC-V: Unifying predicates for vector type matching in c2 Message-ID: Hi, In the vector type predicate matching process of riscv, n->bottom_type()->is_vect()->element_basic_type() is used in some places to get the data type, and Matcher::vector_element_basic_type(n) is used in some places to get the data type, In fact, Matcher::vector_element_basic_type(n) is the function encapsulation form of n->bottom_type()->is_vect()->element_basic_type(), here Matcher::vector_element_basic_type(n) is used uniformly to get the data type Please take a look and have some reviews. Thanks a lot. ## Testing: - hotspot and jdk tier1 without new failures (release with UseRVV on QEMU) - test/jdk/jdk/incubator/vector/* (fastdebug/release with UseRVV on QEMU) ------------- Commit messages: - Unifying predicates for vector type matching in c2 Changes: https://git.openjdk.org/jdk/pull/11239/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11239&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297238 Stats: 18 lines in 1 file changed: 0 ins; 0 del; 18 mod Patch: https://git.openjdk.org/jdk/pull/11239.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11239/head:pull/11239 PR: https://git.openjdk.org/jdk/pull/11239 From roland at openjdk.org Fri Nov 18 13:50:26 2022 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 18 Nov 2022 13:50:26 GMT Subject: Integrated: 8295788: C2 compilation hits "assert((mode == ControlAroundStripMined && use == sfpt) || !use->is_reachable_from_root()) failed: missed a node" In-Reply-To: <_rSAe8s5BhWa4Cgo6uwk57tkXBVZLqoTOpJvvTQ-toQ=.6bbebdf8-28e6-4c7b-b751-59c05ef0aab8@github.com> References: <_rSAe8s5BhWa4Cgo6uwk57tkXBVZLqoTOpJvvTQ-toQ=.6bbebdf8-28e6-4c7b-b751-59c05ef0aab8@github.com> Message-ID: <55Bd1-FXzg8QaYWpG6OQh7qj5KMojnFd-FsxYNZtni0=.9e21cbcf-fe14-461f-b3b3-e762cb1fef3a@github.com> On Tue, 15 Nov 2022 11:42:00 GMT, Roland Westrelin wrote: > This failure is similar to previous failures with loop strip mining: a > node is encountered that has control set in the outer strip mined loop > but is not reachable from the safepoint. There's already logic in loop > cloning to find those and fix their control to be outside the > loop. Usually a node ends up in the outer loop because some of its > inputs is in the outer loop. The current logic to catch nodes that are > erroneously assigned control in the outer loop is to start from > safepoint's inputs and look for uses with incorrect control. That > doesn't work in this case because: 1) the node is created by > IdealLoopTree::reassociate in the outer loop because its inputs are > indeed there 2) but a pass of split if updates the control to be > inside the inner loop. > > To fix this, I propose reusing the existing clone_outer_loop_helper() > but apply it to the loop body as well. I had to tweak that method > because I ran into cases of dead nodes still reachable from a node in > the loop body but removed from the _body list by > IdealLoopTree::DCE_loop_body() (and as a result not cloned). This pull request has now been integrated. Changeset: 761a4f48 Author: Roland Westrelin URL: https://git.openjdk.org/jdk/commit/761a4f4852cbb40660b6fb9eda4d740464218f75 Stats: 85 lines in 2 files changed: 68 ins; 0 del; 17 mod 8295788: C2 compilation hits "assert((mode == ControlAroundStripMined && use == sfpt) || !use->is_reachable_from_root()) failed: missed a node" Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/11162 From coleenp at openjdk.org Fri Nov 18 16:58:47 2022 From: coleenp at openjdk.org (Coleen Phillimore) Date: Fri, 18 Nov 2022 16:58:47 GMT Subject: RFR: 8293584: CodeCache::old_nmethods_do incorrectly filters is_unloading nmethods Message-ID: I fixed the code to also include is_unloading nmethods. I couldn't write a dedicated test for this. Tested with tier1-4,6. Also tried to reproduce another redefinition bug with this change, which didn't reproduce, but not caused by this change. ------------- Commit messages: - 8293584: CodeCache::old_nmethods_do incorrectly filters is_unloading nmethods Changes: https://git.openjdk.org/jdk/pull/11243/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11243&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8293584 Stats: 5 lines in 1 file changed: 0 ins; 4 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11243.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11243/head:pull/11243 PR: https://git.openjdk.org/jdk/pull/11243 From aph at openjdk.org Fri Nov 18 17:15:15 2022 From: aph at openjdk.org (Andrew Haley) Date: Fri, 18 Nov 2022 17:15:15 GMT Subject: RFR: 8295276: AArch64: Add backend support for half float conversion intrinsics In-Reply-To: <5AM4Pj8V60JHjfIHgbvE8FGx7BAyy2LmGnUkr3GWNMQ=.d138a971-fe0d-491a-887b-07c96fc03008@github.com> References: <5AM4Pj8V60JHjfIHgbvE8FGx7BAyy2LmGnUkr3GWNMQ=.d138a971-fe0d-491a-887b-07c96fc03008@github.com> Message-ID: On Thu, 20 Oct 2022 14:33:33 GMT, Bhavana Kilambi wrote: > This patch adds aarch64 backend support for library intrinsics that implement conversions between half-precision and single-precision floats. > > Ran the following benchmarks to assess the performance with this patch - > > org.openjdk.bench.java.math.Fp16ConversionBenchmark.floatToFloat16 org.openjdk.bench.java.math.Fp16ConversionBenchmark.float16ToFloat > > The performance (ops/ms) gain with the patch on an ARM NEON machine is shown below - > > > Benchmark Gain > Fp16ConversionBenchmark.float16ToFloat 3.42 > Fp16ConversionBenchmark.floatToFloat16 5.85 Marked as reviewed by aph (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/10796 From eosterlund at openjdk.org Fri Nov 18 17:29:19 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Fri, 18 Nov 2022 17:29:19 GMT Subject: RFR: 8293584: CodeCache::old_nmethods_do incorrectly filters is_unloading nmethods In-Reply-To: References: Message-ID: On Fri, 18 Nov 2022 16:51:45 GMT, Coleen Phillimore wrote: > I fixed the code to also include is_unloading nmethods. I couldn't write a dedicated test for this. > Tested with tier1-4,6. Also tried to reproduce another redefinition bug with this change, which didn't reproduce, but not caused by this change. Looks good. ------------- Marked as reviewed by eosterlund (Reviewer). PR: https://git.openjdk.org/jdk/pull/11243 From kvn at openjdk.org Fri Nov 18 17:54:14 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 18 Nov 2022 17:54:14 GMT Subject: RFR: 8293584: CodeCache::old_nmethods_do incorrectly filters is_unloading nmethods In-Reply-To: References: Message-ID: On Fri, 18 Nov 2022 16:51:45 GMT, Coleen Phillimore wrote: > I fixed the code to also include is_unloading nmethods. I couldn't write a dedicated test for this. > Tested with tier1-4,6. Also tried to reproduce another redefinition bug with this change, which didn't reproduce, but not caused by this change. Would be nice to have a comment here to explain why `is_unloaded` nmethods are included in scan. ------------- PR: https://git.openjdk.org/jdk/pull/11243 From vlivanov at openjdk.org Fri Nov 18 22:39:34 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 18 Nov 2022 22:39:34 GMT Subject: RFR: 6312651: Compiler should only use verified interface types for optimization [v4] In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 08:55:00 GMT, Roland Westrelin wrote: > I removed ciArrayKlass::interfaces() in the new commit. Is the resulting code what you suggested? Yes, it is much clearer now. Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/10901 From coleenp at openjdk.org Fri Nov 18 22:40:40 2022 From: coleenp at openjdk.org (Coleen Phillimore) Date: Fri, 18 Nov 2022 22:40:40 GMT Subject: RFR: 8293584: CodeCache::old_nmethods_do incorrectly filters is_unloading nmethods [v2] In-Reply-To: References: Message-ID: <0CvVKqEi5uyiVY2hubrmmGe8kVJV-Lc5Zc8bKMmGza4=.c6c70cca-52d9-4723-9992-2ba57473f014@github.com> > I fixed the code to also include is_unloading nmethods. I couldn't write a dedicated test for this. > Tested with tier1-4,6. Also tried to reproduce another redefinition bug with this change, which didn't reproduce, but not caused by this change. Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: Add a comment. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11243/files - new: https://git.openjdk.org/jdk/pull/11243/files/de0a72c9..91dc37be Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11243&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11243&range=00-01 Stats: 3 lines in 1 file changed: 3 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11243.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11243/head:pull/11243 PR: https://git.openjdk.org/jdk/pull/11243 From coleenp at openjdk.org Fri Nov 18 22:40:40 2022 From: coleenp at openjdk.org (Coleen Phillimore) Date: Fri, 18 Nov 2022 22:40:40 GMT Subject: RFR: 8293584: CodeCache::old_nmethods_do incorrectly filters is_unloading nmethods [v2] In-Reply-To: References: Message-ID: On Fri, 18 Nov 2022 17:25:34 GMT, Erik ?sterlund wrote: >> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: >> >> Add a comment. > > Looks good. Thanks @fisk. I also added a comment @vnkozlov. Please review for accuracy. ------------- PR: https://git.openjdk.org/jdk/pull/11243 From kvn at openjdk.org Sat Nov 19 00:04:30 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 19 Nov 2022 00:04:30 GMT Subject: RFR: 8293584: CodeCache::old_nmethods_do incorrectly filters is_unloading nmethods [v2] In-Reply-To: <0CvVKqEi5uyiVY2hubrmmGe8kVJV-Lc5Zc8bKMmGza4=.c6c70cca-52d9-4723-9992-2ba57473f014@github.com> References: <0CvVKqEi5uyiVY2hubrmmGe8kVJV-Lc5Zc8bKMmGza4=.c6c70cca-52d9-4723-9992-2ba57473f014@github.com> Message-ID: On Fri, 18 Nov 2022 22:40:40 GMT, Coleen Phillimore wrote: >> I fixed the code to also include is_unloading nmethods. I couldn't write a dedicated test for this. >> Tested with tier1-4,6. Also tried to reproduce another redefinition bug with this change, which didn't reproduce, but not caused by this change. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Add a comment. Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11243 From kvn at openjdk.org Sat Nov 19 00:05:28 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 19 Nov 2022 00:05:28 GMT Subject: RFR: 6312651: Compiler should only use verified interface types for optimization [v5] In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 08:58:14 GMT, Roland Westrelin wrote: >> This change is mostly the same I sent for review 3 years ago but was >> never integrated: >> >> https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2019-May/033803.html >> >> The main difference is that, in the meantime, I submitted a couple of >> refactoring changes extracted from the 2019 patch: >> >> 8266550: C2: mirror TypeOopPtr/TypeInstPtr/TypeAryPtr with TypeKlassPtr/TypeInstKlassPtr/TypeAryKlassPtr >> 8275201: C2: hide klass() accessor from TypeOopPtr and typeKlassPtr subclasses >> >> As a result, the current patch is much smaller (but still not small). >> >> The implementation is otherwise largely the same as in the 2019 >> patch. I tried to remove some of the code duplication between the >> TypeOopPtr and TypeKlassPtr hierarchies by having some of the logic >> shared in template methods. In the 2019 patch, interfaces were trusted >> when types were constructed and I had added code to drop interfaces >> from a type where they couldn't be trusted. This new patch proceeds >> the other way around: interfaces are not trusted when a type is >> constructed and code that uses the type must explicitly request that >> they are included (this was suggested as an improvement by Vladimir >> Ivanov I think). > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > review Nice work. Thanks! ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10901 From dzhang at openjdk.org Sat Nov 19 01:47:39 2022 From: dzhang at openjdk.org (Dingli Zhang) Date: Sat, 19 Nov 2022 01:47:39 GMT Subject: RFR: 8297238: RISC-V: Unifying predicates for vector type matching in c2 In-Reply-To: References: Message-ID: <_1Jfwb3dhS-ruq-D5mQfaLP8827AM7KnC6q3syL3tTE=.26774736-8ce8-466f-850b-0089b8818439@github.com> On Fri, 18 Nov 2022 13:12:54 GMT, Gui Cao wrote: > Hi, > > In the vector type predicate matching process of riscv, n->bottom_type()->is_vect()->element_basic_type() is used in some places to get the data type, and Matcher::vector_element_basic_type(n) is used in some places to get the data type, In fact, Matcher::vector_element_basic_type(n) is the function encapsulation form of n->bottom_type()->is_vect()->element_basic_type(), here Matcher::vector_element_basic_type(n) is used uniformly to get the data type. > > Please take a look and have some reviews. Thanks a lot. > > ## Testing: > - hotspot and jdk tier1 without new failures (release with UseRVV on QEMU) > - test/jdk/jdk/incubator/vector/* (fastdebug/release with UseRVV on QEMU) LGTM ------------- Marked as reviewed by dzhang (Author). PR: https://git.openjdk.org/jdk/pull/11239 From fyang at openjdk.org Sat Nov 19 08:26:17 2022 From: fyang at openjdk.org (Fei Yang) Date: Sat, 19 Nov 2022 08:26:17 GMT Subject: RFR: 8297238: RISC-V: Unifying predicates for vector type matching in c2 In-Reply-To: References: Message-ID: On Fri, 18 Nov 2022 13:12:54 GMT, Gui Cao wrote: > Hi, > > In the vector type predicate matching process of riscv, n->bottom_type()->is_vect()->element_basic_type() is used in some places to get the data type, and Matcher::vector_element_basic_type(n) is used in some places to get the data type, In fact, Matcher::vector_element_basic_type(n) is the function encapsulation form of n->bottom_type()->is_vect()->element_basic_type(), here Matcher::vector_element_basic_type(n) is used uniformly to get the data type. > > Please take a look and have some reviews. Thanks a lot. > > ## Testing: > - hotspot and jdk tier1 without new failures (release with UseRVV on QEMU) > - test/jdk/jdk/incubator/vector/* (fastdebug/release with UseRVV on QEMU) Looks good. Thanks for the cleanup. ------------- Marked as reviewed by fyang (Reviewer). PR: https://git.openjdk.org/jdk/pull/11239 From fyang at openjdk.org Sat Nov 19 08:39:32 2022 From: fyang at openjdk.org (Fei Yang) Date: Sat, 19 Nov 2022 08:39:32 GMT Subject: RFR: 8297238: RISC-V: Unifying predicates for vector type matching in c2 In-Reply-To: References: Message-ID: On Fri, 18 Nov 2022 13:12:54 GMT, Gui Cao wrote: > Hi, > > In the vector type predicate matching process of riscv, n->bottom_type()->is_vect()->element_basic_type() is used in some places to get the data type, and Matcher::vector_element_basic_type(n) is used in some places to get the data type, In fact, Matcher::vector_element_basic_type(n) is the function encapsulation form of n->bottom_type()->is_vect()->element_basic_type(), here Matcher::vector_element_basic_type(n) is used uniformly to get the data type. > > Please take a look and have some reviews. Thanks a lot. > > ## Testing: > - hotspot and jdk tier1 without new failures (release with UseRVV on QEMU) > - test/jdk/jdk/incubator/vector/* (fastdebug/release with UseRVV on QEMU) PS: I think it should be more specific to change the title of the JBS issue to something like: "RISC-V: C2: Use Matcher::vector_element_basic_type when checking for vector element type in predicate" ------------- PR: https://git.openjdk.org/jdk/pull/11239 From gcao at openjdk.org Sat Nov 19 08:49:12 2022 From: gcao at openjdk.org (Gui Cao) Date: Sat, 19 Nov 2022 08:49:12 GMT Subject: RFR: 8297238: RISC-V: C2: Use Matcher::vector_element_basic_type when checking for vector element type in predicate In-Reply-To: References: Message-ID: <7hXy_yJvW9jeiS7sU_1vyBkRut7ktsZPQfqzUXMbh3c=.54ddd227-7fa6-47e9-8ba1-48d381f058cd@github.com> On Sat, 19 Nov 2022 08:36:05 GMT, Fei Yang wrote: >> Hi, >> >> In the vector type predicate matching process of riscv, n->bottom_type()->is_vect()->element_basic_type() is used in some places to get the data type, and Matcher::vector_element_basic_type(n) is used in some places to get the data type, In fact, Matcher::vector_element_basic_type(n) is the function encapsulation form of n->bottom_type()->is_vect()->element_basic_type(), here Matcher::vector_element_basic_type(n) is used uniformly to get the data type. >> >> Please take a look and have some reviews. Thanks a lot. >> >> ## Testing: >> - hotspot and jdk tier1 without new failures (release with UseRVV on QEMU) >> - test/jdk/jdk/incubator/vector/* (fastdebug/release with UseRVV on QEMU) > > PS: I think it should be more specific to change the title of the JBS issue to something like: "RISC-V: C2: Use Matcher::vector_element_basic_type when checking for vector element type in predicate" @RealFYang @DingliZhang Thanks for the review. > PS: I think it should be more specific to change the title of the JBS issue to something like: "RISC-V: C2: Use Matcher::vector_element_basic_type when checking for vector element type in predicate" Thanks, done. ------------- PR: https://git.openjdk.org/jdk/pull/11239 From njian at openjdk.org Mon Nov 21 02:08:38 2022 From: njian at openjdk.org (Ningsheng Jian) Date: Mon, 21 Nov 2022 02:08:38 GMT Subject: RFR: 8293488: Add EOR3 backend rule for aarch64 SHA3 extension [v4] In-Reply-To: References: Message-ID: On Mon, 14 Nov 2022 09:37:53 GMT, Bhavana Kilambi wrote: >> Arm ISA v8.2A and v9.0A include SHA3 feature extensions and one of those SHA3 instructions - "eor3" performs an exclusive OR of three vectors. This is helpful in applications that have multiple, consecutive "eor" operations which can be reduced by clubbing them into fewer operations using the "eor3" instruction. For example - >> >> eor a, a, b >> eor a, a, c >> >> can be optimized to single instruction - `eor3 a, b, c` >> >> This patch adds backend rules for Neon and SVE2 "eor3" instructions and a micro benchmark to assess the performance gains with this patch. Following are the results of the included micro benchmark on a 128-bit aarch64 machine that supports Neon, SVE2 and SHA3 features - >> >> >> Benchmark gain >> TestEor3.test1Int 10.87% >> TestEor3.test1Long 8.84% >> TestEor3.test2Int 21.68% >> TestEor3.test2Long 21.04% >> >> >> The numbers shown are performance gains with using Neon eor3 instruction over the master branch that uses multiple "eor" instructions instead. Similar gains can be observed with the SVE2 "eor3" version as well since the "eor3" instruction is unpredicated and the machine under test uses a maximum vector width of 128 bits which makes the SVE2 code generation very similar to the one with Neon. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Removed svesha3 feature check for eor3 Marked as reviewed by njian (Committer). ------------- PR: https://git.openjdk.org/jdk/pull/10407 From eliu at openjdk.org Mon Nov 21 02:08:38 2022 From: eliu at openjdk.org (Eric Liu) Date: Mon, 21 Nov 2022 02:08:38 GMT Subject: RFR: 8293488: Add EOR3 backend rule for aarch64 SHA3 extension [v4] In-Reply-To: References: Message-ID: <3Jp_EvePVgJJqhiwIh5_U2E2alw4sJf72ArNuWnQr90=.09a499da-8ad9-4176-a679-44e410a5efa9@github.com> On Mon, 14 Nov 2022 09:37:53 GMT, Bhavana Kilambi wrote: >> Arm ISA v8.2A and v9.0A include SHA3 feature extensions and one of those SHA3 instructions - "eor3" performs an exclusive OR of three vectors. This is helpful in applications that have multiple, consecutive "eor" operations which can be reduced by clubbing them into fewer operations using the "eor3" instruction. For example - >> >> eor a, a, b >> eor a, a, c >> >> can be optimized to single instruction - `eor3 a, b, c` >> >> This patch adds backend rules for Neon and SVE2 "eor3" instructions and a micro benchmark to assess the performance gains with this patch. Following are the results of the included micro benchmark on a 128-bit aarch64 machine that supports Neon, SVE2 and SHA3 features - >> >> >> Benchmark gain >> TestEor3.test1Int 10.87% >> TestEor3.test1Long 8.84% >> TestEor3.test2Int 21.68% >> TestEor3.test2Long 21.04% >> >> >> The numbers shown are performance gains with using Neon eor3 instruction over the master branch that uses multiple "eor" instructions instead. Similar gains can be observed with the SVE2 "eor3" version as well since the "eor3" instruction is unpredicated and the machine under test uses a maximum vector width of 128 bits which makes the SVE2 code generation very similar to the one with Neon. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Removed svesha3 feature check for eor3 Marked as reviewed by eliu (Committer). ------------- PR: https://git.openjdk.org/jdk/pull/10407 From njian at openjdk.org Mon Nov 21 02:15:49 2022 From: njian at openjdk.org (Ningsheng Jian) Date: Mon, 21 Nov 2022 02:15:49 GMT Subject: RFR: 8295276: AArch64: Add backend support for half float conversion intrinsics In-Reply-To: <5AM4Pj8V60JHjfIHgbvE8FGx7BAyy2LmGnUkr3GWNMQ=.d138a971-fe0d-491a-887b-07c96fc03008@github.com> References: <5AM4Pj8V60JHjfIHgbvE8FGx7BAyy2LmGnUkr3GWNMQ=.d138a971-fe0d-491a-887b-07c96fc03008@github.com> Message-ID: <-1IScFkj5fAuey5hREsKQtZVCemc6HwgZ4N3cF-HT6Y=.edc004a0-ca2b-4c0f-a614-2b20e7657dc0@github.com> On Thu, 20 Oct 2022 14:33:33 GMT, Bhavana Kilambi wrote: > This patch adds aarch64 backend support for library intrinsics that implement conversions between half-precision and single-precision floats. > > Ran the following benchmarks to assess the performance with this patch - > > org.openjdk.bench.java.math.Fp16ConversionBenchmark.floatToFloat16 org.openjdk.bench.java.math.Fp16ConversionBenchmark.float16ToFloat > > The performance (ops/ms) gain with the patch on an ARM NEON machine is shown below - > > > Benchmark Gain > Fp16ConversionBenchmark.float16ToFloat 3.42 > Fp16ConversionBenchmark.floatToFloat16 5.85 Looks good to me. ------------- Marked as reviewed by njian (Committer). PR: https://git.openjdk.org/jdk/pull/10796 From yyang at openjdk.org Mon Nov 21 02:22:15 2022 From: yyang at openjdk.org (Yi Yang) Date: Mon, 21 Nov 2022 02:22:15 GMT Subject: RFR: 8290432: C2 compilation fails with assert(node->_last_del == _last) failed: must have deleted the edge just produced [v5] In-Reply-To: References: Message-ID: > Hi, can I have a review for this patch? [JDK-8273585](https://bugs.openjdk.org/browse/JDK-8273585) recognized the form of `Phi->CastII->AddI` as additional parallel induction variables. In the following program: > > class Test { > static int dontInline() { > return 0; > } > > static long test(int val, boolean b) { > long ret = 0; > long dArr[] = new long[100]; > for (int i = 15; 293 > i; ++i) { > ret = val; > int j = 1; > while (++j < 6) { > int k = (val--); > for (long l = i; 1 > l; ) { > if (k != 0) { > ret += dontInline(); > } > } > if (b) { > break; > } > } > } > return ret; > } > > public static void main(String[] args) { > for (int i = 0; i < 1000; i++) { > test(0, false); > } > } > } > > `val` is incorrectly matched with the new parallel IV form: > ![image](https://user-images.githubusercontent.com/5010047/182059398-fc5204bc-8d95-4e3e-8c66-15776af457b8.png) > And C2 further replaces it with newly added nodes, which finally leads the crash: > ![image](https://user-images.githubusercontent.com/5010047/182059498-13148d46-b10f-4e18-b84a-f6b9f626ac7b.png) > > I think we can add more constraints to the new form. The form of `Phi->CastXX->AddX` appears when using Preconditions.checkIndex, and it would be recognized as additional IV when 1) Phi != phi2, 2) CastXX is controlled by RangeCheck(to reflect changes in Preconditions checkindex intrinsic) Yi Yang has updated the pull request incrementally with one additional commit since the last revision: one test only ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9695/files - new: https://git.openjdk.org/jdk/pull/9695/files/4cce45e5..55b236d5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9695&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9695&range=03-04 Stats: 90 lines in 2 files changed: 25 ins; 65 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/9695.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9695/head:pull/9695 PR: https://git.openjdk.org/jdk/pull/9695 From yyang at openjdk.org Mon Nov 21 02:31:34 2022 From: yyang at openjdk.org (Yi Yang) Date: Mon, 21 Nov 2022 02:31:34 GMT Subject: RFR: 8290432: C2 compilation fails with assert(node->_last_del == _last) failed: must have deleted the edge just produced [v6] In-Reply-To: References: Message-ID: > Hi, can I have a review for this patch? [JDK-8273585](https://bugs.openjdk.org/browse/JDK-8273585) recognized the form of `Phi->CastII->AddI` as additional parallel induction variables. In the following program: > > class Test { > static int dontInline() { > return 0; > } > > static long test(int val, boolean b) { > long ret = 0; > long dArr[] = new long[100]; > for (int i = 15; 293 > i; ++i) { > ret = val; > int j = 1; > while (++j < 6) { > int k = (val--); > for (long l = i; 1 > l; ) { > if (k != 0) { > ret += dontInline(); > } > } > if (b) { > break; > } > } > } > return ret; > } > > public static void main(String[] args) { > for (int i = 0; i < 1000; i++) { > test(0, false); > } > } > } > > `val` is incorrectly matched with the new parallel IV form: > ![image](https://user-images.githubusercontent.com/5010047/182059398-fc5204bc-8d95-4e3e-8c66-15776af457b8.png) > And C2 further replaces it with newly added nodes, which finally leads the crash: > ![image](https://user-images.githubusercontent.com/5010047/182059498-13148d46-b10f-4e18-b84a-f6b9f626ac7b.png) > > I think we can add more constraints to the new form. The form of `Phi->CastXX->AddX` appears when using Preconditions.checkIndex, and it would be recognized as additional IV when 1) Phi != phi2, 2) CastXX is controlled by RangeCheck(to reflect changes in Preconditions checkindex intrinsic) Yi Yang has updated the pull request incrementally with one additional commit since the last revision: whitespace ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9695/files - new: https://git.openjdk.org/jdk/pull/9695/files/55b236d5..7bfc6cc0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9695&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9695&range=04-05 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/9695.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9695/head:pull/9695 PR: https://git.openjdk.org/jdk/pull/9695 From yyang at openjdk.org Mon Nov 21 02:31:34 2022 From: yyang at openjdk.org (Yi Yang) Date: Mon, 21 Nov 2022 02:31:34 GMT Subject: RFR: 8290432: C2 compilation fails with assert(node->_last_del == _last) failed: must have deleted the edge just produced [v5] In-Reply-To: References: Message-ID: On Mon, 21 Nov 2022 02:22:15 GMT, Yi Yang wrote: >> Hi, can I have a review for this patch? [JDK-8273585](https://bugs.openjdk.org/browse/JDK-8273585) recognized the form of `Phi->CastII->AddI` as additional parallel induction variables. In the following program: >> >> class Test { >> static int dontInline() { >> return 0; >> } >> >> static long test(int val, boolean b) { >> long ret = 0; >> long dArr[] = new long[100]; >> for (int i = 15; 293 > i; ++i) { >> ret = val; >> int j = 1; >> while (++j < 6) { >> int k = (val--); >> for (long l = i; 1 > l; ) { >> if (k != 0) { >> ret += dontInline(); >> } >> } >> if (b) { >> break; >> } >> } >> } >> return ret; >> } >> >> public static void main(String[] args) { >> for (int i = 0; i < 1000; i++) { >> test(0, false); >> } >> } >> } >> >> `val` is incorrectly matched with the new parallel IV form: >> ![image](https://user-images.githubusercontent.com/5010047/182059398-fc5204bc-8d95-4e3e-8c66-15776af457b8.png) >> And C2 further replaces it with newly added nodes, which finally leads the crash: >> ![image](https://user-images.githubusercontent.com/5010047/182059498-13148d46-b10f-4e18-b84a-f6b9f626ac7b.png) >> >> I think we can add more constraints to the new form. The form of `Phi->CastXX->AddX` appears when using Preconditions.checkIndex, and it would be recognized as additional IV when 1) Phi != phi2, 2) CastXX is controlled by RangeCheck(to reflect changes in Preconditions checkindex intrinsic) > > Yi Yang has updated the pull request incrementally with one additional commit since the last revision: > > one test only I filed adding IR verification test as https://bugs.openjdk.org/browse/JDK-8297307. ------------- PR: https://git.openjdk.org/jdk/pull/9695 From thartmann at openjdk.org Mon Nov 21 06:15:26 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 21 Nov 2022 06:15:26 GMT Subject: RFR: 8290432: C2 compilation fails with assert(node->_last_del == _last) failed: must have deleted the edge just produced [v6] In-Reply-To: References: Message-ID: On Mon, 21 Nov 2022 02:31:34 GMT, Yi Yang wrote: >> Hi, can I have a review for this patch? [JDK-8273585](https://bugs.openjdk.org/browse/JDK-8273585) recognized the form of `Phi->CastII->AddI` as additional parallel induction variables. In the following program: >> >> class Test { >> static int dontInline() { >> return 0; >> } >> >> static long test(int val, boolean b) { >> long ret = 0; >> long dArr[] = new long[100]; >> for (int i = 15; 293 > i; ++i) { >> ret = val; >> int j = 1; >> while (++j < 6) { >> int k = (val--); >> for (long l = i; 1 > l; ) { >> if (k != 0) { >> ret += dontInline(); >> } >> } >> if (b) { >> break; >> } >> } >> } >> return ret; >> } >> >> public static void main(String[] args) { >> for (int i = 0; i < 1000; i++) { >> test(0, false); >> } >> } >> } >> >> `val` is incorrectly matched with the new parallel IV form: >> ![image](https://user-images.githubusercontent.com/5010047/182059398-fc5204bc-8d95-4e3e-8c66-15776af457b8.png) >> And C2 further replaces it with newly added nodes, which finally leads the crash: >> ![image](https://user-images.githubusercontent.com/5010047/182059498-13148d46-b10f-4e18-b84a-f6b9f626ac7b.png) >> >> I think we can add more constraints to the new form. The form of `Phi->CastXX->AddX` appears when using Preconditions.checkIndex, and it would be recognized as additional IV when 1) Phi != phi2, 2) CastXX is controlled by RangeCheck(to reflect changes in Preconditions checkindex intrinsic) > > Yi Yang has updated the pull request incrementally with one additional commit since the last revision: > > whitespace Looks good to me. Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/9695 From roland at openjdk.org Mon Nov 21 08:49:59 2022 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 21 Nov 2022 08:49:59 GMT Subject: RFR: 6312651: Compiler should only use verified interface types for optimization [v4] In-Reply-To: References: Message-ID: On Fri, 18 Nov 2022 22:35:35 GMT, Vladimir Ivanov wrote: >>> Great work, Roland! I'm approving the PR. (hs-tier1 - hs-tier2 sanity testing passed with latest version.) >>> >>> Feel free to handle `ciArrayKlass::interfaces()` as you find most appropriate. >> >> Thanks for the review (and running tests). >> I removed `ciArrayKlass::interfaces()` in the new commit. Is the resulting code what you suggested? > >> I removed ciArrayKlass::interfaces() in the new commit. Is the resulting code what you suggested? > > Yes, it is much clearer now. Thanks. @iwanowww @vnkozlov thanks for the reviews. ------------- PR: https://git.openjdk.org/jdk/pull/10901 From roland at openjdk.org Mon Nov 21 08:51:29 2022 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 21 Nov 2022 08:51:29 GMT Subject: Integrated: 6312651: Compiler should only use verified interface types for optimization In-Reply-To: References: Message-ID: <7u9ihIZXby4OAqOfy_2SYiNxpbNZPuhrkdWCWVbT4O0=.6873a9cb-23c5-47e9-837c-189c336a490d@github.com> On Fri, 28 Oct 2022 12:29:15 GMT, Roland Westrelin wrote: > This change is mostly the same I sent for review 3 years ago but was > never integrated: > > https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2019-May/033803.html > > The main difference is that, in the meantime, I submitted a couple of > refactoring changes extracted from the 2019 patch: > > 8266550: C2: mirror TypeOopPtr/TypeInstPtr/TypeAryPtr with TypeKlassPtr/TypeInstKlassPtr/TypeAryKlassPtr > 8275201: C2: hide klass() accessor from TypeOopPtr and typeKlassPtr subclasses > > As a result, the current patch is much smaller (but still not small). > > The implementation is otherwise largely the same as in the 2019 > patch. I tried to remove some of the code duplication between the > TypeOopPtr and TypeKlassPtr hierarchies by having some of the logic > shared in template methods. In the 2019 patch, interfaces were trusted > when types were constructed and I had added code to drop interfaces > from a type where they couldn't be trusted. This new patch proceeds > the other way around: interfaces are not trusted when a type is > constructed and code that uses the type must explicitly request that > they are included (this was suggested as an improvement by Vladimir > Ivanov I think). This pull request has now been integrated. Changeset: 45d1807a Author: Roland Westrelin URL: https://git.openjdk.org/jdk/commit/45d1807ad3248805f32b1b94b02ac368e0d6bcc0 Stats: 1592 lines in 20 files changed: 750 ins; 491 del; 351 mod 6312651: Compiler should only use verified interface types for optimization Reviewed-by: vlivanov, kvn ------------- PR: https://git.openjdk.org/jdk/pull/10901 From bkilambi at openjdk.org Mon Nov 21 09:59:35 2022 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Mon, 21 Nov 2022 09:59:35 GMT Subject: Integrated: 8295276: AArch64: Add backend support for half float conversion intrinsics In-Reply-To: <5AM4Pj8V60JHjfIHgbvE8FGx7BAyy2LmGnUkr3GWNMQ=.d138a971-fe0d-491a-887b-07c96fc03008@github.com> References: <5AM4Pj8V60JHjfIHgbvE8FGx7BAyy2LmGnUkr3GWNMQ=.d138a971-fe0d-491a-887b-07c96fc03008@github.com> Message-ID: On Thu, 20 Oct 2022 14:33:33 GMT, Bhavana Kilambi wrote: > This patch adds aarch64 backend support for library intrinsics that implement conversions between half-precision and single-precision floats. > > Ran the following benchmarks to assess the performance with this patch - > > org.openjdk.bench.java.math.Fp16ConversionBenchmark.floatToFloat16 org.openjdk.bench.java.math.Fp16ConversionBenchmark.float16ToFloat > > The performance (ops/ms) gain with the patch on an ARM NEON machine is shown below - > > > Benchmark Gain > Fp16ConversionBenchmark.float16ToFloat 3.42 > Fp16ConversionBenchmark.floatToFloat16 5.85 This pull request has now been integrated. Changeset: 891c706a Author: Bhavana Kilambi Committer: Ningsheng Jian URL: https://git.openjdk.org/jdk/commit/891c706a103042043f5ef6fcf56720ccbcfc7e19 Stats: 658 lines in 4 files changed: 34 ins; 0 del; 624 mod 8295276: AArch64: Add backend support for half float conversion intrinsics Reviewed-by: ngasson, aph, njian ------------- PR: https://git.openjdk.org/jdk/pull/10796 From gcao at openjdk.org Mon Nov 21 10:05:57 2022 From: gcao at openjdk.org (Gui Cao) Date: Mon, 21 Nov 2022 10:05:57 GMT Subject: Integrated: 8297238: RISC-V: C2: Use Matcher::vector_element_basic_type when checking for vector element type in predicate In-Reply-To: References: Message-ID: On Fri, 18 Nov 2022 13:12:54 GMT, Gui Cao wrote: > Hi, > > In the vector type predicate matching process of riscv, n->bottom_type()->is_vect()->element_basic_type() is used in some places to get the data type, and Matcher::vector_element_basic_type(n) is used in some places to get the data type, In fact, Matcher::vector_element_basic_type(n) is the function encapsulation form of n->bottom_type()->is_vect()->element_basic_type(), here Matcher::vector_element_basic_type(n) is used uniformly to get the data type. > > Please take a look and have some reviews. Thanks a lot. > > ## Testing: > - hotspot and jdk tier1 without new failures (release with UseRVV on QEMU) > - test/jdk/jdk/incubator/vector/* (fastdebug/release with UseRVV on QEMU) This pull request has now been integrated. Changeset: e4206618 Author: Gui Cao Committer: Fei Yang URL: https://git.openjdk.org/jdk/commit/e4206618ac82222f8f61e348cfa68db0d708fe90 Stats: 18 lines in 1 file changed: 0 ins; 0 del; 18 mod 8297238: RISC-V: C2: Use Matcher::vector_element_basic_type when checking for vector element type in predicate Reviewed-by: dzhang, fyang ------------- PR: https://git.openjdk.org/jdk/pull/11239 From yyang at openjdk.org Mon Nov 21 13:17:28 2022 From: yyang at openjdk.org (Yi Yang) Date: Mon, 21 Nov 2022 13:17:28 GMT Subject: RFR: 8290432: C2 compilation fails with assert(node->_last_del == _last) failed: must have deleted the edge just produced [v3] In-Reply-To: <4iWrjtgXQfRvRYkT2_wUGAQkIouqwlng4IJmHyvCHqQ=.6b36faaa-eb22-4e32-bfc2-dfedd645eff2@github.com> References: <5ff-r2RgTNzao-sZ4D1kKWOPHWwzaCZxDDxyxl1Y0Us=.ae799d57-29ab-42c5-9908-a5811a8db0bc@github.com> <4iWrjtgXQfRvRYkT2_wUGAQkIouqwlng4IJmHyvCHqQ=.6b36faaa-eb22-4e32-bfc2-dfedd645eff2@github.com> Message-ID: On Tue, 9 Aug 2022 17:16:02 GMT, Vladimir Kozlov wrote: >> Yi Yang has updated the pull request incrementally with one additional commit since the last revision: >> >> update comment > > I still think the issue is in some other place. Your change just avoiding the case which triggers it. The java code is legal and compiling next code did not trigger nay issues: > > static long test(int val, boolean b) { > long ret = 0; > for (int i = 15; 293 > i; ++i) { > ret = val; > val--; > } > return ret; > } > > Actually I was not able to reproduce the issue with included test. @vnkozlov Can I have a second review? Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/9695 From coleenp at openjdk.org Mon Nov 21 13:50:02 2022 From: coleenp at openjdk.org (Coleen Phillimore) Date: Mon, 21 Nov 2022 13:50:02 GMT Subject: RFR: 8293584: CodeCache::old_nmethods_do incorrectly filters is_unloading nmethods [v2] In-Reply-To: <0CvVKqEi5uyiVY2hubrmmGe8kVJV-Lc5Zc8bKMmGza4=.c6c70cca-52d9-4723-9992-2ba57473f014@github.com> References: <0CvVKqEi5uyiVY2hubrmmGe8kVJV-Lc5Zc8bKMmGza4=.c6c70cca-52d9-4723-9992-2ba57473f014@github.com> Message-ID: On Fri, 18 Nov 2022 22:40:40 GMT, Coleen Phillimore wrote: >> I fixed the code to also include is_unloading nmethods. I couldn't write a dedicated test for this. >> Tested with tier1-4,6. Also tried to reproduce another redefinition bug with this change, which didn't reproduce, but not caused by this change. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Add a comment. Thanks for reviewing! ------------- PR: https://git.openjdk.org/jdk/pull/11243 From coleenp at openjdk.org Mon Nov 21 13:50:03 2022 From: coleenp at openjdk.org (Coleen Phillimore) Date: Mon, 21 Nov 2022 13:50:03 GMT Subject: Integrated: 8293584: CodeCache::old_nmethods_do incorrectly filters is_unloading nmethods In-Reply-To: References: Message-ID: On Fri, 18 Nov 2022 16:51:45 GMT, Coleen Phillimore wrote: > I fixed the code to also include is_unloading nmethods. I couldn't write a dedicated test for this. > Tested with tier1-4,6. Also tried to reproduce another redefinition bug with this change, which didn't reproduce, but not caused by this change. This pull request has now been integrated. Changeset: 08008139 Author: Coleen Phillimore URL: https://git.openjdk.org/jdk/commit/08008139cc05a8271e7163eca47d2bc59db2049b Stats: 5 lines in 1 file changed: 0 ins; 1 del; 4 mod 8293584: CodeCache::old_nmethods_do incorrectly filters is_unloading nmethods Reviewed-by: eosterlund, kvn ------------- PR: https://git.openjdk.org/jdk/pull/11243 From rrich at openjdk.org Mon Nov 21 13:59:54 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Mon, 21 Nov 2022 13:59:54 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v7] In-Reply-To: References: Message-ID: > Hi, > > this is the port of [JEP 425: Virtual Threads (Preview)](https://openjdk.org/jeps/425)) to PPC64. > More precisely it is the port of vm continuations in hotspot to PPC64. It allows to run with `-XX:+VMContinuations` which is a prerequisit for 'real' virtual threads (oxymoron?). > > Most of the shared code changes are related to a new platform dependent constant `frame::metadata_words_at_top`. It is either added or subtracted to a frame address or size. > > The following is supposed to explain (without real life details) why it is needed in addition to the existing `frame::metadata_words`. The diagram shows a frame at `SP` and its stack arguments. The caller frame is located at `callers_SP`. > > > X86 / AARCH64 PPC64: > > : : : : > : : : : > | | | | > |-----------------| |-----------------| > | | | | > | stack arguments | | stack arguments | > | |<- callers_SP | | > =================== |-----------------| > | | | | > | metadata at bottom | | metadata at top | > | | | |<- callers_SP > |-----------------| =================== > | | | | > | | | | > | | | | > | | | | > | |<- SP | | > =================== |-----------------| > | | > | metadata at top | > | |<- SP > =================== > > > On X86 and AARCH64 metadata (that's return address, saved SP/FP, etc.) is stored at the frame bottom (`metadata at bottom`). On PPC64 it is stored at the frame top (`metadata at top`) where it affects size and address calculations. Shared code deals with this by making use of the platform dependent constant `frame::metadata_words_at_top`. > > * size required to 'freeze' a single frame with its stack arguments in a `StackChunk`: > `sizeof(frame) + sizeof(stack arguments) + frame::metadata_words_at_top` > > * address of stack arguments: > `callers_SP + frame::metadata_words_at_top` > > * The value of `frame::metadata_words_at_top` is 0 words on X86 and AARCH64. On PPC64 it is 4 words. > > Please refer to comments I've added for more details (e.g. the comment on StackChunkFrameStream::frame_size()). Recently I've given a talk about vm continuations and the PPC64 port to my colleagues. It's rather an overview than a deep dive. [The slides](http://cr.openjdk.java.net/~rrich/webrevs/2022/8286302/202210_loom_ppc64_port.pdf) might serve as well as an intro to the matter. > > The pr includes the new test jdk/jdk/internal/vm/Continuation/BasicExp.java which I wrote while doing the port. The test cases vary from simple to not-so-simple. One of the main features is that a CompilationPolicy passed as argument controls which frames are compiled/interpreted when freezing by defining a sliding window of compiled / interpreted frames which produces interesting transitions with and without stack arguments. There is overlap with Basic.java and Fuzz.java. Let me know wether to remove or keep BasicExp.java. Runtime with fastdebug: 2m on Apple M1 Pro and 3m on Intel Xeon E5-2660 v3 @ 2.60GHz. Note that -XX:+VerifyContinuations is explicitly set as a found it very useful, it increases the runtime quite a bit though. > > Testing: the change passed our CI testing: most JCK and JTREG tests, also in Xcomp mode, SPECjvm2008, SPECjbb2015, Renaissance Suite and SAP specific tests with fastdebug and release builds on the standard platforms plus PPC64. These tests do include hotspot_loom and jdk_loom JTREG tests which I've also run with TEST_VM_OPTS="-XX:+VerifyContinuations" on X86_64, PPC64, and AARCH64. > > Thanks, Richard. Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: Cleanup BasicExp test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10961/files - new: https://git.openjdk.org/jdk/pull/10961/files/116839ee..22430750 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10961&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10961&range=05-06 Stats: 23 lines in 1 file changed: 0 ins; 6 del; 17 mod Patch: https://git.openjdk.org/jdk/pull/10961.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10961/head:pull/10961 PR: https://git.openjdk.org/jdk/pull/10961 From duke at openjdk.org Mon Nov 21 17:44:36 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Mon, 21 Nov 2022 17:44:36 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v21] In-Reply-To: References: Message-ID: On Thu, 17 Nov 2022 19:32:28 GMT, Vladimir Ivanov wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: >> >> vzeroall, no spill, reg re-map > > Overall, looks good. Just one minor cleanup suggestion. > > I've submitted the latest patch for testing (hs-tier1 - hs-tier4). @iwanowww Hope the extra tests passed? (Or do you have to re-run them on the latest patch again?) ------------- PR: https://git.openjdk.org/jdk/pull/10582 From mdoerr at openjdk.org Mon Nov 21 18:31:25 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Mon, 21 Nov 2022 18:31:25 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v7] In-Reply-To: References: Message-ID: <8Wr20y50FkFBSZhYpi1UzRm5sl9ZcHNGs7ljbeTLQe8=.7982fc02-f681-4cc2-a565-9431df59c4e9@github.com> On Mon, 21 Nov 2022 13:59:54 GMT, Richard Reingruber wrote: >> Hi, >> >> this is the port of [JEP 425: Virtual Threads (Preview)](https://openjdk.org/jeps/425)) to PPC64. >> More precisely it is the port of vm continuations in hotspot to PPC64. It allows to run with `-XX:+VMContinuations` which is a prerequisit for 'real' virtual threads (oxymoron?). >> >> Most of the shared code changes are related to a new platform dependent constant `frame::metadata_words_at_top`. It is either added or subtracted to a frame address or size. >> >> The following is supposed to explain (without real life details) why it is needed in addition to the existing `frame::metadata_words`. The diagram shows a frame at `SP` and its stack arguments. The caller frame is located at `callers_SP`. >> >> >> X86 / AARCH64 PPC64: >> >> : : : : >> : : : : >> | | | | >> |-----------------| |-----------------| >> | | | | >> | stack arguments | | stack arguments | >> | |<- callers_SP | | >> =================== |-----------------| >> | | | | >> | metadata at bottom | | metadata at top | >> | | | |<- callers_SP >> |-----------------| =================== >> | | | | >> | | | | >> | | | | >> | | | | >> | |<- SP | | >> =================== |-----------------| >> | | >> | metadata at top | >> | |<- SP >> =================== >> >> >> On X86 and AARCH64 metadata (that's return address, saved SP/FP, etc.) is stored at the frame bottom (`metadata at bottom`). On PPC64 it is stored at the frame top (`metadata at top`) where it affects size and address calculations. Shared code deals with this by making use of the platform dependent constant `frame::metadata_words_at_top`. >> >> * size required to 'freeze' a single frame with its stack arguments in a `StackChunk`: >> `sizeof(frame) + sizeof(stack arguments) + frame::metadata_words_at_top` >> >> * address of stack arguments: >> `callers_SP + frame::metadata_words_at_top` >> >> * The value of `frame::metadata_words_at_top` is 0 words on X86 and AARCH64. On PPC64 it is 4 words. >> >> Please refer to comments I've added for more details (e.g. the comment on StackChunkFrameStream::frame_size()). Recently I've given a talk about vm continuations and the PPC64 port to my colleagues. It's rather an overview than a deep dive. [The slides](http://cr.openjdk.java.net/~rrich/webrevs/2022/8286302/202210_loom_ppc64_port.pdf) might serve as well as an intro to the matter. >> >> The pr includes the new test jdk/jdk/internal/vm/Continuation/BasicExp.java which I wrote while doing the port. The test cases vary from simple to not-so-simple. One of the main features is that a CompilationPolicy passed as argument controls which frames are compiled/interpreted when freezing by defining a sliding window of compiled / interpreted frames which produces interesting transitions with and without stack arguments. There is overlap with Basic.java and Fuzz.java. Let me know wether to remove or keep BasicExp.java. Runtime with fastdebug: 2m on Apple M1 Pro and 3m on Intel Xeon E5-2660 v3 @ 2.60GHz. Note that -XX:+VerifyContinuations is explicitly set as a found it very useful, it increases the runtime quite a bit though. >> >> Testing: the change passed our CI testing: most JCK and JTREG tests, also in Xcomp mode, SPECjvm2008, SPECjbb2015, Renaissance Suite and SAP specific tests with fastdebug and release builds on the standard platforms plus PPC64. These tests do include hotspot_loom and jdk_loom JTREG tests which I've also run with TEST_VM_OPTS="-XX:+VerifyContinuations" on X86_64, PPC64, and AARCH64. >> >> Thanks, Richard. > > Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: > > Cleanup BasicExp test Thanks for mitigating the risk of test timeouts! ------------- Marked as reviewed by mdoerr (Reviewer). PR: https://git.openjdk.org/jdk/pull/10961 From vlivanov at openjdk.org Mon Nov 21 19:00:31 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Mon, 21 Nov 2022 19:00:31 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v22] In-Reply-To: References: Message-ID: On Thu, 17 Nov 2022 20:42:27 GMT, Volodymyr Paprotski wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: > > remove early return JVM part looks good. The test results look good. (Had to wait until testing is complete.) ------------- Marked as reviewed by vlivanov (Reviewer). PR: https://git.openjdk.org/jdk/pull/10582 From vkempik at openjdk.org Mon Nov 21 20:56:30 2022 From: vkempik at openjdk.org (Vladimir Kempik) Date: Mon, 21 Nov 2022 20:56:30 GMT Subject: RFR: 8297359: RISC-V: improve performance of floating Max Min intrinsics Message-ID: Placeholder ------------- Commit messages: - remove unneeded space addition - Merge branch 'master' into minmax_fadd - minmax with fadd and fclass Changes: https://git.openjdk.org/jdk/pull/11276/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11276&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297359 Stats: 19 lines in 1 file changed: 7 ins; 11 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11276.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11276/head:pull/11276 PR: https://git.openjdk.org/jdk/pull/11276 From duke at openjdk.org Mon Nov 21 21:05:39 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Mon, 21 Nov 2022 21:05:39 GMT Subject: Integrated: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions In-Reply-To: References: Message-ID: On Wed, 5 Oct 2022 21:28:26 GMT, Volodymyr Paprotski wrote: > Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. > > - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. > - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. > - Added a JMH perf test. > - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. > > Perf before: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s > > and after: > > Benchmark (dataSize) (provider) Mode Cnt Score Error Units > Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s > Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s > Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s > Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s > Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s This pull request has now been integrated. Changeset: f12710e9 Author: Volodymyr Paprotski Committer: Sandhya Viswanathan URL: https://git.openjdk.org/jdk/commit/f12710e938b36594623e9c82961d8aa0c0ef29c2 Stats: 1860 lines in 32 files changed: 1824 ins; 3 del; 33 mod 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions Reviewed-by: sviswanathan, vlivanov ------------- PR: https://git.openjdk.org/jdk/pull/10582 From rrich at openjdk.org Mon Nov 21 22:45:24 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Mon, 21 Nov 2022 22:45:24 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v7] In-Reply-To: <8Wr20y50FkFBSZhYpi1UzRm5sl9ZcHNGs7ljbeTLQe8=.7982fc02-f681-4cc2-a565-9431df59c4e9@github.com> References: <8Wr20y50FkFBSZhYpi1UzRm5sl9ZcHNGs7ljbeTLQe8=.7982fc02-f681-4cc2-a565-9431df59c4e9@github.com> Message-ID: On Mon, 21 Nov 2022 18:27:25 GMT, Martin Doerr wrote: > Thanks for mitigating the risk of test timeouts! Yes I had to adjust the timeout a little. Also it's not necessary to set -XX:+VerifyContinuations again in the runs that call System.gc(). ------------- PR: https://git.openjdk.org/jdk/pull/10961 From dholmes at openjdk.org Tue Nov 22 00:45:57 2022 From: dholmes at openjdk.org (David Holmes) Date: Tue, 22 Nov 2022 00:45:57 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v22] In-Reply-To: References: Message-ID: On Thu, 17 Nov 2022 20:42:27 GMT, Volodymyr Paprotski wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: > > remove early return Testing is broken: test/jdk/sun/security/util/math/BigIntegerModuloP.java:160: error: BigIntegerModuloP.ImmutableElement is not abstract and does not override abstract method getLimbs() in IntegerModuloP private class ImmutableElement extends Element Did you forget to commit a test file? I will file a new bug for this. ------------- PR: https://git.openjdk.org/jdk/pull/10582 From lmesnik at openjdk.org Tue Nov 22 02:51:28 2022 From: lmesnik at openjdk.org (Leonid Mesnik) Date: Tue, 22 Nov 2022 02:51:28 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v7] In-Reply-To: References: Message-ID: On Mon, 21 Nov 2022 13:59:54 GMT, Richard Reingruber wrote: >> Hi, >> >> this is the port of [JEP 425: Virtual Threads (Preview)](https://openjdk.org/jeps/425)) to PPC64. >> More precisely it is the port of vm continuations in hotspot to PPC64. It allows to run with `-XX:+VMContinuations` which is a prerequisit for 'real' virtual threads (oxymoron?). >> >> Most of the shared code changes are related to a new platform dependent constant `frame::metadata_words_at_top`. It is either added or subtracted to a frame address or size. >> >> The following is supposed to explain (without real life details) why it is needed in addition to the existing `frame::metadata_words`. The diagram shows a frame at `SP` and its stack arguments. The caller frame is located at `callers_SP`. >> >> >> X86 / AARCH64 PPC64: >> >> : : : : >> : : : : >> | | | | >> |-----------------| |-----------------| >> | | | | >> | stack arguments | | stack arguments | >> | |<- callers_SP | | >> =================== |-----------------| >> | | | | >> | metadata at bottom | | metadata at top | >> | | | |<- callers_SP >> |-----------------| =================== >> | | | | >> | | | | >> | | | | >> | | | | >> | |<- SP | | >> =================== |-----------------| >> | | >> | metadata at top | >> | |<- SP >> =================== >> >> >> On X86 and AARCH64 metadata (that's return address, saved SP/FP, etc.) is stored at the frame bottom (`metadata at bottom`). On PPC64 it is stored at the frame top (`metadata at top`) where it affects size and address calculations. Shared code deals with this by making use of the platform dependent constant `frame::metadata_words_at_top`. >> >> * size required to 'freeze' a single frame with its stack arguments in a `StackChunk`: >> `sizeof(frame) + sizeof(stack arguments) + frame::metadata_words_at_top` >> >> * address of stack arguments: >> `callers_SP + frame::metadata_words_at_top` >> >> * The value of `frame::metadata_words_at_top` is 0 words on X86 and AARCH64. On PPC64 it is 4 words. >> >> Please refer to comments I've added for more details (e.g. the comment on StackChunkFrameStream::frame_size()). Recently I've given a talk about vm continuations and the PPC64 port to my colleagues. It's rather an overview than a deep dive. [The slides](http://cr.openjdk.java.net/~rrich/webrevs/2022/8286302/202210_loom_ppc64_port.pdf) might serve as well as an intro to the matter. >> >> The pr includes the new test jdk/jdk/internal/vm/Continuation/BasicExp.java which I wrote while doing the port. The test cases vary from simple to not-so-simple. One of the main features is that a CompilationPolicy passed as argument controls which frames are compiled/interpreted when freezing by defining a sliding window of compiled / interpreted frames which produces interesting transitions with and without stack arguments. There is overlap with Basic.java and Fuzz.java. Let me know wether to remove or keep BasicExp.java. Runtime with fastdebug: 2m on Apple M1 Pro and 3m on Intel Xeon E5-2660 v3 @ 2.60GHz. Note that -XX:+VerifyContinuations is explicitly set as a found it very useful, it increases the runtime quite a bit though. >> >> Testing: the change passed our CI testing: most JCK and JTREG tests, also in Xcomp mode, SPECjvm2008, SPECjbb2015, Renaissance Suite and SAP specific tests with fastdebug and release builds on the standard platforms plus PPC64. These tests do include hotspot_loom and jdk_loom JTREG tests which I've also run with TEST_VM_OPTS="-XX:+VerifyContinuations" on X86_64, PPC64, and AARCH64. >> >> Thanks, Richard. > > Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: > > Cleanup BasicExp test Not a complete review but some notes about test style. test/jdk/jdk/internal/vm/Continuation/BasicExp.java line 129: > 127: * @summary Collection of basic continuation tests. CompilationPolicy controls which frames in a sequence should be compiled when calling Continuation.yield(). > 128: * @requires vm.continuations > 129: * @requires vm.flavor == "server" & (vm.opt.TieredCompilation == null | vm.opt.TieredCompilation == false) Isn't vm.opt.TieredCompilation != true better? test/jdk/jdk/internal/vm/Continuation/BasicExp.java line 136: > 134: * @run driver jdk.test.lib.helpers.ClassFileInstaller jdk.test.whitebox.WhiteBox > 135: * > 136: * @run main/othervm/timeout=300 --enable-preview -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -Xbootclasspath/a:. It is better to use tag @enablePreview so --enable-preview is not needed. Also, please remove -XX:+UnlockDiagnosticVMOptions where it is not required. test/jdk/jdk/internal/vm/Continuation/BasicExp.java line 193: > 191: compLevel = CompilerWhiteBoxTest.COMP_LEVEL_FULL_OPTIMIZATION; > 192: // // Run tests with C1 compilations > 193: // compLevel = CompilerWhiteBoxTest.COMP_LEVEL_FULL_PROFILE; Any reasons why the test is executed with C2 only? And is -XX:-TieredCompilation required if optimization level and compilation are controlled with WB? test/jdk/jdk/internal/vm/Continuation/BasicExp.java line 224: > 222: compPolicy.print(); System.out.println(); > 223: > 224: new ContinuationRunYieldRunTest().runTestCase(3, compPolicy); I think it would be better to split the test into several smaller testcases so timeout=300 is not needed. The fine granularity allows better parallelization of execution. The 5 minutes timeout means a very long-time test that couldn't be split. test/jdk/jdk/internal/vm/Continuation/BasicExp.java line 369: > 367: } > 368: > 369: public void log_dontjit() { Please use camelCase for all functions. test/jdk/jdk/internal/vm/Continuation/BasicExp.java line 484: > 482: log_dontjit("Exc: " + e); > 483: } > 484: if (callSystemGC) System.gc(); It would be better to call WB.fullGC() not System,gc() to ensure that GC is called. test/jdk/jdk/internal/vm/Continuation/BasicExp.java line 504: > 502: } > 503: > 504: static final long i1=1; static final long i2=2; static final long i3=3; Please fix identation. test/jdk/jdk/internal/vm/Continuation/BasicExp.java line 729: > 727: log_dontjit("Continuation running on thread " + Thread.currentThread()); > 728: long res = ord101_recurse_dontinline(0, i1, i2, i3, i4, i5, i6, i7, i8, i9, i10, i11); > 729: if (res != i1+i2+i3+i4+i5+i6+i7+i8+i9+i10+i11) { Please fix the indentation. There are several places were 'i1+i2' should be fixed to 'i1 + i2 '. ------------- Changes requested by lmesnik (Reviewer). PR: https://git.openjdk.org/jdk/pull/10961 From duke at openjdk.org Tue Nov 22 05:11:15 2022 From: duke at openjdk.org (Zhiqiang Zang) Date: Tue, 22 Nov 2022 05:11:15 GMT Subject: RFR: 8297384: Add IR tests for existing idealizations of arithmetic nodes Message-ID: I noticed some idealizations have no associated IR tests so I included for them. ------------- Commit messages: - format. - format whitespace. - add some missing tests for existing idealizations. Changes: https://git.openjdk.org/jdk/pull/11049/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11049&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297384 Stats: 582 lines in 11 files changed: 505 ins; 0 del; 77 mod Patch: https://git.openjdk.org/jdk/pull/11049.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11049/head:pull/11049 PR: https://git.openjdk.org/jdk/pull/11049 From yyang at openjdk.org Tue Nov 22 05:11:16 2022 From: yyang at openjdk.org (Yi Yang) Date: Tue, 22 Nov 2022 05:11:16 GMT Subject: RFR: 8297384: Add IR tests for existing idealizations of arithmetic nodes In-Reply-To: References: Message-ID: On Wed, 9 Nov 2022 00:20:37 GMT, Zhiqiang Zang wrote: > I noticed some idealizations have no associated IR tests so I included for them. Hi, do you need a JBS issue for this? Okay, I filed [JDK-8297384 Add IR tests for existing idealizations of arithmetic nodes](https://bugs.openjdk.org/browse/JDK-8297384) for this patch. ------------- PR: https://git.openjdk.org/jdk/pull/11049 From duke at openjdk.org Tue Nov 22 05:11:16 2022 From: duke at openjdk.org (Zhiqiang Zang) Date: Tue, 22 Nov 2022 05:11:16 GMT Subject: RFR: 8297384: Add IR tests for existing idealizations of arithmetic nodes In-Reply-To: References: Message-ID: On Mon, 21 Nov 2022 15:02:04 GMT, Yi Yang wrote: > Hi, do you need a JBS issue for this? Hi, I do not have an OpenJDK account, so can you open one issue for me? Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/11049 From duke at openjdk.org Tue Nov 22 05:18:51 2022 From: duke at openjdk.org (Zhiqiang Zang) Date: Tue, 22 Nov 2022 05:18:51 GMT Subject: RFR: 8297384: Add IR tests for existing idealizations of arithmetic nodes [v2] In-Reply-To: References: Message-ID: > I noticed some idealizations have no associated IR tests so I included for them. Zhiqiang Zang has updated the pull request incrementally with one additional commit since the last revision: include @bug and @summary for newly created ir test classes. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11049/files - new: https://git.openjdk.org/jdk/pull/11049/files/6930c908..c0931397 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11049&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11049&range=00-01 Stats: 12 lines in 6 files changed: 12 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11049.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11049/head:pull/11049 PR: https://git.openjdk.org/jdk/pull/11049 From duke at openjdk.org Tue Nov 22 05:28:03 2022 From: duke at openjdk.org (Zhiqiang Zang) Date: Tue, 22 Nov 2022 05:28:03 GMT Subject: RFR: 8297384: Add IR tests for existing idealizations of arithmetic nodes In-Reply-To: References: Message-ID: On Tue, 22 Nov 2022 02:23:01 GMT, Yi Yang wrote: > Okay, I filed [JDK-8297384 Add IR tests for existing idealizations of arithmetic nodes](https://bugs.openjdk.org/browse/JDK-8297384) for this patch. Thanks a lot! I included the issue id in this PR. Could you please help review the PR if you get a chance, thanks! ------------- PR: https://git.openjdk.org/jdk/pull/11049 From vkempik at openjdk.org Tue Nov 22 08:31:24 2022 From: vkempik at openjdk.org (Vladimir Kempik) Date: Tue, 22 Nov 2022 08:31:24 GMT Subject: RFR: 8297359: RISC-V: improve performance of floating Max Min intrinsics In-Reply-To: References: Message-ID: On Mon, 21 Nov 2022 20:48:00 GMT, Vladimir Kempik wrote: > Please review this change. > > It improves performance of Math.min/max intrinsics for Floats and Doubles. > > The main issue in these intrinsics is the requirement to return NaN if any of arguments is NaN. In risc-v, fmin/fmax returns NaN only if both of src registers are NaN ( quiet NaN). > That requires additional logic to handle the case where only of of src is NaN. > > Here the postcheck with flt (floating less than comparision) and flags analysis replaced with precheck. The precheck is done with fadd-ing srcs into dst and checking the dst for NaN ( with fclass). > > The results on the thead c910: > > The results, thead c910: > > before > > Benchmark Mode Cnt Score Error Units > FpMinMaxIntrinsics.dMax avgt 25 54023.827 ? 268.645 ns/op > FpMinMaxIntrinsics.dMin avgt 25 54309.850 ? 323.551 ns/op > FpMinMaxIntrinsics.dMinReduce avgt 25 42192.140 ? 12.114 ns/op > FpMinMaxIntrinsics.fMax avgt 25 53797.657 ? 15.816 ns/op > FpMinMaxIntrinsics.fMin avgt 25 54135.710 ? 313.185 ns/op > FpMinMaxIntrinsics.fMinReduce avgt 25 42196.156 ? 13.424 ns/op > MaxMinOptimizeTest.dAdd avgt 25 650.810 ? 169.998 us/op > MaxMinOptimizeTest.dMax avgt 25 4561.967 ? 40.367 us/op > MaxMinOptimizeTest.dMin avgt 25 4589.100 ? 75.854 us/op > MaxMinOptimizeTest.dMul avgt 25 759.821 ? 240.092 us/op > MaxMinOptimizeTest.fAdd avgt 25 300.137 ? 13.495 us/op > MaxMinOptimizeTest.fMax avgt 25 4348.885 ? 20.061 us/op > MaxMinOptimizeTest.fMin avgt 25 4372.799 ? 27.296 us/op > MaxMinOptimizeTest.fMul avgt 25 304.024 ? 12.120 us/op > > after > > Benchmark Mode Cnt Score Error Units > FpMinMaxIntrinsics.dMax avgt 25 10545.196 ? 140.137 ns/op > FpMinMaxIntrinsics.dMin avgt 25 10454.525 ? 9.972 ns/op > FpMinMaxIntrinsics.dMinReduce avgt 25 3104.703 ? 0.892 ns/op > FpMinMaxIntrinsics.fMax avgt 25 10449.709 ? 7.284 ns/op > FpMinMaxIntrinsics.fMin avgt 25 10445.261 ? 7.206 ns/op > FpMinMaxIntrinsics.fMinReduce avgt 25 3104.769 ? 0.951 ns/op > MaxMinOptimizeTest.dAdd avgt 25 487.769 ? 170.711 us/op > MaxMinOptimizeTest.dMax avgt 25 929.394 ? 158.697 us/op > MaxMinOptimizeTest.dMin avgt 25 864.230 ? 284.794 us/op > MaxMinOptimizeTest.dMul avgt 25 894.116 ? 342.550 us/op > MaxMinOptimizeTest.fAdd avgt 25 284.664 ? 1.446 us/op > MaxMinOptimizeTest.fMax avgt 25 384.388 ? 15.004 us/op > MaxMinOptimizeTest.fMin avgt 25 371.952 ? 15.295 us/op > MaxMinOptimizeTest.fMul avgt 25 305.226 ? 12.467 us/op > > significant improvement > > On hifive u74 ( unmatched) the improvements is less significant: > > hifive: > > before > Benchmark Mode Cnt Score Error Units > FpMinMaxIntrinsics.dMax avgt 25 30219.666 ? 12.878 ns/op > FpMinMaxIntrinsics.dMin avgt 25 30242.249 ? 31.374 ns/op > FpMinMaxIntrinsics.dMinReduce avgt 25 15394.622 ? 2.803 ns/op > FpMinMaxIntrinsics.fMax avgt 25 30150.114 ? 22.421 ns/op > FpMinMaxIntrinsics.fMin avgt 25 30149.752 ? 20.813 ns/op > FpMinMaxIntrinsics.fMinReduce avgt 25 15396.402 ? 4.251 ns/op > MaxMinOptimizeTest.dAdd avgt 25 1143.582 ? 4.444 us/op > MaxMinOptimizeTest.dMax avgt 25 2556.317 ? 3.795 us/op > MaxMinOptimizeTest.dMin avgt 25 2556.569 ? 2.274 us/op > MaxMinOptimizeTest.dMul avgt 25 1142.769 ? 1.593 us/op > MaxMinOptimizeTest.fAdd avgt 25 748.688 ? 7.342 us/op > MaxMinOptimizeTest.fMax avgt 25 2280.381 ? 1.535 us/op > MaxMinOptimizeTest.fMin avgt 25 2280.760 ? 1.532 us/op > MaxMinOptimizeTest.fMul avgt 25 748.991 ? 7.261 us/op > > after: > > Benchmark Mode Cnt Score Error Units > FpMinMaxIntrinsics.dMax avgt 25 27723.791 ? 22.784 ns/op > FpMinMaxIntrinsics.dMin avgt 25 27760.799 ? 45.411 ns/op > FpMinMaxIntrinsics.dMinReduce avgt 25 12875.949 ? 2.829 ns/op > FpMinMaxIntrinsics.fMax avgt 25 25992.753 ? 23.788 ns/op > FpMinMaxIntrinsics.fMin avgt 25 25994.554 ? 32.060 ns/op > FpMinMaxIntrinsics.fMinReduce avgt 25 11200.737 ? 2.169 ns/op > MaxMinOptimizeTest.dAdd avgt 25 1144.128 ? 4.371 us/op > MaxMinOptimizeTest.dMax avgt 25 1968.145 ? 2.346 us/op > MaxMinOptimizeTest.dMin avgt 25 1970.249 ? 4.712 us/op > MaxMinOptimizeTest.dMul avgt 25 1143.356 ? 2.203 us/op > MaxMinOptimizeTest.fAdd avgt 25 748.634 ? 7.229 us/op > MaxMinOptimizeTest.fMax avgt 25 1523.719 ? 0.570 us/op > MaxMinOptimizeTest.fMin avgt 25 1524.534 ? 1.109 us/op > MaxMinOptimizeTest.fMul avgt 25 748.643 ? 7.291 us/op > > > fAdd/dAdd and fMul/dMull is unaffected likely due to : > > private double dAddBench(double a, double b) { > return Math.max(a, b) + Math.min(a, b); > } > > private double dMulBench(double a, double b) { > return Math.max(a, b) * Math.min(a, b); > } > may get reduces to just a + b and a*b respectively without actually using min/max > > Testing : tier1/tier2 in progress, will update this as soon as it finishes Withdrawn, this version have issues when operating with infinity, I'll redo the change test FloatMaxVectorTests.MAXReduceFloatMaxVectorTests(float[i * 5]): success test FloatMaxVectorTests.MAXReduceFloatMaxVectorTests(float[i + 1]): success test FloatMaxVectorTests.MAXReduceFloatMaxVectorTests(float[cornerCaseValue(i)]): failure java.lang.AssertionError: at index #2 expected [Infinity] but found [NaN] at org.testng.Assert.fail(Assert.java:99) -- test FloatMaxVectorTests.MAXReduceFloatMaxVectorTestsMasked(float[i * 5], mask[i % 2]): success test FloatMaxVectorTests.MAXReduceFloatMaxVectorTestsMasked(float[i + 1], mask[i % 2]): success test FloatMaxVectorTests.MAXReduceFloatMaxVectorTestsMasked(float[cornerCaseValue(i)], mask[i % 2]): failure java.lang.AssertionError: at index #10 expected [Infinity] but found [NaN] at org.testng.Assert.fail(Assert.java:99) -- test FloatMaxVectorTests.MAXReduceFloatMaxVectorTestsMasked(float[i * 5], mask[true]): success test FloatMaxVectorTests.MAXReduceFloatMaxVectorTestsMasked(float[i + 1], mask[true]): success test FloatMaxVectorTests.MAXReduceFloatMaxVectorTestsMasked(float[cornerCaseValue(i)], mask[true]): failure java.lang.AssertionError: at index #2 expected [Infinity] but found [NaN] at org.testng.Assert.fail(Assert.java:99) -- test FloatMaxVectorTests.MINReduceFloatMaxVectorTests(float[i * 5]): success test FloatMaxVectorTests.MINReduceFloatMaxVectorTests(float[i + 1]): success test FloatMaxVectorTests.MINReduceFloatMaxVectorTests(float[cornerCaseValue(i)]): failure java.lang.AssertionError: at index #2 expected [-Infinity] but found [NaN] at org.testng.Assert.fail(Assert.java:99) -- test FloatMaxVectorTests.MINReduceFloatMaxVectorTestsMasked(float[i * 5], mask[i % 2]): success test FloatMaxVectorTests.MINReduceFloatMaxVectorTestsMasked(float[i + 1], mask[i % 2]): success test FloatMaxVectorTests.MINReduceFloatMaxVectorTestsMasked(float[cornerCaseValue(i)], mask[i % 2]): failure java.lang.AssertionError: at index #2 expected [-Infinity] but found [NaN] at org.testng.Assert.fail(Assert.java:99) -- test FloatMaxVectorTests.MINReduceFloatMaxVectorTestsMasked(float[i * 5], mask[true]): success test FloatMaxVectorTests.MINReduceFloatMaxVectorTestsMasked(float[i + 1], mask[true]): success test FloatMaxVectorTests.MINReduceFloatMaxVectorTestsMasked(float[cornerCaseValue(i)], mask[true]): failure java.lang.AssertionError: at index #2 expected [-Infinity] but found [NaN] at org.testng.Assert.fail(Assert.java:99) ------------- PR: https://git.openjdk.org/jdk/pull/11276 From vkempik at openjdk.org Tue Nov 22 08:31:25 2022 From: vkempik at openjdk.org (Vladimir Kempik) Date: Tue, 22 Nov 2022 08:31:25 GMT Subject: Withdrawn: 8297359: RISC-V: improve performance of floating Max Min intrinsics In-Reply-To: References: Message-ID: On Mon, 21 Nov 2022 20:48:00 GMT, Vladimir Kempik wrote: > Please review this change. > > It improves performance of Math.min/max intrinsics for Floats and Doubles. > > The main issue in these intrinsics is the requirement to return NaN if any of arguments is NaN. In risc-v, fmin/fmax returns NaN only if both of src registers are NaN ( quiet NaN). > That requires additional logic to handle the case where only of of src is NaN. > > Here the postcheck with flt (floating less than comparision) and flags analysis replaced with precheck. The precheck is done with fadd-ing srcs into dst and checking the dst for NaN ( with fclass). > > The results on the thead c910: > > The results, thead c910: > > before > > Benchmark Mode Cnt Score Error Units > FpMinMaxIntrinsics.dMax avgt 25 54023.827 ? 268.645 ns/op > FpMinMaxIntrinsics.dMin avgt 25 54309.850 ? 323.551 ns/op > FpMinMaxIntrinsics.dMinReduce avgt 25 42192.140 ? 12.114 ns/op > FpMinMaxIntrinsics.fMax avgt 25 53797.657 ? 15.816 ns/op > FpMinMaxIntrinsics.fMin avgt 25 54135.710 ? 313.185 ns/op > FpMinMaxIntrinsics.fMinReduce avgt 25 42196.156 ? 13.424 ns/op > MaxMinOptimizeTest.dAdd avgt 25 650.810 ? 169.998 us/op > MaxMinOptimizeTest.dMax avgt 25 4561.967 ? 40.367 us/op > MaxMinOptimizeTest.dMin avgt 25 4589.100 ? 75.854 us/op > MaxMinOptimizeTest.dMul avgt 25 759.821 ? 240.092 us/op > MaxMinOptimizeTest.fAdd avgt 25 300.137 ? 13.495 us/op > MaxMinOptimizeTest.fMax avgt 25 4348.885 ? 20.061 us/op > MaxMinOptimizeTest.fMin avgt 25 4372.799 ? 27.296 us/op > MaxMinOptimizeTest.fMul avgt 25 304.024 ? 12.120 us/op > > after > > Benchmark Mode Cnt Score Error Units > FpMinMaxIntrinsics.dMax avgt 25 10545.196 ? 140.137 ns/op > FpMinMaxIntrinsics.dMin avgt 25 10454.525 ? 9.972 ns/op > FpMinMaxIntrinsics.dMinReduce avgt 25 3104.703 ? 0.892 ns/op > FpMinMaxIntrinsics.fMax avgt 25 10449.709 ? 7.284 ns/op > FpMinMaxIntrinsics.fMin avgt 25 10445.261 ? 7.206 ns/op > FpMinMaxIntrinsics.fMinReduce avgt 25 3104.769 ? 0.951 ns/op > MaxMinOptimizeTest.dAdd avgt 25 487.769 ? 170.711 us/op > MaxMinOptimizeTest.dMax avgt 25 929.394 ? 158.697 us/op > MaxMinOptimizeTest.dMin avgt 25 864.230 ? 284.794 us/op > MaxMinOptimizeTest.dMul avgt 25 894.116 ? 342.550 us/op > MaxMinOptimizeTest.fAdd avgt 25 284.664 ? 1.446 us/op > MaxMinOptimizeTest.fMax avgt 25 384.388 ? 15.004 us/op > MaxMinOptimizeTest.fMin avgt 25 371.952 ? 15.295 us/op > MaxMinOptimizeTest.fMul avgt 25 305.226 ? 12.467 us/op > > significant improvement > > On hifive u74 ( unmatched) the improvements is less significant: > > hifive: > > before > Benchmark Mode Cnt Score Error Units > FpMinMaxIntrinsics.dMax avgt 25 30219.666 ? 12.878 ns/op > FpMinMaxIntrinsics.dMin avgt 25 30242.249 ? 31.374 ns/op > FpMinMaxIntrinsics.dMinReduce avgt 25 15394.622 ? 2.803 ns/op > FpMinMaxIntrinsics.fMax avgt 25 30150.114 ? 22.421 ns/op > FpMinMaxIntrinsics.fMin avgt 25 30149.752 ? 20.813 ns/op > FpMinMaxIntrinsics.fMinReduce avgt 25 15396.402 ? 4.251 ns/op > MaxMinOptimizeTest.dAdd avgt 25 1143.582 ? 4.444 us/op > MaxMinOptimizeTest.dMax avgt 25 2556.317 ? 3.795 us/op > MaxMinOptimizeTest.dMin avgt 25 2556.569 ? 2.274 us/op > MaxMinOptimizeTest.dMul avgt 25 1142.769 ? 1.593 us/op > MaxMinOptimizeTest.fAdd avgt 25 748.688 ? 7.342 us/op > MaxMinOptimizeTest.fMax avgt 25 2280.381 ? 1.535 us/op > MaxMinOptimizeTest.fMin avgt 25 2280.760 ? 1.532 us/op > MaxMinOptimizeTest.fMul avgt 25 748.991 ? 7.261 us/op > > after: > > Benchmark Mode Cnt Score Error Units > FpMinMaxIntrinsics.dMax avgt 25 27723.791 ? 22.784 ns/op > FpMinMaxIntrinsics.dMin avgt 25 27760.799 ? 45.411 ns/op > FpMinMaxIntrinsics.dMinReduce avgt 25 12875.949 ? 2.829 ns/op > FpMinMaxIntrinsics.fMax avgt 25 25992.753 ? 23.788 ns/op > FpMinMaxIntrinsics.fMin avgt 25 25994.554 ? 32.060 ns/op > FpMinMaxIntrinsics.fMinReduce avgt 25 11200.737 ? 2.169 ns/op > MaxMinOptimizeTest.dAdd avgt 25 1144.128 ? 4.371 us/op > MaxMinOptimizeTest.dMax avgt 25 1968.145 ? 2.346 us/op > MaxMinOptimizeTest.dMin avgt 25 1970.249 ? 4.712 us/op > MaxMinOptimizeTest.dMul avgt 25 1143.356 ? 2.203 us/op > MaxMinOptimizeTest.fAdd avgt 25 748.634 ? 7.229 us/op > MaxMinOptimizeTest.fMax avgt 25 1523.719 ? 0.570 us/op > MaxMinOptimizeTest.fMin avgt 25 1524.534 ? 1.109 us/op > MaxMinOptimizeTest.fMul avgt 25 748.643 ? 7.291 us/op > > > fAdd/dAdd and fMul/dMull is unaffected likely due to : > > private double dAddBench(double a, double b) { > return Math.max(a, b) + Math.min(a, b); > } > > private double dMulBench(double a, double b) { > return Math.max(a, b) * Math.min(a, b); > } > may get reduces to just a + b and a*b respectively without actually using min/max > > Testing : tier1/tier2 in progress, will update this as soon as it finishes This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/11276 From thartmann at openjdk.org Tue Nov 22 08:37:58 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 22 Nov 2022 08:37:58 GMT Subject: RFR: 8297382: Test fails to compile after JDK-8288047 Message-ID: [JDK-8288047](https://bugs.openjdk.org/browse/JDK-8288047) added `long[] getLimbs()` to `src/java.base/share/classes/sun/security/util/math/IntegerModuloP.java` but forgot to override that method in test implementation `test/jdk/sun/security/util/math/BigIntegerModuloP.java`. Thanks, Tobias ------------- Commit messages: - 8297382: Test fails to compile after JDK-8288047 Changes: https://git.openjdk.org/jdk/pull/11282/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11282&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297382 Stats: 5 lines in 1 file changed: 5 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11282.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11282/head:pull/11282 PR: https://git.openjdk.org/jdk/pull/11282 From chagedorn at openjdk.org Tue Nov 22 09:24:20 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 22 Nov 2022 09:24:20 GMT Subject: RFR: 8297382: Test fails to compile after JDK-8288047 In-Reply-To: References: Message-ID: On Tue, 22 Nov 2022 08:28:44 GMT, Tobias Hartmann wrote: > [JDK-8288047](https://bugs.openjdk.org/browse/JDK-8288047) added `long[] getLimbs()` to `src/java.base/share/classes/sun/security/util/math/IntegerModuloP.java` but forgot to override that method in test implementation `test/jdk/sun/security/util/math/BigIntegerModuloP.java`. > > Thanks, > Tobias Looks good and trivial! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11282 From thartmann at openjdk.org Tue Nov 22 09:28:08 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 22 Nov 2022 09:28:08 GMT Subject: RFR: 8297382: Test fails to compile after JDK-8288047 In-Reply-To: References: Message-ID: On Tue, 22 Nov 2022 08:28:44 GMT, Tobias Hartmann wrote: > [JDK-8288047](https://bugs.openjdk.org/browse/JDK-8288047) added `long[] getLimbs()` to `src/java.base/share/classes/sun/security/util/math/IntegerModuloP.java` but forgot to override that method in test implementation `test/jdk/sun/security/util/math/BigIntegerModuloP.java`. > > Thanks, > Tobias Thanks for the review, Christian! ------------- PR: https://git.openjdk.org/jdk/pull/11282 From thartmann at openjdk.org Tue Nov 22 09:29:42 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 22 Nov 2022 09:29:42 GMT Subject: Integrated: 8297382: Test fails to compile after JDK-8288047 In-Reply-To: References: Message-ID: On Tue, 22 Nov 2022 08:28:44 GMT, Tobias Hartmann wrote: > [JDK-8288047](https://bugs.openjdk.org/browse/JDK-8288047) added `long[] getLimbs()` to `src/java.base/share/classes/sun/security/util/math/IntegerModuloP.java` but forgot to override that method in test implementation `test/jdk/sun/security/util/math/BigIntegerModuloP.java`. > > Thanks, > Tobias This pull request has now been integrated. Changeset: 42c20374 Author: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/42c2037429a8ee6f683bbbc99fb48c540519524c Stats: 5 lines in 1 file changed: 5 ins; 0 del; 0 mod 8297382: Test fails to compile after JDK-8288047 Reviewed-by: chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/11282 From chagedorn at openjdk.org Tue Nov 22 09:49:21 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 22 Nov 2022 09:49:21 GMT Subject: RFR: 8297384: Add IR tests for existing idealizations of arithmetic nodes [v2] In-Reply-To: References: Message-ID: On Tue, 22 Nov 2022 05:18:51 GMT, Zhiqiang Zang wrote: >> I noticed some idealizations have no associated IR tests so I included for them. > > Zhiqiang Zang has updated the pull request incrementally with one additional commit since the last revision: > > include @bug and @summary for newly created ir test classes. Nice additional tests! They look good. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11049 From dnsimon at openjdk.org Tue Nov 22 14:38:25 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 22 Nov 2022 14:38:25 GMT Subject: RFR: 8297431: [JVMCI] HotSpotJVMCIRuntime.encodeThrowable should not throw an exception Message-ID: JVMCI has a mechanism for translating exceptions from libjvmci to HotSpot and vice versa. This is important for proper error handling when a thread calls between these 2 runtime heaps. This translation mechanism itself needs to be robust in the context of resource limits, especially heap limits, as it may be translating an OutOfMemoryError from HotSpot back into libjvmci. The existing code in [`HotSpotJVMCIRuntime.encodeThrowable`](https://github.com/graalvm/labs-openjdk-17/blob/f6b18b596fa5acb1ab7efa10e284d106669040a6/src/jdk.internal.vm.ci/share/classes/jdk.vm.ci.hotspot/src/jdk/vm/ci/hotspot/HotSpotJVMCIRuntime.java#L237) and [`TranslatedException.encodeThrowable`](https://github.com/graalvm/labs-openjdk-17/blob/f6b18b596fa5acb1ab7efa10e284d106669040a6/src/jdk.internal.vm.ci/share/classes/jdk.vm.ci.hotspot/src/jdk/vm/ci/hotspot/TranslatedException.java#L153) is designed to handle translation failures by falling back to non-allocating code. However, we still occasionally see [an OOME that breaks the translation mechanism](https://github.com/oracle/graal/issues/5470#issuecomment-1321749688). One speculated possibility for this is an OOME re-materializing oops duri ng a deoptimization causing an unexpected execution path. This PR increases the robustness of the exception translation code in light of such issues. ------------- Commit messages: - be more robust translating exceptions between HotSpot and libjvmci Changes: https://git.openjdk.org/jdk/pull/11286/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11286&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297431 Stats: 25 lines in 3 files changed: 16 ins; 2 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/11286.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11286/head:pull/11286 PR: https://git.openjdk.org/jdk/pull/11286 From thartmann at openjdk.org Tue Nov 22 15:25:42 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 22 Nov 2022 15:25:42 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v21] In-Reply-To: References: Message-ID: On Mon, 21 Nov 2022 17:42:28 GMT, Volodymyr Paprotski wrote: >> Overall, looks good. Just one minor cleanup suggestion. >> >> I've submitted the latest patch for testing (hs-tier1 - hs-tier4). > > @iwanowww Hope the extra tests passed? (Or do you have to re-run them on the latest patch again?) I fixed the test issue with [JDK-8297382](https://bugs.openjdk.org/browse/JDK-8297382) but this also caused a regression with one of the crypto tests: [JDK-8297417](https://bugs.openjdk.org/browse/JDK-8297417). @vpaprotsk, @sviswa7 could you please have a look at this? ------------- PR: https://git.openjdk.org/jdk/pull/10582 From duke at openjdk.org Tue Nov 22 15:30:45 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Tue, 22 Nov 2022 15:30:45 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v21] In-Reply-To: References: Message-ID: On Tue, 22 Nov 2022 15:21:44 GMT, Tobias Hartmann wrote: >> @iwanowww Hope the extra tests passed? (Or do you have to re-run them on the latest patch again?) > > I fixed the test issue with [JDK-8297382](https://bugs.openjdk.org/browse/JDK-8297382) but this also caused a regression with one of the crypto tests: [JDK-8297417](https://bugs.openjdk.org/browse/JDK-8297417). @vpaprotsk, @sviswa7 could you please have a look at this? @TobiHartmann @dholmes-ora Sorry about that, looking ------------- PR: https://git.openjdk.org/jdk/pull/10582 From never at openjdk.org Tue Nov 22 16:32:20 2022 From: never at openjdk.org (Tom Rodriguez) Date: Tue, 22 Nov 2022 16:32:20 GMT Subject: RFR: 8297431: [JVMCI] HotSpotJVMCIRuntime.encodeThrowable should not throw an exception In-Reply-To: References: Message-ID: <2NRXrCm_YYPDtujf239wUNPW1iKv9ZyVDVkXmlLK0dA=.64e06762-03e7-43a5-b57f-77a24a99d042@github.com> On Tue, 22 Nov 2022 14:30:01 GMT, Doug Simon wrote: > JVMCI has a mechanism for translating exceptions from libjvmci to HotSpot and vice versa. This is important for proper error handling when a thread calls between these 2 runtime heaps. > > This translation mechanism itself needs to be robust in the context of resource limits, especially heap limits, as it may be translating an OutOfMemoryError from HotSpot back into libjvmci. The existing code in [`HotSpotJVMCIRuntime.encodeThrowable`](https://github.com/graalvm/labs-openjdk-17/blob/f6b18b596fa5acb1ab7efa10e284d106669040a6/src/jdk.internal.vm.ci/share/classes/jdk.vm.ci.hotspot/src/jdk/vm/ci/hotspot/HotSpotJVMCIRuntime.java#L237) and [`TranslatedException.encodeThrowable`](https://github.com/graalvm/labs-openjdk-17/blob/f6b18b596fa5acb1ab7efa10e284d106669040a6/src/jdk.internal.vm.ci/share/classes/jdk.vm.ci.hotspot/src/jdk/vm/ci/hotspot/TranslatedException.java#L153) is designed to handle translation failures by falling back to non-allocating code. However, we still occasionally see [an OOME that breaks the translation mechanism](https://github.com/oracle/graal/issues/5470#issuecomment-1321749688). One speculated possibility for this is an OOME re-materializing oops du ring a deoptimization causing an unexpected execution path. This PR increases the robustness of the exception translation code in light of such issues. src/hotspot/share/jvmci/jvmciEnv.cpp line 321: > 319: jlong buffer = (jlong) NEW_RESOURCE_ARRAY_IN_THREAD_RETURN_NULL(THREAD, jbyte, buffer_size); > 320: if (buffer == 0L) { > 321: decode(THREAD, runtimeKlass, 0L); Can we add an argument so that each of these call sites reports a unique message? Should we get the class name from the pending exception and include that as well? I think we should include enough breadcrumbs for future failures in the this path that we might have a better guess what's happening. ------------- PR: https://git.openjdk.org/jdk/pull/11286 From kvn at openjdk.org Tue Nov 22 17:07:29 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 22 Nov 2022 17:07:29 GMT Subject: RFR: 8290432: C2 compilation fails with assert(node->_last_del == _last) failed: must have deleted the edge just produced [v6] In-Reply-To: References: Message-ID: On Mon, 21 Nov 2022 02:31:34 GMT, Yi Yang wrote: >> Hi, can I have a review for this patch? [JDK-8273585](https://bugs.openjdk.org/browse/JDK-8273585) recognized the form of `Phi->CastII->AddI` as additional parallel induction variables. In the following program: >> >> class Test { >> static int dontInline() { >> return 0; >> } >> >> static long test(int val, boolean b) { >> long ret = 0; >> long dArr[] = new long[100]; >> for (int i = 15; 293 > i; ++i) { >> ret = val; >> int j = 1; >> while (++j < 6) { >> int k = (val--); >> for (long l = i; 1 > l; ) { >> if (k != 0) { >> ret += dontInline(); >> } >> } >> if (b) { >> break; >> } >> } >> } >> return ret; >> } >> >> public static void main(String[] args) { >> for (int i = 0; i < 1000; i++) { >> test(0, false); >> } >> } >> } >> >> `val` is incorrectly matched with the new parallel IV form: >> ![image](https://user-images.githubusercontent.com/5010047/182059398-fc5204bc-8d95-4e3e-8c66-15776af457b8.png) >> And C2 further replaces it with newly added nodes, which finally leads the crash: >> ![image](https://user-images.githubusercontent.com/5010047/182059498-13148d46-b10f-4e18-b84a-f6b9f626ac7b.png) >> >> I think we can add more constraints to the new form. The form of `Phi->CastXX->AddX` appears when using Preconditions.checkIndex, and it would be recognized as additional IV when 1) Phi != phi2, 2) CastXX is controlled by RangeCheck(to reflect changes in Preconditions checkindex intrinsic) > > Yi Yang has updated the pull request incrementally with one additional commit since the last revision: > > whitespace I may ask to do our internal performance testing for this change too before approval. ------------- PR: https://git.openjdk.org/jdk/pull/9695 From rkennke at openjdk.org Tue Nov 22 17:09:02 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 22 Nov 2022 17:09:02 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism Message-ID: Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. Testing: - [x] tier1 (x86_64, x86_32, aarch64) - [x] tier2 (x86_64, x86_32, aarch64) - [x] tier3 (x86_64, x86_32, aarch64) ------------- Commit messages: - RISCV fixes - Rename new platform files - Revert "Add virtual destructor to C2CodeStub" - Add virtual destructor to C2CodeStub - Some fixes (RISCV) - Add missing include (PPC) - RISCV parts - PPC parts - Use compile arena to allocate stub list - Rename files in includes, too. Duh. - ... and 11 more: https://git.openjdk.org/jdk/compare/87530e66...30a22232 Changes: https://git.openjdk.org/jdk/pull/11188/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11188&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297036 Stats: 927 lines in 21 files changed: 458 ins; 450 del; 19 mod Patch: https://git.openjdk.org/jdk/pull/11188.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11188/head:pull/11188 PR: https://git.openjdk.org/jdk/pull/11188 From rkennke at openjdk.org Tue Nov 22 17:09:03 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 22 Nov 2022 17:09:03 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism In-Reply-To: References: Message-ID: On Wed, 16 Nov 2022 15:03:07 GMT, Roman Kennke wrote: > Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. > > Testing: > - [x] tier1 (x86_64, x86_32, aarch64) > - [x] tier2 (x86_64, x86_32, aarch64) > - [x] tier3 (x86_64, x86_32, aarch64) @RealFYang, @tstuefe could you check what might be wrong with this PR? RISCV and PPC builds complain with: > > > undefined reference to `vtable for C2SafepointPollStub' > > > > > > I don't think the newly added [c2_CodeStubs.cpp](https://github.com/openjdk/jdk/pull/11188/files#diff-290de220f9ef00679e975edd8e14fca00873e5d4acc734c81f4893ebaaa1484c) even compiles, because this should fail: > > ``` > > __ assert_alignment(pc()); > > ``` > > > > > > > > > > > > > > > > > > > > > > > > You probably need to call it `c2_CodeStubs_riscv.cpp`? This might be some weird build system quirk? > > Ah. "Of course." You have `share/opto/c2_CodeStubs.cpp` and `cpu/riscv/c2_CodeStubs.cpp`. Only one gets compiled. Don't clash the compilation unit names :) Indeed. Duh. Thanks for finding it, I've been scratching my head why the linker complains on PPC and RISCV. ------------- PR: https://git.openjdk.org/jdk/pull/11188 From shade at openjdk.org Tue Nov 22 17:09:04 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 22 Nov 2022 17:09:04 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism In-Reply-To: References: Message-ID: On Wed, 16 Nov 2022 15:03:07 GMT, Roman Kennke wrote: > Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. > > Testing: > - [x] tier1 (x86_64, x86_32, aarch64) > - [x] tier2 (x86_64, x86_32, aarch64) > - [x] tier3 (x86_64, x86_32, aarch64) > undefined reference to `vtable for C2SafepointPollStub' I don't think the newly added [c2_CodeStubs.cpp](https://github.com/openjdk/jdk/pull/11188/files#diff-290de220f9ef00679e975edd8e14fca00873e5d4acc734c81f4893ebaaa1484c) even compiles, because this should fail: __ assert_alignment(pc()); You probably need to call it `c2_CodeStubs_riscv.cpp`? This might be some weird build system quirk? Further RISC-V build fixes: diff --git a/src/hotspot/cpu/riscv/c2_CodeStubs_riscv.cpp b/src/hotspot/cpu/riscv/c2_CodeStubs_riscv.cpp index bea3f39da31..df38bb5e4d0 100644 --- a/src/hotspot/cpu/riscv/c2_CodeStubs_riscv.cpp +++ b/src/hotspot/cpu/riscv/c2_CodeStubs_riscv.cpp @@ -47,7 +47,7 @@ void C2SafepointPollStub::emit(C2_MacroAssembler& masm) { } void C2EntryBarrierStub::emit(C2_MacroAssembler& masm) { - IncompressibleRegion ir(&masm); // Fixed length + Assembler::IncompressibleRegion ir(&masm); // Fixed length // make guard value 4-byte aligned so that it can be accessed by atomic instructions on riscv int alignment_bytes = __ align(4); @@ -61,7 +61,7 @@ void C2EntryBarrierStub::emit(C2_MacroAssembler& masm) { __ bind(guard()); __ relocate(entry_guard_Relocation::spec()); - __ assert_alignment(pc()); + __ assert_alignment(__ pc()); __ emit_int32(0); // nmethod guard value // make sure the stub with a fixed code size if (alignment_bytes == 2) { PPC64LE seems to build fine with the latest changes. ------------- PR: https://git.openjdk.org/jdk/pull/11188 From shade at openjdk.org Tue Nov 22 17:09:05 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 22 Nov 2022 17:09:05 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism In-Reply-To: References: Message-ID: On Tue, 22 Nov 2022 13:06:05 GMT, Aleksey Shipilev wrote: > You probably need to call it `c2_CodeStubs_riscv.cpp`? This might be some weird build system quirk? Ah. "Of course." You have `share/opto/c2_CodeStubs.cpp` and `cpu/riscv/c2_CodeStubs.cpp`. Only one gets compiled. Don't clash the compilation unit names :) ------------- PR: https://git.openjdk.org/jdk/pull/11188 From rkennke at openjdk.org Tue Nov 22 17:09:05 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 22 Nov 2022 17:09:05 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism In-Reply-To: References: Message-ID: On Tue, 22 Nov 2022 13:42:26 GMT, Aleksey Shipilev wrote: > PPC64LE seems to build fine with the latest changes. Thanks for testing and providing the build fix! ------------- PR: https://git.openjdk.org/jdk/pull/11188 From dcubed at openjdk.org Tue Nov 22 17:13:12 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Tue, 22 Nov 2022 17:13:12 GMT Subject: RFR: 8295424: adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest Message-ID: Misc stress testing related fixes: [JDK-8295424](https://bugs.openjdk.org/browse/JDK-8295424) adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest [JDK-8297367](https://bugs.openjdk.org/browse/JDK-8297367) disable TestRedirectLinks.java in slowdebug mode [JDK-8297369](https://bugs.openjdk.org/browse/JDK-8297369) disable Fuzz.java in slowdebug mode ------------- Commit messages: - 8297369: disable Fuzz.java in slowdebug mode - 8297367: disable TestRedirectLinks.java in slowdebug mode - 8295424: adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest Changes: https://git.openjdk.org/jdk/pull/11278/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11278&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295424 Stats: 17 lines in 3 files changed: 16 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11278.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11278/head:pull/11278 PR: https://git.openjdk.org/jdk/pull/11278 From duke at openjdk.org Tue Nov 22 17:20:56 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Tue, 22 Nov 2022 17:20:56 GMT Subject: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v22] In-Reply-To: References: Message-ID: On Thu, 17 Nov 2022 20:42:27 GMT, Volodymyr Paprotski wrote: >> Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in `macroAssembler_x86_poly.cpp`. >> >> - Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java. >> - Would like to add an `InvalidKeyException` in `Poly1305.java` (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please. >> - Added a JMH perf test. >> - JMH test had to use reflection (instead of existing `MacBench.java`), since Poly1305 is not 'properly' registered with the provider. >> >> Perf before: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ? 110554.162 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ? 86696.037 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ? 14074.655 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ? 390.921 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ? 1.402 ops/s >> >> and after: >> >> Benchmark (dataSize) (provider) Mode Cnt Score Error Units >> Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ? 154528.057 ops/s >> Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ? 95253.445 ops/s >> Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ? 100847.766 ops/s >> Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ? 25883.825 ops/s >> Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ? 56.147 ops/s > > Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: > > remove early return @robcasloz Update to [JDK-8297417](https://bugs.openjdk.org/browse/JDK-8297417) (since I don't have an account on the bugtracker yet to update there) Not able to reproduce it on Linux yet. The seed should make it deterministic.. but nothing. Resurrecting's my windows sandbox to see if I can reproduce on windows (only difference on windows is the intrinsic function register linkage. However problem there would make the problem _very_ deterministic.. I think) ------------- PR: https://git.openjdk.org/jdk/pull/10582 From rrich at openjdk.org Tue Nov 22 17:29:03 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Tue, 22 Nov 2022 17:29:03 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v8] In-Reply-To: References: Message-ID: > Hi, > > this is the port of [JEP 425: Virtual Threads (Preview)](https://openjdk.org/jeps/425)) to PPC64. > More precisely it is the port of vm continuations in hotspot to PPC64. It allows to run with `-XX:+VMContinuations` which is a prerequisit for 'real' virtual threads (oxymoron?). > > Most of the shared code changes are related to a new platform dependent constant `frame::metadata_words_at_top`. It is either added or subtracted to a frame address or size. > > The following is supposed to explain (without real life details) why it is needed in addition to the existing `frame::metadata_words`. The diagram shows a frame at `SP` and its stack arguments. The caller frame is located at `callers_SP`. > > > X86 / AARCH64 PPC64: > > : : : : > : : : : > | | | | > |-----------------| |-----------------| > | | | | > | stack arguments | | stack arguments | > | |<- callers_SP | | > =================== |-----------------| > | | | | > | metadata at bottom | | metadata at top | > | | | |<- callers_SP > |-----------------| =================== > | | | | > | | | | > | | | | > | | | | > | |<- SP | | > =================== |-----------------| > | | > | metadata at top | > | |<- SP > =================== > > > On X86 and AARCH64 metadata (that's return address, saved SP/FP, etc.) is stored at the frame bottom (`metadata at bottom`). On PPC64 it is stored at the frame top (`metadata at top`) where it affects size and address calculations. Shared code deals with this by making use of the platform dependent constant `frame::metadata_words_at_top`. > > * size required to 'freeze' a single frame with its stack arguments in a `StackChunk`: > `sizeof(frame) + sizeof(stack arguments) + frame::metadata_words_at_top` > > * address of stack arguments: > `callers_SP + frame::metadata_words_at_top` > > * The value of `frame::metadata_words_at_top` is 0 words on X86 and AARCH64. On PPC64 it is 4 words. > > Please refer to comments I've added for more details (e.g. the comment on StackChunkFrameStream::frame_size()). Recently I've given a talk about vm continuations and the PPC64 port to my colleagues. It's rather an overview than a deep dive. [The slides](http://cr.openjdk.java.net/~rrich/webrevs/2022/8286302/202210_loom_ppc64_port.pdf) might serve as well as an intro to the matter. > > The pr includes the new test jdk/jdk/internal/vm/Continuation/BasicExp.java which I wrote while doing the port. The test cases vary from simple to not-so-simple. One of the main features is that a CompilationPolicy passed as argument controls which frames are compiled/interpreted when freezing by defining a sliding window of compiled / interpreted frames which produces interesting transitions with and without stack arguments. There is overlap with Basic.java and Fuzz.java. Let me know wether to remove or keep BasicExp.java. Runtime with fastdebug: 2m on Apple M1 Pro and 3m on Intel Xeon E5-2660 v3 @ 2.60GHz. Note that -XX:+VerifyContinuations is explicitly set as a found it very useful, it increases the runtime quite a bit though. > > Testing: the change passed our CI testing: most JCK and JTREG tests, also in Xcomp mode, SPECjvm2008, SPECjbb2015, Renaissance Suite and SAP specific tests with fastdebug and release builds on the standard platforms plus PPC64. These tests do include hotspot_loom and jdk_loom JTREG tests which I've also run with TEST_VM_OPTS="-XX:+VerifyContinuations" on X86_64, PPC64, and AARCH64. > > Thanks, Richard. Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: Feedback Leonid ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10961/files - new: https://git.openjdk.org/jdk/pull/10961/files/22430750..5cf90744 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10961&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10961&range=06-07 Stats: 272 lines in 1 file changed: 172 ins; 0 del; 100 mod Patch: https://git.openjdk.org/jdk/pull/10961.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10961/head:pull/10961 PR: https://git.openjdk.org/jdk/pull/10961 From rrich at openjdk.org Tue Nov 22 17:29:07 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Tue, 22 Nov 2022 17:29:07 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v7] In-Reply-To: References: Message-ID: On Tue, 22 Nov 2022 01:51:19 GMT, Leonid Mesnik wrote: >> Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: >> >> Cleanup BasicExp test > > test/jdk/jdk/internal/vm/Continuation/BasicExp.java line 129: > >> 127: * @summary Collection of basic continuation tests. CompilationPolicy controls which frames in a sequence should be compiled when calling Continuation.yield(). >> 128: * @requires vm.continuations >> 129: * @requires vm.flavor == "server" & (vm.opt.TieredCompilation == null | vm.opt.TieredCompilation == false) > > Isn't vm.opt.TieredCompilation != true better? Ah, sure. > It is better to use tag @enablePreview so --enable-preview is not needed. Done > Also, please remove -XX:+UnlockDiagnosticVMOptions where it is not required. The release build requires -XX:+UnlockDiagnosticVMOptions for -XX:+WhiteBoxAPI. Without I get Error: VM option 'WhiteBoxAPI' is diagnostic and must be enabled via -XX:+UnlockDiagnosticVMOptions. Error: The unlock option must precede 'WhiteBoxAPI'. Error: Could not create the Java Virtual Machine. Error: A fatal exception has occurred. Program will exit. > test/jdk/jdk/internal/vm/Continuation/BasicExp.java line 224: > >> 222: compPolicy.print(); System.out.println(); >> 223: >> 224: new ContinuationRunYieldRunTest().runTestCase(3, compPolicy); > > I think it would be better to split the test into several smaller testcases so timeout=300 is not needed. The fine granularity allows better parallelization of execution. The 5 minutes timeout means a very long-time test that couldn't be split. Done ------------- PR: https://git.openjdk.org/jdk/pull/10961 From duke at openjdk.org Tue Nov 22 17:32:51 2022 From: duke at openjdk.org (Zhiqiang Zang) Date: Tue, 22 Nov 2022 17:32:51 GMT Subject: RFR: 8297384: Add IR tests for existing idealizations of arithmetic nodes [v3] In-Reply-To: References: Message-ID: > I noticed some idealizations have no associated IR tests so I included for them. Zhiqiang Zang has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: - Merge master. - include @bug and @summary for newly created ir test classes. - format. - format whitespace. - add some missing tests for existing idealizations. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11049/files - new: https://git.openjdk.org/jdk/pull/11049/files/c0931397..bb92a752 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11049&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11049&range=01-02 Stats: 53640 lines in 1037 files changed: 19606 ins; 20747 del; 13287 mod Patch: https://git.openjdk.org/jdk/pull/11049.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11049/head:pull/11049 PR: https://git.openjdk.org/jdk/pull/11049 From rrich at openjdk.org Tue Nov 22 17:33:05 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Tue, 22 Nov 2022 17:33:05 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v7] In-Reply-To: References: Message-ID: On Tue, 22 Nov 2022 01:57:54 GMT, Leonid Mesnik wrote: >> Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: >> >> Cleanup BasicExp test > > test/jdk/jdk/internal/vm/Continuation/BasicExp.java line 369: > >> 367: } >> 368: >> 369: public void log_dontjit() { > > Please use camelCase for all functions. I'd rather not because where I do not already use camelCase it is because I added a name prefix/suffix to the method that is used to control compilation and inlining. IMHO readability is better if these technical name extensions are separated by an underscore from the method name. Let's see what others say. > test/jdk/jdk/internal/vm/Continuation/BasicExp.java line 484: > >> 482: log_dontjit("Exc: " + e); >> 483: } >> 484: if (callSystemGC) System.gc(); > > It would be better to call WB.fullGC() not System,gc() to ensure that GC is called. Good point! Actually WB.youngGC() is sufficient and reduces the runtime significantly. > test/jdk/jdk/internal/vm/Continuation/BasicExp.java line 504: > >> 502: } >> 503: >> 504: static final long i1=1; static final long i2=2; static final long i3=3; > > Please fix identation. Done > test/jdk/jdk/internal/vm/Continuation/BasicExp.java line 729: > >> 727: log_dontjit("Continuation running on thread " + Thread.currentThread()); >> 728: long res = ord101_recurse_dontinline(0, i1, i2, i3, i4, i5, i6, i7, i8, i9, i10, i11); >> 729: if (res != i1+i2+i3+i4+i5+i6+i7+i8+i9+i10+i11) { > > Please fix the indentation. There are several places were 'i1+i2' should be fixed to 'i1 + i2 '. Done ------------- PR: https://git.openjdk.org/jdk/pull/10961 From rrich at openjdk.org Tue Nov 22 17:37:58 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Tue, 22 Nov 2022 17:37:58 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v7] In-Reply-To: References: Message-ID: On Tue, 22 Nov 2022 02:29:41 GMT, Leonid Mesnik wrote: >> Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: >> >> Cleanup BasicExp test > > test/jdk/jdk/internal/vm/Continuation/BasicExp.java line 193: > >> 191: compLevel = CompilerWhiteBoxTest.COMP_LEVEL_FULL_OPTIMIZATION; >> 192: // // Run tests with C1 compilations >> 193: // compLevel = CompilerWhiteBoxTest.COMP_LEVEL_FULL_PROFILE; > > Any reasons why the test is executed with C2 only? And is -XX:-TieredCompilation required if optimization level and compilation are controlled with WB? It doesn't matter that much which compiler is used. What really matters is to have compiled frames at specific locations when freezing/thawing. The test cases control which methods are compiled and which are not. I don't want TieredCompilation to interfere with that. ------------- PR: https://git.openjdk.org/jdk/pull/10961 From rrich at openjdk.org Tue Nov 22 17:41:41 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Tue, 22 Nov 2022 17:41:41 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v7] In-Reply-To: References: Message-ID: On Tue, 22 Nov 2022 02:47:30 GMT, Leonid Mesnik wrote: >> Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: >> >> Cleanup BasicExp test > > Not a complete review but some notes about test style. Thanks for your feedback @lmesnik. It help reducing the runtime of the test significantly. ------------- PR: https://git.openjdk.org/jdk/pull/10961 From lmesnik at openjdk.org Tue Nov 22 18:47:49 2022 From: lmesnik at openjdk.org (Leonid Mesnik) Date: Tue, 22 Nov 2022 18:47:49 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v7] In-Reply-To: References: Message-ID: On Tue, 22 Nov 2022 17:24:09 GMT, Richard Reingruber wrote: >> test/jdk/jdk/internal/vm/Continuation/BasicExp.java line 136: >> >>> 134: * @run driver jdk.test.lib.helpers.ClassFileInstaller jdk.test.whitebox.WhiteBox >>> 135: * >>> 136: * @run main/othervm/timeout=300 --enable-preview -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -Xbootclasspath/a:. >> >> It is better to use tag @enablePreview so --enable-preview is not needed. >> Also, please remove -XX:+UnlockDiagnosticVMOptions where it is not required. > >> It is better to use tag @enablePreview so --enable-preview is not needed. > > Done > >> Also, please remove -XX:+UnlockDiagnosticVMOptions where it is not required. > > The release build requires -XX:+UnlockDiagnosticVMOptions for -XX:+WhiteBoxAPI. > > Without I get > > > Error: VM option 'WhiteBoxAPI' is diagnostic and must be enabled via -XX:+UnlockDiagnosticVMOptions. > Error: The unlock option must precede 'WhiteBoxAPI'. > Error: Could not create the Java Virtual Machine. > Error: A fatal exception has occurred. Program will exit. Sorry, I meant -XX:+IgnoreUnrecognizedVMOptions where it is not needed. I think it is needed only to don't fail when -XX:+VerifyContinuations is used in product mode and not needed in other testcases ------------- PR: https://git.openjdk.org/jdk/pull/10961 From sspitsyn at openjdk.org Tue Nov 22 19:47:22 2022 From: sspitsyn at openjdk.org (Serguei Spitsyn) Date: Tue, 22 Nov 2022 19:47:22 GMT Subject: RFR: 8295424: adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest In-Reply-To: References: Message-ID: <6tiK8h3MQgoNTHVnRtLGFJmH2HycabKnQvpE3PL413Q=.298830ac-77b9-4916-a568-10aba857b348@github.com> On Mon, 21 Nov 2022 22:55:40 GMT, Daniel D. Daugherty wrote: > Misc stress testing related fixes: > > [JDK-8295424](https://bugs.openjdk.org/browse/JDK-8295424) adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest > [JDK-8297367](https://bugs.openjdk.org/browse/JDK-8297367) disable TestRedirectLinks.java in slowdebug mode > [JDK-8297369](https://bugs.openjdk.org/browse/JDK-8297369) disable Fuzz.java in slowdebug mode This looks good. Thanks, Serguei test/langtools/jdk/javadoc/doclet/testLinkOption/TestRedirectLinks.java line 73: > 71: > 72: import jdk.test.lib.Platform; > 73: import jtreg.SkippedException; Nit: the order of imports on 72-73 needs to be swapped. ------------- Marked as reviewed by sspitsyn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11278 From duke at openjdk.org Tue Nov 22 21:08:31 2022 From: duke at openjdk.org (Zhiqiang Zang) Date: Tue, 22 Nov 2022 21:08:31 GMT Subject: RFR: 8297384: Add IR tests for existing idealizations of arithmetic nodes [v2] In-Reply-To: References: Message-ID: On Tue, 22 Nov 2022 09:47:06 GMT, Christian Hagedorn wrote: > Nice additional tests! They look good. Thank you for reviewing. I noticed `RotateLeftNodeIntIdealizationTests.java` and `RotateLeftNodeLongIdealizationTests.java` failed on x86. Does this matter should we make sure they pass? ------------- PR: https://git.openjdk.org/jdk/pull/11049 From dcubed at openjdk.org Tue Nov 22 21:09:26 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Tue, 22 Nov 2022 21:09:26 GMT Subject: RFR: 8295424: adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest In-Reply-To: <6tiK8h3MQgoNTHVnRtLGFJmH2HycabKnQvpE3PL413Q=.298830ac-77b9-4916-a568-10aba857b348@github.com> References: <6tiK8h3MQgoNTHVnRtLGFJmH2HycabKnQvpE3PL413Q=.298830ac-77b9-4916-a568-10aba857b348@github.com> Message-ID: On Tue, 22 Nov 2022 19:43:38 GMT, Serguei Spitsyn wrote: >> Misc stress testing related fixes: >> >> [JDK-8295424](https://bugs.openjdk.org/browse/JDK-8295424) adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest >> [JDK-8297367](https://bugs.openjdk.org/browse/JDK-8297367) disable TestRedirectLinks.java in slowdebug mode >> [JDK-8297369](https://bugs.openjdk.org/browse/JDK-8297369) disable Fuzz.java in slowdebug mode > > This looks good. > Thanks, > Serguei @sspitsyn - Thanks for the review! > test/langtools/jdk/javadoc/doclet/testLinkOption/TestRedirectLinks.java line 73: > >> 71: >> 72: import jdk.test.lib.Platform; >> 73: import jtreg.SkippedException; > > Nit: the order of imports on 72-73 needs to be swapped. Why? 'jdk' comes before 'jtreg' and 'Platform' comes before 'SkippedException'. What am I missing here? ------------- PR: https://git.openjdk.org/jdk/pull/11278 From rrich at openjdk.org Tue Nov 22 20:40:52 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Tue, 22 Nov 2022 20:40:52 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v9] In-Reply-To: References: Message-ID: > Hi, > > this is the port of [JEP 425: Virtual Threads (Preview)](https://openjdk.org/jeps/425)) to PPC64. > More precisely it is the port of vm continuations in hotspot to PPC64. It allows to run with `-XX:+VMContinuations` which is a prerequisit for 'real' virtual threads (oxymoron?). > > Most of the shared code changes are related to a new platform dependent constant `frame::metadata_words_at_top`. It is either added or subtracted to a frame address or size. > > The following is supposed to explain (without real life details) why it is needed in addition to the existing `frame::metadata_words`. The diagram shows a frame at `SP` and its stack arguments. The caller frame is located at `callers_SP`. > > > X86 / AARCH64 PPC64: > > : : : : > : : : : > | | | | > |-----------------| |-----------------| > | | | | > | stack arguments | | stack arguments | > | |<- callers_SP | | > =================== |-----------------| > | | | | > | metadata at bottom | | metadata at top | > | | | |<- callers_SP > |-----------------| =================== > | | | | > | | | | > | | | | > | | | | > | |<- SP | | > =================== |-----------------| > | | > | metadata at top | > | |<- SP > =================== > > > On X86 and AARCH64 metadata (that's return address, saved SP/FP, etc.) is stored at the frame bottom (`metadata at bottom`). On PPC64 it is stored at the frame top (`metadata at top`) where it affects size and address calculations. Shared code deals with this by making use of the platform dependent constant `frame::metadata_words_at_top`. > > * size required to 'freeze' a single frame with its stack arguments in a `StackChunk`: > `sizeof(frame) + sizeof(stack arguments) + frame::metadata_words_at_top` > > * address of stack arguments: > `callers_SP + frame::metadata_words_at_top` > > * The value of `frame::metadata_words_at_top` is 0 words on X86 and AARCH64. On PPC64 it is 4 words. > > Please refer to comments I've added for more details (e.g. the comment on StackChunkFrameStream::frame_size()). Recently I've given a talk about vm continuations and the PPC64 port to my colleagues. It's rather an overview than a deep dive. [The slides](http://cr.openjdk.java.net/~rrich/webrevs/2022/8286302/202210_loom_ppc64_port.pdf) might serve as well as an intro to the matter. > > The pr includes the new test jdk/jdk/internal/vm/Continuation/BasicExp.java which I wrote while doing the port. The test cases vary from simple to not-so-simple. One of the main features is that a CompilationPolicy passed as argument controls which frames are compiled/interpreted when freezing by defining a sliding window of compiled / interpreted frames which produces interesting transitions with and without stack arguments. There is overlap with Basic.java and Fuzz.java. Let me know wether to remove or keep BasicExp.java. Runtime with fastdebug: 2m on Apple M1 Pro and 3m on Intel Xeon E5-2660 v3 @ 2.60GHz. Note that -XX:+VerifyContinuations is explicitly set as a found it very useful, it increases the runtime quite a bit though. > > Testing: the change passed our CI testing: most JCK and JTREG tests, also in Xcomp mode, SPECjvm2008, SPECjbb2015, Renaissance Suite and SAP specific tests with fastdebug and release builds on the standard platforms plus PPC64. These tests do include hotspot_loom and jdk_loom JTREG tests which I've also run with TEST_VM_OPTS="-XX:+VerifyContinuations" on X86_64, PPC64, and AARCH64. > > Thanks, Richard. Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: More Feedback Leonid ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10961/files - new: https://git.openjdk.org/jdk/pull/10961/files/5cf90744..5f59f2cf Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10961&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10961&range=07-08 Stats: 5 lines in 1 file changed: 0 ins; 0 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/10961.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10961/head:pull/10961 PR: https://git.openjdk.org/jdk/pull/10961 From rrich at openjdk.org Tue Nov 22 20:40:53 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Tue, 22 Nov 2022 20:40:53 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v7] In-Reply-To: References: Message-ID: On Tue, 22 Nov 2022 18:43:45 GMT, Leonid Mesnik wrote: >>> It is better to use tag @enablePreview so --enable-preview is not needed. >> >> Done >> >>> Also, please remove -XX:+UnlockDiagnosticVMOptions where it is not required. >> >> The release build requires -XX:+UnlockDiagnosticVMOptions for -XX:+WhiteBoxAPI. >> >> Without I get >> >> >> Error: VM option 'WhiteBoxAPI' is diagnostic and must be enabled via -XX:+UnlockDiagnosticVMOptions. >> Error: The unlock option must precede 'WhiteBoxAPI'. >> Error: Could not create the Java Virtual Machine. >> Error: A fatal exception has occurred. Program will exit. > > Sorry, I meant -XX:+IgnoreUnrecognizedVMOptions where it is not needed. I think it is needed only to don't fail when -XX:+VerifyContinuations is used in product mode and not needed in other testcases Ok (I thought so). I've fixed it now. ------------- PR: https://git.openjdk.org/jdk/pull/10961 From jjg at openjdk.org Tue Nov 22 23:20:02 2022 From: jjg at openjdk.org (Jonathan Gibbons) Date: Tue, 22 Nov 2022 23:20:02 GMT Subject: RFR: 8295424: adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest In-Reply-To: References: Message-ID: On Mon, 21 Nov 2022 22:55:40 GMT, Daniel D. Daugherty wrote: > Misc stress testing related fixes: > > [JDK-8295424](https://bugs.openjdk.org/browse/JDK-8295424) adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest > [JDK-8297367](https://bugs.openjdk.org/browse/JDK-8297367) disable TestRedirectLinks.java in slowdebug mode > [JDK-8297369](https://bugs.openjdk.org/browse/JDK-8297369) disable Fuzz.java in slowdebug mode I accept the javadoc change but dislike the general methodology: it's too much like brushing the dirt under the carpet. In general, I think it is better to use keywords, or `@require` to mark such tests and then (if using keywords) use command-line options to filter out such tests. ------------- Marked as reviewed by jjg (Reviewer). PR: https://git.openjdk.org/jdk/pull/11278 From jjg at openjdk.org Tue Nov 22 23:20:04 2022 From: jjg at openjdk.org (Jonathan Gibbons) Date: Tue, 22 Nov 2022 23:20:04 GMT Subject: RFR: 8295424: adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest In-Reply-To: References: <6tiK8h3MQgoNTHVnRtLGFJmH2HycabKnQvpE3PL413Q=.298830ac-77b9-4916-a568-10aba857b348@github.com> Message-ID: On Tue, 22 Nov 2022 21:05:16 GMT, Daniel D. Daugherty wrote: >> test/langtools/jdk/javadoc/doclet/testLinkOption/TestRedirectLinks.java line 73: >> >>> 71: >>> 72: import jdk.test.lib.Platform; >>> 73: import jtreg.SkippedException; >> >> Nit: the order of imports on 72-73 needs to be swapped. > > Why? 'jdk' comes before 'jtreg' and 'Platform' comes before 'SkippedException'. > What am I missing here? Mild grumble: langtools tests do not rely on jdk test libraries ------------- PR: https://git.openjdk.org/jdk/pull/11278 From dcubed at openjdk.org Tue Nov 22 23:33:20 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Tue, 22 Nov 2022 23:33:20 GMT Subject: RFR: 8295424: adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest In-Reply-To: References: Message-ID: On Tue, 22 Nov 2022 23:17:24 GMT, Jonathan Gibbons wrote: >> Misc stress testing related fixes: >> >> [JDK-8295424](https://bugs.openjdk.org/browse/JDK-8295424) adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest >> [JDK-8297367](https://bugs.openjdk.org/browse/JDK-8297367) disable TestRedirectLinks.java in slowdebug mode >> [JDK-8297369](https://bugs.openjdk.org/browse/JDK-8297369) disable Fuzz.java in slowdebug mode > > I accept the javadoc change but dislike the general methodology: it's too much like brushing the dirt under the carpet. > > In general, I think it is better to use keywords, or `@require` to mark such tests and then (if using keywords) use command-line options to filter out such tests. @jonathan-gibbons - Thanks for the review! I could not find an @requires incantation for saying do-not-use-slowdebug-bits nor one for saying do-not-use-macosx-aarch64. I don't really do a lot with @requires so I could be missing something. > it's too much like brushing the dirt under the carpet. Please see the parent bugs for [JDK-8297367](https://bugs.openjdk.org/browse/JDK-8297367) and [JDK-8297369](https://bugs.openjdk.org/browse/JDK-8297369) and you'll see that I have clearly documented the failures that I've been seeing. I do plan to leave those bugs open, but I've gotten tired of accounting for those failures in my weekly stress testing runs. ------------- PR: https://git.openjdk.org/jdk/pull/11278 From dcubed at openjdk.org Tue Nov 22 23:33:20 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Tue, 22 Nov 2022 23:33:20 GMT Subject: RFR: 8295424: adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest In-Reply-To: References: <6tiK8h3MQgoNTHVnRtLGFJmH2HycabKnQvpE3PL413Q=.298830ac-77b9-4916-a568-10aba857b348@github.com> Message-ID: On Tue, 22 Nov 2022 23:15:48 GMT, Jonathan Gibbons wrote: >> Why? 'jdk' comes before 'jtreg' and 'Platform' comes before 'SkippedException'. >> What am I missing here? > > Mild grumble: langtools tests do not rely on jdk test libraries Does langtools have its own test libraries that I can use to ask the same questions? ------------- PR: https://git.openjdk.org/jdk/pull/11278 From sspitsyn at openjdk.org Wed Nov 23 00:08:29 2022 From: sspitsyn at openjdk.org (Serguei Spitsyn) Date: Wed, 23 Nov 2022 00:08:29 GMT Subject: RFR: 8295424: adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest In-Reply-To: References: <6tiK8h3MQgoNTHVnRtLGFJmH2HycabKnQvpE3PL413Q=.298830ac-77b9-4916-a568-10aba857b348@github.com> Message-ID: On Tue, 22 Nov 2022 23:31:07 GMT, Daniel D. Daugherty wrote: >> Mild grumble: langtools tests do not rely on jdk test libraries > > Does langtools have its own test libraries that I can use to ask the same questions? Sorry, I was not clear. The Fuzz.java has this order: +import jdk.test.lib.Platform; +import jtreg.SkippedException; I thought, you ordered imports by names. Then it is better to keep this order unified. It is really minor though. ------------- PR: https://git.openjdk.org/jdk/pull/11278 From lmesnik at openjdk.org Wed Nov 23 01:04:36 2022 From: lmesnik at openjdk.org (Leonid Mesnik) Date: Wed, 23 Nov 2022 01:04:36 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v9] In-Reply-To: References: Message-ID: On Tue, 22 Nov 2022 20:40:52 GMT, Richard Reingruber wrote: >> Hi, >> >> this is the port of [JEP 425: Virtual Threads (Preview)](https://openjdk.org/jeps/425)) to PPC64. >> More precisely it is the port of vm continuations in hotspot to PPC64. It allows to run with `-XX:+VMContinuations` which is a prerequisit for 'real' virtual threads (oxymoron?). >> >> Most of the shared code changes are related to a new platform dependent constant `frame::metadata_words_at_top`. It is either added or subtracted to a frame address or size. >> >> The following is supposed to explain (without real life details) why it is needed in addition to the existing `frame::metadata_words`. The diagram shows a frame at `SP` and its stack arguments. The caller frame is located at `callers_SP`. >> >> >> X86 / AARCH64 PPC64: >> >> : : : : >> : : : : >> | | | | >> |-----------------| |-----------------| >> | | | | >> | stack arguments | | stack arguments | >> | |<- callers_SP | | >> =================== |-----------------| >> | | | | >> | metadata at bottom | | metadata at top | >> | | | |<- callers_SP >> |-----------------| =================== >> | | | | >> | | | | >> | | | | >> | | | | >> | |<- SP | | >> =================== |-----------------| >> | | >> | metadata at top | >> | |<- SP >> =================== >> >> >> On X86 and AARCH64 metadata (that's return address, saved SP/FP, etc.) is stored at the frame bottom (`metadata at bottom`). On PPC64 it is stored at the frame top (`metadata at top`) where it affects size and address calculations. Shared code deals with this by making use of the platform dependent constant `frame::metadata_words_at_top`. >> >> * size required to 'freeze' a single frame with its stack arguments in a `StackChunk`: >> `sizeof(frame) + sizeof(stack arguments) + frame::metadata_words_at_top` >> >> * address of stack arguments: >> `callers_SP + frame::metadata_words_at_top` >> >> * The value of `frame::metadata_words_at_top` is 0 words on X86 and AARCH64. On PPC64 it is 4 words. >> >> Please refer to comments I've added for more details (e.g. the comment on StackChunkFrameStream::frame_size()). Recently I've given a talk about vm continuations and the PPC64 port to my colleagues. It's rather an overview than a deep dive. [The slides](http://cr.openjdk.java.net/~rrich/webrevs/2022/8286302/202210_loom_ppc64_port.pdf) might serve as well as an intro to the matter. >> >> The pr includes the new test jdk/jdk/internal/vm/Continuation/BasicExp.java which I wrote while doing the port. The test cases vary from simple to not-so-simple. One of the main features is that a CompilationPolicy passed as argument controls which frames are compiled/interpreted when freezing by defining a sliding window of compiled / interpreted frames which produces interesting transitions with and without stack arguments. There is overlap with Basic.java and Fuzz.java. Let me know wether to remove or keep BasicExp.java. Runtime with fastdebug: 2m on Apple M1 Pro and 3m on Intel Xeon E5-2660 v3 @ 2.60GHz. Note that -XX:+VerifyContinuations is explicitly set as a found it very useful, it increases the runtime quite a bit though. >> >> Testing: the change passed our CI testing: most JCK and JTREG tests, also in Xcomp mode, SPECjvm2008, SPECjbb2015, Renaissance Suite and SAP specific tests with fastdebug and release builds on the standard platforms plus PPC64. These tests do include hotspot_loom and jdk_loom JTREG tests which I've also run with TEST_VM_OPTS="-XX:+VerifyContinuations" on X86_64, PPC64, and AARCH64. >> >> Thanks, Richard. > > Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: > > More Feedback Leonid Thanks for fixing the test. I haven't done the full review, not adding myself as a reviewer. Don't have any additional comments. ------------- PR: https://git.openjdk.org/jdk/pull/10961 From dholmes at openjdk.org Wed Nov 23 02:19:56 2022 From: dholmes at openjdk.org (David Holmes) Date: Wed, 23 Nov 2022 02:19:56 GMT Subject: RFR: 8295424: adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest In-Reply-To: References: Message-ID: On Tue, 22 Nov 2022 23:30:57 GMT, Daniel D. Daugherty wrote: > I could not find an @requires incantation for saying do-not-use-slowdebug-bits nor one for saying do-not-use-macosx-aarch64. Something like: `@requires vm.debug != slowdebug` `@requires !(os.arch == "aarch64" && os.family == "mac")` ------------- PR: https://git.openjdk.org/jdk/pull/11278 From cjplummer at openjdk.org Wed Nov 23 02:30:31 2022 From: cjplummer at openjdk.org (Chris Plummer) Date: Wed, 23 Nov 2022 02:30:31 GMT Subject: RFR: 8295424: adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest In-Reply-To: References: Message-ID: On Mon, 21 Nov 2022 22:55:40 GMT, Daniel D. Daugherty wrote: > Misc stress testing related fixes: > > [JDK-8295424](https://bugs.openjdk.org/browse/JDK-8295424) adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest > [JDK-8297367](https://bugs.openjdk.org/browse/JDK-8297367) disable TestRedirectLinks.java in slowdebug mode > [JDK-8297369](https://bugs.openjdk.org/browse/JDK-8297369) disable Fuzz.java in slowdebug mode Do you plan on closing the CRs associated with these changes even though the root causes are not being addressed, just avoided? It's not clear what is meant by "Test is unstable". Is the test buggy, or are these JVM issues? In either case shouldn't we be trying to understand why it is unstable with slowdebug bug not fastdebug? ------------- PR: https://git.openjdk.org/jdk/pull/11278 From duke at openjdk.org Wed Nov 23 03:08:39 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Wed, 23 Nov 2022 03:08:39 GMT Subject: RFR: 8297417: Poly1305IntrinsicFuzzTest fails with tag mismatch exception Message-ID: <6EUTYCRx6TH5HSleuN8KTk2RliVRyK0QljfCj3c0meA=.5227e497-0675-4cd1-ae1e-83d136816753@github.com> >From https://github.com/openjdk/jdk/pull/10582, `t0` gets clobbered if `rscratch` is used. Example, [here](https://github.com/openjdk/jdk/blob/09f70dad2fe3f0691afacded6c38f61fa8a0d28d/src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp#L605-L606): __ mov(t0, a0); __ andq(t0, ExternalAddress(poly1305_mask44()), rscratch); // First limb (R^4[43:0]) ------------- Commit messages: - fix t0 rscratch usage overlap Changes: https://git.openjdk.org/jdk/pull/11308/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11308&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297417 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11308.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11308/head:pull/11308 PR: https://git.openjdk.org/jdk/pull/11308 From svkamath at openjdk.org Wed Nov 23 04:30:14 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Wed, 23 Nov 2022 04:30:14 GMT Subject: RFR: 8295351: java/lang/Float/Binary16Conversion.java fails with "Unexpected result of converting" Message-ID: 8295351: java/lang/Float/Binary16Conversion.java fails with "Unexpected result of converting" ------------- Commit messages: - Updated code to fix windows build issue - Removed test from ProblemList-Xcomp file - Fix for JDK-829531 Changes: https://git.openjdk.org/jdk/pull/11301/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11301&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295351 Stats: 19 lines in 2 files changed: 12 ins; 1 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/11301.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11301/head:pull/11301 PR: https://git.openjdk.org/jdk/pull/11301 From svkamath at openjdk.org Wed Nov 23 04:30:15 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Wed, 23 Nov 2022 04:30:15 GMT Subject: RFR: 8295351: java/lang/Float/Binary16Conversion.java fails with "Unexpected result of converting" In-Reply-To: References: Message-ID: On Tue, 22 Nov 2022 21:52:59 GMT, Smita Kamath wrote: > 8295351: java/lang/Float/Binary16Conversion.java fails with "Unexpected result of converting" Hi All, I have updated f2hf and hf2f methods in sharedRuntime.cpp as a fix for the error unexpected result of converting. Kindly review this patch and provide feedback. Thank you. Regards, Smita ------------- PR: https://git.openjdk.org/jdk/pull/11301 From duke at openjdk.org Wed Nov 23 04:30:16 2022 From: duke at openjdk.org (ExE Boss) Date: Wed, 23 Nov 2022 04:30:16 GMT Subject: RFR: 8295351: java/lang/Float/Binary16Conversion.java fails with "Unexpected result of converting" In-Reply-To: References: Message-ID: On Tue, 22 Nov 2022 21:52:59 GMT, Smita Kamath wrote: > 8295351: java/lang/Float/Binary16Conversion.java fails with "Unexpected result of converting" src/hotspot/share/runtime/sharedRuntime.cpp line 531: > 529: return bits.f; > 530: } > 531: } Wrong?indentation: Suggestion: } else if (hf_exp == 16) { if (hf_significand_bits == 0) { bits.i = 0x7f800000; return sign * bits.f; } else { bits.i = (hf_sign_bit << 16) | 0x7f800000 | (hf_significand_bits << significand_shift); return bits.f; } } ------------- PR: https://git.openjdk.org/jdk/pull/11301 From dholmes at openjdk.org Wed Nov 23 05:09:23 2022 From: dholmes at openjdk.org (David Holmes) Date: Wed, 23 Nov 2022 05:09:23 GMT Subject: RFR: 8295351: java/lang/Float/Binary16Conversion.java fails with "Unexpected result of converting" In-Reply-To: References: Message-ID: On Tue, 22 Nov 2022 21:52:59 GMT, Smita Kamath wrote: > 8295351: java/lang/Float/Binary16Conversion.java fails with "Unexpected result of converting" src/hotspot/share/runtime/sharedRuntime.cpp line 455: > 453: union {jfloat f; juint i;} bits; > 454: bits.f = x; > 455: jint doppel = bits.i; Doesn't the conversion from unsigned to signed risk a compiler warning being emitted? Can't you just use the existing `JavaValue` type to perform the union conversion trick? ------------- PR: https://git.openjdk.org/jdk/pull/11301 From fyang at openjdk.org Wed Nov 23 06:26:59 2022 From: fyang at openjdk.org (Fei Yang) Date: Wed, 23 Nov 2022 06:26:59 GMT Subject: RFR: 8297476: Increase InlineSmallCode default from 1000 to 2000 for RISC-V Message-ID: The current default value of InlineSmallCode on RISC-V is 1000. I witnessed notable performance improvement by increasing this value to 2000 when running the renaissance benchmark. Here are the exact commands used for each of the benchmarks: Before: $ java -XX:InlineSmallCode=1000 -XX:+UseParallelGC -Xms12g -Xmx12g -jar renaissance-gpl-0.14.1.jar -r 40 all After: $ java -XX:InlineSmallCode=2000 -XX:+UseParallelGC -Xms12g -Xmx12g -jar renaissance-gpl-0.14.1.jar -r 40 all Best run time for one repetition (ms ? lower is better) on Unmatched board: Benchmark | Before | After | Ratio -- | -- | -- | -- AkkaUct | 75629.766 | 71839.905 | 5.01% Reactors | 98120.668 | 91597.120 | 6.65% DecTree | 12144.666 | 11801.740 | 2.82% Als | 57719.166 | 53307.041 | 7.64% ChiSquare | 21704.666 | 16301.189 | 24.89% GaussMix | 17494.891 | 17497.291 | -0.02% LogRegression | 11881.352 | 11382.722 | 4.20% MovieLens | 100944.374 | 96510.793 | 4.39% NaiveBayes | 81946.569 | 68566.763 | 16.32% PageRank | 43689.497 | 43204.553 | 1.11% FjKmeans | 68398.667 | 67261.674 | 1.66% FutureGenetic | 31752.695 | 31524.457 | 0.72% Mnemonics | 126312.832 | 115335.512 | 8.69% ParMnemonics | 93406.666 | 88320.443 | 5.45% Scrabble | 6894.853 | 6888.426 | 0.09% RxScrabble | 5163.473 | 4875.730 | 5.08% Dotty | 14852.405 | 14667.255 | 1.25% ScalaDoku | 95770.117 | 39728.637 | 58.52% Philosophers | 13974.965 | 11579.551 | 17.14% ScalaStmBench7 | 12185.093 | 12243.016 | -0.47% FinagleChirper | 32676.065 | 30900.282 | 5.44% FinagleHttp | 30633.640 | 30191.792 | 1.44% Other testing: tier1-tier3 tested on Unmatched board. I have also tested other possible values like 1500 and 2500, but the numbers say show 2000 would outperform those values in most of the cases. And I can verify no regressions across at least following benchmarks: - Dacapo - SPECjvm2008 - SPECjbb2005 - SPECjbb2015 ------------- Commit messages: - 8297476: Increase InlineSmallCode default from 1000 to 2000 for RISC-V Changes: https://git.openjdk.org/jdk/pull/11310/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11310&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297476 Stats: 6 lines in 1 file changed: 6 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11310.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11310/head:pull/11310 PR: https://git.openjdk.org/jdk/pull/11310 From shade at openjdk.org Wed Nov 23 07:17:00 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 23 Nov 2022 07:17:00 GMT Subject: RFR: 8297476: Increase InlineSmallCode default from 1000 to 2000 for RISC-V In-Reply-To: References: Message-ID: On Wed, 23 Nov 2022 06:18:44 GMT, Fei Yang wrote: > The current default value of InlineSmallCode on RISC-V is 1000. I witnessed notable performance improvement by increasing this value to 2000 when running the Renaissance benchmark. Here are the exact commands used for each of the benchmarks: > > Before: > $ java -XX:InlineSmallCode=1000 -XX:+UseParallelGC -Xms12g -Xmx12g -jar renaissance-gpl-0.14.1.jar -r 40 all > > After: > $ java -XX:InlineSmallCode=2000 -XX:+UseParallelGC -Xms12g -Xmx12g -jar renaissance-gpl-0.14.1.jar -r 40 all > > Best run time for one repetition (ms ? lower is better) on Unmatched board: > > Benchmark | Before | After | Ratio > -- | -- | -- | -- > AkkaUct | 75629.766 | 71839.905 | 5.01% > Reactors | 98120.668 | 91597.120 | 6.65% > DecTree | 12144.666 | 11801.740 | 2.82% > Als | 57719.166 | 53307.041 | 7.64% > ChiSquare | 21704.666 | 16301.189 | 24.89% > GaussMix | 17494.891 | 17497.291 | -0.02% > LogRegression | 11881.352 | 11382.722 | 4.20% > MovieLens | 100944.374 | 96510.793 | 4.39% > NaiveBayes | 81946.569 | 68566.763 | 16.32% > PageRank | 43689.497 | 43204.553 | 1.11% > FjKmeans | 68398.667 | 67261.674 | 1.66% > FutureGenetic | 31752.695 | 31524.457 | 0.72% > Mnemonics | 126312.832 | 115335.512 | 8.69% > ParMnemonics | 93406.666 | 88320.443 | 5.45% > Scrabble | 6894.853 | 6888.426 | 0.09% > RxScrabble | 5163.473 | 4875.730 | 5.08% > Dotty | 14852.405 | 14667.255 | 1.25% > ScalaDoku | 95770.117 | 39728.637 | 58.52% > Philosophers | 13974.965 | 11579.551 | 17.14% > ScalaStmBench7 | 12185.093 | 12243.016 | -0.47% > FinagleChirper | 32676.065 | 30900.282 | 5.44% > FinagleHttp | 30633.640 | 30191.792 | 1.44% > > Other testing: tier1-tier3 tested on Unmatched board. > > I have also tested other possible values for InlineSmallCode like 1500 and 2500, but the numbers say show 2000 would outperform those values in most of the cases. And I can verify no regressions across at least following benchmarks: > > - Dacapo > - SPECjvm2008 > - SPECjbb2005 > - SPECjbb2015 I am curious: is `2000` significantly better than `2500` that other platforms overrides do here? If not, I'd prefer `2500`, so that we could eventually just use it as platform-independent default value. ------------- PR: https://git.openjdk.org/jdk/pull/11310 From chagedorn at openjdk.org Wed Nov 23 09:06:24 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 23 Nov 2022 09:06:24 GMT Subject: RFR: 8297384: Add IR tests for existing idealizations of arithmetic nodes [v3] In-Reply-To: References: Message-ID: On Tue, 22 Nov 2022 17:32:51 GMT, Zhiqiang Zang wrote: >> I noticed some idealizations have no associated IR tests so I included for them. > > Zhiqiang Zang has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Merge master. > - include @bug and @summary for newly created ir test classes. > - format. > - format whitespace. > - add some missing tests for existing idealizations. We should make sure to make them pass on x86 as well such that GHA is clean after the integration of this PR. I've had a quick look on the match rules and it seems that we only use `RotateLeft/Right` nodes on x86_64, aarch64 and riscv. We could restrict these tests to only run on these architectures with `@requires os.arch == "x86_64" | os.arch == "aarch64" | os.arch == "riscv64"`. ------------- PR: https://git.openjdk.org/jdk/pull/11049 From fyang at openjdk.org Wed Nov 23 10:01:43 2022 From: fyang at openjdk.org (Fei Yang) Date: Wed, 23 Nov 2022 10:01:43 GMT Subject: RFR: 8297476: Increase InlineSmallCode default from 1000 to 2000 for RISC-V [v2] In-Reply-To: References: Message-ID: > The current default value of InlineSmallCode on RISC-V is 1000. I witnessed notable performance improvement by increasing this value to 2000 when running the Renaissance benchmark. Here are the exact commands used for each of the benchmarks: > > Before: > $ java -XX:InlineSmallCode=1000 -XX:+UseParallelGC -Xms12g -Xmx12g -jar renaissance-gpl-0.14.1.jar -r 40 all > > After: > $ java -XX:InlineSmallCode=2000 -XX:+UseParallelGC -Xms12g -Xmx12g -jar renaissance-gpl-0.14.1.jar -r 40 all > > Best run time for one repetition (ms ? lower is better) on Unmatched board: > > Benchmark | Before | After | Ratio > -- | -- | -- | -- > AkkaUct | 75629.766 | 71839.905 | 5.01% > Reactors | 98120.668 | 91597.120 | 6.65% > DecTree | 12144.666 | 11801.740 | 2.82% > Als | 57719.166 | 53307.041 | 7.64% > ChiSquare | 21704.666 | 16301.189 | 24.89% > GaussMix | 17494.891 | 17497.291 | -0.02% > LogRegression | 11881.352 | 11382.722 | 4.20% > MovieLens | 100944.374 | 96510.793 | 4.39% > NaiveBayes | 81946.569 | 68566.763 | 16.32% > PageRank | 43689.497 | 43204.553 | 1.11% > FjKmeans | 68398.667 | 67261.674 | 1.66% > FutureGenetic | 31752.695 | 31524.457 | 0.72% > Mnemonics | 126312.832 | 115335.512 | 8.69% > ParMnemonics | 93406.666 | 88320.443 | 5.45% > Scrabble | 6894.853 | 6888.426 | 0.09% > RxScrabble | 5163.473 | 4875.730 | 5.08% > Dotty | 14852.405 | 14667.255 | 1.25% > ScalaDoku | 95770.117 | 39728.637 | 58.52% > Philosophers | 13974.965 | 11579.551 | 17.14% > ScalaStmBench7 | 12185.093 | 12243.016 | -0.47% > FinagleChirper | 32676.065 | 30900.282 | 5.44% > FinagleHttp | 30633.640 | 30191.792 | 1.44% > > Other testing: tier1-tier3 tested on Unmatched board. > > I have also tested other possible values for InlineSmallCode like 1500 and 2500, but the numbers say show 2000 would outperform those values in most of the cases. And I can verify no regressions across at least following benchmarks: > > - Dacapo > - SPECjvm2008 > - SPECjbb2005 > - SPECjbb2015 Fei Yang has updated the pull request incrementally with one additional commit since the last revision: Review ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11310/files - new: https://git.openjdk.org/jdk/pull/11310/files/b3648f37..00bf187b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11310&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11310&range=00-01 Stats: 13 lines in 1 file changed: 0 ins; 12 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11310.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11310/head:pull/11310 PR: https://git.openjdk.org/jdk/pull/11310 From shade at openjdk.org Wed Nov 23 10:04:57 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 23 Nov 2022 10:04:57 GMT Subject: RFR: 8297476: Increase InlineSmallCode default from 1000 to 2500 for RISC-V [v2] In-Reply-To: References: Message-ID: On Wed, 23 Nov 2022 10:01:43 GMT, Fei Yang wrote: >> The current default value of InlineSmallCode on RISC-V is 1000. I witnessed notable performance improvement by increasing this value to 2000 when running the Renaissance benchmark. Here are the exact commands used for each of the benchmarks: >> >> Before: >> $ java -XX:InlineSmallCode=1000 -XX:+UseParallelGC -Xms12g -Xmx12g -jar renaissance-gpl-0.14.1.jar -r 40 all >> >> After: >> $ java -XX:InlineSmallCode=2000 -XX:+UseParallelGC -Xms12g -Xmx12g -jar renaissance-gpl-0.14.1.jar -r 40 all >> >> Best run time for one repetition (ms ? lower is better) on Unmatched board: >> >> Benchmark | Before | After | Ratio >> -- | -- | -- | -- >> AkkaUct | 75629.766 | 71839.905 | 5.01% >> Reactors | 98120.668 | 91597.120 | 6.65% >> DecTree | 12144.666 | 11801.740 | 2.82% >> Als | 57719.166 | 53307.041 | 7.64% >> ChiSquare | 21704.666 | 16301.189 | 24.89% >> GaussMix | 17494.891 | 17497.291 | -0.02% >> LogRegression | 11881.352 | 11382.722 | 4.20% >> MovieLens | 100944.374 | 96510.793 | 4.39% >> NaiveBayes | 81946.569 | 68566.763 | 16.32% >> PageRank | 43689.497 | 43204.553 | 1.11% >> FjKmeans | 68398.667 | 67261.674 | 1.66% >> FutureGenetic | 31752.695 | 31524.457 | 0.72% >> Mnemonics | 126312.832 | 115335.512 | 8.69% >> ParMnemonics | 93406.666 | 88320.443 | 5.45% >> Scrabble | 6894.853 | 6888.426 | 0.09% >> RxScrabble | 5163.473 | 4875.730 | 5.08% >> Dotty | 14852.405 | 14667.255 | 1.25% >> ScalaDoku | 95770.117 | 39728.637 | 58.52% >> Philosophers | 13974.965 | 11579.551 | 17.14% >> ScalaStmBench7 | 12185.093 | 12243.016 | -0.47% >> FinagleChirper | 32676.065 | 30900.282 | 5.44% >> FinagleHttp | 30633.640 | 30191.792 | 1.44% >> >> Other testing: tier1-tier3 tested on Unmatched board. >> >> I have also tested other possible values for InlineSmallCode like 1500 and 2500, but the numbers say show 2000 would outperform those values in most of the cases. And I can verify no regressions across at least following benchmarks: >> >> - Dacapo >> - SPECjvm2008 >> - SPECjbb2005 >> - SPECjbb2015 > > Fei Yang has updated the pull request incrementally with one additional commit since the last revision: > > Review I like this, thanks. ------------- Marked as reviewed by shade (Reviewer). PR: https://git.openjdk.org/jdk/pull/11310 From fyang at openjdk.org Wed Nov 23 10:04:59 2022 From: fyang at openjdk.org (Fei Yang) Date: Wed, 23 Nov 2022 10:04:59 GMT Subject: RFR: 8297476: Increase InlineSmallCode default from 1000 to 2500 for RISC-V [v2] In-Reply-To: References: Message-ID: On Wed, 23 Nov 2022 07:13:37 GMT, Aleksey Shipilev wrote: > I am curious: is `2000` significantly better than `2500` that other platforms overrides do here? If not, I'd prefer `2500`, so that we could eventually just use it as platform-independent default value. Well, not that significant. I am OK to default to '2500' too. I have pushed a new version unifying the code for the three CPUs. Please take another look. Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/11310 From dnsimon at openjdk.org Wed Nov 23 10:54:31 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 23 Nov 2022 10:54:31 GMT Subject: RFR: 8297431: [JVMCI] HotSpotJVMCIRuntime.encodeThrowable should not throw an exception [v2] In-Reply-To: References: Message-ID: > JVMCI has a mechanism for translating exceptions from libjvmci to HotSpot and vice versa. This is important for proper error handling when a thread calls between these 2 runtime heaps. > > This translation mechanism itself needs to be robust in the context of resource limits, especially heap limits, as it may be translating an OutOfMemoryError from HotSpot back into libjvmci. The existing code in [`HotSpotJVMCIRuntime.encodeThrowable`](https://github.com/graalvm/labs-openjdk-17/blob/f6b18b596fa5acb1ab7efa10e284d106669040a6/src/jdk.internal.vm.ci/share/classes/jdk.vm.ci.hotspot/src/jdk/vm/ci/hotspot/HotSpotJVMCIRuntime.java#L237) and [`TranslatedException.encodeThrowable`](https://github.com/graalvm/labs-openjdk-17/blob/f6b18b596fa5acb1ab7efa10e284d106669040a6/src/jdk.internal.vm.ci/share/classes/jdk.vm.ci.hotspot/src/jdk/vm/ci/hotspot/TranslatedException.java#L153) is designed to handle translation failures by falling back to non-allocating code. However, we still occasionally see [an OOME that breaks the translation mechanism](https://github.com/oracle/graal/issues/5470#issuecomment-1321749688). One speculated possibility for this is an OOME re-materializing oops du ring a deoptimization causing an unexpected execution path. This PR increases the robustness of the exception translation code in light of such issues. Doug Simon has updated the pull request incrementally with one additional commit since the last revision: add more context when possible to exception translation errors ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11286/files - new: https://git.openjdk.org/jdk/pull/11286/files/2d5f0817..591c7589 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11286&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11286&range=00-01 Stats: 34 lines in 2 files changed: 20 ins; 2 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/11286.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11286/head:pull/11286 PR: https://git.openjdk.org/jdk/pull/11286 From dnsimon at openjdk.org Wed Nov 23 10:58:05 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 23 Nov 2022 10:58:05 GMT Subject: RFR: 8297431: [JVMCI] HotSpotJVMCIRuntime.encodeThrowable should not throw an exception [v2] In-Reply-To: <2NRXrCm_YYPDtujf239wUNPW1iKv9ZyVDVkXmlLK0dA=.64e06762-03e7-43a5-b57f-77a24a99d042@github.com> References: <2NRXrCm_YYPDtujf239wUNPW1iKv9ZyVDVkXmlLK0dA=.64e06762-03e7-43a5-b57f-77a24a99d042@github.com> Message-ID: On Tue, 22 Nov 2022 16:29:58 GMT, Tom Rodriguez wrote: >> Doug Simon has updated the pull request incrementally with one additional commit since the last revision: >> >> add more context when possible to exception translation errors > > src/hotspot/share/jvmci/jvmciEnv.cpp line 321: > >> 319: jlong buffer = (jlong) NEW_RESOURCE_ARRAY_IN_THREAD_RETURN_NULL(THREAD, jbyte, buffer_size); >> 320: if (buffer == 0L) { >> 321: decode(THREAD, runtimeKlass, 0L); > > Can we add an argument so that each of these call sites reports a unique message? Should we get the class name from the pending exception and include that as well? I think we should include enough breadcrumbs for future failures in the this path that we might have a better guess what's happening. I think the case we care about most is an `OutOfMemoryError` occurring in the HotSpot heap so I've pushed a change that calls this out. I've also distinguished the case where the native buffer for the encoding cannot be allocated (although if that happens, the VM is in real trouble). ------------- PR: https://git.openjdk.org/jdk/pull/11286 From rrich at openjdk.org Wed Nov 23 14:00:04 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Wed, 23 Nov 2022 14:00:04 GMT Subject: RFR: 8297487: G1 Remark: no need to keep alive oop constants of nmethods on stack Message-ID: <3TOLagicixH3acqrLIpQutFoQNtrs2dlIJjEWPJs2Ow=.18c402c8-32f2-4235-a179-9796dc4ff19d@github.com> This pr removes the stackwalks to keep alive oops of nmethods found on stack during G1 remark as it seems redundant. The oops are already kept alive by the [nmethod entry barrier](https://github.com/openjdk/jdk/blob/f26bd4e0e8b68de297a9ff93526cd7fac8668320/src/hotspot/share/gc/shared/barrierSetNMethod.cpp#L85) Additionally it fixes a comment that says nmethod entry barriers are needed to deal with continuations which, afaik, is not the case. Please correct me and explain if I'm mistaken. Testing: the patch is included in our daily CI testing since a week. That is most JCK and JTREG tests, also in Xcomp mode, Renaissance benchmark and SAP specific tests with fastdebug and release builds on the standard platforms plus PPC64. There was no failure I could attribute to this change. I tried to find a jtreg test that is sensitive to the keep alive by omitting it in the nmethod entry barrier and also in G1 remark but without success. ------------- Commit messages: - G1 Remark: no need to keep alive const oops for nmethods on stack Changes: https://git.openjdk.org/jdk/pull/11314/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11314&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297487 Stats: 16 lines in 2 files changed: 1 ins; 13 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/11314.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11314/head:pull/11314 PR: https://git.openjdk.org/jdk/pull/11314 From vkempik at openjdk.org Wed Nov 23 15:30:21 2022 From: vkempik at openjdk.org (Vladimir Kempik) Date: Wed, 23 Nov 2022 15:30:21 GMT Subject: RFR: 8297359: RISC-V: improve performance of floating Max Min intrinsics Message-ID: Please review this change. It improves performance of Math.min/max intrinsics for Floats and Doubles. The main issue in these intrinsics is the requirement to return NaN if any of arguments is NaN. In risc-v, fmin/fmax returns NaN only if both of src registers are NaN ( quiet NaN). That requires additional logic to handle the case where only of of src is NaN. Here the postcheck with flt (floating less than comparision) and flags analysis replaced with precheck. The precheck is done with 2 fclass on both src then checking combined ( by or-in) result, if one of src is NaN then put the NaN into dst ( using fadd dst, src1, src2). Microbench results: The results on the thead c910: before Benchmark Mode Cnt Score Error Units FpMinMaxIntrinsics.dMax avgt 25 53752.831 ? 97.198 ns/op FpMinMaxIntrinsics.dMin avgt 25 53707.229 ? 177.559 ns/op FpMinMaxIntrinsics.dMinReduce avgt 25 42805.985 ? 9.901 ns/op FpMinMaxIntrinsics.fMax avgt 25 53449.568 ? 215.294 ns/op FpMinMaxIntrinsics.fMin avgt 25 53504.106 ? 180.833 ns/op FpMinMaxIntrinsics.fMinReduce avgt 25 42794.579 ? 7.013 ns/op MaxMinOptimizeTest.dAdd avgt 25 381.138 ? 5.692 us/op MaxMinOptimizeTest.dMax avgt 25 4575.094 ? 17.065 us/op MaxMinOptimizeTest.dMin avgt 25 4584.648 ? 18.561 us/op MaxMinOptimizeTest.dMul avgt 25 384.615 ? 7.751 us/op MaxMinOptimizeTest.fAdd avgt 25 318.076 ? 3.308 us/op MaxMinOptimizeTest.fMax avgt 25 4405.724 ? 20.353 us/op MaxMinOptimizeTest.fMin avgt 25 4421.652 ? 18.029 us/op MaxMinOptimizeTest.fMul avgt 25 305.462 ? 19.437 us/op after Benchmark Mode Cnt Score Error Units FpMinMaxIntrinsics.dMax avgt 25 10712.246 ? 5.607 ns/op FpMinMaxIntrinsics.dMin avgt 25 10732.655 ? 41.894 ns/op FpMinMaxIntrinsics.dMinReduce avgt 25 3248.106 ? 2.143 ns/op FpMinMaxIntrinsics.fMax avgt 25 10707.084 ? 3.276 ns/op FpMinMaxIntrinsics.fMin avgt 25 10719.771 ? 14.864 ns/op FpMinMaxIntrinsics.fMinReduce avgt 25 3274.775 ? 0.996 ns/op MaxMinOptimizeTest.dAdd avgt 25 383.720 ? 8.849 us/op MaxMinOptimizeTest.dMax avgt 25 429.345 ? 11.160 us/op MaxMinOptimizeTest.dMin avgt 25 439.980 ? 3.757 us/op MaxMinOptimizeTest.dMul avgt 25 390.126 ? 10.258 us/op MaxMinOptimizeTest.fAdd avgt 25 300.005 ? 18.206 us/op MaxMinOptimizeTest.fMax avgt 25 370.467 ? 6.054 us/op MaxMinOptimizeTest.fMin avgt 25 375.134 ? 4.568 us/op MaxMinOptimizeTest.fMul avgt 25 305.344 ? 18.307 us/op hifive umatched before Benchmark Mode Cnt Score Error Units FpMinMaxIntrinsics.dMax avgt 25 30234.224 ? 16.744 ns/op FpMinMaxIntrinsics.dMin avgt 25 30227.686 ? 15.389 ns/op FpMinMaxIntrinsics.dMinReduce avgt 25 15766.749 ? 3.724 ns/op FpMinMaxIntrinsics.fMax avgt 25 30140.092 ? 10.243 ns/op FpMinMaxIntrinsics.fMin avgt 25 30149.470 ? 34.041 ns/op FpMinMaxIntrinsics.fMinReduce avgt 25 15760.770 ? 5.415 ns/op MaxMinOptimizeTest.dAdd avgt 25 1155.234 ? 4.603 us/op MaxMinOptimizeTest.dMax avgt 25 2597.897 ? 3.307 us/op MaxMinOptimizeTest.dMin avgt 25 2599.183 ? 3.806 us/op MaxMinOptimizeTest.dMul avgt 25 1155.281 ? 1.813 us/op MaxMinOptimizeTest.fAdd avgt 25 750.967 ? 7.254 us/op MaxMinOptimizeTest.fMax avgt 25 2305.085 ? 1.556 us/op MaxMinOptimizeTest.fMin avgt 25 2305.306 ? 1.478 us/op MaxMinOptimizeTest.fMul avgt 25 750.623 ? 7.357 us/op 2fclass_new Benchmark Mode Cnt Score Error Units FpMinMaxIntrinsics.dMax avgt 25 23599.547 ? 29.571 ns/op FpMinMaxIntrinsics.dMin avgt 25 23593.236 ? 18.456 ns/op FpMinMaxIntrinsics.dMinReduce avgt 25 8630.201 ? 1.353 ns/op FpMinMaxIntrinsics.fMax avgt 25 23496.337 ? 18.340 ns/op FpMinMaxIntrinsics.fMin avgt 25 23477.881 ? 8.545 ns/op FpMinMaxIntrinsics.fMinReduce avgt 25 8629.135 ? 0.869 ns/op MaxMinOptimizeTest.dAdd avgt 25 1155.479 ? 4.938 us/op MaxMinOptimizeTest.dMax avgt 25 1560.323 ? 3.077 us/op MaxMinOptimizeTest.dMin avgt 25 1558.668 ? 2.421 us/op MaxMinOptimizeTest.dMul avgt 25 1154.919 ? 2.077 us/op MaxMinOptimizeTest.fAdd avgt 25 751.325 ? 7.169 us/op MaxMinOptimizeTest.fMax avgt 25 1306.131 ? 1.102 us/op MaxMinOptimizeTest.fMin avgt 25 1306.134 ? 0.957 us/op MaxMinOptimizeTest.fMul avgt 25 750.968 ? 7.334 us/op ------------- Commit messages: - updated version of 2fclass minmax Changes: https://git.openjdk.org/jdk/pull/11327/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11327&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297359 Stats: 33 lines in 2 files changed: 12 ins; 11 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/11327.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11327/head:pull/11327 PR: https://git.openjdk.org/jdk/pull/11327 From never at openjdk.org Wed Nov 23 17:01:29 2022 From: never at openjdk.org (Tom Rodriguez) Date: Wed, 23 Nov 2022 17:01:29 GMT Subject: RFR: 8297431: [JVMCI] HotSpotJVMCIRuntime.encodeThrowable should not throw an exception [v2] In-Reply-To: References: Message-ID: On Wed, 23 Nov 2022 10:54:31 GMT, Doug Simon wrote: >> JVMCI has a mechanism for translating exceptions from libjvmci to HotSpot and vice versa. This is important for proper error handling when a thread calls between these 2 runtime heaps. >> >> This translation mechanism itself needs to be robust in the context of resource limits, especially heap limits, as it may be translating an OutOfMemoryError from HotSpot back into libjvmci. The existing code in [`HotSpotJVMCIRuntime.encodeThrowable`](https://github.com/graalvm/labs-openjdk-17/blob/f6b18b596fa5acb1ab7efa10e284d106669040a6/src/jdk.internal.vm.ci/share/classes/jdk.vm.ci.hotspot/src/jdk/vm/ci/hotspot/HotSpotJVMCIRuntime.java#L237) and [`TranslatedException.encodeThrowable`](https://github.com/graalvm/labs-openjdk-17/blob/f6b18b596fa5acb1ab7efa10e284d106669040a6/src/jdk.internal.vm.ci/share/classes/jdk.vm.ci.hotspot/src/jdk/vm/ci/hotspot/TranslatedException.java#L153) is designed to handle translation failures by falling back to non-allocating code. However, we still occasionally see [an OOME that breaks the translation mechanism](https://github.com/oracle/graal/issues/5470#issuecomment-1321749688). One speculated possibility for this is an OOME re-materializing oops d uring a deoptimization causing an unexpected execution path. This PR increases the robustness of the exception translation code in light of such issues. > > Doug Simon has updated the pull request incrementally with one additional commit since the last revision: > > add more context when possible to exception translation errors Marked as reviewed by never (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11286 From never at openjdk.org Wed Nov 23 17:01:30 2022 From: never at openjdk.org (Tom Rodriguez) Date: Wed, 23 Nov 2022 17:01:30 GMT Subject: RFR: 8297431: [JVMCI] HotSpotJVMCIRuntime.encodeThrowable should not throw an exception [v2] In-Reply-To: References: <2NRXrCm_YYPDtujf239wUNPW1iKv9ZyVDVkXmlLK0dA=.64e06762-03e7-43a5-b57f-77a24a99d042@github.com> Message-ID: On Wed, 23 Nov 2022 10:55:13 GMT, Doug Simon wrote: >> src/hotspot/share/jvmci/jvmciEnv.cpp line 321: >> >>> 319: jlong buffer = (jlong) NEW_RESOURCE_ARRAY_IN_THREAD_RETURN_NULL(THREAD, jbyte, buffer_size); >>> 320: if (buffer == 0L) { >>> 321: decode(THREAD, runtimeKlass, 0L); >> >> Can we add an argument so that each of these call sites reports a unique message? Should we get the class name from the pending exception and include that as well? I think we should include enough breadcrumbs for future failures in the this path that we might have a better guess what's happening. > > I think the case we care about most is an `OutOfMemoryError` occurring in the HotSpot heap so I've pushed a change that calls this out. I've also distinguished the case where the native buffer for the encoding cannot be allocated (although if that happens, the VM is in real trouble). I was thinking we could get the jthrowable and somehow find the name of the exception from the jclass but I guess JNI doesn't let you do that. Should we use `_from_env->describe_pending_exception()`? ------------- PR: https://git.openjdk.org/jdk/pull/11286 From dnsimon at openjdk.org Wed Nov 23 17:11:23 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 23 Nov 2022 17:11:23 GMT Subject: RFR: 8297431: [JVMCI] HotSpotJVMCIRuntime.encodeThrowable should not throw an exception [v2] In-Reply-To: References: <2NRXrCm_YYPDtujf239wUNPW1iKv9ZyVDVkXmlLK0dA=.64e06762-03e7-43a5-b57f-77a24a99d042@github.com> Message-ID: On Wed, 23 Nov 2022 16:57:29 GMT, Tom Rodriguez wrote: >> I think the case we care about most is an `OutOfMemoryError` occurring in the HotSpot heap so I've pushed a change that calls this out. I've also distinguished the case where the native buffer for the encoding cannot be allocated (although if that happens, the VM is in real trouble). > > I was thinking we could get the jthrowable and somehow find the name of the exception from the jclass but I guess JNI doesn't let you do that. Should we use `_from_env->describe_pending_exception()`? Good idea but unfortunately the [JNI `ExceptionDescribe` function](https://docs.oracle.com/en/java/javase/19/docs/specs/jni/functions.html#exceptiondescribe) writes to the console which is something we don't want in this context. ------------- PR: https://git.openjdk.org/jdk/pull/11286 From never at openjdk.org Wed Nov 23 17:27:26 2022 From: never at openjdk.org (Tom Rodriguez) Date: Wed, 23 Nov 2022 17:27:26 GMT Subject: RFR: 8297431: [JVMCI] HotSpotJVMCIRuntime.encodeThrowable should not throw an exception [v2] In-Reply-To: References: <2NRXrCm_YYPDtujf239wUNPW1iKv9ZyVDVkXmlLK0dA=.64e06762-03e7-43a5-b57f-77a24a99d042@github.com> Message-ID: On Wed, 23 Nov 2022 17:09:21 GMT, Doug Simon wrote: >> I was thinking we could get the jthrowable and somehow find the name of the exception from the jclass but I guess JNI doesn't let you do that. Should we use `_from_env->describe_pending_exception()`? > > Good idea but unfortunately the [JNI `ExceptionDescribe` function](https://docs.oracle.com/en/java/javase/19/docs/specs/jni/functions.html#exceptiondescribe) writes to the console which is something we don't want in this context. If we were still treating it as fatal it would be fine but I agree it's not ideal in the new code. ------------- PR: https://git.openjdk.org/jdk/pull/11286 From lmesnik at openjdk.org Wed Nov 23 17:53:34 2022 From: lmesnik at openjdk.org (Leonid Mesnik) Date: Wed, 23 Nov 2022 17:53:34 GMT Subject: RFR: 8295424: adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest In-Reply-To: References: Message-ID: On Mon, 21 Nov 2022 22:55:40 GMT, Daniel D. Daugherty wrote: > Misc stress testing related fixes: > > [JDK-8295424](https://bugs.openjdk.org/browse/JDK-8295424) adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest > [JDK-8297367](https://bugs.openjdk.org/browse/JDK-8297367) disable TestRedirectLinks.java in slowdebug mode > [JDK-8297369](https://bugs.openjdk.org/browse/JDK-8297369) disable Fuzz.java in slowdebug mode test/jdk/jdk/internal/vm/Continuation/Fuzz.java line 90: > 88: > 89: public static void main(String[] args) { > 90: if (Platform.isSlowDebugBuild() && Platform.isOSX() && Platform.isAArch64()) { I don't like the idea of skipping the unstable test using SkippedException. Wouldn't be better to add problemlist for slowdebug? So anyone could easy identify test bugs in slowdebug mode. Really it would be better to support bits configurations in standard problem lists like os/arch but it is a separate issue. ------------- PR: https://git.openjdk.org/jdk/pull/11278 From sviswanathan at openjdk.org Wed Nov 23 18:03:20 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Wed, 23 Nov 2022 18:03:20 GMT Subject: RFR: 8297417: Poly1305IntrinsicFuzzTest fails with tag mismatch exception In-Reply-To: <6EUTYCRx6TH5HSleuN8KTk2RliVRyK0QljfCj3c0meA=.5227e497-0675-4cd1-ae1e-83d136816753@github.com> References: <6EUTYCRx6TH5HSleuN8KTk2RliVRyK0QljfCj3c0meA=.5227e497-0675-4cd1-ae1e-83d136816753@github.com> Message-ID: On Wed, 23 Nov 2022 02:59:30 GMT, Volodymyr Paprotski wrote: > From https://github.com/openjdk/jdk/pull/10582, `t0` gets clobbered if `rscratch` is used. Example, [here](https://github.com/openjdk/jdk/blob/09f70dad2fe3f0691afacded6c38f61fa8a0d28d/src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp#L605-L606): > > > __ mov(t0, a0); > __ andq(t0, ExternalAddress(poly1305_mask44()), rscratch); // First limb (R^4[43:0]) src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 562: > 560: const Register t1 = r14; > 561: const Register t2 = r15; > 562: const Register rscratch = r14; The register map above in the comments also should reflect this change that rscratch is r14 now. ------------- PR: https://git.openjdk.org/jdk/pull/11308 From vlivanov at openjdk.org Wed Nov 23 18:43:31 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 23 Nov 2022 18:43:31 GMT Subject: RFR: 8297417: Poly1305IntrinsicFuzzTest fails with tag mismatch exception In-Reply-To: <6EUTYCRx6TH5HSleuN8KTk2RliVRyK0QljfCj3c0meA=.5227e497-0675-4cd1-ae1e-83d136816753@github.com> References: <6EUTYCRx6TH5HSleuN8KTk2RliVRyK0QljfCj3c0meA=.5227e497-0675-4cd1-ae1e-83d136816753@github.com> Message-ID: On Wed, 23 Nov 2022 02:59:30 GMT, Volodymyr Paprotski wrote: > From https://github.com/openjdk/jdk/pull/10582, `t0` gets clobbered if `rscratch` is used. Example, [here](https://github.com/openjdk/jdk/blob/09f70dad2fe3f0691afacded6c38f61fa8a0d28d/src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp#L605-L606): > > > __ mov(t0, a0); > __ andq(t0, ExternalAddress(poly1305_mask44()), rscratch); // First limb (R^4[43:0]) It's a general problem: it's hard to see possible interference when names alias to the same register. I suggest to get rid of `rscratch` declaration and directly refer to `t1` followed by `/*rscratch*/` comment: __ movq(t0, a0); __ andq(t0, ExternalAddress(poly1305_mask44()), t1 /*rscratch*/); // First limb (Acc[43:0]) __ movq(C0, t0); Also, `-XX:+ForceUnreachable` is designed to stress non-rip addressing mode. Worth adding it to the corresponding test. ------------- PR: https://git.openjdk.org/jdk/pull/11308 From duke at openjdk.org Wed Nov 23 19:40:26 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Wed, 23 Nov 2022 19:40:26 GMT Subject: RFR: 8297417: Poly1305IntrinsicFuzzTest fails with tag mismatch exception In-Reply-To: References: <6EUTYCRx6TH5HSleuN8KTk2RliVRyK0QljfCj3c0meA=.5227e497-0675-4cd1-ae1e-83d136816753@github.com> Message-ID: On Wed, 23 Nov 2022 18:41:18 GMT, Vladimir Ivanov wrote: > Also, `-XX:+ForceUnreachable` is designed to stress non-rip addressing mode. Worth adding it to the corresponding test. Thanks, was wondering about that. > It's a general problem: it's hard to see possible interference when names alias to the same register. I suggest to get rid of `rscratch` declaration and directly refer to `t1` followed by `/*rscratch*/` comment: Hmm. I think register aliasing is pretty much a 'necessary evil' in the intrinsic code. Having better names for variables combined with too few registers leads to this issue. (unless of course we could somehow inject intrinsics _before_ register_ allocation. Writing assembler with infinite registers.. there is an idea :) ) . But in this specific case maybe getting rid of aliasing doesn't harm the readability that much. In some way, I would had preferred to keep `rscratch` instead of `t1` (I already have too many 'temps') but the `/*rscratch*/` comment should make things ok. ? ------------- PR: https://git.openjdk.org/jdk/pull/11308 From svkamath at openjdk.org Wed Nov 23 19:58:41 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Wed, 23 Nov 2022 19:58:41 GMT Subject: RFR: 8295351: java/lang/Float/Binary16Conversion.java fails with "Unexpected result of converting" [v2] In-Reply-To: References: Message-ID: <6gcjBNNHk8twSr92oF2TGB90s6F4WnFZWT4xJPmuYoc=.c9c47aad-82a6-44a3-bc50-68e2b9e0a7c6@github.com> > 8295351: java/lang/Float/Binary16Conversion.java fails with "Unexpected result of converting" Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: Addressed review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11301/files - new: https://git.openjdk.org/jdk/pull/11301/files/bda63544..5af25e9b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11301&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11301&range=00-01 Stats: 11 lines in 1 file changed: 0 ins; 0 del; 11 mod Patch: https://git.openjdk.org/jdk/pull/11301.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11301/head:pull/11301 PR: https://git.openjdk.org/jdk/pull/11301 From duke at openjdk.org Wed Nov 23 20:01:37 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Wed, 23 Nov 2022 20:01:37 GMT Subject: RFR: 8297417: Poly1305IntrinsicFuzzTest fails with tag mismatch exception [v2] In-Reply-To: <6EUTYCRx6TH5HSleuN8KTk2RliVRyK0QljfCj3c0meA=.5227e497-0675-4cd1-ae1e-83d136816753@github.com> References: <6EUTYCRx6TH5HSleuN8KTk2RliVRyK0QljfCj3c0meA=.5227e497-0675-4cd1-ae1e-83d136816753@github.com> Message-ID: > From https://github.com/openjdk/jdk/pull/10582, `t0` gets clobbered if `rscratch` is used. Example, [here](https://github.com/openjdk/jdk/blob/09f70dad2fe3f0691afacded6c38f61fa8a0d28d/src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp#L605-L606): > > > __ mov(t0, a0); > __ andq(t0, ExternalAddress(poly1305_mask44()), rscratch); // First limb (R^4[43:0]) Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: remove rscratch from top regmap and ForceUnreachable test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11308/files - new: https://git.openjdk.org/jdk/pull/11308/files/aa8cea41..a2c8907f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11308&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11308&range=00-01 Stats: 24 lines in 2 files changed: 0 ins; 2 del; 22 mod Patch: https://git.openjdk.org/jdk/pull/11308.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11308/head:pull/11308 PR: https://git.openjdk.org/jdk/pull/11308 From duke at openjdk.org Wed Nov 23 20:04:52 2022 From: duke at openjdk.org (Zhiqiang Zang) Date: Wed, 23 Nov 2022 20:04:52 GMT Subject: RFR: 8297384: Add IR tests for existing idealizations of arithmetic nodes [v4] In-Reply-To: References: Message-ID: <6Oxc0wRLUE-gil2tIE2IBQfykELzLNJ7Ki77xpB2yZ8=.ee1a319e-a2be-4495-b821-6b587b0200cb@github.com> > I noticed some idealizations have no associated IR tests so I included for them. Zhiqiang Zang has updated the pull request incrementally with one additional commit since the last revision: restrict RotateLeft/Right tests from running on x86. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11049/files - new: https://git.openjdk.org/jdk/pull/11049/files/bb92a752..02c2d5e9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11049&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11049&range=02-03 Stats: 2 lines in 2 files changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11049.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11049/head:pull/11049 PR: https://git.openjdk.org/jdk/pull/11049 From duke at openjdk.org Wed Nov 23 20:05:22 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Wed, 23 Nov 2022 20:05:22 GMT Subject: RFR: 8297417: Poly1305IntrinsicFuzzTest fails with tag mismatch exception In-Reply-To: References: <6EUTYCRx6TH5HSleuN8KTk2RliVRyK0QljfCj3c0meA=.5227e497-0675-4cd1-ae1e-83d136816753@github.com> Message-ID: On Wed, 23 Nov 2022 18:41:18 GMT, Vladimir Ivanov wrote: >> From https://github.com/openjdk/jdk/pull/10582, `t0` gets clobbered if `rscratch` is used. Example, [here](https://github.com/openjdk/jdk/blob/09f70dad2fe3f0691afacded6c38f61fa8a0d28d/src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp#L605-L606): >> >> >> __ mov(t0, a0); >> __ andq(t0, ExternalAddress(poly1305_mask44()), rscratch); // First limb (R^4[43:0]) > > Also, `-XX:+ForceUnreachable` is designed to stress non-rip addressing mode. Worth adding it to the corresponding test. @iwanowww done - Confirmed, `-XX:+ForceUnreachable` does indeed catch this on linux; added to half the tests - Only removed `rscratch` from top-level `poly1305_process_blocks_avx512` function. I think calling it `rscratch` in 'leaf' functions (i.e. `poly1305_multiply8_avx512`) is ok. ------------- PR: https://git.openjdk.org/jdk/pull/11308 From duke at openjdk.org Wed Nov 23 20:05:23 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Wed, 23 Nov 2022 20:05:23 GMT Subject: RFR: 8297417: Poly1305IntrinsicFuzzTest fails with tag mismatch exception [v2] In-Reply-To: References: <6EUTYCRx6TH5HSleuN8KTk2RliVRyK0QljfCj3c0meA=.5227e497-0675-4cd1-ae1e-83d136816753@github.com> Message-ID: On Wed, 23 Nov 2022 18:00:58 GMT, Sandhya Viswanathan wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: >> >> remove rscratch from top regmap and ForceUnreachable test > > src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 562: > >> 560: const Register t1 = r14; >> 561: const Register t2 = r15; >> 562: const Register rscratch = r14; > > The register map above in the comments also should reflect this change that rscratch is r14 now. done. (or rather, ended up removing `rscratch` completely from that function) ------------- PR: https://git.openjdk.org/jdk/pull/11308 From duke at openjdk.org Wed Nov 23 20:13:18 2022 From: duke at openjdk.org (Zhiqiang Zang) Date: Wed, 23 Nov 2022 20:13:18 GMT Subject: RFR: 8297384: Add IR tests for existing idealizations of arithmetic nodes [v3] In-Reply-To: References: Message-ID: <8xK2P815j62-TYJQ0VmEe4VOXZkFcjP8JzdIjUf5wb0=.a7e79ed6-3460-4d8b-ae0e-a163dcaabe9f@github.com> On Wed, 23 Nov 2022 09:04:09 GMT, Christian Hagedorn wrote: > We should make sure to make them pass on x86 as well such that GHA is clean after the integration of this PR. I've had a quick look on the match rules and it seems that we only use `RotateLeft/Right` nodes on x86_64, aarch64 and riscv. We could restrict these tests to only run on these architectures with `@requires os.arch == "x86_64" | os.arch == "aarch64" | os.arch == "riscv64"`. Thanks for your help! I have added the restrictions and the GHA should be clean now. I have a question though: where did you see the architectures where RotateLeft/Right nodes are used? so that I can know when the the same case happens again. Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/11049 From mdoerr at openjdk.org Wed Nov 23 20:56:22 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Wed, 23 Nov 2022 20:56:22 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v11] In-Reply-To: References: Message-ID: On Thu, 10 Nov 2022 20:09:42 GMT, Martin Doerr wrote: >> This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > We can't retrieve class loader oop during class unloading. I think treating method handle intrinsics as adapter blobs has other drawbacks. At least, we would need to think about how much space to reserve in non-nmethod space. What if that one gets full? I think https://bugs.openjdk.org/browse/JDK-8296336 is the best approach in the long term. For now, this PR has worked stable for 2 weeks. By the way, I was able to reproduce the issue by `jdk/bin/java -XX:ReservedCodeCacheSize=2496k -XX:+SegmentedCodeCache -version`. But, I'll take a look at the whitebox API and see if I can write a more deterministic test. ------------- PR: https://git.openjdk.org/jdk/pull/10933 From sviswanathan at openjdk.org Wed Nov 23 21:45:36 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Wed, 23 Nov 2022 21:45:36 GMT Subject: RFR: 8297417: Poly1305IntrinsicFuzzTest fails with tag mismatch exception [v2] In-Reply-To: References: <6EUTYCRx6TH5HSleuN8KTk2RliVRyK0QljfCj3c0meA=.5227e497-0675-4cd1-ae1e-83d136816753@github.com> Message-ID: On Wed, 23 Nov 2022 20:01:37 GMT, Volodymyr Paprotski wrote: >> From https://github.com/openjdk/jdk/pull/10582, `t0` gets clobbered if `rscratch` is used. Example, [here](https://github.com/openjdk/jdk/blob/09f70dad2fe3f0691afacded6c38f61fa8a0d28d/src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp#L605-L606): >> >> >> __ mov(t0, a0); >> __ andq(t0, ExternalAddress(poly1305_mask44()), rscratch); // First limb (R^4[43:0]) > > Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: > > remove rscratch from top regmap and ForceUnreachable test Marked as reviewed by sviswanathan (Reviewer). PR looks good to me. ------------- PR: https://git.openjdk.org/jdk/pull/11308 From vlivanov at openjdk.org Wed Nov 23 22:29:19 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 23 Nov 2022 22:29:19 GMT Subject: RFR: 8297417: Poly1305IntrinsicFuzzTest fails with tag mismatch exception [v2] In-Reply-To: References: <6EUTYCRx6TH5HSleuN8KTk2RliVRyK0QljfCj3c0meA=.5227e497-0675-4cd1-ae1e-83d136816753@github.com> Message-ID: On Wed, 23 Nov 2022 20:01:37 GMT, Volodymyr Paprotski wrote: >> From https://github.com/openjdk/jdk/pull/10582, `t0` gets clobbered if `rscratch` is used. Example, [here](https://github.com/openjdk/jdk/blob/09f70dad2fe3f0691afacded6c38f61fa8a0d28d/src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp#L605-L606): >> >> >> __ mov(t0, a0); >> __ andq(t0, ExternalAddress(poly1305_mask44()), rscratch); // First limb (R^4[43:0]) > > Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: > > remove rscratch from top regmap and ForceUnreachable test test/jdk/com/sun/crypto/provider/Cipher/ChaCha20/unittest/Poly1305UnitTestDriver.java line 52: > 50: * @modules java.base/com.sun.crypto.provider > 51: * @summary Unit test for IntrinsicCandidate in com.sun.crypto.provider.Poly1305. > 52: * @run main/othervm -Xcomp -XX:-TieredCompilation -XX:+ForceUnreachable java.base/com.sun.crypto.provider.Poly1305IntrinsicFuzzTest Don't you need `-XX:+UnlockDiagnosticVMOptions` here as well? ------------- PR: https://git.openjdk.org/jdk/pull/11308 From duke at openjdk.org Wed Nov 23 22:32:45 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Wed, 23 Nov 2022 22:32:45 GMT Subject: RFR: 8297417: Poly1305IntrinsicFuzzTest fails with tag mismatch exception [v2] In-Reply-To: References: <6EUTYCRx6TH5HSleuN8KTk2RliVRyK0QljfCj3c0meA=.5227e497-0675-4cd1-ae1e-83d136816753@github.com> Message-ID: On Wed, 23 Nov 2022 22:26:59 GMT, Vladimir Ivanov wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: >> >> remove rscratch from top regmap and ForceUnreachable test > > test/jdk/com/sun/crypto/provider/Cipher/ChaCha20/unittest/Poly1305UnitTestDriver.java line 52: > >> 50: * @modules java.base/com.sun.crypto.provider >> 51: * @summary Unit test for IntrinsicCandidate in com.sun.crypto.provider.Poly1305. >> 52: * @run main/othervm -Xcomp -XX:-TieredCompilation -XX:+ForceUnreachable java.base/com.sun.crypto.provider.Poly1305IntrinsicFuzzTest > > Don't you need `-XX:+UnlockDiagnosticVMOptions` here as well? Darn. Yes.. one min, will push.. Sorry about the noise. (thought that the fact that it failed on linux meant it wasnt a diagnostic, but it is..) product(bool, ForceUnreachable, false, DIAGNOSTIC, \ "Make all non code cache addresses to be unreachable by " \ "forcing use of 64bit literal fixups") ------------- PR: https://git.openjdk.org/jdk/pull/11308 From vlivanov at openjdk.org Wed Nov 23 22:35:18 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 23 Nov 2022 22:35:18 GMT Subject: RFR: 8297417: Poly1305IntrinsicFuzzTest fails with tag mismatch exception [v2] In-Reply-To: References: <6EUTYCRx6TH5HSleuN8KTk2RliVRyK0QljfCj3c0meA=.5227e497-0675-4cd1-ae1e-83d136816753@github.com> Message-ID: On Wed, 23 Nov 2022 20:01:37 GMT, Volodymyr Paprotski wrote: >> From https://github.com/openjdk/jdk/pull/10582, `t0` gets clobbered if `rscratch` is used. Example, [here](https://github.com/openjdk/jdk/blob/09f70dad2fe3f0691afacded6c38f61fa8a0d28d/src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp#L605-L606): >> >> >> __ mov(t0, a0); >> __ andq(t0, ExternalAddress(poly1305_mask44()), rscratch); // First limb (R^4[43:0]) > > Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: > > remove rscratch from top regmap and ForceUnreachable test src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 562: > 560: const Register t1 = r14; > 561: const Register t2 = r15; > 562: const Register rscratch = r13; After seeing the whole patch, I noticed that `t1` is used only at the very end of the stub. Alternatively, you could keep `rscratch` and move `t1` declaration close to the use sites. ------------- PR: https://git.openjdk.org/jdk/pull/11308 From sviswanathan at openjdk.org Wed Nov 23 22:36:22 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Wed, 23 Nov 2022 22:36:22 GMT Subject: RFR: 8295351: java/lang/Float/Binary16Conversion.java fails with "Unexpected result of converting" [v2] In-Reply-To: <6gcjBNNHk8twSr92oF2TGB90s6F4WnFZWT4xJPmuYoc=.c9c47aad-82a6-44a3-bc50-68e2b9e0a7c6@github.com> References: <6gcjBNNHk8twSr92oF2TGB90s6F4WnFZWT4xJPmuYoc=.c9c47aad-82a6-44a3-bc50-68e2b9e0a7c6@github.com> Message-ID: On Wed, 23 Nov 2022 19:58:41 GMT, Smita Kamath wrote: >> 8295351: java/lang/Float/Binary16Conversion.java fails with "Unexpected result of converting" > > Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: > > Addressed review comments Marked as reviewed by sviswanathan (Reviewer). The PR looks good to me. ------------- PR: https://git.openjdk.org/jdk/pull/11301 From svkamath at openjdk.org Wed Nov 23 22:36:23 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Wed, 23 Nov 2022 22:36:23 GMT Subject: RFR: 8295351: java/lang/Float/Binary16Conversion.java fails with "Unexpected result of converting" [v2] In-Reply-To: References: Message-ID: On Wed, 23 Nov 2022 05:07:00 GMT, David Holmes wrote: >> Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: >> >> Addressed review comments > > src/hotspot/share/runtime/sharedRuntime.cpp line 455: > >> 453: union {jfloat f; juint i;} bits; >> 454: bits.f = x; >> 455: jint doppel = bits.i; > > Doesn't the conversion from unsigned to signed risk a compiler warning being emitted? > > Can't you just use the existing `JavaValue` type to perform the union conversion trick? Hi David, thanks for pointing this out. I have updated the code to use jint. I have used the union conversion trick that was previously used in SharedRuntime::drem and SharedRuntime::frem. I hope that's okay with you. ------------- PR: https://git.openjdk.org/jdk/pull/11301 From duke at openjdk.org Wed Nov 23 22:43:22 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Wed, 23 Nov 2022 22:43:22 GMT Subject: RFR: 8297417: Poly1305IntrinsicFuzzTest fails with tag mismatch exception [v2] In-Reply-To: References: <6EUTYCRx6TH5HSleuN8KTk2RliVRyK0QljfCj3c0meA=.5227e497-0675-4cd1-ae1e-83d136816753@github.com> Message-ID: On Wed, 23 Nov 2022 22:33:05 GMT, Vladimir Ivanov wrote: >> Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: >> >> remove rscratch from top regmap and ForceUnreachable test > > src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 562: > >> 560: const Register t1 = r14; >> 561: const Register t2 = r15; >> 562: const Register rscratch = r13; > > After seeing the whole patch, I noticed that `t1` is used only at the very end of the stub. > > Alternatively, you could keep `rscratch` and move `t1` declaration close to the use sites. No (unless I misunderstood), there are a couple more in the middle passed to `poly1305_multiply_scalar` (I prefer 'the look' of `t0, t1, t2`) poly1305_multiply_scalar(a0, a1, a2, r0, r1, c1, true, t0, t1, t2, mulql, mulqh); ------------- PR: https://git.openjdk.org/jdk/pull/11308 From duke at openjdk.org Wed Nov 23 22:50:34 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Wed, 23 Nov 2022 22:50:34 GMT Subject: RFR: 8297417: Poly1305IntrinsicFuzzTest fails with tag mismatch exception [v3] In-Reply-To: <6EUTYCRx6TH5HSleuN8KTk2RliVRyK0QljfCj3c0meA=.5227e497-0675-4cd1-ae1e-83d136816753@github.com> References: <6EUTYCRx6TH5HSleuN8KTk2RliVRyK0QljfCj3c0meA=.5227e497-0675-4cd1-ae1e-83d136816753@github.com> Message-ID: > From https://github.com/openjdk/jdk/pull/10582, `t0` gets clobbered if `rscratch` is used. Example, [here](https://github.com/openjdk/jdk/blob/09f70dad2fe3f0691afacded6c38f61fa8a0d28d/src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp#L605-L606): > > > __ mov(t0, a0); > __ andq(t0, ExternalAddress(poly1305_mask44()), rscratch); // First limb (R^4[43:0]) Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: add UnlockDiagnosticVMOptions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11308/files - new: https://git.openjdk.org/jdk/pull/11308/files/a2c8907f..0effe196 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11308&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11308&range=01-02 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/11308.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11308/head:pull/11308 PR: https://git.openjdk.org/jdk/pull/11308 From vlivanov at openjdk.org Wed Nov 23 23:07:19 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 23 Nov 2022 23:07:19 GMT Subject: RFR: 8297417: Poly1305IntrinsicFuzzTest fails with tag mismatch exception [v3] In-Reply-To: References: <6EUTYCRx6TH5HSleuN8KTk2RliVRyK0QljfCj3c0meA=.5227e497-0675-4cd1-ae1e-83d136816753@github.com> Message-ID: On Wed, 23 Nov 2022 22:50:34 GMT, Volodymyr Paprotski wrote: >> From https://github.com/openjdk/jdk/pull/10582, `t0` gets clobbered if `rscratch` is used. Example, [here](https://github.com/openjdk/jdk/blob/09f70dad2fe3f0691afacded6c38f61fa8a0d28d/src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp#L605-L606): >> >> >> __ mov(t0, a0); >> __ andq(t0, ExternalAddress(poly1305_mask44()), rscratch); // First limb (R^4[43:0]) > > Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: > > add UnlockDiagnosticVMOptions Looks good. Submitted the patch for testing. ------------- Marked as reviewed by vlivanov (Reviewer). PR: https://git.openjdk.org/jdk/pull/11308 From vlivanov at openjdk.org Wed Nov 23 23:07:20 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 23 Nov 2022 23:07:20 GMT Subject: RFR: 8297417: Poly1305IntrinsicFuzzTest fails with tag mismatch exception [v3] In-Reply-To: References: <6EUTYCRx6TH5HSleuN8KTk2RliVRyK0QljfCj3c0meA=.5227e497-0675-4cd1-ae1e-83d136816753@github.com> Message-ID: <_BGzXZlblX63DvgoXtUcY5P8dfdt28eA7WIg_mR7tRU=.833044db-5193-4082-88fe-a7f0b58c3bca@github.com> On Wed, 23 Nov 2022 22:41:01 GMT, Volodymyr Paprotski wrote: >> src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 562: >> >>> 560: const Register t1 = r14; >>> 561: const Register t2 = r15; >>> 562: const Register rscratch = r13; >> >> After seeing the whole patch, I noticed that `t1` is used only at the very end of the stub. >> >> Alternatively, you could keep `rscratch` and move `t1` declaration close to the use sites. > > No (unless I misunderstood), there are a couple more in the middle passed to `poly1305_multiply_scalar` (I prefer 'the look' of `t0, t1, t2`) > > > poly1305_multiply_scalar(a0, a1, a2, > r0, r1, c1, true, > t0, t1, t2, mulql, mulqh); Yes, you are right. Overlooked those when browsed code on GitHub. ------------- PR: https://git.openjdk.org/jdk/pull/11308 From dholmes at openjdk.org Thu Nov 24 00:00:22 2022 From: dholmes at openjdk.org (David Holmes) Date: Thu, 24 Nov 2022 00:00:22 GMT Subject: RFR: 8295351: java/lang/Float/Binary16Conversion.java fails with "Unexpected result of converting" [v2] In-Reply-To: References: Message-ID: On Wed, 23 Nov 2022 22:32:35 GMT, Smita Kamath wrote: >> src/hotspot/share/runtime/sharedRuntime.cpp line 455: >> >>> 453: union {jfloat f; juint i;} bits; >>> 454: bits.f = x; >>> 455: jint doppel = bits.i; >> >> Doesn't the conversion from unsigned to signed risk a compiler warning being emitted? >> >> Can't you just use the existing `JavaValue` type to perform the union conversion trick? > > Hi David, thanks for pointing this out. I have updated the code to use jint. > I have used the union conversion trick that was previously used in SharedRuntime::drem and SharedRuntime::frem. I hope that's okay with you. We seem to employ this trick in a few places, e.g. also see metaprogramming/primitiveConversions.hpp. It would be good to reduce that so I will file a separate RFE. ------------- PR: https://git.openjdk.org/jdk/pull/11301 From dholmes at openjdk.org Thu Nov 24 00:03:19 2022 From: dholmes at openjdk.org (David Holmes) Date: Thu, 24 Nov 2022 00:03:19 GMT Subject: RFR: 8295351: java/lang/Float/Binary16Conversion.java fails with "Unexpected result of converting" [v2] In-Reply-To: <6gcjBNNHk8twSr92oF2TGB90s6F4WnFZWT4xJPmuYoc=.c9c47aad-82a6-44a3-bc50-68e2b9e0a7c6@github.com> References: <6gcjBNNHk8twSr92oF2TGB90s6F4WnFZWT4xJPmuYoc=.c9c47aad-82a6-44a3-bc50-68e2b9e0a7c6@github.com> Message-ID: On Wed, 23 Nov 2022 19:58:41 GMT, Smita Kamath wrote: >> 8295351: java/lang/Float/Binary16Conversion.java fails with "Unexpected result of converting" > > Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: > > Addressed review comments I'm not familiar with the details of this code, so will put this through our CI to verify the test failure is dealt with. ------------- PR: https://git.openjdk.org/jdk/pull/11301 From dholmes at openjdk.org Thu Nov 24 03:35:22 2022 From: dholmes at openjdk.org (David Holmes) Date: Thu, 24 Nov 2022 03:35:22 GMT Subject: RFR: 8295351: java/lang/Float/Binary16Conversion.java fails with "Unexpected result of converting" [v2] In-Reply-To: <6gcjBNNHk8twSr92oF2TGB90s6F4WnFZWT4xJPmuYoc=.c9c47aad-82a6-44a3-bc50-68e2b9e0a7c6@github.com> References: <6gcjBNNHk8twSr92oF2TGB90s6F4WnFZWT4xJPmuYoc=.c9c47aad-82a6-44a3-bc50-68e2b9e0a7c6@github.com> Message-ID: On Wed, 23 Nov 2022 19:58:41 GMT, Smita Kamath wrote: >> 8295351: java/lang/Float/Binary16Conversion.java fails with "Unexpected result of converting" > > Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: > > Addressed review comments Our CI testing passed. ------------- PR: https://git.openjdk.org/jdk/pull/11301 From dzhang at openjdk.org Thu Nov 24 05:48:31 2022 From: dzhang at openjdk.org (Dingli Zhang) Date: Thu, 24 Nov 2022 05:48:31 GMT Subject: RFR: 8297549: RISC-V: Support vloadcon instruction for Vector API Message-ID: The instruction which is matched `VectorLoadConst` will create index starting from 0 and incremented by 1. In detail, the instruction populates the destination vector by setting the first element to 0 and monotonically incrementing the value by 1 for each subsequent element. We can add support of `VectorLoadConst` for RISC-V by `vid.v` . It was implemented by referring to RVV v1.0 [1]. The JMH score of the micro-benchmark `IndexVectorBenchmark` [2] will be improved by a factor of about 9 on avarage after we implement this node. Before: Benchmark (size) Mode Cnt Score Error Units IndexVectorBenchmark.byteIndexVector 1024 thrpt 5 16.900 ? 3.606 ops/ms IndexVectorBenchmark.doubleIndexVector 1024 thrpt 5 2.722 ? 0.113 ops/ms IndexVectorBenchmark.floatIndexVector 1024 thrpt 5 3.692 ? 0.588 ops/ms IndexVectorBenchmark.intIndexVector 1024 thrpt 5 4.974 ? 0.943 ops/ms IndexVectorBenchmark.longIndexVector 1024 thrpt 5 3.273 ? 0.251 ops/ms IndexVectorBenchmark.shortIndexVector 1024 thrpt 5 9.485 ? 0.452 ops/ms After: Benchmark (size) Mode Cnt Score Error Units IndexVectorBenchmark.byteIndexVector 1024 thrpt 5 91.309 ? 0.833 ops/ms IndexVectorBenchmark.doubleIndexVector 1024 thrpt 5 20.834 ? 5.665 ops/ms IndexVectorBenchmark.floatIndexVector 1024 thrpt 5 33.560 ? 1.569 ops/ms IndexVectorBenchmark.intIndexVector 1024 thrpt 5 60.216 ? 0.532 ops/ms IndexVectorBenchmark.longIndexVector 1024 thrpt 5 42.142 ? 1.934 ops/ms IndexVectorBenchmark.shortIndexVector 1024 thrpt 5 76.982 ? 0.612 ops/ms [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc [2] https://github.com/openjdk/jdk/blob/857b0f9b05bc711f3282a0da85fcff131fffab91/test/micro/org/openjdk/bench/jdk/incubator/vector/IndexVectorBenchmark.java Please take a look and have some reviews. Thanks a lot. ## Testing: - hotspot and jdk tier1 without new failures (release with UseRVV on QEMU) - test/jdk/jdk/incubator/vector/* (fastdebug/release with UseRVV on QEMU) ------------- Commit messages: - Support vloadcon instruction for Vector API Changes: https://git.openjdk.org/jdk/pull/11344/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11344&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297549 Stats: 18 lines in 1 file changed: 17 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11344.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11344/head:pull/11344 PR: https://git.openjdk.org/jdk/pull/11344 From yzhu at openjdk.org Thu Nov 24 07:59:05 2022 From: yzhu at openjdk.org (Yanhong Zhu) Date: Thu, 24 Nov 2022 07:59:05 GMT Subject: RFR: 8297476: Increase InlineSmallCode default from 1000 to 2500 for RISC-V [v2] In-Reply-To: References: Message-ID: On Wed, 23 Nov 2022 10:01:43 GMT, Fei Yang wrote: >> The current default value of InlineSmallCode on RISC-V is 1000. I witnessed notable performance improvement by increasing this value to 2000 when running the Renaissance benchmark. Here are the exact commands used for each of the benchmarks: >> >> Before: >> $ java -XX:InlineSmallCode=1000 -XX:+UseParallelGC -Xms12g -Xmx12g -jar renaissance-gpl-0.14.1.jar -r 40 all >> >> After: >> $ java -XX:InlineSmallCode=2000 -XX:+UseParallelGC -Xms12g -Xmx12g -jar renaissance-gpl-0.14.1.jar -r 40 all >> >> Best run time for one repetition (ms ? lower is better) on Unmatched board: >> >> Benchmark | Before | After | Ratio >> -- | -- | -- | -- >> AkkaUct | 75629.766 | 71839.905 | 5.01% >> Reactors | 98120.668 | 91597.120 | 6.65% >> DecTree | 12144.666 | 11801.740 | 2.82% >> Als | 57719.166 | 53307.041 | 7.64% >> ChiSquare | 21704.666 | 16301.189 | 24.89% >> GaussMix | 17494.891 | 17497.291 | -0.02% >> LogRegression | 11881.352 | 11382.722 | 4.20% >> MovieLens | 100944.374 | 96510.793 | 4.39% >> NaiveBayes | 81946.569 | 68566.763 | 16.32% >> PageRank | 43689.497 | 43204.553 | 1.11% >> FjKmeans | 68398.667 | 67261.674 | 1.66% >> FutureGenetic | 31752.695 | 31524.457 | 0.72% >> Mnemonics | 126312.832 | 115335.512 | 8.69% >> ParMnemonics | 93406.666 | 88320.443 | 5.45% >> Scrabble | 6894.853 | 6888.426 | 0.09% >> RxScrabble | 5163.473 | 4875.730 | 5.08% >> Dotty | 14852.405 | 14667.255 | 1.25% >> ScalaDoku | 95770.117 | 39728.637 | 58.52% >> Philosophers | 13974.965 | 11579.551 | 17.14% >> ScalaStmBench7 | 12185.093 | 12243.016 | -0.47% >> FinagleChirper | 32676.065 | 30900.282 | 5.44% >> FinagleHttp | 30633.640 | 30191.792 | 1.44% >> >> Other testing: tier1-tier3 tested on Unmatched board. >> >> I have also tested other possible values for InlineSmallCode like 1500 and 2500, but the numbers say show 2000 would outperform those values in most of the cases. And I can verify no regressions across at least following benchmarks: >> >> - Dacapo >> - SPECjvm2008 >> - SPECjbb2005 >> - SPECjbb2015 > > Fei Yang has updated the pull request incrementally with one additional commit since the last revision: > > Review Looks good. ------------- Marked as reviewed by yzhu (Author). PR: https://git.openjdk.org/jdk/pull/11310 From rcastanedalo at openjdk.org Thu Nov 24 08:24:22 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 24 Nov 2022 08:24:22 GMT Subject: RFR: 8297417: Poly1305IntrinsicFuzzTest fails with tag mismatch exception [v3] In-Reply-To: References: <6EUTYCRx6TH5HSleuN8KTk2RliVRyK0QljfCj3c0meA=.5227e497-0675-4cd1-ae1e-83d136816753@github.com> Message-ID: On Wed, 23 Nov 2022 22:50:34 GMT, Volodymyr Paprotski wrote: >> From https://github.com/openjdk/jdk/pull/10582, `t0` gets clobbered if `rscratch` is used. Example, [here](https://github.com/openjdk/jdk/blob/09f70dad2fe3f0691afacded6c38f61fa8a0d28d/src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp#L605-L606): >> >> >> __ mov(t0, a0); >> __ andq(t0, ExternalAddress(poly1305_mask44()), rscratch); // First limb (R^4[43:0]) > > Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: > > add UnlockDiagnosticVMOptions Thanks for promptly addressing this issue, Volodymyr, looks good. I see that @iwanowww's internal testing succeeded. ------------- Marked as reviewed by rcastanedalo (Reviewer). PR: https://git.openjdk.org/jdk/pull/11308 From thartmann at openjdk.org Thu Nov 24 09:47:22 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 24 Nov 2022 09:47:22 GMT Subject: RFR: 8297417: Poly1305IntrinsicFuzzTest fails with tag mismatch exception [v3] In-Reply-To: References: <6EUTYCRx6TH5HSleuN8KTk2RliVRyK0QljfCj3c0meA=.5227e497-0675-4cd1-ae1e-83d136816753@github.com> Message-ID: <5TJktR1jm_VBQ3g0I7RhoIBTnff8ZfUe-7N4a7sgKUI=.32c58f51-dc53-41ce-9f6b-8f78f4279fe5@github.com> On Wed, 23 Nov 2022 22:50:34 GMT, Volodymyr Paprotski wrote: >> From https://github.com/openjdk/jdk/pull/10582, `t0` gets clobbered if `rscratch` is used. Example, [here](https://github.com/openjdk/jdk/blob/09f70dad2fe3f0691afacded6c38f61fa8a0d28d/src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp#L605-L606): >> >> >> __ mov(t0, a0); >> __ andq(t0, ExternalAddress(poly1305_mask44()), rscratch); // First limb (R^4[43:0]) > > Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: > > add UnlockDiagnosticVMOptions Looks good to me too. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11308 From fyang at openjdk.org Thu Nov 24 13:34:19 2022 From: fyang at openjdk.org (Fei Yang) Date: Thu, 24 Nov 2022 13:34:19 GMT Subject: RFR: 8297476: Increase InlineSmallCode default from 1000 to 2500 for RISC-V [v2] In-Reply-To: References: Message-ID: On Wed, 23 Nov 2022 10:02:30 GMT, Aleksey Shipilev wrote: >> Fei Yang has updated the pull request incrementally with one additional commit since the last revision: >> >> Review > > I like this, thanks. @shipilev @yhzhu20 : Thanks for the review! I have also verified on both linux-x64 and linux-aarch64 platforms with following commands, still work as expected. $ java -XX:+PrintFlagsFinal -version | grep InlineSmallCode intx InlineSmallCode = 2500 {C2 pd product} {default} openjdk version "20-internal" 2023-03-21 OpenJDK Runtime Environment (build 20-internal-adhoc.fyang.openjdk-jdk) OpenJDK 64-Bit Server VM (build 20-internal-adhoc.fyang.openjdk-jdk, mixed mode, sharing) $ java -XX:-TieredCompilation -XX:+PrintFlagsFinal -version | grep InlineSmallCode intx InlineSmallCode = 1000 {C2 pd product} {default} openjdk version "20-internal" 2023-03-21 OpenJDK Runtime Environment (build 20-internal-adhoc.fyang.openjdk-jdk) OpenJDK 64-Bit Server VM (build 20-internal-adhoc.fyang.openjdk-jdk, mixed mode, sharing) ------------- PR: https://git.openjdk.org/jdk/pull/11310 From fyang at openjdk.org Thu Nov 24 13:36:35 2022 From: fyang at openjdk.org (Fei Yang) Date: Thu, 24 Nov 2022 13:36:35 GMT Subject: Integrated: 8297476: Increase InlineSmallCode default from 1000 to 2500 for RISC-V In-Reply-To: References: Message-ID: <-EEHZX8-3fIK-diLbBQMSvrKs2r5wUROQGDd3bT6ng4=.8c21fa8c-e745-4e00-8ea5-a5b1cd2a7064@github.com> On Wed, 23 Nov 2022 06:18:44 GMT, Fei Yang wrote: > The current default value of InlineSmallCode on RISC-V is 1000. I witnessed notable performance improvement by increasing this value to 2000 when running the Renaissance benchmark. Here are the exact commands used for each of the benchmarks: > > Before: > $ java -XX:InlineSmallCode=1000 -XX:+UseParallelGC -Xms12g -Xmx12g -jar renaissance-gpl-0.14.1.jar -r 40 all > > After: > $ java -XX:InlineSmallCode=2000 -XX:+UseParallelGC -Xms12g -Xmx12g -jar renaissance-gpl-0.14.1.jar -r 40 all > > Best run time for one repetition (ms ? lower is better) on Unmatched board: > > Benchmark | Before | After | Ratio > -- | -- | -- | -- > AkkaUct | 75629.766 | 71839.905 | 5.01% > Reactors | 98120.668 | 91597.120 | 6.65% > DecTree | 12144.666 | 11801.740 | 2.82% > Als | 57719.166 | 53307.041 | 7.64% > ChiSquare | 21704.666 | 16301.189 | 24.89% > GaussMix | 17494.891 | 17497.291 | -0.02% > LogRegression | 11881.352 | 11382.722 | 4.20% > MovieLens | 100944.374 | 96510.793 | 4.39% > NaiveBayes | 81946.569 | 68566.763 | 16.32% > PageRank | 43689.497 | 43204.553 | 1.11% > FjKmeans | 68398.667 | 67261.674 | 1.66% > FutureGenetic | 31752.695 | 31524.457 | 0.72% > Mnemonics | 126312.832 | 115335.512 | 8.69% > ParMnemonics | 93406.666 | 88320.443 | 5.45% > Scrabble | 6894.853 | 6888.426 | 0.09% > RxScrabble | 5163.473 | 4875.730 | 5.08% > Dotty | 14852.405 | 14667.255 | 1.25% > ScalaDoku | 95770.117 | 39728.637 | 58.52% > Philosophers | 13974.965 | 11579.551 | 17.14% > ScalaStmBench7 | 12185.093 | 12243.016 | -0.47% > FinagleChirper | 32676.065 | 30900.282 | 5.44% > FinagleHttp | 30633.640 | 30191.792 | 1.44% > > Other testing: tier1-tier3 tested on Unmatched board. > > I have also tested other possible values for InlineSmallCode like 1500 and 2500, but the numbers say show 2000 would outperform those values in most of the cases. And I can verify no regressions across at least following benchmarks: > > - Dacapo > - SPECjvm2008 > - SPECjbb2005 > - SPECjbb2015 This pull request has now been integrated. Changeset: 5e196b4b Author: Fei Yang URL: https://git.openjdk.org/jdk/commit/5e196b4b8e623107424e2fb54672790fd925fe73 Stats: 7 lines in 1 file changed: 0 ins; 6 del; 1 mod 8297476: Increase InlineSmallCode default from 1000 to 2500 for RISC-V Reviewed-by: shade, yzhu ------------- PR: https://git.openjdk.org/jdk/pull/11310 From rrich at openjdk.org Thu Nov 24 13:56:06 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Thu, 24 Nov 2022 13:56:06 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v10] In-Reply-To: References: Message-ID: > Hi, > > this is the port of [JEP 425: Virtual Threads (Preview)](https://openjdk.org/jeps/425)) to PPC64. > More precisely it is the port of vm continuations in hotspot to PPC64. It allows to run with `-XX:+VMContinuations` which is a prerequisit for 'real' virtual threads (oxymoron?). > > Most of the shared code changes are related to a new platform dependent constant `frame::metadata_words_at_top`. It is either added or subtracted to a frame address or size. > > The following is supposed to explain (without real life details) why it is needed in addition to the existing `frame::metadata_words`. The diagram shows a frame at `SP` and its stack arguments. The caller frame is located at `callers_SP`. > > > X86 / AARCH64 PPC64: > > : : : : > : : : : > | | | | > |-----------------| |-----------------| > | | | | > | stack arguments | | stack arguments | > | |<- callers_SP | | > =================== |-----------------| > | | | | > | metadata at bottom | | metadata at top | > | | | |<- callers_SP > |-----------------| =================== > | | | | > | | | | > | | | | > | | | | > | |<- SP | | > =================== |-----------------| > | | > | metadata at top | > | |<- SP > =================== > > > On X86 and AARCH64 metadata (that's return address, saved SP/FP, etc.) is stored at the frame bottom (`metadata at bottom`). On PPC64 it is stored at the frame top (`metadata at top`) where it affects size and address calculations. Shared code deals with this by making use of the platform dependent constant `frame::metadata_words_at_top`. > > * size required to 'freeze' a single frame with its stack arguments in a `StackChunk`: > `sizeof(frame) + sizeof(stack arguments) + frame::metadata_words_at_top` > > * address of stack arguments: > `callers_SP + frame::metadata_words_at_top` > > * The value of `frame::metadata_words_at_top` is 0 words on X86 and AARCH64. On PPC64 it is 4 words. > > Please refer to comments I've added for more details (e.g. the comment on StackChunkFrameStream::frame_size()). Recently I've given a talk about vm continuations and the PPC64 port to my colleagues. It's rather an overview than a deep dive. [The slides](http://cr.openjdk.java.net/~rrich/webrevs/2022/8286302/202210_loom_ppc64_port.pdf) might serve as well as an intro to the matter. > > The pr includes the new test jdk/jdk/internal/vm/Continuation/BasicExp.java which I wrote while doing the port. The test cases vary from simple to not-so-simple. One of the main features is that a CompilationPolicy passed as argument controls which frames are compiled/interpreted when freezing by defining a sliding window of compiled / interpreted frames which produces interesting transitions with and without stack arguments. There is overlap with Basic.java and Fuzz.java. Let me know wether to remove or keep BasicExp.java. Runtime with fastdebug: 2m on Apple M1 Pro and 3m on Intel Xeon E5-2660 v3 @ 2.60GHz. Note that -XX:+VerifyContinuations is explicitly set as a found it very useful, it increases the runtime quite a bit though. > > Testing: the change passed our CI testing: most JCK and JTREG tests, also in Xcomp mode, SPECjvm2008, SPECjbb2015, Renaissance Suite and SAP specific tests with fastdebug and release builds on the standard platforms plus PPC64. These tests do include hotspot_loom and jdk_loom JTREG tests which I've also run with TEST_VM_OPTS="-XX:+VerifyContinuations" on X86_64, PPC64, and AARCH64. > > Thanks, Richard. Richard Reingruber has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits: - Merge branch 'master' into 8286302_Port_JEP_425_to_PPC64 - More Feedback Leonid - Feedback Leonid - Cleanup BasicExp test - Feedback Martin - Cleanup BasicExp.java - Feedback from backwaterred - Fix cpp condition and add PPC64 - Changes lost in merge - Merge branch 'master' into 8286302_Port_JEP_425_to_PPC64 - ... and 2 more: https://git.openjdk.org/jdk/compare/9c77e41b...0b7e325b ------------- Changes: https://git.openjdk.org/jdk/pull/10961/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10961&range=09 Stats: 3564 lines in 66 files changed: 3156 ins; 109 del; 299 mod Patch: https://git.openjdk.org/jdk/pull/10961.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10961/head:pull/10961 PR: https://git.openjdk.org/jdk/pull/10961 From xlinzheng at openjdk.org Thu Nov 24 13:56:21 2022 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Thu, 24 Nov 2022 13:56:21 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism In-Reply-To: References: Message-ID: <3s5XMZcuP0SwIRcnpM94VMX4GEDU1KRhz-LgMRCp6kk=.0fb5dc97-57e4-48b4-a2d0-cc0ed6de48bb@github.com> On Wed, 16 Nov 2022 15:03:07 GMT, Roman Kennke wrote: > Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. > > Testing: > - [x] tier1 (x86_64, x86_32, aarch64) > - [x] tier2 (x86_64, x86_32, aarch64) > - [x] tier3 (x86_64, x86_32, aarch64) Hi Roman, This is a very nice refactoring, but we find this patch can interestingly break the RISC-V native fastdebug/slowdebug building process. The root cause is the `measure_code_size()` is always a `0` value, for `PhaseOutput::init_buffer()` is too early to collect all CodeStubs emitted in the future. This issue appears to exist before this patch. What is different is, originally `C2EntryBarrierStubTable::estimate_stub_size()` seems to return a non-zero value at all times for now (nmethod entry barriers for continuations). However, due to the pre-existing issue mentioned above, our new code using a sum to calculate the total size will always return a `0`, shrinking the `measure_code_size()` from a non-zero value to `0`. So the `stub_req` is smaller than before. Some code generated in RISC-V would just exceed the smaller threshold when emitting trampolines stubs, causing some failures. The below print can show the zero value when `-XX:+UnlockDiagnosticVMOptions -XX:+UseNewCode` are on. diff --git a/src/hotspot/share/opto/output.cpp b/src/hotspot/share/opto/output.cpp index f189a12316b..1416c196faf 100644 --- a/src/hotspot/share/opto/output.cpp +++ b/src/hotspot/share/opto/output.cpp @@ -1244,7 +1244,11 @@ CodeBuffer* PhaseOutput::init_buffer() { BarrierSetC2* bs = BarrierSet::barrier_set()->barrier_set_c2(); stub_req += bs->estimate_stub_size(); - stub_req += _stub_list.measure_code_size(); + int newlist_size = _stub_list.measure_code_size(); + stub_req += newlist_size; + if (UseNewCode) { + tty->print_cr("the measured size is: %d", newlist_size); + } // nmethod and CodeBuffer count stubs & constants as part of method's code. // class HandlerImpl is platform-specific and defined in the *.ad files. one hs_err file here. (the line number may not match, for I added some debugging stuff in my codebase) [hs_err_pid1282.log](https://github.com/openjdk/jdk/files/10085047/hs_err_pid1282.log) cc @RealFYang Best Regards, Xiaolin ------------- PR: https://git.openjdk.org/jdk/pull/11188 From roland at openjdk.org Thu Nov 24 15:28:23 2022 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 24 Nov 2022 15:28:23 GMT Subject: RFR: 8297556: Parse::check_interpreter_type fails with assert "must constrain OSR typestate" Message-ID: With 6312651 (Compiler should only use verified interface types for optimization), I changed when the _klass field for an array in the type system would be null. Before, there was an assumption that any type (other than top, bottom) could be represented with a non null _klass field. With 6312651, there are types that are impossible to represent with a single klass pointer so now, the _klass field is only guaranteed non null for an array of basic type. When I made that change, I went over uses of the _klass field for anything other than an array of basic type and fixed the code so it uses something other than the _klass field. This is a place I missed. (I made some of the changes for the _klass field after Vladimir I ran extensive testing for the patch so that could explain why this issue was missed) ------------- Commit messages: - test - fix Changes: https://git.openjdk.org/jdk/pull/11356/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11356&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297556 Stats: 50 lines in 2 files changed: 46 ins; 2 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/11356.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11356/head:pull/11356 PR: https://git.openjdk.org/jdk/pull/11356 From roland at openjdk.org Thu Nov 24 15:35:29 2022 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 24 Nov 2022 15:35:29 GMT Subject: RFR: 8297343: TestStress*.java fail with "got different traces for the same seed" Message-ID: Root cause from Roberto's analysis: "The regression seems to be due to the introduction of non-determinism in the node dumps of otherwise identical compilations." "The problem seems to be that JDK-6312651 dumps interface sets in an order that is determined by the raw pointers of the set elements. This is unstable across different runs and leads to different node dumps for otherwise identical compilations." "Stable node dumps are useful for debugging (e.g. when diffing compiler traces from two different runs), so the solution is probably dumping interface sets in some order (e.g. lexicographic order of each interface dump) that does not depend on raw pointer values." This patch implements Roberto's recommendation by sorting interfaces on their ciBaseObject::_ident. ------------- Commit messages: - comment - fix Changes: https://git.openjdk.org/jdk/pull/11357/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11357&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297343 Stats: 11 lines in 1 file changed: 9 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/11357.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11357/head:pull/11357 PR: https://git.openjdk.org/jdk/pull/11357 From rkennke at openjdk.org Thu Nov 24 15:40:23 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 24 Nov 2022 15:40:23 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism In-Reply-To: References: Message-ID: On Wed, 16 Nov 2022 15:03:07 GMT, Roman Kennke wrote: > Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. > > Testing: > - [x] tier1 (x86_64, x86_32, aarch64) > - [x] tier2 (x86_64, x86_32, aarch64) > - [x] tier3 (x86_64, x86_32, aarch64) Hi Xiaolin, > This is a very nice refactoring, but we find this patch can interestingly break the RISC-V native fastdebug/slowdebug building process. > > The root cause is the `measure_code_size()` is always a `0` value, for `PhaseOutput::init_buffer()` is too early to collect all CodeStubs emitted in the future. This issue appears to exist before this patch. > > What is different is, originally `C2EntryBarrierStubTable::estimate_stub_size()` seems to return a non-zero value at all times for now (nmethod entry barriers for continuations). However, due to the pre-existing issue mentioned above, our new code using a sum to calculate the total size will always return a `0`, shrinking the `measure_code_size()` from a non-zero value to `0`. So the `stub_req` is smaller than before. Some code generated in RISC-V would just exceed the smaller threshold when emitting trampolines stubs, causing some failures. > > The below print can show the zero value when `-XX:+UnlockDiagnosticVMOptions -XX:+UseNewCode` are on. > > ``` > diff --git a/src/hotspot/share/opto/output.cpp b/src/hotspot/share/opto/output.cpp > index f189a12316b..1416c196faf 100644 > --- a/src/hotspot/share/opto/output.cpp > +++ b/src/hotspot/share/opto/output.cpp > @@ -1244,7 +1244,11 @@ CodeBuffer* PhaseOutput::init_buffer() { > > BarrierSetC2* bs = BarrierSet::barrier_set()->barrier_set_c2(); > stub_req += bs->estimate_stub_size(); > - stub_req += _stub_list.measure_code_size(); > + int newlist_size = _stub_list.measure_code_size(); > + stub_req += newlist_size; > + if (UseNewCode) { > + tty->print_cr("the measured size is: %d", newlist_size); > + } > > // nmethod and CodeBuffer count stubs & constants as part of method's code. > // class HandlerImpl is platform-specific and defined in the *.ad files. > ``` > > one hs_err file here. (the line number may not match, for I added some debugging stuff in my codebase) [hs_err_pid1282.log](https://github.com/openjdk/jdk/files/10085047/hs_err_pid1282.log) I don't quite understand. Are you saying that in init_buffer(), we don't have any method_entry_barrier stubs, yet, and therefore we return 0 there? In this case I don't see how it could have worked before. Or are you saying that we have one method entry barrier stub, but measuring its size returns 0? In this case I don't understand why. The measurement creates a scratch buffer, and emits one stub to it, and measures the size of the generated code. C2MethodEntryBarrier::emit() doesn't have any relevant conditional in it, it should always emit something, and the size should be > 0. Or what am I missing? Thank you, Roman ------------- PR: https://git.openjdk.org/jdk/pull/11188 From thartmann at openjdk.org Thu Nov 24 15:45:24 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 24 Nov 2022 15:45:24 GMT Subject: RFR: 8297556: Parse::check_interpreter_type fails with assert "must constrain OSR typestate" In-Reply-To: References: Message-ID: <-VraobYHAJoKs86iHqVsojv7vShcRHJ_oSUoRNM4H3U=.97bad425-2ef3-4f02-8297-6e87a79a98ea@github.com> On Thu, 24 Nov 2022 15:20:27 GMT, Roland Westrelin wrote: > With 6312651 (Compiler should only use verified interface types for > optimization), I changed when the _klass field for an array in the > type system would be null. Before, there was an assumption that any > type (other than top, bottom) could be represented with a non null > _klass field. With 6312651, there are types that are impossible to > represent with a single klass pointer so now, the _klass field is only > guaranteed non null for an array of basic type. When I made that > change, I went over uses of the _klass field for anything other than > an array of basic type and fixed the code so it uses something other > than the _klass field. This is a place I missed. > > (I made some of the changes for the _klass field after Vladimir I ran > extensive testing for the patch so that could explain why this issue > was missed) Looks good to me. All tests passed. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11356 From thartmann at openjdk.org Thu Nov 24 15:50:35 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 24 Nov 2022 15:50:35 GMT Subject: RFR: 8297343: TestStress*.java fail with "got different traces for the same seed" In-Reply-To: References: Message-ID: On Thu, 24 Nov 2022 15:27:41 GMT, Roland Westrelin wrote: > Root cause from Roberto's analysis: > > "The regression seems to be due to the introduction of non-determinism > in the node dumps of otherwise identical compilations." > > "The problem seems to be that JDK-6312651 dumps interface sets in an > order that is determined by the raw pointers of the set elements. This > is unstable across different runs and leads to different node dumps > for otherwise identical compilations." > > "Stable node dumps are useful for debugging (e.g. when diffing > compiler traces from two different runs), so the solution is probably > dumping interface sets in some order (e.g. lexicographic order of each > interface dump) that does not depend on raw pointer values." > > This patch implements Roberto's recommendation by sorting interfaces > on their ciBaseObject::_ident. You also need to remove the tests from the problem list. Looks good to me otherwise. src/hotspot/share/opto/type.cpp line 3187: > 3185: static int compare_interfaces(ciKlass** k1, ciKlass** k2) { > 3186: return (int)((*k1)->ident() - (*k2)->ident()); > 3187: } Suggestion: static int compare_interfaces(ciKlass** k1, ciKlass** k2) { return (int)((*k1)->ident() - (*k2)->ident()); } src/hotspot/share/opto/type.cpp line 3197: > 3195: GrowableArray interfaces; > 3196: interfaces.appendAll(&_list); > 3197: // Sort the interfaces so there's listed in the same order from one run to the other of the same compilation Suggestion: // Sort the interfaces so they are listed in the same order from one run to the other of the same compilation ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11357 From bkilambi at openjdk.org Thu Nov 24 15:56:08 2022 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Thu, 24 Nov 2022 15:56:08 GMT Subject: RFR: 8293488: Add EOR3 backend rule for aarch64 SHA3 extension [v5] In-Reply-To: References: Message-ID: > Arm ISA v8.2A and v9.0A include SHA3 feature extensions and one of those SHA3 instructions - "eor3" performs an exclusive OR of three vectors. This is helpful in applications that have multiple, consecutive "eor" operations which can be reduced by clubbing them into fewer operations using the "eor3" instruction. For example - > > eor a, a, b > eor a, a, c > > can be optimized to single instruction - `eor3 a, b, c` > > This patch adds backend rules for Neon and SVE2 "eor3" instructions and a micro benchmark to assess the performance gains with this patch. Following are the results of the included micro benchmark on a 128-bit aarch64 machine that supports Neon, SVE2 and SHA3 features - > > > Benchmark gain > TestEor3.test1Int 10.87% > TestEor3.test1Long 8.84% > TestEor3.test2Int 21.68% > TestEor3.test2Long 21.04% > > > The numbers shown are performance gains with using Neon eor3 instruction over the master branch that uses multiple "eor" instructions instead. Similar gains can be observed with the SVE2 "eor3" version as well since the "eor3" instruction is unpredicated and the machine under test uses a maximum vector width of 128 bits which makes the SVE2 code generation very similar to the one with Neon. Bhavana Kilambi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: - Resolve merge conflicts with master - Merge branch 'master' into JDK-8293488 - Removed svesha3 feature check for eor3 - Changed the modifier order preference in JTREG test - Modified JTREG test to include feature constraints - 8293488: Add EOR3 backend rule for aarch64 SHA3 extension Arm ISA v8.2A and v9.0A include SHA3 feature extensions and one of those SHA3 instructions - "eor3" performs an exclusive OR of three vectors. This is helpful in applications that have multiple, consecutive "eor" operations which can be reduced by clubbing them into fewer operations using the "eor3" instruction. For example - eor a, a, b eor a, a, c can be optimized to single instruction - eor3 a, b, c This patch adds backend rules for Neon and SVE2 "eor3" instructions and a micro benchmark to assess the performance gains with this patch. Following are the results of the included micro benchmark on a 128-bit aarch64 machine that supports Neon, SVE2 and SHA3 features - Benchmark gain TestEor3.test1Int 10.87% TestEor3.test1Long 8.84% TestEor3.test2Int 21.68% TestEor3.test2Long 21.04% The numbers shown are performance gains with using Neon eor3 instruction over the master branch that uses multiple "eor" instructions instead. Similar gains can be observed with the SVE2 "eor3" version as well since the "eor3" instruction is unpredicated and the machine under test uses a maximum vector width of 128 bits which makes the SVE2 code generation very similar to the one with Neon. ------------- Changes: https://git.openjdk.org/jdk/pull/10407/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10407&range=04 Stats: 330 lines in 7 files changed: 295 ins; 0 del; 35 mod Patch: https://git.openjdk.org/jdk/pull/10407.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10407/head:pull/10407 PR: https://git.openjdk.org/jdk/pull/10407 From sgehwolf at openjdk.org Thu Nov 24 16:40:25 2022 From: sgehwolf at openjdk.org (Severin Gehwolf) Date: Thu, 24 Nov 2022 16:40:25 GMT Subject: RFR: 8297590: [TESTBUG] HotSpotResolvedJavaFieldTest does not run Message-ID: Simple test fix. Test now also runs when running the JVMCI test set. Testing: jvmci tests. Includes the test now and passes. ------------- Commit messages: - 8297590: [TESTBUG] HotSpotResolvedJavaFieldTest does not run Changes: https://git.openjdk.org/jdk/pull/11358/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11358&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297590 Stats: 10 lines in 1 file changed: 8 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/11358.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11358/head:pull/11358 PR: https://git.openjdk.org/jdk/pull/11358 From mdoerr at openjdk.org Thu Nov 24 17:00:42 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 24 Nov 2022 17:00:42 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v12] In-Reply-To: References: Message-ID: > This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: Add regression test. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10933/files - new: https://git.openjdk.org/jdk/pull/10933/files/90617b36..2c5d2839 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10933&range=11 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10933&range=10-11 Stats: 84 lines in 1 file changed: 84 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10933.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10933/head:pull/10933 PR: https://git.openjdk.org/jdk/pull/10933 From sgehwolf at openjdk.org Thu Nov 24 17:09:29 2022 From: sgehwolf at openjdk.org (Severin Gehwolf) Date: Thu, 24 Nov 2022 17:09:29 GMT Subject: RFR: 8297590: [TESTBUG] HotSpotResolvedJavaFieldTest does not run In-Reply-To: References: Message-ID: On Thu, 24 Nov 2022 16:33:00 GMT, Severin Gehwolf wrote: > Simple test fix. Test now also runs when running the JVMCI test set. > > Testing: jvmci tests. Includes the test now and passes. test/hotspot/jtreg/compiler/jvmci/jdk.vm.ci.hotspot.test/src/jdk/vm/ci/hotspot/test/HotSpotResolvedJavaFieldTest.java line 72: > 70: try { > 71: Class typeImpl = Class.forName("jdk.vm.ci.hotspot.HotSpotResolvedObjectTypeImpl"); > 72: m = typeImpl.getDeclaredMethod("createField", JavaType.class, int.class, int.class, int.class); Note this is needed as the mentioned method in class `HotSpotResolvedObjectTypeImpl` has `int` signature for the second param. Not `long`. ------------- PR: https://git.openjdk.org/jdk/pull/11358 From shade at redhat.com Thu Nov 24 17:25:30 2022 From: shade at redhat.com (Aleksey Shipilev) Date: Thu, 24 Nov 2022 18:25:30 +0100 Subject: C2, ThreadLocalNode, and Loom Message-ID: <666c7f0a-eaf8-7ccb-7233-fb5fe20d0ac8@redhat.com> Hi, In Loom x86_32 port, I am following up on the remaining C2 bugs. I believe one bug can be summarized as follows. C2 models thread-locals with ThreadLocalNode (TLN). TLN is effectively a constant node: its only input is root, and it hashes like a normal node. This was logically sound for decades, because the code never switched the threads. Therefore, code is free to treat thread address as constant. In Loom, this guarantee no longer holds. If we ever store the "old" value of TLN node somewhere and reuse it past the potential yield-resume point, we end up using the *wrong* thread. How is this not a problem we saw before? On most architectures, we have the dedicated thread register, and TLN matches to it. That dedicated thread register holds the true current thread already. AFAICS, most TLN uses go straight to various AddP-s, so we are reasonably safe that no naked TLS addresses are stored, and the majority (all?) uses reference that thread register. (I am not sure what protects us from accidentally "caching" thread register into adhoc one. It would make little sense from performance/compiler standpoint, but I cannot yet see what theoretically prevents it in C2 code.) On x86_32, however, TLN matches to full MacroAssembler::get_thread call, and there storing the thread address into an adhoc register is a normal thing to do. Reusing that register over the continuation switch points visibly breaks x86_32. This usually manifests like a heap corruption because multiple threads stomp over foreign TLABs, or a failure in runtime GC code. Current failures in Loom x86_32 port, for example, can be easily reproduced by adding a simple assert in any G1 runtime method that pulls the (wrong) thread (mis)loaded from C2 barrier: // G1 pre write barrier slowpath JRT_LEAF(void, G1BarrierSetRuntime::write_ref_field_pre_entry(oopDesc* orig, JavaThread* thread)) assert(thread == JavaThread::current(), "write_ref_field_pre_entry sanity"); So, while this manifests on x86_32, I think this is symptom of a larger problem with assuming TLN const-ness. At this point, I am trying to come up with solutions: 1) Dodge the problem in x86_32 and then pretend all arches have dedicated thread registers Sacrifice one x86_32 register for thread address. This would likely to penalize performance a little bit, because x86_32 does not have lots of general purpose registers to begin with. We can probably try and go to FS/GS instead of carrying the address in the register; but I don't know how much work would that entail. This feels like a cowardly way out, and it would still break any future arch that does not have dedicated thread registers. And, it would break if we ever replace and ThreadLocalNode with the call to Thread::current(). 2) Remodel ThreadLocalNode as non-constant What partially solves the problem: saying that ThreadLocalNode::hash() is NO_HASH. AFAICS, this successfully prevents collapsing TLNs in the same compilation. This still does not solve the case where a single TLN gets yanked to the earliest block and its value cached in the register. AFAIU, we only want to make sure that TLN is reloaded after the potential continuation yield, which also serves as the point of return. Since continuation yields are modeled as calls, and calls produce both control and memory, we might need to hook up TLN to either control or memory. I tried to hook up the current control to every TLN node [1]. It works with a few wrinkles, but the patch shows there are ripple effects throughout C2 code, and it sometimes breaks the graph. Some pattern matching code (for example AddP matching code in EA) also asserts, probably assuming that TLNs have no inputs. I suspect other places might have implicit dependencies like these as well. This would be the inevitable consequence for any patch that changes ThreadLocalNode inputs/outputs. 3) Some other easy way out I am overlooking? Any other ideas? -- Thanks, -Aleksey [1] https://cr.openjdk.java.net/~shade/loom/x86_32/tln-ctrl-1.patch From rkennke at amazon.de Thu Nov 24 17:35:13 2022 From: rkennke at amazon.de (Kennke, Roman) Date: Thu, 24 Nov 2022 18:35:13 +0100 Subject: C2, ThreadLocalNode, and Loom In-Reply-To: <666c7f0a-eaf8-7ccb-7233-fb5fe20d0ac8@redhat.com> References: <666c7f0a-eaf8-7ccb-7233-fb5fe20d0ac8@redhat.com> Message-ID: <85a39bd8-3cae-b8d0-0694-77c96821f475@amazon.de> > 2) Remodel ThreadLocalNode as non-constant > > What partially solves the problem: saying that ThreadLocalNode::hash() > is NO_HASH. AFAICS, this > successfully prevents collapsing TLNs in the same compilation. This > still does not solve the case > where a single TLN gets yanked to the earliest block and its value > cached in the register. > > AFAIU, we only want to make sure that TLN is reloaded after the > potential continuation yield, which > also serves as the point of return. Since continuation yields are > modeled as calls, and calls > produce both control and memory, we might need to hook up TLN to either > control or memory. > > I tried to hook up the current control to every TLN node [1]. It works > with a few wrinkles, but the > patch shows there are ripple effects throughout C2 code, and it > sometimes breaks the graph. Some > pattern matching code (for example AddP matching code in EA) also > asserts, probably assuming that > TLNs have no inputs. I suspect other places might have implicit > dependencies like these as well. > This would be the inevitable consequence for any patch that changes > ThreadLocalNode inputs/outputs. This reminds me of a similar problem that doesn't have a good solution yet, and solving one might be a template for the other. In GC barriers, one thing that is very commonly done is 1. Load GC state from a global field, 2. Branch to slow- (or mid-) path if GC is supposed to take some action. This GC state is loaded at every barrier site, because it's treated as RAW memory. However, since it only ever changes at safepoints, it would be nice to be able to tell C2 that it doesn't need reloading between safepoints. I suspect that if we had a constant-between-safepoints notion in C2, several other uses and possible optimizations could be found. Roman Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879 From vlivanov at openjdk.org Thu Nov 24 20:58:04 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Thu, 24 Nov 2022 20:58:04 GMT Subject: RFR: 8297556: Parse::check_interpreter_type fails with assert "must constrain OSR typestate" In-Reply-To: References: Message-ID: On Thu, 24 Nov 2022 15:20:27 GMT, Roland Westrelin wrote: > With 6312651 (Compiler should only use verified interface types for > optimization), I changed when the _klass field for an array in the > type system would be null. Before, there was an assumption that any > type (other than top, bottom) could be represented with a non null > _klass field. With 6312651, there are types that are impossible to > represent with a single klass pointer so now, the _klass field is only > guaranteed non null for an array of basic type. When I made that > change, I went over uses of the _klass field for anything other than > an array of basic type and fixed the code so it uses something other > than the _klass field. This is a place I missed. > > (I made some of the changes for the _klass field after Vladimir I ran > extensive testing for the patch so that could explain why this issue > was missed) Looks good. ------------- Marked as reviewed by vlivanov (Reviewer). PR: https://git.openjdk.org/jdk/pull/11356 From fyang at openjdk.org Fri Nov 25 03:18:16 2022 From: fyang at openjdk.org (Fei Yang) Date: Fri, 25 Nov 2022 03:18:16 GMT Subject: RFR: 8297359: RISC-V: improve performance of floating Max Min intrinsics In-Reply-To: References: Message-ID: <_LcFVvFBhmg42IvhZ8DaceKPaqgzoiCcVeZCN9j620o=.31f61588-8573-4f9b-a7a2-1e8cd3979cb7@github.com> On Wed, 23 Nov 2022 15:20:47 GMT, Vladimir Kempik wrote: > Please review this change. > > It improves performance of Math.min/max intrinsics for Floats and Doubles. > > The main issue in these intrinsics is the requirement to return NaN if any of arguments is NaN. In risc-v, fmin/fmax returns NaN only if both of src registers are NaN ( quiet NaN). > That requires additional logic to handle the case where only of of src is NaN. > > Here the postcheck with flt (floating less than comparision) and flags analysis replaced with precheck. The precheck is done with 2 fclass on both src then checking combined ( by or-in) result, if one of src is NaN then put the NaN into dst ( using fadd dst, src1, src2). > > Microbench results: > > The results on the thead c910: > before > > Benchmark Mode Cnt Score Error Units > FpMinMaxIntrinsics.dMax avgt 25 53752.831 ? 97.198 ns/op > FpMinMaxIntrinsics.dMin avgt 25 53707.229 ? 177.559 ns/op > FpMinMaxIntrinsics.dMinReduce avgt 25 42805.985 ? 9.901 ns/op > FpMinMaxIntrinsics.fMax avgt 25 53449.568 ? 215.294 ns/op > FpMinMaxIntrinsics.fMin avgt 25 53504.106 ? 180.833 ns/op > FpMinMaxIntrinsics.fMinReduce avgt 25 42794.579 ? 7.013 ns/op > MaxMinOptimizeTest.dAdd avgt 25 381.138 ? 5.692 us/op > MaxMinOptimizeTest.dMax avgt 25 4575.094 ? 17.065 us/op > MaxMinOptimizeTest.dMin avgt 25 4584.648 ? 18.561 us/op > MaxMinOptimizeTest.dMul avgt 25 384.615 ? 7.751 us/op > MaxMinOptimizeTest.fAdd avgt 25 318.076 ? 3.308 us/op > MaxMinOptimizeTest.fMax avgt 25 4405.724 ? 20.353 us/op > MaxMinOptimizeTest.fMin avgt 25 4421.652 ? 18.029 us/op > MaxMinOptimizeTest.fMul avgt 25 305.462 ? 19.437 us/op > > after > Benchmark Mode Cnt Score Error Units > FpMinMaxIntrinsics.dMax avgt 25 10712.246 ? 5.607 ns/op > FpMinMaxIntrinsics.dMin avgt 25 10732.655 ? 41.894 ns/op > FpMinMaxIntrinsics.dMinReduce avgt 25 3248.106 ? 2.143 ns/op > FpMinMaxIntrinsics.fMax avgt 25 10707.084 ? 3.276 ns/op > FpMinMaxIntrinsics.fMin avgt 25 10719.771 ? 14.864 ns/op > FpMinMaxIntrinsics.fMinReduce avgt 25 3274.775 ? 0.996 ns/op > MaxMinOptimizeTest.dAdd avgt 25 383.720 ? 8.849 us/op > MaxMinOptimizeTest.dMax avgt 25 429.345 ? 11.160 us/op > MaxMinOptimizeTest.dMin avgt 25 439.980 ? 3.757 us/op > MaxMinOptimizeTest.dMul avgt 25 390.126 ? 10.258 us/op > MaxMinOptimizeTest.fAdd avgt 25 300.005 ? 18.206 us/op > MaxMinOptimizeTest.fMax avgt 25 370.467 ? 6.054 us/op > MaxMinOptimizeTest.fMin avgt 25 375.134 ? 4.568 us/op > MaxMinOptimizeTest.fMul avgt 25 305.344 ? 18.307 us/op > > hifive umatched > > before > > Benchmark Mode Cnt Score Error Units > FpMinMaxIntrinsics.dMax avgt 25 30234.224 ? 16.744 ns/op > FpMinMaxIntrinsics.dMin avgt 25 30227.686 ? 15.389 ns/op > FpMinMaxIntrinsics.dMinReduce avgt 25 15766.749 ? 3.724 ns/op > FpMinMaxIntrinsics.fMax avgt 25 30140.092 ? 10.243 ns/op > FpMinMaxIntrinsics.fMin avgt 25 30149.470 ? 34.041 ns/op > FpMinMaxIntrinsics.fMinReduce avgt 25 15760.770 ? 5.415 ns/op > MaxMinOptimizeTest.dAdd avgt 25 1155.234 ? 4.603 us/op > MaxMinOptimizeTest.dMax avgt 25 2597.897 ? 3.307 us/op > MaxMinOptimizeTest.dMin avgt 25 2599.183 ? 3.806 us/op > MaxMinOptimizeTest.dMul avgt 25 1155.281 ? 1.813 us/op > MaxMinOptimizeTest.fAdd avgt 25 750.967 ? 7.254 us/op > MaxMinOptimizeTest.fMax avgt 25 2305.085 ? 1.556 us/op > MaxMinOptimizeTest.fMin avgt 25 2305.306 ? 1.478 us/op > MaxMinOptimizeTest.fMul avgt 25 750.623 ? 7.357 us/op > > 2fclass_new > > Benchmark Mode Cnt Score Error Units > FpMinMaxIntrinsics.dMax avgt 25 23599.547 ? 29.571 ns/op > FpMinMaxIntrinsics.dMin avgt 25 23593.236 ? 18.456 ns/op > FpMinMaxIntrinsics.dMinReduce avgt 25 8630.201 ? 1.353 ns/op > FpMinMaxIntrinsics.fMax avgt 25 23496.337 ? 18.340 ns/op > FpMinMaxIntrinsics.fMin avgt 25 23477.881 ? 8.545 ns/op > FpMinMaxIntrinsics.fMinReduce avgt 25 8629.135 ? 0.869 ns/op > MaxMinOptimizeTest.dAdd avgt 25 1155.479 ? 4.938 us/op > MaxMinOptimizeTest.dMax avgt 25 1560.323 ? 3.077 us/op > MaxMinOptimizeTest.dMin avgt 25 1558.668 ? 2.421 us/op > MaxMinOptimizeTest.dMul avgt 25 1154.919 ? 2.077 us/op > MaxMinOptimizeTest.fAdd avgt 25 751.325 ? 7.169 us/op > MaxMinOptimizeTest.fMax avgt 25 1306.131 ? 1.102 us/op > MaxMinOptimizeTest.fMin avgt 25 1306.134 ? 0.957 us/op > MaxMinOptimizeTest.fMul avgt 25 750.968 ? 7.334 us/op The performance improvements on both RISC-V hardware platforms look great. I also checked the vector versions of min/max, but looks like they won't help here. So looks good as long as you pass the other necessary tests. Thanks. ------------- Marked as reviewed by fyang (Reviewer). PR: https://git.openjdk.org/jdk/pull/11327 From xlinzheng at openjdk.org Fri Nov 25 04:42:20 2022 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Fri, 25 Nov 2022 04:42:20 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism In-Reply-To: References: Message-ID: On Wed, 16 Nov 2022 15:03:07 GMT, Roman Kennke wrote: > Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. > > Testing: > - [x] tier1 (x86_64, x86_32, aarch64) > - [x] tier2 (x86_64, x86_32, aarch64) > - [x] tier3 (x86_64, x86_32, aarch64) Hi Roman, Sorry for the late response - It is the former. > Are you saying that in init_buffer(), we don't have any method_entry_barrier stubs, yet, and therefore we return 0 there? This one. That `output()->_stub_list._stubs` appears to me to be always zero, for nodes are not emitted yet. I confirmed that before this patch `_safepoint_poll_table` is `0` but `_entry_barrier_table` is a non-zero value, like on aarch64 is `24`. Why it can work before is I think it is within some margins of error. Like `code_req += MAX_inst_size; // ensure per-instruction margin`, RISC-V's generated code is more verbose, so reproduces this. Simply adding a `4`, which is just one instruction size, to the new `stub_req` can make the build pass. But the zero value of `_stub_list._stubs` is not an expectant one, though - I am not quite sure the best way to fix that. The table sizes are in fact both 1 to me? Best, Xiaolin ------------- PR: https://git.openjdk.org/jdk/pull/11188 From kbarrett at openjdk.org Fri Nov 25 07:09:18 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Fri, 25 Nov 2022 07:09:18 GMT Subject: RFR: 8297417: Poly1305IntrinsicFuzzTest fails with tag mismatch exception [v3] In-Reply-To: References: <6EUTYCRx6TH5HSleuN8KTk2RliVRyK0QljfCj3c0meA=.5227e497-0675-4cd1-ae1e-83d136816753@github.com> Message-ID: On Wed, 23 Nov 2022 22:50:34 GMT, Volodymyr Paprotski wrote: >> From https://github.com/openjdk/jdk/pull/10582, `t0` gets clobbered if `rscratch` is used. Example, [here](https://github.com/openjdk/jdk/blob/09f70dad2fe3f0691afacded6c38f61fa8a0d28d/src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp#L605-L606): >> >> >> __ mov(t0, a0); >> __ andq(t0, ExternalAddress(poly1305_mask44()), rscratch); // First limb (R^4[43:0]) > > Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision: > > add UnlockDiagnosticVMOptions This is causing some noise in our CI. While I'm not that familiar with the code being changed, this looks reasonable to me, and has already been approved by 4 much more competent reviewers. So I'm going to sponsor now to get this in sooner. ------------- PR: https://git.openjdk.org/jdk/pull/11308 From duke at openjdk.org Fri Nov 25 07:10:52 2022 From: duke at openjdk.org (Volodymyr Paprotski) Date: Fri, 25 Nov 2022 07:10:52 GMT Subject: Integrated: 8297417: Poly1305IntrinsicFuzzTest fails with tag mismatch exception In-Reply-To: <6EUTYCRx6TH5HSleuN8KTk2RliVRyK0QljfCj3c0meA=.5227e497-0675-4cd1-ae1e-83d136816753@github.com> References: <6EUTYCRx6TH5HSleuN8KTk2RliVRyK0QljfCj3c0meA=.5227e497-0675-4cd1-ae1e-83d136816753@github.com> Message-ID: <975X59jwDM2ErrOoYEmUS_aKifi7oPLM9Jf8y041yMU=.d3d0f939-92d1-4c88-b168-ee080ed256cd@github.com> On Wed, 23 Nov 2022 02:59:30 GMT, Volodymyr Paprotski wrote: > From https://github.com/openjdk/jdk/pull/10582, `t0` gets clobbered if `rscratch` is used. Example, [here](https://github.com/openjdk/jdk/blob/09f70dad2fe3f0691afacded6c38f61fa8a0d28d/src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp#L605-L606): > > > __ mov(t0, a0); > __ andq(t0, ExternalAddress(poly1305_mask44()), rscratch); // First limb (R^4[43:0]) This pull request has now been integrated. Changeset: 74d3bacc Author: Volodymyr Paprotski Committer: Kim Barrett URL: https://git.openjdk.org/jdk/commit/74d3baccb332c07f4ce58a53d7e9d36d3d4b8318 Stats: 24 lines in 2 files changed: 0 ins; 2 del; 22 mod 8297417: Poly1305IntrinsicFuzzTest fails with tag mismatch exception Reviewed-by: sviswanathan, vlivanov, rcastanedalo, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/11308 From vkempik at openjdk.org Fri Nov 25 08:03:59 2022 From: vkempik at openjdk.org (Vladimir Kempik) Date: Fri, 25 Nov 2022 08:03:59 GMT Subject: RFR: 8297359: RISC-V: improve performance of floating Max Min intrinsics In-Reply-To: References: Message-ID: On Wed, 23 Nov 2022 15:20:47 GMT, Vladimir Kempik wrote: > Please review this change. > > It improves performance of Math.min/max intrinsics for Floats and Doubles. > > The main issue in these intrinsics is the requirement to return NaN if any of arguments is NaN. In risc-v, fmin/fmax returns NaN only if both of src registers are NaN ( quiet NaN). > That requires additional logic to handle the case where only of of src is NaN. > > Here the postcheck with flt (floating less than comparision) and flags analysis replaced with precheck. The precheck is done with 2 fclass on both src then checking combined ( by or-in) result, if one of src is NaN then put the NaN into dst ( using fadd dst, src1, src2). > > Microbench results: > > The results on the thead c910: > before > > Benchmark Mode Cnt Score Error Units > FpMinMaxIntrinsics.dMax avgt 25 53752.831 ? 97.198 ns/op > FpMinMaxIntrinsics.dMin avgt 25 53707.229 ? 177.559 ns/op > FpMinMaxIntrinsics.dMinReduce avgt 25 42805.985 ? 9.901 ns/op > FpMinMaxIntrinsics.fMax avgt 25 53449.568 ? 215.294 ns/op > FpMinMaxIntrinsics.fMin avgt 25 53504.106 ? 180.833 ns/op > FpMinMaxIntrinsics.fMinReduce avgt 25 42794.579 ? 7.013 ns/op > MaxMinOptimizeTest.dAdd avgt 25 381.138 ? 5.692 us/op > MaxMinOptimizeTest.dMax avgt 25 4575.094 ? 17.065 us/op > MaxMinOptimizeTest.dMin avgt 25 4584.648 ? 18.561 us/op > MaxMinOptimizeTest.dMul avgt 25 384.615 ? 7.751 us/op > MaxMinOptimizeTest.fAdd avgt 25 318.076 ? 3.308 us/op > MaxMinOptimizeTest.fMax avgt 25 4405.724 ? 20.353 us/op > MaxMinOptimizeTest.fMin avgt 25 4421.652 ? 18.029 us/op > MaxMinOptimizeTest.fMul avgt 25 305.462 ? 19.437 us/op > > after > Benchmark Mode Cnt Score Error Units > FpMinMaxIntrinsics.dMax avgt 25 10712.246 ? 5.607 ns/op > FpMinMaxIntrinsics.dMin avgt 25 10732.655 ? 41.894 ns/op > FpMinMaxIntrinsics.dMinReduce avgt 25 3248.106 ? 2.143 ns/op > FpMinMaxIntrinsics.fMax avgt 25 10707.084 ? 3.276 ns/op > FpMinMaxIntrinsics.fMin avgt 25 10719.771 ? 14.864 ns/op > FpMinMaxIntrinsics.fMinReduce avgt 25 3274.775 ? 0.996 ns/op > MaxMinOptimizeTest.dAdd avgt 25 383.720 ? 8.849 us/op > MaxMinOptimizeTest.dMax avgt 25 429.345 ? 11.160 us/op > MaxMinOptimizeTest.dMin avgt 25 439.980 ? 3.757 us/op > MaxMinOptimizeTest.dMul avgt 25 390.126 ? 10.258 us/op > MaxMinOptimizeTest.fAdd avgt 25 300.005 ? 18.206 us/op > MaxMinOptimizeTest.fMax avgt 25 370.467 ? 6.054 us/op > MaxMinOptimizeTest.fMin avgt 25 375.134 ? 4.568 us/op > MaxMinOptimizeTest.fMul avgt 25 305.344 ? 18.307 us/op > > hifive umatched > > before > > Benchmark Mode Cnt Score Error Units > FpMinMaxIntrinsics.dMax avgt 25 30234.224 ? 16.744 ns/op > FpMinMaxIntrinsics.dMin avgt 25 30227.686 ? 15.389 ns/op > FpMinMaxIntrinsics.dMinReduce avgt 25 15766.749 ? 3.724 ns/op > FpMinMaxIntrinsics.fMax avgt 25 30140.092 ? 10.243 ns/op > FpMinMaxIntrinsics.fMin avgt 25 30149.470 ? 34.041 ns/op > FpMinMaxIntrinsics.fMinReduce avgt 25 15760.770 ? 5.415 ns/op > MaxMinOptimizeTest.dAdd avgt 25 1155.234 ? 4.603 us/op > MaxMinOptimizeTest.dMax avgt 25 2597.897 ? 3.307 us/op > MaxMinOptimizeTest.dMin avgt 25 2599.183 ? 3.806 us/op > MaxMinOptimizeTest.dMul avgt 25 1155.281 ? 1.813 us/op > MaxMinOptimizeTest.fAdd avgt 25 750.967 ? 7.254 us/op > MaxMinOptimizeTest.fMax avgt 25 2305.085 ? 1.556 us/op > MaxMinOptimizeTest.fMin avgt 25 2305.306 ? 1.478 us/op > MaxMinOptimizeTest.fMul avgt 25 750.623 ? 7.357 us/op > > 2fclass_new > > Benchmark Mode Cnt Score Error Units > FpMinMaxIntrinsics.dMax avgt 25 23599.547 ? 29.571 ns/op > FpMinMaxIntrinsics.dMin avgt 25 23593.236 ? 18.456 ns/op > FpMinMaxIntrinsics.dMinReduce avgt 25 8630.201 ? 1.353 ns/op > FpMinMaxIntrinsics.fMax avgt 25 23496.337 ? 18.340 ns/op > FpMinMaxIntrinsics.fMin avgt 25 23477.881 ? 8.545 ns/op > FpMinMaxIntrinsics.fMinReduce avgt 25 8629.135 ? 0.869 ns/op > MaxMinOptimizeTest.dAdd avgt 25 1155.479 ? 4.938 us/op > MaxMinOptimizeTest.dMax avgt 25 1560.323 ? 3.077 us/op > MaxMinOptimizeTest.dMin avgt 25 1558.668 ? 2.421 us/op > MaxMinOptimizeTest.dMul avgt 25 1154.919 ? 2.077 us/op > MaxMinOptimizeTest.fAdd avgt 25 751.325 ? 7.169 us/op > MaxMinOptimizeTest.fMax avgt 25 1306.131 ? 1.102 us/op > MaxMinOptimizeTest.fMin avgt 25 1306.134 ? 0.957 us/op > MaxMinOptimizeTest.fMul avgt 25 750.968 ? 7.334 us/op Thanks for looking at it. Tier1 is fine, running the rest ------------- PR: https://git.openjdk.org/jdk/pull/11327 From roland at openjdk.org Fri Nov 25 08:06:28 2022 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 25 Nov 2022 08:06:28 GMT Subject: RFR: 8297556: Parse::check_interpreter_type fails with assert "must constrain OSR typestate" In-Reply-To: <-VraobYHAJoKs86iHqVsojv7vShcRHJ_oSUoRNM4H3U=.97bad425-2ef3-4f02-8297-6e87a79a98ea@github.com> References: <-VraobYHAJoKs86iHqVsojv7vShcRHJ_oSUoRNM4H3U=.97bad425-2ef3-4f02-8297-6e87a79a98ea@github.com> Message-ID: <1Qh72mgY9OtzPr2Zwsua7dQXaiLGB_6jTCSuQT8orQc=.ec18c4e8-4b79-479b-b7a9-35e6c2066304@github.com> On Thu, 24 Nov 2022 15:43:17 GMT, Tobias Hartmann wrote: >> With 6312651 (Compiler should only use verified interface types for >> optimization), I changed when the _klass field for an array in the >> type system would be null. Before, there was an assumption that any >> type (other than top, bottom) could be represented with a non null >> _klass field. With 6312651, there are types that are impossible to >> represent with a single klass pointer so now, the _klass field is only >> guaranteed non null for an array of basic type. When I made that >> change, I went over uses of the _klass field for anything other than >> an array of basic type and fixed the code so it uses something other >> than the _klass field. This is a place I missed. >> >> (I made some of the changes for the _klass field after Vladimir I ran >> extensive testing for the patch so that could explain why this issue >> was missed) > > Looks good to me. All tests passed. @TobiHartmann @iwanowww thanks for the reviews ------------- PR: https://git.openjdk.org/jdk/pull/11356 From roland at openjdk.org Fri Nov 25 08:10:25 2022 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 25 Nov 2022 08:10:25 GMT Subject: Integrated: 8297556: Parse::check_interpreter_type fails with assert "must constrain OSR typestate" In-Reply-To: References: Message-ID: On Thu, 24 Nov 2022 15:20:27 GMT, Roland Westrelin wrote: > With 6312651 (Compiler should only use verified interface types for > optimization), I changed when the _klass field for an array in the > type system would be null. Before, there was an assumption that any > type (other than top, bottom) could be represented with a non null > _klass field. With 6312651, there are types that are impossible to > represent with a single klass pointer so now, the _klass field is only > guaranteed non null for an array of basic type. When I made that > change, I went over uses of the _klass field for anything other than > an array of basic type and fixed the code so it uses something other > than the _klass field. This is a place I missed. > > (I made some of the changes for the _klass field after Vladimir I ran > extensive testing for the patch so that could explain why this issue > was missed) This pull request has now been integrated. Changeset: cfe5a371 Author: Roland Westrelin URL: https://git.openjdk.org/jdk/commit/cfe5a3716e980734c3d195f7eec8c383337dca2d Stats: 50 lines in 2 files changed: 46 ins; 2 del; 2 mod 8297556: Parse::check_interpreter_type fails with assert "must constrain OSR typestate" Reviewed-by: thartmann, vlivanov ------------- PR: https://git.openjdk.org/jdk/pull/11356 From chagedorn at openjdk.org Fri Nov 25 08:17:21 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 25 Nov 2022 08:17:21 GMT Subject: RFR: 8297384: Add IR tests for existing idealizations of arithmetic nodes [v3] In-Reply-To: <8xK2P815j62-TYJQ0VmEe4VOXZkFcjP8JzdIjUf5wb0=.a7e79ed6-3460-4d8b-ae0e-a163dcaabe9f@github.com> References: <8xK2P815j62-TYJQ0VmEe4VOXZkFcjP8JzdIjUf5wb0=.a7e79ed6-3460-4d8b-ae0e-a163dcaabe9f@github.com> Message-ID: On Wed, 23 Nov 2022 20:10:47 GMT, Zhiqiang Zang wrote: > > We should make sure to make them pass on x86 as well such that GHA is clean after the integration of this PR. I've had a quick look on the match rules and it seems that we only use `RotateLeft/Right` nodes on x86_64, aarch64 and riscv. We could restrict these tests to only run on these architectures with `@requires os.arch == "x86_64" | os.arch == "aarch64" | os.arch == "riscv64"`. > > Thanks for your help! I have added the restrictions and the GHA should be clean now. I have a question though: where did you see the architectures where RotateLeft/Right nodes are used? so that I can know when the the same case happens again. Thanks. Since `RotateLeft/Right` are not macro nodes or other intermediate nodes like `Opaque*` which go away during optimizations, we need to find a match rule for them in order to build the mach graph. I therefore just searched for `match(.*Rotate` and only found rules in `aarch64.ad`, `x86_64.ad` and `riscv_b.ad` (in `x86.ad`, we only find rules for the vector rotate nodes `Rotate.*V`). ------------- PR: https://git.openjdk.org/jdk/pull/11049 From duke at openjdk.org Fri Nov 25 08:21:03 2022 From: duke at openjdk.org (Zhiqiang Zang) Date: Fri, 25 Nov 2022 08:21:03 GMT Subject: Integrated: 8297384: Add IR tests for existing idealizations of arithmetic nodes In-Reply-To: References: Message-ID: On Wed, 9 Nov 2022 00:20:37 GMT, Zhiqiang Zang wrote: > I noticed some idealizations have no associated IR tests so I included for them. This pull request has now been integrated. Changeset: fd910f77 Author: Zhiqiang Zang Committer: Christian Hagedorn URL: https://git.openjdk.org/jdk/commit/fd910f77bcd205110688b2f17f26f76ce3de88d5 Stats: 596 lines in 11 files changed: 519 ins; 0 del; 77 mod 8297384: Add IR tests for existing idealizations of arithmetic nodes Reviewed-by: chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/11049 From rwestrel at redhat.com Fri Nov 25 08:58:45 2022 From: rwestrel at redhat.com (Roland Westrelin) Date: Fri, 25 Nov 2022 09:58:45 +0100 Subject: C2, ThreadLocalNode, and Loom In-Reply-To: <666c7f0a-eaf8-7ccb-7233-fb5fe20d0ac8@redhat.com> References: <666c7f0a-eaf8-7ccb-7233-fb5fe20d0ac8@redhat.com> Message-ID: <87sfi776m2.fsf@redhat.com> > naked TLS addresses are stored, and the majority (all?) uses reference that thread register. (I am > not sure what protects us from accidentally "caching" thread register into adhoc one. It would make > little sense from performance/compiler standpoint, but I cannot yet see what theoretically prevents > it in C2 code.) I don't think the thread register on, say, x64 can be spilled. tsLoadP has its own register mask (single register mask: r15) that doesn't intersect with the mask used by spilling instructions (they don't mess with r15) so it's impossible for tsLoadP to be an input to a spill. tsLoadP is rematerializeable so when r15 is killed, rather than spill it, a new tsLoadP is added near uses. > 3) Some other easy way out I am overlooking? Ideally, I think we want to avoid putting the burden of an uncommon case on c2 generated code: can we have the runtime code that performs the yield update the register that contains the thread (like gc logic would update the live oops at a safepoint)? That would require a way for runtime code to know where the thread pointer lives which is not straighforward AFAICT. Roland. From rwestrel at redhat.com Fri Nov 25 09:03:58 2022 From: rwestrel at redhat.com (Roland Westrelin) Date: Fri, 25 Nov 2022 10:03:58 +0100 Subject: C2, ThreadLocalNode, and Loom In-Reply-To: <85a39bd8-3cae-b8d0-0694-77c96821f475@amazon.de> References: <666c7f0a-eaf8-7ccb-7233-fb5fe20d0ac8@redhat.com> <85a39bd8-3cae-b8d0-0694-77c96821f475@amazon.de> Message-ID: <87pmdb76dd.fsf@redhat.com> > This reminds me of a similar problem that doesn't have a good solution > yet, and solving one might be a template for the other. In GC barriers, > one thing that is very commonly done is 1. Load GC state from a global > field, 2. Branch to slow- (or mid-) path if GC is supposed to take some > action. This GC state is loaded at every barrier site, because it's > treated as RAW memory. However, since it only ever changes at > safepoints, it would be nice to be able to tell C2 that it doesn't need > reloading between safepoints. I suspect that if we had a > constant-between-safepoints notion in C2, several other uses and > possible optimizations could be found. That would require SafePointNode to produce a new memory state. Then there's no need to pin nodes that must not float across a safepoint. Anyway, that's the kind of change to c2 that's disruptive with little benefit that's very unlikely to happen. Roland. From rkennke at openjdk.org Fri Nov 25 11:07:34 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Fri, 25 Nov 2022 11:07:34 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism In-Reply-To: References: Message-ID: On Fri, 25 Nov 2022 04:39:51 GMT, Xiaolin Zheng wrote: > Hi Roman, > > Sorry for the late response - It is the former. > > > Are you saying that in init_buffer(), we don't have any method_entry_barrier stubs, yet, and therefore we return 0 there? > > This one. That `output()->_stub_list._stubs` appears to me to be always zero, for nodes are not emitted yet. I confirmed that before this patch `_safepoint_poll_table` is `0` but `_entry_barrier_table` is a non-zero value, like on aarch64 is `24`. Why it can work before is I think it is within some margins of error. Like `code_req += MAX_inst_size; // ensure per-instruction margin`, RISC-V's generated code is more verbose, so reproduces this. Simply adding a `4`, which is just one instruction size, to the new `stub_req` can make the build pass. > > But the zero value of `_stub_list._stubs` is not an expectant one, though - I am not quite sure the best way to fix that. The table sizes are in fact both 1 to me? nmethod-barriers are not expected to be always present. It depends on the GC being active, if GC needs it, a (single) nmethod barrier is inserted. This is currently only the case for ZGC and Shenandoah. For all other GCs, nmethod barriers are not used, and thus no stubs should be emitted. ------------- PR: https://git.openjdk.org/jdk/pull/11188 From bkilambi at openjdk.org Fri Nov 25 11:07:45 2022 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Fri, 25 Nov 2022 11:07:45 GMT Subject: RFR: 8293488: Add EOR3 backend rule for aarch64 SHA3 extension [v5] In-Reply-To: References: Message-ID: On Thu, 24 Nov 2022 15:56:08 GMT, Bhavana Kilambi wrote: >> Arm ISA v8.2A and v9.0A include SHA3 feature extensions and one of those SHA3 instructions - "eor3" performs an exclusive OR of three vectors. This is helpful in applications that have multiple, consecutive "eor" operations which can be reduced by clubbing them into fewer operations using the "eor3" instruction. For example - >> >> eor a, a, b >> eor a, a, c >> >> can be optimized to single instruction - `eor3 a, b, c` >> >> This patch adds backend rules for Neon and SVE2 "eor3" instructions and a micro benchmark to assess the performance gains with this patch. Following are the results of the included micro benchmark on a 128-bit aarch64 machine that supports Neon, SVE2 and SHA3 features - >> >> >> Benchmark gain >> TestEor3.test1Int 10.87% >> TestEor3.test1Long 8.84% >> TestEor3.test2Int 21.68% >> TestEor3.test2Long 21.04% >> >> >> The numbers shown are performance gains with using Neon eor3 instruction over the master branch that uses multiple "eor" instructions instead. Similar gains can be observed with the SVE2 "eor3" version as well since the "eor3" instruction is unpredicated and the machine under test uses a maximum vector width of 128 bits which makes the SVE2 code generation very similar to the one with Neon. > > Bhavana Kilambi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: > > - Resolve merge conflicts with master > - Merge branch 'master' into JDK-8293488 > - Removed svesha3 feature check for eor3 > - Changed the modifier order preference in JTREG test > - Modified JTREG test to include feature constraints > - 8293488: Add EOR3 backend rule for aarch64 SHA3 extension > > Arm ISA v8.2A and v9.0A include SHA3 feature extensions and one of those > SHA3 instructions - "eor3" performs an exclusive OR of three vectors. > This is helpful in applications that have multiple, consecutive "eor" > operations which can be reduced by clubbing them into fewer operations > using the "eor3" instruction. For example - > eor a, a, b > eor a, a, c > can be optimized to single instruction - eor3 a, b, c > > This patch adds backend rules for Neon and SVE2 "eor3" instructions and > a micro benchmark to assess the performance gains with this patch. > Following are the results of the included micro benchmark on a 128-bit > aarch64 machine that supports Neon, SVE2 and SHA3 features - > > Benchmark gain > TestEor3.test1Int 10.87% > TestEor3.test1Long 8.84% > TestEor3.test2Int 21.68% > TestEor3.test2Long 21.04% > > The numbers shown are performance gains with using Neon eor3 instruction > over the master branch that uses multiple "eor" instructions instead. > Similar gains can be observed with the SVE2 "eor3" version as well since > the "eor3" instruction is unpredicated and the machine under test uses a > maximum vector width of 128 bits which makes the SVE2 code generation very > similar to the one with Neon. Hello, I have resolved merge conflicts and have uploaded the latest patch here. Please review. I messed up with the re-request review option but I now understand how it works (I thought I could re-request from everyone but realized that the previous selection is removed if i select another reviewer). Apologies if any inconvenience caused. ------------- PR: https://git.openjdk.org/jdk/pull/10407 From vkempik at openjdk.org Fri Nov 25 11:07:46 2022 From: vkempik at openjdk.org (Vladimir Kempik) Date: Fri, 25 Nov 2022 11:07:46 GMT Subject: RFR: 8297549: RISC-V: Add support for Vector API vector load const operation In-Reply-To: References: Message-ID: On Thu, 24 Nov 2022 05:40:12 GMT, Dingli Zhang wrote: > The instruction which is matched `VectorLoadConst` will create index starting from 0 and incremented by 1. In detail, the instruction populates the destination vector by setting the first element to 0 and monotonically incrementing the value by 1 for each subsequent element. > > We can add support of `VectorLoadConst` for RISC-V by `vid.v` . It was implemented by referring to RVV v1.0 [1]. > > Tests are performed on qemu with parameter `-cpu rv64,v=true,vlen=256,vext_spec=v1.0`. By adding the `-XX:+PrintAssembly -XX:-TieredCompilation -XX:+LogCompilation -XX:LogFile=compile.log` parameter when executing micro-benchmark `IndexVectorBenchmark` [2] , the compilation log is as follows: > > > 0d2 B7: # out( B12 B8 ) <- in( B11 B6 ) Freq: 1 > . > . > . > 0ec vloadcon V3 # generate iota indices > > > At the same time, the following assembly code will be generated when running the `intIndexVector` case: > > 0x00000040144294ac: .4byte 0x10072d7 > 0x00000040144294b0: .4byte 0x5208a1d7 > > `0x10072d7/0x5208a1d7` are the machine code for `vsetvli/vid.v`. When running the `floatIndexVector` case, there will be one more instruction than `intIndexVector`: > > 0x000000401443cc9c: .4byte 0x10072d7 > 0x000000401443cca0: .4byte 0x5208a157 > 0x000000401443cca4: .4byte 0x4a219157 > > `0x4a219157` are the machine code for `vfcvt.f.x.v`, which is the instruction generated by `is_floating_point_type(bt)`: > > if (is_floating_point_type(bt)) { > __ vfcvt_f_x_v(as_VectorRegister($dst$$reg), as_VectorRegister($dst$$reg)); > } > > > After we implement these nodes, by using `-XX:+UseRVV`, the number of assembly instructions is reduced by about ~50% because of the different execution paths with the number of loops, similar to `AddTest` [3]. > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/857b0f9b05bc711f3282a0da85fcff131fffab91/test/micro/org/openjdk/bench/jdk/incubator/vector/IndexVectorBenchmark.java > [3] https://github.com/zifeihan/vector-api-test-rvv/blob/master/vector-api-rvv-performance.md > > Please take a look and have some reviews. Thanks a lot. > > ## Testing: > > - hotspot and jdk tier1 without new failures (release with UseRVV on QEMU) > - test/jdk/jdk/incubator/vector/* (fastdebug/release with UseRVV on QEMU) Can you also run whole tier2 please ? src/hotspot/cpu/riscv/riscv_v.ad line 2088: > 2086: BasicType bt = Matcher::vector_element_basic_type(this); > 2087: Assembler::SEW sew = Assembler::elemtype_to_sew(bt); > 2088: __ vsetvli(t0, x0, sew); I heard this opcode ( vsetvli) is pretty costly when the params of vector engine gets reconfigured ( for example for different element width). Not saying anything bad here. We might need to think about some optimisations for using vsetvli in future ------------- PR: https://git.openjdk.org/jdk/pull/11344 From rrich at openjdk.org Fri Nov 25 11:13:28 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Fri, 25 Nov 2022 11:13:28 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v11] In-Reply-To: References: Message-ID: > Hi, > > this is the port of [JEP 425: Virtual Threads (Preview)](https://openjdk.org/jeps/425)) to PPC64. > More precisely it is the port of vm continuations in hotspot to PPC64. It allows to run with `-XX:+VMContinuations` which is a prerequisit for 'real' virtual threads (oxymoron?). > > Most of the shared code changes are related to a new platform dependent constant `frame::metadata_words_at_top`. It is either added or subtracted to a frame address or size. > > The following is supposed to explain (without real life details) why it is needed in addition to the existing `frame::metadata_words`. The diagram shows a frame at `SP` and its stack arguments. The caller frame is located at `callers_SP`. > > > X86 / AARCH64 PPC64: > > : : : : > : : : : > | | | | > |-----------------| |-----------------| > | | | | > | stack arguments | | stack arguments | > | |<- callers_SP | | > =================== |-----------------| > | | | | > | metadata at bottom | | metadata at top | > | | | |<- callers_SP > |-----------------| =================== > | | | | > | | | | > | | | | > | | | | > | |<- SP | | > =================== |-----------------| > | | > | metadata at top | > | |<- SP > =================== > > > On X86 and AARCH64 metadata (that's return address, saved SP/FP, etc.) is stored at the frame bottom (`metadata at bottom`). On PPC64 it is stored at the frame top (`metadata at top`) where it affects size and address calculations. Shared code deals with this by making use of the platform dependent constant `frame::metadata_words_at_top`. > > * size required to 'freeze' a single frame with its stack arguments in a `StackChunk`: > `sizeof(frame) + sizeof(stack arguments) + frame::metadata_words_at_top` > > * address of stack arguments: > `callers_SP + frame::metadata_words_at_top` > > * The value of `frame::metadata_words_at_top` is 0 words on X86 and AARCH64. On PPC64 it is 4 words. > > Please refer to comments I've added for more details (e.g. the comment on StackChunkFrameStream::frame_size()). Recently I've given a talk about vm continuations and the PPC64 port to my colleagues. It's rather an overview than a deep dive. [The slides](http://cr.openjdk.java.net/~rrich/webrevs/2022/8286302/202210_loom_ppc64_port.pdf) might serve as well as an intro to the matter. > > The pr includes the new test jdk/jdk/internal/vm/Continuation/BasicExp.java which I wrote while doing the port. The test cases vary from simple to not-so-simple. One of the main features is that a CompilationPolicy passed as argument controls which frames are compiled/interpreted when freezing by defining a sliding window of compiled / interpreted frames which produces interesting transitions with and without stack arguments. There is overlap with Basic.java and Fuzz.java. Let me know wether to remove or keep BasicExp.java. Runtime with fastdebug: 2m on Apple M1 Pro and 3m on Intel Xeon E5-2660 v3 @ 2.60GHz. Note that -XX:+VerifyContinuations is explicitly set as a found it very useful, it increases the runtime quite a bit though. > > Testing: the change passed our CI testing: most JCK and JTREG tests, also in Xcomp mode, SPECjvm2008, SPECjbb2015, Renaissance Suite and SAP specific tests with fastdebug and release builds on the standard platforms plus PPC64. These tests do include hotspot_loom and jdk_loom JTREG tests which I've also run with TEST_VM_OPTS="-XX:+VerifyContinuations" on X86_64, PPC64, and AARCH64. > > Thanks, Richard. Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: Rename test BasicExp -> BasicExt ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10961/files - new: https://git.openjdk.org/jdk/pull/10961/files/0b7e325b..817ec63b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10961&range=10 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10961&range=09-10 Stats: 11 lines in 1 file changed: 0 ins; 0 del; 11 mod Patch: https://git.openjdk.org/jdk/pull/10961.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10961/head:pull/10961 PR: https://git.openjdk.org/jdk/pull/10961 From rrich at openjdk.org Fri Nov 25 11:13:31 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Fri, 25 Nov 2022 11:13:31 GMT Subject: RFR: 8286302: Port JEP 425 to PPC64 [v10] In-Reply-To: References: Message-ID: On Thu, 24 Nov 2022 13:56:06 GMT, Richard Reingruber wrote: >> Hi, >> >> this is the port of [JEP 425: Virtual Threads (Preview)](https://openjdk.org/jeps/425)) to PPC64. >> More precisely it is the port of vm continuations in hotspot to PPC64. It allows to run with `-XX:+VMContinuations` which is a prerequisit for 'real' virtual threads (oxymoron?). >> >> Most of the shared code changes are related to a new platform dependent constant `frame::metadata_words_at_top`. It is either added or subtracted to a frame address or size. >> >> The following is supposed to explain (without real life details) why it is needed in addition to the existing `frame::metadata_words`. The diagram shows a frame at `SP` and its stack arguments. The caller frame is located at `callers_SP`. >> >> >> X86 / AARCH64 PPC64: >> >> : : : : >> : : : : >> | | | | >> |-----------------| |-----------------| >> | | | | >> | stack arguments | | stack arguments | >> | |<- callers_SP | | >> =================== |-----------------| >> | | | | >> | metadata at bottom | | metadata at top | >> | | | |<- callers_SP >> |-----------------| =================== >> | | | | >> | | | | >> | | | | >> | | | | >> | |<- SP | | >> =================== |-----------------| >> | | >> | metadata at top | >> | |<- SP >> =================== >> >> >> On X86 and AARCH64 metadata (that's return address, saved SP/FP, etc.) is stored at the frame bottom (`metadata at bottom`). On PPC64 it is stored at the frame top (`metadata at top`) where it affects size and address calculations. Shared code deals with this by making use of the platform dependent constant `frame::metadata_words_at_top`. >> >> * size required to 'freeze' a single frame with its stack arguments in a `StackChunk`: >> `sizeof(frame) + sizeof(stack arguments) + frame::metadata_words_at_top` >> >> * address of stack arguments: >> `callers_SP + frame::metadata_words_at_top` >> >> * The value of `frame::metadata_words_at_top` is 0 words on X86 and AARCH64. On PPC64 it is 4 words. >> >> Please refer to comments I've added for more details (e.g. the comment on StackChunkFrameStream::frame_size()). Recently I've given a talk about vm continuations and the PPC64 port to my colleagues. It's rather an overview than a deep dive. [The slides](http://cr.openjdk.java.net/~rrich/webrevs/2022/8286302/202210_loom_ppc64_port.pdf) might serve as well as an intro to the matter. >> >> The pr includes the new test jdk/jdk/internal/vm/Continuation/BasicExp.java which I wrote while doing the port. The test cases vary from simple to not-so-simple. One of the main features is that a CompilationPolicy passed as argument controls which frames are compiled/interpreted when freezing by defining a sliding window of compiled / interpreted frames which produces interesting transitions with and without stack arguments. There is overlap with Basic.java and Fuzz.java. Let me know wether to remove or keep BasicExp.java. Runtime with fastdebug: 2m on Apple M1 Pro and 3m on Intel Xeon E5-2660 v3 @ 2.60GHz. Note that -XX:+VerifyContinuations is explicitly set as a found it very useful, it increases the runtime quite a bit though. >> >> Testing: the change passed our CI testing: most JCK and JTREG tests, also in Xcomp mode, SPECjvm2008, SPECjbb2015, Renaissance Suite and SAP specific tests with fastdebug and release builds on the standard platforms plus PPC64. These tests do include hotspot_loom and jdk_loom JTREG tests which I've also run with TEST_VM_OPTS="-XX:+VerifyContinuations" on X86_64, PPC64, and AARCH64. >> >> Thanks, Richard. > > Richard Reingruber has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits: > > - Merge branch 'master' into 8286302_Port_JEP_425_to_PPC64 > - More Feedback Leonid > - Feedback Leonid > - Cleanup BasicExp test > - Feedback Martin > - Cleanup BasicExp.java > - Feedback from backwaterred > - Fix cpp condition and add PPC64 > - Changes lost in merge > - Merge branch 'master' into 8286302_Port_JEP_425_to_PPC64 > - ... and 2 more: https://git.openjdk.org/jdk/compare/9c77e41b...0b7e325b Thanks a lot for the reviews and feedback! My own testing and GHA have succeeded after merging master. I'm intending to integrate the port on Monday since I don't expect more feedback. ------------- PR: https://git.openjdk.org/jdk/pull/10961 From tschatzl at openjdk.org Fri Nov 25 11:19:19 2022 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Fri, 25 Nov 2022 11:19:19 GMT Subject: RFR: 8297487: G1 Remark: no need to keep alive oop constants of nmethods on stack In-Reply-To: <3TOLagicixH3acqrLIpQutFoQNtrs2dlIJjEWPJs2Ow=.18c402c8-32f2-4235-a179-9796dc4ff19d@github.com> References: <3TOLagicixH3acqrLIpQutFoQNtrs2dlIJjEWPJs2Ow=.18c402c8-32f2-4235-a179-9796dc4ff19d@github.com> Message-ID: On Wed, 23 Nov 2022 10:05:56 GMT, Richard Reingruber wrote: > This pr removes the stackwalks to keep alive oops of nmethods found on stack during G1 remark as it seems redundant. The oops are already kept alive by the [nmethod entry barrier](https://github.com/openjdk/jdk/blob/f26bd4e0e8b68de297a9ff93526cd7fac8668320/src/hotspot/share/gc/shared/barrierSetNMethod.cpp#L85) > > Additionally it fixes a comment that says nmethod entry barriers are needed to deal with continuations which, afaik, is not the case. Please correct me and explain if I'm mistaken. > > Testing: the patch is included in our daily CI testing since a week. That is most JCK and JTREG tests, also in Xcomp mode, Renaissance benchmark and SAP specific tests with fastdebug and release builds on the standard platforms plus PPC64. There was no failure I could attribute to this change. > > I tried to find a jtreg test that is sensitive to the keep alive by omitting it in the nmethod entry barrier and also in G1 remark but without success. Lgtm. Will push it through our testing. ------------- Marked as reviewed by tschatzl (Reviewer). PR: https://git.openjdk.org/jdk/pull/11314 From dzhang at openjdk.org Fri Nov 25 13:10:14 2022 From: dzhang at openjdk.org (Dingli Zhang) Date: Fri, 25 Nov 2022 13:10:14 GMT Subject: RFR: 8297549: RISC-V: Add support for Vector API vector load const operation In-Reply-To: References: Message-ID: On Fri, 25 Nov 2022 10:21:42 GMT, Vladimir Kempik wrote: > Can you also run whole tier2 please ? Thank you for your suggestion, of course! I will add the results of whole tier2 later. > src/hotspot/cpu/riscv/riscv_v.ad line 2088: > >> 2086: BasicType bt = Matcher::vector_element_basic_type(this); >> 2087: Assembler::SEW sew = Assembler::elemtype_to_sew(bt); >> 2088: __ vsetvli(t0, x0, sew); > > I heard this opcode ( vsetvli) is pretty costly when the params of vector engine gets reconfigured ( for example for different element width). Not saying anything bad here. We might need to think about some optimisations for using vsetvli in future Hi @VladimirKempik , thanks for the review! Almost every instruct in `riscv_v.ad ` uses this opcode ( vsetvli) at the beginning, and it does look like there is a need for optimization. Maybe we can probably discuss it more extensively and change it uniformly. ------------- PR: https://git.openjdk.org/jdk/pull/11344 From sgehwolf at openjdk.org Fri Nov 25 13:43:07 2022 From: sgehwolf at openjdk.org (Severin Gehwolf) Date: Fri, 25 Nov 2022 13:43:07 GMT Subject: RFR: 8297590: [TESTBUG] HotSpotResolvedJavaFieldTest does not run In-Reply-To: References: Message-ID: On Thu, 24 Nov 2022 16:33:00 GMT, Severin Gehwolf wrote: > Simple test fix. Test now also runs when running the JVMCI test set. > > Testing: jvmci tests. Includes the test now and passes. @dougxc Hi! Could you help review this, please? Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/11358 From rrich at openjdk.org Fri Nov 25 13:55:16 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Fri, 25 Nov 2022 13:55:16 GMT Subject: RFR: 8297487: G1 Remark: no need to keep alive oop constants of nmethods on stack In-Reply-To: References: <3TOLagicixH3acqrLIpQutFoQNtrs2dlIJjEWPJs2Ow=.18c402c8-32f2-4235-a179-9796dc4ff19d@github.com> Message-ID: On Fri, 25 Nov 2022 11:16:43 GMT, Thomas Schatzl wrote: > Lgtm. Will push it through our testing. Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/11314 From chagedorn at openjdk.org Fri Nov 25 15:13:46 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 25 Nov 2022 15:13:46 GMT Subject: RFR: 8297590: [TESTBUG] HotSpotResolvedJavaFieldTest does not run In-Reply-To: References: Message-ID: On Thu, 24 Nov 2022 16:33:00 GMT, Severin Gehwolf wrote: > Simple test fix. Test now also runs when running the JVMCI test set. > > Testing: jvmci tests. Includes the test now and passes. Looks good to me! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11358 From ihse at openjdk.org Fri Nov 25 15:20:25 2022 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Fri, 25 Nov 2022 15:20:25 GMT Subject: RFR: 8297644: RISC-V: Compilation error when shenandoah is disabled Message-ID: If configuring with `--disable-jvm-feature-shenandoahgc`, the risc-v port fails to build. It seems that the code is really dependent on two header files, that is not declared, and probably has "leaked in" somewhere, but only if shenandoah is enabled. I have tried to resolve it to the best of my knowledge, but if you're not happy with the solution, by all means suggest a better way or take over this bug. ------------- Commit messages: - 8297644: [riscv] Compilation error when shenandoah is disabled Changes: https://git.openjdk.org/jdk/pull/11370/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11370&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297644 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11370.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11370/head:pull/11370 PR: https://git.openjdk.org/jdk/pull/11370 From tschatzl at openjdk.org Fri Nov 25 15:23:12 2022 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Fri, 25 Nov 2022 15:23:12 GMT Subject: RFR: 8297487: G1 Remark: no need to keep alive oop constants of nmethods on stack In-Reply-To: <3TOLagicixH3acqrLIpQutFoQNtrs2dlIJjEWPJs2Ow=.18c402c8-32f2-4235-a179-9796dc4ff19d@github.com> References: <3TOLagicixH3acqrLIpQutFoQNtrs2dlIJjEWPJs2Ow=.18c402c8-32f2-4235-a179-9796dc4ff19d@github.com> Message-ID: <-F1WM-axVDffLhcQk_OrMvjoUwFr_K7LjHsx1kvV-UU=.8d3f4fb2-4733-4950-9d9b-306373759636@github.com> On Wed, 23 Nov 2022 10:05:56 GMT, Richard Reingruber wrote: > This pr removes the stackwalks to keep alive oops of nmethods found on stack during G1 remark as it seems redundant. The oops are already kept alive by the [nmethod entry barrier](https://github.com/openjdk/jdk/blob/f26bd4e0e8b68de297a9ff93526cd7fac8668320/src/hotspot/share/gc/shared/barrierSetNMethod.cpp#L85) > > Additionally it fixes a comment that says nmethod entry barriers are needed to deal with continuations which, afaik, is not the case. Please correct me and explain if I'm mistaken. > > Testing: the patch is included in our daily CI testing since a week. That is most JCK and JTREG tests, also in Xcomp mode, Renaissance benchmark and SAP specific tests with fastdebug and release builds on the standard platforms plus PPC64. There was no failure I could attribute to this change. > > I tried to find a jtreg test that is sensitive to the keep alive by omitting it in the nmethod entry barrier and also in G1 remark but without success. One other note: since the ARM32 port does not have this nmethod walk safety net during Remark pause anymore and it does not implement nmethod barriers it may be good to talk to ARM32 maintainers about this change. I do not know who maintains ARM32 code. We at Oracle do not support ARM32 so it should be good, but it may help ARM32 maintainers to keep this now removed code after all. ------------- PR: https://git.openjdk.org/jdk/pull/11314 From sgehwolf at openjdk.org Fri Nov 25 15:33:07 2022 From: sgehwolf at openjdk.org (Severin Gehwolf) Date: Fri, 25 Nov 2022 15:33:07 GMT Subject: RFR: 8297590: [TESTBUG] HotSpotResolvedJavaFieldTest does not run In-Reply-To: References: Message-ID: On Fri, 25 Nov 2022 15:10:28 GMT, Christian Hagedorn wrote: >> Simple test fix. Test now also runs when running the JVMCI test set. >> >> Testing: jvmci tests. Includes the test now and passes. > > Looks good to me! Thanks for the review @chhagedorn. ------------- PR: https://git.openjdk.org/jdk/pull/11358 From roland at openjdk.org Fri Nov 25 15:44:44 2022 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 25 Nov 2022 15:44:44 GMT Subject: RFR: 8297343: TestStress*.java fail with "got different traces for the same seed" [v2] In-Reply-To: References: Message-ID: > Root cause from Roberto's analysis: > > "The regression seems to be due to the introduction of non-determinism > in the node dumps of otherwise identical compilations." > > "The problem seems to be that JDK-6312651 dumps interface sets in an > order that is determined by the raw pointers of the set elements. This > is unstable across different runs and leads to different node dumps > for otherwise identical compilations." > > "Stable node dumps are useful for debugging (e.g. when diffing > compiler traces from two different runs), so the solution is probably > dumping interface sets in some order (e.g. lexicographic order of each > interface dump) that does not depend on raw pointer values." > > This patch implements Roberto's recommendation by sorting interfaces > on their ciBaseObject::_ident. Roland Westrelin has updated the pull request incrementally with two additional commits since the last revision: - Update src/hotspot/share/opto/type.cpp Co-authored-by: Tobias Hartmann - Update src/hotspot/share/opto/type.cpp Co-authored-by: Tobias Hartmann ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11357/files - new: https://git.openjdk.org/jdk/pull/11357/files/6914074e..a4aea940 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11357&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11357&range=00-01 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/11357.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11357/head:pull/11357 PR: https://git.openjdk.org/jdk/pull/11357 From roland at openjdk.org Fri Nov 25 15:58:03 2022 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 25 Nov 2022 15:58:03 GMT Subject: RFR: 8297343: TestStress*.java fail with "got different traces for the same seed" [v3] In-Reply-To: References: Message-ID: > Root cause from Roberto's analysis: > > "The regression seems to be due to the introduction of non-determinism > in the node dumps of otherwise identical compilations." > > "The problem seems to be that JDK-6312651 dumps interface sets in an > order that is determined by the raw pointers of the set elements. This > is unstable across different runs and leads to different node dumps > for otherwise identical compilations." > > "Stable node dumps are useful for debugging (e.g. when diffing > compiler traces from two different runs), so the solution is probably > dumping interface sets in some order (e.g. lexicographic order of each > interface dump) that does not depend on raw pointer values." > > This patch implements Roberto's recommendation by sorting interfaces > on their ciBaseObject::_ident. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: Revert "8297347: Problem list compiler/debug/TestStress*.java" This reverts commit 0b04a99245795c223a01d1cbe66a46d20e480c53. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11357/files - new: https://git.openjdk.org/jdk/pull/11357/files/a4aea940..1a9fbf3a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11357&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11357&range=01-02 Stats: 3 lines in 1 file changed: 0 ins; 3 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11357.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11357/head:pull/11357 PR: https://git.openjdk.org/jdk/pull/11357 From roland at openjdk.org Fri Nov 25 15:58:06 2022 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 25 Nov 2022 15:58:06 GMT Subject: RFR: 8297343: TestStress*.java fail with "got different traces for the same seed" [v2] In-Reply-To: References: Message-ID: On Fri, 25 Nov 2022 15:44:44 GMT, Roland Westrelin wrote: >> Root cause from Roberto's analysis: >> >> "The regression seems to be due to the introduction of non-determinism >> in the node dumps of otherwise identical compilations." >> >> "The problem seems to be that JDK-6312651 dumps interface sets in an >> order that is determined by the raw pointers of the set elements. This >> is unstable across different runs and leads to different node dumps >> for otherwise identical compilations." >> >> "Stable node dumps are useful for debugging (e.g. when diffing >> compiler traces from two different runs), so the solution is probably >> dumping interface sets in some order (e.g. lexicographic order of each >> interface dump) that does not depend on raw pointer values." >> >> This patch implements Roberto's recommendation by sorting interfaces >> on their ciBaseObject::_ident. > > Roland Westrelin has updated the pull request incrementally with two additional commits since the last revision: > > - Update src/hotspot/share/opto/type.cpp > > Co-authored-by: Tobias Hartmann > - Update src/hotspot/share/opto/type.cpp > > Co-authored-by: Tobias Hartmann Thanks for reviewing this. > You also need to remove the tests from the problem list. Done. ------------- PR: https://git.openjdk.org/jdk/pull/11357 From rrich at openjdk.org Fri Nov 25 16:15:17 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Fri, 25 Nov 2022 16:15:17 GMT Subject: RFR: 8297487: G1 Remark: no need to keep alive oop constants of nmethods on stack In-Reply-To: <-F1WM-axVDffLhcQk_OrMvjoUwFr_K7LjHsx1kvV-UU=.8d3f4fb2-4733-4950-9d9b-306373759636@github.com> References: <3TOLagicixH3acqrLIpQutFoQNtrs2dlIJjEWPJs2Ow=.18c402c8-32f2-4235-a179-9796dc4ff19d@github.com> <-F1WM-axVDffLhcQk_OrMvjoUwFr_K7LjHsx1kvV-UU=.8d3f4fb2-4733-4950-9d9b-306373759636@github.com> Message-ID: On Fri, 25 Nov 2022 15:21:08 GMT, Thomas Schatzl wrote: > One other note: since the ARM32 port does not have this nmethod walk safety net during Remark pause anymore and it does not implement nmethod barriers (at least that's what that `nullptr` return value by that removed comment indicates to me) it may be good to talk to ARM32 maintainers about this change. I do not know who maintains ARM32 code. > > We at Oracle do not support ARM32 so it should be good, but it may help ARM32 maintainers to keep this now removed code after all. Good point. It is myy understanding (also stated in the JBS item) that G1 concurrent marking requires the keep alive of oop constants by the nmethod entry barriers for SATB correctness. So without the entry barriers ARM32 has an issue there already because the keep alive during the remark pause is not sufficient, is it? There's a JBS item [JDK-8291302](https://bugs.openjdk.org/browse/JDK-8291302) to implement the nmethod entry barrier assigned to @bulasevich. Boris are there plans to implement the nmethod entry barriers on ARM32 in the near future given that they are required for G1 correctness? ------------- PR: https://git.openjdk.org/jdk/pull/11314 From dnsimon at openjdk.org Fri Nov 25 16:26:11 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Fri, 25 Nov 2022 16:26:11 GMT Subject: RFR: 8297590: [TESTBUG] HotSpotResolvedJavaFieldTest does not run In-Reply-To: References: Message-ID: On Thu, 24 Nov 2022 16:33:00 GMT, Severin Gehwolf wrote: > Simple test fix. Test now also runs when running the JVMCI test set. > > Testing: jvmci tests. Includes the test now and passes. Marked as reviewed by dnsimon (Committer). ------------- PR: https://git.openjdk.org/jdk/pull/11358 From sgehwolf at openjdk.org Fri Nov 25 16:34:11 2022 From: sgehwolf at openjdk.org (Severin Gehwolf) Date: Fri, 25 Nov 2022 16:34:11 GMT Subject: Integrated: 8297590: [TESTBUG] HotSpotResolvedJavaFieldTest does not run In-Reply-To: References: Message-ID: On Thu, 24 Nov 2022 16:33:00 GMT, Severin Gehwolf wrote: > Simple test fix. Test now also runs when running the JVMCI test set. > > Testing: jvmci tests. Includes the test now and passes. This pull request has now been integrated. Changeset: 08e6a820 Author: Severin Gehwolf URL: https://git.openjdk.org/jdk/commit/08e6a820bcb70e74a0faa28198493292e2993901 Stats: 10 lines in 1 file changed: 8 ins; 0 del; 2 mod 8297590: [TESTBUG] HotSpotResolvedJavaFieldTest does not run Reviewed-by: chagedorn, dnsimon ------------- PR: https://git.openjdk.org/jdk/pull/11358 From dnsimon at openjdk.org Fri Nov 25 17:43:24 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Fri, 25 Nov 2022 17:43:24 GMT Subject: Integrated: 8297431: [JVMCI] HotSpotJVMCIRuntime.encodeThrowable should not throw an exception In-Reply-To: References: Message-ID: On Tue, 22 Nov 2022 14:30:01 GMT, Doug Simon wrote: > JVMCI has a mechanism for translating exceptions from libjvmci to HotSpot and vice versa. This is important for proper error handling when a thread calls between these 2 runtime heaps. > > This translation mechanism itself needs to be robust in the context of resource limits, especially heap limits, as it may be translating an OutOfMemoryError from HotSpot back into libjvmci. The existing code in [`HotSpotJVMCIRuntime.encodeThrowable`](https://github.com/graalvm/labs-openjdk-17/blob/f6b18b596fa5acb1ab7efa10e284d106669040a6/src/jdk.internal.vm.ci/share/classes/jdk.vm.ci.hotspot/src/jdk/vm/ci/hotspot/HotSpotJVMCIRuntime.java#L237) and [`TranslatedException.encodeThrowable`](https://github.com/graalvm/labs-openjdk-17/blob/f6b18b596fa5acb1ab7efa10e284d106669040a6/src/jdk.internal.vm.ci/share/classes/jdk.vm.ci.hotspot/src/jdk/vm/ci/hotspot/TranslatedException.java#L153) is designed to handle translation failures by falling back to non-allocating code. However, we still occasionally see [an OOME that breaks the translation mechanism](https://github.com/oracle/graal/issues/5470#issuecomment-1321749688). One speculated possibility for this is an OOME re-materializing oops du ring a deoptimization causing an unexpected execution path. This PR increases the robustness of the exception translation code in light of such issues. This pull request has now been integrated. Changeset: 952e1005 Author: Doug Simon URL: https://git.openjdk.org/jdk/commit/952e10055135613e8ea2b818a4f35842936f5633 Stats: 47 lines in 3 files changed: 34 ins; 2 del; 11 mod 8297431: [JVMCI] HotSpotJVMCIRuntime.encodeThrowable should not throw an exception Reviewed-by: never ------------- PR: https://git.openjdk.org/jdk/pull/11286 From tschatzl at openjdk.org Fri Nov 25 19:29:07 2022 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Fri, 25 Nov 2022 19:29:07 GMT Subject: RFR: 8297487: G1 Remark: no need to keep alive oop constants of nmethods on stack In-Reply-To: References: <3TOLagicixH3acqrLIpQutFoQNtrs2dlIJjEWPJs2Ow=.18c402c8-32f2-4235-a179-9796dc4ff19d@github.com> <-F1WM-axVDffLhcQk_OrMvjoUwFr_K7LjHsx1kvV-UU=.8d3f4fb2-4733-4950-9d9b-306373759636@github.com> Message-ID: On Fri, 25 Nov 2022 16:12:53 GMT, Richard Reingruber wrote: > > We at Oracle do not support ARM32 so it should be good, but it may help ARM32 maintainers to keep this now removed code after all. > > Good point. It is myy understanding (also stated in the JBS item) that G1 concurrent marking requires the keep alive of oop constants by the nmethod entry barriers for SATB correctness. So without the entry barriers ARM32 has an issue there already because the keep alive during the remark pause is not sufficient, is it? That is true, but it might make the problem larger than necessary - although admittedly, the comments indicate some magic hand-waving of the effectiveness of this additional walk through the nmethods on thread stacks. Imho G1 has worked well enough with that level of wrongness for a long time on the other platforms, so keeping this (little amount of) code may help ARM32 maintainers to tide over a little bit (i.e. not make their platforms potentially crash left and right) until they are ready with their nmethod barrier implementation. I'm good either way you choose. ------------- PR: https://git.openjdk.org/jdk/pull/11314 From duke at openjdk.org Fri Nov 25 22:06:19 2022 From: duke at openjdk.org (duke) Date: Fri, 25 Nov 2022 22:06:19 GMT Subject: Withdrawn: 8291801: IGV: Broken button "Hide graphs which are the same as the previous graph" In-Reply-To: References: Message-ID: On Tue, 27 Sep 2022 08:09:02 GMT, Koichi Sakata wrote: > This pull request makes the "Hide graphs..." button working. > > The parser for bgv files, `BinaryParaser` class, enables the "hide graphs" function. But the xml parser, `Parser` class, doesn't have any code about the function. So I wrote code by referring to BinaryParser class. > > # Tests > > I tested manually. Screenshots are as follows. > > ????????? 2022-09-27 16 52 16 > > In this case, there are 13 graphs and 10 graphs from "After Parsing" to "Before matching" are the same. So only 4 graphs are shown after we push the button. > > ????????? 2022-09-27 16 52 23 > > Pushing it again or opening the graph that is hidden in the view restores the view to its original state. I attached [the graph file that I used in the test](https://github.com/openjdk/jdk/files/9653179/hello.zip). > > Furthermore, I added a test method for the class I changed. Running `mvn test` command was successful. > > # Concerns > > I'm concerned about the equivalence of 2 graphs. I regard them as the same graphs when all fields in `InputGraph` class are equal. But in the test class, 2 graphs are equal when nodes, edges, blocks and blocks of each `InputNode` are equal. Here is an extract of the test code. > > > // src/utils/IdealGraphVisualizer/Data/src/test/java/com/sun/hotspot/igv/data/Util.java > > ? ? public static void assertGraphEquals(InputGraph a, InputGraph b) { > > ? ? ? ? if(!a.getNodesAsSet().equals(b.getNodesAsSet())) { > ? ? ? ? ? ? fail(); > ? ? ? ? } > > ? ? ? ? if (!a.getEdges().equals(b.getEdges())) { > ? ? ? ? ? ? fail(); > ? ? ? ? } > > ? ? ? ? if (a.getBlocks().equals(b.getBlocks())) { > ? ? ? ? ? ? fail(); > ? ? ? ? } > > ? ? ? ? for (InputNode n : a.getNodes()) { > ? ? ? ? ? ? assertEquals(a.getBlock(n), b.getBlock(n)); > ? ? ? ? } > ? ? } > > > But opening a graph is very slow when I compare blocks of each InputNode. So I didn't add that comparison in `isSameContent` method. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/10440 From fyang at openjdk.org Sat Nov 26 13:10:09 2022 From: fyang at openjdk.org (Fei Yang) Date: Sat, 26 Nov 2022 13:10:09 GMT Subject: RFR: 8297644: RISC-V: Compilation error when shenandoah is disabled In-Reply-To: References: Message-ID: On Fri, 25 Nov 2022 15:12:01 GMT, Magnus Ihse Bursie wrote: > If configuring with `--disable-jvm-feature-shenandoahgc`, the risc-v port fails to build. > > It seems that the code is really dependent on two header files, that is not declared, and probably has "leaked in" somewhere, but only if shenandoah is enabled. I have tried to resolve it to the best of my knowledge, but if you're not happy with the solution, by all means suggest a better way or take over this bug. Looks good. Also verified by doing a native fastdebug build with --disable-jvm-feature-shenandoahgc. Thanks. ------------- Marked as reviewed by fyang (Reviewer). PR: https://git.openjdk.org/jdk/pull/11370 From vkempik at openjdk.org Sat Nov 26 16:20:11 2022 From: vkempik at openjdk.org (Vladimir Kempik) Date: Sat, 26 Nov 2022 16:20:11 GMT Subject: RFR: 8297359: RISC-V: improve performance of floating Max Min intrinsics In-Reply-To: References: Message-ID: <9hOqv0ROTE2uklCoTHx36e9VJBk2myCIwW_DWKx6SnI=.a931fe5b-ac32-405a-a763-71df19a93f8a@github.com> On Wed, 23 Nov 2022 15:20:47 GMT, Vladimir Kempik wrote: > Please review this change. > > It improves performance of Math.min/max intrinsics for Floats and Doubles. > > The main issue in these intrinsics is the requirement to return NaN if any of arguments is NaN. In risc-v, fmin/fmax returns NaN only if both of src registers are NaN ( quiet NaN). > That requires additional logic to handle the case where only of of src is NaN. > > Here the postcheck with flt (floating less than comparision) and flags analysis replaced with precheck. The precheck is done with 2 fclass on both src then checking combined ( by or-in) result, if one of src is NaN then put the NaN into dst ( using fadd dst, src1, src2). > > Microbench results: > > The results on the thead c910: > before > > Benchmark Mode Cnt Score Error Units > FpMinMaxIntrinsics.dMax avgt 25 53752.831 ? 97.198 ns/op > FpMinMaxIntrinsics.dMin avgt 25 53707.229 ? 177.559 ns/op > FpMinMaxIntrinsics.dMinReduce avgt 25 42805.985 ? 9.901 ns/op > FpMinMaxIntrinsics.fMax avgt 25 53449.568 ? 215.294 ns/op > FpMinMaxIntrinsics.fMin avgt 25 53504.106 ? 180.833 ns/op > FpMinMaxIntrinsics.fMinReduce avgt 25 42794.579 ? 7.013 ns/op > MaxMinOptimizeTest.dAdd avgt 25 381.138 ? 5.692 us/op > MaxMinOptimizeTest.dMax avgt 25 4575.094 ? 17.065 us/op > MaxMinOptimizeTest.dMin avgt 25 4584.648 ? 18.561 us/op > MaxMinOptimizeTest.dMul avgt 25 384.615 ? 7.751 us/op > MaxMinOptimizeTest.fAdd avgt 25 318.076 ? 3.308 us/op > MaxMinOptimizeTest.fMax avgt 25 4405.724 ? 20.353 us/op > MaxMinOptimizeTest.fMin avgt 25 4421.652 ? 18.029 us/op > MaxMinOptimizeTest.fMul avgt 25 305.462 ? 19.437 us/op > > after > Benchmark Mode Cnt Score Error Units > FpMinMaxIntrinsics.dMax avgt 25 10712.246 ? 5.607 ns/op > FpMinMaxIntrinsics.dMin avgt 25 10732.655 ? 41.894 ns/op > FpMinMaxIntrinsics.dMinReduce avgt 25 3248.106 ? 2.143 ns/op > FpMinMaxIntrinsics.fMax avgt 25 10707.084 ? 3.276 ns/op > FpMinMaxIntrinsics.fMin avgt 25 10719.771 ? 14.864 ns/op > FpMinMaxIntrinsics.fMinReduce avgt 25 3274.775 ? 0.996 ns/op > MaxMinOptimizeTest.dAdd avgt 25 383.720 ? 8.849 us/op > MaxMinOptimizeTest.dMax avgt 25 429.345 ? 11.160 us/op > MaxMinOptimizeTest.dMin avgt 25 439.980 ? 3.757 us/op > MaxMinOptimizeTest.dMul avgt 25 390.126 ? 10.258 us/op > MaxMinOptimizeTest.fAdd avgt 25 300.005 ? 18.206 us/op > MaxMinOptimizeTest.fMax avgt 25 370.467 ? 6.054 us/op > MaxMinOptimizeTest.fMin avgt 25 375.134 ? 4.568 us/op > MaxMinOptimizeTest.fMul avgt 25 305.344 ? 18.307 us/op > > hifive umatched > > before > > Benchmark Mode Cnt Score Error Units > FpMinMaxIntrinsics.dMax avgt 25 30234.224 ? 16.744 ns/op > FpMinMaxIntrinsics.dMin avgt 25 30227.686 ? 15.389 ns/op > FpMinMaxIntrinsics.dMinReduce avgt 25 15766.749 ? 3.724 ns/op > FpMinMaxIntrinsics.fMax avgt 25 30140.092 ? 10.243 ns/op > FpMinMaxIntrinsics.fMin avgt 25 30149.470 ? 34.041 ns/op > FpMinMaxIntrinsics.fMinReduce avgt 25 15760.770 ? 5.415 ns/op > MaxMinOptimizeTest.dAdd avgt 25 1155.234 ? 4.603 us/op > MaxMinOptimizeTest.dMax avgt 25 2597.897 ? 3.307 us/op > MaxMinOptimizeTest.dMin avgt 25 2599.183 ? 3.806 us/op > MaxMinOptimizeTest.dMul avgt 25 1155.281 ? 1.813 us/op > MaxMinOptimizeTest.fAdd avgt 25 750.967 ? 7.254 us/op > MaxMinOptimizeTest.fMax avgt 25 2305.085 ? 1.556 us/op > MaxMinOptimizeTest.fMin avgt 25 2305.306 ? 1.478 us/op > MaxMinOptimizeTest.fMul avgt 25 750.623 ? 7.357 us/op > > 2fclass_new > > Benchmark Mode Cnt Score Error Units > FpMinMaxIntrinsics.dMax avgt 25 23599.547 ? 29.571 ns/op > FpMinMaxIntrinsics.dMin avgt 25 23593.236 ? 18.456 ns/op > FpMinMaxIntrinsics.dMinReduce avgt 25 8630.201 ? 1.353 ns/op > FpMinMaxIntrinsics.fMax avgt 25 23496.337 ? 18.340 ns/op > FpMinMaxIntrinsics.fMin avgt 25 23477.881 ? 8.545 ns/op > FpMinMaxIntrinsics.fMinReduce avgt 25 8629.135 ? 0.869 ns/op > MaxMinOptimizeTest.dAdd avgt 25 1155.479 ? 4.938 us/op > MaxMinOptimizeTest.dMax avgt 25 1560.323 ? 3.077 us/op > MaxMinOptimizeTest.dMin avgt 25 1558.668 ? 2.421 us/op > MaxMinOptimizeTest.dMul avgt 25 1154.919 ? 2.077 us/op > MaxMinOptimizeTest.fAdd avgt 25 751.325 ? 7.169 us/op > MaxMinOptimizeTest.fMax avgt 25 1306.131 ? 1.102 us/op > MaxMinOptimizeTest.fMin avgt 25 1306.134 ? 0.957 us/op > MaxMinOptimizeTest.fMul avgt 25 750.968 ? 7.334 us/op hotspot:tier3 is fine, so ------------- PR: https://git.openjdk.org/jdk/pull/11327 From vkempik at openjdk.org Sat Nov 26 16:20:12 2022 From: vkempik at openjdk.org (Vladimir Kempik) Date: Sat, 26 Nov 2022 16:20:12 GMT Subject: Integrated: 8297359: RISC-V: improve performance of floating Max Min intrinsics In-Reply-To: References: Message-ID: On Wed, 23 Nov 2022 15:20:47 GMT, Vladimir Kempik wrote: > Please review this change. > > It improves performance of Math.min/max intrinsics for Floats and Doubles. > > The main issue in these intrinsics is the requirement to return NaN if any of arguments is NaN. In risc-v, fmin/fmax returns NaN only if both of src registers are NaN ( quiet NaN). > That requires additional logic to handle the case where only of of src is NaN. > > Here the postcheck with flt (floating less than comparision) and flags analysis replaced with precheck. The precheck is done with 2 fclass on both src then checking combined ( by or-in) result, if one of src is NaN then put the NaN into dst ( using fadd dst, src1, src2). > > Microbench results: > > The results on the thead c910: > before > > Benchmark Mode Cnt Score Error Units > FpMinMaxIntrinsics.dMax avgt 25 53752.831 ? 97.198 ns/op > FpMinMaxIntrinsics.dMin avgt 25 53707.229 ? 177.559 ns/op > FpMinMaxIntrinsics.dMinReduce avgt 25 42805.985 ? 9.901 ns/op > FpMinMaxIntrinsics.fMax avgt 25 53449.568 ? 215.294 ns/op > FpMinMaxIntrinsics.fMin avgt 25 53504.106 ? 180.833 ns/op > FpMinMaxIntrinsics.fMinReduce avgt 25 42794.579 ? 7.013 ns/op > MaxMinOptimizeTest.dAdd avgt 25 381.138 ? 5.692 us/op > MaxMinOptimizeTest.dMax avgt 25 4575.094 ? 17.065 us/op > MaxMinOptimizeTest.dMin avgt 25 4584.648 ? 18.561 us/op > MaxMinOptimizeTest.dMul avgt 25 384.615 ? 7.751 us/op > MaxMinOptimizeTest.fAdd avgt 25 318.076 ? 3.308 us/op > MaxMinOptimizeTest.fMax avgt 25 4405.724 ? 20.353 us/op > MaxMinOptimizeTest.fMin avgt 25 4421.652 ? 18.029 us/op > MaxMinOptimizeTest.fMul avgt 25 305.462 ? 19.437 us/op > > after > Benchmark Mode Cnt Score Error Units > FpMinMaxIntrinsics.dMax avgt 25 10712.246 ? 5.607 ns/op > FpMinMaxIntrinsics.dMin avgt 25 10732.655 ? 41.894 ns/op > FpMinMaxIntrinsics.dMinReduce avgt 25 3248.106 ? 2.143 ns/op > FpMinMaxIntrinsics.fMax avgt 25 10707.084 ? 3.276 ns/op > FpMinMaxIntrinsics.fMin avgt 25 10719.771 ? 14.864 ns/op > FpMinMaxIntrinsics.fMinReduce avgt 25 3274.775 ? 0.996 ns/op > MaxMinOptimizeTest.dAdd avgt 25 383.720 ? 8.849 us/op > MaxMinOptimizeTest.dMax avgt 25 429.345 ? 11.160 us/op > MaxMinOptimizeTest.dMin avgt 25 439.980 ? 3.757 us/op > MaxMinOptimizeTest.dMul avgt 25 390.126 ? 10.258 us/op > MaxMinOptimizeTest.fAdd avgt 25 300.005 ? 18.206 us/op > MaxMinOptimizeTest.fMax avgt 25 370.467 ? 6.054 us/op > MaxMinOptimizeTest.fMin avgt 25 375.134 ? 4.568 us/op > MaxMinOptimizeTest.fMul avgt 25 305.344 ? 18.307 us/op > > hifive umatched > > before > > Benchmark Mode Cnt Score Error Units > FpMinMaxIntrinsics.dMax avgt 25 30234.224 ? 16.744 ns/op > FpMinMaxIntrinsics.dMin avgt 25 30227.686 ? 15.389 ns/op > FpMinMaxIntrinsics.dMinReduce avgt 25 15766.749 ? 3.724 ns/op > FpMinMaxIntrinsics.fMax avgt 25 30140.092 ? 10.243 ns/op > FpMinMaxIntrinsics.fMin avgt 25 30149.470 ? 34.041 ns/op > FpMinMaxIntrinsics.fMinReduce avgt 25 15760.770 ? 5.415 ns/op > MaxMinOptimizeTest.dAdd avgt 25 1155.234 ? 4.603 us/op > MaxMinOptimizeTest.dMax avgt 25 2597.897 ? 3.307 us/op > MaxMinOptimizeTest.dMin avgt 25 2599.183 ? 3.806 us/op > MaxMinOptimizeTest.dMul avgt 25 1155.281 ? 1.813 us/op > MaxMinOptimizeTest.fAdd avgt 25 750.967 ? 7.254 us/op > MaxMinOptimizeTest.fMax avgt 25 2305.085 ? 1.556 us/op > MaxMinOptimizeTest.fMin avgt 25 2305.306 ? 1.478 us/op > MaxMinOptimizeTest.fMul avgt 25 750.623 ? 7.357 us/op > > 2fclass_new > > Benchmark Mode Cnt Score Error Units > FpMinMaxIntrinsics.dMax avgt 25 23599.547 ? 29.571 ns/op > FpMinMaxIntrinsics.dMin avgt 25 23593.236 ? 18.456 ns/op > FpMinMaxIntrinsics.dMinReduce avgt 25 8630.201 ? 1.353 ns/op > FpMinMaxIntrinsics.fMax avgt 25 23496.337 ? 18.340 ns/op > FpMinMaxIntrinsics.fMin avgt 25 23477.881 ? 8.545 ns/op > FpMinMaxIntrinsics.fMinReduce avgt 25 8629.135 ? 0.869 ns/op > MaxMinOptimizeTest.dAdd avgt 25 1155.479 ? 4.938 us/op > MaxMinOptimizeTest.dMax avgt 25 1560.323 ? 3.077 us/op > MaxMinOptimizeTest.dMin avgt 25 1558.668 ? 2.421 us/op > MaxMinOptimizeTest.dMul avgt 25 1154.919 ? 2.077 us/op > MaxMinOptimizeTest.fAdd avgt 25 751.325 ? 7.169 us/op > MaxMinOptimizeTest.fMax avgt 25 1306.131 ? 1.102 us/op > MaxMinOptimizeTest.fMin avgt 25 1306.134 ? 0.957 us/op > MaxMinOptimizeTest.fMul avgt 25 750.968 ? 7.334 us/op This pull request has now been integrated. Changeset: 99d3840d Author: Vladimir Kempik URL: https://git.openjdk.org/jdk/commit/99d3840d368f1d99af72250678a2cb0c55ee0957 Stats: 33 lines in 2 files changed: 12 ins; 11 del; 10 mod 8297359: RISC-V: improve performance of floating Max Min intrinsics Reviewed-by: fyang ------------- PR: https://git.openjdk.org/jdk/pull/11327 From fyang at openjdk.org Sun Nov 27 09:57:34 2022 From: fyang at openjdk.org (Fei Yang) Date: Sun, 27 Nov 2022 09:57:34 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism In-Reply-To: References: Message-ID: On Fri, 25 Nov 2022 10:43:30 GMT, Roman Kennke wrote: >> Hi Roman, >> >> Sorry for the late response - It is the former. >> >>> Are you saying that in init_buffer(), we don't have any method_entry_barrier stubs, yet, and therefore we return 0 there? >> >> This one. That `output()->_stub_list._stubs` appears to me to be always zero, for nodes are not emitted yet. I confirmed that before this patch `_safepoint_poll_table` is `0` but `_entry_barrier_table` is a non-zero value, like on aarch64 is `24`. Why it can work before is I think it is within some margins of error. Like `code_req += MAX_inst_size; // ensure per-instruction margin`, RISC-V's generated code is more verbose, so reproduces this. Simply adding a `4`, which is just one instruction size, to the new `stub_req` can make the build pass. >> >> But the zero value of `_stub_list._stubs` is not an expectant one, though - I am not quite sure the best way to fix that. The table sizes are in fact both 1 to me? >> >> Best, >> Xiaolin > >> Hi Roman, >> >> Sorry for the late response - It is the former. >> >> > Are you saying that in init_buffer(), we don't have any method_entry_barrier stubs, yet, and therefore we return 0 there? >> >> This one. That `output()->_stub_list._stubs` appears to me to be always zero, for nodes are not emitted yet. I confirmed that before this patch `_safepoint_poll_table` is `0` but `_entry_barrier_table` is a non-zero value, like on aarch64 is `24`. Why it can work before is I think it is within some margins of error. Like `code_req += MAX_inst_size; // ensure per-instruction margin`, RISC-V's generated code is more verbose, so reproduces this. Simply adding a `4`, which is just one instruction size, to the new `stub_req` can make the build pass. >> >> But the zero value of `_stub_list._stubs` is not an expectant one, though - I am not quite sure the best way to fix that. The table sizes are in fact both 1 to me? > > nmethod-barriers are not expected to be always present. It depends on the GC being active, if GC needs it, a (single) nmethod barrier is inserted. This is currently only the case for ZGC and Shenandoah. For all other GCs, nmethod barriers are not used, and thus no stubs should be emitted. @rkennke @zhengxiaolinX : Looks like the issue mentioned by Xiaolin doesn't menifest itself when I am doing a tier1 test with release build on my linux-riscv64 platform. But I did reproduce the crash by doing a linux-riscv64 native fastdebug build. On how to reproduce this issue with current jdk master on linux-aarch64: 1. Apply following small change which simply increases size of the safepointpoll stub: diff --git a/src/hotspot/cpu/aarch64/c2_safepointPollStubTable_aarch64.cpp b/src/hotspot/cpu/aarch64/c2_safepointPollStubTable_aarch64.cpp index fb36406fbde..e306bfd52ec 100644 --- a/src/hotspot/cpu/aarch64/c2_safepointPollStubTable_aarch64.cpp +++ b/src/hotspot/cpu/aarch64/c2_safepointPollStubTable_aarch64.cpp @@ -42,5 +42,8 @@ void C2SafepointPollStubTable::emit_stub_impl(MacroAssembler& masm, C2SafepointP __ adr(rscratch1, safepoint_pc); __ str(rscratch1, Address(rthread, JavaThread::saved_exception_pc_offset())); __ far_jump(callback_addr); + for (int i = 0; i < 700; i++) { + __ nop(); + } } #undef __ 2. Do a native fastdebug build on linux-aarch64, and the crash message looks like: * All command lines available in /home/realfyang/openjdk-jdk/build/linux-aarch64-server-fastdebug/make-support/failure-logs. === End of repeated output === No indication of failed target found. HELP: Try searching the build log for '] Error'. HELP: Run 'make doctor' to diagnose build problems. make[1]: *** [/home/realfyang/openjdk-jdk/make/Init.gmk:320: main] Error 2 make: *** [/home/realfyang/openjdk-jdk/make/Init.gmk:186: images] Error 2 Building target 'images' in configuration 'linux-aarch64-server-fastdebug' # To suppress the following error report, specify this argument # after -XX: or in .hotspotrc: SuppressErrorAt=/codeBuffer.hpp:191 # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/home/realfyang/openjdk-jdk/src/hotspot/share/asm/codeBuffer.hpp:191), pid=3806845, tid=3806862 # assert(allocates2(pc)) failed: not in CodeBuffer memory: 0x0000ffff805ffa00 <= 0x0000ffff80600434 <= 0x0000ffff80600430 # # JRE version: OpenJDK Runtime Environment (20.0) (fastdebug build 20-internal-adhoc.realfyang.openjdk-jdk) # Java VM: OpenJDK 64-Bit Server VM (fastdebug 20-internal-adhoc.realfyang.openjdk-jdk, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64) # Problematic frame: # V [libjvm.so+0x44ef48] Instruction_aarch64::~Instruction_aarch64()+0xb0 # # No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # An error report file with more information is saved as: # /home/realfyang/openjdk-jdk/make/hs_err_pid3806845.log # # Compiler replay data is saved as: # /home/realfyang/openjdk-jdk/make/replay_pid3806845.log # # If you would like to submit a bug report, please visit: # https://bugreport.java.com/bugreport/crash.jsp # ------------- PR: https://git.openjdk.org/jdk/pull/11188 From qamai at openjdk.org Sun Nov 27 19:42:06 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Sun, 27 Nov 2022 19:42:06 GMT Subject: RFR: 8282365: Optimize divideUnsigned and remainderUnsigned for constants [v10] In-Reply-To: References: Message-ID: > This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. > > In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: > > floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) > ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) > > The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. > > For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: > > c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) > c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) > > which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. > > For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. > > More tests are added to cover the possible patterns. > > Please take a look and have some reviews. Thank you very much. Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 28 commits: - Merge branch 'master' into unsignedDiv - add gtest - Merge branch 'master' into unsignedDiv - style, comments - limit tests - revert backend changes - whitespace, mistaken added - fast path for negative divisors - code styles - Merge branch 'master' into unsignedDiv - ... and 18 more: https://git.openjdk.org/jdk/compare/2f83b5c4...3c03f89b ------------- Changes: https://git.openjdk.org/jdk/pull/9947/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=09 Stats: 1737 lines in 16 files changed: 1391 ins; 220 del; 126 mod Patch: https://git.openjdk.org/jdk/pull/9947.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9947/head:pull/9947 PR: https://git.openjdk.org/jdk/pull/9947 From qamai at openjdk.org Sun Nov 27 20:23:29 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Sun, 27 Nov 2022 20:23:29 GMT Subject: RFR: 8282365: Optimize divideUnsigned and remainderUnsigned for constants [v11] In-Reply-To: References: Message-ID: > This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. > > In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: > > floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) > ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) > > The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. > > For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: > > c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) > c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) > > which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. > > For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. > > More tests are added to cover the possible patterns. > > Please take a look and have some reviews. Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: fix build failure ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9947/files - new: https://git.openjdk.org/jdk/pull/9947/files/3c03f89b..afe37a99 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=10 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=09-10 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/9947.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9947/head:pull/9947 PR: https://git.openjdk.org/jdk/pull/9947 From yadongwang at openjdk.org Mon Nov 28 01:33:55 2022 From: yadongwang at openjdk.org (Yadong Wang) Date: Mon, 28 Nov 2022 01:33:55 GMT Subject: RFR: 8297644: RISC-V: Compilation error when shenandoah is disabled In-Reply-To: References: Message-ID: On Fri, 25 Nov 2022 15:12:01 GMT, Magnus Ihse Bursie wrote: > If configuring with `--disable-jvm-feature-shenandoahgc`, the risc-v port fails to build. > > It seems that the code is really dependent on two header files, that is not declared, and probably has "leaked in" somewhere, but only if shenandoah is enabled. I have tried to resolve it to the best of my knowledge, but if you're not happy with the solution, by all means suggest a better way or take over this bug. lgtm ------------- Marked as reviewed by yadongwang (Author). PR: https://git.openjdk.org/jdk/pull/11370 From fyang at openjdk.org Mon Nov 28 01:39:46 2022 From: fyang at openjdk.org (Fei Yang) Date: Mon, 28 Nov 2022 01:39:46 GMT Subject: RFR: 8297549: RISC-V: Add support for Vector API vector load const operation In-Reply-To: References: Message-ID: <_3535nHcSZiWIf0sdAvX_9fu38a45ObTp8c73FGLTYk=.7bf50ded-b579-4544-bf32-5d886f6af817@github.com> On Fri, 25 Nov 2022 10:20:23 GMT, Vladimir Kempik wrote: >> The instruction which is matched `VectorLoadConst` will create index starting from 0 and incremented by 1. In detail, the instruction populates the destination vector by setting the first element to 0 and monotonically incrementing the value by 1 for each subsequent element. >> >> We can add support of `VectorLoadConst` for RISC-V by `vid.v` . It was implemented by referring to RVV v1.0 [1]. >> >> Tests are performed on qemu with parameter `-cpu rv64,v=true,vlen=256,vext_spec=v1.0`. By adding the `-XX:+PrintAssembly -XX:-TieredCompilation -XX:+LogCompilation -XX:LogFile=compile.log` parameter when executing micro-benchmark `IndexVectorBenchmark` [2] , the compilation log is as follows: >> >> >> 0d2 B7: # out( B12 B8 ) <- in( B11 B6 ) Freq: 1 >> . >> . >> . >> 0ec vloadcon V3 # generate iota indices >> >> >> At the same time, the following assembly code will be generated when running the `intIndexVector` case: >> >> 0x00000040144294ac: .4byte 0x10072d7 >> 0x00000040144294b0: .4byte 0x5208a1d7 >> >> `0x10072d7/0x5208a1d7` are the machine code for `vsetvli/vid.v`. When running the `floatIndexVector` case, there will be one more instruction than `intIndexVector`: >> >> 0x000000401443cc9c: .4byte 0x10072d7 >> 0x000000401443cca0: .4byte 0x5208a157 >> 0x000000401443cca4: .4byte 0x4a219157 >> >> `0x4a219157` are the machine code for `vfcvt.f.x.v`, which is the instruction generated by `is_floating_point_type(bt)`: >> >> if (is_floating_point_type(bt)) { >> __ vfcvt_f_x_v(as_VectorRegister($dst$$reg), as_VectorRegister($dst$$reg)); >> } >> >> >> After we implement these nodes, by using `-XX:+UseRVV`, the number of assembly instructions is reduced by about ~50% because of the different execution paths with the number of loops, similar to `AddTest` [3]. >> >> [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc >> [2] https://github.com/openjdk/jdk/blob/857b0f9b05bc711f3282a0da85fcff131fffab91/test/micro/org/openjdk/bench/jdk/incubator/vector/IndexVectorBenchmark.java >> [3] https://github.com/zifeihan/vector-api-test-rvv/blob/master/vector-api-rvv-performance.md >> >> Please take a look and have some reviews. Thanks a lot. >> >> ## Testing: >> >> - hotspot and jdk tier1 without new failures (release with UseRVV on QEMU) >> - test/jdk/jdk/incubator/vector/* (fastdebug/release with UseRVV on QEMU) > > src/hotspot/cpu/riscv/riscv_v.ad line 2088: > >> 2086: BasicType bt = Matcher::vector_element_basic_type(this); >> 2087: Assembler::SEW sew = Assembler::elemtype_to_sew(bt); >> 2088: __ vsetvli(t0, x0, sew); > > I heard this opcode ( vsetvli) is pretty costly when the params of vector engine gets reconfigured ( for example for different element width). Not saying anything bad here. We might need to think about some optimisations for using vsetvli in future > Hi @VladimirKempik , thanks for the review! Almost every instruct in `riscv_v.ad ` uses this opcode ( vsetvli) at the beginning, and it does look like there is a need for optimization. Maybe we can probably discuss it more extensively and change it uniformly. It's interesting to know how native compilers like GCC/LLVM eliminate such redundances. I guess they face similar issues when do auto-vectorization for different loops within the same function. ------------- PR: https://git.openjdk.org/jdk/pull/11344 From fyang at openjdk.org Mon Nov 28 01:39:48 2022 From: fyang at openjdk.org (Fei Yang) Date: Mon, 28 Nov 2022 01:39:48 GMT Subject: RFR: 8297549: RISC-V: Add support for Vector API vector load const operation In-Reply-To: References: Message-ID: On Thu, 24 Nov 2022 05:40:12 GMT, Dingli Zhang wrote: > The instruction which is matched `VectorLoadConst` will create index starting from 0 and incremented by 1. In detail, the instruction populates the destination vector by setting the first element to 0 and monotonically incrementing the value by 1 for each subsequent element. > > We can add support of `VectorLoadConst` for RISC-V by `vid.v` . It was implemented by referring to RVV v1.0 [1]. > > Tests are performed on qemu with parameter `-cpu rv64,v=true,vlen=256,vext_spec=v1.0`. By adding the `-XX:+PrintAssembly -XX:-TieredCompilation -XX:+LogCompilation -XX:LogFile=compile.log` parameter when executing micro-benchmark `IndexVectorBenchmark` [2] , the compilation log is as follows: > > > 0d2 B7: # out( B12 B8 ) <- in( B11 B6 ) Freq: 1 > . > . > . > 0ec vloadcon V3 # generate iota indices > > > At the same time, the following assembly code will be generated when running the `intIndexVector` case: > > 0x00000040144294ac: .4byte 0x10072d7 > 0x00000040144294b0: .4byte 0x5208a1d7 > > `0x10072d7/0x5208a1d7` are the machine code for `vsetvli/vid.v`. When running the `floatIndexVector` case, there will be one more instruction than `intIndexVector`: > > 0x000000401443cc9c: .4byte 0x10072d7 > 0x000000401443cca0: .4byte 0x5208a157 > 0x000000401443cca4: .4byte 0x4a219157 > > `0x4a219157` are the machine code for `vfcvt.f.x.v`, which is the instruction generated by `is_floating_point_type(bt)`: > > if (is_floating_point_type(bt)) { > __ vfcvt_f_x_v(as_VectorRegister($dst$$reg), as_VectorRegister($dst$$reg)); > } > > > After we implement these nodes, by using `-XX:+UseRVV`, the number of assembly instructions is reduced by about ~50% because of the different execution paths with the number of loops, similar to `AddTest` [3]. > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/857b0f9b05bc711f3282a0da85fcff131fffab91/test/micro/org/openjdk/bench/jdk/incubator/vector/IndexVectorBenchmark.java > [3] https://github.com/zifeihan/vector-api-test-rvv/blob/master/vector-api-rvv-performance.md > > Please take a look and have some reviews. Thanks a lot. > > ## Testing: > > - hotspot and jdk tier1 without new failures (release with UseRVV on QEMU) > - test/jdk/jdk/incubator/vector/* (fastdebug/release with UseRVV on QEMU) src/hotspot/cpu/riscv/riscv_v.ad line 2091: > 2089: __ vid_v(as_VectorRegister($dst$$reg)); > 2090: if (is_floating_point_type(bt)) { > 2091: __ vfcvt_f_x_v(as_VectorRegister($dst$$reg), as_VectorRegister($dst$$reg)); You might want to distinugish between float and double for 'bt' here. Since vfcvt.f.x.v only convert signed integer to float. ------------- PR: https://git.openjdk.org/jdk/pull/11344 From njian at openjdk.org Mon Nov 28 01:42:06 2022 From: njian at openjdk.org (Ningsheng Jian) Date: Mon, 28 Nov 2022 01:42:06 GMT Subject: RFR: 8293488: Add EOR3 backend rule for aarch64 SHA3 extension [v5] In-Reply-To: References: Message-ID: On Thu, 24 Nov 2022 15:56:08 GMT, Bhavana Kilambi wrote: >> Arm ISA v8.2A and v9.0A include SHA3 feature extensions and one of those SHA3 instructions - "eor3" performs an exclusive OR of three vectors. This is helpful in applications that have multiple, consecutive "eor" operations which can be reduced by clubbing them into fewer operations using the "eor3" instruction. For example - >> >> eor a, a, b >> eor a, a, c >> >> can be optimized to single instruction - `eor3 a, b, c` >> >> This patch adds backend rules for Neon and SVE2 "eor3" instructions and a micro benchmark to assess the performance gains with this patch. Following are the results of the included micro benchmark on a 128-bit aarch64 machine that supports Neon, SVE2 and SHA3 features - >> >> >> Benchmark gain >> TestEor3.test1Int 10.87% >> TestEor3.test1Long 8.84% >> TestEor3.test2Int 21.68% >> TestEor3.test2Long 21.04% >> >> >> The numbers shown are performance gains with using Neon eor3 instruction over the master branch that uses multiple "eor" instructions instead. Similar gains can be observed with the SVE2 "eor3" version as well since the "eor3" instruction is unpredicated and the machine under test uses a maximum vector width of 128 bits which makes the SVE2 code generation very similar to the one with Neon. > > Bhavana Kilambi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: > > - Resolve merge conflicts with master > - Merge branch 'master' into JDK-8293488 > - Removed svesha3 feature check for eor3 > - Changed the modifier order preference in JTREG test > - Modified JTREG test to include feature constraints > - 8293488: Add EOR3 backend rule for aarch64 SHA3 extension > > Arm ISA v8.2A and v9.0A include SHA3 feature extensions and one of those > SHA3 instructions - "eor3" performs an exclusive OR of three vectors. > This is helpful in applications that have multiple, consecutive "eor" > operations which can be reduced by clubbing them into fewer operations > using the "eor3" instruction. For example - > eor a, a, b > eor a, a, c > can be optimized to single instruction - eor3 a, b, c > > This patch adds backend rules for Neon and SVE2 "eor3" instructions and > a micro benchmark to assess the performance gains with this patch. > Following are the results of the included micro benchmark on a 128-bit > aarch64 machine that supports Neon, SVE2 and SHA3 features - > > Benchmark gain > TestEor3.test1Int 10.87% > TestEor3.test1Long 8.84% > TestEor3.test2Int 21.68% > TestEor3.test2Long 21.04% > > The numbers shown are performance gains with using Neon eor3 instruction > over the master branch that uses multiple "eor" instructions instead. > Similar gains can be observed with the SVE2 "eor3" version as well since > the "eor3" instruction is unpredicated and the machine under test uses a > maximum vector width of 128 bits which makes the SVE2 code generation very > similar to the one with Neon. Marked as reviewed by njian (Committer). ------------- PR: https://git.openjdk.org/jdk/pull/10407 From haosun at openjdk.org Mon Nov 28 02:39:11 2022 From: haosun at openjdk.org (Hao Sun) Date: Mon, 28 Nov 2022 02:39:11 GMT Subject: RFR: 8287925: AArch64: intrinsics for compareUnsigned method in Integer and Long Message-ID: x86 implemented the intrinsics for compareUnsigned() method in Integer and Long. See JDK-8283726. We add the corresponding AArch64 backend support in this patch. Note-1: minor style issues are fixed for CmpL3 related rules. Note-2: Jtreg case TestCompareUnsigned.java is updated to cover the matching rules for "comparing reg with imm" case. Testing: tier1~3 passed on Linux/AArch64 platform with no new failures. Following is the performance data for the JMH case: Before After Benchmark (size) Mode Cnt Score Error Score Error Units Integers.compareUnsignedDirect 500 avgt 5 0.994 ? 0.001 0.872 ? 0.015 us/op Integers.compareUnsignedIndirect 500 avgt 5 0.991 ? 0.001 0.833 ? 0.055 us/op Longs.compareUnsignedDirect 500 avgt 5 1.052 ? 0.001 0.974 ? 0.057 us/op Longs.compareUnsignedIndirect 500 avgt 5 1.053 ? 0.001 0.916 ? 0.038 us/op ------------- Commit messages: - 8287925: AArch64: intrinsics for compareUnsigned method in Integer and Long Changes: https://git.openjdk.org/jdk/pull/11383/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11383&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8287925 Stats: 123 lines in 2 files changed: 110 ins; 1 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/11383.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11383/head:pull/11383 PR: https://git.openjdk.org/jdk/pull/11383 From thartmann at openjdk.org Mon Nov 28 06:17:52 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 28 Nov 2022 06:17:52 GMT Subject: RFR: 8297343: TestStress*.java fail with "got different traces for the same seed" [v3] In-Reply-To: References: Message-ID: On Fri, 25 Nov 2022 15:58:03 GMT, Roland Westrelin wrote: >> Root cause from Roberto's analysis: >> >> "The regression seems to be due to the introduction of non-determinism >> in the node dumps of otherwise identical compilations." >> >> "The problem seems to be that JDK-6312651 dumps interface sets in an >> order that is determined by the raw pointers of the set elements. This >> is unstable across different runs and leads to different node dumps >> for otherwise identical compilations." >> >> "Stable node dumps are useful for debugging (e.g. when diffing >> compiler traces from two different runs), so the solution is probably >> dumping interface sets in some order (e.g. lexicographic order of each >> interface dump) that does not depend on raw pointer values." >> >> This patch implements Roberto's recommendation by sorting interfaces >> on their ciBaseObject::_ident. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > Revert "8297347: Problem list compiler/debug/TestStress*.java" > > This reverts commit 0b04a99245795c223a01d1cbe66a46d20e480c53. Looks good. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11357 From chagedorn at openjdk.org Mon Nov 28 07:11:12 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 28 Nov 2022 07:11:12 GMT Subject: RFR: 8297343: TestStress*.java fail with "got different traces for the same seed" [v3] In-Reply-To: References: Message-ID: On Fri, 25 Nov 2022 15:58:03 GMT, Roland Westrelin wrote: >> Root cause from Roberto's analysis: >> >> "The regression seems to be due to the introduction of non-determinism >> in the node dumps of otherwise identical compilations." >> >> "The problem seems to be that JDK-6312651 dumps interface sets in an >> order that is determined by the raw pointers of the set elements. This >> is unstable across different runs and leads to different node dumps >> for otherwise identical compilations." >> >> "Stable node dumps are useful for debugging (e.g. when diffing >> compiler traces from two different runs), so the solution is probably >> dumping interface sets in some order (e.g. lexicographic order of each >> interface dump) that does not depend on raw pointer values." >> >> This patch implements Roberto's recommendation by sorting interfaces >> on their ciBaseObject::_ident. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > Revert "8297347: Problem list compiler/debug/TestStress*.java" > > This reverts commit 0b04a99245795c223a01d1cbe66a46d20e480c53. Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11357 From thartmann at openjdk.org Mon Nov 28 07:44:28 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 28 Nov 2022 07:44:28 GMT Subject: RFR: 8295351: java/lang/Float/Binary16Conversion.java fails with "Unexpected result of converting" [v2] In-Reply-To: <6gcjBNNHk8twSr92oF2TGB90s6F4WnFZWT4xJPmuYoc=.c9c47aad-82a6-44a3-bc50-68e2b9e0a7c6@github.com> References: <6gcjBNNHk8twSr92oF2TGB90s6F4WnFZWT4xJPmuYoc=.c9c47aad-82a6-44a3-bc50-68e2b9e0a7c6@github.com> Message-ID: On Wed, 23 Nov 2022 19:58:41 GMT, Smita Kamath wrote: >> 8295351: java/lang/Float/Binary16Conversion.java fails with "Unexpected result of converting" > > Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: > > Addressed review comments Looks reasonable to me. src/hotspot/share/runtime/sharedRuntime.cpp line 472: > 470: } > 471: > 472: jint exp = ((0x7f800000 & doppel) >> (24 -1)) - 127; Suggestion: jint exp = ((0x7f800000 & doppel) >> (24 - 1)) - 127; ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11301 From rrich at openjdk.org Mon Nov 28 08:12:50 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Mon, 28 Nov 2022 08:12:50 GMT Subject: Integrated: 8286302: Port JEP 425 to PPC64 In-Reply-To: References: Message-ID: On Wed, 2 Nov 2022 21:54:26 GMT, Richard Reingruber wrote: > Hi, > > this is the port of [JEP 425: Virtual Threads (Preview)](https://openjdk.org/jeps/425)) to PPC64. > More precisely it is the port of vm continuations in hotspot to PPC64. It allows to run with `-XX:+VMContinuations` which is a prerequisit for 'real' virtual threads (oxymoron?). > > Most of the shared code changes are related to a new platform dependent constant `frame::metadata_words_at_top`. It is either added or subtracted to a frame address or size. > > The following is supposed to explain (without real life details) why it is needed in addition to the existing `frame::metadata_words`. The diagram shows a frame at `SP` and its stack arguments. The caller frame is located at `callers_SP`. > > > X86 / AARCH64 PPC64: > > : : : : > : : : : > | | | | > |-----------------| |-----------------| > | | | | > | stack arguments | | stack arguments | > | |<- callers_SP | | > =================== |-----------------| > | | | | > | metadata at bottom | | metadata at top | > | | | |<- callers_SP > |-----------------| =================== > | | | | > | | | | > | | | | > | | | | > | |<- SP | | > =================== |-----------------| > | | > | metadata at top | > | |<- SP > =================== > > > On X86 and AARCH64 metadata (that's return address, saved SP/FP, etc.) is stored at the frame bottom (`metadata at bottom`). On PPC64 it is stored at the frame top (`metadata at top`) where it affects size and address calculations. Shared code deals with this by making use of the platform dependent constant `frame::metadata_words_at_top`. > > * size required to 'freeze' a single frame with its stack arguments in a `StackChunk`: > `sizeof(frame) + sizeof(stack arguments) + frame::metadata_words_at_top` > > * address of stack arguments: > `callers_SP + frame::metadata_words_at_top` > > * The value of `frame::metadata_words_at_top` is 0 words on X86 and AARCH64. On PPC64 it is 4 words. > > Please refer to comments I've added for more details (e.g. the comment on StackChunkFrameStream::frame_size()). Recently I've given a talk about vm continuations and the PPC64 port to my colleagues. It's rather an overview than a deep dive. [The slides](http://cr.openjdk.java.net/~rrich/webrevs/2022/8286302/202210_loom_ppc64_port.pdf) might serve as well as an intro to the matter. > > The pr includes the new test jdk/jdk/internal/vm/Continuation/BasicExp.java which I wrote while doing the port. The test cases vary from simple to not-so-simple. One of the main features is that a CompilationPolicy passed as argument controls which frames are compiled/interpreted when freezing by defining a sliding window of compiled / interpreted frames which produces interesting transitions with and without stack arguments. There is overlap with Basic.java and Fuzz.java. Let me know wether to remove or keep BasicExp.java. Runtime with fastdebug: 2m on Apple M1 Pro and 3m on Intel Xeon E5-2660 v3 @ 2.60GHz. Note that -XX:+VerifyContinuations is explicitly set as a found it very useful, it increases the runtime quite a bit though. > > Testing: the change passed our CI testing: most JCK and JTREG tests, also in Xcomp mode, SPECjvm2008, SPECjbb2015, Renaissance Suite and SAP specific tests with fastdebug and release builds on the standard platforms plus PPC64. These tests do include hotspot_loom and jdk_loom JTREG tests which I've also run with TEST_VM_OPTS="-XX:+VerifyContinuations" on X86_64, PPC64, and AARCH64. > > Thanks, Richard. This pull request has now been integrated. Changeset: 43d11736 Author: Richard Reingruber URL: https://git.openjdk.org/jdk/commit/43d1173605128126dda0dc39ffc376b84065cc65 Stats: 3564 lines in 66 files changed: 3156 ins; 109 del; 299 mod 8286302: Port JEP 425 to PPC64 Reviewed-by: tsteele, mdoerr ------------- PR: https://git.openjdk.org/jdk/pull/10961 From fyang at openjdk.org Mon Nov 28 09:39:11 2022 From: fyang at openjdk.org (Fei Yang) Date: Mon, 28 Nov 2022 09:39:11 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism In-Reply-To: References: Message-ID: On Fri, 25 Nov 2022 10:43:30 GMT, Roman Kennke wrote: >> Hi Roman, >> >> Sorry for the late response - It is the former. >> >>> Are you saying that in init_buffer(), we don't have any method_entry_barrier stubs, yet, and therefore we return 0 there? >> >> This one. That `output()->_stub_list._stubs` appears to me to be always zero, for nodes are not emitted yet. I confirmed that before this patch `_safepoint_poll_table` is `0` but `_entry_barrier_table` is a non-zero value, like on aarch64 is `24`. Why it can work before is I think it is within some margins of error. Like `code_req += MAX_inst_size; // ensure per-instruction margin`, RISC-V's generated code is more verbose, so reproduces this. Simply adding a `4`, which is just one instruction size, to the new `stub_req` can make the build pass. >> >> But the zero value of `_stub_list._stubs` is not an expectant one, though - I am not quite sure the best way to fix that. The table sizes are in fact both 1 to me? >> >> Best, >> Xiaolin > >> Hi Roman, >> >> Sorry for the late response - It is the former. >> >> > Are you saying that in init_buffer(), we don't have any method_entry_barrier stubs, yet, and therefore we return 0 there? >> >> This one. That `output()->_stub_list._stubs` appears to me to be always zero, for nodes are not emitted yet. I confirmed that before this patch `_safepoint_poll_table` is `0` but `_entry_barrier_table` is a non-zero value, like on aarch64 is `24`. Why it can work before is I think it is within some margins of error. Like `code_req += MAX_inst_size; // ensure per-instruction margin`, RISC-V's generated code is more verbose, so reproduces this. Simply adding a `4`, which is just one instruction size, to the new `stub_req` can make the build pass. >> >> But the zero value of `_stub_list._stubs` is not an expectant one, though - I am not quite sure the best way to fix that. The table sizes are in fact both 1 to me? > > nmethod-barriers are not expected to be always present. It depends on the GC being active, if GC needs it, a (single) nmethod barrier is inserted. This is currently only the case for ZGC and Shenandoah. For all other GCs, nmethod barriers are not used, and thus no stubs should be emitted. @rkennke @zhengxiaolinX : Please ignore my previous message. I think change increases the size of the safepointpoll stub too much and thus breaks the condition at [1]. So false alarm. [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/output.cpp#L233 ------------- PR: https://git.openjdk.org/jdk/pull/11188 From xlinzheng at openjdk.org Mon Nov 28 09:53:08 2022 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Mon, 28 Nov 2022 09:53:08 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism In-Reply-To: References: Message-ID: On Wed, 16 Nov 2022 15:03:07 GMT, Roman Kennke wrote: > Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. > > Testing: > - [x] tier1 (x86_64, x86_32, aarch64) > - [x] tier2 (x86_64, x86_32, aarch64) > - [x] tier3 (x86_64, x86_32, aarch64) Hi Roman, I felt there was something still vague to me, so I took another look into this issue earlier today and found another interesting thing. It seems there are two issues reflected by this PR, but of course, this PR is only doing refactoring work... awesome. The other issue is, it appears to me that [1] and [2] both lack a `cb->stubs()->maybe_expand_to_ensure_remaining();` before the `align()`s. After adding the expansion logic before the two places, failures are gone (on RISC-V). So, in summary, there are two issues here (certainly, not related to this PR - this PR just interestingly triggers and lets us spot them): 1. `output()->_stub_list._stubs` is always a `0` value. 2. the missing `cb->stubs()->maybe_expand_to_ensure_remaining();` before `align()` in the shared trampoline logic, as above-mentioned. It appears to me that we already have got the expansion logic for the two stubs [3], and the size is `2048` - enough big value to cover the sizes of the stubs. I would like to humbly suggest some solutions to it: 1. A quick fix is to remove the `C2CodeStubList::measure_code_size()` for it always returns a `0` now (sorry for saying this), or I guess we can use some other approaches to calculate the correct node counts of the two kinds of stubs. 2. I guess I might need to file another PR to solve the missing expansion logic in shared trampolines. I would like to hear what you think. Best, Xiaolin [1] https://github.com/openjdk/jdk/blob/43d1173605128126dda0dc39ffc376b84065cc65/src/hotspot/cpu/aarch64/codeBuffer_aarch64.cpp#L55 [2] https://github.com/openjdk/jdk/blob/43d1173605128126dda0dc39ffc376b84065cc65/src/hotspot/cpu/riscv/codeBuffer_riscv.cpp#L56 [3] https://github.com/openjdk/jdk/pull/11188/files#diff-96c31ff7167c1300458cf557427ee89af5250035ecbc2f189817c793a328a502R74 ------------- PR: https://git.openjdk.org/jdk/pull/11188 From adinn at openjdk.org Mon Nov 28 10:02:12 2022 From: adinn at openjdk.org (Andrew Dinn) Date: Mon, 28 Nov 2022 10:02:12 GMT Subject: RFR: 8287925: AArch64: intrinsics for compareUnsigned method in Integer and Long In-Reply-To: References: Message-ID: On Mon, 28 Nov 2022 02:31:25 GMT, Hao Sun wrote: > x86 implemented the intrinsics for compareUnsigned() method in Integer and Long. See JDK-8283726. We add the corresponding AArch64 backend support in this patch. > > Note-1: minor style issues are fixed for CmpL3 related rules. > > Note-2: Jtreg case TestCompareUnsigned.java is updated to cover the matching rules for "comparing reg with imm" case. > > Testing: tier1~3 passed on Linux/AArch64 platform with no new failures. > > Following is the performance data for the JMH case: > > > Before After > Benchmark (size) Mode Cnt Score Error Score Error Units > Integers.compareUnsignedDirect 500 avgt 5 0.994 ? 0.001 0.872 ? 0.015 us/op > Integers.compareUnsignedIndirect 500 avgt 5 0.991 ? 0.001 0.833 ? 0.055 us/op > Longs.compareUnsignedDirect 500 avgt 5 1.052 ? 0.001 0.974 ? 0.057 us/op > Longs.compareUnsignedIndirect 500 avgt 5 1.053 ? 0.001 0.916 ? 0.038 us/op Looks good. ------------- Marked as reviewed by adinn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11383 From aph at openjdk.org Mon Nov 28 10:43:24 2022 From: aph at openjdk.org (Andrew Haley) Date: Mon, 28 Nov 2022 10:43:24 GMT Subject: RFR: 8287925: AArch64: intrinsics for compareUnsigned method in Integer and Long In-Reply-To: References: Message-ID: On Mon, 28 Nov 2022 02:31:25 GMT, Hao Sun wrote: > x86 implemented the intrinsics for compareUnsigned() method in Integer and Long. See JDK-8283726. We add the corresponding AArch64 backend support in this patch. > > Note-1: minor style issues are fixed for CmpL3 related rules. > > Note-2: Jtreg case TestCompareUnsigned.java is updated to cover the matching rules for "comparing reg with imm" case. > > Testing: tier1~3 passed on Linux/AArch64 platform with no new failures. > > Following is the performance data for the JMH case: > > > Before After > Benchmark (size) Mode Cnt Score Error Score Error Units > Integers.compareUnsignedDirect 500 avgt 5 0.994 ? 0.001 0.872 ? 0.015 us/op > Integers.compareUnsignedIndirect 500 avgt 5 0.991 ? 0.001 0.833 ? 0.055 us/op > Longs.compareUnsignedDirect 500 avgt 5 1.052 ? 0.001 0.974 ? 0.057 us/op > Longs.compareUnsignedIndirect 500 avgt 5 1.053 ? 0.001 0.916 ? 0.038 us/op src/hotspot/cpu/aarch64/aarch64.ad line 9734: > 9732: } else { > 9733: __ subs(zr, $src1$$Register, con); > 9734: } Suggestion: __ subs(zr, $src1$$Register, (int32_t)$src2$$constant); should work here. ------------- PR: https://git.openjdk.org/jdk/pull/11383 From aph at openjdk.org Mon Nov 28 10:51:07 2022 From: aph at openjdk.org (Andrew Haley) Date: Mon, 28 Nov 2022 10:51:07 GMT Subject: RFR: 8287925: AArch64: intrinsics for compareUnsigned method in Integer and Long In-Reply-To: References: Message-ID: <1EvcctFXNg8dJJBT0CTVPHJD-iOvEUTHJlHz8GKo7A0=.13dc63bf-6e42-40c9-8d2e-90f736cea88e@github.com> On Mon, 28 Nov 2022 02:31:25 GMT, Hao Sun wrote: > x86 implemented the intrinsics for compareUnsigned() method in Integer and Long. See JDK-8283726. We add the corresponding AArch64 backend support in this patch. > > Note-1: minor style issues are fixed for CmpL3 related rules. > > Note-2: Jtreg case TestCompareUnsigned.java is updated to cover the matching rules for "comparing reg with imm" case. > > Testing: tier1~3 passed on Linux/AArch64 platform with no new failures. > > Following is the performance data for the JMH case: > > > Before After > Benchmark (size) Mode Cnt Score Error Score Error Units > Integers.compareUnsignedDirect 500 avgt 5 0.994 ? 0.001 0.872 ? 0.015 us/op > Integers.compareUnsignedIndirect 500 avgt 5 0.991 ? 0.001 0.833 ? 0.055 us/op > Longs.compareUnsignedDirect 500 avgt 5 1.052 ? 0.001 0.974 ? 0.057 us/op > Longs.compareUnsignedIndirect 500 avgt 5 1.053 ? 0.001 0.916 ? 0.038 us/op It seems to me that the enhancement here, if it exists, is in the noise. That may well be because the test is dominated by memory traffic. ------------- PR: https://git.openjdk.org/jdk/pull/11383 From roland at openjdk.org Mon Nov 28 12:28:31 2022 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 28 Nov 2022 12:28:31 GMT Subject: RFR: 8297343: TestStress*.java fail with "got different traces for the same seed" [v3] In-Reply-To: References: Message-ID: On Mon, 28 Nov 2022 06:14:18 GMT, Tobias Hartmann wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> Revert "8297347: Problem list compiler/debug/TestStress*.java" >> >> This reverts commit 0b04a99245795c223a01d1cbe66a46d20e480c53. > > Looks good. @TobiHartmann @chhagedorn thanks for the reviews. ------------- PR: https://git.openjdk.org/jdk/pull/11357 From roland at openjdk.org Mon Nov 28 12:30:24 2022 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 28 Nov 2022 12:30:24 GMT Subject: Integrated: 8297343: TestStress*.java fail with "got different traces for the same seed" In-Reply-To: References: Message-ID: <8RCtbyUeyxKGmiCOFaVHvwSyA5LKYerxy53lExKGANg=.69719d66-dc6f-47b2-8018-99705522b3dd@github.com> On Thu, 24 Nov 2022 15:27:41 GMT, Roland Westrelin wrote: > Root cause from Roberto's analysis: > > "The regression seems to be due to the introduction of non-determinism > in the node dumps of otherwise identical compilations." > > "The problem seems to be that JDK-6312651 dumps interface sets in an > order that is determined by the raw pointers of the set elements. This > is unstable across different runs and leads to different node dumps > for otherwise identical compilations." > > "Stable node dumps are useful for debugging (e.g. when diffing > compiler traces from two different runs), so the solution is probably > dumping interface sets in some order (e.g. lexicographic order of each > interface dump) that does not depend on raw pointer values." > > This patch implements Roberto's recommendation by sorting interfaces > on their ciBaseObject::_ident. This pull request has now been integrated. Changeset: eff4c039 Author: Roland Westrelin URL: https://git.openjdk.org/jdk/commit/eff4c039dab99aa946dbdde1be8901929ebbfc6f Stats: 14 lines in 2 files changed: 9 ins; 3 del; 2 mod 8297343: TestStress*.java fail with "got different traces for the same seed" Reviewed-by: thartmann, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/11357 From qamai at openjdk.org Mon Nov 28 12:43:04 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Mon, 28 Nov 2022 12:43:04 GMT Subject: RFR: 8282365: Optimize divideUnsigned and remainderUnsigned for constants [v12] In-Reply-To: References: Message-ID: > This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. > > In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: > > floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) > ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) > > The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. > > For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: > > c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) > c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) > > which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. > > For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. > > More tests are added to cover the possible patterns. > > Please take a look and have some reviews. Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: fix build failures, move limits to new header ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9947/files - new: https://git.openjdk.org/jdk/pull/9947/files/afe37a99..4b359d3a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=11 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=10-11 Stats: 56 lines in 3 files changed: 25 ins; 30 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9947.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9947/head:pull/9947 PR: https://git.openjdk.org/jdk/pull/9947 From roland at openjdk.org Mon Nov 28 14:19:46 2022 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 28 Nov 2022 14:19:46 GMT Subject: RFR: 8269820: C2 PhaseIdealLoop::do_unroll get wrong opaque node Message-ID: A main loop loses its pre loop. The Opaque1 node for the zero trip guard of the main loop is assigned control at a Region through which an If is split. As a result, the Opaque1 is cloned and the zero trip guard takes a Phi that merges Opaque1 nodes. One of the branch dies next and as, a result, the zero trip guard has an Opaque1 as input but at the wrong CmpI input. The assert fires next. The fix I propose is that if an Opaque1 node that is part of a zero trip guard is encountered during split if, rather than split if up or down, instead, assign it the control of the zero trip guard's control. This way the pattern of the zero trip guard is unaffected and split if can proceed. I believe it's safe to assign it a later control: - an Opaque1 can't be shared - the zero trip guard can't be the If that's being split As Vladimir noted, this bug used to not reproduce with loop strip mining disabled but now always reproduces because the loop strip mining nest is always constructed. The reason is that the main loop in this test is kept alive by the LSM safepoint. If the LSM loop nest is not constructed, the loop is optimized out. I filed: https://bugs.openjdk.org/browse/JDK-8297724 for this issue. ------------- Commit messages: - more - test - more - fix Changes: https://git.openjdk.org/jdk/pull/11391/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11391&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8269820 Stats: 169 lines in 6 files changed: 167 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/11391.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11391/head:pull/11391 PR: https://git.openjdk.org/jdk/pull/11391 From rrich at openjdk.org Mon Nov 28 15:00:13 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Mon, 28 Nov 2022 15:00:13 GMT Subject: RFR: 8297487: G1 Remark: no need to keep alive oop constants of nmethods on stack In-Reply-To: <3TOLagicixH3acqrLIpQutFoQNtrs2dlIJjEWPJs2Ow=.18c402c8-32f2-4235-a179-9796dc4ff19d@github.com> References: <3TOLagicixH3acqrLIpQutFoQNtrs2dlIJjEWPJs2Ow=.18c402c8-32f2-4235-a179-9796dc4ff19d@github.com> Message-ID: On Wed, 23 Nov 2022 10:05:56 GMT, Richard Reingruber wrote: > This pr removes the stackwalks to keep alive oops of nmethods found on stack during G1 remark as it seems redundant. The oops are already kept alive by the [nmethod entry barrier](https://github.com/openjdk/jdk/blob/f26bd4e0e8b68de297a9ff93526cd7fac8668320/src/hotspot/share/gc/shared/barrierSetNMethod.cpp#L85) > > Additionally it fixes a comment that says nmethod entry barriers are needed to deal with continuations which, afaik, is not the case. Please correct me and explain if I'm mistaken. > > Testing: the patch is included in our daily CI testing since a week. That is most JCK and JTREG tests, also in Xcomp mode, Renaissance benchmark and SAP specific tests with fastdebug and release builds on the standard platforms plus PPC64. There was no failure I could attribute to this change. > > I tried to find a jtreg test that is sensitive to the keep alive by omitting it in the nmethod entry barrier and also in G1 remark but without success. I would like to get rid of the stackwalks if they are unnecessary, for pause time reduction but also to ease understanding the code. I'm not in a hurry though. I've posted on the porters list to get some feedback from ARM32 maintainers: https://mail.openjdk.org/pipermail/porters-dev/2022-November/000739.html ------------- PR: https://git.openjdk.org/jdk/pull/11314 From aph at openjdk.org Mon Nov 28 17:44:45 2022 From: aph at openjdk.org (Andrew Haley) Date: Mon, 28 Nov 2022 17:44:45 GMT Subject: RFR: 8287925: AArch64: intrinsics for compareUnsigned method in Integer and Long In-Reply-To: References: Message-ID: On Mon, 28 Nov 2022 02:31:25 GMT, Hao Sun wrote: > x86 implemented the intrinsics for compareUnsigned() method in Integer and Long. See JDK-8283726. We add the corresponding AArch64 backend support in this patch. > > Note-1: minor style issues are fixed for CmpL3 related rules. > > Note-2: Jtreg case TestCompareUnsigned.java is updated to cover the matching rules for "comparing reg with imm" case. > > Testing: tier1~3 passed on Linux/AArch64 platform with no new failures. > > Following is the performance data for the JMH case: > > > Before After > Benchmark (size) Mode Cnt Score Error Score Error Units > Integers.compareUnsignedDirect 500 avgt 5 0.994 ? 0.001 0.872 ? 0.015 us/op > Integers.compareUnsignedIndirect 500 avgt 5 0.991 ? 0.001 0.833 ? 0.055 us/op > Longs.compareUnsignedDirect 500 avgt 5 1.052 ? 0.001 0.974 ? 0.057 us/op > Longs.compareUnsignedIndirect 500 avgt 5 1.053 ? 0.001 0.916 ? 0.038 us/op I've been trying for an hour or so to write a benchmark that is actually improved by this patch. I have been unable to do so. Part of the problem is that using `cmp; cset; cneg` doesn't take advantage of branch prediction. And more often than not, the result of a comparison is predictable. Also, the code isn't necessarily shorter either. I do not believe this patch should be committed unless someone manages to write a benchmark that demonstrates some advantage. ------------- PR: https://git.openjdk.org/jdk/pull/11383 From sviswanathan at openjdk.org Mon Nov 28 18:18:37 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Mon, 28 Nov 2022 18:18:37 GMT Subject: RFR: 8292761: x86: Clone nodes to match complex rules [v2] In-Reply-To: References: Message-ID: On Sat, 17 Sep 2022 12:23:35 GMT, Quan Anh Mai wrote: >> Please include the benchmark in the patch. Could you show the generated code before/after? Thanks! > > Thank @TobiHartmann @chhagedorn for your comments, I have updated the PR to address those. @merykitty Do you plan to get this optimization in? I had only one comment (October 18). Please take a look. ------------- PR: https://git.openjdk.org/jdk/pull/9977 From duke at openjdk.org Mon Nov 28 18:21:15 2022 From: duke at openjdk.org (Matthijs Bijman) Date: Mon, 28 Nov 2022 18:21:15 GMT Subject: RFR: 8293294: Remove dead code in Parse::check_interpreter_type Message-ID: A small cleanup in Parse::check_interpreter_type to remove two dead declarations. ------------- Commit messages: - 8293294: Remove dead code in Parse::check_interpreter_type Changes: https://git.openjdk.org/jdk/pull/11325/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11325&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8293294 Stats: 2 lines in 1 file changed: 0 ins; 2 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11325.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11325/head:pull/11325 PR: https://git.openjdk.org/jdk/pull/11325 From vlivanov at openjdk.org Mon Nov 28 18:36:17 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Mon, 28 Nov 2022 18:36:17 GMT Subject: RFR: 8293294: Remove dead code in Parse::check_interpreter_type In-Reply-To: References: Message-ID: On Wed, 23 Nov 2022 14:38:35 GMT, Matthijs Bijman wrote: > A small cleanup in Parse::check_interpreter_type to remove two dead declarations. Looks good and trivial. ------------- Marked as reviewed by vlivanov (Reviewer). PR: https://git.openjdk.org/jdk/pull/11325 From kvn at openjdk.org Mon Nov 28 18:53:00 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 28 Nov 2022 18:53:00 GMT Subject: RFR: 8269820: C2 PhaseIdealLoop::do_unroll get wrong opaque node In-Reply-To: References: Message-ID: On Mon, 28 Nov 2022 14:02:50 GMT, Roland Westrelin wrote: > A main loop loses its pre loop. The Opaque1 node for the zero trip > guard of the main loop is assigned control at a Region through which > an If is split. As a result, the Opaque1 is cloned and the zero trip > guard takes a Phi that merges Opaque1 nodes. One of the branch dies > next and as, a result, the zero trip guard has an Opaque1 as input but > at the wrong CmpI input. The assert fires next. > > The fix I propose is that if an Opaque1 node that is part of a zero > trip guard is encountered during split if, rather than split if up or > down, instead, assign it the control of the zero trip guard's > control. This way the pattern of the zero trip guard is unaffected and > split if can proceed. I believe it's safe to assign it a later > control: > > - an Opaque1 can't be shared > > - the zero trip guard can't be the If that's being split > > As Vladimir noted, this bug used to not reproduce with loop strip > mining disabled but now always reproduces because the loop > strip mining nest is always constructed. The reason is that the > main loop in this test is kept alive by the LSM safepoint. If the > LSM loop nest is not constructed, the loop is optimized out. I > filed: > > https://bugs.openjdk.org/browse/JDK-8297724 > > for this issue. General question. Will it help (simplify changes) if we add specialized `class OpaqueZeroTripGuardNode : public Opaque1Node` class? src/hotspot/share/opto/split_if.cpp line 242: > 240: set_ctrl(n, ctrl->in(0)->in(0)); > 241: set_ctrl(cmp, ctrl->in(0)->in(0)); > 242: set_ctrl(bol, ctrl->in(0)->in(0)); Why you assign control to `cmp` and `bol` too? ------------- PR: https://git.openjdk.org/jdk/pull/11391 From svkamath at openjdk.org Mon Nov 28 18:57:35 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Mon, 28 Nov 2022 18:57:35 GMT Subject: RFR: 8295351: java/lang/Float/Binary16Conversion.java fails with "Unexpected result of converting" [v3] In-Reply-To: References: Message-ID: > 8295351: java/lang/Float/Binary16Conversion.java fails with "Unexpected result of converting" Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: Addressed review comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11301/files - new: https://git.openjdk.org/jdk/pull/11301/files/5af25e9b..e432bf7c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11301&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11301&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11301.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11301/head:pull/11301 PR: https://git.openjdk.org/jdk/pull/11301 From sviswanathan at openjdk.org Mon Nov 28 19:11:55 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Mon, 28 Nov 2022 19:11:55 GMT Subject: RFR: 8295351: java/lang/Float/Binary16Conversion.java fails with "Unexpected result of converting" In-Reply-To: References: Message-ID: On Tue, 22 Nov 2022 22:05:34 GMT, Smita Kamath wrote: >> 8295351: java/lang/Float/Binary16Conversion.java fails with "Unexpected result of converting" > > Hi All, > > I have updated f2hf and hf2f methods in sharedRuntime.cpp as a fix for the error unexpected result of converting. Kindly review this patch and provide feedback. Thank you. > > Regards, > Smita The whitespace related change looks good. @smita-kamath Please go ahead and integrate. ------------- PR: https://git.openjdk.org/jdk/pull/11301 From svkamath at openjdk.org Mon Nov 28 19:28:37 2022 From: svkamath at openjdk.org (Smita Kamath) Date: Mon, 28 Nov 2022 19:28:37 GMT Subject: Integrated: 8295351: java/lang/Float/Binary16Conversion.java fails with "Unexpected result of converting" In-Reply-To: References: Message-ID: On Tue, 22 Nov 2022 21:52:59 GMT, Smita Kamath wrote: > 8295351: java/lang/Float/Binary16Conversion.java fails with "Unexpected result of converting" This pull request has now been integrated. Changeset: 105d9d75 Author: Smita Kamath Committer: Sandhya Viswanathan URL: https://git.openjdk.org/jdk/commit/105d9d75e84a46400f52fafda2ea00c99c14eaf0 Stats: 20 lines in 2 files changed: 12 ins; 1 del; 7 mod 8295351: java/lang/Float/Binary16Conversion.java fails with "Unexpected result of converting" Reviewed-by: sviswanathan, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/11301 From fgao at openjdk.org Tue Nov 29 02:32:47 2022 From: fgao at openjdk.org (Fei Gao) Date: Tue, 29 Nov 2022 02:32:47 GMT Subject: RFR: 8297172: Fix some issues of auto-vectorization of `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()` Message-ID: <3dmE8J0CDjMIZSJyayrid5vRkD48AD9g6zaXr0M4mWo=.9c9a050c-7406-4a7f-a3c0-98aeb80d7590@github.com> Background: Java API[1] for `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()` returns int type, while Vector API[2] for them returns long type. Currently, to support auto-vectorization of Java API and Vector API at the same time, some vector platforms, namely aarch64 and x86, provides two types of vector nodes taking long type: One produces long vector type for vector API, and the other one produces int vector type by casting long-type result from the first one. We can move the casting work for auto-vectorization of Java API to the mid-end so that we can unify the vector implementation in the backend, reducing extra code. The patch does the refactoring and also fixes several issues below. 1. Refine the auto-vectorization of `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()` In the patch, during the stage of generating vector node for the candidate pack, to implement the complete behavior of these Java APIs, superword will make two consecutive vector nodes: the first one, the same as Vector API, does the real execution to produce long-type result, and the second one casts the result to int vector type. For those platforms, which have supported correctly vectorizing these java APIs before, the patch has no real impact on final generated assembly code and, consequently, has no performance regression. 2. Fix the IR check failure of `compiler/vectorization/TestPopCountVectorLong.java` on 128-bit sve platform These Java APIs take a long type and produce an int type, like conversion nodes between different data sizes do. In superword, the alignment of their input nodes is different from their own. It results in that these APIs can't be vectorized when `-XX:MaxVectorSize=16`. So, the IR check for vector nodes in `compiler/vectorization/TestPopCountVectorLong.java` would fail. To fix the issue of alignment, the patch corrects their related alignment, just like it did for conversion nodes between different data sizes. After the patch, these Java APIs can be vectorized on 128-bit platforms, as long as the auto-vectorization is profitable. 3. Fix the incorrect vectorization of `numberOfTrailingZeros/numberOfLeadingZeros()` in aarch64 platforms with more than 128 bits Although `Long.NumberOfLeadingZeros/NumberOfTrailingZeros()` can be vectorized on sve platforms when `-XX:MaxVectorSize=32` or `-XX:MaxVectorSize=64` even before the patch, aarch64 backend didn't provide special vector implementation for Java API and thus the generated code is not correct, like: LOOP: sxtw x13, w12 add x14, x15, x13, uxtx #3 add x17, x14, #0x10 ld1d {z16.d}, p7/z, [x17] // Incorrectly use integer rbit/clz insn for long type vector *rbit z16.s, p7/m, z16.s *clz z16.s, p7/m, z16.s add x13, x16, x13, uxtx #2 str q16, [x13, #16] ... add w12, w12, #0x20 cmp w12, w3 b.lt LOOP It causes a runtime failure of the testcase `compiler/vectorization/TestNumberOfContinuousZeros.java` added in the patch. After the refactoring, the testcase can pass and the code is corrected: LOOP: sxtw x13, w12 add x14, x15, x13, uxtx #3 add x17, x14, #0x10 ld1d {z16.d}, p7/z, [x17] // Compute with long vector type and convert to int vector type *rbit z16.d, p7/m, z16.d *clz z16.d, p7/m, z16.d *mov z24.d, #0 *uzp1 z25.s, z16.s, z24.s add x13, x16, x13, uxtx #2 str q25, [x13, #16] ... add w12, w12, #0x20 cmp w12, w3 b.lt LOOP 4. Fix an assertion failure on x86 avx2 platform Before, on x86 avx2 platform, there is an assertion failure when C2 tries to vectorize the loops like: // long[] ia; // int[] ic; for (int i = 0; i < LENGTH; ++i) { ic[i] = Long.numberOfLeadingZeros(ia[i]); } X86 backend supports vectorizing `numberOfLeadingZeros()` on avx2 platform, but it uses `evpmovqd()` to do casting for `CountLeadingZerosV`[3], which can only be used when `UseAVX > 2`[4]. After the refactoring, the failure can be fixed naturally. Tier 1~3 passed with no new failures on Linux AArch64/X86 platform. [1] https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#bitCount(long) https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#numberOfTrailingZeros(long) https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#numberOfLeadingZeros(long) [2] https://github.com/openjdk/jdk/blob/544e31722528d12fae0eb19271f85886680801a6/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/LongVector.java#L687 [3] https://github.com/openjdk/jdk/blob/544e31722528d12fae0eb19271f85886680801a6/src/hotspot/cpu/x86/x86.ad#L9418 [4] https://github.com/openjdk/jdk/blob/fc616588c1bf731150a9d9b80033bb589bcb231f/src/hotspot/cpu/x86/assembler_x86.cpp#L2239 ------------- Commit messages: - 8297172: Fix some issues of auto-vectorization of `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()` Changes: https://git.openjdk.org/jdk/pull/11405/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11405&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297172 Stats: 303 lines in 11 files changed: 161 ins; 131 del; 11 mod Patch: https://git.openjdk.org/jdk/pull/11405.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11405/head:pull/11405 PR: https://git.openjdk.org/jdk/pull/11405 From dzhang at openjdk.org Tue Nov 29 03:15:37 2022 From: dzhang at openjdk.org (Dingli Zhang) Date: Tue, 29 Nov 2022 03:15:37 GMT Subject: RFR: 8297549: RISC-V: Add support for Vector API vector load const operation In-Reply-To: References: Message-ID: On Mon, 28 Nov 2022 01:34:15 GMT, Fei Yang wrote: >> The instruction which is matched `VectorLoadConst` will create index starting from 0 and incremented by 1. In detail, the instruction populates the destination vector by setting the first element to 0 and monotonically incrementing the value by 1 for each subsequent element. >> >> We can add support of `VectorLoadConst` for RISC-V by `vid.v` . It was implemented by referring to RVV v1.0 [1]. >> >> We can use the JMH test from https://github.com/openjdk/jdk/pull/10332. Tests are performed on qemu with parameter `-cpu rv64,v=true,vlen=256,vext_spec=v1.0`. By adding the `-XX:+PrintAssembly`, the compilation log of `floatIndexVector` is as follows: >> >> >> 120 vloadcon V2 # generate iota indices >> 12c vfmul.vv V1, V2, V1 #@vmulF >> 134 vfmv.v.f V2, F8 #@replicateF >> 13c vfadd.vv V1, V2, V1 #@vaddF >> >> The above nodes match the logic of `Compute indexes with "vec + iota * scale"` in https://github.com/openjdk/jdk/pull/10332, which is the operation corresponding to `addIndex` in benchmark: >> https://github.com/openjdk/jdk/blob/d6102110e1b48c065292db83744245a33e269cc2/test/micro/org/openjdk/bench/jdk/incubator/vector/IndexVectorBenchmark.java#L92-L97 >> >> At the same time, the following assembly code will be generated when running the `floatIndexVector` case, there will be one more instruction than `intIndexVector`: >> >> 0x000000401443cc9c: .4byte 0x10072d7 >> 0x000000401443cca0: .4byte 0x5208a157 >> 0x000000401443cca4: .4byte 0x4a219157 >> >> `0x10072d7/0x5208a1d7` is the machine code for `vsetvli/vid.v` and `0x4a219157` is the additional machine code for `vfcvt.f.x.v`, which are the opcodes generated by `is_floating_point_type(bt)`: >> >> if (is_floating_point_type(bt)) { >> __ vfcvt_f_x_v(as_VectorRegister($dst$$reg), as_VectorRegister($dst$$reg)); >> } >> >> >> After we implement these nodes, by using `-XX:+UseRVV`, the number of assembly instructions is reduced by about ~50% because of the different execution paths with the number of loops, similar to `AddTest` [3]. >> >> [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc >> [2] https://github.com/openjdk/jdk/blob/857b0f9b05bc711f3282a0da85fcff131fffab91/test/micro/org/openjdk/bench/jdk/incubator/vector/IndexVectorBenchmark.java >> [3] https://github.com/zifeihan/vector-api-test-rvv/blob/master/vector-api-rvv-performance.md >> >> Please take a look and have some reviews. Thanks a lot. >> >> ## Testing: >> >> - hotspot and jdk tier1 without new failures (release with UseRVV on QEMU) >> - test/jdk/jdk/incubator/vector/* (fastdebug/release with UseRVV on QEMU) > > src/hotspot/cpu/riscv/riscv_v.ad line 2091: > >> 2089: __ vid_v(as_VectorRegister($dst$$reg)); >> 2090: if (is_floating_point_type(bt)) { >> 2091: __ vfcvt_f_x_v(as_VectorRegister($dst$$reg), as_VectorRegister($dst$$reg)); > > You might want to distinugish between float and double for 'bt' here. Since vfcvt.f.x.v only convert signed integer to float. Hi @RealFYang , thanks for the review! Since `vid.v` generates a sequence of integers, it should be converted and stored in each element of the vector if `bt` is floating point type. ------------- PR: https://git.openjdk.org/jdk/pull/11344 From fyang at openjdk.org Tue Nov 29 03:42:14 2022 From: fyang at openjdk.org (Fei Yang) Date: Tue, 29 Nov 2022 03:42:14 GMT Subject: RFR: 8297715: RISC-V: C2: Use single-bit instructions from the Zbs extension Message-ID: <6sS3mj04mdeRTLoLgicm1C3g0F1nBigke74p1XinQ4U=.9f9f80b2-774e-49d2-8659-225e79b1f4f5@github.com> The single-bit instructions from the Zbs extension provide a mechanism to set, clear, invert, or extract a single bit in a register. The bit is specified by its index. Especially, the single-bit extract (immediate) instruction 'bexti rd, rs1, shamt' performs: let index = shamt & (XLEN - 1); X(rd) = (X(rs1) >> index) & 1; This instruction is a perfect match for following sub-graph in C2 when integer immediate 'mask' is power of 2: ''' Set dst (Conv2B (AndI src mask)) ''' Then we could optimize C2 JIT code for [1]: Before: lhu R28, [R11, #12] # short, #@loadUS ! Field: com/sun/org/apache/xerces/internal/dom/NodeImpl.flags andi R7, R28, #8 #@andI_reg_imm snez R10, R7 #@convI2Bool After: lhu R28, [R11, #12] # short, #@loadUS ! Field: com/sun/org/apache/xerces/internal/dom/NodeImpl.flags bexti R10, R28, 3 # Testing: Tier1-3 hotspot & jdk tested with QEMU (JTREG="VM_OPTIONS=-XX:+UnlockExperimentalVMOptions -XX:+UseZbs"). [1] https://github.com/openjdk/jdk/blob/master/src/java.xml/share/classes/com/sun/org/apache/xerces/internal/dom/NodeImpl.java#L1936 ------------- Commit messages: - 8297715: RISC-V: C2: Use single-bit instructions from the Zbs extension Changes: https://git.openjdk.org/jdk/pull/11406/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11406&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297715 Stats: 26 lines in 4 files changed: 24 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/11406.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11406/head:pull/11406 PR: https://git.openjdk.org/jdk/pull/11406 From fyang at openjdk.org Tue Nov 29 06:46:22 2022 From: fyang at openjdk.org (Fei Yang) Date: Tue, 29 Nov 2022 06:46:22 GMT Subject: RFR: 8297549: RISC-V: Add support for Vector API vector load const operation In-Reply-To: References: Message-ID: On Thu, 24 Nov 2022 05:40:12 GMT, Dingli Zhang wrote: > The instruction which is matched `VectorLoadConst` will create index starting from 0 and incremented by 1. In detail, the instruction populates the destination vector by setting the first element to 0 and monotonically incrementing the value by 1 for each subsequent element. > > We can add support of `VectorLoadConst` for RISC-V by `vid.v` . It was implemented by referring to RVV v1.0 [1]. > > We can use the JMH test from https://github.com/openjdk/jdk/pull/10332. Tests are performed on qemu with parameter `-cpu rv64,v=true,vlen=256,vext_spec=v1.0`. By adding the `-XX:+PrintAssembly`, the compilation log of `floatIndexVector` is as follows: > > > 120 vloadcon V2 # generate iota indices > 12c vfmul.vv V1, V2, V1 #@vmulF > 134 vfmv.v.f V2, F8 #@replicateF > 13c vfadd.vv V1, V2, V1 #@vaddF > > The above nodes match the logic of `Compute indexes with "vec + iota * scale"` in https://github.com/openjdk/jdk/pull/10332, which is the operation corresponding to `addIndex` in benchmark: > https://github.com/openjdk/jdk/blob/d6102110e1b48c065292db83744245a33e269cc2/test/micro/org/openjdk/bench/jdk/incubator/vector/IndexVectorBenchmark.java#L92-L97 > > At the same time, the following assembly code will be generated when running the `floatIndexVector` case, there will be one more instruction than `intIndexVector`: > > 0x000000401443cc9c: .4byte 0x10072d7 > 0x000000401443cca0: .4byte 0x5208a157 > 0x000000401443cca4: .4byte 0x4a219157 > > `0x10072d7/0x5208a1d7` is the machine code for `vsetvli/vid.v` and `0x4a219157` is the additional machine code for `vfcvt.f.x.v`, which are the opcodes generated by `is_floating_point_type(bt)`: > > if (is_floating_point_type(bt)) { > __ vfcvt_f_x_v(as_VectorRegister($dst$$reg), as_VectorRegister($dst$$reg)); > } > > > After we implement these nodes, by using `-XX:+UseRVV`, the number of assembly instructions is reduced by about ~50% because of the different execution paths with the number of loops, similar to `AddTest` [3]. > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/857b0f9b05bc711f3282a0da85fcff131fffab91/test/micro/org/openjdk/bench/jdk/incubator/vector/IndexVectorBenchmark.java > [3] https://github.com/zifeihan/vector-api-test-rvv/blob/master/vector-api-rvv-performance.md > > Please take a look and have some reviews. Thanks a lot. > > ## Testing: > > - hotspot and jdk tier1 without new failures (release with UseRVV on QEMU) > - test/jdk/jdk/incubator/vector/* (fastdebug/release with UseRVV on QEMU) Marked as reviewed by fyang (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11344 From fyang at openjdk.org Tue Nov 29 06:46:23 2022 From: fyang at openjdk.org (Fei Yang) Date: Tue, 29 Nov 2022 06:46:23 GMT Subject: RFR: 8297549: RISC-V: Add support for Vector API vector load const operation In-Reply-To: References: Message-ID: On Tue, 29 Nov 2022 03:10:17 GMT, Dingli Zhang wrote: >> src/hotspot/cpu/riscv/riscv_v.ad line 2091: >> >>> 2089: __ vid_v(as_VectorRegister($dst$$reg)); >>> 2090: if (is_floating_point_type(bt)) { >>> 2091: __ vfcvt_f_x_v(as_VectorRegister($dst$$reg), as_VectorRegister($dst$$reg)); >> >> You might want to distinugish between float and double for 'bt' here. Since vfcvt.f.x.v only convert signed integer to float. > > Hi @RealFYang , thanks for the review! Since `vid.v` generates a sequence of integers, it should be converted and stored in each element of the vector if `bt` is floating point type. Hi, I double checked the RVV spec and I think you are right here. Change looks good then. ------------- PR: https://git.openjdk.org/jdk/pull/11344 From fjiang at openjdk.org Tue Nov 29 06:52:04 2022 From: fjiang at openjdk.org (Feilong Jiang) Date: Tue, 29 Nov 2022 06:52:04 GMT Subject: RFR: 8297715: RISC-V: C2: Use single-bit instructions from the Zbs extension In-Reply-To: <6sS3mj04mdeRTLoLgicm1C3g0F1nBigke74p1XinQ4U=.9f9f80b2-774e-49d2-8659-225e79b1f4f5@github.com> References: <6sS3mj04mdeRTLoLgicm1C3g0F1nBigke74p1XinQ4U=.9f9f80b2-774e-49d2-8659-225e79b1f4f5@github.com> Message-ID: On Tue, 29 Nov 2022 03:35:12 GMT, Fei Yang wrote: > The single-bit instructions from the Zbs extension provide a mechanism to set, clear, > invert, or extract a single bit in a register. The bit is specified by its index. > > Especially, the single-bit extract (immediate) instruction 'bexti rd, rs1, shamt' performs: > > let index = shamt & (XLEN - 1); > X(rd) = (X(rs1) >> index) & 1; > > > This instruction is a perfect match for following C2 sub-graph when integer immediate 'mask' is power of 2: > > Set dst (Conv2B (AndI src mask)) > > > The effect is that we could then optimize C2 JIT code for methods like [1]: > Before: > > lhu R28, [R11, #12] # short, #@loadUS ! Field: com/sun/org/apache/xerces/internal/dom/NodeImpl.flags > andi R7, R28, #8 #@andI_reg_imm > snez R10, R7 #@convI2Bool > > > After: > > lhu R28, [R11, #12] # short, #@loadUS ! Field: com/sun/org/apache/xerces/internal/dom/NodeImpl.flags > bexti R10, R28, 3 # > > > Testing: Tier1-3 hotspot & jdk tested with QEMU (JTREG="VM_OPTIONS=-XX:+UnlockExperimentalVMOptions -XX:+UseZbs"). > > [1] https://github.com/openjdk/jdk/blob/master/src/java.xml/share/classes/com/sun/org/apache/xerces/internal/dom/NodeImpl.java#L1936 Looks good. ------------- Marked as reviewed by fjiang (Author). PR: https://git.openjdk.org/jdk/pull/11406 From thartmann at openjdk.org Tue Nov 29 08:17:16 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 29 Nov 2022 08:17:16 GMT Subject: RFR: 8293294: Remove dead code in Parse::check_interpreter_type In-Reply-To: References: Message-ID: On Wed, 23 Nov 2022 14:38:35 GMT, Matthijs Bijman wrote: > A small cleanup in Parse::check_interpreter_type to remove two dead declarations. Marked as reviewed by thartmann (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11325 From aph at openjdk.org Tue Nov 29 09:03:19 2022 From: aph at openjdk.org (Andrew Haley) Date: Tue, 29 Nov 2022 09:03:19 GMT Subject: RFR: 8287925: AArch64: intrinsics for compareUnsigned method in Integer and Long In-Reply-To: References: Message-ID: On Mon, 28 Nov 2022 02:31:25 GMT, Hao Sun wrote: > x86 implemented the intrinsics for compareUnsigned() method in Integer and Long. See JDK-8283726. We add the corresponding AArch64 backend support in this patch. > > Note-1: minor style issues are fixed for CmpL3 related rules. > > Note-2: Jtreg case TestCompareUnsigned.java is updated to cover the matching rules for "comparing reg with imm" case. > > Testing: tier1~3 passed on Linux/AArch64 platform with no new failures. > > Following is the performance data for the JMH case: > > > Before After > Benchmark (size) Mode Cnt Score Error Score Error Units > Integers.compareUnsignedDirect 500 avgt 5 0.994 ? 0.001 0.872 ? 0.015 us/op > Integers.compareUnsignedIndirect 500 avgt 5 0.991 ? 0.001 0.833 ? 0.055 us/op > Longs.compareUnsignedDirect 500 avgt 5 1.052 ? 0.001 0.974 ? 0.057 us/op > Longs.compareUnsignedIndirect 500 avgt 5 1.053 ? 0.001 0.916 ? 0.038 us/op What would probably work better is to idealize `(cmp (cmp3 x y) < 0)` to `(cmpU x y)` ------------- PR: https://git.openjdk.org/jdk/pull/11383 From ngasson at openjdk.org Tue Nov 29 09:45:15 2022 From: ngasson at openjdk.org (Nick Gasson) Date: Tue, 29 Nov 2022 09:45:15 GMT Subject: RFR: 8293488: Add EOR3 backend rule for aarch64 SHA3 extension [v5] In-Reply-To: References: Message-ID: On Thu, 24 Nov 2022 15:56:08 GMT, Bhavana Kilambi wrote: >> Arm ISA v8.2A and v9.0A include SHA3 feature extensions and one of those SHA3 instructions - "eor3" performs an exclusive OR of three vectors. This is helpful in applications that have multiple, consecutive "eor" operations which can be reduced by clubbing them into fewer operations using the "eor3" instruction. For example - >> >> eor a, a, b >> eor a, a, c >> >> can be optimized to single instruction - `eor3 a, b, c` >> >> This patch adds backend rules for Neon and SVE2 "eor3" instructions and a micro benchmark to assess the performance gains with this patch. Following are the results of the included micro benchmark on a 128-bit aarch64 machine that supports Neon, SVE2 and SHA3 features - >> >> >> Benchmark gain >> TestEor3.test1Int 10.87% >> TestEor3.test1Long 8.84% >> TestEor3.test2Int 21.68% >> TestEor3.test2Long 21.04% >> >> >> The numbers shown are performance gains with using Neon eor3 instruction over the master branch that uses multiple "eor" instructions instead. Similar gains can be observed with the SVE2 "eor3" version as well since the "eor3" instruction is unpredicated and the machine under test uses a maximum vector width of 128 bits which makes the SVE2 code generation very similar to the one with Neon. > > Bhavana Kilambi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: > > - Resolve merge conflicts with master > - Merge branch 'master' into JDK-8293488 > - Removed svesha3 feature check for eor3 > - Changed the modifier order preference in JTREG test > - Modified JTREG test to include feature constraints > - 8293488: Add EOR3 backend rule for aarch64 SHA3 extension > > Arm ISA v8.2A and v9.0A include SHA3 feature extensions and one of those > SHA3 instructions - "eor3" performs an exclusive OR of three vectors. > This is helpful in applications that have multiple, consecutive "eor" > operations which can be reduced by clubbing them into fewer operations > using the "eor3" instruction. For example - > eor a, a, b > eor a, a, c > can be optimized to single instruction - eor3 a, b, c > > This patch adds backend rules for Neon and SVE2 "eor3" instructions and > a micro benchmark to assess the performance gains with this patch. > Following are the results of the included micro benchmark on a 128-bit > aarch64 machine that supports Neon, SVE2 and SHA3 features - > > Benchmark gain > TestEor3.test1Int 10.87% > TestEor3.test1Long 8.84% > TestEor3.test2Int 21.68% > TestEor3.test2Long 21.04% > > The numbers shown are performance gains with using Neon eor3 instruction > over the master branch that uses multiple "eor" instructions instead. > Similar gains can be observed with the SVE2 "eor3" version as well since > the "eor3" instruction is unpredicated and the machine under test uses a > maximum vector width of 128 bits which makes the SVE2 code generation very > similar to the one with Neon. test/hotspot/gtest/aarch64/aarch64-asmtest.py line 1043: > 1041: [str(self.reg[i]) for i in range(1, self.numRegs)])) > 1042: def astr(self): > 1043: if self._name == "eor3": Suggestion: firstArg = 0 if self._name == "eor3" else 1 formatStr = "%s%s" + ''.join([", %s" for i in range(firstArg, self.numRegs)]) And similarly below. ------------- PR: https://git.openjdk.org/jdk/pull/10407 From yyang at openjdk.org Tue Nov 29 09:54:30 2022 From: yyang at openjdk.org (Yi Yang) Date: Tue, 29 Nov 2022 09:54:30 GMT Subject: RFR: 8290432: C2 compilation fails with assert(node->_last_del == _last) failed: must have deleted the edge just produced [v6] In-Reply-To: References: Message-ID: On Tue, 22 Nov 2022 17:05:04 GMT, Vladimir Kozlov wrote: > I may ask to do our internal performance testing for this change too before approval. Do you mean... oracle internal performance testing or opensource microbenchmark inside openjdk? For the former one, may?I?ask?your?help?to?do that? Thanks. (For the later one, I tested JDK-8273585 attached test, and it works well. After this patch, it takes 8s.) ------------- PR: https://git.openjdk.org/jdk/pull/9695 From haosun at openjdk.org Tue Nov 29 09:57:36 2022 From: haosun at openjdk.org (Hao Sun) Date: Tue, 29 Nov 2022 09:57:36 GMT Subject: RFR: 8287925: AArch64: intrinsics for compareUnsigned method in Integer and Long In-Reply-To: References: Message-ID: On Tue, 29 Nov 2022 08:59:40 GMT, Andrew Haley wrote: > What would probably work better is to idealize `(cmp (cmp3 x y) < 0)` to `(cmpU x y)` I think this has been done in the original x86 patch. See https://github.com/openjdk/jdk/pull/9068/files#diff-054ecd9354722843f23556a38d2c24546c8a777b58b3442abea2d5e9fe6bb916R851 ------------- PR: https://git.openjdk.org/jdk/pull/11383 From aph at openjdk.org Tue Nov 29 10:04:25 2022 From: aph at openjdk.org (Andrew Haley) Date: Tue, 29 Nov 2022 10:04:25 GMT Subject: RFR: 8287925: AArch64: intrinsics for compareUnsigned method in Integer and Long In-Reply-To: References: Message-ID: On Tue, 29 Nov 2022 09:55:34 GMT, Hao Sun wrote: > > What would probably work better is to idealize `(cmp (cmp3 x y) < 0)` to `(cmpU x y)` > > I think this has been done in the original x86 patch. See https://github.com/openjdk/jdk/pull/9068/files#diff-054ecd9354722843f23556a38d2c24546c8a777b58b3442abea2d5e9fe6bb916R851 Interesting. I didn't see this happening. I'll have another look. ------------- PR: https://git.openjdk.org/jdk/pull/11383 From haosun at openjdk.org Tue Nov 29 10:04:27 2022 From: haosun at openjdk.org (Hao Sun) Date: Tue, 29 Nov 2022 10:04:27 GMT Subject: RFR: 8287925: AArch64: intrinsics for compareUnsigned method in Integer and Long In-Reply-To: References: Message-ID: On Mon, 28 Nov 2022 17:42:24 GMT, Andrew Haley wrote: > I've been trying for an hour or so to write a benchmark that is actually improved by this patch. I have been unable to do so. Part of the problem is that using `cmp; cset; cneg` doesn't take advantage of branch prediction. And more often than not, the result of a comparison is predictable. Also, the code isn't necessarily shorter either. I do not believe this patch should be committed unless someone manages to write a benchmark that demonstrates some advantage. Thanks for your insightful comment. I'm doing some investigation. ------------- PR: https://git.openjdk.org/jdk/pull/11383 From aph-open at littlepinkcloud.com Tue Nov 29 10:20:44 2022 From: aph-open at littlepinkcloud.com (Andrew Haley) Date: Tue, 29 Nov 2022 10:20:44 +0000 Subject: C2, ThreadLocalNode, and Loom In-Reply-To: <666c7f0a-eaf8-7ccb-7233-fb5fe20d0ac8@redhat.com> References: <666c7f0a-eaf8-7ccb-7233-fb5fe20d0ac8@redhat.com> Message-ID: On 11/24/22 17:25, Aleksey Shipilev wrote: > 3) Some other easy way out I am overlooking? > > Any other ideas? Tag the frame info with a field which identifies it as containing a pointer to the thread. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From rkennke at openjdk.org Tue Nov 29 11:49:48 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 29 Nov 2022 11:49:48 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism [v2] In-Reply-To: References: Message-ID: > Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. > > Testing: > - [x] tier1 (x86_64, x86_32, aarch64) > - [x] tier2 (x86_64, x86_32, aarch64) > - [x] tier3 (x86_64, x86_32, aarch64) Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: Don't measure code size in advance, but verify it after emitting ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11188/files - new: https://git.openjdk.org/jdk/pull/11188/files/30a22232..ed2ab014 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11188&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11188&range=00-01 Stats: 75 lines in 4 files changed: 17 ins; 52 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/11188.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11188/head:pull/11188 PR: https://git.openjdk.org/jdk/pull/11188 From rkennke at openjdk.org Tue Nov 29 12:02:40 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 29 Nov 2022 12:02:40 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism [v3] In-Reply-To: References: Message-ID: > Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. > > Testing: > - [x] tier1 (x86_64, x86_32, aarch64) > - [x] tier2 (x86_64, x86_32, aarch64) > - [x] tier3 (x86_64, x86_32, aarch64) Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: AArch64 parts ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11188/files - new: https://git.openjdk.org/jdk/pull/11188/files/ed2ab014..1b4541fb Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11188&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11188&range=01-02 Stats: 8 lines in 1 file changed: 8 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11188.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11188/head:pull/11188 PR: https://git.openjdk.org/jdk/pull/11188 From thartmann at openjdk.org Tue Nov 29 12:05:34 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 29 Nov 2022 12:05:34 GMT Subject: RFR: 8290432: C2 compilation fails with assert(node->_last_del == _last) failed: must have deleted the edge just produced [v6] In-Reply-To: References: Message-ID: <0cHqGwSwc-jPUC3HQkzyaAkD8I8IKCK7592tirF5mUw=.f8ba9fe7-d397-4fe9-9b41-9650830a0adc@github.com> On Mon, 21 Nov 2022 02:31:34 GMT, Yi Yang wrote: >> Hi, can I have a review for this patch? [JDK-8273585](https://bugs.openjdk.org/browse/JDK-8273585) recognized the form of `Phi->CastII->AddI` as additional parallel induction variables. In the following program: >> >> class Test { >> static int dontInline() { >> return 0; >> } >> >> static long test(int val, boolean b) { >> long ret = 0; >> long dArr[] = new long[100]; >> for (int i = 15; 293 > i; ++i) { >> ret = val; >> int j = 1; >> while (++j < 6) { >> int k = (val--); >> for (long l = i; 1 > l; ) { >> if (k != 0) { >> ret += dontInline(); >> } >> } >> if (b) { >> break; >> } >> } >> } >> return ret; >> } >> >> public static void main(String[] args) { >> for (int i = 0; i < 1000; i++) { >> test(0, false); >> } >> } >> } >> >> `val` is incorrectly matched with the new parallel IV form: >> ![image](https://user-images.githubusercontent.com/5010047/182059398-fc5204bc-8d95-4e3e-8c66-15776af457b8.png) >> And C2 further replaces it with newly added nodes, which finally leads the crash: >> ![image](https://user-images.githubusercontent.com/5010047/182059498-13148d46-b10f-4e18-b84a-f6b9f626ac7b.png) >> >> I think we can add more constraints to the new form. The form of `Phi->CastXX->AddX` appears when using Preconditions.checkIndex, and it would be recognized as additional IV when 1) Phi != phi2, 2) CastXX is controlled by RangeCheck(to reflect changes in Preconditions checkindex intrinsic) > > Yi Yang has updated the pull request incrementally with one additional commit since the last revision: > > whitespace I'll run some performance testing in our system and report back once it passed. ------------- PR: https://git.openjdk.org/jdk/pull/9695 From thartmann at openjdk.org Tue Nov 29 12:06:43 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 29 Nov 2022 12:06:43 GMT Subject: RFR: 8296924: C2: assert(is_valid_AArch64_address(dest.target())) failed: bad address Message-ID: With (unreachable) unsafe accesses, it can happen that the base address is invalid. On AArch64, C2 will emit a `loadConP` for loading the constant address that is implemented by [aarch64_enc_mov_p](https://github.com/openjdk/jdk/blob/48017b1d9c3a7867984f54d61f17c7f034d213f5/src/hotspot/cpu/aarch64/aarch64.ad#L3366) calling [MacroAssembler::adrp](https://github.com/openjdk/jdk/blob/48017b1d9c3a7867984f54d61f17c7f034d213f5/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp#L4576). The `adrp` implementation then asserts in [is_valid_AArch64_address](https://github.com/openjdk/jdk/blob/48017b1d9c3a7867984f54d61f17c7f034d213f5/src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp#L1321), assuming that we can only ever load constant pointers that are within the 48-bit AArch64 address space. The fix, proposed by @theRealAph, is to emit a full-blown `mov` in case of a bad address. Thanks, Tobias ------------- Commit messages: - 8296924: C2: assert(is_valid_AArch64_address(dest.target())) failed: bad address Changes: https://git.openjdk.org/jdk/pull/11412/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11412&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296924 Stats: 59 lines in 2 files changed: 58 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11412.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11412/head:pull/11412 PR: https://git.openjdk.org/jdk/pull/11412 From aph at openjdk.org Tue Nov 29 12:06:45 2022 From: aph at openjdk.org (Andrew Haley) Date: Tue, 29 Nov 2022 12:06:45 GMT Subject: RFR: 8296924: C2: assert(is_valid_AArch64_address(dest.target())) failed: bad address In-Reply-To: References: Message-ID: On Tue, 29 Nov 2022 11:55:51 GMT, Tobias Hartmann wrote: > With (unreachable) unsafe accesses, it can happen that the base address is invalid. On AArch64, C2 will emit a `loadConP` for loading the constant address that is implemented by [aarch64_enc_mov_p](https://github.com/openjdk/jdk/blob/48017b1d9c3a7867984f54d61f17c7f034d213f5/src/hotspot/cpu/aarch64/aarch64.ad#L3366) calling [MacroAssembler::adrp](https://github.com/openjdk/jdk/blob/48017b1d9c3a7867984f54d61f17c7f034d213f5/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp#L4576). The `adrp` implementation then asserts in [is_valid_AArch64_address](https://github.com/openjdk/jdk/blob/48017b1d9c3a7867984f54d61f17c7f034d213f5/src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp#L1321), assuming that we can only ever load constant pointers that are within the 48-bit AArch64 address space. > > The fix, proposed by @theRealAph, is to emit a full-blown `mov` in case of a bad address. > > Thanks, > Tobias Thanks. That's obviously correct. src/hotspot/cpu/aarch64/aarch64.ad line 3380: > 3378: } else { > 3379: assert(rtype == relocInfo::none, "unexpected reloc type"); > 3380: if (!__ is_valid_AArch64_address(con) || Suggestion: if (! __ is_valid_AArch64_address(con) || ------------- Marked as reviewed by aph (Reviewer). PR: https://git.openjdk.org/jdk/pull/11412 From rkennke at openjdk.org Tue Nov 29 12:10:27 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 29 Nov 2022 12:10:27 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism In-Reply-To: References: Message-ID: On Mon, 28 Nov 2022 09:50:54 GMT, Xiaolin Zheng wrote: > Hi Roman, > > I felt there was something still vague to me, so I took another look into this issue earlier today and found another interesting thing. > > It seems there are two issues reflected by this PR, but of course, this PR is only doing refactoring work... awesome. > > The other issue is, it appears to me that [1] and [2] both lack a `cb->stubs()->maybe_expand_to_ensure_remaining();` before the `align()`s. After adding the expansion logic before the two places, failures are gone (on RISC-V). > > So, in summary, there are two issues here (certainly, not related to this PR - this PR just interestingly triggers and lets us spot them): > > 1. `output()->_stub_list._stubs` is always a `0` value. > 2. the missing `cb->stubs()->maybe_expand_to_ensure_remaining();` before `align()` in the shared trampoline logic, as above-mentioned. > > It appears to me that we already have got the expansion logic for the two stubs [3], and the size is `2048` - enough big value to cover the sizes of the stubs. > > I would like to humbly suggest some solutions to it: > > 1. A quick fix is to remove the `C2CodeStubList::measure_code_size()` for it always returns a `0` now (sorry for saying this), or I guess we can use some other approaches to calculate the correct node counts of the two kinds of stubs. > 2. I guess I might need to file another PR to solve the missing expansion logic in shared trampolines. > > I would like to hear what you think. > > Best, Xiaolin > > [1] > > https://github.com/openjdk/jdk/blob/43d1173605128126dda0dc39ffc376b84065cc65/src/hotspot/cpu/aarch64/codeBuffer_aarch64.cpp#L55 > > > [2] > https://github.com/openjdk/jdk/blob/43d1173605128126dda0dc39ffc376b84065cc65/src/hotspot/cpu/riscv/codeBuffer_riscv.cpp#L56 > > > [3] https://github.com/openjdk/jdk/pull/11188/files#diff-96c31ff7167c1300458cf557427ee89af5250035ecbc2f189817c793a328a502R74 I think I understand now. You are right - when code size is 'measured' we don't have any stubs, yet. That is because the stubs only get generated while all the other assembly code is emitted, i.e. after code buffers are generated. This problem is pre-existing and C2SafepointPollStub got that part wrong before. However, we *do* call maybe_expand_to_ensure_remaining() before align(), that happens in C2CodeStubList::emit() before each stub gets emitted. I changed that part now to try expansion only with the amount of code that each stub requires instead of some maximum size. I'm also checking that each stub generates as much code as it reports it would. I am not sure how useful that is, tbh. But it helps to implement the size() methods (which you need to do now in RISCV). Start with implementing them to return 0, do a build, and change the value to what the check reports. The only other way to improve the situation is if we would first emit the whole method into the code buffer, and then measure and create a new buffer only for the stubs. It would be very small, and I don't know if it's worth the effort or if that is possible at all. WDYT? Please let me know if that fixes your problem! ------------- PR: https://git.openjdk.org/jdk/pull/11188 From thartmann at openjdk.org Tue Nov 29 12:15:32 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 29 Nov 2022 12:15:32 GMT Subject: RFR: 8296924: C2: assert(is_valid_AArch64_address(dest.target())) failed: bad address [v2] In-Reply-To: References: Message-ID: > With (unreachable) unsafe accesses, it can happen that the base address is invalid. On AArch64, C2 will emit a `loadConP` for loading the constant address that is implemented by [aarch64_enc_mov_p](https://github.com/openjdk/jdk/blob/48017b1d9c3a7867984f54d61f17c7f034d213f5/src/hotspot/cpu/aarch64/aarch64.ad#L3366) calling [MacroAssembler::adrp](https://github.com/openjdk/jdk/blob/48017b1d9c3a7867984f54d61f17c7f034d213f5/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp#L4576). The `adrp` implementation then asserts in [is_valid_AArch64_address](https://github.com/openjdk/jdk/blob/48017b1d9c3a7867984f54d61f17c7f034d213f5/src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp#L1321), assuming that we can only ever load constant pointers that are within the 48-bit AArch64 address space. > > The fix, proposed by @theRealAph, is to emit a full-blown `mov` in case of a bad address. > > Thanks, > Tobias Tobias Hartmann has updated the pull request incrementally with one additional commit since the last revision: Update src/hotspot/cpu/aarch64/aarch64.ad Co-authored-by: Andrew Haley ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11412/files - new: https://git.openjdk.org/jdk/pull/11412/files/fa945fa3..453e6638 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11412&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11412&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11412.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11412/head:pull/11412 PR: https://git.openjdk.org/jdk/pull/11412 From thartmann at openjdk.org Tue Nov 29 12:15:33 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 29 Nov 2022 12:15:33 GMT Subject: RFR: 8296924: C2: assert(is_valid_AArch64_address(dest.target())) failed: bad address In-Reply-To: References: Message-ID: On Tue, 29 Nov 2022 11:55:51 GMT, Tobias Hartmann wrote: > With (unreachable) unsafe accesses, it can happen that the base address is invalid. On AArch64, C2 will emit a `loadConP` for loading the constant address that is implemented by [aarch64_enc_mov_p](https://github.com/openjdk/jdk/blob/48017b1d9c3a7867984f54d61f17c7f034d213f5/src/hotspot/cpu/aarch64/aarch64.ad#L3366) calling [MacroAssembler::adrp](https://github.com/openjdk/jdk/blob/48017b1d9c3a7867984f54d61f17c7f034d213f5/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp#L4576). The `adrp` implementation then asserts in [is_valid_AArch64_address](https://github.com/openjdk/jdk/blob/48017b1d9c3a7867984f54d61f17c7f034d213f5/src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp#L1321), assuming that we can only ever load constant pointers that are within the 48-bit AArch64 address space. > > The fix, proposed by @theRealAph, is to emit a full-blown `mov` in case of a bad address. > > Thanks, > Tobias Thanks for the review and the help with the fix, Andrew! ------------- PR: https://git.openjdk.org/jdk/pull/11412 From xlinzheng at openjdk.org Tue Nov 29 12:59:22 2022 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Tue, 29 Nov 2022 12:59:22 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism [v3] In-Reply-To: References: Message-ID: <2zCmvVRzXernU_6akt6mUsjrUT3PsrALsqrj1EAtkqI=.a78fe5fd-3c7b-4808-8323-bc69e19baa2c@github.com> On Tue, 29 Nov 2022 12:02:40 GMT, Roman Kennke wrote: >> Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. >> >> Testing: >> - [x] tier1 (x86_64, x86_32, aarch64) >> - [x] tier2 (x86_64, x86_32, aarch64) >> - [x] tier3 (x86_64, x86_32, aarch64) > > Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: > > AArch64 parts Thank you for the updates, Roman! I am working on the RISC-V part and testing the values now, and I think the current version is nice enough ;-). > However, we do call maybe_expand_to_ensure_remaining() before align(), that happens in C2CodeStubList::emit() before each stub gets emitted. But I think the `align()` is a different story - I guess there shall be another fix [1] independently; it appears to me that there is no such `align()` thing in `C2CodeStubList::emit()` :-) In my opinion, it is another issue if I didn't miss something. I am still natively compiling a RISC-V fastdebug build using QEMU - it is slow but already fast enough compared to natively building on a physical board - and may need about one hour for me to come back to you. [1] https://github.com/zhengxiaolinX/jdk/commit/0c031a78742d10c84a7cf2f3ec3a823351bb9876 ------------- PR: https://git.openjdk.org/jdk/pull/11188 From mdoerr at openjdk.org Tue Nov 29 13:14:01 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 29 Nov 2022 13:14:01 GMT Subject: RFR: 8295724: VirtualMachineError: Out of space in CodeCache for method handle intrinsic [v12] In-Reply-To: References: Message-ID: On Thu, 24 Nov 2022 17:00:42 GMT, Martin Doerr wrote: >> This proposal prevents the VM from terminating unexpectedly in some rare cases (see JBS issue). It allows using NonNMethod code space for method handle intrinsics which are needed urgently if the other code cache spaces are full. There are other options (see JBS issue), but this one appears to be the simplest one. > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Add regression test. Please note that other people have observed this problem, too. See [JDK-8297267](https://bugs.openjdk.org/browse/JDK-8297267). I think we should have a fix for JDK 20. Can we use this proposal or are there more concerns? I guess [JDK-8296336](https://bugs.openjdk.org/browse/JDK-8296336) won't make it into JDK 20. ------------- PR: https://git.openjdk.org/jdk/pull/10933 From bkilambi at openjdk.org Tue Nov 29 13:55:36 2022 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Tue, 29 Nov 2022 13:55:36 GMT Subject: RFR: 8293488: Add EOR3 backend rule for aarch64 SHA3 extension [v6] In-Reply-To: References: Message-ID: > Arm ISA v8.2A and v9.0A include SHA3 feature extensions and one of those SHA3 instructions - "eor3" performs an exclusive OR of three vectors. This is helpful in applications that have multiple, consecutive "eor" operations which can be reduced by clubbing them into fewer operations using the "eor3" instruction. For example - > > eor a, a, b > eor a, a, c > > can be optimized to single instruction - `eor3 a, b, c` > > This patch adds backend rules for Neon and SVE2 "eor3" instructions and a micro benchmark to assess the performance gains with this patch. Following are the results of the included micro benchmark on a 128-bit aarch64 machine that supports Neon, SVE2 and SHA3 features - > > > Benchmark gain > TestEor3.test1Int 10.87% > TestEor3.test1Long 8.84% > TestEor3.test2Int 21.68% > TestEor3.test2Long 21.04% > > > The numbers shown are performance gains with using Neon eor3 instruction over the master branch that uses multiple "eor" instructions instead. Similar gains can be observed with the SVE2 "eor3" version as well since the "eor3" instruction is unpredicated and the machine under test uses a maximum vector width of 128 bits which makes the SVE2 code generation very similar to the one with Neon. Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: Improve assembler test generation for eor3 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10407/files - new: https://git.openjdk.org/jdk/pull/10407/files/a0aa8cdc..6265863a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10407&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10407&range=04-05 Stats: 8 lines in 1 file changed: 0 ins; 5 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/10407.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10407/head:pull/10407 PR: https://git.openjdk.org/jdk/pull/10407 From bkilambi at openjdk.org Tue Nov 29 13:57:36 2022 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Tue, 29 Nov 2022 13:57:36 GMT Subject: RFR: 8293488: Add EOR3 backend rule for aarch64 SHA3 extension [v5] In-Reply-To: References: Message-ID: On Tue, 29 Nov 2022 09:41:34 GMT, Nick Gasson wrote: >> Bhavana Kilambi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: >> >> - Resolve merge conflicts with master >> - Merge branch 'master' into JDK-8293488 >> - Removed svesha3 feature check for eor3 >> - Changed the modifier order preference in JTREG test >> - Modified JTREG test to include feature constraints >> - 8293488: Add EOR3 backend rule for aarch64 SHA3 extension >> >> Arm ISA v8.2A and v9.0A include SHA3 feature extensions and one of those >> SHA3 instructions - "eor3" performs an exclusive OR of three vectors. >> This is helpful in applications that have multiple, consecutive "eor" >> operations which can be reduced by clubbing them into fewer operations >> using the "eor3" instruction. For example - >> eor a, a, b >> eor a, a, c >> can be optimized to single instruction - eor3 a, b, c >> >> This patch adds backend rules for Neon and SVE2 "eor3" instructions and >> a micro benchmark to assess the performance gains with this patch. >> Following are the results of the included micro benchmark on a 128-bit >> aarch64 machine that supports Neon, SVE2 and SHA3 features - >> >> Benchmark gain >> TestEor3.test1Int 10.87% >> TestEor3.test1Long 8.84% >> TestEor3.test2Int 21.68% >> TestEor3.test2Long 21.04% >> >> The numbers shown are performance gains with using Neon eor3 instruction >> over the master branch that uses multiple "eor" instructions instead. >> Similar gains can be observed with the SVE2 "eor3" version as well since >> the "eor3" instruction is unpredicated and the machine under test uses a >> maximum vector width of 128 bits which makes the SVE2 code generation very >> similar to the one with Neon. > > test/hotspot/gtest/aarch64/aarch64-asmtest.py line 1043: > >> 1041: [str(self.reg[i]) for i in range(1, self.numRegs)])) >> 1042: def astr(self): >> 1043: if self._name == "eor3": > > Suggestion: > > firstArg = 0 if self._name == "eor3" else 1 > formatStr = "%s%s" + ''.join([", %s" for i in range(firstArg, self.numRegs)]) > > > And similarly below. Thank you for the suggestion. I made the suggested changes in the latest patch. Please review. ------------- PR: https://git.openjdk.org/jdk/pull/10407 From roland at openjdk.org Tue Nov 29 14:13:46 2022 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 29 Nov 2022 14:13:46 GMT Subject: RFR: 8269820: C2 PhaseIdealLoop::do_unroll get wrong opaque node In-Reply-To: References: Message-ID: On Mon, 28 Nov 2022 14:02:50 GMT, Roland Westrelin wrote: > A main loop loses its pre loop. The Opaque1 node for the zero trip > guard of the main loop is assigned control at a Region through which > an If is split. As a result, the Opaque1 is cloned and the zero trip > guard takes a Phi that merges Opaque1 nodes. One of the branch dies > next and as, a result, the zero trip guard has an Opaque1 as input but > at the wrong CmpI input. The assert fires next. > > The fix I propose is that if an Opaque1 node that is part of a zero > trip guard is encountered during split if, rather than split if up or > down, instead, assign it the control of the zero trip guard's > control. This way the pattern of the zero trip guard is unaffected and > split if can proceed. I believe it's safe to assign it a later > control: > > - an Opaque1 can't be shared > > - the zero trip guard can't be the If that's being split > > As Vladimir noted, this bug used to not reproduce with loop strip > mining disabled but now always reproduces because the loop > strip mining nest is always constructed. The reason is that the > main loop in this test is kept alive by the LSM safepoint. If the > LSM loop nest is not constructed, the loop is optimized out. I > filed: > > https://bugs.openjdk.org/browse/JDK-8297724 > > for this issue. Thanks for reviewing this. > General question. Will it help (simplify changes) if we add specialized `class OpaqueZeroTripGuardNode : public Opaque1Node` class? Good suggestion! Let me give it a try. ------------- PR: https://git.openjdk.org/jdk/pull/11391 From roland at openjdk.org Tue Nov 29 14:13:49 2022 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 29 Nov 2022 14:13:49 GMT Subject: RFR: 8269820: C2 PhaseIdealLoop::do_unroll get wrong opaque node In-Reply-To: References: Message-ID: <6TevexWCuJcFlfnzo3ZRLWmkV7QTiZCKk_gdFXEWwv0=.0b79374a-3c34-4c20-bc90-0ca8df0acbcf@github.com> On Mon, 28 Nov 2022 18:46:11 GMT, Vladimir Kozlov wrote: >> A main loop loses its pre loop. The Opaque1 node for the zero trip >> guard of the main loop is assigned control at a Region through which >> an If is split. As a result, the Opaque1 is cloned and the zero trip >> guard takes a Phi that merges Opaque1 nodes. One of the branch dies >> next and as, a result, the zero trip guard has an Opaque1 as input but >> at the wrong CmpI input. The assert fires next. >> >> The fix I propose is that if an Opaque1 node that is part of a zero >> trip guard is encountered during split if, rather than split if up or >> down, instead, assign it the control of the zero trip guard's >> control. This way the pattern of the zero trip guard is unaffected and >> split if can proceed. I believe it's safe to assign it a later >> control: >> >> - an Opaque1 can't be shared >> >> - the zero trip guard can't be the If that's being split >> >> As Vladimir noted, this bug used to not reproduce with loop strip >> mining disabled but now always reproduces because the loop >> strip mining nest is always constructed. The reason is that the >> main loop in this test is kept alive by the LSM safepoint. If the >> LSM loop nest is not constructed, the loop is optimized out. I >> filed: >> >> https://bugs.openjdk.org/browse/JDK-8297724 >> >> for this issue. > > src/hotspot/share/opto/split_if.cpp line 242: > >> 240: set_ctrl(n, ctrl->in(0)->in(0)); >> 241: set_ctrl(cmp, ctrl->in(0)->in(0)); >> 242: set_ctrl(bol, ctrl->in(0)->in(0)); > > Why you assign control to `cmp` and `bol` too? The subgraph is` (Bool (CmpI (Opaque1...))`. So if only the control of `Opaque1` is updated then the `CmpI`/`Bool` could end up with a control that strictly dominates the control of the `Opaque1`. It's quite possible that that wouldn't break anything but wouldn't it be inconsistent and quite ugly? ------------- PR: https://git.openjdk.org/jdk/pull/11391 From qamai at openjdk.org Tue Nov 29 14:29:27 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 29 Nov 2022 14:29:27 GMT Subject: RFR: 8282365: Optimize divideUnsigned and remainderUnsigned for constants [v13] In-Reply-To: References: Message-ID: > This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. > > In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: > > floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) > ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) > > The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. > > For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: > > c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) > c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) > > which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. > > For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. > > More tests are added to cover the possible patterns. > > Please take a look and have some reviews. Thank you very much. Quan Anh Mai has updated the pull request incrementally with seven additional commits since the last revision: - change julong to uint64_t - uint - various fixes - add constexpr - add constexpr - add message to static_assert - missing powerOfTwo.hpp ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9947/files - new: https://git.openjdk.org/jdk/pull/9947/files/4b359d3a..9b0f730a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=12 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=11-12 Stats: 32 lines in 6 files changed: 3 ins; 3 del; 26 mod Patch: https://git.openjdk.org/jdk/pull/9947.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9947/head:pull/9947 PR: https://git.openjdk.org/jdk/pull/9947 From qamai at openjdk.org Tue Nov 29 14:31:51 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 29 Nov 2022 14:31:51 GMT Subject: RFR: 8282365: Optimize divideUnsigned and remainderUnsigned for constants [v12] In-Reply-To: References: Message-ID: <9PAReV1BRXLUm7spgm6fI2nYY6v17gQume9MM-YvDh8=.d34b027b-e9e7-472d-8d8f-9abcd5f514fd@github.com> On Mon, 28 Nov 2022 12:43:04 GMT, Quan Anh Mai wrote: >> This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. >> >> In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: >> >> floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) >> ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) >> >> The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. >> >> For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: >> >> c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) >> c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) >> >> which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. >> >> For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. >> >> More tests are added to cover the possible patterns. >> >> Please take a look and have some reviews. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > fix build failures, move limits to new header I have followed @rose00 's suggestion by refactoring the magic constant calculation into a separate header and creating test cases for these functions. Please take a look and leave reviews. Thanks a lot. ------------- PR: https://git.openjdk.org/jdk/pull/9947 From qamai at openjdk.org Tue Nov 29 14:38:57 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 29 Nov 2022 14:38:57 GMT Subject: RFR: 8292289: [vectorapi] Improve the implementation of VectorTestNode [v13] In-Reply-To: References: Message-ID: <955ScdreoJQ7PG5cXUmly_giKjOJx8ouU8oy1DX_GEA=.7c59dbbb-4a3b-4f35-a951-4cf0aaa6a047@github.com> > This patch modifies the node generation of `VectorSupport::test` to emit a `CMoveINode`, which is picked up by `BoolNode::Ideal(PhaseGVN*, bool)` to connect the `VectorTestNode` directly to the `BoolNode`, removing the redundant operations of materialising the test result in a GP register and do a `CmpI` to get back the flags. As a result, `VectorMask::alltrue` is compiled into machine codes: > > vptest xmm0, xmm1 > jb if_true > if_false: > > instead of: > > vptest xmm0, xmm1 > setb r10 > movzbl r10 > testl r10 > jne if_true > if_false: > > The results of `jdk.incubator.vector.ArrayMismatchBenchmark` shows noticeable improvements: > > Before After > Benchmark Prefix Size Mode Cnt Score Error Score Error Units Change > ArrayMismatchBenchmark.mismatchVectorByte 0.5 9 thrpt 10 217345.383 ? 8316.444 222279.381 ? 2660.983 ops/ms +2.3% > ArrayMismatchBenchmark.mismatchVectorByte 0.5 257 thrpt 10 113918.406 ? 1618.836 116268.691 ? 1291.899 ops/ms +2.1% > ArrayMismatchBenchmark.mismatchVectorByte 0.5 100000 thrpt 10 702.066 ? 72.862 797.806 ? 16.429 ops/ms +13.6% > ArrayMismatchBenchmark.mismatchVectorByte 1.0 9 thrpt 10 146096.564 ? 2401.258 145338.910 ? 687.453 ops/ms -0.5% > ArrayMismatchBenchmark.mismatchVectorByte 1.0 257 thrpt 10 60598.181 ? 1259.397 69041.519 ? 1073.156 ops/ms +13.9% > ArrayMismatchBenchmark.mismatchVectorByte 1.0 100000 thrpt 10 316.814 ? 10.975 408.770 ? 5.281 ops/ms +29.0% > ArrayMismatchBenchmark.mismatchVectorDouble 0.5 9 thrpt 10 195674.549 ? 1200.166 188482.433 ? 1872.076 ops/ms -3.7% > ArrayMismatchBenchmark.mismatchVectorDouble 0.5 257 thrpt 10 44357.169 ? 473.013 42293.411 ? 2838.255 ops/ms -4.7% > ArrayMismatchBenchmark.mismatchVectorDouble 0.5 100000 thrpt 10 68.199 ? 5.410 67.628 ? 3.241 ops/ms -0.8% > ArrayMismatchBenchmark.mismatchVectorDouble 1.0 9 thrpt 10 107722.450 ? 1677.607 111060.400 ? 982.230 ops/ms +3.1% > ArrayMismatchBenchmark.mismatchVectorDouble 1.0 257 thrpt 10 16692.645 ? 1002.599 21440.506 ? 1618.266 ops/ms +28.4% > ArrayMismatchBenchmark.mismatchVectorDouble 1.0 100000 thrpt 10 32.984 ? 0.548 33.202 ? 2.365 ops/ms +0.7% > ArrayMismatchBenchmark.mismatchVectorInt 0.5 9 thrpt 10 335458.217 ? 3154.842 379944.254 ? 5703.134 ops/ms +13.3% > ArrayMismatchBenchmark.mismatchVectorInt 0.5 257 thrpt 10 58505.302 ? 786.312 56721.368 ? 2497.052 ops/ms -3.0% > ArrayMismatchBenchmark.mismatchVectorInt 0.5 100000 thrpt 10 133.037 ? 11.415 139.537 ? 4.667 ops/ms +4.9% > ArrayMismatchBenchmark.mismatchVectorInt 1.0 9 thrpt 10 117943.802 ? 2281.349 112409.365 ? 2110.055 ops/ms -4.7% > ArrayMismatchBenchmark.mismatchVectorInt 1.0 257 thrpt 10 27060.015 ? 795.619 33756.613 ? 826.533 ops/ms +24.7% > ArrayMismatchBenchmark.mismatchVectorInt 1.0 100000 thrpt 10 57.558 ? 8.927 66.951 ? 4.381 ops/ms +16.3% > ArrayMismatchBenchmark.mismatchVectorLong 0.5 9 thrpt 10 182963.715 ? 1042.497 182438.405 ? 2120.832 ops/ms -0.3% > ArrayMismatchBenchmark.mismatchVectorLong 0.5 257 thrpt 10 36672.215 ? 614.821 35397.398 ? 1609.235 ops/ms -3.5% > ArrayMismatchBenchmark.mismatchVectorLong 0.5 100000 thrpt 10 66.438 ? 2.142 65.427 ? 2.270 ops/ms -1.5% > ArrayMismatchBenchmark.mismatchVectorLong 1.0 9 thrpt 10 110393.047 ? 497.853 115165.845 ? 5381.674 ops/ms +4.3% > ArrayMismatchBenchmark.mismatchVectorLong 1.0 257 thrpt 10 14720.765 ? 661.350 19871.096 ? 201.464 ops/ms +35.0% > ArrayMismatchBenchmark.mismatchVectorLong 1.0 100000 thrpt 10 30.760 ? 0.821 31.933 ? 1.352 ops/ms +3.8% > > I have not been able to conduct throughout testing on AVX512 and Aarch64 so any help would be invaluable. Thank you very much. Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 30 commits: - Merge branch 'master' into improveVTest - Merge branch 'master' into improveVTest - redundant casts - remove untaken code paths - Merge branch 'master' into improveVTest - Merge branch 'master' into improveVTest - Merge branch 'master' into improveVTest - fix merge problems - Merge branch 'master' into improveVTest - refactor x86 - ... and 20 more: https://git.openjdk.org/jdk/compare/2f83b5c4...1fec3d30 ------------- Changes: https://git.openjdk.org/jdk/pull/9855/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9855&range=12 Stats: 494 lines in 23 files changed: 215 ins; 170 del; 109 mod Patch: https://git.openjdk.org/jdk/pull/9855.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9855/head:pull/9855 PR: https://git.openjdk.org/jdk/pull/9855 From qamai at openjdk.org Tue Nov 29 14:40:45 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 29 Nov 2022 14:40:45 GMT Subject: RFR: 8292289: [vectorapi] Improve the implementation of VectorTestNode [v12] In-Reply-To: <1qzngp8Z8spVxoU3C8PxQgqkCJFw3anZqp8_mn8qI2s=.2db33f71-30cf-4365-9ba6-d05146fc8771@github.com> References: <1qzngp8Z8spVxoU3C8PxQgqkCJFw3anZqp8_mn8qI2s=.2db33f71-30cf-4365-9ba6-d05146fc8771@github.com> Message-ID: On Wed, 12 Oct 2022 11:58:47 GMT, Quan Anh Mai wrote: >> This patch modifies the node generation of `VectorSupport::test` to emit a `CMoveINode`, which is picked up by `BoolNode::Ideal(PhaseGVN*, bool)` to connect the `VectorTestNode` directly to the `BoolNode`, removing the redundant operations of materialising the test result in a GP register and do a `CmpI` to get back the flags. As a result, `VectorMask::alltrue` is compiled into machine codes: >> >> vptest xmm0, xmm1 >> jb if_true >> if_false: >> >> instead of: >> >> vptest xmm0, xmm1 >> setb r10 >> movzbl r10 >> testl r10 >> jne if_true >> if_false: >> >> The results of `jdk.incubator.vector.ArrayMismatchBenchmark` shows noticeable improvements: >> >> Before After >> Benchmark Prefix Size Mode Cnt Score Error Score Error Units Change >> ArrayMismatchBenchmark.mismatchVectorByte 0.5 9 thrpt 10 217345.383 ? 8316.444 222279.381 ? 2660.983 ops/ms +2.3% >> ArrayMismatchBenchmark.mismatchVectorByte 0.5 257 thrpt 10 113918.406 ? 1618.836 116268.691 ? 1291.899 ops/ms +2.1% >> ArrayMismatchBenchmark.mismatchVectorByte 0.5 100000 thrpt 10 702.066 ? 72.862 797.806 ? 16.429 ops/ms +13.6% >> ArrayMismatchBenchmark.mismatchVectorByte 1.0 9 thrpt 10 146096.564 ? 2401.258 145338.910 ? 687.453 ops/ms -0.5% >> ArrayMismatchBenchmark.mismatchVectorByte 1.0 257 thrpt 10 60598.181 ? 1259.397 69041.519 ? 1073.156 ops/ms +13.9% >> ArrayMismatchBenchmark.mismatchVectorByte 1.0 100000 thrpt 10 316.814 ? 10.975 408.770 ? 5.281 ops/ms +29.0% >> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 9 thrpt 10 195674.549 ? 1200.166 188482.433 ? 1872.076 ops/ms -3.7% >> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 257 thrpt 10 44357.169 ? 473.013 42293.411 ? 2838.255 ops/ms -4.7% >> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 100000 thrpt 10 68.199 ? 5.410 67.628 ? 3.241 ops/ms -0.8% >> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 9 thrpt 10 107722.450 ? 1677.607 111060.400 ? 982.230 ops/ms +3.1% >> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 257 thrpt 10 16692.645 ? 1002.599 21440.506 ? 1618.266 ops/ms +28.4% >> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 100000 thrpt 10 32.984 ? 0.548 33.202 ? 2.365 ops/ms +0.7% >> ArrayMismatchBenchmark.mismatchVectorInt 0.5 9 thrpt 10 335458.217 ? 3154.842 379944.254 ? 5703.134 ops/ms +13.3% >> ArrayMismatchBenchmark.mismatchVectorInt 0.5 257 thrpt 10 58505.302 ? 786.312 56721.368 ? 2497.052 ops/ms -3.0% >> ArrayMismatchBenchmark.mismatchVectorInt 0.5 100000 thrpt 10 133.037 ? 11.415 139.537 ? 4.667 ops/ms +4.9% >> ArrayMismatchBenchmark.mismatchVectorInt 1.0 9 thrpt 10 117943.802 ? 2281.349 112409.365 ? 2110.055 ops/ms -4.7% >> ArrayMismatchBenchmark.mismatchVectorInt 1.0 257 thrpt 10 27060.015 ? 795.619 33756.613 ? 826.533 ops/ms +24.7% >> ArrayMismatchBenchmark.mismatchVectorInt 1.0 100000 thrpt 10 57.558 ? 8.927 66.951 ? 4.381 ops/ms +16.3% >> ArrayMismatchBenchmark.mismatchVectorLong 0.5 9 thrpt 10 182963.715 ? 1042.497 182438.405 ? 2120.832 ops/ms -0.3% >> ArrayMismatchBenchmark.mismatchVectorLong 0.5 257 thrpt 10 36672.215 ? 614.821 35397.398 ? 1609.235 ops/ms -3.5% >> ArrayMismatchBenchmark.mismatchVectorLong 0.5 100000 thrpt 10 66.438 ? 2.142 65.427 ? 2.270 ops/ms -1.5% >> ArrayMismatchBenchmark.mismatchVectorLong 1.0 9 thrpt 10 110393.047 ? 497.853 115165.845 ? 5381.674 ops/ms +4.3% >> ArrayMismatchBenchmark.mismatchVectorLong 1.0 257 thrpt 10 14720.765 ? 661.350 19871.096 ? 201.464 ops/ms +35.0% >> ArrayMismatchBenchmark.mismatchVectorLong 1.0 100000 thrpt 10 30.760 ? 0.821 31.933 ? 1.352 ops/ms +3.8% >> >> I have not been able to conduct throughout testing on AVX512 and Aarch64 so any help would be invaluable. Thank you very much. > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 29 commits: > > - Merge branch 'master' into improveVTest > - redundant casts > - remove untaken code paths > - Merge branch 'master' into improveVTest > - Merge branch 'master' into improveVTest > - Merge branch 'master' into improveVTest > - fix merge problems > - Merge branch 'master' into improveVTest > - refactor x86 > - revert renaming temp > - ... and 19 more: https://git.openjdk.org/jdk/compare/86ec158d...05c1b9f5 May I have another review for this PR, please? Thank you very much. ------------- PR: https://git.openjdk.org/jdk/pull/9855 From qamai at openjdk.org Tue Nov 29 14:43:27 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 29 Nov 2022 14:43:27 GMT Subject: RFR: 8292761: x86: Clone nodes to match complex rules [v2] In-Reply-To: References: Message-ID: On Mon, 28 Nov 2022 18:16:07 GMT, Sandhya Viswanathan wrote: >> Thank @TobiHartmann @chhagedorn for your comments, I have updated the PR to address those. > > @merykitty Do you plan to get this optimization in? I had only one comment (October 18). Please take a look. @sviswa7 I did some more testing and it seems that this approach does not work reliably in cases of `blsi` and the likes. I am thinking more about this problem. Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/9977 From ngasson at openjdk.org Tue Nov 29 14:56:30 2022 From: ngasson at openjdk.org (Nick Gasson) Date: Tue, 29 Nov 2022 14:56:30 GMT Subject: RFR: 8293488: Add EOR3 backend rule for aarch64 SHA3 extension [v6] In-Reply-To: References: Message-ID: On Tue, 29 Nov 2022 13:55:36 GMT, Bhavana Kilambi wrote: >> Arm ISA v8.2A and v9.0A include SHA3 feature extensions and one of those SHA3 instructions - "eor3" performs an exclusive OR of three vectors. This is helpful in applications that have multiple, consecutive "eor" operations which can be reduced by clubbing them into fewer operations using the "eor3" instruction. For example - >> >> eor a, a, b >> eor a, a, c >> >> can be optimized to single instruction - `eor3 a, b, c` >> >> This patch adds backend rules for Neon and SVE2 "eor3" instructions and a micro benchmark to assess the performance gains with this patch. Following are the results of the included micro benchmark on a 128-bit aarch64 machine that supports Neon, SVE2 and SHA3 features - >> >> >> Benchmark gain >> TestEor3.test1Int 10.87% >> TestEor3.test1Long 8.84% >> TestEor3.test2Int 21.68% >> TestEor3.test2Long 21.04% >> >> >> The numbers shown are performance gains with using Neon eor3 instruction over the master branch that uses multiple "eor" instructions instead. Similar gains can be observed with the SVE2 "eor3" version as well since the "eor3" instruction is unpredicated and the machine under test uses a maximum vector width of 128 bits which makes the SVE2 code generation very similar to the one with Neon. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Improve assembler test generation for eor3 Marked as reviewed by ngasson (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/10407 From rkennke at openjdk.org Tue Nov 29 16:37:09 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 29 Nov 2022 16:37:09 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism [v4] In-Reply-To: References: Message-ID: > Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. > > Testing: > - [x] tier1 (x86_64, x86_32, aarch64) > - [x] tier2 (x86_64, x86_32, aarch64) > - [x] tier3 (x86_64, x86_32, aarch64) Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: x86_32 fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11188/files - new: https://git.openjdk.org/jdk/pull/11188/files/1b4541fb..604f2a46 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11188&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11188&range=02-03 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11188.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11188/head:pull/11188 PR: https://git.openjdk.org/jdk/pull/11188 From shade at openjdk.org Tue Nov 29 16:49:03 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 29 Nov 2022 16:49:03 GMT Subject: RFR: 8296545: C2 Blackholes should allow load optimizations [v5] In-Reply-To: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> References: <6ssL2j26EFrHwSQTBrTf5GZ__NwbMYIWmtE8oxpep_U=.8b348595-97dd-46a9-96d1-a178bee4d075@github.com> Message-ID: > If you look at generated code for the JMH benchmark like: > > > public class ArrayRead { > @Param({"1", "100", "10000", "1000000"}) > int size; > > int[] is; > > @Setup > public void setup() { > is = new int[size]; > for (int c = 0; c < size; c++) { > is[c] = c; > } > } > > @Benchmark > public void test(Blackhole bh) { > for (int i = 0; i < is.length; i++) { > bh.consume(is[i]); > } > } > } > > > ...then you would notice that the loop always re-reads `is`, `is.length`, does the range check, etc. -- all the things we would otherwise expect to be hoisted out of the loop. > > This is because C2 blackholes are modeled as membars that pinch both control and memory slices (like you would expect from the opaque non-inlined call), therefore every iteration has to re-read the referenced memory contents and recompute everything dependent on those loads. This behavior is not new -- the old, non-compiler blackholes were doing the same thing, accidentally -- but it was drowned in blackhole overheads. Now, these effects are clearly visible. > > We can try to do this a bit better: allow load optimizations to work across the blackholes, leaving only "prevent dead code elimination" part, as minimally required by blackhole semantics. > > Motivational improvements on the test above: > > > Benchmark (size) Mode Cnt Score Error Units > > # Before, full Java blackholes > ArrayRead.test 1 avgt 9 5.422 ? 0.023 ns/op > ArrayRead.test 100 avgt 9 460.619 ? 0.421 ns/op > ArrayRead.test 10000 avgt 9 44697.909 ? 1964.787 ns/op > ArrayRead.test 1000000 avgt 9 4332723.304 ? 2791.324 ns/op > > # Before, compiler blackholes > ArrayRead.test 1 avgt 9 1.791 ? 0.007 ns/op > ArrayRead.test 100 avgt 9 114.103 ? 1.677 ns/op > ArrayRead.test 10000 avgt 9 8528.544 ? 52.010 ns/op > ArrayRead.test 1000000 avgt 9 1005139.070 ? 2883.011 ns/op > > # After, compiler blackholes > ArrayRead.test 1 avgt 9 1.686 ? 0.006 ns/op ; ~1.1x better > ArrayRead.test 100 avgt 9 16.249 ? 0.019 ns/op ; ~7.0x better > ArrayRead.test 10000 avgt 9 1375.265 ? 2.420 ns/op ; ~6.2x better > ArrayRead.test 1000000 avgt 9 136862.574 ? 1057.100 ns/op ; ~7.3x better > > > `-prof perfasm` shows the reason for these improvements clearly: > > Before: > > > ? 0x00007f0b54498360: mov 0xc(%r12,%r10,8),%edx ; range check 1 > 7.97% ? 0x00007f0b54498365: cmp %edx,%r11d > 1.27% ? 0x00007f0b54498368: jae 0x00007f0b5449838f > ? 0x00007f0b5449836a: shl $0x3,%r10 > 0.03% ? 0x00007f0b5449836e: mov 0x10(%r10,%r11,4),%r10d ; get "is[i]" > 7.76% ? 0x00007f0b54498373: mov 0x10(%r9),%r10d ; restore "is" > 0.24% ? 0x00007f0b54498377: mov 0x3c0(%r15),%rdx ; safepoint poll, part 1 > 17.48% ? 0x00007f0b5449837e: inc %r11d ; i++ > 0.17% ? 0x00007f0b54498381: test %eax,(%rdx) ; safepoint poll, part 2 > 53.26% ? 0x00007f0b54498383: mov 0xc(%r12,%r10,8),%edx ; loop index check > 4.84% ? 0x00007f0b54498388: cmp %edx,%r11d > 0.31% ? 0x00007f0b5449838b: jl 0x00007f0b54498360 > > > After: > > > > ? 0x00007fa06c49a8b0: mov 0x2c(%rbp,%r10,4),%r9d ; stride read > 19.66% ? 0x00007fa06c49a8b5: mov 0x28(%rbp,%r10,4),%edx > 0.14% ? 0x00007fa06c49a8ba: mov 0x10(%rbp,%r10,4),%ebx > 22.09% ? 0x00007fa06c49a8bf: mov 0x14(%rbp,%r10,4),%ebx > 0.21% ? 0x00007fa06c49a8c4: mov 0x18(%rbp,%r10,4),%ebx > 20.19% ? 0x00007fa06c49a8c9: mov 0x1c(%rbp,%r10,4),%ebx > 0.04% ? 0x00007fa06c49a8ce: mov 0x20(%rbp,%r10,4),%ebx > 24.02% ? 0x00007fa06c49a8d3: mov 0x24(%rbp,%r10,4),%ebx > 0.21% ? 0x00007fa06c49a8d8: add $0x8,%r10d ; i += 8 > ? 0x00007fa06c49a8dc: cmp %esi,%r10d > 0.07% ? 0x00007fa06c49a8df: jl 0x00007fa06c49a8b0 > > > Additional testing: > - [x] Eyeballing JMH Samples `-prof perfasm` > - [x] Linux x86_64 fastdebug, `compiler/blackhole`, `compiler/c2/irTests/blackhole` > - [x] Linux x86_64 fastdebug, JDK benchmark corpus Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision: - Merge branch 'master' into JDK-8296545-blackhole-effects - Add comment in cfgnode.hpp - Blackhole as CFG node - Merge branch 'master' into JDK-8296545-blackhole-effects - Blackhole should be AliasIdxTop - Do not touch memory at all - Fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11041/files - new: https://git.openjdk.org/jdk/pull/11041/files/06eb3d6a..49a34ed7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11041&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11041&range=03-04 Stats: 64711 lines in 1184 files changed: 29171 ins; 21225 del; 14315 mod Patch: https://git.openjdk.org/jdk/pull/11041.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11041/head:pull/11041 PR: https://git.openjdk.org/jdk/pull/11041 From kvn at openjdk.org Tue Nov 29 16:53:22 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 29 Nov 2022 16:53:22 GMT Subject: RFR: 8296924: C2: assert(is_valid_AArch64_address(dest.target())) failed: bad address [v2] In-Reply-To: References: Message-ID: On Tue, 29 Nov 2022 12:15:32 GMT, Tobias Hartmann wrote: >> With (unreachable) unsafe accesses, it can happen that the base address is invalid. On AArch64, C2 will emit a `loadConP` for loading the constant address that is implemented by [aarch64_enc_mov_p](https://github.com/openjdk/jdk/blob/48017b1d9c3a7867984f54d61f17c7f034d213f5/src/hotspot/cpu/aarch64/aarch64.ad#L3366) calling [MacroAssembler::adrp](https://github.com/openjdk/jdk/blob/48017b1d9c3a7867984f54d61f17c7f034d213f5/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp#L4576). The `adrp` implementation then asserts in [is_valid_AArch64_address](https://github.com/openjdk/jdk/blob/48017b1d9c3a7867984f54d61f17c7f034d213f5/src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp#L1321), assuming that we can only ever load constant pointers that are within the 48-bit AArch64 address space. >> >> The fix, proposed by @theRealAph, is to emit a full-blown `mov` in case of a bad address. >> >> Thanks, >> Tobias > > Tobias Hartmann has updated the pull request incrementally with one additional commit since the last revision: > > Update src/hotspot/cpu/aarch64/aarch64.ad > > Co-authored-by: Andrew Haley Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11412 From thartmann at openjdk.org Tue Nov 29 17:00:40 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 29 Nov 2022 17:00:40 GMT Subject: RFR: 8296924: C2: assert(is_valid_AArch64_address(dest.target())) failed: bad address [v2] In-Reply-To: References: Message-ID: <2QsekzVQbYu3aYgPM634lSzTw4EFeIO5iI6C1STGgqY=.11493a20-59a4-4dc4-b3d6-23c7589db715@github.com> On Tue, 29 Nov 2022 12:15:32 GMT, Tobias Hartmann wrote: >> With (unreachable) unsafe accesses, it can happen that the base address is invalid. On AArch64, C2 will emit a `loadConP` for loading the constant address that is implemented by [aarch64_enc_mov_p](https://github.com/openjdk/jdk/blob/48017b1d9c3a7867984f54d61f17c7f034d213f5/src/hotspot/cpu/aarch64/aarch64.ad#L3366) calling [MacroAssembler::adrp](https://github.com/openjdk/jdk/blob/48017b1d9c3a7867984f54d61f17c7f034d213f5/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp#L4576). The `adrp` implementation then asserts in [is_valid_AArch64_address](https://github.com/openjdk/jdk/blob/48017b1d9c3a7867984f54d61f17c7f034d213f5/src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp#L1321), assuming that we can only ever load constant pointers that are within the 48-bit AArch64 address space. >> >> The fix, proposed by @theRealAph, is to emit a full-blown `mov` in case of a bad address. >> >> Thanks, >> Tobias > > Tobias Hartmann has updated the pull request incrementally with one additional commit since the last revision: > > Update src/hotspot/cpu/aarch64/aarch64.ad > > Co-authored-by: Andrew Haley Thanks for the review, Vladimir! ------------- PR: https://git.openjdk.org/jdk/pull/11412 From kvn at openjdk.org Tue Nov 29 17:06:27 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 29 Nov 2022 17:06:27 GMT Subject: RFR: 8269820: C2 PhaseIdealLoop::do_unroll get wrong opaque node In-Reply-To: <6TevexWCuJcFlfnzo3ZRLWmkV7QTiZCKk_gdFXEWwv0=.0b79374a-3c34-4c20-bc90-0ca8df0acbcf@github.com> References: <6TevexWCuJcFlfnzo3ZRLWmkV7QTiZCKk_gdFXEWwv0=.0b79374a-3c34-4c20-bc90-0ca8df0acbcf@github.com> Message-ID: On Tue, 29 Nov 2022 14:09:42 GMT, Roland Westrelin wrote: >> src/hotspot/share/opto/split_if.cpp line 242: >> >>> 240: set_ctrl(n, ctrl->in(0)->in(0)); >>> 241: set_ctrl(cmp, ctrl->in(0)->in(0)); >>> 242: set_ctrl(bol, ctrl->in(0)->in(0)); >> >> Why you assign control to `cmp` and `bol` too? > > The subgraph is` (Bool (CmpI (Opaque1...))`. So if only the control of `Opaque1` is updated then the `CmpI`/`Bool` could end up with a control that strictly dominates the control of the `Opaque1`. It's quite possible that that wouldn't break anything but wouldn't it be inconsistent and quite ugly? I mean we never assign control edge to `Bool` and `Cmp` nodes - they depend only on their inputs. At least I don't know about it. ------------- PR: https://git.openjdk.org/jdk/pull/11391 From bkilambi at openjdk.org Tue Nov 29 17:15:28 2022 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Tue, 29 Nov 2022 17:15:28 GMT Subject: RFR: 8293488: Add EOR3 backend rule for aarch64 SHA3 extension [v6] In-Reply-To: References: Message-ID: On Tue, 29 Nov 2022 13:55:36 GMT, Bhavana Kilambi wrote: >> Arm ISA v8.2A and v9.0A include SHA3 feature extensions and one of those SHA3 instructions - "eor3" performs an exclusive OR of three vectors. This is helpful in applications that have multiple, consecutive "eor" operations which can be reduced by clubbing them into fewer operations using the "eor3" instruction. For example - >> >> eor a, a, b >> eor a, a, c >> >> can be optimized to single instruction - `eor3 a, b, c` >> >> This patch adds backend rules for Neon and SVE2 "eor3" instructions and a micro benchmark to assess the performance gains with this patch. Following are the results of the included micro benchmark on a 128-bit aarch64 machine that supports Neon, SVE2 and SHA3 features - >> >> >> Benchmark gain >> TestEor3.test1Int 10.87% >> TestEor3.test1Long 8.84% >> TestEor3.test2Int 21.68% >> TestEor3.test2Long 21.04% >> >> >> The numbers shown are performance gains with using Neon eor3 instruction over the master branch that uses multiple "eor" instructions instead. Similar gains can be observed with the SVE2 "eor3" version as well since the "eor3" instruction is unpredicated and the machine under test uses a maximum vector width of 128 bits which makes the SVE2 code generation very similar to the one with Neon. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Improve assembler test generation for eor3 Thank you for all the reviews. ------------- PR: https://git.openjdk.org/jdk/pull/10407 From bkilambi at openjdk.org Tue Nov 29 17:20:48 2022 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Tue, 29 Nov 2022 17:20:48 GMT Subject: Integrated: 8293488: Add EOR3 backend rule for aarch64 SHA3 extension In-Reply-To: References: Message-ID: On Fri, 23 Sep 2022 11:13:40 GMT, Bhavana Kilambi wrote: > Arm ISA v8.2A and v9.0A include SHA3 feature extensions and one of those SHA3 instructions - "eor3" performs an exclusive OR of three vectors. This is helpful in applications that have multiple, consecutive "eor" operations which can be reduced by clubbing them into fewer operations using the "eor3" instruction. For example - > > eor a, a, b > eor a, a, c > > can be optimized to single instruction - `eor3 a, b, c` > > This patch adds backend rules for Neon and SVE2 "eor3" instructions and a micro benchmark to assess the performance gains with this patch. Following are the results of the included micro benchmark on a 128-bit aarch64 machine that supports Neon, SVE2 and SHA3 features - > > > Benchmark gain > TestEor3.test1Int 10.87% > TestEor3.test1Long 8.84% > TestEor3.test2Int 21.68% > TestEor3.test2Long 21.04% > > > The numbers shown are performance gains with using Neon eor3 instruction over the master branch that uses multiple "eor" instructions instead. Similar gains can be observed with the SVE2 "eor3" version as well since the "eor3" instruction is unpredicated and the machine under test uses a maximum vector width of 128 bits which makes the SVE2 code generation very similar to the one with Neon. This pull request has now been integrated. Changeset: 54e6d6aa Author: Bhavana Kilambi Committer: Nick Gasson URL: https://git.openjdk.org/jdk/commit/54e6d6aaeb5dec2dc1b9fb3ac9b34c8621df506d Stats: 325 lines in 7 files changed: 290 ins; 0 del; 35 mod 8293488: Add EOR3 backend rule for aarch64 SHA3 extension Reviewed-by: haosun, njian, eliu, aturbanov, ngasson ------------- PR: https://git.openjdk.org/jdk/pull/10407 From rkennke at openjdk.org Tue Nov 29 18:52:52 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 29 Nov 2022 18:52:52 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism [v5] In-Reply-To: References: Message-ID: <671SXXcuhqudDefAvlWX1SbElyHC7liG4_Igym6UiV8=.ec68d335-6e71-4ae2-8109-12ddb823f6a0@github.com> > Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. > > Testing: > - [x] tier1 (x86_64, x86_32, aarch64) > - [x] tier2 (x86_64, x86_32, aarch64) > - [x] tier3 (x86_64, x86_32, aarch64) Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: PPC fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11188/files - new: https://git.openjdk.org/jdk/pull/11188/files/604f2a46..438f00f5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11188&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11188&range=03-04 Stats: 4 lines in 1 file changed: 4 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11188.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11188/head:pull/11188 PR: https://git.openjdk.org/jdk/pull/11188 From dcubed at openjdk.org Tue Nov 29 22:30:15 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Tue, 29 Nov 2022 22:30:15 GMT Subject: RFR: 8295424: adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest In-Reply-To: <6tiK8h3MQgoNTHVnRtLGFJmH2HycabKnQvpE3PL413Q=.298830ac-77b9-4916-a568-10aba857b348@github.com> References: <6tiK8h3MQgoNTHVnRtLGFJmH2HycabKnQvpE3PL413Q=.298830ac-77b9-4916-a568-10aba857b348@github.com> Message-ID: <1Xleu66GnSLwZ2Lmq07ahOfQ8Jqy17oG-_sQD9bicik=.b8584b47-4d74-496e-b887-456550ca2aec@github.com> On Tue, 22 Nov 2022 19:43:38 GMT, Serguei Spitsyn wrote: >> Misc stress testing related fixes: >> >> [JDK-8295424](https://bugs.openjdk.org/browse/JDK-8295424) adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest >> [JDK-8297367](https://bugs.openjdk.org/browse/JDK-8297367) disable TestRedirectLinks.java in slowdebug mode >> [JDK-8297369](https://bugs.openjdk.org/browse/JDK-8297369) disable Fuzz.java in slowdebug mode > > This looks good. > Thanks, > Serguei @sspitsyn, @dholmes-ora, @plummercj and @lmesnik - Thanks for the reviews! Sorry for the delay in getting back to this review. I had an over abundance of CI/GK work to do before the holiday break and I just finished getting caught up after the holiday break. ------------- PR: https://git.openjdk.org/jdk/pull/11278 From dcubed at openjdk.org Tue Nov 29 22:35:22 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Tue, 29 Nov 2022 22:35:22 GMT Subject: RFR: 8295424: adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest In-Reply-To: References: Message-ID: On Wed, 23 Nov 2022 02:16:30 GMT, David Holmes wrote: >> @jonathan-gibbons - Thanks for the review! >> >> I could not find an @requires incantation for saying do-not-use-slowdebug-bits >> nor one for saying do-not-use-macosx-aarch64. I don't really do a lot with >> @requires so I could be missing something. >> >>> it's too much like brushing the dirt under the carpet. >> >> Please see the parent bugs for [JDK-8297367](https://bugs.openjdk.org/browse/JDK-8297367) >> and [JDK-8297369](https://bugs.openjdk.org/browse/JDK-8297369) and you'll >> see that I have clearly documented the failures that I've been seeing. I do plan >> to leave those bugs open, but I've gotten tired of accounting for those failures >> in my weekly stress testing runs. > >> I could not find an @requires incantation for saying do-not-use-slowdebug-bits nor one for saying do-not-use-macosx-aarch64. > > Something like: > > `@requires vm.debug != slowdebug` > `@requires !(os.arch == "aarch64" && os.family == "mac")` @dholmes-ora: > Something like: > > `@requires vm.debug != slowdebug` > `@requires !(os.arch == "aarch64" && os.family == "mac")` Thanks! I'll test these suggestions! ------------- PR: https://git.openjdk.org/jdk/pull/11278 From dcubed at openjdk.org Tue Nov 29 22:35:25 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Tue, 29 Nov 2022 22:35:25 GMT Subject: RFR: 8295424: adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest In-Reply-To: References: Message-ID: On Wed, 23 Nov 2022 17:51:20 GMT, Leonid Mesnik wrote: >> Misc stress testing related fixes: >> >> [JDK-8295424](https://bugs.openjdk.org/browse/JDK-8295424) adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest >> [JDK-8297367](https://bugs.openjdk.org/browse/JDK-8297367) disable TestRedirectLinks.java in slowdebug mode >> [JDK-8297369](https://bugs.openjdk.org/browse/JDK-8297369) disable Fuzz.java in slowdebug mode > > test/jdk/jdk/internal/vm/Continuation/Fuzz.java line 90: > >> 88: >> 89: public static void main(String[] args) { >> 90: if (Platform.isSlowDebugBuild() && Platform.isOSX() && Platform.isAArch64()) { > > I don't like the idea of skipping the unstable test using SkippedException. Wouldn't be better to add problemlist for slowdebug? So anyone could easy identify test bugs in slowdebug mode. Really it would be better to support bits configurations in standard problem lists like os/arch but it is a separate issue. As far as I know, the ProblemList does not support bits config so there's no way to specify an entry for 'release' or 'fastdebug' or 'slowdebug' or... ------------- PR: https://git.openjdk.org/jdk/pull/11278 From dcubed at openjdk.org Tue Nov 29 22:35:26 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Tue, 29 Nov 2022 22:35:26 GMT Subject: RFR: 8295424: adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest In-Reply-To: References: <6tiK8h3MQgoNTHVnRtLGFJmH2HycabKnQvpE3PL413Q=.298830ac-77b9-4916-a568-10aba857b348@github.com> Message-ID: On Wed, 23 Nov 2022 00:05:54 GMT, Serguei Spitsyn wrote: >> Does langtools have its own test libraries that I can use to ask the same questions? > > Sorry, I was not clear. > The Fuzz.java has this order: > > +import jdk.test.lib.Platform; > +import jtreg.SkippedException; > > I thought, you ordered imports by names. Then it is better to keep this order unified. > It is really minor though. Sorry I'm still confused. As far as I can see, I've added the imports the same way in both Fuzz.java and TestRedirectLinks.java. And the imports are in sort order: 'jdk' comes before 'jtreg' and 'Platform' comes before 'SkippedException'. ------------- PR: https://git.openjdk.org/jdk/pull/11278 From dcubed at openjdk.org Tue Nov 29 22:48:08 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Tue, 29 Nov 2022 22:48:08 GMT Subject: RFR: 8295424: adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest In-Reply-To: References: Message-ID: <1EP3SGl8MIdP1vYfZ2ZzLw1QvNZBFMmVIyTi-EGKYPg=.f82dfb7b-5d75-4eb1-a082-ffa636ccf5c8@github.com> On Wed, 23 Nov 2022 02:28:02 GMT, Chris Plummer wrote: > Do you plan on closing the CRs associated with these changes even though the root causes are not being addressed, just avoided? These two CRs are like ProblemListing bugs: [JDK-8297367](https://bugs.openjdk.org/browse/JDK-8297367): disable TestRedirectLinks.java in slowdebug mode [JDK-8297369](https://bugs.openjdk.org/browse/JDK-8297369): disable Fuzz.java in slowdebug mode They are both sub-tasks of the bugs that describe the slowdebug failures that I'm seeing. So the "disable" bugs will be closed if/when I integrate these work arounds much like a ProblemListing bug is closed when a test is added to a ProblemList. However, the parent bugs will remain open so that someone can investigate and fix these slowdebug issue in the future. > It's not clear what is meant by "Test is unstable". Is the test buggy, or are these JVM issues? I suspect that the tests have inherent assumptions about how long things take to happen and they don't happen "on time" when slowdebug bits are used. > In either case shouldn't we be trying to understand why it is unstable with slowdebug bug not fastdebug? Yes and that's why the parent bugs will still be open. This is just like ProblemListing a test when it is too noisy in the CI. However, in this case, these tests are being disabled in slowdebug configs instead of being ProblemListed for all configs. These two parent issues only affect folks running the tests in slowdebug configs when the system is heavily stressed. As far as I know, I'm the only person that regularly runs slowdebug testing... :-) ------------- PR: https://git.openjdk.org/jdk/pull/11278 From dcubed at openjdk.org Tue Nov 29 22:48:09 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Tue, 29 Nov 2022 22:48:09 GMT Subject: RFR: 8295424: adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest In-Reply-To: References: Message-ID: <6Fo-8krZyzK9UwM-H6eHXopYtCX4b-cmMVeJUcgIwZI=.8f7b3ef2-708b-4ab7-b260-6bb50d24cd3e@github.com> On Tue, 22 Nov 2022 23:17:24 GMT, Jonathan Gibbons wrote: >> Misc stress testing related fixes: >> >> [JDK-8295424](https://bugs.openjdk.org/browse/JDK-8295424) adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest >> [JDK-8297367](https://bugs.openjdk.org/browse/JDK-8297367) disable TestRedirectLinks.java in slowdebug mode >> [JDK-8297369](https://bugs.openjdk.org/browse/JDK-8297369) disable Fuzz.java in slowdebug mode > > I accept the javadoc change but dislike the general methodology: it's too much like brushing the dirt under the carpet. > > In general, I think it is better to use keywords, or `@require` to mark such tests and then (if using keywords) use command-line options to filter out such tests. @jonathan-gibbons, @sspitsyn, @dholmes-ora, @plummercj and @lmesnik - I think I've replied to all of the comments made so far. I still have to checkout the suggestion that @dholmes-ora made so I may be updating this PR again. Please let me know if these replies are acceptable to you. ------------- PR: https://git.openjdk.org/jdk/pull/11278 From lmesnik at openjdk.org Tue Nov 29 22:48:12 2022 From: lmesnik at openjdk.org (Leonid Mesnik) Date: Tue, 29 Nov 2022 22:48:12 GMT Subject: RFR: 8295424: adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest In-Reply-To: References: Message-ID: On Tue, 29 Nov 2022 22:27:46 GMT, Daniel D. Daugherty wrote: >> test/jdk/jdk/internal/vm/Continuation/Fuzz.java line 90: >> >>> 88: >>> 89: public static void main(String[] args) { >>> 90: if (Platform.isSlowDebugBuild() && Platform.isOSX() && Platform.isAArch64()) { >> >> I don't like the idea of skipping the unstable test using SkippedException. Wouldn't be better to add problemlist for slowdebug? So anyone could easy identify test bugs in slowdebug mode. Really it would be better to support bits configurations in standard problem lists like os/arch but it is a separate issue. > > As far as I know, the ProblemList does not support bits config so there's no way > to specify an entry for 'release' or 'fastdebug' or 'slowdebug' or... Yes, it is needed to make a separate problem list for this and use it in your testing. The SkippedException and '@requires' are used to filter out the test when it is not applicable for this configuration, not when there is a bug that reproduced only with this configuration. Adding '@requires' usually means that we are not planning to run. If you want to add them as exception might be it makes sense to add a corresponding comment. ------------- PR: https://git.openjdk.org/jdk/pull/11278 From lmesnik at openjdk.org Tue Nov 29 23:21:31 2022 From: lmesnik at openjdk.org (Leonid Mesnik) Date: Tue, 29 Nov 2022 23:21:31 GMT Subject: RFR: 8295424: adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest In-Reply-To: References: Message-ID: <9apDpYbebd860rW6JwFxqGvRPLxEQuQk799aW7HSrPM=.7403aac2-b7ef-404c-8e67-a63ebf75a165@github.com> On Mon, 21 Nov 2022 22:55:40 GMT, Daniel D. Daugherty wrote: > Misc stress testing related fixes: > > [JDK-8295424](https://bugs.openjdk.org/browse/JDK-8295424) adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest > [JDK-8297367](https://bugs.openjdk.org/browse/JDK-8297367) disable TestRedirectLinks.java in slowdebug mode > [JDK-8297369](https://bugs.openjdk.org/browse/JDK-8297369) disable Fuzz.java in slowdebug mode The suggested approach is good enough for me. The comments and bugs to fix/investigate slowdebug behaviour are welcome! ------------- PR: https://git.openjdk.org/jdk/pull/11278 From xlinzheng at openjdk.org Wed Nov 30 04:06:19 2022 From: xlinzheng at openjdk.org (Xiaolin Zheng) Date: Wed, 30 Nov 2022 04:06:19 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism [v5] In-Reply-To: <671SXXcuhqudDefAvlWX1SbElyHC7liG4_Igym6UiV8=.ec68d335-6e71-4ae2-8109-12ddb823f6a0@github.com> References: <671SXXcuhqudDefAvlWX1SbElyHC7liG4_Igym6UiV8=.ec68d335-6e71-4ae2-8109-12ddb823f6a0@github.com> Message-ID: On Tue, 29 Nov 2022 18:52:52 GMT, Roman Kennke wrote: >> Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. >> >> Testing: >> - [x] tier1 (x86_64, x86_32, aarch64) >> - [x] tier2 (x86_64, x86_32, aarch64) >> - [x] tier3 (x86_64, x86_32, aarch64) > > Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: > > PPC fix The diff has passed tier1. I am testing for more tiers. [riscv-11188.txt](https://github.com/openjdk/jdk/files/10119278/riscv-11188.txt) ------------- PR: https://git.openjdk.org/jdk/pull/11188 From cjplummer at openjdk.org Wed Nov 30 05:12:16 2022 From: cjplummer at openjdk.org (Chris Plummer) Date: Wed, 30 Nov 2022 05:12:16 GMT Subject: RFR: 8295424: adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest In-Reply-To: References: Message-ID: On Mon, 21 Nov 2022 22:55:40 GMT, Daniel D. Daugherty wrote: > Misc stress testing related fixes: > > [JDK-8295424](https://bugs.openjdk.org/browse/JDK-8295424) adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest > [JDK-8297367](https://bugs.openjdk.org/browse/JDK-8297367) disable TestRedirectLinks.java in slowdebug mode > [JDK-8297369](https://bugs.openjdk.org/browse/JDK-8297369) disable Fuzz.java in slowdebug mode Marked as reviewed by cjplummer (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11278 From thartmann at openjdk.org Wed Nov 30 06:42:24 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 30 Nov 2022 06:42:24 GMT Subject: Integrated: 8296924: C2: assert(is_valid_AArch64_address(dest.target())) failed: bad address In-Reply-To: References: Message-ID: On Tue, 29 Nov 2022 11:55:51 GMT, Tobias Hartmann wrote: > With (unreachable) unsafe accesses, it can happen that the base address is invalid. On AArch64, C2 will emit a `loadConP` for loading the constant address that is implemented by [aarch64_enc_mov_p](https://github.com/openjdk/jdk/blob/48017b1d9c3a7867984f54d61f17c7f034d213f5/src/hotspot/cpu/aarch64/aarch64.ad#L3366) calling [MacroAssembler::adrp](https://github.com/openjdk/jdk/blob/48017b1d9c3a7867984f54d61f17c7f034d213f5/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp#L4576). The `adrp` implementation then asserts in [is_valid_AArch64_address](https://github.com/openjdk/jdk/blob/48017b1d9c3a7867984f54d61f17c7f034d213f5/src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp#L1321), assuming that we can only ever load constant pointers that are within the 48-bit AArch64 address space. > > The fix, proposed by @theRealAph, is to emit a full-blown `mov` in case of a bad address. > > Thanks, > Tobias This pull request has now been integrated. Changeset: abe532a8 Author: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/abe532a89cbdd2b959789611cecbad7c94f6a870 Stats: 59 lines in 2 files changed: 58 ins; 0 del; 1 mod 8296924: C2: assert(is_valid_AArch64_address(dest.target())) failed: bad address Co-authored-by: Andrew Haley Reviewed-by: aph, kvn ------------- PR: https://git.openjdk.org/jdk/pull/11412 From dzhang at openjdk.org Wed Nov 30 07:10:17 2022 From: dzhang at openjdk.org (Dingli Zhang) Date: Wed, 30 Nov 2022 07:10:17 GMT Subject: RFR: 8297549: RISC-V: Add support for Vector API vector load const operation In-Reply-To: References: Message-ID: On Fri, 25 Nov 2022 13:08:00 GMT, Dingli Zhang wrote: > Can you also run whole tier2 please ? Hi @VladimirKempik I've run tier2 and updated the test status at the top of the page. ------------- PR: https://git.openjdk.org/jdk/pull/11344 From thartmann at openjdk.org Wed Nov 30 08:43:29 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 30 Nov 2022 08:43:29 GMT Subject: RFR: 8297689: Fix incorrect result of Short.reverseBytes() call in loops In-Reply-To: References: Message-ID: <8pV4gvVPCq8hnGreDD3Ex80UjgkTqd1PgePPH6zAUqQ=.309fbc0b-3486-41d9-9681-220c9eb51f7e@github.com> On Wed, 30 Nov 2022 07:20:11 GMT, Pengfei Li wrote: > Recently, we find calling `Short.reverseBytes()` in loops may generate incorrect result if the code is compiled by C2. Below is a simple case to reproduce. > > > class Foo { > static final int SIZE = 50; > static int a[] = new int[SIZE]; > > static void test() { > for (int i = 0; i < SIZE; i++) { > a[i] = Short.reverseBytes((short) a[i]); > } > } > > public static void main(String[] args) throws Exception { > Class.forName("java.lang.Short"); > a[25] = 16; > test(); > System.out.println(a[25]); > } > } > > // $ java -Xint Foo > // 4096 > // $ java -Xcomp -XX:-TieredCompilation -XX:CompileOnly=Foo.test Foo > // 268435456 > > > In this case, the `reverseBytes()` call is intrinsified and transformed into a `ReverseBytesS` node. But then C2 compiler incorrectly vectorizes it into `ReverseBytesV` with int type. C2 `Op_ReverseBytes*` has short, char, int and long versions. Their behaviors are different for different data sizes. In superword, subword operation itself doesn't have precise data size info. Instead, the data size info comes from memory operations in its use-def chain. Hence, vectorization of `reverseBytes()` is valid only if the data size is consistent with the type size of the caller's class. But current C2 compiler code lacks fine-grained type checks for `ReverseBytes*` in vector transformation. It results in `reverseBytes()` call from Short or Character class with int load/store gets vectorized incorrectly in above case. > > To fix the issue, this patch adds more checks in `VectorNode::opcode()`. T_BYTE is a special case for `Op_ReverseBytes*`. As the Java Byte class doesn't have `reverseBytes()` method so there's no `Op_ReverseBytesB`. But T_BYTE may still appear in VectorAPI calls. In this patch we still use `Op_ReverseBytesI` for T_BYTE to ensure vector intrinsification succeeds. > > Tested with hotspot::hotspot_all_no_apps, jdk tier1~3 and langtools tier1 on x86 and AArch64, no issue is found. This looks reasonable to me but I'm not an expert in that code. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/11427 From yadongwang at openjdk.org Wed Nov 30 09:46:19 2022 From: yadongwang at openjdk.org (Yadong Wang) Date: Wed, 30 Nov 2022 09:46:19 GMT Subject: RFR: 8297715: RISC-V: C2: Use single-bit instructions from the Zbs extension In-Reply-To: <6sS3mj04mdeRTLoLgicm1C3g0F1nBigke74p1XinQ4U=.9f9f80b2-774e-49d2-8659-225e79b1f4f5@github.com> References: <6sS3mj04mdeRTLoLgicm1C3g0F1nBigke74p1XinQ4U=.9f9f80b2-774e-49d2-8659-225e79b1f4f5@github.com> Message-ID: On Tue, 29 Nov 2022 03:35:12 GMT, Fei Yang wrote: > The single-bit instructions from the Zbs extension provide a mechanism to set, clear, > invert, or extract a single bit in a register. The bit is specified by its index. > > Especially, the single-bit extract (immediate) instruction 'bexti rd, rs1, shamt' [1] performs: > > let index = shamt & (XLEN - 1); > X(rd) = (X(rs1) >> index) & 1; > > > This instruction is a perfect match for following C2 sub-graph when integer immediate 'mask' is power of 2: > > Set dst (Conv2B (AndI src mask)) > > > The effect is that we could then optimize C2 JIT code for methods like [2]: > Before: > > lhu R28, [R11, #12] # short, #@loadUS ! Field: com/sun/org/apache/xerces/internal/dom/NodeImpl.flags > andi R7, R28, #8 #@andI_reg_imm > snez R10, R7 #@convI2Bool > > > After: > > lhu R28, [R11, #12] # short, #@loadUS ! Field: com/sun/org/apache/xerces/internal/dom/NodeImpl.flags > bexti R10, R28, 3 # > > > Testing: Tier1-3 hotspot & jdk tested with QEMU (JTREG="VM_OPTIONS=-XX:+UnlockExperimentalVMOptions -XX:+UseZbs"). > > [1] https://github.com/riscv/riscv-bitmanip/blob/main/bitmanip/insns/bexti.adoc > > [2] https://github.com/openjdk/jdk/blob/master/src/java.xml/share/classes/com/sun/org/apache/xerces/internal/dom/NodeImpl.java#L1936 lgtm ------------- Marked as reviewed by yadongwang (Author). PR: https://git.openjdk.org/jdk/pull/11406 From roland at openjdk.org Wed Nov 30 10:19:42 2022 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 30 Nov 2022 10:19:42 GMT Subject: RFR: 8269820: C2 PhaseIdealLoop::do_unroll get wrong opaque node [v2] In-Reply-To: References: Message-ID: > A main loop loses its pre loop. The Opaque1 node for the zero trip > guard of the main loop is assigned control at a Region through which > an If is split. As a result, the Opaque1 is cloned and the zero trip > guard takes a Phi that merges Opaque1 nodes. One of the branch dies > next and as, a result, the zero trip guard has an Opaque1 as input but > at the wrong CmpI input. The assert fires next. > > The fix I propose is that if an Opaque1 node that is part of a zero > trip guard is encountered during split if, rather than split if up or > down, instead, assign it the control of the zero trip guard's > control. This way the pattern of the zero trip guard is unaffected and > split if can proceed. I believe it's safe to assign it a later > control: > > - an Opaque1 can't be shared > > - the zero trip guard can't be the If that's being split > > As Vladimir noted, this bug used to not reproduce with loop strip > mining disabled but now always reproduces because the loop > strip mining nest is always constructed. The reason is that the > main loop in this test is kept alive by the LSM safepoint. If the > LSM loop nest is not constructed, the loop is optimized out. I > filed: > > https://bugs.openjdk.org/browse/JDK-8297724 > > for this issue. Roland Westrelin has updated the pull request incrementally with three additional commits since the last revision: - more - more - review ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11391/files - new: https://git.openjdk.org/jdk/pull/11391/files/7d1e79cb..26a002f5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11391&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11391&range=00-01 Stats: 117 lines in 9 files changed: 8 ins; 89 del; 20 mod Patch: https://git.openjdk.org/jdk/pull/11391.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11391/head:pull/11391 PR: https://git.openjdk.org/jdk/pull/11391 From roland at openjdk.org Wed Nov 30 10:56:20 2022 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 30 Nov 2022 10:56:20 GMT Subject: RFR: 8269820: C2 PhaseIdealLoop::do_unroll get wrong opaque node In-Reply-To: References: Message-ID: On Tue, 29 Nov 2022 14:06:04 GMT, Roland Westrelin wrote: > > General question. Will it help (simplify changes) if we add specialized `class OpaqueZeroTripGuardNode : public Opaque1Node` class? > > Good suggestion! Let me give it a try. I updated the patch with an OpaqueZeroTripGuardNode. ------------- PR: https://git.openjdk.org/jdk/pull/11391 From roland at openjdk.org Wed Nov 30 10:56:20 2022 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 30 Nov 2022 10:56:20 GMT Subject: RFR: 8269820: C2 PhaseIdealLoop::do_unroll get wrong opaque node [v2] In-Reply-To: References: <6TevexWCuJcFlfnzo3ZRLWmkV7QTiZCKk_gdFXEWwv0=.0b79374a-3c34-4c20-bc90-0ca8df0acbcf@github.com> Message-ID: On Tue, 29 Nov 2022 17:03:53 GMT, Vladimir Kozlov wrote: >> The subgraph is` (Bool (CmpI (Opaque1...))`. So if only the control of `Opaque1` is updated then the `CmpI`/`Bool` could end up with a control that strictly dominates the control of the `Opaque1`. It's quite possible that that wouldn't break anything but wouldn't it be inconsistent and quite ugly? > > I mean we never assign control edge to `Bool` and `Cmp` nodes - they depend only on their inputs. At least I don't know about it. `set_ctrl()` doesn't change the control input of the nodes, right? It only updates the current loop opts pass's table of controls and all data nodes are in that table. I'm confused by what could be wrong here. ------------- PR: https://git.openjdk.org/jdk/pull/11391 From ihse at openjdk.org Wed Nov 30 13:00:41 2022 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Wed, 30 Nov 2022 13:00:41 GMT Subject: Integrated: 8297644: RISC-V: Compilation error when shenandoah is disabled In-Reply-To: References: Message-ID: On Fri, 25 Nov 2022 15:12:01 GMT, Magnus Ihse Bursie wrote: > If configuring with `--disable-jvm-feature-shenandoahgc`, the risc-v port fails to build. > > It seems that the code is really dependent on two header files, that is not declared, and probably has "leaked in" somewhere, but only if shenandoah is enabled. I have tried to resolve it to the best of my knowledge, but if you're not happy with the solution, by all means suggest a better way or take over this bug. This pull request has now been integrated. Changeset: 4d730f56 Author: Magnus Ihse Bursie URL: https://git.openjdk.org/jdk/commit/4d730f561fc493a956386b053de492933933ff54 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod 8297644: RISC-V: Compilation error when shenandoah is disabled Reviewed-by: fyang, yadongwang ------------- PR: https://git.openjdk.org/jdk/pull/11370 From ayang at openjdk.org Wed Nov 30 14:44:24 2022 From: ayang at openjdk.org (Albert Mingkun Yang) Date: Wed, 30 Nov 2022 14:44:24 GMT Subject: RFR: 8297487: G1 Remark: no need to keep alive oop constants of nmethods on stack In-Reply-To: <3TOLagicixH3acqrLIpQutFoQNtrs2dlIJjEWPJs2Ow=.18c402c8-32f2-4235-a179-9796dc4ff19d@github.com> References: <3TOLagicixH3acqrLIpQutFoQNtrs2dlIJjEWPJs2Ow=.18c402c8-32f2-4235-a179-9796dc4ff19d@github.com> Message-ID: On Wed, 23 Nov 2022 10:05:56 GMT, Richard Reingruber wrote: > This pr removes the stackwalks to keep alive oops of nmethods found on stack during G1 remark as it seems redundant. The oops are already kept alive by the [nmethod entry barrier](https://github.com/openjdk/jdk/blob/f26bd4e0e8b68de297a9ff93526cd7fac8668320/src/hotspot/share/gc/shared/barrierSetNMethod.cpp#L85) > > Additionally it fixes a comment that says nmethod entry barriers are needed to deal with continuations which, afaik, is not the case. Please correct me and explain if I'm mistaken. > > Testing: the patch is included in our daily CI testing since a week. That is most JCK and JTREG tests, also in Xcomp mode, Renaissance benchmark and SAP specific tests with fastdebug and release builds on the standard platforms plus PPC64. There was no failure I could attribute to this change. > > I tried to find a jtreg test that is sensitive to the keep alive by omitting it in the nmethod entry barrier and also in G1 remark but without success. Marked as reviewed by ayang (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11314 From gcao at openjdk.org Wed Nov 30 14:56:22 2022 From: gcao at openjdk.org (Gui Cao) Date: Wed, 30 Nov 2022 14:56:22 GMT Subject: RFR: 8297549: RISC-V: Add support for Vector API vector load const operation In-Reply-To: References: Message-ID: On Thu, 24 Nov 2022 05:40:12 GMT, Dingli Zhang wrote: > The instruction which is matched `VectorLoadConst` will create index starting from 0 and incremented by 1. In detail, the instruction populates the destination vector by setting the first element to 0 and monotonically incrementing the value by 1 for each subsequent element. > > We can add support of `VectorLoadConst` for RISC-V by `vid.v` . It was implemented by referring to RVV v1.0 [1]. > > We can use the JMH test from https://github.com/openjdk/jdk/pull/10332. Tests are performed on qemu with parameter `-cpu rv64,v=true,vlen=256,vext_spec=v1.0`. By adding the `-XX:+PrintAssembly`, the compilation log of `floatIndexVector` is as follows: > > > 120 vloadcon V2 # generate iota indices > 12c vfmul.vv V1, V2, V1 #@vmulF > 134 vfmv.v.f V2, F8 #@replicateF > 13c vfadd.vv V1, V2, V1 #@vaddF > > The above nodes match the logic of `Compute indexes with "vec + iota * scale"` in https://github.com/openjdk/jdk/pull/10332, which is the operation corresponding to `addIndex` in benchmark: > https://github.com/openjdk/jdk/blob/d6102110e1b48c065292db83744245a33e269cc2/test/micro/org/openjdk/bench/jdk/incubator/vector/IndexVectorBenchmark.java#L92-L97 > > At the same time, the following assembly code will be generated when running the `floatIndexVector` case, there will be one more instruction than `intIndexVector`: > > 0x000000401443cc9c: .4byte 0x10072d7 > 0x000000401443cca0: .4byte 0x5208a157 > 0x000000401443cca4: .4byte 0x4a219157 > > `0x10072d7/0x5208a1d7` is the machine code for `vsetvli/vid.v` and `0x4a219157` is the additional machine code for `vfcvt.f.x.v`, which are the opcodes generated by `is_floating_point_type(bt)`: > > if (is_floating_point_type(bt)) { > __ vfcvt_f_x_v(as_VectorRegister($dst$$reg), as_VectorRegister($dst$$reg)); > } > > > After we implement these nodes, by using `-XX:+UseRVV`, the number of assembly instructions is reduced by about ~50% because of the different execution paths with the number of loops, similar to `AddTest` [3]. > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/857b0f9b05bc711f3282a0da85fcff131fffab91/test/micro/org/openjdk/bench/jdk/incubator/vector/IndexVectorBenchmark.java > [3] https://github.com/zifeihan/vector-api-test-rvv/blob/master/vector-api-rvv-performance.md > > Please take a look and have some reviews. Thanks a lot. > > ## Testing: > > - hotspot and jdk tier1 without new failures (release with UseRVV on QEMU) > - hotspot, jdk and langtools tier2 without new failures (release with UseRVV on QEMU) > - test/jdk/jdk/incubator/vector/* (fastdebug/release with UseRVV on QEMU) LGTM, thanks ------------- Marked as reviewed by gcao (Author). PR: https://git.openjdk.org/jdk/pull/11344 From rrich at openjdk.org Wed Nov 30 15:12:23 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Wed, 30 Nov 2022 15:12:23 GMT Subject: RFR: 8297487: G1 Remark: no need to keep alive oop constants of nmethods on stack In-Reply-To: References: <3TOLagicixH3acqrLIpQutFoQNtrs2dlIJjEWPJs2Ow=.18c402c8-32f2-4235-a179-9796dc4ff19d@github.com> Message-ID: On Wed, 30 Nov 2022 14:40:27 GMT, Albert Mingkun Yang wrote: >> This pr removes the stackwalks to keep alive oops of nmethods found on stack during G1 remark as it seems redundant. The oops are already kept alive by the [nmethod entry barrier](https://github.com/openjdk/jdk/blob/f26bd4e0e8b68de297a9ff93526cd7fac8668320/src/hotspot/share/gc/shared/barrierSetNMethod.cpp#L85) >> >> Additionally it fixes a comment that says nmethod entry barriers are needed to deal with continuations which, afaik, is not the case. Please correct me and explain if I'm mistaken. >> >> Testing: the patch is included in our daily CI testing since a week. That is most JCK and JTREG tests, also in Xcomp mode, Renaissance benchmark and SAP specific tests with fastdebug and release builds on the standard platforms plus PPC64. There was no failure I could attribute to this change. >> >> I tried to find a jtreg test that is sensitive to the keep alive by omitting it in the nmethod entry barrier and also in G1 remark but without success. > > Marked as reviewed by ayang (Reviewer). Thanks for the review @albertnetymk ------------- PR: https://git.openjdk.org/jdk/pull/11314 From rkennke at openjdk.org Wed Nov 30 17:51:31 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Wed, 30 Nov 2022 17:51:31 GMT Subject: RFR: 8297036: Generalize C2 stub mechanism [v6] In-Reply-To: References: Message-ID: > Currently, we have two implementations of out-of-line stubs in C2, one for safepoint poll stubs (C2SafepointPollStubTable in output.hpp) and another for nmmethod entry barriers (C2EntryBarrierStubTable in output.hpp). I will need a few more for Lilliput: One for checking lock-stack size in method prologue, one for handling lock failures (both for fast-locking), and another one for load-klass slow-path. It would be good to generalize the mechanism and consolidate the existing uses on the new general mechanism. > > Testing: > - [x] tier1 (x86_64, x86_32, aarch64) > - [x] tier2 (x86_64, x86_32, aarch64) > - [x] tier3 (x86_64, x86_32, aarch64) Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: More RISCV fixes ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11188/files - new: https://git.openjdk.org/jdk/pull/11188/files/438f00f5..cdedf273 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11188&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11188&range=04-05 Stats: 8 lines in 1 file changed: 8 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11188.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11188/head:pull/11188 PR: https://git.openjdk.org/jdk/pull/11188 From lmesnik at openjdk.org Wed Nov 30 18:01:15 2022 From: lmesnik at openjdk.org (Leonid Mesnik) Date: Wed, 30 Nov 2022 18:01:15 GMT Subject: RFR: 8295424: adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest In-Reply-To: References: Message-ID: On Mon, 21 Nov 2022 22:55:40 GMT, Daniel D. Daugherty wrote: > Misc stress testing related fixes: > > [JDK-8295424](https://bugs.openjdk.org/browse/JDK-8295424) adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest > [JDK-8297367](https://bugs.openjdk.org/browse/JDK-8297367) disable TestRedirectLinks.java in slowdebug mode > [JDK-8297369](https://bugs.openjdk.org/browse/JDK-8297369) disable Fuzz.java in slowdebug mode Marked as reviewed by lmesnik (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11278 From dcubed at openjdk.org Wed Nov 30 22:06:58 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Wed, 30 Nov 2022 22:06:58 GMT Subject: RFR: 8295424: adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest In-Reply-To: References: Message-ID: On Wed, 23 Nov 2022 02:16:30 GMT, David Holmes wrote: >> @jonathan-gibbons - Thanks for the review! >> >> I could not find an @requires incantation for saying do-not-use-slowdebug-bits >> nor one for saying do-not-use-macosx-aarch64. I don't really do a lot with >> @requires so I could be missing something. >> >>> it's too much like brushing the dirt under the carpet. >> >> Please see the parent bugs for [JDK-8297367](https://bugs.openjdk.org/browse/JDK-8297367) >> and [JDK-8297369](https://bugs.openjdk.org/browse/JDK-8297369) and you'll >> see that I have clearly documented the failures that I've been seeing. I do plan >> to leave those bugs open, but I've gotten tired of accounting for those failures >> in my weekly stress testing runs. > >> I could not find an @requires incantation for saying do-not-use-slowdebug-bits nor one for saying do-not-use-macosx-aarch64. > > Something like: > > `@requires vm.debug != slowdebug` > `@requires !(os.arch == "aarch64" && os.family == "mac")` @dholmes-ora: > Something like: > > `@requires vm.debug != slowdebug` `@requires !(os.arch == "aarch64" && os.family == "mac")` A change like this: @@ -25,6 +25,7 @@ * @test * @bug 8190312 * @summary test redirected URLs for -link + * @requires (vm.debug != slowdebug) * @library /tools/lib ../../lib * @modules jdk.compiler/com.sun.tools.javac.api * jdk.compiler/com.sun.tools.javac.main results in a complaint from jtreg like this: test result: Error. Parse Exception: Syntax error in @requires expression: invalid name: slowdebug `(vm.debug == true)` `(vm.debug == false)` both work as does just plain: `vm.debug` so `vm.debug` is a boolean and not a string. ------------- PR: https://git.openjdk.org/jdk/pull/11278 From dcubed at openjdk.org Wed Nov 30 22:10:50 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Wed, 30 Nov 2022 22:10:50 GMT Subject: RFR: 8295424: adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest In-Reply-To: References: Message-ID: <53gBxlZnpaG0yOQRMJ4qAeRMObxFCUgldU50W7SIpJ8=.7ae48ef0-c934-4311-a990-cd412c34d26c@github.com> On Mon, 21 Nov 2022 22:55:40 GMT, Daniel D. Daugherty wrote: > Misc stress testing related fixes: > > [JDK-8295424](https://bugs.openjdk.org/browse/JDK-8295424) adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest > [JDK-8297367](https://bugs.openjdk.org/browse/JDK-8297367) disable TestRedirectLinks.java in slowdebug mode > [JDK-8297369](https://bugs.openjdk.org/browse/JDK-8297369) disable Fuzz.java in slowdebug mode I also checked for the name `vm.slowdebug` with this: `@requires (vm.slowdebug == false)` and that always evaluated to `false` so the test didn't run at all... ------------- PR: https://git.openjdk.org/jdk/pull/11278 From dcubed at openjdk.org Wed Nov 30 22:20:33 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Wed, 30 Nov 2022 22:20:33 GMT Subject: RFR: 8295424: adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest In-Reply-To: References: Message-ID: On Wed, 23 Nov 2022 02:16:30 GMT, David Holmes wrote: >> @jonathan-gibbons - Thanks for the review! >> >> I could not find an @requires incantation for saying do-not-use-slowdebug-bits >> nor one for saying do-not-use-macosx-aarch64. I don't really do a lot with >> @requires so I could be missing something. >> >>> it's too much like brushing the dirt under the carpet. >> >> Please see the parent bugs for [JDK-8297367](https://bugs.openjdk.org/browse/JDK-8297367) >> and [JDK-8297369](https://bugs.openjdk.org/browse/JDK-8297369) and you'll >> see that I have clearly documented the failures that I've been seeing. I do plan >> to leave those bugs open, but I've gotten tired of accounting for those failures >> in my weekly stress testing runs. > >> I could not find an @requires incantation for saying do-not-use-slowdebug-bits nor one for saying do-not-use-macosx-aarch64. > > Something like: > > `@requires vm.debug != slowdebug` > `@requires !(os.arch == "aarch64" && os.family == "mac")` @dholmes-ora - please let me know if you are okay with these fixes since the @requires idea did not work. ------------- PR: https://git.openjdk.org/jdk/pull/11278 From dholmes at openjdk.org Wed Nov 30 23:58:26 2022 From: dholmes at openjdk.org (David Holmes) Date: Wed, 30 Nov 2022 23:58:26 GMT Subject: RFR: 8295424: adjust timeout for another JLI GetObjectSizeIntrinsicsTest.java subtest In-Reply-To: References: Message-ID: On Wed, 30 Nov 2022 22:17:59 GMT, Daniel D. Daugherty wrote: >>> I could not find an @requires incantation for saying do-not-use-slowdebug-bits nor one for saying do-not-use-macosx-aarch64. >> >> Something like: >> >> `@requires vm.debug != slowdebug` >> `@requires !(os.arch == "aarch64" && os.family == "mac")` > > @dholmes-ora - please let me know if you are okay with these fixes since > the @requires idea did not work. @dcubed-ojdk no objection from me. I was just offering what I hoped was a solution. But I see now that `vm.debug` is treated as a boolean: true for `slowdebug` and (fast)`debug`; and false otherwise. We would have to add a new property to `jtreg-ext/requires/VMProps.java` to allow checking for `slowdebug`. ------------- PR: https://git.openjdk.org/jdk/pull/11278