From kbarrett at openjdk.org Tue Apr 1 09:08:24 2025 From: kbarrett at openjdk.org (Kim Barrett) Date: Tue, 1 Apr 2025 09:08:24 GMT Subject: RFR: 8352645: Add tool support to check order of includes [v6] In-Reply-To: References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>

Message-ID: On Mon, 31 Mar 2025 10:02:39 GMT, Magnus Ihse Bursie wrote: > I know the source code is bundled with the test image, but I'm not 100% sure if it just includes `src`, or if the entire top-level source is included. I'll need to check that, including what is the best way to get a proper reference to the top-level directory from a test. There was some discussion of this when recently adding the sources/TestNoNULL.java test. The code used here appears to similar in function (though different code) to the approach taken in that earlier test. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24247#issuecomment-2768685198 From tschatzl at openjdk.org Tue Apr 1 09:24:12 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Tue, 1 Apr 2025 09:24:12 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v29] In-Reply-To: References: Message-ID: > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... Thomas Schatzl has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 37 commits: - Merge branch 'master' into 8342382-card-table-instead-of-dcq - Merge branch 'master' into 8342382-card-table-instead-of-dcq - Merge branch 'master' into submit/8342382-card-table-instead-of-dcq - * make young gen length revising independent of refinement thread * use a service task * both refinement control thread and young gen length revising use the same infrastructure to get the number of available bytes and determine the time to the next update - * fix IR code generation tests that change due to barrier cost changes - * factor out card table and refinement table merging into a single method - Merge branch 'master' into 8342382-card-table-instead-of-dcq3 - * obsolete G1UpdateBufferSize G1UpdateBufferSize has previously been used to size the refinement buffers and impose a minimum limit on the number of cards per thread that need to be pending before refinement starts. The former function is now obsolete with the removal of the dirty card queues, the latter functionality has been taken over by the new diagnostic option `G1PerThreadPendingCardThreshold`. I prefer to make this a diagnostic option is better than a product option because it is something that is only necessary for some test cases to produce some otherwise unwanted behavior (continuous refinement). CSR is pending. - * more documentation on why we need to rendezvous the gc threads - Merge branch 'master' into 8342381-card-table-instead-of-dcq - ... and 27 more: https://git.openjdk.org/jdk/compare/aff5aa72...51fb6e63 ------------- Changes: https://git.openjdk.org/jdk/pull/23739/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=28 Stats: 7089 lines in 110 files changed: 2610 ins; 3555 del; 924 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739 From stefank at openjdk.org Tue Apr 1 11:12:37 2025 From: stefank at openjdk.org (Stefan Karlsson) Date: Tue, 1 Apr 2025 11:12:37 GMT Subject: RFR: 8352645: Add tool support to check order of includes [v6] In-Reply-To: References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> Message-ID: <4RCTjaaCqzo0ZjzZIIlEmWVMqQU90-j-HeuGvZAVV7M=.360d98b4-3aa7-46bc-a3cb-efdaaf12db0d@github.com> On Fri, 28 Mar 2025 22:24:40 GMT, Doug Simon wrote: >> This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)). >> >> By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order. >> >> The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now. >> >> I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories. >> >> When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example: >> >> java.lang.RuntimeException: The unsorted includes listed below should be fixable by running: >> >> java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci >> >> at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80) >> at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) >> at java.base/java.lang.reflect.Method.invoke(Method.java:565) >> at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335) >> at java.base/java.lang.Thread.run(Thread.java:1447) >> Caused by: java.lang.RuntimeException: 36 files with unsorted headers found: >> >> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp >> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp >> /Users/dnsimo... > > Doug Simon has updated the pull request incrementally with one additional commit since the last revision: > > convert Windows path to Unix path This looks good to me. I personally would have preferred to have the tool somewhere other than in the test directory, but I've gotten feedback from other HotSpot devs that they think its better to have the tool there. I leave the review of TEST.group to someone else. ------------- Marked as reviewed by stefank (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/24247#pullrequestreview-2732333629 From kbarrett at openjdk.org Tue Apr 1 14:34:16 2025 From: kbarrett at openjdk.org (Kim Barrett) Date: Tue, 1 Apr 2025 14:34:16 GMT Subject: RFR: 8352645: Add tool support to check order of includes [v6] In-Reply-To: References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> Message-ID: On Fri, 28 Mar 2025 22:24:40 GMT, Doug Simon wrote: >> This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)). >> >> By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order. >> >> The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now. >> >> I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories. >> >> When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example: >> >> java.lang.RuntimeException: The unsorted includes listed below should be fixable by running: >> >> java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci >> >> at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80) >> at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) >> at java.base/java.lang.reflect.Method.invoke(Method.java:565) >> at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335) >> at java.base/java.lang.Thread.run(Thread.java:1447) >> Caused by: java.lang.RuntimeException: 36 files with unsorted headers found: >> >> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp >> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp >> /Users/dnsimo... > > Doug Simon has updated the pull request incrementally with one additional commit since the last revision: > > convert Windows path to Unix path test/hotspot/jtreg/TEST.groups line 142: > 140: > 141: tier1_common = \ > 142: sources \ I don't understand this change. How does this end up doing anything different than before? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2022491702 From dnsimon at openjdk.org Tue Apr 1 15:39:18 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 1 Apr 2025 15:39:18 GMT Subject: RFR: 8352645: Add tool support to check order of includes [v6] In-Reply-To: References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>

Message-ID: On Tue, 1 Apr 2025 09:25:17 GMT, Kim Barrett wrote: >> Doug Simon has updated the pull request incrementally with one additional commit since the last revision: >> >> convert Windows path to Unix path > > test/hotspot/jtreg/TEST.groups line 142: > >> 140: >> 141: tier1_common = \ >> 142: sources \ > > I don't understand this change. How does this end up doing anything different than before? This makes `sources` be tested in GHA: https://github.com/openjdk/jdk/blob/a1ab1d8de411aace21decd133e7e74bb97f27897/.github/workflows/test.yml#L88 An alternative would be to add a separate GHA jobs just for `sources`: - test-name: 'hs/tier1 sources' test-suite: 'test/hotspot/jtreg/:tier1_sources' debug-suffix: -debug Given how small `sources` is ([currently only 1 test](https://github.com/openjdk/jdk/tree/master/test/hotspot/jtreg/sources)), it felt like it should just be folded into common. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2023111780 From vpaprotski at openjdk.org Tue Apr 1 18:47:39 2025 From: vpaprotski at openjdk.org (Volodymyr Paprotski) Date: Tue, 1 Apr 2025 18:47:39 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v11] In-Reply-To: References:

Message-ID: On Sat, 22 Mar 2025 20:02:31 GMT, Ferenc Rakoczi wrote: >> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision: > > - Further readability improvements. > - Added asserts for array sizes src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 342: > 340: // Performs two keccak() computations in parallel. The steps of the > 341: // two computations are executed interleaved. > 342: static address generate_double_keccak(StubGenerator *stubgen, MacroAssembler *_masm) { This function seems ok. I didnt do as line-by-line 'exact' review as for the NTT intrinsics, but just put the new version into a diff next to the original function. Seems like a reasonable clean 'refactor' (hardcode the blocksize, add new input registers 10-14. Makes it really easy to spot vs 0-4 original registers..) I didnt realize before that the 'top 3 limbs' are wasted. I guess it doesnt matter, there are registers to spare aplenty and it makes the entire algorithm cleaner and easier to follow. I did also stare at the algorithm with the 'What about AVX2' question.. This function would pretty much need to be rewritten it looks like :/ Last two questions.. - how much performance is gained from doubling this function up? - If thats worth it.. what if instead it was quadrupled the input? (I scanned the java code, it looked like NR was parametrized already to 2..). It looks like there are almost enough registers here to go to 4 (I think 3 would need to be freed up somehow.. alternatively, the upper 3 limbs are empty in all operations, perhaps it could be used instead.. at the expense of readability) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2017636762 From vpaprotski at openjdk.org Tue Apr 1 18:47:39 2025 From: vpaprotski at openjdk.org (Volodymyr Paprotski) Date: Tue, 1 Apr 2025 18:47:39 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v12] In-Reply-To: References:

Message-ID: On Mon, 31 Mar 2025 14:40:56 GMT, Ferenc Rakoczi wrote: >> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: > > Reacting to comments by Volodymyr. src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 359: > 357: __ kmovbl(k4, rax); > 358: __ addl(rax, 16); > 359: __ kmovbl(k5, rax); We could use the sequence from generate_sha3_implCompress to setup the K registers, that has less dependency: __ movl(rax, 0x1F); __ kmovbl(k5, rax); __ kshiftrbl(k4, k5, 1); __ kshiftrbl(k3, k5, 2); __ kshiftrbl(k2, k5, 3); __ kshiftrbl(k1, k5, 4); ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2023769620 From duke at openjdk.org Wed Apr 2 07:38:34 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Wed, 2 Apr 2025 07:38:34 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v13] In-Reply-To: References: Message-ID: > By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: Reacting to comment by Sandhya. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23860/files - new: https://git.openjdk.org/jdk/pull/23860/files/7a9f6645..e4ab10bb Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=12 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=11-12 Stats: 10 lines in 1 file changed: 0 ins; 4 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/23860.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23860/head:pull/23860 PR: https://git.openjdk.org/jdk/pull/23860 From duke at openjdk.org Wed Apr 2 07:45:14 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Wed, 2 Apr 2025 07:45:14 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v12] In-Reply-To: References:

Message-ID: <_3aVrAsKu82hHiEvG-gkLScqZrm-7M6nDo6vcA7EHds=.19728142-3151-462d-95ea-bdbc36c236a7@github.com> On Tue, 1 Apr 2025 22:43:36 GMT, Sandhya Viswanathan wrote: >> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: >> >> Reacting to comments by Volodymyr. > > src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 359: > >> 357: __ kmovbl(k4, rax); >> 358: __ addl(rax, 16); >> 359: __ kmovbl(k5, rax); > > We could use the sequence from generate_sha3_implCompress to setup the K registers, that has less dependency: > > __ movl(rax, 0x1F); > __ kmovbl(k5, rax); > __ kshiftrbl(k4, k5, 1); > __ kshiftrbl(k3, k5, 2); > __ kshiftrbl(k2, k5, 3); > __ kshiftrbl(k1, k5, 4); Thanks! (I had copied/doubled this function from the single state version before you made me do this change on that one and I forgot to update the copy :-) ) Changed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2024255339 From duke at openjdk.org Wed Apr 2 08:22:22 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Wed, 2 Apr 2025 08:22:22 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v11] In-Reply-To: References:

Message-ID: On Thu, 27 Mar 2025 21:42:08 GMT, Volodymyr Paprotski wrote: >> Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision: >> >> - Further readability improvements. >> - Added asserts for array sizes > > src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 342: > >> 340: // Performs two keccak() computations in parallel. The steps of the >> 341: // two computations are executed interleaved. >> 342: static address generate_double_keccak(StubGenerator *stubgen, MacroAssembler *_masm) { > > This function seems ok. I didnt do as line-by-line 'exact' review as for the NTT intrinsics, but just put the new version into a diff next to the original function. Seems like a reasonable clean 'refactor' (hardcode the blocksize, add new input registers 10-14. Makes it really easy to spot vs 0-4 original registers..) > > I didnt realize before that the 'top 3 limbs' are wasted. I guess it doesnt matter, there are registers to spare aplenty and it makes the entire algorithm cleaner and easier to follow. > > I did also stare at the algorithm with the 'What about AVX2' question.. This function would pretty much need to be rewritten it looks like :/ > > Last two questions.. > - how much performance is gained from doubling this function up? > - If thats worth it.. what if instead it was quadrupled the input? (I scanned the java code, it looked like NR was parametrized already to 2..). It looks like there are almost enough registers here to go to 4 (I think 3 would need to be freed up somehow.. alternatively, the upper 3 limbs are empty in all operations, perhaps it could be used instead.. at the expense of readability) Well, the algorithm (keccak()) is doing the same things on 5 array elements (It works on essentially a 5x5 matrix doing row and column operations, so putting 5 array entries in a vector register was the "natural" thing to do). This function can only be used under very special circumstances, which occur during the generation of tha "A matrix" in ML-KEM and ML-DSA, the speed of that matrix generation has almost doubled (I don't have exact numbers). We are using 7 registers per state and 15 for the constants, so we have only 3 to spare. We could perhaps juggle with the constants keeping just the ones that will be needed next in registers and reloading them "just in time", but that might slow things down a bit - more load instructions executed + maybe some load delay. On the other hand, more parallelism. I might try it out. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2024317665 From mchevalier at openjdk.org Wed Apr 2 14:15:00 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Wed, 2 Apr 2025 14:15:00 GMT Subject: RFR: 8348853: Fold layout helper check for objects implementing non-array interfaces [v2] In-Reply-To: References:

Message-ID: On Mon, 31 Mar 2025 11:46:57 GMT, Roland Westrelin wrote: >> Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: >> >> not reinventing the wheel > > src/hotspot/share/opto/memnode.cpp line 2214: > >> 2212: if (tkls->offset() == in_bytes(Klass::layout_helper_offset()) && >> 2213: tkls->isa_instklassptr() && // not directly typed as an array >> 2214: !tkls->is_instklassptr()->might_be_an_array() // not the supertype of all T[] (java.lang.Object) or has an interface that is not Serializable or Cloneable > > Could we do the same by using `TypeKlassPtr::maybe_java_subtype_of(TypeAryKlassPtr::BOTTOM)` and define a `TypeAryKlassPtr::BOTTOM` to be a static field for the `array_interfaces`? > > AFAICT, `TypeKlassPtr::maybe_java_subtype_of()` already covers that case so it would avoid some logic duplication. Also in the test above, maybe you could simplify the test a little but by removing `tkls->isa_instklassptr()`? I think it should be TypeAryKlassPtr::BOTTOM->maybe_java_subtype_of(tkls) rather than tkls->maybe_java_subtype_of(TypeAryKlassPtr::BOTTOM) My reasoning: if `TypeAryKlassPtr::BOTTOM` is `java.lang.Object + Cloneable + Serializable` any array is a subtype of that. But so is any class implementing these interfaces. As well as as any `Object` implementing more interfaces. But for these two last cases, we know they cannot be array, which is what we want to know: are we sure it's not an array, or could it be an array? But if we check if `tkls` is a supertype of `java.lang.Object + Cloneable + Serializable`, then it has to be an `Object` (the most general class) and it implements a subset of `Cloneable` and `Serializable`. In this case, it can be an array. If `tkls` is not a super-type of `java.lang.Object + Cloneable + Serializable`, there are 2 cases: - either it is an array type directly (so, I think, in a way or another, we need to check for `is_instklassptr`), and so a fortiori it can be an array type. - it's an instance type and then cannot be an array since there is nothing between array types and `java.lang.Object + Cloneable + Serializable`. I.e. there is no type `T` that is not an array type, that is a super-type of at least one array type and that is not a super-type of `java.lang.Object + Cloneable + Serializable` (that is that is not `java.lang.Object` or that implements at least another interface). In other words, our question is \exists T: T is an array type /\ T <= tkls (where `A <= B` means `A is a subtype of B`) which is equivalent to tkls >= (java.lang.Object + Cloneable + Serializable) / (tkls <= (java.lang.Object + Cloneable + Serializable) /\ tkls is an array type) We can spare the call to `is_instklassptr` by using a virtual method instead or probably other mechanisms, that's an implementation detail. But I think we need to distinguish cases: both `int[]` and `MyClass + Cloneable + Serializable + MyInterface` are sub-types of `java.lang.Object + Cloneable + Serializable` but for one, we can conclude it's definitely an array, and the other, it's definitely not. Without distinguishing cases, the only sound approximation would be to that that everything can be an array (both sub and super types of `java.lang.Object + Cloneable + Serializable`). Does that makes sense? Did I get something wrong? is the `BOTTOM` not what you had in mind? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24245#discussion_r2024918440 From mchevalier at openjdk.org Wed Apr 2 14:49:15 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Wed, 2 Apr 2025 14:49:15 GMT Subject: RFR: 8346989: C2: deoptimization and re-compilation cycle with Math.*Exact in case of frequent overflow [v5] In-Reply-To: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> Message-ID: > `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments. > This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached. > > Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all. > > tl;dr: > - C1: no problem, no change > - C2: > - with intrinsics: > - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms) > - without overflow: no problem, no change > - without intrinsics: no problem, no change > > Before the fix: > > Benchmark (SIZE) Mode Cnt Score Error Units > MathExact.C1_1.loopAddIInBounds 1000000 avgt 3 1.272 ? 0.048 ms/op > MathExact.C1_1.loopAddIOverflow 1000000 avgt 3 641.917 ? 58.238 ms/op > MathExact.C1_1.loopAddLInBounds 1000000 avgt 3 1.402 ? 0.842 ms/op > MathExact.C1_1.loopAddLOverflow 1000000 avgt 3 671.013 ? 229.425 ms/op > MathExact.C1_1.loopDecrementIInBounds 1000000 avgt 3 3.722 ? 22.244 ms/op > MathExact.C1_1.loopDecrementIOverflow 1000000 avgt 3 653.341 ? 279.003 ms/op > MathExact.C1_1.loopDecrementLInBounds 1000000 avgt 3 2.525 ? 0.810 ms/op > MathExact.C1_1.loopDecrementLOverflow 1000000 avgt 3 656.750 ? 141.792 ms/op > MathExact.C1_1.loopIncrementIInBounds 1000000 avgt 3 4.621 ? 12.822 ms/op > MathExact.C1_1.loopIncrementIOverflow 1000000 avgt 3 651.608 ? 274.396 ms/op > MathExact.C1_1.loopIncrementLInBounds 1000000 avgt 3 2.576 ? 3.316 ms/op > MathExact.C1_1.loopIncrementLOverflow 1000000 avgt 3 662.216 ? 71.879 ms/op > MathExact.C1_1.loopMultiplyIInBounds 1000000 avgt 3 1.402 ? 0.587 ms/op > MathExact.C1_1.loopMultiplyIOverflow 1000000 avgt 3 615.836 ? 252.137 ms/op > MathExact.C1_1.loopMultiplyLInBounds 1000000 avgt 3 2.906 ? 5.718 ms/op > MathExact.C1_1.loopMultiplyLOverflow 1000000 avgt 3 655.576 ? 147.432 ms/op > MathExact.C1_1.loopNegateIInBounds 1000000 avgt 3 2.023 ? 0.027 ms/op > MathExact.C1_1.loopNegateIOverflow 1000000 avgt 3 639.136 ? 30.841 ms/op > MathExact.C1_1.loopNegateLInBounds 1000000 avgt 3 2.422 ? 3.59... Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: Apply @iwanowww's refactoring ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23916/files - new: https://git.openjdk.org/jdk/pull/23916/files/80a67a55..34b3b75c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23916&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23916&range=03-04 Stats: 152 lines in 4 files changed: 71 ins; 57 del; 24 mod Patch: https://git.openjdk.org/jdk/pull/23916.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23916/head:pull/23916 PR: https://git.openjdk.org/jdk/pull/23916 From mchevalier at openjdk.org Wed Apr 2 14:49:17 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Wed, 2 Apr 2025 14:49:17 GMT Subject: RFR: 8346989: C2: deoptimization and re-compilation cycle with Math.*Exact in case of frequent overflow [v4] In-Reply-To: References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> Message-ID: <1RzVI3uVrE2YscRJPUC3KeGoF5pshACXrfZX9fooPAk=.cbcc9de2-c5d4-4f8a-82f3-444f7ee7ae0a@github.com> On Mon, 31 Mar 2025 08:33:42 GMT, Marc Chevalier wrote: >> `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments. >> This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached. >> >> Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all. >> >> tl;dr: >> - C1: no problem, no change >> - C2: >> - with intrinsics: >> - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms) >> - without overflow: no problem, no change >> - without intrinsics: no problem, no change >> >> Before the fix: >> >> Benchmark (SIZE) Mode Cnt Score Error Units >> MathExact.C1_1.loopAddIInBounds 1000000 avgt 3 1.272 ? 0.048 ms/op >> MathExact.C1_1.loopAddIOverflow 1000000 avgt 3 641.917 ? 58.238 ms/op >> MathExact.C1_1.loopAddLInBounds 1000000 avgt 3 1.402 ? 0.842 ms/op >> MathExact.C1_1.loopAddLOverflow 1000000 avgt 3 671.013 ? 229.425 ms/op >> MathExact.C1_1.loopDecrementIInBounds 1000000 avgt 3 3.722 ? 22.244 ms/op >> MathExact.C1_1.loopDecrementIOverflow 1000000 avgt 3 653.341 ? 279.003 ms/op >> MathExact.C1_1.loopDecrementLInBounds 1000000 avgt 3 2.525 ? 0.810 ms/op >> MathExact.C1_1.loopDecrementLOverflow 1000000 avgt 3 656.750 ? 141.792 ms/op >> MathExact.C1_1.loopIncrementIInBounds 1000000 avgt 3 4.621 ? 12.822 ms/op >> MathExact.C1_1.loopIncrementIOverflow 1000000 avgt 3 651.608 ? 274.396 ms/op >> MathExact.C1_1.loopIncrementLInBounds 1000000 avgt 3 2.576 ? 3.316 ms/op >> MathExact.C1_1.loopIncrementLOverflow 1000000 avgt 3 662.216 ? 71.879 ms/op >> MathExact.C1_1.loopMultiplyIInBounds 1000000 avgt 3 1.402 ? 0.587 ms/op >> MathExact.C1_1.loopMultiplyIOverflow 1000000 avgt 3 615.836 ? 252.137 ms/op >> MathExact.C1_1.loopMultiplyLInBounds 1000000 avgt 3 2.906 ? 5.718 ms/op >> MathExact.C1_1.loopMultiplyLOverflow 1000000 avgt 3 655.576 ? 147.432 ms/op >> MathExact.C1_1.loopNegateIInBounds 1000000 avgt 3 2.023 ? 0.027 ms/op >> MathExact.C1_1.loopNegateIOverflow 1000000 avgt 3 639.136 ? 30.841 ms/op >> MathExact.C1_1.loop... > > Marc Chevalier has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: > > - Merge branch 'master' into fix/Deoptimization-and-re-compilation-cycle-with-C2-compiled-code > - guess_exception_from_deopt_reason out of builtin_throw > - Use builtin_throw > - Merge branch 'master' into fix/Deoptimization-and-re-compilation-cycle-with-C2-compiled-code > - More exhaustive bench > - Limit inlining of math Exact operations in case of too many deopts I've applied the suggested refactoring. It looks fine to me, tests seems happy, microbench shows similar profile. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23916#issuecomment-2772794493 From kbarrett at openjdk.org Wed Apr 2 17:11:57 2025 From: kbarrett at openjdk.org (Kim Barrett) Date: Wed, 2 Apr 2025 17:11:57 GMT Subject: RFR: 8352645: Add tool support to check order of includes [v6] In-Reply-To: References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> Message-ID: On Fri, 28 Mar 2025 22:24:40 GMT, Doug Simon wrote: >> This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)). >> >> By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order. >> >> The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now. >> >> I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories. >> >> When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example: >> >> java.lang.RuntimeException: The unsorted includes listed below should be fixable by running: >> >> java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci >> >> at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80) >> at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) >> at java.base/java.lang.reflect.Method.invoke(Method.java:565) >> at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335) >> at java.base/java.lang.Thread.run(Thread.java:1447) >> Caused by: java.lang.RuntimeException: 36 files with unsorted headers found: >> >> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp >> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp >> /Users/dnsimo... > > Doug Simon has updated the pull request incrementally with one additional commit since the last revision: > > convert Windows path to Unix path Marked as reviewed by kbarrett (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/24247#pullrequestreview-2737009362 From kbarrett at openjdk.org Wed Apr 2 17:11:58 2025 From: kbarrett at openjdk.org (Kim Barrett) Date: Wed, 2 Apr 2025 17:11:58 GMT Subject: RFR: 8352645: Add tool support to check order of includes [v6] In-Reply-To: References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>

Message-ID: On Tue, 1 Apr 2025 15:35:45 GMT, Doug Simon wrote: >> test/hotspot/jtreg/TEST.groups line 142: >> >>> 140: >>> 141: tier1_common = \ >>> 142: sources \ >> >> I don't understand this change. How does this end up doing anything different than before? > > This makes `sources` be tested in GHA: https://github.com/openjdk/jdk/blob/a1ab1d8de411aace21decd133e7e74bb97f27897/.github/workflows/test.yml#L88 > > An alternative would be to add a separate GHA jobs just for `sources`: > > - test-name: 'hs/tier1 sources' > test-suite: 'test/hotspot/jtreg/:tier1_sources' > debug-suffix: -debug > > Given how small `sources` is ([currently only 1 test](https://github.com/openjdk/jdk/tree/master/test/hotspot/jtreg/sources)), it felt like it should just be folded into common. Ah, the workflows definition is what I was having trouble finding. I understand now. In light of that, the proposed change to the groups looks fine. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2025256054 From vlivanov at openjdk.org Wed Apr 2 17:13:58 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 2 Apr 2025 17:13:58 GMT Subject: RFR: 8346989: C2: deoptimization and re-compilation cycle with Math.*Exact in case of frequent overflow [v5] In-Reply-To: References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> Message-ID: On Wed, 2 Apr 2025 14:49:15 GMT, Marc Chevalier wrote: >> `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments. >> This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached. >> >> Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all. >> >> tl;dr: >> - C1: no problem, no change >> - C2: >> - with intrinsics: >> - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms) >> - without overflow: no problem, no change >> - without intrinsics: no problem, no change >> >> Before the fix: >> >> Benchmark (SIZE) Mode Cnt Score Error Units >> MathExact.C1_1.loopAddIInBounds 1000000 avgt 3 1.272 ? 0.048 ms/op >> MathExact.C1_1.loopAddIOverflow 1000000 avgt 3 641.917 ? 58.238 ms/op >> MathExact.C1_1.loopAddLInBounds 1000000 avgt 3 1.402 ? 0.842 ms/op >> MathExact.C1_1.loopAddLOverflow 1000000 avgt 3 671.013 ? 229.425 ms/op >> MathExact.C1_1.loopDecrementIInBounds 1000000 avgt 3 3.722 ? 22.244 ms/op >> MathExact.C1_1.loopDecrementIOverflow 1000000 avgt 3 653.341 ? 279.003 ms/op >> MathExact.C1_1.loopDecrementLInBounds 1000000 avgt 3 2.525 ? 0.810 ms/op >> MathExact.C1_1.loopDecrementLOverflow 1000000 avgt 3 656.750 ? 141.792 ms/op >> MathExact.C1_1.loopIncrementIInBounds 1000000 avgt 3 4.621 ? 12.822 ms/op >> MathExact.C1_1.loopIncrementIOverflow 1000000 avgt 3 651.608 ? 274.396 ms/op >> MathExact.C1_1.loopIncrementLInBounds 1000000 avgt 3 2.576 ? 3.316 ms/op >> MathExact.C1_1.loopIncrementLOverflow 1000000 avgt 3 662.216 ? 71.879 ms/op >> MathExact.C1_1.loopMultiplyIInBounds 1000000 avgt 3 1.402 ? 0.587 ms/op >> MathExact.C1_1.loopMultiplyIOverflow 1000000 avgt 3 615.836 ? 252.137 ms/op >> MathExact.C1_1.loopMultiplyLInBounds 1000000 avgt 3 2.906 ? 5.718 ms/op >> MathExact.C1_1.loopMultiplyLOverflow 1000000 avgt 3 655.576 ? 147.432 ms/op >> MathExact.C1_1.loopNegateIInBounds 1000000 avgt 3 2.023 ? 0.027 ms/op >> MathExact.C1_1.loopNegateIOverflow 1000000 avgt 3 639.136 ? 30.841 ms/op >> MathExact.C1_1.loop... > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > Apply @iwanowww's refactoring Looks good. src/hotspot/share/opto/library_call.cpp line 2009: > 2007: if (builtin_throw_too_many_traps(Deoptimization::Reason_intrinsic, > 2008: env()->ArithmeticException_instance())) { > 2009: // It has been already too many times, but we cannot use builtin_throw care (e.g. we care about backtraces), Remove "care" in "builtin_throw care"? ------------- Marked as reviewed by vlivanov (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23916#pullrequestreview-2737016248 PR Review Comment: https://git.openjdk.org/jdk/pull/23916#discussion_r2025260344 From mchevalier at openjdk.org Wed Apr 2 17:23:03 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Wed, 2 Apr 2025 17:23:03 GMT Subject: RFR: 8346989: C2: deoptimization and re-compilation cycle with Math.*Exact in case of frequent overflow [v6] In-Reply-To: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> Message-ID: > `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments. > This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached. > > Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all. > > tl;dr: > - C1: no problem, no change > - C2: > - with intrinsics: > - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms) > - without overflow: no problem, no change > - without intrinsics: no problem, no change > > Before the fix: > > Benchmark (SIZE) Mode Cnt Score Error Units > MathExact.C1_1.loopAddIInBounds 1000000 avgt 3 1.272 ? 0.048 ms/op > MathExact.C1_1.loopAddIOverflow 1000000 avgt 3 641.917 ? 58.238 ms/op > MathExact.C1_1.loopAddLInBounds 1000000 avgt 3 1.402 ? 0.842 ms/op > MathExact.C1_1.loopAddLOverflow 1000000 avgt 3 671.013 ? 229.425 ms/op > MathExact.C1_1.loopDecrementIInBounds 1000000 avgt 3 3.722 ? 22.244 ms/op > MathExact.C1_1.loopDecrementIOverflow 1000000 avgt 3 653.341 ? 279.003 ms/op > MathExact.C1_1.loopDecrementLInBounds 1000000 avgt 3 2.525 ? 0.810 ms/op > MathExact.C1_1.loopDecrementLOverflow 1000000 avgt 3 656.750 ? 141.792 ms/op > MathExact.C1_1.loopIncrementIInBounds 1000000 avgt 3 4.621 ? 12.822 ms/op > MathExact.C1_1.loopIncrementIOverflow 1000000 avgt 3 651.608 ? 274.396 ms/op > MathExact.C1_1.loopIncrementLInBounds 1000000 avgt 3 2.576 ? 3.316 ms/op > MathExact.C1_1.loopIncrementLOverflow 1000000 avgt 3 662.216 ? 71.879 ms/op > MathExact.C1_1.loopMultiplyIInBounds 1000000 avgt 3 1.402 ? 0.587 ms/op > MathExact.C1_1.loopMultiplyIOverflow 1000000 avgt 3 615.836 ? 252.137 ms/op > MathExact.C1_1.loopMultiplyLInBounds 1000000 avgt 3 2.906 ? 5.718 ms/op > MathExact.C1_1.loopMultiplyLOverflow 1000000 avgt 3 655.576 ? 147.432 ms/op > MathExact.C1_1.loopNegateIInBounds 1000000 avgt 3 2.023 ? 0.027 ms/op > MathExact.C1_1.loopNegateIOverflow 1000000 avgt 3 639.136 ? 30.841 ms/op > MathExact.C1_1.loopNegateLInBounds 1000000 avgt 3 2.422 ? 3.59... Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: fix typo in comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23916/files - new: https://git.openjdk.org/jdk/pull/23916/files/34b3b75c..238b129d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23916&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23916&range=04-05 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23916.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23916/head:pull/23916 PR: https://git.openjdk.org/jdk/pull/23916 From mchevalier at openjdk.org Wed Apr 2 17:23:03 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Wed, 2 Apr 2025 17:23:03 GMT Subject: RFR: 8346989: C2: deoptimization and re-compilation cycle with Math.*Exact in case of frequent overflow [v5] In-Reply-To: References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com>

Message-ID: On Wed, 2 Apr 2025 17:11:00 GMT, Vladimir Ivanov wrote: >> Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: >> >> Apply @iwanowww's refactoring > > src/hotspot/share/opto/library_call.cpp line 2009: > >> 2007: if (builtin_throw_too_many_traps(Deoptimization::Reason_intrinsic, >> 2008: env()->ArithmeticException_instance())) { >> 2009: // It has been already too many times, but we cannot use builtin_throw care (e.g. we care about backtraces), > > Remove "care" in "builtin_throw care"? Thanks! Done. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23916#discussion_r2025271377 From vlivanov at openjdk.org Wed Apr 2 18:10:50 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 2 Apr 2025 18:10:50 GMT Subject: RFR: 8346989: C2: deoptimization and re-compilation cycle with Math.*Exact in case of frequent overflow [v6] In-Reply-To: References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> Message-ID: On Wed, 2 Apr 2025 17:23:03 GMT, Marc Chevalier wrote: >> `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments. >> This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached. >> >> Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all. >> >> tl;dr: >> - C1: no problem, no change >> - C2: >> - with intrinsics: >> - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms) >> - without overflow: no problem, no change >> - without intrinsics: no problem, no change >> >> Before the fix: >> >> Benchmark (SIZE) Mode Cnt Score Error Units >> MathExact.C1_1.loopAddIInBounds 1000000 avgt 3 1.272 ? 0.048 ms/op >> MathExact.C1_1.loopAddIOverflow 1000000 avgt 3 641.917 ? 58.238 ms/op >> MathExact.C1_1.loopAddLInBounds 1000000 avgt 3 1.402 ? 0.842 ms/op >> MathExact.C1_1.loopAddLOverflow 1000000 avgt 3 671.013 ? 229.425 ms/op >> MathExact.C1_1.loopDecrementIInBounds 1000000 avgt 3 3.722 ? 22.244 ms/op >> MathExact.C1_1.loopDecrementIOverflow 1000000 avgt 3 653.341 ? 279.003 ms/op >> MathExact.C1_1.loopDecrementLInBounds 1000000 avgt 3 2.525 ? 0.810 ms/op >> MathExact.C1_1.loopDecrementLOverflow 1000000 avgt 3 656.750 ? 141.792 ms/op >> MathExact.C1_1.loopIncrementIInBounds 1000000 avgt 3 4.621 ? 12.822 ms/op >> MathExact.C1_1.loopIncrementIOverflow 1000000 avgt 3 651.608 ? 274.396 ms/op >> MathExact.C1_1.loopIncrementLInBounds 1000000 avgt 3 2.576 ? 3.316 ms/op >> MathExact.C1_1.loopIncrementLOverflow 1000000 avgt 3 662.216 ? 71.879 ms/op >> MathExact.C1_1.loopMultiplyIInBounds 1000000 avgt 3 1.402 ? 0.587 ms/op >> MathExact.C1_1.loopMultiplyIOverflow 1000000 avgt 3 615.836 ? 252.137 ms/op >> MathExact.C1_1.loopMultiplyLInBounds 1000000 avgt 3 2.906 ? 5.718 ms/op >> MathExact.C1_1.loopMultiplyLOverflow 1000000 avgt 3 655.576 ? 147.432 ms/op >> MathExact.C1_1.loopNegateIInBounds 1000000 avgt 3 2.023 ? 0.027 ms/op >> MathExact.C1_1.loopNegateIOverflow 1000000 avgt 3 639.136 ? 30.841 ms/op >> MathExact.C1_1.loop... > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > fix typo in comment Marked as reviewed by vlivanov (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23916#pullrequestreview-2737157109 From jbhateja at openjdk.org Wed Apr 2 18:24:55 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 2 Apr 2025 18:24:55 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v13] In-Reply-To: References:

Message-ID: On Wed, 2 Apr 2025 07:38:34 GMT, Ferenc Rakoczi wrote: >> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: > > Reacting to comment by Sandhya. @ferakocz , I verified new version of patch on Linux and windows and it works fine. Thanks for addressing my comments. ------------- Marked as reviewed by jbhateja (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23860#pullrequestreview-2737186292 From dnsimon at openjdk.org Wed Apr 2 22:32:58 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 2 Apr 2025 22:32:58 GMT Subject: RFR: 8352645: Add tool support to check order of includes [v6] In-Reply-To: References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> Message-ID: On Fri, 28 Mar 2025 22:24:40 GMT, Doug Simon wrote: >> This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)). >> >> By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order. >> >> The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now. >> >> I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories. >> >> When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example: >> >> java.lang.RuntimeException: The unsorted includes listed below should be fixable by running: >> >> java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci >> >> at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80) >> at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) >> at java.base/java.lang.reflect.Method.invoke(Method.java:565) >> at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335) >> at java.base/java.lang.Thread.run(Thread.java:1447) >> Caused by: java.lang.RuntimeException: 36 files with unsorted headers found: >> >> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp >> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp >> /Users/dnsimo... > > Doug Simon has updated the pull request incrementally with one additional commit since the last revision: > > convert Windows path to Unix path Thanks for all the discussion and reviews. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24247#issuecomment-2773874524 From dnsimon at openjdk.org Wed Apr 2 22:32:59 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 2 Apr 2025 22:32:59 GMT Subject: Integrated: 8352645: Add tool support to check order of includes In-Reply-To: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> Message-ID: On Wed, 26 Mar 2025 09:21:59 GMT, Doug Simon wrote: > This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)). > > By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order. > > The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now. > > I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories. > > When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example: > > java.lang.RuntimeException: The unsorted includes listed below should be fixable by running: > > java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci > > at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80) > at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) > at java.base/java.lang.reflect.Method.invoke(Method.java:565) > at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335) > at java.base/java.lang.Thread.run(Thread.java:1447) > Caused by: java.lang.RuntimeException: 36 files with unsorted headers found: > > /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp > /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp > /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Optim... This pull request has now been integrated. Changeset: 814730ea Author: Doug Simon URL: https://git.openjdk.org/jdk/commit/814730eae76d7b60a6082dc6f9e30618b7d8524b Stats: 486 lines in 53 files changed: 407 ins; 55 del; 24 mod 8352645: Add tool support to check order of includes Reviewed-by: stefank, kbarrett ------------- PR: https://git.openjdk.org/jdk/pull/24247 From thartmann at openjdk.org Thu Apr 3 12:01:59 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 3 Apr 2025 12:01:59 GMT Subject: RFR: 8346989: C2: deoptimization and re-compilation cycle with Math.*Exact in case of frequent overflow [v6] In-Reply-To: References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> Message-ID: On Wed, 2 Apr 2025 17:23:03 GMT, Marc Chevalier wrote: >> `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments. >> This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached. >> >> Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all. >> >> tl;dr: >> - C1: no problem, no change >> - C2: >> - with intrinsics: >> - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms) >> - without overflow: no problem, no change >> - without intrinsics: no problem, no change >> >> Before the fix: >> >> Benchmark (SIZE) Mode Cnt Score Error Units >> MathExact.C1_1.loopAddIInBounds 1000000 avgt 3 1.272 ? 0.048 ms/op >> MathExact.C1_1.loopAddIOverflow 1000000 avgt 3 641.917 ? 58.238 ms/op >> MathExact.C1_1.loopAddLInBounds 1000000 avgt 3 1.402 ? 0.842 ms/op >> MathExact.C1_1.loopAddLOverflow 1000000 avgt 3 671.013 ? 229.425 ms/op >> MathExact.C1_1.loopDecrementIInBounds 1000000 avgt 3 3.722 ? 22.244 ms/op >> MathExact.C1_1.loopDecrementIOverflow 1000000 avgt 3 653.341 ? 279.003 ms/op >> MathExact.C1_1.loopDecrementLInBounds 1000000 avgt 3 2.525 ? 0.810 ms/op >> MathExact.C1_1.loopDecrementLOverflow 1000000 avgt 3 656.750 ? 141.792 ms/op >> MathExact.C1_1.loopIncrementIInBounds 1000000 avgt 3 4.621 ? 12.822 ms/op >> MathExact.C1_1.loopIncrementIOverflow 1000000 avgt 3 651.608 ? 274.396 ms/op >> MathExact.C1_1.loopIncrementLInBounds 1000000 avgt 3 2.576 ? 3.316 ms/op >> MathExact.C1_1.loopIncrementLOverflow 1000000 avgt 3 662.216 ? 71.879 ms/op >> MathExact.C1_1.loopMultiplyIInBounds 1000000 avgt 3 1.402 ? 0.587 ms/op >> MathExact.C1_1.loopMultiplyIOverflow 1000000 avgt 3 615.836 ? 252.137 ms/op >> MathExact.C1_1.loopMultiplyLInBounds 1000000 avgt 3 2.906 ? 5.718 ms/op >> MathExact.C1_1.loopMultiplyLOverflow 1000000 avgt 3 655.576 ? 147.432 ms/op >> MathExact.C1_1.loopNegateIInBounds 1000000 avgt 3 2.023 ? 0.027 ms/op >> MathExact.C1_1.loopNegateIOverflow 1000000 avgt 3 639.136 ? 30.841 ms/op >> MathExact.C1_1.loop... > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > fix typo in comment Took me a while to parse the code but the refactoring definitely improves the situation :slightly_smiling_face: Looks good! ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23916#pullrequestreview-2739594615 From mchevalier at openjdk.org Thu Apr 3 13:01:15 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Thu, 3 Apr 2025 13:01:15 GMT Subject: RFR: 8346989: C2: deoptimization and re-compilation cycle with Math.*Exact in case of frequent overflow [v7] In-Reply-To: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> Message-ID: > `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments. > This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached. > > Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all. > > tl;dr: > - C1: no problem, no change > - C2: > - with intrinsics: > - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms) > - without overflow: no problem, no change > - without intrinsics: no problem, no change > > Before the fix: > > Benchmark (SIZE) Mode Cnt Score Error Units > MathExact.C1_1.loopAddIInBounds 1000000 avgt 3 1.272 ? 0.048 ms/op > MathExact.C1_1.loopAddIOverflow 1000000 avgt 3 641.917 ? 58.238 ms/op > MathExact.C1_1.loopAddLInBounds 1000000 avgt 3 1.402 ? 0.842 ms/op > MathExact.C1_1.loopAddLOverflow 1000000 avgt 3 671.013 ? 229.425 ms/op > MathExact.C1_1.loopDecrementIInBounds 1000000 avgt 3 3.722 ? 22.244 ms/op > MathExact.C1_1.loopDecrementIOverflow 1000000 avgt 3 653.341 ? 279.003 ms/op > MathExact.C1_1.loopDecrementLInBounds 1000000 avgt 3 2.525 ? 0.810 ms/op > MathExact.C1_1.loopDecrementLOverflow 1000000 avgt 3 656.750 ? 141.792 ms/op > MathExact.C1_1.loopIncrementIInBounds 1000000 avgt 3 4.621 ? 12.822 ms/op > MathExact.C1_1.loopIncrementIOverflow 1000000 avgt 3 651.608 ? 274.396 ms/op > MathExact.C1_1.loopIncrementLInBounds 1000000 avgt 3 2.576 ? 3.316 ms/op > MathExact.C1_1.loopIncrementLOverflow 1000000 avgt 3 662.216 ? 71.879 ms/op > MathExact.C1_1.loopMultiplyIInBounds 1000000 avgt 3 1.402 ? 0.587 ms/op > MathExact.C1_1.loopMultiplyIOverflow 1000000 avgt 3 615.836 ? 252.137 ms/op > MathExact.C1_1.loopMultiplyLInBounds 1000000 avgt 3 2.906 ? 5.718 ms/op > MathExact.C1_1.loopMultiplyLOverflow 1000000 avgt 3 655.576 ? 147.432 ms/op > MathExact.C1_1.loopNegateIInBounds 1000000 avgt 3 2.023 ? 0.027 ms/op > MathExact.C1_1.loopNegateIOverflow 1000000 avgt 3 639.136 ? 30.841 ms/op > MathExact.C1_1.loopNegateLInBounds 1000000 avgt 3 2.422 ? 3.59... Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: Remove useless flags in tests ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23916/files - new: https://git.openjdk.org/jdk/pull/23916/files/238b129d..e7c8f3e0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23916&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23916&range=05-06 Stats: 9 lines in 1 file changed: 0 ins; 0 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/23916.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23916/head:pull/23916 PR: https://git.openjdk.org/jdk/pull/23916 From mchevalier at openjdk.org Thu Apr 3 13:01:16 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Thu, 3 Apr 2025 13:01:16 GMT Subject: RFR: 8346989: C2: deoptimization and re-compilation cycle with Math.*Exact in case of frequent overflow [v6] In-Reply-To: References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> Message-ID: On Wed, 2 Apr 2025 17:23:03 GMT, Marc Chevalier wrote: >> `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments. >> This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached. >> >> Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all. >> >> tl;dr: >> - C1: no problem, no change >> - C2: >> - with intrinsics: >> - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms) >> - without overflow: no problem, no change >> - without intrinsics: no problem, no change >> >> Before the fix: >> >> Benchmark (SIZE) Mode Cnt Score Error Units >> MathExact.C1_1.loopAddIInBounds 1000000 avgt 3 1.272 ? 0.048 ms/op >> MathExact.C1_1.loopAddIOverflow 1000000 avgt 3 641.917 ? 58.238 ms/op >> MathExact.C1_1.loopAddLInBounds 1000000 avgt 3 1.402 ? 0.842 ms/op >> MathExact.C1_1.loopAddLOverflow 1000000 avgt 3 671.013 ? 229.425 ms/op >> MathExact.C1_1.loopDecrementIInBounds 1000000 avgt 3 3.722 ? 22.244 ms/op >> MathExact.C1_1.loopDecrementIOverflow 1000000 avgt 3 653.341 ? 279.003 ms/op >> MathExact.C1_1.loopDecrementLInBounds 1000000 avgt 3 2.525 ? 0.810 ms/op >> MathExact.C1_1.loopDecrementLOverflow 1000000 avgt 3 656.750 ? 141.792 ms/op >> MathExact.C1_1.loopIncrementIInBounds 1000000 avgt 3 4.621 ? 12.822 ms/op >> MathExact.C1_1.loopIncrementIOverflow 1000000 avgt 3 651.608 ? 274.396 ms/op >> MathExact.C1_1.loopIncrementLInBounds 1000000 avgt 3 2.576 ? 3.316 ms/op >> MathExact.C1_1.loopIncrementLOverflow 1000000 avgt 3 662.216 ? 71.879 ms/op >> MathExact.C1_1.loopMultiplyIInBounds 1000000 avgt 3 1.402 ? 0.587 ms/op >> MathExact.C1_1.loopMultiplyIOverflow 1000000 avgt 3 615.836 ? 252.137 ms/op >> MathExact.C1_1.loopMultiplyLInBounds 1000000 avgt 3 2.906 ? 5.718 ms/op >> MathExact.C1_1.loopMultiplyLOverflow 1000000 avgt 3 655.576 ? 147.432 ms/op >> MathExact.C1_1.loopNegateIInBounds 1000000 avgt 3 2.023 ? 0.027 ms/op >> MathExact.C1_1.loopNegateIOverflow 1000000 avgt 3 639.136 ? 30.841 ms/op >> MathExact.C1_1.loop... > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > fix typo in comment I've made the test flags tighter as discussed offline. I'll need a fresh approval. And for completeness, there are the bench result on this last state. We can see that things behave as we expect: builtin_throw is taken and making the situation a lot better. When intrinsics or builtin_throw are disabled, we see C1-like perfs. Benchmark (SIZE) Mode Cnt Score Error Units MathExact.C1_1.loopAddIInBounds 1000000 avgt 3 1.616 ? 7.813 ms/op MathExact.C1_1.loopAddIOverflow 1000000 avgt 3 654.971 ? 573.250 ms/op MathExact.C1_1.loopAddLInBounds 1000000 avgt 3 1.398 ? 0.274 ms/op MathExact.C1_1.loopAddLOverflow 1000000 avgt 3 629.620 ? 41.181 ms/op MathExact.C1_1.loopDecrementIInBounds 1000000 avgt 3 2.048 ? 0.340 ms/op MathExact.C1_1.loopDecrementIOverflow 1000000 avgt 3 681.702 ? 63.721 ms/op MathExact.C1_1.loopDecrementLInBounds 1000000 avgt 3 3.057 ? 13.688 ms/op MathExact.C1_1.loopDecrementLOverflow 1000000 avgt 3 660.457 ? 295.393 ms/op MathExact.C1_1.loopIncrementIInBounds 1000000 avgt 3 2.531 ? 13.692 ms/op MathExact.C1_1.loopIncrementIOverflow 1000000 avgt 3 647.970 ? 65.451 ms/op MathExact.C1_1.loopIncrementLInBounds 1000000 avgt 3 5.350 ? 25.080 ms/op MathExact.C1_1.loopIncrementLOverflow 1000000 avgt 3 681.097 ? 72.604 ms/op MathExact.C1_1.loopMultiplyIInBounds 1000000 avgt 3 1.552 ? 3.145 ms/op MathExact.C1_1.loopMultiplyIOverflow 1000000 avgt 3 648.402 ? 62.995 ms/op MathExact.C1_1.loopMultiplyLInBounds 1000000 avgt 3 2.501 ? 0.720 ms/op MathExact.C1_1.loopMultiplyLOverflow 1000000 avgt 3 701.498 ? 47.948 ms/op MathExact.C1_1.loopNegateIInBounds 1000000 avgt 3 2.074 ? 0.949 ms/op MathExact.C1_1.loopNegateIOverflow 1000000 avgt 3 665.143 ? 537.941 ms/op MathExact.C1_1.loopNegateLInBounds 1000000 avgt 3 5.487 ? 7.165 ms/op MathExact.C1_1.loopNegateLOverflow 1000000 avgt 3 687.085 ? 20.738 ms/op MathExact.C1_1.loopSubtractIInBounds 1000000 avgt 3 1.329 ? 0.769 ms/op MathExact.C1_1.loopSubtractIOverflow 1000000 avgt 3 683.922 ? 70.434 ms/op MathExact.C1_1.loopSubtractLInBounds 1000000 avgt 3 1.384 ? 0.386 ms/op MathExact.C1_1.loopSubtractLOverflow 1000000 avgt 3 664.380 ? 480.847 ms/op MathExact.C1_2.loopAddIInBounds 1000000 avgt 3 1.862 ? 0.815 ms/op MathExact.C1_2.loopAddIOverflow 1000000 avgt 3 660.421 ? 506.723 ms/op MathExact.C1_2.loopAddLInBounds 1000000 avgt 3 1.829 ? 0.221 ms/op MathExact.C1_2.loopAddLOverflow 1000000 avgt 3 681.209 ? 78.976 ms/op MathExact.C1_2.loopDecrementIInBounds 1000000 avgt 3 3.533 ? 11.302 ms/op MathExact.C1_2.loopDecrementIOverflow 1000000 avgt 3 682.639 ? 225.392 ms/op MathExact.C1_2.loopDecrementLInBounds 1000000 avgt 3 3.402 ? 1.031 ms/op MathExact.C1_2.loopDecrementLOverflow 1000000 avgt 3 697.283 ? 306.867 ms/op MathExact.C1_2.loopIncrementIInBounds 1000000 avgt 3 3.326 ? 5.072 ms/op MathExact.C1_2.loopIncrementIOverflow 1000000 avgt 3 658.514 ? 636.731 ms/op MathExact.C1_2.loopIncrementLInBounds 1000000 avgt 3 3.718 ? 0.422 ms/op MathExact.C1_2.loopIncrementLOverflow 1000000 avgt 3 693.863 ? 49.201 ms/op MathExact.C1_2.loopMultiplyIInBounds 1000000 avgt 3 1.924 ? 2.800 ms/op MathExact.C1_2.loopMultiplyIOverflow 1000000 avgt 3 609.308 ? 94.814 ms/op MathExact.C1_2.loopMultiplyLInBounds 1000000 avgt 3 3.459 ? 0.625 ms/op MathExact.C1_2.loopMultiplyLOverflow 1000000 avgt 3 713.503 ? 556.995 ms/op MathExact.C1_2.loopNegateIInBounds 1000000 avgt 3 3.195 ? 0.726 ms/op MathExact.C1_2.loopNegateIOverflow 1000000 avgt 3 684.176 ? 27.164 ms/op MathExact.C1_2.loopNegateLInBounds 1000000 avgt 3 3.483 ? 0.947 ms/op MathExact.C1_2.loopNegateLOverflow 1000000 avgt 3 656.284 ? 582.286 ms/op MathExact.C1_2.loopSubtractIInBounds 1000000 avgt 3 1.728 ? 0.315 ms/op MathExact.C1_2.loopSubtractIOverflow 1000000 avgt 3 688.029 ? 25.201 ms/op MathExact.C1_2.loopSubtractLInBounds 1000000 avgt 3 1.941 ? 0.169 ms/op MathExact.C1_2.loopSubtractLOverflow 1000000 avgt 3 694.341 ? 339.431 ms/op MathExact.C1_3.loopAddIInBounds 1000000 avgt 3 3.122 ? 0.910 ms/op MathExact.C1_3.loopAddIOverflow 1000000 avgt 3 688.731 ? 308.210 ms/op MathExact.C1_3.loopAddLInBounds 1000000 avgt 3 5.492 ? 36.236 ms/op MathExact.C1_3.loopAddLOverflow 1000000 avgt 3 697.053 ? 229.958 ms/op MathExact.C1_3.loopDecrementIInBounds 1000000 avgt 3 9.155 ? 72.182 ms/op MathExact.C1_3.loopDecrementIOverflow 1000000 avgt 3 708.458 ? 788.701 ms/op MathExact.C1_3.loopDecrementLInBounds 1000000 avgt 3 6.402 ? 3.658 ms/op MathExact.C1_3.loopDecrementLOverflow 1000000 avgt 3 705.992 ? 213.542 ms/op MathExact.C1_3.loopIncrementIInBounds 1000000 avgt 3 7.699 ? 61.434 ms/op MathExact.C1_3.loopIncrementIOverflow 1000000 avgt 3 697.353 ? 105.457 ms/op MathExact.C1_3.loopIncrementLInBounds 1000000 avgt 3 6.380 ? 0.839 ms/op MathExact.C1_3.loopIncrementLOverflow 1000000 avgt 3 669.240 ? 522.870 ms/op MathExact.C1_3.loopMultiplyIInBounds 1000000 avgt 3 3.225 ? 0.140 ms/op MathExact.C1_3.loopMultiplyIOverflow 1000000 avgt 3 624.811 ? 457.059 ms/op MathExact.C1_3.loopMultiplyLInBounds 1000000 avgt 3 6.110 ? 1.265 ms/op MathExact.C1_3.loopMultiplyLOverflow 1000000 avgt 3 718.460 ? 68.166 ms/op MathExact.C1_3.loopNegateIInBounds 1000000 avgt 3 6.085 ? 1.430 ms/op MathExact.C1_3.loopNegateIOverflow 1000000 avgt 3 675.036 ? 341.177 ms/op MathExact.C1_3.loopNegateLInBounds 1000000 avgt 3 9.410 ? 93.522 ms/op MathExact.C1_3.loopNegateLOverflow 1000000 avgt 3 652.042 ? 166.119 ms/op MathExact.C1_3.loopSubtractIInBounds 1000000 avgt 3 3.432 ? 11.899 ms/op MathExact.C1_3.loopSubtractIOverflow 1000000 avgt 3 654.208 ? 120.258 ms/op MathExact.C1_3.loopSubtractLInBounds 1000000 avgt 3 5.166 ? 38.529 ms/op MathExact.C1_3.loopSubtractLOverflow 1000000 avgt 3 691.094 ? 80.676 ms/op MathExact.C2.loopAddIInBounds 1000000 avgt 3 2.276 ? 1.750 ms/op MathExact.C2.loopAddIOverflow 1000000 avgt 3 1.173 ? 1.392 ms/op MathExact.C2.loopAddLInBounds 1000000 avgt 3 0.985 ? 0.167 ms/op MathExact.C2.loopAddLOverflow 1000000 avgt 3 1.990 ? 5.310 ms/op MathExact.C2.loopDecrementIInBounds 1000000 avgt 3 2.072 ? 0.173 ms/op MathExact.C2.loopDecrementIOverflow 1000000 avgt 3 1.911 ? 0.288 ms/op MathExact.C2.loopDecrementLInBounds 1000000 avgt 3 1.845 ? 0.424 ms/op MathExact.C2.loopDecrementLOverflow 1000000 avgt 3 2.757 ? 27.268 ms/op MathExact.C2.loopIncrementIInBounds 1000000 avgt 3 2.136 ? 0.517 ms/op MathExact.C2.loopIncrementIOverflow 1000000 avgt 3 2.199 ? 4.024 ms/op MathExact.C2.loopIncrementLInBounds 1000000 avgt 3 1.957 ? 0.365 ms/op MathExact.C2.loopIncrementLOverflow 1000000 avgt 3 2.053 ? 0.779 ms/op MathExact.C2.loopMultiplyIInBounds 1000000 avgt 3 1.174 ? 0.941 ms/op MathExact.C2.loopMultiplyIOverflow 1000000 avgt 3 1.971 ? 10.040 ms/op MathExact.C2.loopMultiplyLInBounds 1000000 avgt 3 0.997 ? 0.318 ms/op MathExact.C2.loopMultiplyLOverflow 1000000 avgt 3 2.847 ? 4.548 ms/op MathExact.C2.loopNegateIInBounds 1000000 avgt 3 4.783 ? 2.454 ms/op MathExact.C2.loopNegateIOverflow 1000000 avgt 3 1.915 ? 0.009 ms/op MathExact.C2.loopNegateLInBounds 1000000 avgt 3 2.824 ? 28.297 ms/op MathExact.C2.loopNegateLOverflow 1000000 avgt 3 4.766 ? 32.627 ms/op MathExact.C2.loopSubtractIInBounds 1000000 avgt 3 0.990 ? 0.264 ms/op MathExact.C2.loopSubtractIOverflow 1000000 avgt 3 1.181 ? 2.120 ms/op MathExact.C2.loopSubtractLInBounds 1000000 avgt 3 2.363 ? 1.575 ms/op MathExact.C2.loopSubtractLOverflow 1000000 avgt 3 2.429 ? 7.120 ms/op MathExact.C2_no_builtin_throw.loopAddIInBounds 1000000 avgt 3 1.040 ? 0.181 ms/op MathExact.C2_no_builtin_throw.loopAddIOverflow 1000000 avgt 3 580.950 ? 112.050 ms/op MathExact.C2_no_builtin_throw.loopAddLInBounds 1000000 avgt 3 1.223 ? 5.700 ms/op MathExact.C2_no_builtin_throw.loopAddLOverflow 1000000 avgt 3 585.712 ? 61.699 ms/op MathExact.C2_no_builtin_throw.loopDecrementIInBounds 1000000 avgt 3 2.114 ? 0.663 ms/op MathExact.C2_no_builtin_throw.loopDecrementIOverflow 1000000 avgt 3 604.866 ? 578.502 ms/op MathExact.C2_no_builtin_throw.loopDecrementLInBounds 1000000 avgt 3 2.167 ? 9.268 ms/op MathExact.C2_no_builtin_throw.loopDecrementLOverflow 1000000 avgt 3 621.175 ? 225.858 ms/op MathExact.C2_no_builtin_throw.loopIncrementIInBounds 1000000 avgt 3 1.950 ? 0.326 ms/op MathExact.C2_no_builtin_throw.loopIncrementIOverflow 1000000 avgt 3 633.735 ? 830.255 ms/op MathExact.C2_no_builtin_throw.loopIncrementLInBounds 1000000 avgt 3 2.397 ? 11.911 ms/op MathExact.C2_no_builtin_throw.loopIncrementLOverflow 1000000 avgt 3 627.599 ? 141.709 ms/op MathExact.C2_no_builtin_throw.loopMultiplyIInBounds 1000000 avgt 3 1.167 ? 1.187 ms/op MathExact.C2_no_builtin_throw.loopMultiplyIOverflow 1000000 avgt 3 623.224 ? 298.374 ms/op MathExact.C2_no_builtin_throw.loopMultiplyLInBounds 1000000 avgt 3 0.944 ? 0.743 ms/op MathExact.C2_no_builtin_throw.loopMultiplyLOverflow 1000000 avgt 3 658.380 ? 137.021 ms/op MathExact.C2_no_builtin_throw.loopNegateIInBounds 1000000 avgt 3 2.119 ? 0.642 ms/op MathExact.C2_no_builtin_throw.loopNegateIOverflow 1000000 avgt 3 643.102 ? 452.213 ms/op MathExact.C2_no_builtin_throw.loopNegateLInBounds 1000000 avgt 3 2.036 ? 0.862 ms/op MathExact.C2_no_builtin_throw.loopNegateLOverflow 1000000 avgt 3 586.103 ? 26.173 ms/op MathExact.C2_no_builtin_throw.loopSubtractIInBounds 1000000 avgt 3 2.552 ? 3.677 ms/op MathExact.C2_no_builtin_throw.loopSubtractIOverflow 1000000 avgt 3 635.294 ? 217.034 ms/op MathExact.C2_no_builtin_throw.loopSubtractLInBounds 1000000 avgt 3 1.093 ? 1.685 ms/op MathExact.C2_no_builtin_throw.loopSubtractLOverflow 1000000 avgt 3 661.541 ? 1358.199 ms/op MathExact.C2_no_intrinsics.loopAddIInBounds 1000000 avgt 3 2.185 ? 15.103 ms/op MathExact.C2_no_intrinsics.loopAddIOverflow 1000000 avgt 3 831.812 ? 1260.546 ms/op MathExact.C2_no_intrinsics.loopAddLInBounds 1000000 avgt 3 2.145 ? 0.088 ms/op MathExact.C2_no_intrinsics.loopAddLOverflow 1000000 avgt 3 709.930 ? 658.722 ms/op MathExact.C2_no_intrinsics.loopDecrementIInBounds 1000000 avgt 3 2.288 ? 0.950 ms/op MathExact.C2_no_intrinsics.loopDecrementIOverflow 1000000 avgt 3 646.879 ? 186.231 ms/op MathExact.C2_no_intrinsics.loopDecrementLInBounds 1000000 avgt 3 1.894 ? 0.421 ms/op MathExact.C2_no_intrinsics.loopDecrementLOverflow 1000000 avgt 3 641.577 ? 323.040 ms/op MathExact.C2_no_intrinsics.loopIncrementIInBounds 1000000 avgt 3 2.027 ? 0.249 ms/op MathExact.C2_no_intrinsics.loopIncrementIOverflow 1000000 avgt 3 657.092 ? 229.818 ms/op MathExact.C2_no_intrinsics.loopIncrementLInBounds 1000000 avgt 3 3.220 ? 16.992 ms/op MathExact.C2_no_intrinsics.loopIncrementLOverflow 1000000 avgt 3 603.468 ? 73.240 ms/op MathExact.C2_no_intrinsics.loopMultiplyIInBounds 1000000 avgt 3 1.295 ? 0.413 ms/op MathExact.C2_no_intrinsics.loopMultiplyIOverflow 1000000 avgt 3 593.005 ? 576.291 ms/op MathExact.C2_no_intrinsics.loopMultiplyLInBounds 1000000 avgt 3 1.093 ? 0.916 ms/op MathExact.C2_no_intrinsics.loopMultiplyLOverflow 1000000 avgt 3 618.956 ? 554.204 ms/op MathExact.C2_no_intrinsics.loopNegateIInBounds 1000000 avgt 3 2.035 ? 0.047 ms/op MathExact.C2_no_intrinsics.loopNegateIOverflow 1000000 avgt 3 650.591 ? 1248.923 ms/op MathExact.C2_no_intrinsics.loopNegateLInBounds 1000000 avgt 3 3.505 ? 20.475 ms/op MathExact.C2_no_intrinsics.loopNegateLOverflow 1000000 avgt 3 660.686 ? 201.612 ms/op MathExact.C2_no_intrinsics.loopSubtractIInBounds 1000000 avgt 3 1.109 ? 0.726 ms/op MathExact.C2_no_intrinsics.loopSubtractIOverflow 1000000 avgt 3 670.468 ? 475.269 ms/op MathExact.C2_no_intrinsics.loopSubtractLInBounds 1000000 avgt 3 1.208 ? 0.806 ms/op MathExact.C2_no_intrinsics.loopSubtractLOverflow 1000000 avgt 3 597.522 ? 32.465 ms/op ------------- PR Comment: https://git.openjdk.org/jdk/pull/23916#issuecomment-2775707480 From roland at openjdk.org Thu Apr 3 13:06:49 2025 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 3 Apr 2025 13:06:49 GMT Subject: RFR: 8348853: Fold layout helper check for objects implementing non-array interfaces [v2] In-Reply-To: References:

Message-ID: <5c7yEX837btOgbGnTKNn8a7hlPljZRwh0TpgZI6Ogb0=.1c7f3aed-8e8c-4efe-beed-68ea192bcb99@github.com> On Wed, 2 Apr 2025 14:11:34 GMT, Marc Chevalier wrote: >> src/hotspot/share/opto/memnode.cpp line 2214: >> >>> 2212: if (tkls->offset() == in_bytes(Klass::layout_helper_offset()) && >>> 2213: tkls->isa_instklassptr() && // not directly typed as an array >>> 2214: !tkls->is_instklassptr()->might_be_an_array() // not the supertype of all T[] (java.lang.Object) or has an interface that is not Serializable or Cloneable >> >> Could we do the same by using `TypeKlassPtr::maybe_java_subtype_of(TypeAryKlassPtr::BOTTOM)` and define a `TypeAryKlassPtr::BOTTOM` to be a static field for the `array_interfaces`? >> >> AFAICT, `TypeKlassPtr::maybe_java_subtype_of()` already covers that case so it would avoid some logic duplication. Also in the test above, maybe you could simplify the test a little but by removing `tkls->isa_instklassptr()`? > > I think it should be > > TypeAryKlassPtr::BOTTOM->maybe_java_subtype_of(tkls) > > rather than > > tkls->maybe_java_subtype_of(TypeAryKlassPtr::BOTTOM) > > > My reasoning: if `TypeAryKlassPtr::BOTTOM` is `java.lang.Object + Cloneable + Serializable` any array is a subtype of that. But so is any class implementing these interfaces. As well as as any `Object` implementing more interfaces. But for these two last cases, we know they cannot be array, which is what we want to know: are we sure it's not an array, or could it be an array? > > But if we check if `tkls` is a supertype of `java.lang.Object + Cloneable + Serializable`, then it has to be an `Object` (the most general class) and it implements a subset of `Cloneable` and `Serializable`. In this case, it can be an array. If `tkls` is not a super-type of `java.lang.Object + Cloneable + Serializable`, there are 2 cases: > - either it is an array type directly (so, I think, in a way or another, we need to check for `is_instklassptr`), and so a fortiori it can be an array type. > - it's an instance type and then cannot be an array since there is nothing between array types and `java.lang.Object + Cloneable + Serializable`. I.e. there is no type `T` that is not an array type, that is a super-type of at least one array type and that is not a super-type of `java.lang.Object + Cloneable + Serializable` (that is that is not `java.lang.Object` or that implements at least another interface). > > In other words, our question is > > \exists T: T is an array type /\ T <= tkls > > (where `A <= B` means `A is a subtype of B`) which is equivalent to > > tkls >= (java.lang.Object + Cloneable + Serializable) > / (tkls <= (java.lang.Object + Cloneable + Serializable) /\ tkls is an array type) > > > We can spare the call to `is_instklassptr` by using a virtual method instead or probably other mechanisms, that's an implementation detail. But I think we need to distinguish cases: both `int[]` and `MyClass + Cloneable + Serializable + MyInterface` are sub-types of `java.lang.Object + Cloneable + Serializable` but for one, we can conclude it's definitely an array, and the other, it's definitely not. Without distinguishing cases, the only sound approximation would be to that that everything can be an array (both sub and super types of `java.lang.Object + Cloneable + Serializable`). > > Does that makes sense? Did I get something wrong? is the `BOTTOM` not what you had in mind? Yes, what I suggested doesn't work indeed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24245#discussion_r2026954565 From thartmann at openjdk.org Thu Apr 3 13:13:51 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 3 Apr 2025 13:13:51 GMT Subject: RFR: 8346989: C2: deoptimization and re-compilation cycle with Math.*Exact in case of frequent overflow [v7] In-Reply-To: References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> Message-ID: <5fCOI-cNWoRD89POiHnHraJaiy_73Hlt1xZCNGLcHrY=.aebffb74-c8e5-475e-a853-3576673d6161@github.com> On Thu, 3 Apr 2025 13:01:15 GMT, Marc Chevalier wrote: >> `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments. >> This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached. >> >> Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all. >> >> tl;dr: >> - C1: no problem, no change >> - C2: >> - with intrinsics: >> - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms) >> - without overflow: no problem, no change >> - without intrinsics: no problem, no change >> >> Before the fix: >> >> Benchmark (SIZE) Mode Cnt Score Error Units >> MathExact.C1_1.loopAddIInBounds 1000000 avgt 3 1.272 ? 0.048 ms/op >> MathExact.C1_1.loopAddIOverflow 1000000 avgt 3 641.917 ? 58.238 ms/op >> MathExact.C1_1.loopAddLInBounds 1000000 avgt 3 1.402 ? 0.842 ms/op >> MathExact.C1_1.loopAddLOverflow 1000000 avgt 3 671.013 ? 229.425 ms/op >> MathExact.C1_1.loopDecrementIInBounds 1000000 avgt 3 3.722 ? 22.244 ms/op >> MathExact.C1_1.loopDecrementIOverflow 1000000 avgt 3 653.341 ? 279.003 ms/op >> MathExact.C1_1.loopDecrementLInBounds 1000000 avgt 3 2.525 ? 0.810 ms/op >> MathExact.C1_1.loopDecrementLOverflow 1000000 avgt 3 656.750 ? 141.792 ms/op >> MathExact.C1_1.loopIncrementIInBounds 1000000 avgt 3 4.621 ? 12.822 ms/op >> MathExact.C1_1.loopIncrementIOverflow 1000000 avgt 3 651.608 ? 274.396 ms/op >> MathExact.C1_1.loopIncrementLInBounds 1000000 avgt 3 2.576 ? 3.316 ms/op >> MathExact.C1_1.loopIncrementLOverflow 1000000 avgt 3 662.216 ? 71.879 ms/op >> MathExact.C1_1.loopMultiplyIInBounds 1000000 avgt 3 1.402 ? 0.587 ms/op >> MathExact.C1_1.loopMultiplyIOverflow 1000000 avgt 3 615.836 ? 252.137 ms/op >> MathExact.C1_1.loopMultiplyLInBounds 1000000 avgt 3 2.906 ? 5.718 ms/op >> MathExact.C1_1.loopMultiplyLOverflow 1000000 avgt 3 655.576 ? 147.432 ms/op >> MathExact.C1_1.loopNegateIInBounds 1000000 avgt 3 2.023 ? 0.027 ms/op >> MathExact.C1_1.loopNegateIOverflow 1000000 avgt 3 639.136 ? 30.841 ms/op >> MathExact.C1_1.loop... > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > Remove useless flags in tests Marked as reviewed by thartmann (Reviewer). Great, thank you! ------------- PR Review: https://git.openjdk.org/jdk/pull/23916#pullrequestreview-2739795916 PR Comment: https://git.openjdk.org/jdk/pull/23916#issuecomment-2775743849 From mchevalier at openjdk.org Fri Apr 4 06:54:53 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Fri, 4 Apr 2025 06:54:53 GMT Subject: RFR: 8346989: C2: deoptimization and re-execution cycle with Math.*Exact in case of frequent overflow [v7] In-Reply-To: References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> Message-ID: On Thu, 3 Apr 2025 13:01:15 GMT, Marc Chevalier wrote: >> `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments. >> This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached. >> >> Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all. >> >> tl;dr: >> - C1: no problem, no change >> - C2: >> - with intrinsics: >> - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms) >> - without overflow: no problem, no change >> - without intrinsics: no problem, no change >> >> Before the fix: >> >> Benchmark (SIZE) Mode Cnt Score Error Units >> MathExact.C1_1.loopAddIInBounds 1000000 avgt 3 1.272 ? 0.048 ms/op >> MathExact.C1_1.loopAddIOverflow 1000000 avgt 3 641.917 ? 58.238 ms/op >> MathExact.C1_1.loopAddLInBounds 1000000 avgt 3 1.402 ? 0.842 ms/op >> MathExact.C1_1.loopAddLOverflow 1000000 avgt 3 671.013 ? 229.425 ms/op >> MathExact.C1_1.loopDecrementIInBounds 1000000 avgt 3 3.722 ? 22.244 ms/op >> MathExact.C1_1.loopDecrementIOverflow 1000000 avgt 3 653.341 ? 279.003 ms/op >> MathExact.C1_1.loopDecrementLInBounds 1000000 avgt 3 2.525 ? 0.810 ms/op >> MathExact.C1_1.loopDecrementLOverflow 1000000 avgt 3 656.750 ? 141.792 ms/op >> MathExact.C1_1.loopIncrementIInBounds 1000000 avgt 3 4.621 ? 12.822 ms/op >> MathExact.C1_1.loopIncrementIOverflow 1000000 avgt 3 651.608 ? 274.396 ms/op >> MathExact.C1_1.loopIncrementLInBounds 1000000 avgt 3 2.576 ? 3.316 ms/op >> MathExact.C1_1.loopIncrementLOverflow 1000000 avgt 3 662.216 ? 71.879 ms/op >> MathExact.C1_1.loopMultiplyIInBounds 1000000 avgt 3 1.402 ? 0.587 ms/op >> MathExact.C1_1.loopMultiplyIOverflow 1000000 avgt 3 615.836 ? 252.137 ms/op >> MathExact.C1_1.loopMultiplyLInBounds 1000000 avgt 3 2.906 ? 5.718 ms/op >> MathExact.C1_1.loopMultiplyLOverflow 1000000 avgt 3 655.576 ? 147.432 ms/op >> MathExact.C1_1.loopNegateIInBounds 1000000 avgt 3 2.023 ? 0.027 ms/op >> MathExact.C1_1.loopNegateIOverflow 1000000 avgt 3 639.136 ? 30.841 ms/op >> MathExact.C1_1.loop... > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > Remove useless flags in tests Thanks @iwanowww and @TobiHartmann! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23916#issuecomment-2777705340 From duke at openjdk.org Fri Apr 4 06:54:53 2025 From: duke at openjdk.org (duke) Date: Fri, 4 Apr 2025 06:54:53 GMT Subject: RFR: 8346989: C2: deoptimization and re-execution cycle with Math.*Exact in case of frequent overflow [v7] In-Reply-To: References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> Message-ID: On Thu, 3 Apr 2025 13:01:15 GMT, Marc Chevalier wrote: >> `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments. >> This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached. >> >> Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all. >> >> tl;dr: >> - C1: no problem, no change >> - C2: >> - with intrinsics: >> - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms) >> - without overflow: no problem, no change >> - without intrinsics: no problem, no change >> >> Before the fix: >> >> Benchmark (SIZE) Mode Cnt Score Error Units >> MathExact.C1_1.loopAddIInBounds 1000000 avgt 3 1.272 ? 0.048 ms/op >> MathExact.C1_1.loopAddIOverflow 1000000 avgt 3 641.917 ? 58.238 ms/op >> MathExact.C1_1.loopAddLInBounds 1000000 avgt 3 1.402 ? 0.842 ms/op >> MathExact.C1_1.loopAddLOverflow 1000000 avgt 3 671.013 ? 229.425 ms/op >> MathExact.C1_1.loopDecrementIInBounds 1000000 avgt 3 3.722 ? 22.244 ms/op >> MathExact.C1_1.loopDecrementIOverflow 1000000 avgt 3 653.341 ? 279.003 ms/op >> MathExact.C1_1.loopDecrementLInBounds 1000000 avgt 3 2.525 ? 0.810 ms/op >> MathExact.C1_1.loopDecrementLOverflow 1000000 avgt 3 656.750 ? 141.792 ms/op >> MathExact.C1_1.loopIncrementIInBounds 1000000 avgt 3 4.621 ? 12.822 ms/op >> MathExact.C1_1.loopIncrementIOverflow 1000000 avgt 3 651.608 ? 274.396 ms/op >> MathExact.C1_1.loopIncrementLInBounds 1000000 avgt 3 2.576 ? 3.316 ms/op >> MathExact.C1_1.loopIncrementLOverflow 1000000 avgt 3 662.216 ? 71.879 ms/op >> MathExact.C1_1.loopMultiplyIInBounds 1000000 avgt 3 1.402 ? 0.587 ms/op >> MathExact.C1_1.loopMultiplyIOverflow 1000000 avgt 3 615.836 ? 252.137 ms/op >> MathExact.C1_1.loopMultiplyLInBounds 1000000 avgt 3 2.906 ? 5.718 ms/op >> MathExact.C1_1.loopMultiplyLOverflow 1000000 avgt 3 655.576 ? 147.432 ms/op >> MathExact.C1_1.loopNegateIInBounds 1000000 avgt 3 2.023 ? 0.027 ms/op >> MathExact.C1_1.loopNegateIOverflow 1000000 avgt 3 639.136 ? 30.841 ms/op >> MathExact.C1_1.loop... > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > Remove useless flags in tests @marc-chevalier Your change (at version e7c8f3e06f46e85cb3c2dc974db84b10a57bd086) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23916#issuecomment-2777706873 From tschatzl at openjdk.org Fri Apr 4 08:10:34 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Fri, 4 Apr 2025 08:10:34 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v30] In-Reply-To: References: Message-ID: > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... Thomas Schatzl has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 39 commits: - * missing file from merge - Merge branch 'master' into 8342382-card-table-instead-of-dcq - Merge branch 'master' into 8342382-card-table-instead-of-dcq - Merge branch 'master' into 8342382-card-table-instead-of-dcq - Merge branch 'master' into submit/8342382-card-table-instead-of-dcq - * make young gen length revising independent of refinement thread * use a service task * both refinement control thread and young gen length revising use the same infrastructure to get the number of available bytes and determine the time to the next update - * fix IR code generation tests that change due to barrier cost changes - * factor out card table and refinement table merging into a single method - Merge branch 'master' into 8342382-card-table-instead-of-dcq3 - * obsolete G1UpdateBufferSize G1UpdateBufferSize has previously been used to size the refinement buffers and impose a minimum limit on the number of cards per thread that need to be pending before refinement starts. The former function is now obsolete with the removal of the dirty card queues, the latter functionality has been taken over by the new diagnostic option `G1PerThreadPendingCardThreshold`. I prefer to make this a diagnostic option is better than a product option because it is something that is only necessary for some test cases to produce some otherwise unwanted behavior (continuous refinement). CSR is pending. - ... and 29 more: https://git.openjdk.org/jdk/compare/41d4a0d7...1c5a669f ------------- Changes: https://git.openjdk.org/jdk/pull/23739/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=29 Stats: 7089 lines in 110 files changed: 2610 ins; 3555 del; 924 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739 From sviswanathan at openjdk.org Sat Apr 5 00:44:56 2025 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Sat, 5 Apr 2025 00:44:56 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v13] In-Reply-To: References:

Message-ID: On Sat, 5 Apr 2025 14:29:28 GMT, Zihao Lin wrote: >> This patch remove slice parameter from LoadNode::make >> >> Mention in https://github.com/openjdk/jdk/pull/21834#pullrequestreview-2429164805 >> >> Hi team, I am new, I'd appreciate any guidance. Thank a lot! > > Zihao Lin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Merge branch 'openjdk:master' into 8344116 > - Fix build > - Fix test failed > - 8344116: C2: remove slice parameter from LoadNode::make Hi @TobiHartmann , Could you please take a look? Thank you. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24258#issuecomment-2781240184 From mchevalier at openjdk.org Mon Apr 7 05:24:57 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Mon, 7 Apr 2025 05:24:57 GMT Subject: Integrated: 8346989: C2: deoptimization and re-execution cycle with Math.*Exact in case of frequent overflow In-Reply-To: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> Message-ID: <0131FJuGwDAfwqB3GKnlj_9xeoinsnsUjNq8LodfkZE=.9501af2c-473f-4387-b319-aac1dff8cd18@github.com> On Wed, 5 Mar 2025 12:56:48 GMT, Marc Chevalier wrote: > `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments. > This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached. > > Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all. > > tl;dr: > - C1: no problem, no change > - C2: > - with intrinsics: > - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms) > - without overflow: no problem, no change > - without intrinsics: no problem, no change > > Before the fix: > > Benchmark (SIZE) Mode Cnt Score Error Units > MathExact.C1_1.loopAddIInBounds 1000000 avgt 3 1.272 ? 0.048 ms/op > MathExact.C1_1.loopAddIOverflow 1000000 avgt 3 641.917 ? 58.238 ms/op > MathExact.C1_1.loopAddLInBounds 1000000 avgt 3 1.402 ? 0.842 ms/op > MathExact.C1_1.loopAddLOverflow 1000000 avgt 3 671.013 ? 229.425 ms/op > MathExact.C1_1.loopDecrementIInBounds 1000000 avgt 3 3.722 ? 22.244 ms/op > MathExact.C1_1.loopDecrementIOverflow 1000000 avgt 3 653.341 ? 279.003 ms/op > MathExact.C1_1.loopDecrementLInBounds 1000000 avgt 3 2.525 ? 0.810 ms/op > MathExact.C1_1.loopDecrementLOverflow 1000000 avgt 3 656.750 ? 141.792 ms/op > MathExact.C1_1.loopIncrementIInBounds 1000000 avgt 3 4.621 ? 12.822 ms/op > MathExact.C1_1.loopIncrementIOverflow 1000000 avgt 3 651.608 ? 274.396 ms/op > MathExact.C1_1.loopIncrementLInBounds 1000000 avgt 3 2.576 ? 3.316 ms/op > MathExact.C1_1.loopIncrementLOverflow 1000000 avgt 3 662.216 ? 71.879 ms/op > MathExact.C1_1.loopMultiplyIInBounds 1000000 avgt 3 1.402 ? 0.587 ms/op > MathExact.C1_1.loopMultiplyIOverflow 1000000 avgt 3 615.836 ? 252.137 ms/op > MathExact.C1_1.loopMultiplyLInBounds 1000000 avgt 3 2.906 ? 5.718 ms/op > MathExact.C1_1.loopMultiplyLOverflow 1000000 avgt 3 655.576 ? 147.432 ms/op > MathExact.C1_1.loopNegateIInBounds 1000000 avgt 3 2.023 ? 0.027 ms/op > MathExact.C1_1.loopNegateIOverflow 1000000 avgt 3 639.136 ? 30.841 ms/op > MathExact.C1_1.loopNegateLInBounds 1000000 avgt 3 2.422 ? 3.59... This pull request has now been integrated. Changeset: 97ed5361 Author: Marc Chevalier URL: https://git.openjdk.org/jdk/commit/97ed536125645304aed03a4afbc3ded627de0bb0 Stats: 845 lines in 6 files changed: 769 ins; 59 del; 17 mod 8346989: C2: deoptimization and re-execution cycle with Math.*Exact in case of frequent overflow Reviewed-by: thartmann, vlivanov ------------- PR: https://git.openjdk.org/jdk/pull/23916 From thartmann at openjdk.org Mon Apr 7 06:02:52 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 7 Apr 2025 06:02:52 GMT Subject: RFR: 8348853: Fold layout helper check for objects implementing non-array interfaces [v2] In-Reply-To: References:

Message-ID: On Sun, 23 Mar 2025 17:00:43 GMT, Ferenc Rakoczi wrote: >> By using the aarch64 vector registers the speed of the computation of the ML-KEM algorithms (key generation, encapsulation, decapsulation) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: > > - Merged master. > - Fixed bad assertion. > - Fixed mismerge. > - Merged master. > - A little cleanup > - Merged master > - removing trailing spaces > - kyber aarch64 intrinsics @ferakocz Thanks for another very good piece of work which appears to me to be functioning correctly and performantly. The PR suffers from the same problems as the original ML_DSA one i.e. The mapping of data to registers and the overall structure of the generated code and its relation to the related Java code/the original algorithms will be hard for a maintainer to identify. I have reworked your patch to use vector sequences in this [draft PR](https://github.com/openjdk/jdk/pull/24419) in very much the same way as was done for the ML_DSA PR. This has significantly abstracted and clarified the register mappings that are in use in each kyber generator and has also made the higher level structure of the generated code much easier to follow. Note that my rework of the generation routines was applied to your original PR after rebasing it on master. Before updating the kyber routines I also generalized a few of the VSeq methods that benefit from being shared by both kyber and dilithium, most notably the montmul routines, and I added a few extra helpers. The reworked version passes the ML_KEM functional test and gives similar performance improvements for the ML_KEM micro benchmark. The generated code does differ in a few places from what your original patch generates but only superficially - most notable is that a few loads/stores that rely on continued post-increments in the original instead use a constant offset or an add/load pair in the reworked code. This makes a very minor difference to code size and does not seem to affect performance. I would like you to rework your PR to incorporate these changes because I believe it will make a big difference to maintainability. n.b. it may be easier to integrate my changes by diffing your branch and mine and applying the resulting change set rather than trying to merge the changes. Please let me know if you have problems with the integration and need help. I still have some further review comments and would also like to see more commenting to explain what the code is doing. However, I think it will be easier to do that after this rework has been integrated into your PR. ------------- PR Review: https://git.openjdk.org/jdk/pull/23663#pullrequestreview-2746672860 From yzheng at openjdk.org Mon Apr 7 14:27:30 2025 From: yzheng at openjdk.org (Yudi Zheng) Date: Mon, 7 Apr 2025 14:27:30 GMT Subject: RFR: 8353735: [JVMCI] Allow specifying storage kind of the callee save register Message-ID: Windows x64 ABI considers the upper portions of YMM0-YMM15 and ZMM0-ZMM15 volatile, that is, destroyed on function calls. This PR allows `RegisterConfig` implementations to refine the storage kind of callee save register, such that JVMCI compiler can exploit this information to avoid backing up full width of these registers. ------------- Commit messages: - [JVMCI] Allow specifying storage kind of the callee save register Changes: https://git.openjdk.org/jdk/pull/24451/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24451&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8353735 Stats: 8 lines in 1 file changed: 7 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/24451.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24451/head:pull/24451 PR: https://git.openjdk.org/jdk/pull/24451 From dnsimon at openjdk.org Mon Apr 7 14:49:37 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 7 Apr 2025 14:49:37 GMT Subject: RFR: 8353735: [JVMCI] Allow specifying storage kind of the callee save register In-Reply-To: References: Message-ID: On Fri, 4 Apr 2025 14:47:39 GMT, Yudi Zheng wrote: > Windows x64 ABI considers the upper portions of YMM0-YMM15 and ZMM0-ZMM15 volatile, that is, destroyed on function calls. This PR allows `RegisterConfig` implementations to refine the storage kind of callee save register, such that JVMCI compiler can exploit this information to avoid backing up full width of these registers. Marked as reviewed by dnsimon (Reviewer). src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/code/RegisterConfig.java line 98: > 96: > 97: /** > 98: * Gets the storage kind for a callee save register. I would add a second sentence describing the Window ABI example so that it's clear why this API exists. ------------- PR Review: https://git.openjdk.org/jdk/pull/24451#pullrequestreview-2747108590 PR Review Comment: https://git.openjdk.org/jdk/pull/24451#discussion_r2031411734 From roland at openjdk.org Mon Apr 7 15:20:16 2025 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 7 Apr 2025 15:20:16 GMT Subject: RFR: 8348853: Fold layout helper check for objects implementing non-array interfaces [v2] In-Reply-To: References:

Message-ID: On Mon, 31 Mar 2025 06:49:50 GMT, Marc Chevalier wrote: >> If `TypeInstKlassPtr` represents an array type, it has to be `java.lang.Object`. From contraposition, if it is not `java.lang.Object`, we can conclude it is not an array, and we can skip some array checks, for instance. >> >> In this PR, we improve this deduction with an interface base reasoning: arrays implements only Cloneable and Serializable, so if a type implements anything else, it cannot be an array. >> >> This change partially reverts the changes from [JDK-8348631](https://bugs.openjdk.org/browse/JDK-8348631) (#23331) (in `LibraryCallKit::generate_array_guard_common`) and the test still passes. >> >> The way interfaces are check might be done differently. The current situation is a balance between visibility (not to leak too much things explicitly private), having not overly general methods for one use-case and avoiding too concrete (and brittle) interfaces. >> >> Tested with tier1..3, hs-precheckin-comp and hs-comp-stress >> >> Thanks, >> Marc > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > not reinventing the wheel Looks good to me. ------------- Marked as reviewed by roland (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/24245#pullrequestreview-2747218926 From cslucas at openjdk.org Mon Apr 7 19:04:15 2025 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Mon, 7 Apr 2025 19:04:15 GMT Subject: RFR: 8353735: [JVMCI] Allow specifying storage kind of the callee save register In-Reply-To: References: Message-ID: <4o1jkl6NOwwVZ0--oMiBOOLnRzE0OIO6JazFL_gV4UU=.7d25433e-7507-49f0-afab-091bb7f2305d@github.com> On Fri, 4 Apr 2025 14:47:39 GMT, Yudi Zheng wrote: > Windows x64 ABI considers the upper portions of YMM0-YMM15 and ZMM0-ZMM15 volatile, that is, destroyed on function calls. This PR allows `RegisterConfig` implementations to refine the storage kind of callee save register, such that JVMCI compiler can exploit this information to avoid saving full width of these registers. LGTM ------------- Marked as reviewed by cslucas (Author). PR Review: https://git.openjdk.org/jdk/pull/24451#pullrequestreview-2747807421 From sviswanathan at openjdk.org Tue Apr 8 00:12:15 2025 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 8 Apr 2025 00:12:15 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v13] In-Reply-To: References:

Message-ID: On Sat, 5 Apr 2025 14:29:28 GMT, Zihao Lin wrote: >> This patch remove slice parameter from LoadNode::make >> >> Mention in https://github.com/openjdk/jdk/pull/21834#pullrequestreview-2429164805 >> >> Hi team, I am new, I'd appreciate any guidance. Thank a lot! > > Zihao Lin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Merge branch 'openjdk:master' into 8344116 > - Fix build > - Fix test failed > - 8344116: C2: remove slice parameter from LoadNode::make src/hotspot/share/gc/shared/c2/barrierSetC2.cpp line 223: > 221: MergeMemNode* mm = opt_access.mem(); > 222: PhaseGVN& gvn = opt_access.gvn(); > 223: Node* mem = mm->memory_at(gvn.C->get_alias_index(access.addr().type())); Can we get rid of all uses of `access.addr().type()`? src/hotspot/share/gc/shared/c2/cardTableBarrierSetC2.cpp line 105: > 103: // stores. In theory we could relax the load from ctrl() to > 104: // no_ctrl, but that doesn't buy much latitude. > 105: Node* card_val = __ load( __ ctrl(), card_adr, TypeInt::BYTE, T_BYTE); We could asssert that `C->get_alias_index(kit->type(card_adr) == Compile::AliasIdxRaw`, that is that computed slice is the same as hardcoded slide. Similar asserts could be added for every location where a slice/address type is removed in this patch. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24258#discussion_r2033149694 PR Review Comment: https://git.openjdk.org/jdk/pull/24258#discussion_r2033162534 From duke at openjdk.org Tue Apr 8 21:27:08 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Tue, 8 Apr 2025 21:27:08 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v14] In-Reply-To: References: Message-ID: <394Wf5RpbwUgE7zBaZBnwa2YAxQFwWDhF1VuaMPHdhE=.98ff29f7-b6a7-49eb-bdd6-8489568b24b7@github.com> > By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: Reacting to mor comments from Sandhya. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23860/files - new: https://git.openjdk.org/jdk/pull/23860/files/e4ab10bb..0b0d0969 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=13 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=12-13 Stats: 11 lines in 1 file changed: 0 ins; 4 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/23860.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23860/head:pull/23860 PR: https://git.openjdk.org/jdk/pull/23860 From duke at openjdk.org Tue Apr 8 21:29:26 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Tue, 8 Apr 2025 21:29:26 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v13] In-Reply-To: References:

Message-ID: On Sat, 5 Apr 2025 00:27:05 GMT, Sandhya Viswanathan wrote: >> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: >> >> Reacting to comment by Sandhya. > > src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 345: > >> 343: >> 344: store4Xmms(coeffs, 0, xmm0_3, _masm); >> 345: store4Xmms(coeffs, 4 * XMMBYTES, xmm4_7, _masm); > > This seems to be unnecessary store. Thanks for catching that. Changed. > src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 370: > >> 368: loadPerm(xmm16_19, perms, nttL4PermsIdx, _masm); >> 369: loadPerm(xmm12_15, perms, nttL4PermsIdx + 64, _masm); >> 370: load4Xmms(xmm24_27, zetas, 4 * 512, _masm); // for level 3 > > The comment // for level3 is not relevant here and could be removed. Ooops. Deleted the comment. > src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 802: > >> 800: __ evpbroadcastd(zero, scratch, Assembler::AVX_512bit); // 0 >> 801: __ addl(scratch, 1); >> 802: __ evpbroadcastd(one, scratch, Assembler::AVX_512bit); // 1 > > A better way to initialize (0, 1, -1) vectors is: > // load 0 into int vector > vpxor(zero, zero, zero, Assembler::AVX_512bit); > // load -1 into int vector > vpternlogd(minusOne, 0xff, minusOne, minusOne, Assembler::AVX_512bit); > // load 1 into int vector > vpsubd(one, zero, minusOne, Assembler::AVX_512bit); > > Where minusOne could be xmm31. > > A broadcast from r register to xmm register is more expensive. Changed. > src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 982: > >> 980: __ evporq(xmm19, k0, xmm19, xmm23, false, Assembler::AVX_512bit); >> 981: >> 982: __ evpsubd(xmm12, k0, zero, one, false, Assembler::AVX_512bit); // -1 > > The -1 initialization could be done outside the loop. Not really. All registers are used. > src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 1015: > >> 1013: __ addptr(lowPart, 4 * XMMBYTES); >> 1014: __ cmpl(len, 0); >> 1015: __ jcc(Assembler::notEqual, L_loop); > > It looks to me that subl and cmpl could be merged: > __ addptr(highPart, 4 * XMMBYTES); > __ addptr(lowPart, 4 * XMMBYTES); > __ subl(len, 4 * XMMBYTES); > __ jcc(Assembler::notEqual, L_loop); Changed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2034057184 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2034057342 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2034057700 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2034057565 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2034057463 From sviswanathan at openjdk.org Tue Apr 8 22:01:42 2025 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 8 Apr 2025 22:01:42 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v14] In-Reply-To: <394Wf5RpbwUgE7zBaZBnwa2YAxQFwWDhF1VuaMPHdhE=.98ff29f7-b6a7-49eb-bdd6-8489568b24b7@github.com> References: <394Wf5RpbwUgE7zBaZBnwa2YAxQFwWDhF1VuaMPHdhE=.98ff29f7-b6a7-49eb-bdd6-8489568b24b7@github.com> Message-ID: <-W1vBCTLtPyOZNm6XhHQXT9spBbkAd4Z4rTn_LHH1Aw=.5beae719-ac8b-404a-a34c-deecfc97dd7e@github.com> On Tue, 8 Apr 2025 21:27:08 GMT, Ferenc Rakoczi wrote: >> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: > > Reacting to mor comments from Sandhya. Overall very clean and nicely done PR. Thanks a lot for considering my inputs. ------------- Marked as reviewed by sviswanathan (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23860#pullrequestreview-2751503300 From mchevalier at openjdk.org Wed Apr 9 07:20:00 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Wed, 9 Apr 2025 07:20:00 GMT Subject: RFR: 8348853: Fold layout helper check for objects implementing non-array interfaces [v3] In-Reply-To: References: Message-ID: > If `TypeInstKlassPtr` represents an array type, it has to be `java.lang.Object`. From contraposition, if it is not `java.lang.Object`, we can conclude it is not an array, and we can skip some array checks, for instance. > > In this PR, we improve this deduction with an interface base reasoning: arrays implements only Cloneable and Serializable, so if a type implements anything else, it cannot be an array. > > This change partially reverts the changes from [JDK-8348631](https://bugs.openjdk.org/browse/JDK-8348631) (#23331) (in `LibraryCallKit::generate_array_guard_common`) and the test still passes. > > The way interfaces are check might be done differently. The current situation is a balance between visibility (not to leak too much things explicitly private), having not overly general methods for one use-case and avoiding too concrete (and brittle) interfaces. > > Tested with tier1..3, hs-precheckin-comp and hs-comp-stress > > Thanks, > Marc Marc Chevalier has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: - Merge branch 'master' into feat/Fold-layout-helper-check-for-objects-implementing-non-array-interfaces - Merge branch 'master' into feat/Fold-layout-helper-check-for-objects-implementing-non-array-interfaces - not reinventing the wheel - Revert now useless fix - Generalize the not-array proof ------------- Changes: - all: https://git.openjdk.org/jdk/pull/24245/files - new: https://git.openjdk.org/jdk/pull/24245/files/daaaf9ae..b1fb82a2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=24245&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24245&range=01-02 Stats: 74611 lines in 2220 files changed: 27997 ins; 41827 del; 4787 mod Patch: https://git.openjdk.org/jdk/pull/24245.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24245/head:pull/24245 PR: https://git.openjdk.org/jdk/pull/24245 From mchevalier at openjdk.org Wed Apr 9 09:19:39 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Wed, 9 Apr 2025 09:19:39 GMT Subject: RFR: 8348853: Fold layout helper check for objects implementing non-array interfaces [v3] In-Reply-To: References:

Message-ID: <1kzdW30AxgS9RnVEzxrwkkQk8dwxOT79wloukL-Vz38=.8c62c7ef-10b0-4011-8744-7ae24211362c@github.com> On Wed, 9 Apr 2025 07:20:00 GMT, Marc Chevalier wrote: >> If `TypeInstKlassPtr` represents an array type, it has to be `java.lang.Object`. From contraposition, if it is not `java.lang.Object`, we can conclude it is not an array, and we can skip some array checks, for instance. >> >> In this PR, we improve this deduction with an interface base reasoning: arrays implements only Cloneable and Serializable, so if a type implements anything else, it cannot be an array. >> >> This change partially reverts the changes from [JDK-8348631](https://bugs.openjdk.org/browse/JDK-8348631) (#23331) (in `LibraryCallKit::generate_array_guard_common`) and the test still passes. >> >> The way interfaces are check might be done differently. The current situation is a balance between visibility (not to leak too much things explicitly private), having not overly general methods for one use-case and avoiding too concrete (and brittle) interfaces. >> >> Tested with tier1..3, hs-precheckin-comp and hs-comp-stress >> >> Thanks, >> Marc > > Marc Chevalier has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Merge branch 'master' into feat/Fold-layout-helper-check-for-objects-implementing-non-array-interfaces > - Merge branch 'master' into feat/Fold-layout-helper-check-for-objects-implementing-non-array-interfaces > - not reinventing the wheel > - Revert now useless fix > - Generalize the not-array proof The branch was a bit old, so I've merged master in it and run tests. It seems all good! Thanks @TobiHartmann and @rwestrel for the reviews. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24245#issuecomment-2788948220 From duke at openjdk.org Wed Apr 9 09:19:40 2025 From: duke at openjdk.org (duke) Date: Wed, 9 Apr 2025 09:19:40 GMT Subject: RFR: 8348853: Fold layout helper check for objects implementing non-array interfaces [v3] In-Reply-To: References:

Message-ID: On Wed, 9 Apr 2025 07:20:00 GMT, Marc Chevalier wrote: >> If `TypeInstKlassPtr` represents an array type, it has to be `java.lang.Object`. From contraposition, if it is not `java.lang.Object`, we can conclude it is not an array, and we can skip some array checks, for instance. >> >> In this PR, we improve this deduction with an interface base reasoning: arrays implements only Cloneable and Serializable, so if a type implements anything else, it cannot be an array. >> >> This change partially reverts the changes from [JDK-8348631](https://bugs.openjdk.org/browse/JDK-8348631) (#23331) (in `LibraryCallKit::generate_array_guard_common`) and the test still passes. >> >> The way interfaces are check might be done differently. The current situation is a balance between visibility (not to leak too much things explicitly private), having not overly general methods for one use-case and avoiding too concrete (and brittle) interfaces. >> >> Tested with tier1..3, hs-precheckin-comp and hs-comp-stress >> >> Thanks, >> Marc > > Marc Chevalier has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Merge branch 'master' into feat/Fold-layout-helper-check-for-objects-implementing-non-array-interfaces > - Merge branch 'master' into feat/Fold-layout-helper-check-for-objects-implementing-non-array-interfaces > - not reinventing the wheel > - Revert now useless fix > - Generalize the not-array proof @marc-chevalier Your change (at version b1fb82a28a4f6c3f126d312727cb9f89a9f51669) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24245#issuecomment-2788951877 From mchevalier at openjdk.org Wed Apr 9 09:31:49 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Wed, 9 Apr 2025 09:31:49 GMT Subject: Integrated: 8348853: Fold layout helper check for objects implementing non-array interfaces In-Reply-To: References: Message-ID: On Wed, 26 Mar 2025 09:16:17 GMT, Marc Chevalier wrote: > If `TypeInstKlassPtr` represents an array type, it has to be `java.lang.Object`. From contraposition, if it is not `java.lang.Object`, we can conclude it is not an array, and we can skip some array checks, for instance. > > In this PR, we improve this deduction with an interface base reasoning: arrays implements only Cloneable and Serializable, so if a type implements anything else, it cannot be an array. > > This change partially reverts the changes from [JDK-8348631](https://bugs.openjdk.org/browse/JDK-8348631) (#23331) (in `LibraryCallKit::generate_array_guard_common`) and the test still passes. > > The way interfaces are check might be done differently. The current situation is a balance between visibility (not to leak too much things explicitly private), having not overly general methods for one use-case and avoiding too concrete (and brittle) interfaces. > > Tested with tier1..3, hs-precheckin-comp and hs-comp-stress > > Thanks, > Marc This pull request has now been integrated. Changeset: a1d566ce Author: Marc Chevalier Committer: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/a1d566ce4b0315591ece489347c5d1c253f06be9 Stats: 34 lines in 5 files changed: 23 ins; 7 del; 4 mod 8348853: Fold layout helper check for objects implementing non-array interfaces Reviewed-by: thartmann, roland ------------- PR: https://git.openjdk.org/jdk/pull/24245 From ayang at openjdk.org Wed Apr 9 10:36:44 2025 From: ayang at openjdk.org (Albert Mingkun Yang) Date: Wed, 9 Apr 2025 10:36:44 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v30] In-Reply-To: References:

Message-ID: On Fri, 4 Apr 2025 08:10:34 GMT, Thomas Schatzl wrote: >> Hi all, >> >> please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. >> >> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. >> >> ### Current situation >> >> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. >> >> The main reason for the current barrier is how g1 implements concurrent refinement: >> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. >> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, >> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. >> >> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: >> >> >> // Filtering >> if (region(@x.a) == region(y)) goto done; // same region check >> if (y == null) goto done; // null value check >> if (card(@x.a) == young_card) goto done; // write to young gen check >> StoreLoad; // synchronize >> if (card(@x.a) == dirty_card) goto done; >> >> *card(@x.a) = dirty >> >> // Card tracking >> enqueue(card-address(@x.a)) into thread-local-dcq; >> if (thread-local-dcq is not full) goto done; >> >> call runtime to move thread-local-dcq into dcqs >> >> done: >> >> >> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. >> >> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. >> >> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). >> >> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c... > > Thomas Schatzl has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 39 commits: > > - * missing file from merge > - Merge branch 'master' into 8342382-card-table-instead-of-dcq > - Merge branch 'master' into 8342382-card-table-instead-of-dcq > - Merge branch 'master' into 8342382-card-table-instead-of-dcq > - Merge branch 'master' into submit/8342382-card-table-instead-of-dcq > - * make young gen length revising independent of refinement thread > * use a service task > * both refinement control thread and young gen length revising use the same infrastructure to get the number of available bytes and determine the time to the next update > - * fix IR code generation tests that change due to barrier cost changes > - * factor out card table and refinement table merging into a single > method > - Merge branch 'master' into 8342382-card-table-instead-of-dcq3 > - * obsolete G1UpdateBufferSize > > G1UpdateBufferSize has previously been used to size the refinement > buffers and impose a minimum limit on the number of cards per thread > that need to be pending before refinement starts. > > The former function is now obsolete with the removal of the dirty > card queues, the latter functionality has been taken over by the new > diagnostic option `G1PerThreadPendingCardThreshold`. > > I prefer to make this a diagnostic option is better than a product option > because it is something that is only necessary for some test cases to > produce some otherwise unwanted behavior (continuous refinement). > > CSR is pending. > - ... and 29 more: https://git.openjdk.org/jdk/compare/41d4a0d7...1c5a669f src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 170: > 168: } > 169: return result; > 170: } I see in `G1ConcurrentRefineThread::do_refinement`: // The yielding may have completed the task, check. if (!state.is_in_progress()) { I wonder if it's simpler to use `is_in_progress` consistently to detect whether we should restart sweep, instead of `_sweep_start_epoch`. src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 349: > 347: } > 348: > 349: bool has_sweep_rt_work = is_in_progress() && _state == State::SweepRT; Why `is_in_progress()`? src/hotspot/share/gc/g1/g1ConcurrentRefineStats.hpp line 79: > 77: > 78: void inc_cards_scanned(size_t increment = 1) { _cards_scanned += increment; } > 79: void inc_cards_clean(size_t increment = 1) { _cards_clean += increment; } The sole caller always passes in arg, so no need for default-arg-value. src/hotspot/share/gc/g1/g1ConcurrentRefineStats.hpp line 87: > 85: void add_atomic(G1ConcurrentRefineStats* other); > 86: > 87: G1ConcurrentRefineStats& operator+=(const G1ConcurrentRefineStats& other); Seems that these operators are not used after this PR. src/hotspot/share/gc/g1/g1ConcurrentRefineSweepTask.cpp line 83: > 81: break; > 82: } > 83: case G1RemSet::HasRefToOld : break; // Nothing special to do. Why doesn't call `inc_cards_clean_again` in this case? The card is cleared also. (In fact, I don't get why this needs to a separate case from `NoInteresting`.) src/hotspot/share/gc/g1/g1ConcurrentRefineSweepTask.cpp line 156: > 154: > 155: _refine_stats.inc_cards_scanned(claim.size()); > 156: _refine_stats.inc_cards_clean(claim.size() - scanned); I feel these two "scanned" mean sth diff; the local var should probably be sth like `num_dirty_cards`. src/hotspot/share/gc/g1/g1ConcurrentRefineThread.cpp line 207: > 205: > 206: if (!interrupted_by_gc) { > 207: state.add_yield_duration(G1CollectedHeap::heap()->safepoint_duration() - synchronize_duration_at_sweep_start); I think this is recorded to later calculate actual refine-time, i.e. sweep-time - yield-time. However, why can't yield-duration be recorded in this refine-control-thread directly -- accumulation of `jlong yield_duration = os::elapsed_counter() - yield_start`. I feel that is easier to reason than going through g1heap. src/hotspot/share/gc/g1/g1ReviseYoungListTargetLengthTask.cpp line 75: > 73: { > 74: MutexLocker x(G1ReviseYoungLength_lock, Mutex::_no_safepoint_check_flag); > 75: G1Policy* p = g1h->policy(); Can probably use the existing `policy`. src/hotspot/share/gc/g1/g1ReviseYoungListTargetLengthTask.cpp line 88: > 86: } > 87: > 88: G1ReviseYoungLengthTargetLengthTask::G1ReviseYoungLengthTargetLengthTask(const char* name) : I wonder if the class name can be shortened a bit, sth like `G1ReviseYoungLengthTask`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2033251162 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2033222407 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2033929489 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2033975054 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2033934399 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2033910496 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2032008908 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2029855278 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2029855435 From rcastanedalo at openjdk.org Wed Apr 9 12:03:49 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 9 Apr 2025 12:03:49 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v30] In-Reply-To: References:

Message-ID: <8noWoU1cd2y4EjjK3QZGMLacPC9gkrwn5Ns3XbQbppI=.74de0b05-b8da-417f-8096-de98d7a3d815@github.com> On Fri, 4 Apr 2025 08:10:34 GMT, Thomas Schatzl wrote: >> Hi all, >> >> please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. >> >> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. >> >> ### Current situation >> >> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. >> >> The main reason for the current barrier is how g1 implements concurrent refinement: >> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. >> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, >> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. >> >> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: >> >> >> // Filtering >> if (region(@x.a) == region(y)) goto done; // same region check >> if (y == null) goto done; // null value check >> if (card(@x.a) == young_card) goto done; // write to young gen check >> StoreLoad; // synchronize >> if (card(@x.a) == dirty_card) goto done; >> >> *card(@x.a) = dirty >> >> // Card tracking >> enqueue(card-address(@x.a)) into thread-local-dcq; >> if (thread-local-dcq is not full) goto done; >> >> call runtime to move thread-local-dcq into dcqs >> >> done: >> >> >> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. >> >> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. >> >> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). >> >> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c... > > Thomas Schatzl has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 39 commits: > > - * missing file from merge > - Merge branch 'master' into 8342382-card-table-instead-of-dcq > - Merge branch 'master' into 8342382-card-table-instead-of-dcq > - Merge branch 'master' into 8342382-card-table-instead-of-dcq > - Merge branch 'master' into submit/8342382-card-table-instead-of-dcq > - * make young gen length revising independent of refinement thread > * use a service task > * both refinement control thread and young gen length revising use the same infrastructure to get the number of available bytes and determine the time to the next update > - * fix IR code generation tests that change due to barrier cost changes > - * factor out card table and refinement table merging into a single > method > - Merge branch 'master' into 8342382-card-table-instead-of-dcq3 > - * obsolete G1UpdateBufferSize > > G1UpdateBufferSize has previously been used to size the refinement > buffers and impose a minimum limit on the number of cards per thread > that need to be pending before refinement starts. > > The former function is now obsolete with the removal of the dirty > card queues, the latter functionality has been taken over by the new > diagnostic option `G1PerThreadPendingCardThreshold`. > > I prefer to make this a diagnostic option is better than a product option > because it is something that is only necessary for some test cases to > produce some otherwise unwanted behavior (continuous refinement). > > CSR is pending. > - ... and 29 more: https://git.openjdk.org/jdk/compare/41d4a0d7...1c5a669f Hi Thomas, great simplification and encouraging results! I reviewed the compiler-related parts of the changeset, including x64 and aarch64 changes. src/hotspot/cpu/aarch64/gc/g1/g1BarrierSetAssembler_aarch64.cpp line 246: > 244: __ cbz(new_val, done); > 245: } > 246: // Storing region crossing non-null, is card young? Suggestion: // Storing region crossing non-null. src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 101: > 99: } > 100: > 101: void G1BarrierSetAssembler::gen_write_ref_array_post_barrier(MacroAssembler* masm, DecoratorSet decorators, Have you measured the performance impact of inlining this assembly code instead of resorting to a runtime call as done before? Is it worth the maintenance cost (for every platform), risk of introducing bugs, etc.? src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 145: > 143: > 144: __ bind(is_clean_card); > 145: // Card was clean. Dirty card and go to next.. This code seems unreachable if `!UseCondCardMark`, meaning we only dirty cards here if `UseCondCardMark` is enabled. Is that intentional? src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 319: > 317: const Register thread, > 318: const Register tmp1, > 319: const Register tmp2, Since `tmp2` is not needed in the x64 post-barrier, I suggest not passing it around for this platform, for simplicity and also to make optimization opportunities more visible in the future. Here is my suggestion: https://github.com/robcasloz/jdk/commit/855ec8df4a641f8c491c5c09acea3ee434b7e230, feel free to merge if you agree. src/hotspot/share/gc/g1/c1/g1BarrierSetC1.cpp line 38: > 36: #include "c1/c1_LIRAssembler.hpp" > 37: #include "c1/c1_MacroAssembler.hpp" > 38: #endif // COMPILER1 I suggest removing the conditional compilation directives and grouping these includes together with the above `c1` ones. src/hotspot/share/gc/g1/c1/g1BarrierSetC1.cpp line 147: > 145: state->do_input(_thread); > 146: > 147: // Use temp registers to ensure these they use different registers. Suggestion: // Use temps to enforce different registers. src/hotspot/share/gc/g1/c2/g1BarrierSetC2.cpp line 307: > 305: + 6 // same region check: Uncompress (new_val) oop, xor, shr, (cmp), jmp > 306: + 4 // new_val is null check > 307: + 4; // card not clean check. It probably does not affect the unrolling heuristics too much, but you may want to make the last cost component conditional on `UseCondCardMark`. src/hotspot/share/gc/g1/c2/g1BarrierSetC2.cpp line 396: > 394: bool needs_liveness_data(const MachNode* mach) const { > 395: return G1BarrierStubC2::needs_pre_barrier(mach) || > 396: G1BarrierStubC2::needs_post_barrier(mach); Suggestion: // Liveness data is only required to compute registers that must be // preserved across the runtime call in the pre-barrier stub. return G1BarrierStubC2::needs_pre_barrier(mach); src/hotspot/share/gc/g1/g1BarrierSet.hpp line 56: > 54: // > 55: // The refinement threads mark cards in the current collection set specially on the > 56: // card table - this is fine wrt to synchronization with the mutator, because at Suggestion: // card table - this is fine wrt synchronization with the mutator, because at test/hotspot/jtreg/compiler/gcbarriers/TestG1BarrierGeneration.java line 521: > 519: phase = CompilePhase.FINAL_CODE) > 520: @IR(counts = {IRNode.COUNTED_LOOP, "2"}, > 521: phase = CompilePhase.FINAL_CODE) I suggest to remove this extra IR check to avoid over-specifying the expected loop shape. For example, running this test with loop unrolling disabled (`-XX:LoopUnrollLimit=0`) would now fail because only one counted loop would be found. ------------- Changes requested by rcastanedalo (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23739#pullrequestreview-2753154117 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035174209 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035175921 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035177738 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035183250 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035186980 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035192666 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035210464 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035196251 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035198219 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035201056 From tschatzl at openjdk.org Wed Apr 9 12:41:40 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Wed, 9 Apr 2025 12:41:40 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v30] In-Reply-To: <8noWoU1cd2y4EjjK3QZGMLacPC9gkrwn5Ns3XbQbppI=.74de0b05-b8da-417f-8096-de98d7a3d815@github.com> References:

<8noWoU1cd2y4EjjK3QZGMLacPC9gkrwn5Ns3XbQbppI=.74de0b05-b8da-417f-8096-de98d7a3d815@github.com> Message-ID: On Wed, 9 Apr 2025 11:35:26 GMT, Roberto Casta?eda Lozano wrote: >> Thomas Schatzl has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 39 commits: >> >> - * missing file from merge >> - Merge branch 'master' into 8342382-card-table-instead-of-dcq >> - Merge branch 'master' into 8342382-card-table-instead-of-dcq >> - Merge branch 'master' into 8342382-card-table-instead-of-dcq >> - Merge branch 'master' into submit/8342382-card-table-instead-of-dcq >> - * make young gen length revising independent of refinement thread >> * use a service task >> * both refinement control thread and young gen length revising use the same infrastructure to get the number of available bytes and determine the time to the next update >> - * fix IR code generation tests that change due to barrier cost changes >> - * factor out card table and refinement table merging into a single >> method >> - Merge branch 'master' into 8342382-card-table-instead-of-dcq3 >> - * obsolete G1UpdateBufferSize >> >> G1UpdateBufferSize has previously been used to size the refinement >> buffers and impose a minimum limit on the number of cards per thread >> that need to be pending before refinement starts. >> >> The former function is now obsolete with the removal of the dirty >> card queues, the latter functionality has been taken over by the new >> diagnostic option `G1PerThreadPendingCardThreshold`. >> >> I prefer to make this a diagnostic option is better than a product option >> because it is something that is only necessary for some test cases to >> produce some otherwise unwanted behavior (continuous refinement). >> >> CSR is pending. >> - ... and 29 more: https://git.openjdk.org/jdk/compare/41d4a0d7...1c5a669f > > src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 145: > >> 143: >> 144: __ bind(is_clean_card); >> 145: // Card was clean. Dirty card and go to next.. > > This code seems unreachable if `!UseCondCardMark`, meaning we only dirty cards here if `UseCondCardMark` is enabled. Is that intentional? Great find! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035280909 From tschatzl at openjdk.org Wed Apr 9 12:50:42 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Wed, 9 Apr 2025 12:50:42 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v30] In-Reply-To: <8noWoU1cd2y4EjjK3QZGMLacPC9gkrwn5Ns3XbQbppI=.74de0b05-b8da-417f-8096-de98d7a3d815@github.com> References:

<8noWoU1cd2y4EjjK3QZGMLacPC9gkrwn5Ns3XbQbppI=.74de0b05-b8da-417f-8096-de98d7a3d815@github.com> Message-ID: On Wed, 9 Apr 2025 11:34:09 GMT, Roberto Casta?eda Lozano wrote: >> Thomas Schatzl has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 39 commits: >> >> - * missing file from merge >> - Merge branch 'master' into 8342382-card-table-instead-of-dcq >> - Merge branch 'master' into 8342382-card-table-instead-of-dcq >> - Merge branch 'master' into 8342382-card-table-instead-of-dcq >> - Merge branch 'master' into submit/8342382-card-table-instead-of-dcq >> - * make young gen length revising independent of refinement thread >> * use a service task >> * both refinement control thread and young gen length revising use the same infrastructure to get the number of available bytes and determine the time to the next update >> - * fix IR code generation tests that change due to barrier cost changes >> - * factor out card table and refinement table merging into a single >> method >> - Merge branch 'master' into 8342382-card-table-instead-of-dcq3 >> - * obsolete G1UpdateBufferSize >> >> G1UpdateBufferSize has previously been used to size the refinement >> buffers and impose a minimum limit on the number of cards per thread >> that need to be pending before refinement starts. >> >> The former function is now obsolete with the removal of the dirty >> card queues, the latter functionality has been taken over by the new >> diagnostic option `G1PerThreadPendingCardThreshold`. >> >> I prefer to make this a diagnostic option is better than a product option >> because it is something that is only necessary for some test cases to >> produce some otherwise unwanted behavior (continuous refinement). >> >> CSR is pending. >> - ... and 29 more: https://git.openjdk.org/jdk/compare/41d4a0d7...1c5a669f > > src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 101: > >> 99: } >> 100: >> 101: void G1BarrierSetAssembler::gen_write_ref_array_post_barrier(MacroAssembler* masm, DecoratorSet decorators, > > Have you measured the performance impact of inlining this assembly code instead of resorting to a runtime call as done before? Is it worth the maintenance cost (for every platform), risk of introducing bugs, etc.? I remember significant impact in some microbenchmark. It's also inlined in Parallel GC. I do not consider it a big issue wrt to maintenance - these things never really change, and the method is small and contained. I will try to redo numbers. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035298557 From tschatzl at openjdk.org Wed Apr 9 14:38:46 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Wed, 9 Apr 2025 14:38:46 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v30] In-Reply-To: References:

Message-ID: On Tue, 8 Apr 2025 19:59:09 GMT, Albert Mingkun Yang wrote: >> Thomas Schatzl has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 39 commits: >> >> - * missing file from merge >> - Merge branch 'master' into 8342382-card-table-instead-of-dcq >> - Merge branch 'master' into 8342382-card-table-instead-of-dcq >> - Merge branch 'master' into 8342382-card-table-instead-of-dcq >> - Merge branch 'master' into submit/8342382-card-table-instead-of-dcq >> - * make young gen length revising independent of refinement thread >> * use a service task >> * both refinement control thread and young gen length revising use the same infrastructure to get the number of available bytes and determine the time to the next update >> - * fix IR code generation tests that change due to barrier cost changes >> - * factor out card table and refinement table merging into a single >> method >> - Merge branch 'master' into 8342382-card-table-instead-of-dcq3 >> - * obsolete G1UpdateBufferSize >> >> G1UpdateBufferSize has previously been used to size the refinement >> buffers and impose a minimum limit on the number of cards per thread >> that need to be pending before refinement starts. >> >> The former function is now obsolete with the removal of the dirty >> card queues, the latter functionality has been taken over by the new >> diagnostic option `G1PerThreadPendingCardThreshold`. >> >> I prefer to make this a diagnostic option is better than a product option >> because it is something that is only necessary for some test cases to >> produce some otherwise unwanted behavior (continuous refinement). >> >> CSR is pending. >> - ... and 29 more: https://git.openjdk.org/jdk/compare/41d4a0d7...1c5a669f > > src/hotspot/share/gc/g1/g1ConcurrentRefineSweepTask.cpp line 83: > >> 81: break; >> 82: } >> 83: case G1RemSet::HasRefToOld : break; // Nothing special to do. > > Why doesn't call `inc_cards_clean_again` in this case? The card is cleared also. (In fact, I don't get why this needs to a separate case from `NoInteresting`.) "NoInteresting" means that the card contains no interesting reference at all. "HasRefToOld" means that there has been an interesting reference in the card. The distinction between these groups of cards seems interesting to me. E.g. out of X non-clean cards, there were A with a reference to the collection set, B that were already marked as containing a card to the collection, C not having any interesting card any more (transitioned from clean -> dirty -> clean, and cleared by the mutator), D being non-parsable, and E having references to old (and no other references). I could add a separate counter for these type of cards too - they can be inferred from the total number of scanned minus the others though. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035512686 From duke at openjdk.org Wed Apr 9 17:12:47 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Wed, 9 Apr 2025 17:12:47 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v14] In-Reply-To: <-W1vBCTLtPyOZNm6XhHQXT9spBbkAd4Z4rTn_LHH1Aw=.5beae719-ac8b-404a-a34c-deecfc97dd7e@github.com> References: <394Wf5RpbwUgE7zBaZBnwa2YAxQFwWDhF1VuaMPHdhE=.98ff29f7-b6a7-49eb-bdd6-8489568b24b7@github.com> <-W1vBCTLtPyOZNm6XhHQXT9spBbkAd4Z4rTn_LHH1Aw=.5beae719-ac8b-404a-a34c-deecfc97dd7e@github.com> Message-ID: On Tue, 8 Apr 2025 21:58:57 GMT, Sandhya Viswanathan wrote: > Overall very clean and nicely done PR. Thanks a lot for considering my inputs. That is in no small part thanks to the reviewers, especially to Volodymyr! @lmesnik, @jatin-bhateja, @sviswa7 would one of you /sponsor me with the integration? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23860#issuecomment-2790417248 From sviswanathan at openjdk.org Wed Apr 9 18:42:45 2025 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Wed, 9 Apr 2025 18:42:45 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v14] In-Reply-To: References: <394Wf5RpbwUgE7zBaZBnwa2YAxQFwWDhF1VuaMPHdhE=.98ff29f7-b6a7-49eb-bdd6-8489568b24b7@github.com> <-W1vBCTLtPyOZNm6XhHQXT9spBbkAd4Z4rTn_LHH1Aw=.5beae719-ac8b-404a-a34c-deecfc97dd7e@github.com> Message-ID: On Wed, 9 Apr 2025 17:09:09 GMT, Ferenc Rakoczi wrote: >> Overall very clean and nicely done PR. Thanks a lot for considering my inputs. > >> Overall very clean and nicely done PR. Thanks a lot for considering my inputs. > > That is in no small part thanks to the reviewers, especially to Volodymyr! > @lmesnik, @jatin-bhateja, @sviswa7 would one of you /sponsor me with the integration? @ferakocz Once you do /integrate, I will be honored to sponsor your PR. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23860#issuecomment-2790618572 From duke at openjdk.org Wed Apr 9 19:33:37 2025 From: duke at openjdk.org (duke) Date: Wed, 9 Apr 2025 19:33:37 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v14] In-Reply-To: <394Wf5RpbwUgE7zBaZBnwa2YAxQFwWDhF1VuaMPHdhE=.98ff29f7-b6a7-49eb-bdd6-8489568b24b7@github.com> References: <394Wf5RpbwUgE7zBaZBnwa2YAxQFwWDhF1VuaMPHdhE=.98ff29f7-b6a7-49eb-bdd6-8489568b24b7@github.com> Message-ID: On Tue, 8 Apr 2025 21:27:08 GMT, Ferenc Rakoczi wrote: >> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: > > Reacting to mor comments from Sandhya. @ferakocz Your change (at version 0b0d0969d6ac629bf2ca997d2286c4d28f91c1b9) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23860#issuecomment-2790791121 From duke at openjdk.org Wed Apr 9 19:33:35 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Wed, 9 Apr 2025 19:33:35 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v14] In-Reply-To: References: <394Wf5RpbwUgE7zBaZBnwa2YAxQFwWDhF1VuaMPHdhE=.98ff29f7-b6a7-49eb-bdd6-8489568b24b7@github.com> <-W1vBCTLtPyOZNm6XhHQXT9spBbkAd4Z4rTn_LHH1Aw=.5beae719-ac8b-404a-a34c-deecfc97dd7e@github.com> Message-ID: On Wed, 9 Apr 2025 17:09:09 GMT, Ferenc Rakoczi wrote: >> Overall very clean and nicely done PR. Thanks a lot for considering my inputs. > >> Overall very clean and nicely done PR. Thanks a lot for considering my inputs. > > That is in no small part thanks to the reviewers, especially to Volodymyr! > @lmesnik, @jatin-bhateja, @sviswa7 would one of you /sponsor me with the integration? > @ferakocz Once you do /integrate, I will be honored to sponsor your PR. Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23860#issuecomment-2790788483 From duke at openjdk.org Wed Apr 9 21:18:35 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Wed, 9 Apr 2025 21:18:35 GMT Subject: Integrated: 8351034: Add AVX-512 intrinsics for ML-DSA In-Reply-To: References: Message-ID: On Mon, 3 Mar 2025 11:12:58 GMT, Ferenc Rakoczi wrote: > By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. This pull request has now been integrated. Changeset: e87ff328 Author: Ferenc Rakoczi Committer: Sandhya Viswanathan URL: https://git.openjdk.org/jdk/commit/e87ff328d5cc66454213dee44cf2faeb0e76262f Stats: 1307 lines in 10 files changed: 1265 ins; 27 del; 15 mod 8351034: Add AVX-512 intrinsics for ML-DSA Reviewed-by: sviswanathan, lmesnik, vpaprotski, jbhateja ------------- PR: https://git.openjdk.org/jdk/pull/23860 From mdoerr at openjdk.org Wed Apr 9 22:26:31 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Wed, 9 Apr 2025 22:26:31 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v30] In-Reply-To: References:

Message-ID: On Fri, 4 Apr 2025 08:10:34 GMT, Thomas Schatzl wrote: >> Hi all, >> >> please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. >> >> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. >> >> ### Current situation >> >> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. >> >> The main reason for the current barrier is how g1 implements concurrent refinement: >> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. >> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, >> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. >> >> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: >> >> >> // Filtering >> if (region(@x.a) == region(y)) goto done; // same region check >> if (y == null) goto done; // null value check >> if (card(@x.a) == young_card) goto done; // write to young gen check >> StoreLoad; // synchronize >> if (card(@x.a) == dirty_card) goto done; >> >> *card(@x.a) = dirty >> >> // Card tracking >> enqueue(card-address(@x.a)) into thread-local-dcq; >> if (thread-local-dcq is not full) goto done; >> >> call runtime to move thread-local-dcq into dcqs >> >> done: >> >> >> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. >> >> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. >> >> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). >> >> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c... > > Thomas Schatzl has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 39 commits: > > - * missing file from merge > - Merge branch 'master' into 8342382-card-table-instead-of-dcq > - Merge branch 'master' into 8342382-card-table-instead-of-dcq > - Merge branch 'master' into 8342382-card-table-instead-of-dcq > - Merge branch 'master' into submit/8342382-card-table-instead-of-dcq > - * make young gen length revising independent of refinement thread > * use a service task > * both refinement control thread and young gen length revising use the same infrastructure to get the number of available bytes and determine the time to the next update > - * fix IR code generation tests that change due to barrier cost changes > - * factor out card table and refinement table merging into a single > method > - Merge branch 'master' into 8342382-card-table-instead-of-dcq3 > - * obsolete G1UpdateBufferSize > > G1UpdateBufferSize has previously been used to size the refinement > buffers and impose a minimum limit on the number of cards per thread > that need to be pending before refinement starts. > > The former function is now obsolete with the removal of the dirty > card queues, the latter functionality has been taken over by the new > diagnostic option `G1PerThreadPendingCardThreshold`. > > I prefer to make this a diagnostic option is better than a product option > because it is something that is only necessary for some test cases to > produce some otherwise unwanted behavior (continuous refinement). > > CSR is pending. > - ... and 29 more: https://git.openjdk.org/jdk/compare/41d4a0d7...1c5a669f This PR needs an update for x86 platforms when merging: g1BarrierSetAssembler_x86.cpp:117:6: error: 'class MacroAssembler' has no member named 'get_thread' ------------- PR Comment: https://git.openjdk.org/jdk/pull/23739#issuecomment-2791114662 From tschatzl at openjdk.org Thu Apr 10 07:26:28 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Thu, 10 Apr 2025 07:26:28 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v31] In-Reply-To: References: Message-ID: > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... Thomas Schatzl has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 45 commits: - * fixes after merge related to 32 bit x86 removal - Merge branch 'master' into 8342382-card-table-instead-of-dcq - * ayang review: revising young gen length * robcasloz review: various minor refactorings - Do not unnecessarily pass around tmp2 in x86 - Refine needs_liveness_data - Reorder includes - * missing file from merge - Merge branch 'master' into 8342382-card-table-instead-of-dcq - Merge branch 'master' into 8342382-card-table-instead-of-dcq - Merge branch 'master' into 8342382-card-table-instead-of-dcq - ... and 35 more: https://git.openjdk.org/jdk/compare/45b7c748...39aa903f ------------- Changes: https://git.openjdk.org/jdk/pull/23739/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=30 Stats: 7118 lines in 110 files changed: 2586 ins; 3598 del; 934 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739 From tschatzl at openjdk.org Thu Apr 10 07:28:31 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Thu, 10 Apr 2025 07:28:31 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v30] In-Reply-To: References:

Message-ID: On Wed, 9 Apr 2025 22:24:10 GMT, Martin Doerr wrote: > This PR needs an update for x86 platforms when merging: g1BarrierSetAssembler_x86.cpp:117:6: error: 'class MacroAssembler' has no member named 'get_thread' I fixed this for now, but it will be broken again in just a bit with Aleksey's ongoing removal of x86 32 bit platform efforts. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23739#issuecomment-2791807489 From shade at openjdk.org Thu Apr 10 08:36:33 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 10 Apr 2025 08:36:33 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v30] In-Reply-To: References:

Message-ID: <03K6ui5yP3iy8HS_C4nurnsrbOymrm_962YA0-U92IM=.0f83b0ac-5895-4e1a-bb22-0006bd5dd888@github.com> On Thu, 10 Apr 2025 07:25:47 GMT, Thomas Schatzl wrote: > I fixed this for now, but it will be broken again in just a bit with Aleksey's ongoing removal of x86 32 bit platform efforts. I think all x86 cleanups related to GC and adjacent code have landed in mainline now. So I expect no more major conflicts with this PR :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23739#issuecomment-2791985351 From tschatzl at openjdk.org Thu Apr 10 09:07:39 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Thu, 10 Apr 2025 09:07:39 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v32] In-Reply-To: References: Message-ID: > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... Thomas Schatzl has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 46 commits: - Merge branch 'master' into 8342382-card-table-instead-of-dcq - * fixes after merge related to 32 bit x86 removal - Merge branch 'master' into 8342382-card-table-instead-of-dcq - * ayang review: revising young gen length * robcasloz review: various minor refactorings - Do not unnecessarily pass around tmp2 in x86 - Refine needs_liveness_data - Reorder includes - * missing file from merge - Merge branch 'master' into 8342382-card-table-instead-of-dcq - Merge branch 'master' into 8342382-card-table-instead-of-dcq - ... and 36 more: https://git.openjdk.org/jdk/compare/f94a4f7a...fcf96a2a ------------- Changes: https://git.openjdk.org/jdk/pull/23739/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=31 Stats: 7112 lines in 110 files changed: 2592 ins; 3594 del; 926 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739 From ayang at openjdk.org Thu Apr 10 09:12:32 2025 From: ayang at openjdk.org (Albert Mingkun Yang) Date: Thu, 10 Apr 2025 09:12:32 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v30] In-Reply-To: References:

Message-ID: On Wed, 9 Apr 2025 14:32:43 GMT, Thomas Schatzl wrote: >> src/hotspot/share/gc/g1/g1ConcurrentRefineSweepTask.cpp line 83: >> >>> 81: break; >>> 82: } >>> 83: case G1RemSet::HasRefToOld : break; // Nothing special to do. >> >> Why doesn't call `inc_cards_clean_again` in this case? The card is cleared also. (In fact, I don't get why this needs to a separate case from `NoInteresting`.) > > "NoInteresting" means that the card contains no interesting reference at all. "HasRefToOld" means that there has been an interesting reference in the card. > > The distinction between these groups of cards seems interesting to me. E.g. out of X non-clean cards, there were A with a reference to the collection set, B that were already marked as containing a card to the collection, C not having any interesting card any more (transitioned from clean -> dirty -> clean, and cleared by the mutator), D being non-parsable, and E having references to old (and no other references). > > I could add a separate counter for these type of cards too - they can be inferred from the total number of scanned minus the others though. I see; "clean again" means the existing interesting pointer was overwritten by mutator. I misinterpret the comment as cards transitioned from dirty to clean. ` size_t _cards_clean_again; // Dirtied cards that were cleaned.` To prevent misunderstanding, what do you think of renaming "NoInteresting" to "NoCrossRegion" and "_cards_clean_again" to "_cards_no_cross_region", or sth alike so that the 1:1 mapping is clearer? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2036885633 From tschatzl at openjdk.org Thu Apr 10 10:02:40 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Thu, 10 Apr 2025 10:02:40 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v33] In-Reply-To: References: Message-ID: <5FzYDFpFOksmAGM5RV0gGk2eDAdinlDCGo8_37eUeEA=.5f96c37e-7b10-41b4-a607-fc7a665abd67@github.com> > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... Thomas Schatzl has updated the pull request incrementally with two additional commits since the last revision: - * indentation fix - * remove support for 32 bit x86 in the barrier generation code, following latest changes from @shade ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23739/files - new: https://git.openjdk.org/jdk/pull/23739/files/fcf96a2a..068d2a37 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=32 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=31-32 Stats: 5 lines in 1 file changed: 0 ins; 2 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739 From tschatzl at openjdk.org Thu Apr 10 10:02:41 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Thu, 10 Apr 2025 10:02:41 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v30] In-Reply-To: <03K6ui5yP3iy8HS_C4nurnsrbOymrm_962YA0-U92IM=.0f83b0ac-5895-4e1a-bb22-0006bd5dd888@github.com> References:

<03K6ui5yP3iy8HS_C4nurnsrbOymrm_962YA0-U92IM=.0f83b0ac-5895-4e1a-bb22-0006bd5dd888@github.com> Message-ID: On Thu, 10 Apr 2025 08:34:00 GMT, Aleksey Shipilev wrote: > > I fixed this for now, but it will be broken again in just a bit with Aleksey's ongoing removal of x86 32 bit platform efforts. > > I think all x86 cleanups related to GC and adjacent code have landed in mainline now. So I expect no more major conflicts with this PR :) Thanks. :) @TheRealMDoerr: should be fixed now. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23739#issuecomment-2792213039 From tschatzl at openjdk.org Thu Apr 10 11:01:42 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Thu, 10 Apr 2025 11:01:42 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v30] In-Reply-To: References:

<8noWoU1cd2y4EjjK3QZGMLacPC9gkrwn5Ns3XbQbppI=.74de0b05-b8da-417f-8096-de98d7a3d815@github.com> Message-ID: On Wed, 9 Apr 2025 12:48:10 GMT, Thomas Schatzl wrote: >> src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 101: >> >>> 99: } >>> 100: >>> 101: void G1BarrierSetAssembler::gen_write_ref_array_post_barrier(MacroAssembler* masm, DecoratorSet decorators, >> >> Have you measured the performance impact of inlining this assembly code instead of resorting to a runtime call as done before? Is it worth the maintenance cost (for every platform), risk of introducing bugs, etc.? > > I remember significant impact in some microbenchmark. It's also inlined in Parallel GC. I do not consider it a big issue wrt to maintenance - these things never really change, and the method is small and contained. > I will try to redo numbers. >From our microbenchmarks (higher numbers are better): Current code: Benchmark (size) Mode Cnt Score Error Units ArrayCopyObject.conjoint_micro 31 thrpt 15 166136.959 ? 5517.157 ops/ms ArrayCopyObject.conjoint_micro 63 thrpt 15 108880.108 ? 4331.112 ops/ms ArrayCopyObject.conjoint_micro 127 thrpt 15 93159.977 ? 5025.458 ops/ms ArrayCopyObject.conjoint_micro 2047 thrpt 15 17234.842 ? 831.344 ops/ms ArrayCopyObject.conjoint_micro 4095 thrpt 15 9202.216 ? 292.612 ops/ms ArrayCopyObject.conjoint_micro 8191 thrpt 15 3565.705 ? 121.116 ops/ms ArrayCopyObject.disjoint_micro 31 thrpt 15 159106.245 ? 5965.576 ops/ms ArrayCopyObject.disjoint_micro 63 thrpt 15 95475.658 ? 5415.267 ops/ms ArrayCopyObject.disjoint_micro 127 thrpt 15 84249.979 ? 6313.007 ops/ms ArrayCopyObject.disjoint_micro 2047 thrpt 15 10682.650 ? 381.832 ops/ms ArrayCopyObject.disjoint_micro 4095 thrpt 15 4471.940 ? 216.439 ops/ms ArrayCopyObject.disjoint_micro 8191 thrpt 15 1378.296 ? 33.421 ops/ms ArrayCopy.arrayCopyObject N/A avgt 15 13.880 ? 0.517 ns/op ArrayCopy.arrayCopyObjectNonConst N/A avgt 15 14.844 ? 0.751 ns/op ArrayCopy.arrayCopyObjectSameArraysBackward N/A avgt 15 11.080 ? 0.703 ns/op ArrayCopy.arrayCopyObjectSameArraysForward N/A avgt 15 11.003 ? 0.135 ns/op Runtime call: Benchmark (size) Mode Cnt Score Error Units ArrayCopyObject.conjoint_micro 31 thrpt 15 73100.230 ? 11079.381 ops/ms ArrayCopyObject.conjoint_micro 63 thrpt 15 65039.431 ? 1996.832 ops/ms ArrayCopyObject.conjoint_micro 127 thrpt 15 58336.711 ? 2260.660 ops/ms ArrayCopyObject.conjoint_micro 2047 thrpt 15 17035.419 ? 524.445 ops/ms ArrayCopyObject.conjoint_micro 4095 thrpt 15 9207.661 ? 286.526 ops/ms ArrayCopyObject.conjoint_micro 8191 thrpt 15 3264.491 ? 73.848 ops/ms ArrayCopyObject.disjoint_micro 31 thrpt 15 84587.219 ? 3007.310 ops/ms ArrayCopyObject.disjoint_micro 63 thrpt 15 62815.254 ? 1214.310 ops/ms ArrayCopyObject.disjoint_micro 127 thrpt 15 58423.470 ? 285.670 ops/ms ArrayCopyObject.disjoint_micro 2047 thrpt 15 10720.462 ? 617.173 ops/ms ArrayCopyObject.disjoint_micro 4095 thrpt 15 4178.195 ? 178.942 ops/ms ArrayCopyObject.disjoint_micro 8191 thrpt 15 1374.268 ? 44.290 ops/ms ArrayCopy.arrayCopyObject N/A avgt 15 19.667 ? 0.740 ns/op ArrayCopy.arrayCopyObjectNonConst N/A avgt 15 21.243 ? 1.891 ns/op ArrayCopy.arrayCopyObjectSameArraysBackward N/A avgt 15 16.645 ? 0.504 ns/op ArrayCopy.arrayCopyObjectSameArraysForward N/A avgt 15 17.409 ? 0.705 ns/op Obviously with larger arrays, the impact diminishes, but it's always there. I think the inlined code is worth the effort in this case. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2037086410 From rcastanedalo at openjdk.org Thu Apr 10 11:22:36 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 10 Apr 2025 11:22:36 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v30] In-Reply-To: References:

<8noWoU1cd2y4EjjK3QZGMLacPC9gkrwn5Ns3XbQbppI=.74de0b05-b8da-417f-8096-de98d7a3d815@github.com>

Message-ID: On Thu, 10 Apr 2025 10:58:24 GMT, Thomas Schatzl wrote: >> I remember significant impact in some microbenchmark. It's also inlined in Parallel GC. I do not consider it a big issue wrt to maintenance - these things never really change, and the method is small and contained. >> I will try to redo numbers. > > From our microbenchmarks (higher numbers are better): > > Current code: > > Benchmark (size) Mode Cnt Score Error Units > ArrayCopyObject.conjoint_micro 31 thrpt 15 166136.959 ? 5517.157 ops/ms > ArrayCopyObject.conjoint_micro 63 thrpt 15 108880.108 ? 4331.112 ops/ms > ArrayCopyObject.conjoint_micro 127 thrpt 15 93159.977 ? 5025.458 ops/ms > ArrayCopyObject.conjoint_micro 2047 thrpt 15 17234.842 ? 831.344 ops/ms > ArrayCopyObject.conjoint_micro 4095 thrpt 15 9202.216 ? 292.612 ops/ms > ArrayCopyObject.conjoint_micro 8191 thrpt 15 3565.705 ? 121.116 ops/ms > ArrayCopyObject.disjoint_micro 31 thrpt 15 159106.245 ? 5965.576 ops/ms > ArrayCopyObject.disjoint_micro 63 thrpt 15 95475.658 ? 5415.267 ops/ms > ArrayCopyObject.disjoint_micro 127 thrpt 15 84249.979 ? 6313.007 ops/ms > ArrayCopyObject.disjoint_micro 2047 thrpt 15 10682.650 ? 381.832 ops/ms > ArrayCopyObject.disjoint_micro 4095 thrpt 15 4471.940 ? 216.439 ops/ms > ArrayCopyObject.disjoint_micro 8191 thrpt 15 1378.296 ? 33.421 ops/ms > ArrayCopy.arrayCopyObject N/A avgt 15 13.880 ? 0.517 ns/op > ArrayCopy.arrayCopyObjectNonConst N/A avgt 15 14.844 ? 0.751 ns/op > ArrayCopy.arrayCopyObjectSameArraysBackward N/A avgt 15 11.080 ? 0.703 ns/op > ArrayCopy.arrayCopyObjectSameArraysForward N/A avgt 15 11.003 ? 0.135 ns/op > > Runtime call: > > Benchmark (size) Mode Cnt Score Error Units > ArrayCopyObject.conjoint_micro 31 thrpt 15 73100.230 ? 11079.381 ops/ms > ArrayCopyObject.conjoint_micro 63 thrpt 15 65039.431 ? 1996.832 ops/ms > ArrayCopyObject.conjoint_micro 127 thrpt 15 58336.711 ? 2260.660 ops/ms > ArrayCopyObject.conjoint_micro 2047 thrpt 15 17035.419 ? 524.445 ops/ms > ArrayCopyObject.conjoint_micro 4095 thrpt 15 9207.661 ? 286.526 ops/ms > ArrayCopyObject.conjoint_micro 8191 thrpt 15 3264.491 ? 73.848 ops/ms > ArrayCopyObject.disjoint_micro 31 thrpt 15 84587.219 ? 3007.310 ops/ms > ArrayCopyObject.disjoint_micro ... Fair enough, thanks for the measurements! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2037121277 From duke at openjdk.org Thu Apr 10 13:19:05 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Thu, 10 Apr 2025 13:19:05 GMT Subject: RFR: 8349721: Add aarch64 intrinsics for ML-KEM [v7] In-Reply-To: References: Message-ID: > By using the aarch64 vector registers the speed of the computation of the ML-KEM algorithms (key generation, encapsulation, decapsulation) can be approximately doubled. Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision: - Code rearrange, some renaming, fixing comments - Changes suggested by Andrew Dinn. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23663/files - new: https://git.openjdk.org/jdk/pull/23663/files/8d5a9e12..74ff3d9f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23663&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23663&range=05-06 Stats: 2139 lines in 4 files changed: 630 ins; 826 del; 683 mod Patch: https://git.openjdk.org/jdk/pull/23663.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23663/head:pull/23663 PR: https://git.openjdk.org/jdk/pull/23663 From adinn at openjdk.org Thu Apr 10 14:19:46 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Thu, 10 Apr 2025 14:19:46 GMT Subject: RFR: 8349721: Add aarch64 intrinsics for ML-KEM [v7] In-Reply-To: References:

Message-ID: <1-Ncd4d8AXSun31DFTq_Q-GI0zbOW76QB0LXX2iyF38=.46a32e70-f613-43f0-a485-c5113200ed40@github.com> On Thu, 10 Apr 2025 13:19:05 GMT, Ferenc Rakoczi wrote: >> By using the aarch64 vector registers the speed of the computation of the ML-KEM algorithms (key generation, encapsulation, decapsulation) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision: > > - Code rearrange, some renaming, fixing comments > - Changes suggested by Andrew Dinn. src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5300: > 5298: // level 5 > 5299: vs_ldpq(vq, kyberConsts); > 5300: int offsets4[4] = { 0, 32, 64, 96 }; Again a comment // At level 5 related coefficients occur in discrete blocks of size 8 so // need to be loaded interleaved using an ld2 operation with arrangement 2D src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5319: > 5317: vs_st2_indexed(vs1, __ T2D, coeffs, tmpAddr, 384, offsets4); > 5318: > 5319: // level 6 And again // At level 6 related coefficients occur in discrete blocks of size 4 so // need to be loaded interleaved using an ld2 operation with arrangement 4S src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5377: > 5375: // level 0 > 5376: vs_ldpq(vq, kyberConsts); > 5377: int offsets4[4] = { 0, 32, 64, 96 }; Again a comment // At level 0 related coefficients occur in discrete blocks of size 4 so // need to be loaded interleaved using an ld2 operation with arrangement 4S src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5399: > 5397: vs_st2_indexed(vs1, __ T4S, coeffs, tmpAddr, 384, offsets4); > 5398: > 5399: // level 1 Again a comment // At level 1 related coefficients occur in discrete blocks of size 8 so // need to be loaded interleaved using an ld2 operation with arrangement 2D src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5423: > 5421: > 5422: // level 2 > 5423: int offsets3[8] = { 0, 32, 64, 96, 128, 160, 192, 224 }; Again // At level 2 coefficients occur in 8 discrete blocks of size 16 // so they are loaded using employing an ldr at 8 distinct offsets. src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5464: > 5462: vs_str_indexed(vs1, __ Q, coeffs, 256, offsets3); > 5463: > 5464: // level 3 / From level 3 upwards coefficients occur in discrete blocks whose size is // some multiple of 32 so can be loaded using ldpq and suitable indexes. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037571231 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037573218 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037577265 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037578385 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037581149 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037585101 From adinn at openjdk.org Thu Apr 10 14:52:33 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Thu, 10 Apr 2025 14:52:33 GMT Subject: RFR: 8349721: Add aarch64 intrinsics for ML-KEM [v7] In-Reply-To: References:

Message-ID: <8wo-YxVVmMSocSfA7Te4-txUqtv6JQyxktwIGZ_U3z4=.fadb528c-6d39-4733-be37-7c85387cf3da@github.com> On Thu, 10 Apr 2025 13:19:05 GMT, Ferenc Rakoczi wrote: >> By using the aarch64 vector registers the speed of the computation of the ML-KEM algorithms (key generation, encapsulation, decapsulation) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision: > > - Code rearrange, some renaming, fixing comments > - Changes suggested by Andrew Dinn. src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5933: > 5931: vs_ld3_post(vin, __ T16B, condensed); > 5932: > 5933: // expand groups of input bytes in vin to shorts in va and vb I's like to expand on the data layouts here so that maintenance engineers don't have to work it out every time they look at it. So, I would like to replace this comment as follows // The front half of sequence vin (vin[0], vin[1] and vin[2]) // holds 48 (16x3) contiguous bytes from memory striped // horizontally across each of the 16 byte lanes. Equivalently, // that is 16 pairs of 12-bit integers. Likewise the back half // holds the next 48 bytes in the same arrangement. // Each vector in the front half can also be viewed as a vertical // strip across the 16 pairs of 12 bit integers. Each byte in // vin[0] stores the low 8 bits of the first int in a pair. Each // byte in vin[1] stores the high 4 bits of the first int and the // low 4 bits of the second int. Each byte in vin[2] stores the // high 8 bits of the second int. Likewise the vectors in second // half. // Converting the data to 16-bit shorts requires first of all // expanding each of the 6 x 16B vectors into 6 corresponding // pairs of 8H vectors. Mask, shift and add operations on the // resulting vector pairs can be used to combine 4 and 8 bit // parts of related 8H vector elements. // // The middle vectors (vin[2] and vin[5]) are actually expanded // twice, one copy manipulated to provide the lower 4 bits // belonging to the first short in a pair and another copy // manipulated to provide the higher 4 bits belonging to the // second short in a pair. This is why the the vector sequences va // and vb used to hold the expanded 8H elements are of length 8. // Expand vin[0] into va[0:1], and vin[1] into va[2:3] and va[4:5] src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5941: > 5939: __ ushll(va[4], __ T8H, vin[1], __ T8B, 0); > 5940: __ ushll2(va[5], __ T8H, vin[1], __ T16B, 0); > 5941: Insert here // Likewise expand vin[3] into vb[0:1], and vin[4] into vb[2:3] // and vb[4:5] src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5949: > 5947: __ ushll2(vb[5], __ T8H, vin[4], __ T16B, 0); > 5948: > 5949: // offset duplicated elements in va and vb by 8 To make this clearer it should say // shift lo byte of copy 1 of the middle stripe into the high byte src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5955: > 5953: __ shl(vb[3], __ T8H, vb[3], 8); > 5954: > 5955: // expand remaining input bytes in vin to shorts in va and vb To make this clearer it should say // Expand vin[2] into va[6:7] and vin[5] into vb[6:7] but this // time pre-shifted by 4 to ensure top bits of input 12-bit int // are in bit positions [4..11]. src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5962: > 5960: __ ushll2(vb[7], __ T8H, vin[5], __ T16B, 4); > 5961: > 5962: // split the duplicated 8 bit values into two distinct 4 bit To make this clearer it should say // mask hi 4 bits of the 1st 12-bit int in a pair from copy1 and // shift lo 4 bits of the 2nd 12-bit int in a pair to the bottom of // copy2 src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5973: > 5971: __ ushr(vb[5], __ T8H, vb[5], 4); > 5972: > 5973: // sum resulting short values into the front halves of va and This should be replaced to clarify details of the ordering for summing and grouping // sum hi 4 bits and lo 8 bits of the 1st 12-bit int in each pair and // hi 8 bits plus lo 4 bits of the 2nd 12-bit int in each pair // n.b. the ordering ensures: i) inputs are consumed before they // are overwritten ii) the order of 16-bit results across successive // pairs of vectors in va and then vb reflects the order of the // corresponding 12-bit inputs src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5984: > 5982: __ addv(vb[3], __ T8H, vb[5], vb[7]); > 5983: > 5984: // store results interleaved as shorts Change to // store 64 results interleaved as shorts src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5993: > 5991: __ cbz(parsedLength, L_end); > 5992: > 5993: // if anything is left it should be a final 72 bytes. so we Clarify as follows // if anything is left it should be a final 72 bytes of input // i.e. a final 48 12-bit values. so we handle this by loading // load 48 bytes into all 16B lanes of front(vin) and only 24 // bytes into the lower 8B lane of back(vin) src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5999: > 5997: vs_ld3(vs_back(vin), __ T8B, condensed); > 5998: > 5999: // expand groups of input bytes in vin to shorts in va and vb Modify as above // Expand vin[0] into va[0:1], and vin[1] into va[2:3] and va[4:5] src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 6009: > 6007: __ ushll2(va[5], __ T8H, vin[1], __ T16B, 0); > 6008: > 6009: __ ushll(vb[0], __ T8H, vin[3], __ T8B, 0); Add a comment // This time expand just the lower 8 lanes src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 6013: > 6011: __ ushll(vb[4], __ T8H, vin[4], __ T8B, 0); > 6012: > 6013: // offset duplicated elements in va and vb by 8 As before clarify as follows // shift lo byte of copy 1 of the middle stripe into the high byte src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 6018: > 6016: __ shl(vb[2], __ T8H, vb[2], 8); > 6017: > 6018: // expand remaining input bytes in vin to shorts in va and vb Again improve this comment // expand vin[2] into va[6:7] and lower 8 lanes of vin[5] into // vb[6] pre-shifted by 4 to ensure top bits of the input 12-bit // int are in bit positions [4..11]. src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 6024: > 6022: __ ushll(vb[6], __ T8H, vin[5], __ T8B, 4); > 6023: > 6024: // split the duplicated 8 bit values into two distinct 4 bit Once again update // mask hi 4 bits of each 1st 12-bit int in pair from copy1 and // shift lo 4 bits of each 2nd 12-bit int in pair to bottom of // copy2 src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 6033: > 6031: __ ushr(vb[4], __ T8H, vb[4], 4); > 6032: > 6033: // sum resulting short values into the front halves of va and Again update to provide more detail // sum hi 4 bits and lo 8 bits of each 1st 12-bit int in pair and // hi 8 bits plus lo 4 bits of each 2nd 12-bit int in pair // n.b. ordering ensures: i) inputs are consumed before they are // overwritten ii) order of 16-bit results across succsessive // pairs of vectors in va and then lower half of vb reflects order // of corresponding 12-bit inputs src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 6042: > 6040: __ addv(vb[1], __ T8H, vb[4], vb[6]); > 6041: > 6042: // store results interleaved as shorts Change to // store 48 results interleaved as shorts ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037755555 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037758589 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037760493 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037762375 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037764723 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037767700 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037783970 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037771521 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037769694 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037774404 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037776668 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037779831 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037780704 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037781617 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037782757 From adinn at openjdk.org Thu Apr 10 16:53:32 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Thu, 10 Apr 2025 16:53:32 GMT Subject: RFR: 8349721: Add aarch64 intrinsics for ML-KEM [v7] In-Reply-To: References:

Message-ID: <5j4hyPmpvOzIValSb9YGR8fSrL14z-zJikzUHr5kSnA=.131af348-ddc2-4b2a-bca3-8a22f022a244@github.com> On Thu, 10 Apr 2025 13:19:05 GMT, Ferenc Rakoczi wrote: >> By using the aarch64 vector registers the speed of the computation of the ML-KEM algorithms (key generation, encapsulation, decapsulation) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision: > > - Code rearrange, some renaming, fixing comments > - Changes suggested by Andrew Dinn. @ferakocz Hi Ferenc. Thank you for adjusting the code as requested and even more so for the extra clean-ups you added which I very much appreciate. I have added suggestions for some extra/modified commenting to clarify certain details of what is being generated that were not 100% clear to me when I first read/restructured the code. They may seem a bit obvious but I want to ensure that any maintainer who needs to review the code can assimilate it quickly (including me if/when I revisit it in 12 months time). Mostly my recommendations for upgrading of comments is complete and I believe little more will be needed to sign off this PR. However, I still want to check through a few parts of the code that I have not fully cross-checked against the Java routines (e.g. the Barrett reductions). I'll try to do that asap but it will probably be a few days from now. Thanks again for your help in improving this code. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23663#issuecomment-2794514677 From rcastanedalo at openjdk.org Fri Apr 11 13:01:49 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 11 Apr 2025 13:01:49 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v33] In-Reply-To: <5FzYDFpFOksmAGM5RV0gGk2eDAdinlDCGo8_37eUeEA=.5f96c37e-7b10-41b4-a607-fc7a665abd67@github.com> References: <5FzYDFpFOksmAGM5RV0gGk2eDAdinlDCGo8_37eUeEA=.5f96c37e-7b10-41b4-a607-fc7a665abd67@github.com> Message-ID: On Thu, 10 Apr 2025 10:02:40 GMT, Thomas Schatzl wrote: >> Hi all, >> >> please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. >> >> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. >> >> ### Current situation >> >> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. >> >> The main reason for the current barrier is how g1 implements concurrent refinement: >> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. >> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, >> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. >> >> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: >> >> >> // Filtering >> if (region(@x.a) == region(y)) goto done; // same region check >> if (y == null) goto done; // null value check >> if (card(@x.a) == young_card) goto done; // write to young gen check >> StoreLoad; // synchronize >> if (card(@x.a) == dirty_card) goto done; >> >> *card(@x.a) = dirty >> >> // Card tracking >> enqueue(card-address(@x.a)) into thread-local-dcq; >> if (thread-local-dcq is not full) goto done; >> >> call runtime to move thread-local-dcq into dcqs >> >> done: >> >> >> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. >> >> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. >> >> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). >> >> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c... > > Thomas Schatzl has updated the pull request incrementally with two additional commits since the last revision: > > - * indentation fix > - * remove support for 32 bit x86 in the barrier generation code, following latest changes from @shade Thank you for addressing my comments, Thomas! The new x64 version of `G1BarrierSetAssembler::gen_write_ref_array_post_barrier` looks correct, but I think it could be significantly simplified, here is my suggestion which is more similar to the aarch64 version: https://github.com/robcasloz/jdk/commit/fbedc0ae1ec5fcfa95b00ad354986885c7a56ce0 (note: did not test it thoroughly). ------------- PR Comment: https://git.openjdk.org/jdk/pull/23739#issuecomment-2796850628 From rcastanedalo at openjdk.org Fri Apr 11 13:10:33 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 11 Apr 2025 13:10:33 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v33] In-Reply-To: <5FzYDFpFOksmAGM5RV0gGk2eDAdinlDCGo8_37eUeEA=.5f96c37e-7b10-41b4-a607-fc7a665abd67@github.com> References: <5FzYDFpFOksmAGM5RV0gGk2eDAdinlDCGo8_37eUeEA=.5f96c37e-7b10-41b4-a607-fc7a665abd67@github.com> Message-ID: On Thu, 10 Apr 2025 10:02:40 GMT, Thomas Schatzl wrote: >> Hi all, >> >> please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. >> >> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. >> >> ### Current situation >> >> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. >> >> The main reason for the current barrier is how g1 implements concurrent refinement: >> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. >> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, >> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. >> >> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: >> >> >> // Filtering >> if (region(@x.a) == region(y)) goto done; // same region check >> if (y == null) goto done; // null value check >> if (card(@x.a) == young_card) goto done; // write to young gen check >> StoreLoad; // synchronize >> if (card(@x.a) == dirty_card) goto done; >> >> *card(@x.a) = dirty >> >> // Card tracking >> enqueue(card-address(@x.a)) into thread-local-dcq; >> if (thread-local-dcq is not full) goto done; >> >> call runtime to move thread-local-dcq into dcqs >> >> done: >> >> >> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. >> >> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. >> >> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). >> >> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c... > > Thomas Schatzl has updated the pull request incrementally with two additional commits since the last revision: > > - * indentation fix > - * remove support for 32 bit x86 in the barrier generation code, following latest changes from @shade > G1 sets UseCondCardMark to true by default. The conditional card mark corresponds to the third filter in the write barrier now, and since I decided to keep all filters for this change, it makes sense to directly use this mechanism. Do you have performance results for `-UseCondCardMark` vs. `+UseCondCardMark`? The benefit of `+UseCondCardMark` is not obvious from looking at the generated barrier code. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23739#issuecomment-2796872496 From rcastanedalo at openjdk.org Fri Apr 11 14:30:32 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 11 Apr 2025 14:30:32 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v33] In-Reply-To: <5FzYDFpFOksmAGM5RV0gGk2eDAdinlDCGo8_37eUeEA=.5f96c37e-7b10-41b4-a607-fc7a665abd67@github.com> References: <5FzYDFpFOksmAGM5RV0gGk2eDAdinlDCGo8_37eUeEA=.5f96c37e-7b10-41b4-a607-fc7a665abd67@github.com> Message-ID: On Thu, 10 Apr 2025 10:02:40 GMT, Thomas Schatzl wrote: >> Hi all, >> >> please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. >> >> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. >> >> ### Current situation >> >> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. >> >> The main reason for the current barrier is how g1 implements concurrent refinement: >> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. >> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, >> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. >> >> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: >> >> >> // Filtering >> if (region(@x.a) == region(y)) goto done; // same region check >> if (y == null) goto done; // null value check >> if (card(@x.a) == young_card) goto done; // write to young gen check >> StoreLoad; // synchronize >> if (card(@x.a) == dirty_card) goto done; >> >> *card(@x.a) = dirty >> >> // Card tracking >> enqueue(card-address(@x.a)) into thread-local-dcq; >> if (thread-local-dcq is not full) goto done; >> >> call runtime to move thread-local-dcq into dcqs >> >> done: >> >> >> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. >> >> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. >> >> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). >> >> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c... > > Thomas Schatzl has updated the pull request incrementally with two additional commits since the last revision: > > - * indentation fix > - * remove support for 32 bit x86 in the barrier generation code, following latest changes from @shade The compiler-related parts of this change (including x64 and aarch64 changes) look good! These are the files I reviewed: - `src/hotspot/share/gc/g1/g1BarrierSet*` - `src/hotspot/share/gc/g1/{c1,c2}` - `src/hotspot/cpu/{x86,aarch64}` - `test/hotspot/jtreg/compiler` - `test/hotspot/jtreg/testlibrary_tests` ------------- Marked as reviewed by rcastanedalo (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23739#pullrequestreview-2760546283 From kbarrett at openjdk.org Sat Apr 12 13:53:24 2025 From: kbarrett at openjdk.org (Kim Barrett) Date: Sat, 12 Apr 2025 13:53:24 GMT Subject: RFR: 8347719: [REDO] Portable implementation of FORBID_C_FUNCTION and ALLOW_C_FUNCTION Message-ID: Please review this second attempt. It's mostly similar to the original attempt: https://bugs.openjdk.org/browse/JDK-8313396 https://github.com/openjdk/jdk/pull/22890 but improves the workarounds for one clang issue, and adds a workaround for another clang issue https://bugs.openjdk.org/browse/JDK-8347649 See globalDefinitions_gcc.hpp for more details about those issues and the workarounds. Additions to the testing done for the earlier attempt (see below) mach5 tier4-5. There is an Oracle-internal build configuration in tier5 that failed with the earlier attempt. Local manual build and tier1 test on linux with clang. For testing on linux with clang, be aware of these issues: https://bugs.openjdk.org/browse/JDK-8354316 https://bugs.openjdk.org/browse/JDK-8354467 Below is a repeat of the PR summary for the earlier attempt. ---------- Please review this change to how HotSpot prevents the use of certain C library functions (e.g. poisons references to those functions), while permitting a subset to be used in restricted circumstances. Reasons for poisoning a function include it being considered obsolete, or a security concern, or there is a HotSpot function (typically in the os:: namespace) providing similar functionality that should be used instead. The old mechanism, based on -Wattribute-warning and the associated attribute, only worked for gcc. (Clang's implementation differs in an important way from gcc, which is the subject of a clang bug that has been open for years. MSVC doesn't provide a similar mechanism.) It also had problems with LTO, due to a gcc bug. The new mechanism is based on deprecation warnings, using [[deprecated]] attributes. We redeclare or forward declare the functions we want to prevent use of as being deprecated. This relies on deprecation warnings being enabled, which they already are in our build configuration. All of our supported compilers support the [[deprecated]] attribute. Another benefit of using deprecation warnings rather than warning attributes is the time when the check is performed. Warning attributes are checked only if the function is referenced after all optimizations have been performed. Deprecation is checked during initial semantic analysis. That's better for our purposes here. (This is also part of why gcc LTO has problems with the old mechanism, but not the new.) Adding these redeclarations or forward declarations isn't as simple as expected, due to differences between the various compilers. We hide the differences behind a set of macros, FORBID_C_FUNCTION and related macros. See the compiler-specific parts of those macros for details. In some situations we need to allow references to these poisoned functions. One common case is where our poisoning is visible to some 3rd party code we don't want to modify. This is typically 3rd party headers included in HotSpot code, such as from Google Test or the C++ Standard Library. For these the BEGIN/END_ALLOW_FORBIDDEN_FUNCTIONS pair of macros are used demark the context where such references are permitted. Some of the poisoned functions are needed to implement associated HotSpot os:: functions, or in other similarly restricted contexts. For these, a wrapper function is provided that calls the poisoned function with the warning suppressed. These wrappers are defined in the permit_forbidden_functions [note: for REDO this is changed to permit_forbidden_function, per prior review] namespace, and called using the qualified name. This makes the use of these functions stand out, suggesting they need careful scrutiny in code reviews and the like. There are several benefits to this approach vs the old ALLOW_C_FUNCTION macro. We can centralize the set of such functions. The syntax for use is simpler (there were syntactic bugs with the old mechanism that weren't always noticed for a while). The permitted reference is explicit; there can't be an ALLOW_C_FUNCTION use that isn't actually needed. Testing: mach5 tier1-3, which includes various build variants such as slowdebug. GHA sanity tests Manual testing for warnings for direct calls to poisoned functions with all 3 compilers, and that the error messages look sane and helpful. gcc: : In function 'void test_exit(int)': ::: error: 'void exit(int)' is deprecated: use os::exit [-Werror=deprecated-declarations] 32 | void test_exit(int status) { return exit(status); } | ~~~~^~~~~~~~ ... and more stuff about the declaration ... clang: ::: error: 'exit' is deprecated: use os::exit [-Werror,-Wdeprecated-declarations] void test_exit(int status) { return exit(status); } ^ ... and more stuff about the declaration ... Visual Studio: (): warning C4996: 'exit': use os::exit ------------- Commit messages: - improve/add clang-specific workarounds - apply original change Changes: https://git.openjdk.org/jdk/pull/24608/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24608&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8347719 Stats: 627 lines in 32 files changed: 464 ins; 64 del; 99 mod Patch: https://git.openjdk.org/jdk/pull/24608.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24608/head:pull/24608 PR: https://git.openjdk.org/jdk/pull/24608 From dnsimon at openjdk.org Mon Apr 14 12:56:20 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 14 Apr 2025 12:56:20 GMT Subject: RFR: 8352724: Verify bounds for primitive array reads in JVMCI In-Reply-To: References: Message-ID: On Mon, 24 Mar 2025 14:31:54 GMT, Andrej Pe?im?th wrote: > This PR adds a bounds check for primitive array reads in JVMCI. When a JVMCI compiler attempts to read after the last array element (from the padding of the allocated object), JVMCI should throw an exception instead of returning a garbage value. The check added in this PR handles both primitive and object reads. LGTM and trival. Actually, can you please add some extra tests to `TestConstantReflectionProvider.java` for out-of-bounds reads. ------------- Marked as reviewed by dnsimon (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/24200#pullrequestreview-2764015041 Changes requested by dnsimon (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/24200#pullrequestreview-2764143284 From duke at openjdk.org Mon Apr 14 12:56:15 2025 From: duke at openjdk.org (Andrej =?UTF-8?B?UGXEjWltw7p0aA==?=) Date: Mon, 14 Apr 2025 12:56:15 GMT Subject: RFR: 8352724: Verify bounds for primitive array reads in JVMCI Message-ID: This PR adds a bounds check for primitive array reads in JVMCI. When a JVMCI compiler attempts to read after the last array element (from the padding of the allocated object), JVMCI should throw an exception instead of returning a garbage value. The check added in this PR handles both primitive and object reads. ------------- Commit messages: - Verify bounds for primitive array reads. Changes: https://git.openjdk.org/jdk/pull/24200/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24200&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8352724 Stats: 13 lines in 1 file changed: 10 ins; 3 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/24200.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24200/head:pull/24200 PR: https://git.openjdk.org/jdk/pull/24200 From duke at openjdk.org Mon Apr 14 12:57:18 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Mon, 14 Apr 2025 12:57:18 GMT Subject: RFR: 8349721: Add aarch64 intrinsics for ML-KEM [v8] In-Reply-To: References: Message-ID: > By using the aarch64 vector registers the speed of the computation of the ML-KEM algorithms (key generation, encapsulation, decapsulation) can be approximately doubled. Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: Clarified comments as suggested by Andrew Dinn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23663/files - new: https://git.openjdk.org/jdk/pull/23663/files/74ff3d9f..d7f7fc8e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23663&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23663&range=06-07 Stats: 134 lines in 3 files changed: 96 ins; 4 del; 34 mod Patch: https://git.openjdk.org/jdk/pull/23663.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23663/head:pull/23663 PR: https://git.openjdk.org/jdk/pull/23663 From adinn at openjdk.org Mon Apr 14 13:00:45 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Mon, 14 Apr 2025 13:00:45 GMT Subject: RFR: 8349721: Add aarch64 intrinsics for ML-KEM [v7] In-Reply-To: References:

Message-ID: On Thu, 10 Apr 2025 13:19:05 GMT, Ferenc Rakoczi wrote: >> By using the aarch64 vector registers the speed of the computation of the ML-KEM algorithms (key generation, encapsulation, decapsulation) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision: > > - Code rearrange, some renaming, fixing comments > - Changes suggested by Andrew Dinn. src/hotspot/cpu/aarch64/register_aarch64.hpp line 509: > 507: } > 508: > 509: // convenience methods for splitting 8-way of 4-way vector register Suggestion: // convenience methods for splitting 8-way or 4-way vector register src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5012: > 5010: assert(!va.is_constant(), "output vector must identify 2 different registers"); > 5011: > 5012: // schedule 2 streams of i 5164: // > 5165: // On each level, we fill up the vector registers in such a way that the > 5166: // array elements that need to be multiplied by the zetas be in one Suggestion: // array elements that need to be multiplied by the zetas are in one src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5168: > 5166: // array elements that need to be multiplied by the zetas be in one > 5167: // set of vector registers while the corresponding ones that don't need to > 5168: // be multiplied, in another set. We can do 32 Montgomery multiplications Suggestion: // be multiplied are in another set. We can do 32 Montgomery multiplications src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5278: > 5276: // level 4 > 5277: vs_ldpq(vq, kyberConsts); > 5278: int offsets3[8] = { 0, 32, 64, 96, 128, 160, 192, 224 }; I'd like to add comment here to explain the coefficient grouping and likewise at level 5 and 6. So here we have: Suggestion: // Up to level 3 the coefficients multiplied by or added/subtracted // to the zetas occur in discrete blocks whose size is some multiple // of 32. At level 4 coefficients occur in 8 discrete blocks of size 16 // so they are loaded using an ldr at 8 distinct offsets. int offsets3[8] = { 0, 32, 64, 96, 128, 160, 192, 224 }; src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5299: > 5297: > 5298: // level 5 > 5299: vs_ldpq(vq, kyberConsts); Suggestion: vs_ldpq(vq, kyberConsts); // At level 5 related coefficients occur in discrete blocks of size 8 so // need to be loaded interleaved using an ld2 operation with arrangement 2D src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5319: > 5317: vs_st2_indexed(vs1, __ T2D, coeffs, tmpAddr, 384, offsets4); > 5318: > 5319: // level 6 Suggestion: // level 6 // At level 6 related coefficients occur in discrete blocks of size 4 so // need to be loaded interleaved using an ld2 operation with arrangement 4S src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5377: > 5375: // level 0 > 5376: vs_ldpq(vq, kyberConsts); > 5377: int offsets4[4] = { 0, 32, 64, 96 }; Again a comment Suggestion: // At level 6 related coefficients occur in discrete blocks of size 4 so // need to be loaded interleaved using an ld2 operation with arrangement 4S int offsets4[4] = { 0, 32, 64, 96 }; src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5399: > 5397: vs_st2_indexed(vs1, __ T4S, coeffs, tmpAddr, 384, offsets4); > 5398: > 5399: // level 1 Again a comment Suggestion: // level 1 // At level 1 related coefficients occur in discrete blocks of size 8 so // need to be loaded interleaved using an ld2 operation with arrangement 2D src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5422: > 5420: vs_st2_indexed(vs1, __ T2D, coeffs, tmpAddr, 384, offsets4); > 5421: > 5422: // level 2 Again Suggestion: // level 2 // At level 2 coefficients occur in 8 discrete blocks of size 16 // so they are loaded using employing an ldr at 8 distinct offsets. src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5464: > 5462: vs_str_indexed(vs1, __ Q, coeffs, 256, offsets3); > 5463: > 5464: // level 3 Suggestion: // level 3 // From level 3 upwards coefficients occur in discrete blocks whose size is // some multiple of 32 so can be loaded using ldpq and suitable indexes. src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5591: > 5589: store64shorts(vs2, tmpAddr); > 5590: > 5591: load64shorts(vs1, tmpAddr); I'd like to make explicit the fact that we have avoided doing an add here (and in the next two cases) by adding a commented out generation step i.e. at this line insert Suggestion: // __ add(tmpAddr, coeffs, 128); // unneeded as implied by preceding store load64shorts(vs1, tmpAddr); src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5596: > 5594: store64shorts(vs2, tmpAddr); > 5595: > 5596: load64shorts(vs1, tmpAddr); Likewise insert: Suggestion: // __ add(tmpAddr, coeffs, 256); // unneeded as implied by preceding store load64shorts(vs1, tmpAddr); src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5601: > 5599: store64shorts(vs2, tmpAddr); > 5600: > 5601: load64shorts(vs1, tmpAddr); Likewise insert: Suggestion: // __ add(tmpAddr, coeffs, 384); // unneeded as implied by preceding store load64shorts(vs1, tmpAddr); src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5640: > 5638: VSeq<4> vs1(0), vs2(4); // 4 sets of 8x8H inputs/outputs/tmps > 5639: VSeq<4> vs3(16), vs4(20); > 5640: VSeq<2> vq(30); // pair of constants for montmul Suggestion: VSeq<2> vq(30); // pair of constants for montmul: qinv, q src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5642: > 5640: VSeq<2> vq(30); // pair of constants for montmul > 5641: VSeq<2> vz(28); // pair of zetas > 5642: VSeq<4> vc(27, 0); // constant sequence for montmul Suggestion: VSeq<4> vc(27, 0); // constant for montmul: montRSquareModQ src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 6094: > 6092: VSeq<8> vc2_1(31, 0); > 6093: VSeq<2> vc2_2(31, 0); > 6094: FloatRegister vc2_3 = v31; Suggestion: // we also need a pair of corresponding constant sequences VSeq<8> vc1_1(30, 0); // kyber_q VSeq<2> vc1_2(30, 0); FloatRegister vc1_3 = v30; VSeq<8> vc2_1(31, 0); // kyberBarrettMultiplier VSeq<2> vc2_2(31, 0); FloatRegister vc2_3 = v31; src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 6102: > 6100: // load q and the multiplier for the Barrett reduction > 6101: __ add(kyberConsts, kyberConsts, 16); > 6102: __ ldpq(vc1_3, vc2_3, kyberConsts); Suggestion: __ ldpq(vc1_3, vc2_3, kyberConsts); // kyber_q, kyberBarrettMultiplier src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 6111: > 6109: __ ldr(vs1_3, __ Q, __ post(coeffs, 16)); > 6110: } > 6111: vs_sqdmulh(vs2_1, __ T8H, vs1_1, vc2_1); Suggestion: vs_sqdmulh(vs2_1, __ T8H, vs1_1, vc2_1); // vs2 <- (2 * vs1 * kyberBarrettMultiplier) >> 16 src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 6116: > 6114: __ sqdmulh(vs2_3, __ T8H, vs1_3, vc2_3); > 6115: } > 6116: vs_sshr(vs2_1, __ T8H, vs2_1, 11); Suggestion: vs_sshr(vs2_1, __ T8H, vs2_1, 11); // vs2 <- (vs1 * kyberBarrettMultiplier) >> 26 src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 6121: > 6119: __ sshr(vs2_3, __ T8H, vs2_3, 11); > 6120: } > 6121: vs_mlsv(vs1_1, __ T8H, vs2_1, vc1_1); Suggestion: vs_mlsv(vs1_1, __ T8H, vs2_1, vc1_1); // vs1 <- vs1 - vs2 * kyber_q ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041886663 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041888092 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041889242 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041889966 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041892243 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041892994 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041893924 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041895469 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041896626 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041898215 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041899623 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041902206 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041902833 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041906126 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041909711 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041929011 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041867539 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041868670 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041881429 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041883012 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041883686 From duke at openjdk.org Mon Apr 14 12:57:35 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Mon, 14 Apr 2025 12:57:35 GMT Subject: RFR: 8349721: Add aarch64 intrinsics for ML-KEM [v7] In-Reply-To: <5j4hyPmpvOzIValSb9YGR8fSrL14z-zJikzUHr5kSnA=.131af348-ddc2-4b2a-bca3-8a22f022a244@github.com> References:

<5j4hyPmpvOzIValSb9YGR8fSrL14z-zJikzUHr5kSnA=.131af348-ddc2-4b2a-bca3-8a22f022a244@github.com> Message-ID: <84xoy6ZxXhJmi5NyuTdJ_cF4QSuhS8e94-kXpvic1AQ=.6894bece-9f7d-4f95-b60e-2efce0b8f7ac@github.com> On Thu, 10 Apr 2025 16:50:29 GMT, Andrew Dinn wrote: >> Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision: >> >> - Code rearrange, some renaming, fixing comments >> - Changes suggested by Andrew Dinn. > > @ferakocz Hi Ferenc. Thank you for adjusting the code as requested and even more so for the extra clean-ups you added which I very much appreciate. > > I have added suggestions for some extra/modified commenting to clarify certain details of what is being generated that were not 100% clear to me when I first read/restructured the code. They may seem a bit obvious but I want to ensure that any maintainer who needs to review the code can assimilate it quickly (including me if/when I revisit it in 12 months time). > > Mostly my recommendations for upgrading of comments is complete and I believe little more will be needed to sign off this PR. However, I still want to check through a few parts of the code that I have not fully cross-checked against the Java routines (e.g. the Barrett reductions). I'll try to do that asap but it will probably be a few days from now. > > Thanks again for your help in improving this code. @adinn Hi, Andrew, I think I addressed all of your comment improvement comments, in most cases I just changed them as you suggested. Thanks a lot for the thorough review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23663#issuecomment-2801545565 From adinn at openjdk.org Mon Apr 14 13:14:19 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Mon, 14 Apr 2025 13:14:19 GMT Subject: RFR: 8349721: Add aarch64 intrinsics for ML-KEM [v7] In-Reply-To: References:

Message-ID: On Mon, 14 Apr 2025 14:32:52 GMT, Andrej Pe?im?th wrote: >> This PR adds a bounds check for primitive array reads in JVMCI. When a JVMCI compiler attempts to read after the last array element (from the padding of the allocated object), JVMCI should throw an exception instead of returning a garbage value. The check added in this PR handles both primitive and object reads. > > Andrej Pe?im?th has updated the pull request incrementally with one additional commit since the last revision: > > Test reads after last array element in JVMCI. test/hotspot/jtreg/compiler/jvmci/jdk.vm.ci.runtime.test/src/jdk/vm/ci/runtime/test/TestConstantReflectionProvider.java line 148: > 146: if (cv.boxed != null && cv.boxed.getClass().isArray()) { > 147: JavaKind kind = metaAccess.lookupJavaType(cv.value).getComponentType().getJavaKind(); > 148: long offset = metaAccess.getArrayBaseOffset(kind) + (long) metaAccess.getArrayIndexScale(kind) * Array.getLength(cv.boxed); If I understand correctly, this tests a read of an element one past the end of the array. Can you please also add a test for a read that is partially out-of-bounds: long offset = 1 + metaAccess.getArrayBaseOffset(kind) + (long) metaAccess.getArrayIndexScale(kind) * (Array.getLength(cv.boxed) - 1); ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24200#discussion_r2042298538 From duke at openjdk.org Mon Apr 14 15:34:21 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Mon, 14 Apr 2025 15:34:21 GMT Subject: RFR: 8349721: Add aarch64 intrinsics for ML-KEM [v9] In-Reply-To: References: Message-ID: > By using the aarch64 vector registers the speed of the computation of the ML-KEM algorithms (key generation, encapsulation, decapsulation) can be approximately doubled. Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: Uncommented an accidentakky commented line in ML_KEM.java + changed 1 more comment as suggested by Andrew Dinn. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23663/files - new: https://git.openjdk.org/jdk/pull/23663/files/d7f7fc8e..5901547f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23663&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23663&range=07-08 Stats: 3 lines in 2 files changed: 2 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23663.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23663/head:pull/23663 PR: https://git.openjdk.org/jdk/pull/23663 From duke at openjdk.org Mon Apr 14 16:33:07 2025 From: duke at openjdk.org (Andrej =?UTF-8?B?UGXEjWltw7p0aA==?=) Date: Mon, 14 Apr 2025 16:33:07 GMT Subject: RFR: 8352724: Verify bounds for primitive array reads in JVMCI [v3] In-Reply-To: References: Message-ID: > This PR adds a bounds check for primitive array reads in JVMCI. When a JVMCI compiler attempts to read after the last array element (from the padding of the allocated object), JVMCI should throw an exception instead of returning a garbage value. The check added in this PR handles both primitive and object reads. Andrej Pe?im?th has updated the pull request incrementally with one additional commit since the last revision: Test array reads that are partially out of bounds. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/24200/files - new: https://git.openjdk.org/jdk/pull/24200/files/7475f468..3661b212 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=24200&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24200&range=01-02 Stats: 17 lines in 1 file changed: 16 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/24200.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24200/head:pull/24200 PR: https://git.openjdk.org/jdk/pull/24200 From duke at openjdk.org Mon Apr 14 16:37:43 2025 From: duke at openjdk.org (Andrej =?UTF-8?B?UGXEjWltw7p0aA==?=) Date: Mon, 14 Apr 2025 16:37:43 GMT Subject: RFR: 8352724: Verify bounds for primitive array reads in JVMCI [v2] In-Reply-To: References:

Message-ID: <4p7s9KQLUxBWfHroigJ58xbtbGVwyMEVUOX0lEjvgGk=.24971263-8748-4d6e-ac6f-805dd1612be1@github.com> On Mon, 14 Apr 2025 14:39:35 GMT, Doug Simon wrote: >> Andrej Pe?im?th has updated the pull request incrementally with one additional commit since the last revision: >> >> Test reads after last array element in JVMCI. > > test/hotspot/jtreg/compiler/jvmci/jdk.vm.ci.runtime.test/src/jdk/vm/ci/runtime/test/TestConstantReflectionProvider.java line 148: > >> 146: if (cv.boxed != null && cv.boxed.getClass().isArray()) { >> 147: JavaKind kind = metaAccess.lookupJavaType(cv.value).getComponentType().getJavaKind(); >> 148: long offset = metaAccess.getArrayBaseOffset(kind) + (long) metaAccess.getArrayIndexScale(kind) * Array.getLength(cv.boxed); > > If I understand correctly, this tests a read of an element one past the end of the array. > Can you please also add a test for a read that is partially out-of-bounds: > > long offset = 1 + metaAccess.getArrayBaseOffset(kind) + (long) metaAccess.getArrayIndexScale(kind) * (Array.getLength(cv.boxed) - 1); I added a test for a `long` read from `array[array.index - 1]` because adding `+ 1` would make the read unaligned (which is also not allowed). Please check it out. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24200#discussion_r2042506762 From dnsimon at openjdk.org Mon Apr 14 16:41:44 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 14 Apr 2025 16:41:44 GMT Subject: RFR: 8352724: Verify bounds for primitive array reads in JVMCI [v3] In-Reply-To: References:

Message-ID: <6dfjF8s-i_CdERHJjGXUHGaMhI_LKgxFxULtPzc3Ufc=.30e51f39-1d5e-4499-968c-aca28094d220@github.com> On Mon, 14 Apr 2025 16:33:07 GMT, Andrej Pe?im?th wrote: >> This PR adds a bounds check for primitive array reads in JVMCI. When a JVMCI compiler attempts to read after the last array element (from the padding of the allocated object), JVMCI should throw an exception instead of returning a garbage value. The check added in this PR handles both primitive and object reads. > > Andrej Pe?im?th has updated the pull request incrementally with one additional commit since the last revision: > > Test array reads that are partially out of bounds. Thanks for the new tests. ------------- Marked as reviewed by dnsimon (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/24200#pullrequestreview-2764946450 From duke at openjdk.org Mon Apr 14 16:47:43 2025 From: duke at openjdk.org (duke) Date: Mon, 14 Apr 2025 16:47:43 GMT Subject: RFR: 8352724: Verify bounds for primitive array reads in JVMCI [v3] In-Reply-To: