From tschatzl at openjdk.org Mon Mar 3 08:42:05 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Mon, 3 Mar 2025 08:42:05 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v5] In-Reply-To: References: Message-ID: > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: * fix comment (trailing whitespace) * another assert when snapshotting at a safepoint. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23739/files - new: https://git.openjdk.org/jdk/pull/23739/files/d87935a0..810bf2d3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=03-04 Stats: 3 lines in 1 file changed: 1 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739 From duke at openjdk.org Mon Mar 3 11:18:32 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Mon, 3 Mar 2025 11:18:32 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA Message-ID: By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. ------------- Commit messages: - JDK-8351034 Add AVX-512 intrinsics for ML-DSA Changes: https://git.openjdk.org/jdk/pull/23860/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8351034 Stats: 2530 lines in 18 files changed: 2445 ins; 9 del; 76 mod Patch: https://git.openjdk.org/jdk/pull/23860.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23860/head:pull/23860 PR: https://git.openjdk.org/jdk/pull/23860 From amitkumar at openjdk.org Mon Mar 3 14:25:54 2025 From: amitkumar at openjdk.org (Amit Kumar) Date: Mon, 3 Mar 2025 14:25:54 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v5] In-Reply-To: References: Message-ID: On Mon, 3 Mar 2025 08:42:05 GMT, Thomas Schatzl wrote: >> Hi all, >> >> please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. >> >> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. >> >> ### Current situation >> >> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. >> >> The main reason for the current barrier is how g1 implements concurrent refinement: >> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. >> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, >> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. >> >> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: >> >> >> // Filtering >> if (region(@x.a) == region(y)) goto done; // same region check >> if (y == null) goto done; // null value check >> if (card(@x.a) == young_card) goto done; // write to young gen check >> StoreLoad; // synchronize >> if (card(@x.a) == dirty_card) goto done; >> >> *card(@x.a) = dirty >> >> // Card tracking >> enqueue(card-address(@x.a)) into thread-local-dcq; >> if (thread-local-dcq is not full) goto done; >> >> call runtime to move thread-local-dcq into dcqs >> >> done: >> >> >> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. >> >> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. >> >> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). >> >> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c... > > Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: > > * fix comment (trailing whitespace) > * another assert when snapshotting at a safepoint. I don't see any failure on s390x. Tier1 test looks good. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23739#issuecomment-2694563382 From ayang at openjdk.org Mon Mar 3 15:22:10 2025 From: ayang at openjdk.org (Albert Mingkun Yang) Date: Mon, 3 Mar 2025 15:22:10 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v5] In-Reply-To: References: Message-ID: On Mon, 3 Mar 2025 08:42:05 GMT, Thomas Schatzl wrote: >> Hi all, >> >> please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. >> >> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. >> >> ### Current situation >> >> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. >> >> The main reason for the current barrier is how g1 implements concurrent refinement: >> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. >> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, >> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. >> >> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: >> >> >> // Filtering >> if (region(@x.a) == region(y)) goto done; // same region check >> if (y == null) goto done; // null value check >> if (card(@x.a) == young_card) goto done; // write to young gen check >> StoreLoad; // synchronize >> if (card(@x.a) == dirty_card) goto done; >> >> *card(@x.a) = dirty >> >> // Card tracking >> enqueue(card-address(@x.a)) into thread-local-dcq; >> if (thread-local-dcq is not full) goto done; >> >> call runtime to move thread-local-dcq into dcqs >> >> done: >> >> >> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. >> >> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. >> >> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). >> >> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c... > > Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: > > * fix comment (trailing whitespace) > * another assert when snapshotting at a safepoint. src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 106: > 104: > 105: __ testptr(count, count); > 106: __ jcc(Assembler::equal, done); I wonder if we can use "zero" instead of "equal" here; they have the same underlying value, but the semantic is to checking for "zero". src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 133: > 131: Label is_clean_card; > 132: __ cmpb(Address(addr, 0), G1CardTable::clean_card_val()); > 133: __ jcc(Assembler::equal, is_clean_card); Should this checking be guarded by `if (UseCondCardMark)`? I see that aarch64 does that. src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 143: > 141: > 142: __ bind(is_clean_card); > 143: // Card was not clean. Dirty card and go to next.. Why "not clean"? I thought this path is for dirtying clean card? src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 323: > 321: assert(thread == r15_thread, "must be"); > 322: #endif // _LP64 > 323: assert_different_registers(store_addr, new_val, thread, tmp1 /*, tmp2 unused */, noreg); Seems that `tmp2` is unused in this method. It is used in aarch64, but it's not obvious to me whether that is indeed necessary. If so, can you add a comment saying sth like "this unused var is needed for other archs..."? src/hotspot/share/gc/g1/g1CardTable.inline.hpp line 54: > 52: // result = 0xBBAABBAA > 53: inline size_t blend(size_t a, size_t b, size_t mask) { > 54: return a ^ ((a ^ b) & mask); The example makes it much clearer; I wonder if `return (a & ~mask) | (b & mask);` is more readable. src/hotspot/share/gc/g1/g1CardTableClaimTable.cpp line 59: > 57: > 58: void G1CardTableClaimTable::reset_all_claims_to_claimed() { > 59: for (size_t i = 0; i < _max_reserved_regions; i++) { `uint` for `i`? src/hotspot/share/gc/g1/g1CardTableClaimTable.hpp line 64: > 62: void reset_all_claims_to_unclaimed(); > 63: void reset_all_claims_to_claimed(); > 64: I wonder if these two APIs can be renamed to "reset_all_to_x", which is more aligned with its single-region counterpart, `reset_to_unclaimed`, IMO. src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 348: > 346: void G1ConcurrentRefineWorkState::snapshot_heap_into(G1CardTableClaimTable* sweep_table) { > 347: // G1CollectedHeap::heap_region_iterate() below will only visit committed regions. Initialize > 348: // all entries in the state table here to not require special handling when iterating over it. Can you elaborate on what the "special handling" would be, if we don's set "claimed" for non-committed regions? src/hotspot/share/gc/g1/g1RemSet.cpp line 837: > 835: for (; refinement_cur_card < refinement_end_card; ++refinement_cur_card, ++card_cur_word) { > 836: size_t value = *refinement_cur_card; > 837: *refinement_cur_card = G1CardTable::WordAllClean; Similarly, this is a "word", not "card", also. src/hotspot/share/gc/g1/g1YoungGCPostEvacuateTasks.cpp line 857: > 855: // We do not expect too many non-Java threads compared to Java threads, so just > 856: // let one worker claim that work. > 857: if (!_non_java_threads_claim && !Atomic::cmpxchg(&_non_java_threads_claim, false, true, memory_order_relaxed)) { Do non-java threads have card-table-base? src/hotspot/share/gc/g1/g1YoungGCPostEvacuateTasks.cpp line 862: > 860: > 861: class ResizeAndSwapCardTableClosure : public ThreadClosure { > 862: SwapCardTableClosure _cl; Field indentation. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977586579 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977594184 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977583002 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977601907 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977645576 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977571306 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977573354 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977704351 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977575441 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977701293 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977679688 From dnsimon at openjdk.org Mon Mar 3 15:32:18 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 3 Mar 2025 15:32:18 GMT Subject: RFR: 8350892: [JVMCI] Align ResolvedJavaType.getInstanceFields with Class.getDeclaredFields Message-ID: The current order of fields returned by `ResolvedJavaType.getInstanceFields` is a) not well specified and b) different than the order of fields used almost everywhere else in HotSpot. This PR aligns the order of `getInstanceFields` with `Class.getDeclaredFields()`. It also makes `ciInstanceKlass::_nonstatic_fields` use the same order which unifies how escape analysis and deoptimization treats fields across C2 and JVMCI. ------------- Commit messages: - made order of ciInstanceKlass::_nonstatic_fields same as JavaFieldStream (and Class.getDeclaredFields) - made order of ResolvedJavaType.getInstanceFields match Class.getDeclaredFields Changes: https://git.openjdk.org/jdk/pull/23849/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23849&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8350892 Stats: 89 lines in 6 files changed: 18 ins; 32 del; 39 mod Patch: https://git.openjdk.org/jdk/pull/23849.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23849/head:pull/23849 PR: https://git.openjdk.org/jdk/pull/23849 From tschatzl at openjdk.org Mon Mar 3 15:40:04 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Mon, 3 Mar 2025 15:40:04 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v5] In-Reply-To: References: Message-ID: On Mon, 3 Mar 2025 14:11:09 GMT, Albert Mingkun Yang wrote: >> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: >> >> * fix comment (trailing whitespace) >> * another assert when snapshotting at a safepoint. > > src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 143: > >> 141: >> 142: __ bind(is_clean_card); >> 143: // Card was not clean. Dirty card and go to next.. > > Why "not clean"? I thought this path is for dirtying clean card? My interpretation is: in this path the card has been found clean ("is clean") earlier. So dirty it. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977733993 From tschatzl at openjdk.org Mon Mar 3 15:42:57 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Mon, 3 Mar 2025 15:42:57 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v5] In-Reply-To: References: Message-ID: On Mon, 3 Mar 2025 14:47:00 GMT, Albert Mingkun Yang wrote: >> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: >> >> * fix comment (trailing whitespace) >> * another assert when snapshotting at a safepoint. > > src/hotspot/share/gc/g1/g1CardTable.inline.hpp line 54: > >> 52: // result = 0xBBAABBAA >> 53: inline size_t blend(size_t a, size_t b, size_t mask) { >> 54: return a ^ ((a ^ b) & mask); > > The example makes it much clearer; I wonder if `return (a & ~mask) | (b & mask);` is more readable. ... and hope that the optimizer knows this pattern? If you insist I can do that, brief examination of that code snippet by itself (not within this code) showed that it does. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977739888 From mdoerr at openjdk.org Mon Mar 3 16:31:58 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Mon, 3 Mar 2025 16:31:58 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: References: <_XnhdwtuB6AhiTL4TYmV4yqIy_WwQEeASn2b2zL9-V0=.05ec2994-8599-4f76-871d-a9e2bbe8afa2@github.com> Message-ID: <413JPgs-IIREKFfH05GHeskZzg5lpyBuNbW6jeGyQVk=.35277f99-0552-4e06-92a0-17d051979e1a@github.com> On Fri, 28 Feb 2025 10:47:39 GMT, Martin Doerr wrote: > > I've used QEMU to smoke test this PR on ppc64le, riscv64 and s390x, But it would be nice if @TheRealMDoerr, @RealFYang and @offamitkumar could check if it runs okay on real hardware as well. > > The PPC64 code looks correct and some quick tests have passed. I'll run larger test suites over the weekend. Test results look good (including tier 1-4 on many platforms). I didn't see any new issue related to this. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23421#issuecomment-2694935731 From tschatzl at openjdk.org Mon Mar 3 16:55:55 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Mon, 3 Mar 2025 16:55:55 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v5] In-Reply-To: References: Message-ID: On Mon, 3 Mar 2025 15:17:27 GMT, Albert Mingkun Yang wrote: >> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: >> >> * fix comment (trailing whitespace) >> * another assert when snapshotting at a safepoint. > > src/hotspot/share/gc/g1/g1YoungGCPostEvacuateTasks.cpp line 857: > >> 855: // We do not expect too many non-Java threads compared to Java threads, so just >> 856: // let one worker claim that work. >> 857: if (!_non_java_threads_claim && !Atomic::cmpxchg(&_non_java_threads_claim, false, true, memory_order_relaxed)) { > > Do non-java threads have card-table-base? This code should not be necessary (any more). Will remove. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977853483 From tschatzl at openjdk.org Mon Mar 3 18:22:24 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Mon, 3 Mar 2025 18:22:24 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v6] In-Reply-To: References: Message-ID: > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: ayang review 2 * removal of useless code * renamings ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23739/files - new: https://git.openjdk.org/jdk/pull/23739/files/810bf2d3..b3dd0084 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=04-05 Stats: 51 lines in 7 files changed: 16 ins; 10 del; 25 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739 From duke at openjdk.org Mon Mar 3 19:00:59 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Mon, 3 Mar 2025 19:00:59 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v2] In-Reply-To: References: Message-ID: > By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: Added comments, removed debugging printfs ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23860/files - new: https://git.openjdk.org/jdk/pull/23860/files/1ff58512..fe50e0d8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=00-01 Stats: 12 lines in 2 files changed: 9 ins; 1 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/23860.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23860/head:pull/23860 PR: https://git.openjdk.org/jdk/pull/23860 From iwalulya at openjdk.org Mon Mar 3 20:18:58 2025 From: iwalulya at openjdk.org (Ivan Walulya) Date: Mon, 3 Mar 2025 20:18:58 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v5] In-Reply-To: References: Message-ID: On Mon, 3 Mar 2025 08:42:05 GMT, Thomas Schatzl wrote: >> Hi all, >> >> please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. >> >> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. >> >> ### Current situation >> >> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. >> >> The main reason for the current barrier is how g1 implements concurrent refinement: >> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. >> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, >> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. >> >> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: >> >> >> // Filtering >> if (region(@x.a) == region(y)) goto done; // same region check >> if (y == null) goto done; // null value check >> if (card(@x.a) == young_card) goto done; // write to young gen check >> StoreLoad; // synchronize >> if (card(@x.a) == dirty_card) goto done; >> >> *card(@x.a) = dirty >> >> // Card tracking >> enqueue(card-address(@x.a)) into thread-local-dcq; >> if (thread-local-dcq is not full) goto done; >> >> call runtime to move thread-local-dcq into dcqs >> >> done: >> >> >> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. >> >> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. >> >> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). >> >> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c... > > Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: > > * fix comment (trailing whitespace) > * another assert when snapshotting at a safepoint. src/hotspot/share/gc/g1/g1CardTable.cpp line 44: > 42: if (!failures) { > 43: G1CollectedHeap* g1h = G1CollectedHeap::heap(); > 44: G1HeapRegion* r = g1h->heap_region_containing(mr.start()); Probably we can move this outside the loop, and assert that `mr` does not cross region boundaries src/hotspot/share/gc/g1/g1CollectedHeap.hpp line 916: > 914: void safepoint_synchronize_end() override; > 915: > 916: jlong synchronized_duration() const { return _safepoint_duration; } safepoint_duration() seems easier to comprehend. src/hotspot/share/gc/g1/g1CollectionSet.cpp line 310: > 308: verify_young_cset_indices(); > 309: > 310: size_t card_rs_length = _policy->analytics()->predict_card_rs_length(in_young_only_phase); Why are we using a prediction here? Additionally, won't this prediction also include cards from the old gen regions in case of mixed gcs? How do we reconcile that when we are adding old gen regions to c-set? src/hotspot/share/gc/g1/g1ConcurrentRefine.hpp line 42: > 40: class G1HeapRegion; > 41: class G1Policy; > 42: class G1CardTableClaimTable; Nit: ordering of the declarations src/hotspot/share/gc/g1/g1ConcurrentRefine.hpp line 84: > 82: // Tracks the current refinement state from idle to completion (and reset back > 83: // to idle). > 84: class G1ConcurrentRefineWorkState { G1ConcurrentRefinementState? I am not convinced the "Work" adds any clarity src/hotspot/share/gc/g1/g1ConcurrentRefine.hpp line 113: > 111: // Current epoch the work has been started; used to determine if there has been > 112: // a forced card table swap due to a garbage collection while doing work. > 113: size_t _refine_work_epoch; same as previous comment, why `refine_work` instead of `refinement`? src/hotspot/share/gc/g1/g1ConcurrentRefineStats.hpp line 43: > 41: size_t _cards_clean; // Number of cards found clean. > 42: size_t _cards_not_parsable; // Number of cards we could not parse and left unrefined. > 43: size_t _cards_still_refer_to_cset; // Number of cards marked still young. `_cards_still_refer_to_cset` from the naming it is not clear what the difference is with `_cards_refer_to_cset`, the comment is not helping with that ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977688778 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977969470 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977982999 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977991124 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1978017843 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1978019093 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1978119476 From pchilanomate at openjdk.org Mon Mar 3 23:42:56 2025 From: pchilanomate at openjdk.org (Patricio Chilano Mateo) Date: Mon, 3 Mar 2025 23:42:56 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: References: Message-ID: <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com> On Thu, 27 Feb 2025 15:54:28 GMT, Fredrik Bredberg wrote: >> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`. >> >> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past. >> >> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks. >> >> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`. >> >> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable. >> >> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list. >> >> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor). >> >> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation. >> >> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fac... > > Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision: > > Update after review by David and Coleen. Changes look good to me. Just a few comments. Thanks, Patricio src/hotspot/share/runtime/objectMonitor.cpp line 204: > 202: // If the thread (F) that removes itself from the end of the list > 203: // hasn't got any prev pointer, we just set the tail pointer to > 204: // null, see 5) and 6) below. Setting the tail pointer to null would be for the case when this node is also the head, i.e single element. Otherwise we just rebuild the doubly link list, unlink F, and set entry_list_tail to G. In other words, the comment here and below seems to be missing that we have to build the doubly link list when F acquires the monitor, not when F needs to find a successor. src/hotspot/share/runtime/objectMonitor.cpp line 1265: > 1263: // that updated _entry_list, so we can access w->_next. > 1264: w = Atomic::load_acquire(&_entry_list); > 1265: assert(w != nullptr, "invariant"); Maybe add the same assert as below for the single element case: `assert(w->TState == ObjectWaiter::TS_ENTER, "invariant")`. src/hotspot/share/runtime/objectMonitor.cpp line 1359: > 1357: // Build the doubly linked list to get hold of currentNode->prev(). > 1358: _entry_list_tail = nullptr; > 1359: entry_list_tail(current); I think we should try to avoid having to rebuild the doubly link list from scratch, since only a few nodes in the front might be missing the previous links. For platform threads it might not matter that much, but for virtual threads this list could be much larger. Maybe we can leave it as a future enhancement. src/hotspot/share/runtime/objectMonitor.cpp line 1509: > 1507: // is no successor, so it appears that an heir-presumptive > 1508: // (successor) must be made ready. Only the current lock owner can > 1509: // detach threads from the entry_list, therefore we need to We don't detach threads here, so maybe manipulate would be better. src/hotspot/share/runtime/objectMonitor.cpp line 1532: > 1530: // Let's say T1 then stalls. T2 acquires O and calls O.notify(). The > 1531: // notify() operation moves T1 from O's waitset to O's entry_list. T2 then > 1532: // release the lock "O". T2 resumes immediately after the ST of null into Pre-existent, but this should be T1. Same in next sentence. ------------- PR Review: https://git.openjdk.org/jdk/pull/23421#pullrequestreview-2655551088 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1978372164 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1978368315 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1978374081 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1978369547 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1978370888 From dholmes at openjdk.org Tue Mar 4 04:52:57 2025 From: dholmes at openjdk.org (David Holmes) Date: Tue, 4 Mar 2025 04:52:57 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com> References: <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com> Message-ID: On Mon, 3 Mar 2025 23:15:46 GMT, Patricio Chilano Mateo wrote: >> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision: >> >> Update after review by David and Coleen. > > src/hotspot/share/runtime/objectMonitor.cpp line 204: > >> 202: // If the thread (F) that removes itself from the end of the list >> 203: // hasn't got any prev pointer, we just set the tail pointer to >> 204: // null, see 5) and 6) below. > > Setting the tail pointer to null would be for the case when this node is also the head, i.e single element. Otherwise we just rebuild the doubly link list, unlink F, and set entry_list_tail to G. In other words, the comment here and below seems to be missing that we have to build the doubly link list when F acquires the monitor, not when F needs to find a successor. We don't rebuild at this point. The thread that is removing itself just sets tail to null if there is no prev. Later when F exits the monitor it will construct the DLL to find the next successor. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1978608595 From tschatzl at openjdk.org Tue Mar 4 08:24:54 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Tue, 4 Mar 2025 08:24:54 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v5] In-Reply-To: References: Message-ID: On Mon, 3 Mar 2025 20:02:16 GMT, Ivan Walulya wrote: >> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: >> >> * fix comment (trailing whitespace) >> * another assert when snapshotting at a safepoint. > > src/hotspot/share/gc/g1/g1ConcurrentRefineStats.hpp line 43: > >> 41: size_t _cards_clean; // Number of cards found clean. >> 42: size_t _cards_not_parsable; // Number of cards we could not parse and left unrefined. >> 43: size_t _cards_still_refer_to_cset; // Number of cards marked still young. > > `_cards_still_refer_to_cset` from the naming it is not clear what the difference is with `_cards_refer_to_cset`, the comment is not helping with that `cards_still_refer_to_cset` refers to cards that were found to have already been marked as `to-collection-set`. Renamed to `_cards_already_refer_to_cset`, would that be okay? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1978868225 From tschatzl at openjdk.org Tue Mar 4 08:28:56 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Tue, 4 Mar 2025 08:28:56 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v5] In-Reply-To: References: Message-ID: On Mon, 3 Mar 2025 18:28:48 GMT, Ivan Walulya wrote: > Why are we using a prediction here? Quickly checking again, do we have the actual count here from somewhere? > Additionally, won't this prediction also include cards from the old gen regions in case of mixed gcs? How do we reconcile that when we are adding old gen regions to c-set? The predictor contents changed to (supposedly) only contain cards containing young gen references. See g1Policy.cpp:934: _analytics->report_card_rs_length(total_cards_scanned - total_non_young_rs_cards, is_young_only_pause); ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1978876199 From tschatzl at openjdk.org Tue Mar 4 08:36:55 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Tue, 4 Mar 2025 08:36:55 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v5] In-Reply-To: References: Message-ID: On Mon, 3 Mar 2025 15:19:20 GMT, Albert Mingkun Yang wrote: > Can you elaborate on what the "special handling" would be, if we don's set "claimed" for non-committed regions? the iteration code, would for every region check whether the region is actually committed or not. The `heap_region_iterate()` API of `G1CollectedHeap` only iterates over committed regions. So only committed regions will be updated in the state table. Later when iterating over the state table, the code uses the array directly, i.e. the claim state of uncommitted regions would be read as uninitialized. Further, it would be hard to exclude regions committed after the snapshot otherwise (we do not need to iterate over them. Their card table can't contain card marks) as we do not track newly committed regions in the snapshot. We could do, but would be a headache due to memory synchronization because regions can be committed any time. Imho it is much simpler to reset all the card claims to "already processed" and then make the regions we want to work on claimable. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1978893134 From tschatzl at openjdk.org Tue Mar 4 08:39:56 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Tue, 4 Mar 2025 08:39:56 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v5] In-Reply-To: References: Message-ID: On Tue, 4 Mar 2025 08:22:03 GMT, Thomas Schatzl wrote: >> src/hotspot/share/gc/g1/g1ConcurrentRefineStats.hpp line 43: >> >>> 41: size_t _cards_clean; // Number of cards found clean. >>> 42: size_t _cards_not_parsable; // Number of cards we could not parse and left unrefined. >>> 43: size_t _cards_still_refer_to_cset; // Number of cards marked still young. >> >> `_cards_still_refer_to_cset` from the naming it is not clear what the difference is with `_cards_refer_to_cset`, the comment is not helping with that > > `cards_still_refer_to_cset` refers to cards that were found to have already been marked as `to-collection-set`. Renamed to `_cards_already_refer_to_cset`, would that be okay? Fwiw, this is just for statistics, so if you want I can remove these. I did some experiments with re-examining these cards too to see whether we could clear them later. For determining if/when to do that a rate of increase for the young cards has been interesting. As mentioned, if you want I can remove them. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1978896272 From tschatzl at openjdk.org Tue Mar 4 08:53:46 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Tue, 4 Mar 2025 08:53:46 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v7] In-Reply-To: References: Message-ID: > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: * iwalulya initial comments * renaming * made blend() helper function more clear; at least gcc will optimize it to the same code as before ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23739/files - new: https://git.openjdk.org/jdk/pull/23739/files/b3dd0084..8f46dc9a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=05-06 Stats: 27 lines in 9 files changed: 7 ins; 3 del; 17 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739 From tschatzl at openjdk.org Tue Mar 4 09:15:24 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Tue, 4 Mar 2025 09:15:24 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v8] In-Reply-To: References: Message-ID: > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: * do not change card table base for gc threads during swapping * not necessary because they do not use it * (recent assert that verifies that non-java threads do not have a card table found this) ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23739/files - new: https://git.openjdk.org/jdk/pull/23739/files/8f46dc9a..9e2ee543 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=06-07 Stats: 25 lines in 1 file changed: 9 ins; 14 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739 From dnsimon at openjdk.org Tue Mar 4 09:23:08 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 4 Mar 2025 09:23:08 GMT Subject: RFR: 8351036: [JVMCI] value not an s2: -32776 Message-ID: This PR adds support for JVMCI to install code that requires stack slots whose offset > `Short.MAX_VALUE`. ------------- Commit messages: - support stack slots with an offset > Short.MAX_VALUE Changes: https://git.openjdk.org/jdk/pull/23888/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23888&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8351036 Stats: 44 lines in 4 files changed: 36 ins; 0 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/23888.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23888/head:pull/23888 PR: https://git.openjdk.org/jdk/pull/23888 From yzheng at openjdk.org Tue Mar 4 09:30:53 2025 From: yzheng at openjdk.org (Yudi Zheng) Date: Tue, 4 Mar 2025 09:30:53 GMT Subject: RFR: 8351036: [JVMCI] value not an s2: -32776 In-Reply-To: References: Message-ID: On Tue, 4 Mar 2025 09:18:40 GMT, Doug Simon wrote: > This PR adds support for JVMCI to install code that requires stack slots whose offset > `Short.MAX_VALUE`. Marked as reviewed by yzheng (Committer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23888#pullrequestreview-2656634837 From iwalulya at openjdk.org Tue Mar 4 09:38:58 2025 From: iwalulya at openjdk.org (Ivan Walulya) Date: Tue, 4 Mar 2025 09:38:58 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v5] In-Reply-To: References: Message-ID: On Tue, 4 Mar 2025 08:36:58 GMT, Thomas Schatzl wrote: >> `cards_still_refer_to_cset` refers to cards that were found to have already been marked as `to-collection-set`. Renamed to `_cards_already_refer_to_cset`, would that be okay? > > Fwiw, this particular counter is just for statistics, so if you want I can remove these. I did some experiments with re-examining these cards too to see whether we could clear them later. For determining if/when to do that a rate of increase for the young cards has been interesting. > > As mentioned, if you want I can remove them. `_cards_already_refer_to_cset` is fine by me, i don't like the option of removing them ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979009507 From iwalulya at openjdk.org Tue Mar 4 09:43:54 2025 From: iwalulya at openjdk.org (Ivan Walulya) Date: Tue, 4 Mar 2025 09:43:54 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v5] In-Reply-To: References: Message-ID: On Tue, 4 Mar 2025 08:26:10 GMT, Thomas Schatzl wrote: >> src/hotspot/share/gc/g1/g1CollectionSet.cpp line 310: >> >>> 308: verify_young_cset_indices(); >>> 309: >>> 310: size_t card_rs_length = _policy->analytics()->predict_card_rs_length(in_young_only_phase); >> >> Why are we using a prediction here? Additionally, won't this prediction also include cards from the old gen regions in case of mixed gcs? How do we reconcile that when we are adding old gen regions to c-set? > >> Why are we using a prediction here? > > Quickly checking again, do we have the actual count here from somewhere? > >> Additionally, won't this prediction also include cards from the old gen regions in case of mixed gcs? How do we reconcile that when we are adding old gen regions to c-set? > > The predictor contents changed to (supposedly) only contain cards containing young gen references. See g1Policy.cpp:934: > > _analytics->report_card_rs_length(total_cards_scanned - total_non_young_rs_cards, is_young_only_pause); Fair, I missed that details on young RS have been removed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979022900 From tschatzl at openjdk.org Tue Mar 4 09:57:56 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Tue, 4 Mar 2025 09:57:56 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v9] In-Reply-To: References: Message-ID: > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: * iwalulya review 2 * G1ConcurrentRefineWorkState -> G1ConcurrentRefineSweepState * some additional documentation ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23739/files - new: https://git.openjdk.org/jdk/pull/23739/files/9e2ee543..442d9eae Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=07-08 Stats: 93 lines in 7 files changed: 27 ins; 3 del; 63 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739 From tschatzl at openjdk.org Tue Mar 4 09:57:58 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Tue, 4 Mar 2025 09:57:58 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v5] In-Reply-To: References: Message-ID: <3BAl6ELdTMEhWoovthkw7lq86mwuoUnyKxzCANFnwNc=.41077bf4-8073-4810-9d0d-078d7ad06240@github.com> On Tue, 4 Mar 2025 09:52:40 GMT, Thomas Schatzl wrote: >> src/hotspot/share/gc/g1/g1ConcurrentRefine.hpp line 84: >> >>> 82: // Tracks the current refinement state from idle to completion (and reset back >>> 83: // to idle). >>> 84: class G1ConcurrentRefineWorkState { >> >> G1ConcurrentRefinementState? I am not convinced the "Work" adds any clarity > > We agreed on `G1ConcurrentRefineSweepState` for now, better suggestions welcome. > > Use `Refine` instead of `Refinement` since all pre-existing classes also use `Refine`. This could be renamed in an extra change. Add the `Sweep` in the name because this is not the state for entire refinement (which also includes information about when to start refinement/sweeping). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979053344 From tschatzl at openjdk.org Tue Mar 4 09:57:58 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Tue, 4 Mar 2025 09:57:58 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v5] In-Reply-To: References: Message-ID: On Mon, 3 Mar 2025 18:50:37 GMT, Ivan Walulya wrote: >> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: >> >> * fix comment (trailing whitespace) >> * another assert when snapshotting at a safepoint. > > src/hotspot/share/gc/g1/g1ConcurrentRefine.hpp line 84: > >> 82: // Tracks the current refinement state from idle to completion (and reset back >> 83: // to idle). >> 84: class G1ConcurrentRefineWorkState { > > G1ConcurrentRefinementState? I am not convinced the "Work" adds any clarity We agreed on `G1ConcurrentRefineSweepState` for now, better suggestions welcome. Use `Refine` instead of `Refinement` since all pre-existing classes also use `Refine`. This could be renamed in an extra change. > src/hotspot/share/gc/g1/g1ConcurrentRefine.hpp line 113: > >> 111: // Current epoch the work has been started; used to determine if there has been >> 112: // a forced card table swap due to a garbage collection while doing work. >> 113: size_t _refine_work_epoch; > > same as previous comment, why `refine_work` instead of `refinement`? Already renamed, same as previous comment. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979050867 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979051649 From mdoerr at openjdk.org Tue Mar 4 10:40:55 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 4 Mar 2025 10:40:55 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v9] In-Reply-To: References: Message-ID: On Tue, 4 Mar 2025 09:57:56 GMT, Thomas Schatzl wrote: >> Hi all, >> >> please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. >> >> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. >> >> ### Current situation >> >> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. >> >> The main reason for the current barrier is how g1 implements concurrent refinement: >> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. >> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, >> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. >> >> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: >> >> >> // Filtering >> if (region(@x.a) == region(y)) goto done; // same region check >> if (y == null) goto done; // null value check >> if (card(@x.a) == young_card) goto done; // write to young gen check >> StoreLoad; // synchronize >> if (card(@x.a) == dirty_card) goto done; >> >> *card(@x.a) = dirty >> >> // Card tracking >> enqueue(card-address(@x.a)) into thread-local-dcq; >> if (thread-local-dcq is not full) goto done; >> >> call runtime to move thread-local-dcq into dcqs >> >> done: >> >> >> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. >> >> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. >> >> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). >> >> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c... > > Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: > > * iwalulya review 2 > * G1ConcurrentRefineWorkState -> G1ConcurrentRefineSweepState > * some additional documentation I got an error while testing java/foreign/TestUpcallStress.java on linuxaarch64 with this PR: # Internal Error (/openjdk-jdk-linux_aarch64-dbg/jdk/src/hotspot/share/gc/g1/g1CardTable.cpp:56), pid=19044, tid=19159 # guarantee(!failures) failed: there should not have been any failures ... V [libjvm.so+0xb6e988] G1CardTable::verify_region(MemRegion, unsigned char, bool)+0x3b8 (g1CardTable.cpp:56) V [libjvm.so+0xc3a10c] G1MergeHeapRootsTask::G1ClearBitmapClosure::do_heap_region(G1HeapRegion*)+0x13c (g1RemSet.cpp:1048) V [libjvm.so+0xb7a80c] G1CollectedHeap::par_iterate_regions_array(G1HeapRegionClosure*, G1HeapRegionClaimer*, unsigned int const*, unsigned long, unsigned int) const+0x9c (g1CollectedHeap.cpp:2059) V [libjvm.so+0xc49fe8] G1MergeHeapRootsTask::work(unsigned int)+0x708 (g1RemSet.cpp:1225) V [libjvm.so+0x19597bc] WorkerThread::run()+0x98 (workerThread.cpp:69) V [libjvm.so+0x1824510] Thread::call_run()+0xac (thread.cpp:231) V [libjvm.so+0x13b3994] thread_native_entry(Thread*)+0x130 (os_linux.cpp:877) C [libpthread.so.0+0x875c] start_thread+0x18c ------------- PR Comment: https://git.openjdk.org/jdk/pull/23739#issuecomment-2697024679 From tschatzl at openjdk.org Tue Mar 4 10:48:56 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Tue, 4 Mar 2025 10:48:56 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v9] In-Reply-To: References: Message-ID: On Tue, 4 Mar 2025 10:37:47 GMT, Martin Doerr wrote: > I got an error while testing java/foreign/TestUpcallStress.java on linuxaarch64 with this PR: > > ``` > # Internal Error (/openjdk-jdk-linux_aarch64-dbg/jdk/src/hotspot/share/gc/g1/g1CardTable.cpp:56), pid=19044, tid=19159 > # guarantee(!failures) failed: there should not have been any failures > ... > V [libjvm.so+0xb6e988] G1CardTable::verify_region(MemRegion, unsigned char, bool)+0x3b8 (g1CardTable.cpp:56) > V [libjvm.so+0xc3a10c] G1MergeHeapRootsTask::G1ClearBitmapClosure::do_heap_region(G1HeapRegion*)+0x13c (g1RemSet.cpp:1048) > V [libjvm.so+0xb7a80c] G1CollectedHeap::par_iterate_regions_array(G1HeapRegionClosure*, G1HeapRegionClaimer*, unsigned int const*, unsigned long, unsigned int) const+0x9c (g1CollectedHeap.cpp:2059) > V [libjvm.so+0xc49fe8] G1MergeHeapRootsTask::work(unsigned int)+0x708 (g1RemSet.cpp:1225) > V [libjvm.so+0x19597bc] WorkerThread::run()+0x98 (workerThread.cpp:69) > V [libjvm.so+0x1824510] Thread::call_run()+0xac (thread.cpp:231) > V [libjvm.so+0x13b3994] thread_native_entry(Thread*)+0x130 (os_linux.cpp:877) > C [libpthread.so.0+0x875c] start_thread+0x18c > ``` I will try to reproduce. Thanks. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23739#issuecomment-2697052899 From tschatzl at openjdk.org Tue Mar 4 10:53:46 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Tue, 4 Mar 2025 10:53:46 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v10] In-Reply-To: References: Message-ID: > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: * ayang review - fix comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23739/files - new: https://git.openjdk.org/jdk/pull/23739/files/442d9eae..fc674f02 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=09 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=08-09 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739 From duke at openjdk.org Tue Mar 4 11:14:01 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Tue, 4 Mar 2025 11:14:01 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v7] In-Reply-To: References: Message-ID: On Thu, 27 Feb 2025 09:53:21 GMT, Andrew Dinn wrote: > Oops. sorry - cut and paste error -- the new setting should be > > ``` > do_arch_blob(compiler, 55000 ZGC_ONLY(+5000)) > ``` @adinn, I have done this change, but that erased your approval. Could you reapprove? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2697145316 From iwalulya at openjdk.org Tue Mar 4 11:19:59 2025 From: iwalulya at openjdk.org (Ivan Walulya) Date: Tue, 4 Mar 2025 11:19:59 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v9] In-Reply-To: References: Message-ID: On Tue, 4 Mar 2025 09:57:56 GMT, Thomas Schatzl wrote: >> Hi all, >> >> please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. >> >> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. >> >> ### Current situation >> >> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. >> >> The main reason for the current barrier is how g1 implements concurrent refinement: >> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. >> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, >> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. >> >> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: >> >> >> // Filtering >> if (region(@x.a) == region(y)) goto done; // same region check >> if (y == null) goto done; // null value check >> if (card(@x.a) == young_card) goto done; // write to young gen check >> StoreLoad; // synchronize >> if (card(@x.a) == dirty_card) goto done; >> >> *card(@x.a) = dirty >> >> // Card tracking >> enqueue(card-address(@x.a)) into thread-local-dcq; >> if (thread-local-dcq is not full) goto done; >> >> call runtime to move thread-local-dcq into dcqs >> >> done: >> >> >> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. >> >> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. >> >> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). >> >> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c... > > Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: > > * iwalulya review 2 > * G1ConcurrentRefineWorkState -> G1ConcurrentRefineSweepState > * some additional documentation src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 108: > 106: > 107: void G1ConcurrentRefineThreadControl::control_thread_do(ThreadClosure* tc) { > 108: if (_control_thread != nullptr) { maybe maintain using `if (max_num_threads() > 0)` as used in `G1ConcurrentRefineThreadControl::initialize`, so that it is clear that setting `G1ConcRefinementThreads=0` effectively turns off concurrent refinement. src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 354: > 352: if (!r->is_free()) { > 353: // Need to scan all parts of non-free regions, so reset the claim. > 354: // No need for synchronization: we are only interested about regions s/about/in src/hotspot/share/gc/g1/g1OopClosures.hpp line 205: > 203: G1CollectedHeap* _g1h; > 204: uint _worker_id; > 205: bool _has_to_cset_ref; Similar to `_cards_refer_to_cset` , do you mind renaming `_has_to_cset_ref` and `_has_to_old_ref` to `_has_ref_to_cset` and `_has_ref_to_old` src/hotspot/share/gc/g1/g1Policy.hpp line 105: > 103: uint _free_regions_at_end_of_collection; > 104: > 105: size_t _pending_cards_from_gc; A comment on the variable would be nice, especially on how it is set/reset both at end of GC and by refinement. Also the `_to_collection_set_cards` below could use a comment ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979077904 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979102189 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979212854 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979155941 From adinn at openjdk.org Tue Mar 4 11:21:00 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Tue, 4 Mar 2025 11:21:00 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v8] In-Reply-To: References: Message-ID: On Fri, 28 Feb 2025 06:22:09 GMT, Ferenc Rakoczi wrote: >> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits: > > - Merged master. > - Added more comments, mainly as suggested by Andrew Dinn > - Changed aarch64-asmtest.py as suggested by Bhavana-Kilambi > - Accepting suggested change from Andrew Dinn > - Added comments suggested by Andrew Dinn > - Fixed copyright years > - renaming a couple of functions > - Adding comments + some code reorganization > - removed debugging code > - merging master > - ... and 3 more: https://git.openjdk.org/jdk/compare/ab4b0ef9...d82dfb2f Still good. ------------- Marked as reviewed by adinn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23300#pullrequestreview-2657047714 From tschatzl at openjdk.org Tue Mar 4 11:39:55 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Tue, 4 Mar 2025 11:39:55 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v9] In-Reply-To: References: Message-ID: On Tue, 4 Mar 2025 10:06:37 GMT, Ivan Walulya wrote: >> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: >> >> * iwalulya review 2 >> * G1ConcurrentRefineWorkState -> G1ConcurrentRefineSweepState >> * some additional documentation > > src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 108: > >> 106: >> 107: void G1ConcurrentRefineThreadControl::control_thread_do(ThreadClosure* tc) { >> 108: if (_control_thread != nullptr) { > > maybe maintain using `if (max_num_threads() > 0)` as used in `G1ConcurrentRefineThreadControl::initialize`, so that it is clear that setting `G1ConcRefinementThreads=0` effectively turns off concurrent refinement. I added a new `is_refinement_enabled()` predicate instead (that uses `max_num_threads()`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979252156 From shade at openjdk.org Tue Mar 4 11:51:07 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 4 Mar 2025 11:51:07 GMT Subject: RFR: 8345169: Implement JEP XXX: Remove the 32-bit x86 Port In-Reply-To: References: Message-ID: On Thu, 5 Dec 2024 08:26:10 GMT, Aleksey Shipilev wrote: > **NOTE: This is work-in-progress draft for interested parties. The JEP is not even submitted, let alone targeted.** > > My plan is to to get this done in a quiet time in mainline to limit the ongoing conflicts with mainline. Feel free to comment in this PR, if you see something ahead of time. These comments might adjust the trajectory we take to implement this removal and/or allows us submit and work out more RFEs ahead of this removal. I plan to re-open a clean PR after this preliminary PR is done, maybe after the round of preliminary reviews. > > This removes the 32-bit x86 port and does a deeper cleaning in Hotspot. The following paragraphs describe what and why was being done. > > Easy stuff first: all files named `*_x86_32` are gone. Those are only built when build system knows we are compiling for x86_32. There is therefore no impact on x86_64. > > The code under `!LP64`, `!AMD64` and `IA32` is removed in `x86`-specific files. There is quite a bit of the code, especially around `Assembler` and `MacroAssembler`. I think these removals make the whole thing cleaner. The downside is that some of the `MacroAssembler::*ptr` functions that were used to select the "machine pointer" instructions either from x86_64 or x86_32 are now exclusively for x86_64. I don't think we want to rewrite `*ptr` -> `*q` at this point. I think we gradually morph the code base to use `*q`-flavored methods in new code. > > x86_32 is the only platform that has special cases for x87 FPU. > > C1 even implements the whole separate thing to deal with x87 FPU: the parts of regalloc treat it specially, there is `FpuStackSim`, there is `VerifyFPU` family of flags, etc. There are also peculiarities with FP conversions that use FPU, that's why x86_32 used to have template interpreter stubs for FP conversion methods. None of that is needed anymore without x86_32. This cleans up some arch-specific code as well. > > Both C1 and C2 implement the workarounds for non-IEEE compliant rounding of x87 FPU. After x86_32 is gone, these are not needed anymore. This removes some C2 nodes, removes the rounding instructions in C1. > > x86_64 is baselined on SSE2+, the VM would not even start if SSE2 is not supported. Most of the checks that we have for `UseSSE < 2` are for the benefit of x86_32. Because of this I folded redundant `UseSSE` checks around Hotspot. > > The one thing I _deliberately_ avoided doing is merging `x86.ad` and `x86_64.ad`. It would likely introduce uncomfortable amount of conflicts with pending work in mainli... Great, thanks for the feedback. I think we are going to go with the JEP implementation that removes the easy parts of x86_32 code, and then do the deeper cleanups under [JDK-8351148](https://bugs.openjdk.org/browse/JDK-8351148) umbrella. I added some subtasks there, based on the commits from this bulk PR. I am closing this PR in favor of about-to-be-created cleaner PR for JEP 503. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22567#issuecomment-2697266596 From shade at openjdk.org Tue Mar 4 11:51:07 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 4 Mar 2025 11:51:07 GMT Subject: Withdrawn: 8345169: Implement JEP XXX: Remove the 32-bit x86 Port In-Reply-To: References: Message-ID: On Thu, 5 Dec 2024 08:26:10 GMT, Aleksey Shipilev wrote: > **NOTE: This is work-in-progress draft for interested parties. The JEP is not even submitted, let alone targeted.** > > My plan is to to get this done in a quiet time in mainline to limit the ongoing conflicts with mainline. Feel free to comment in this PR, if you see something ahead of time. These comments might adjust the trajectory we take to implement this removal and/or allows us submit and work out more RFEs ahead of this removal. I plan to re-open a clean PR after this preliminary PR is done, maybe after the round of preliminary reviews. > > This removes the 32-bit x86 port and does a deeper cleaning in Hotspot. The following paragraphs describe what and why was being done. > > Easy stuff first: all files named `*_x86_32` are gone. Those are only built when build system knows we are compiling for x86_32. There is therefore no impact on x86_64. > > The code under `!LP64`, `!AMD64` and `IA32` is removed in `x86`-specific files. There is quite a bit of the code, especially around `Assembler` and `MacroAssembler`. I think these removals make the whole thing cleaner. The downside is that some of the `MacroAssembler::*ptr` functions that were used to select the "machine pointer" instructions either from x86_64 or x86_32 are now exclusively for x86_64. I don't think we want to rewrite `*ptr` -> `*q` at this point. I think we gradually morph the code base to use `*q`-flavored methods in new code. > > x86_32 is the only platform that has special cases for x87 FPU. > > C1 even implements the whole separate thing to deal with x87 FPU: the parts of regalloc treat it specially, there is `FpuStackSim`, there is `VerifyFPU` family of flags, etc. There are also peculiarities with FP conversions that use FPU, that's why x86_32 used to have template interpreter stubs for FP conversion methods. None of that is needed anymore without x86_32. This cleans up some arch-specific code as well. > > Both C1 and C2 implement the workarounds for non-IEEE compliant rounding of x87 FPU. After x86_32 is gone, these are not needed anymore. This removes some C2 nodes, removes the rounding instructions in C1. > > x86_64 is baselined on SSE2+, the VM would not even start if SSE2 is not supported. Most of the checks that we have for `UseSSE < 2` are for the benefit of x86_32. Because of this I folded redundant `UseSSE` checks around Hotspot. > > The one thing I _deliberately_ avoided doing is merging `x86.ad` and `x86_64.ad`. It would likely introduce uncomfortable amount of conflicts with pending work in mainli... This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/22567 From tschatzl at openjdk.org Tue Mar 4 11:56:56 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Tue, 4 Mar 2025 11:56:56 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v11] In-Reply-To: References: Message-ID: > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: iwalulya review * comments for variables tracking to-collection-set and just dirtied cards after GC/refinement * predicate for determining whether the refinement has been disabled * some other typos/comment improvements * renamed _has_xxx_ref to _has_ref_to_xxx to be more consistent with naming ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23739/files - new: https://git.openjdk.org/jdk/pull/23739/files/fc674f02..b4d19d9b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=10 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=09-10 Stats: 40 lines in 8 files changed: 14 ins; 0 del; 26 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739 From coleenp at openjdk.org Tue Mar 4 13:30:04 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Tue, 4 Mar 2025 13:30:04 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com> References: <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com> Message-ID: On Mon, 3 Mar 2025 23:18:13 GMT, Patricio Chilano Mateo wrote: >> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision: >> >> Update after review by David and Coleen. > > src/hotspot/share/runtime/objectMonitor.cpp line 1359: > >> 1357: // Build the doubly linked list to get hold of currentNode->prev(). >> 1358: _entry_list_tail = nullptr; >> 1359: entry_list_tail(current); > > I think we should try to avoid having to rebuild the doubly link list from scratch, since only a few nodes in the front might be missing the previous links. For platform threads it might not matter that much, but for virtual threads this list could be much larger. Maybe we can leave it as a future enhancement. We don't have a prev node, we don't know which node to set next to our next node to. The list will be broken. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1979432912 From adinn at openjdk.org Tue Mar 4 14:04:03 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Tue, 4 Mar 2025 14:04:03 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v7] In-Reply-To: References: Message-ID: On Tue, 4 Mar 2025 11:11:44 GMT, Ferenc Rakoczi wrote: >> Oops. sorry - cut and paste error -- the new setting should be >> >> do_arch_blob(compiler, 55000 ZGC_ONLY(+5000)) > >> Oops. sorry - cut and paste error -- the new setting should be >> >> ``` >> do_arch_blob(compiler, 55000 ZGC_ONLY(+5000)) >> ``` > > @adinn, I have done this change, but that erased your approval. Could you reapprove? @ferakocz Feel free to integrate and I will sponsor ------------- PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2697719261 From duke at openjdk.org Tue Mar 4 14:13:05 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Tue, 4 Mar 2025 14:13:05 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v7] In-Reply-To: References: Message-ID: On Tue, 4 Mar 2025 11:11:44 GMT, Ferenc Rakoczi wrote: >> Oops. sorry - cut and paste error -- the new setting should be >> >> do_arch_blob(compiler, 55000 ZGC_ONLY(+5000)) > >> Oops. sorry - cut and paste error -- the new setting should be >> >> ``` >> do_arch_blob(compiler, 55000 ZGC_ONLY(+5000)) >> ``` > > @adinn, I have done this change, but that erased your approval. Could you reapprove? > @ferakocz Feel free to integrate and I will sponsor @adinn thanks a lot for the review and the sponsoring, too! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2697761033 From duke at openjdk.org Tue Mar 4 14:13:05 2025 From: duke at openjdk.org (duke) Date: Tue, 4 Mar 2025 14:13:05 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v8] In-Reply-To: References: Message-ID: <4goExO2NlWn1wVnu0eYddpXAN4h_t9F7VG4b-MHI_sE=.74de8ba0-eec5-401e-9aa5-6bda6a4e74a5@github.com> On Fri, 28 Feb 2025 06:22:09 GMT, Ferenc Rakoczi wrote: >> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits: > > - Merged master. > - Added more comments, mainly as suggested by Andrew Dinn > - Changed aarch64-asmtest.py as suggested by Bhavana-Kilambi > - Accepting suggested change from Andrew Dinn > - Added comments suggested by Andrew Dinn > - Fixed copyright years > - renaming a couple of functions > - Adding comments + some code reorganization > - removed debugging code > - merging master > - ... and 3 more: https://git.openjdk.org/jdk/compare/ab4b0ef9...d82dfb2f @ferakocz Your change (at version d82dfb2f6d329f4caa0949bfbcd5dd5e5d52d6e9) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2697751091 From mullan at openjdk.org Tue Mar 4 14:36:04 2025 From: mullan at openjdk.org (Sean Mullan) Date: Tue, 4 Mar 2025 14:36:04 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v8] In-Reply-To: References: Message-ID: On Fri, 28 Feb 2025 06:22:09 GMT, Ferenc Rakoczi wrote: >> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits: > > - Merged master. > - Added more comments, mainly as suggested by Andrew Dinn > - Changed aarch64-asmtest.py as suggested by Bhavana-Kilambi > - Accepting suggested change from Andrew Dinn > - Added comments suggested by Andrew Dinn > - Fixed copyright years > - renaming a couple of functions > - Adding comments + some code reorganization > - removed debugging code > - merging master > - ... and 3 more: https://git.openjdk.org/jdk/compare/ab4b0ef9...d82dfb2f I think it would be nice to add a release note for this describing the approximate performance improvement. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2697841749 From duke at openjdk.org Tue Mar 4 14:44:00 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Tue, 4 Mar 2025 14:44:00 GMT Subject: Integrated: 8348561: Add aarch64 intrinsics for ML-DSA In-Reply-To: References: Message-ID: On Fri, 24 Jan 2025 14:24:23 GMT, Ferenc Rakoczi wrote: > By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. This pull request has now been integrated. Changeset: 3230894b Author: Ferenc Rakoczi Committer: Andrew Dinn URL: https://git.openjdk.org/jdk/commit/3230894bdd8ab4183b83ad4c942eb6acad4acce6 Stats: 2611 lines in 22 files changed: 2030 ins; 92 del; 489 mod 8348561: Add aarch64 intrinsics for ML-DSA Reviewed-by: adinn ------------- PR: https://git.openjdk.org/jdk/pull/23300 From ayang at openjdk.org Tue Mar 4 15:47:00 2025 From: ayang at openjdk.org (Albert Mingkun Yang) Date: Tue, 4 Mar 2025 15:47:00 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v11] In-Reply-To: References: Message-ID: On Tue, 4 Mar 2025 11:56:56 GMT, Thomas Schatzl wrote: >> Hi all, >> >> please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. >> >> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. >> >> ### Current situation >> >> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. >> >> The main reason for the current barrier is how g1 implements concurrent refinement: >> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. >> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, >> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. >> >> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: >> >> >> // Filtering >> if (region(@x.a) == region(y)) goto done; // same region check >> if (y == null) goto done; // null value check >> if (card(@x.a) == young_card) goto done; // write to young gen check >> StoreLoad; // synchronize >> if (card(@x.a) == dirty_card) goto done; >> >> *card(@x.a) = dirty >> >> // Card tracking >> enqueue(card-address(@x.a)) into thread-local-dcq; >> if (thread-local-dcq is not full) goto done; >> >> call runtime to move thread-local-dcq into dcqs >> >> done: >> >> >> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. >> >> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. >> >> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). >> >> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c... > > Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: > > iwalulya review > * comments for variables tracking to-collection-set and just dirtied cards after GC/refinement > * predicate for determining whether the refinement has been disabled > * some other typos/comment improvements > * renamed _has_xxx_ref to _has_ref_to_xxx to be more consistent with naming src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 356: > 354: bool do_heap_region(G1HeapRegion* r) override { > 355: if (!r->is_free()) { > 356: // Need to scan all parts of non-free regions, so reset the claim. Why is the condition "is_free"? I thought we scan only old-or-humongous regions? src/hotspot/share/gc/g1/g1ConcurrentRefine.hpp line 116: > 114: SwapGlobalCT, // Swap global card table. > 115: SwapJavaThreadsCT, // Swap java thread's card tables. > 116: SwapGCThreadsCT, // Swap GC thread's card tables. Do GC threads have card-table? src/hotspot/share/gc/g1/g1ConcurrentRefineThread.cpp line 219: > 217: // The young gen revising mechanism reads the predictor and the values set > 218: // here. Avoid inconsistencies by locking. > 219: MutexLocker x(G1RareEvent_lock, Mutex::_no_safepoint_check_flag); Who else can be in this critical-section? I don't get what this lock is protecting us from. src/hotspot/share/gc/g1/g1ConcurrentRefineThread.hpp line 83: > 81: > 82: public: > 83: static G1ConcurrentRefineThread* create(G1ConcurrentRefine* cr); I wonder if the comment for this class "One or more G1 Concurrent Refinement Threads..." has become obsolete. (AFAICS, this class is a singleton.) src/hotspot/share/gc/g1/g1ConcurrentRefineWorkTask.cpp line 69: > 67: } else if (res == G1RemSet::NoInteresting) { > 68: _refine_stats.inc_cards_clean_again(); > 69: } A `switch` is probably cleaner. src/hotspot/share/gc/g1/g1ConcurrentRefineWorkTask.cpp line 78: > 76: do_dirty_card(source, dest_card); > 77: } > 78: return pointer_delta(dirty_r, dirty_l, sizeof(CardValue)); I feel the `pointer_delta` line belongs to the caller. After that, even the entire method can be inlined to the caller. YMMV. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979666477 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979678325 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979699376 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979695999 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979705019 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979709682 From tschatzl at openjdk.org Tue Mar 4 16:03:55 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Tue, 4 Mar 2025 16:03:55 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v11] In-Reply-To: References: Message-ID: On Tue, 4 Mar 2025 15:16:17 GMT, Albert Mingkun Yang wrote: >> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: >> >> iwalulya review >> * comments for variables tracking to-collection-set and just dirtied cards after GC/refinement >> * predicate for determining whether the refinement has been disabled >> * some other typos/comment improvements >> * renamed _has_xxx_ref to _has_ref_to_xxx to be more consistent with naming > > src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 356: > >> 354: bool do_heap_region(G1HeapRegion* r) override { >> 355: if (!r->is_free()) { >> 356: // Need to scan all parts of non-free regions, so reset the claim. > > Why is the condition "is_free"? I thought we scan only old-or-humongous regions? We also need to clear young gen region marks because we want them to be all clean in the card table for the garbage collection (evacuation failure handling, use in next cycle). This is maybe a bit of a waste if there are multiple refinement rounds between two gcs, but it's less expensive than in the pause wrt to latency. It's fast anyway. > src/hotspot/share/gc/g1/g1ConcurrentRefine.hpp line 116: > >> 114: SwapGlobalCT, // Swap global card table. >> 115: SwapJavaThreadsCT, // Swap java thread's card tables. >> 116: SwapGCThreadsCT, // Swap GC thread's card tables. > > Do GC threads have card-table? Hmm, I thought I changed tat already just recently with Ivan's latest requests. Will fix. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979742662 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979752692 From tschatzl at openjdk.org Tue Mar 4 16:07:58 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Tue, 4 Mar 2025 16:07:58 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v11] In-Reply-To: References: Message-ID: On Tue, 4 Mar 2025 15:33:29 GMT, Albert Mingkun Yang wrote: >> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: >> >> iwalulya review >> * comments for variables tracking to-collection-set and just dirtied cards after GC/refinement >> * predicate for determining whether the refinement has been disabled >> * some other typos/comment improvements >> * renamed _has_xxx_ref to _has_ref_to_xxx to be more consistent with naming > > src/hotspot/share/gc/g1/g1ConcurrentRefineThread.cpp line 219: > >> 217: // The young gen revising mechanism reads the predictor and the values set >> 218: // here. Avoid inconsistencies by locking. >> 219: MutexLocker x(G1RareEvent_lock, Mutex::_no_safepoint_check_flag); > > Who else can be in this critical-section? I don't get what this lock is protecting us from. The concurrent refine control thread in `G1ConcurrentRefineThread::do_refinement`, when calling `G1Policy::record_dirtying_stats`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979759329 From tschatzl at openjdk.org Tue Mar 4 16:07:56 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Tue, 4 Mar 2025 16:07:56 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v11] In-Reply-To: References: Message-ID: On Tue, 4 Mar 2025 16:00:46 GMT, Thomas Schatzl wrote: >> src/hotspot/share/gc/g1/g1ConcurrentRefine.hpp line 116: >> >>> 114: SwapGlobalCT, // Swap global card table. >>> 115: SwapJavaThreadsCT, // Swap java thread's card tables. >>> 116: SwapGCThreadsCT, // Swap GC thread's card tables. >> >> Do GC threads have card-table? > > Hmm, I thought I changed tat already just recently with Ivan's latest requests. Will fix. Oh, I only fixed the string. Apologies. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979761737 From tschatzl at openjdk.org Tue Mar 4 16:20:58 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Tue, 4 Mar 2025 16:20:58 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v11] In-Reply-To: References: Message-ID: On Tue, 4 Mar 2025 15:56:05 GMT, Thomas Schatzl wrote: > It's fast anyway. To clarify: If you have multiple refinement rounds between two garbage collections, the time to clear the young gen cards is almost noise compared to the actual refinement effort. Like two magnitudes faster. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979785011 From tschatzl at openjdk.org Tue Mar 4 16:34:56 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Tue, 4 Mar 2025 16:34:56 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v11] In-Reply-To: References: Message-ID: <3LR5VKMhSuXWmMlphpe8SLHm8vQQt6j343qaO61S_mQ=.dc1d2e4a-c858-44bd-9da0-f3f98340d939@github.com> On Tue, 4 Mar 2025 16:04:00 GMT, Thomas Schatzl wrote: >> src/hotspot/share/gc/g1/g1ConcurrentRefineThread.cpp line 219: >> >>> 217: // The young gen revising mechanism reads the predictor and the values set >>> 218: // here. Avoid inconsistencies by locking. >>> 219: MutexLocker x(G1RareEvent_lock, Mutex::_no_safepoint_check_flag); >> >> Who else can be in this critical-section? I don't get what this lock is protecting us from. > > The concurrent refine control thread in `G1ConcurrentRefineThread::do_refinement`, when calling `G1Policy::record_dirtying_stats`. I could create an extra mutex for that if you want to make it clear which two parties access the same data. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979810144 From tschatzl at openjdk.org Tue Mar 4 17:20:28 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Tue, 4 Mar 2025 17:20:28 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v12] In-Reply-To: References: Message-ID: > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: ayang review * renamings * refactorings ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23739/files - new: https://git.openjdk.org/jdk/pull/23739/files/b4d19d9b..4a978118 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=11 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=10-11 Stats: 34 lines in 4 files changed: 13 ins; 1 del; 20 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739 From pchilanomate at openjdk.org Tue Mar 4 17:38:58 2025 From: pchilanomate at openjdk.org (Patricio Chilano Mateo) Date: Tue, 4 Mar 2025 17:38:58 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: References: <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com> Message-ID: <09Lu69Do9amzXyGok3KDuP2whACShrPwRM7BOel5wgg=.ceed3ba0-9f91-4e95-9cf5-0e85362e29df@github.com> On Tue, 4 Mar 2025 04:50:34 GMT, David Holmes wrote: >> src/hotspot/share/runtime/objectMonitor.cpp line 204: >> >>> 202: // If the thread (F) that removes itself from the end of the list >>> 203: // hasn't got any prev pointer, we just set the tail pointer to >>> 204: // null, see 5) and 6) below. >> >> Setting the tail pointer to null would be for the case when this node is also the head, i.e single element. Otherwise we just rebuild the doubly link list, unlink F, and set entry_list_tail to G. In other words, the comment here and below seems to be missing that we have to build the doubly link list when F acquires the monitor, not when F needs to find a successor. > > We don't rebuild at this point. The thread that is removing itself just sets tail to null if there is no prev. Later when F exits the monitor it will construct the DLL to find the next successor. But if there is a previous node (just no previous pointer set) we have to rebuild the list, otherwise G would still be pointing to F. It would be this case: https://github.com/fbredber/jdk/blob/283c2431ec64b0865d4e678913c636732d01658f/src/hotspot/share/runtime/objectMonitor.cpp#L1313 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1979921706 From fbredberg at openjdk.org Tue Mar 4 18:12:59 2025 From: fbredberg at openjdk.org (Fredrik Bredberg) Date: Tue, 4 Mar 2025 18:12:59 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: <09Lu69Do9amzXyGok3KDuP2whACShrPwRM7BOel5wgg=.ceed3ba0-9f91-4e95-9cf5-0e85362e29df@github.com> References: <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com> <09Lu69Do9amzXyGok3KDuP2whACShrPwRM7BOel5wgg=.ceed3ba0-9f91-4e95-9cf5-0e85362e29df@github.com> Message-ID: On Tue, 4 Mar 2025 17:36:43 GMT, Patricio Chilano Mateo wrote: >> We don't rebuild at this point. The thread that is removing itself just sets tail to null if there is no prev. Later when F exits the monitor it will construct the DLL to find the next successor. > > But if there is a previous node (just no previous pointer set) we have to rebuild the list, otherwise G would still be pointing to F. It would be this case: https://github.com/fbredber/jdk/blob/283c2431ec64b0865d4e678913c636732d01658f/src/hotspot/share/runtime/objectMonitor.cpp#L1313 You're quite right. I'll rewrite that section of the comment. Thank you for spotting this. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1979966484 From pchilanomate at openjdk.org Tue Mar 4 18:13:00 2025 From: pchilanomate at openjdk.org (Patricio Chilano Mateo) Date: Tue, 4 Mar 2025 18:13:00 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: References: <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com> Message-ID: On Tue, 4 Mar 2025 13:27:17 GMT, Coleen Phillimore wrote: >> src/hotspot/share/runtime/objectMonitor.cpp line 1359: >> >>> 1357: // Build the doubly linked list to get hold of currentNode->prev(). >>> 1358: _entry_list_tail = nullptr; >>> 1359: entry_list_tail(current); >> >> I think we should try to avoid having to rebuild the doubly link list from scratch, since only a few nodes in the front might be missing the previous links. For platform threads it might not matter that much, but for virtual threads this list could be much larger. Maybe we can leave it as a future enhancement. > > We don't have a prev node, we don't know which node to set next to our next node to. The list will be broken. Right, we still have to set the previous links for those nodes. I'm just suggesting we don't have to walk the whole list, just until the last node we set the previous pointer. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1979963352 From dlong at openjdk.org Tue Mar 4 18:14:00 2025 From: dlong at openjdk.org (Dean Long) Date: Tue, 4 Mar 2025 18:14:00 GMT Subject: RFR: 8351036: [JVMCI] value not an s2: -32776 In-Reply-To: References: Message-ID: On Tue, 4 Mar 2025 09:18:40 GMT, Doug Simon wrote: > This PR adds support for JVMCI to install code that requires stack slots whose offset > `Short.MAX_VALUE`. Marked as reviewed by dlong (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23888#pullrequestreview-2658525441 From mpowers at openjdk.org Tue Mar 4 19:28:02 2025 From: mpowers at openjdk.org (Mark Powers) Date: Tue, 4 Mar 2025 19:28:02 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v2] In-Reply-To: References: Message-ID: On Mon, 3 Mar 2025 19:00:59 GMT, Ferenc Rakoczi wrote: >> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: > > Added comments, removed debugging printfs ML-DSA benchmark results for this PR keygen ML-DSA-44 96 us/op keygen ML-DSA-65 200 us/op keygen ML-DSA-87 272 us/op siggen ML-DSA-44 297 us/op siggen ML-DSA-65 452 us/op siggen ML-DSA-87 728 us/op sigver ML-DSA-44 115 us/op sigver ML-DSA-65 176 us/op sigver ML-DSA-87 290 us/op ML-DSA no intrinsics keygen ML-DSA-44 169 us/op keygen ML-DSA-65 302 us/op keygen ML-DSA-87 444 us/op siggen ML-DSA-44 696 us/op siggen ML-DSA-65 1114 us/op siggen ML-DSA-87 1828 us/op sigver ML-DSA-44 187 us/op sigver ML-DSA-65 295 us/op sigver ML-DSA-87 473 us/op ------------- PR Comment: https://git.openjdk.org/jdk/pull/23860#issuecomment-2698691038 From dnsimon at openjdk.org Tue Mar 4 20:14:03 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 4 Mar 2025 20:14:03 GMT Subject: RFR: 8351036: [JVMCI] value not an s2: -32776 In-Reply-To: References: Message-ID: <-tI6hRLLVFZKckI0dXweArTpvkkuppQ-UCe7QCP204M=.7071b95a-b2bd-43fd-8593-47a3e0711a98@github.com> On Tue, 4 Mar 2025 09:18:40 GMT, Doug Simon wrote: > This PR adds support for JVMCI to install code that requires stack slots whose offset > `Short.MAX_VALUE`. Thanks for the reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23888#issuecomment-2698788687 From dnsimon at openjdk.org Tue Mar 4 20:14:04 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 4 Mar 2025 20:14:04 GMT Subject: Integrated: 8351036: [JVMCI] value not an s2: -32776 In-Reply-To: References: Message-ID: <13cAPTn_ilQ-6cQXLy7mta5wV4zczVRsYdpRe5RqnWw=.be50128c-0f43-4bca-8cd8-5a01b51b1c34@github.com> On Tue, 4 Mar 2025 09:18:40 GMT, Doug Simon wrote: > This PR adds support for JVMCI to install code that requires stack slots whose offset > `Short.MAX_VALUE`. This pull request has now been integrated. Changeset: a21302bb Author: Doug Simon URL: https://git.openjdk.org/jdk/commit/a21302bb3244b85dd9809c42d1c0fd502bd677cc Stats: 44 lines in 4 files changed: 36 ins; 0 del; 8 mod 8351036: [JVMCI] value not an s2: -32776 Reviewed-by: yzheng, dlong ------------- PR: https://git.openjdk.org/jdk/pull/23888 From duke at openjdk.org Tue Mar 4 22:04:26 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Tue, 4 Mar 2025 22:04:26 GMT Subject: RFR: 8349721: Add aarch64 intrinsics for ML-KEM [v4] In-Reply-To: References: Message-ID: > By using the aarch64 vector registers the speed of the computation of the ML-KEM algorithms (key generation, encapsulation, decapsulation) can be approximately doubled. Ferenc Rakoczi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: - Fixed mismerge. - Merged master. - A little cleanup - Merged master - removing trailing spaces - kyber aarch64 intrinsics ------------- Changes: https://git.openjdk.org/jdk/pull/23663/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23663&range=03 Stats: 2508 lines in 18 files changed: 2464 ins; 16 del; 28 mod Patch: https://git.openjdk.org/jdk/pull/23663.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23663/head:pull/23663 PR: https://git.openjdk.org/jdk/pull/23663 From dholmes at openjdk.org Wed Mar 5 05:17:55 2025 From: dholmes at openjdk.org (David Holmes) Date: Wed, 5 Mar 2025 05:17:55 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: References: <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com> <09Lu69Do9amzXyGok3KDuP2whACShrPwRM7BOel5wgg=.ceed3ba0-9f91-4e95-9cf5-0e85362e29df@github.com> Message-ID: On Tue, 4 Mar 2025 18:09:56 GMT, Fredrik Bredberg wrote: >> But if there is a previous node (just no previous pointer set) we have to rebuild the list, otherwise G would still be pointing to F. It would be this case: https://github.com/fbredber/jdk/blob/283c2431ec64b0865d4e678913c636732d01658f/src/hotspot/share/runtime/objectMonitor.cpp#L1313 > > You're quite right. I'll rewrite that section of the comment. Thank you for spotting this. Yep my bad - you can't delete yourself without a prev node pointer when you are being pointed to. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1980709561 From tschatzl at openjdk.org Wed Mar 5 09:45:00 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Wed, 5 Mar 2025 09:45:00 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v13] In-Reply-To: References: Message-ID: > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: * fix whitespace * additional whitespace between log tags * rename G1ConcurrentRefineWorkTask -> ...SweepTask to conform to the other similar rename ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23739/files - new: https://git.openjdk.org/jdk/pull/23739/files/4a978118..a457e6e7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=12 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=11-12 Stats: 116 lines in 6 files changed: 50 ins; 50 del; 16 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739 From iwalulya at openjdk.org Wed Mar 5 11:12:56 2025 From: iwalulya at openjdk.org (Ivan Walulya) Date: Wed, 5 Mar 2025 11:12:56 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v13] In-Reply-To: References: Message-ID: On Wed, 5 Mar 2025 09:45:00 GMT, Thomas Schatzl wrote: >> Hi all, >> >> please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. >> >> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. >> >> ### Current situation >> >> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. >> >> The main reason for the current barrier is how g1 implements concurrent refinement: >> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. >> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, >> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. >> >> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: >> >> >> // Filtering >> if (region(@x.a) == region(y)) goto done; // same region check >> if (y == null) goto done; // null value check >> if (card(@x.a) == young_card) goto done; // write to young gen check >> StoreLoad; // synchronize >> if (card(@x.a) == dirty_card) goto done; >> >> *card(@x.a) = dirty >> >> // Card tracking >> enqueue(card-address(@x.a)) into thread-local-dcq; >> if (thread-local-dcq is not full) goto done; >> >> call runtime to move thread-local-dcq into dcqs >> >> done: >> >> >> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. >> >> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. >> >> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). >> >> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c... > > Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: > > * fix whitespace > * additional whitespace between log tags > * rename G1ConcurrentRefineWorkTask -> ...SweepTask to conform to the other similar rename src/hotspot/share/gc/g1/c1/g1BarrierSetC1.cpp line 32: > 30: #include "gc/g1/g1HeapRegion.hpp" > 31: #include "gc/g1/g1ThreadLocalData.hpp" > 32: #include "utilities/macros.hpp" Suggestion: #include "utilities/formatBuffer.hpp" #include "utilities/macros.hpp" to use `err_msg` src/hotspot/share/gc/g1/g1RemSet.cpp line 90: > 88: // contiguous ranges of dirty cards to be scanned. These blocks are converted to actual > 89: // memory ranges and then passed on to actual scanning. > 90: class G1RemSetScanState : public CHeapObj { Need to update the comment above to remove reference to "log buffers" (L:67). src/hotspot/share/gc/g1/g1RemSet.hpp line 44: > 42: class CardTableBarrierSet; > 43: class G1AbstractSubTask; > 44: class G1RemSetScanState; Already declared on line 48 below src/hotspot/share/gc/g1/g1ThreadLocalData.hpp line 29: > 27: #include "gc/g1/g1BarrierSet.hpp" > 28: #include "gc/g1/g1CardTable.hpp" > 29: #include "gc/g1/g1CollectedHeap.hpp" probably does not need to be included ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1981138746 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1981162792 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1981118865 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1981142943 From iwalulya at openjdk.org Wed Mar 5 11:12:58 2025 From: iwalulya at openjdk.org (Ivan Walulya) Date: Wed, 5 Mar 2025 11:12:58 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v12] In-Reply-To: References: Message-ID: On Tue, 4 Mar 2025 17:20:28 GMT, Thomas Schatzl wrote: >> Hi all, >> >> please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. >> >> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. >> >> ### Current situation >> >> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. >> >> The main reason for the current barrier is how g1 implements concurrent refinement: >> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. >> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, >> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. >> >> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: >> >> >> // Filtering >> if (region(@x.a) == region(y)) goto done; // same region check >> if (y == null) goto done; // null value check >> if (card(@x.a) == young_card) goto done; // write to young gen check >> StoreLoad; // synchronize >> if (card(@x.a) == dirty_card) goto done; >> >> *card(@x.a) = dirty >> >> // Card tracking >> enqueue(card-address(@x.a)) into thread-local-dcq; >> if (thread-local-dcq is not full) goto done; >> >> call runtime to move thread-local-dcq into dcqs >> >> done: >> >> >> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. >> >> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. >> >> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). >> >> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c... > > Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: > > ayang review > * renamings > * refactorings src/hotspot/share/gc/g1/g1HeapRegion.hpp line 475: > 473: void hr_clear(bool clear_space); > 474: // Clear the card table corresponding to this region. > 475: void clear_cardtable(); in some places `cardtable()` has been refactored to `card_table` e.g. in G1HeapRegionManager. src/hotspot/share/gc/g1/g1ParScanThreadState.hpp line 67: > 65: > 66: size_t _num_marked_as_dirty_cards; > 67: size_t _num_marked_as_into_cset_cards; Suggestion: size_t _num_cards_marked_dirty; size_t _num_cards_marked_to_cset; ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1980117641 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1980145229 From duke at openjdk.org Wed Mar 5 11:33:06 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Wed, 5 Mar 2025 11:33:06 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v3] In-Reply-To: References: Message-ID: > By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. Ferenc Rakoczi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: - Merged master. - Added comments, removed debugging printfs - JDK-8351034 Add AVX-512 intrinsics for ML-DSA ------------- Changes: https://git.openjdk.org/jdk/pull/23860/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=02 Stats: 1642 lines in 8 files changed: 1636 ins; 2 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/23860.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23860/head:pull/23860 PR: https://git.openjdk.org/jdk/pull/23860 From jbhateja at openjdk.org Wed Mar 5 11:38:52 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 5 Mar 2025 11:38:52 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v2] In-Reply-To: References: Message-ID: On Mon, 3 Mar 2025 19:00:59 GMT, Ferenc Rakoczi wrote: >> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: > > Added comments, removed debugging printfs src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 420: > 418: __ movptr(constant2use, round_consts); > 419: > 420: __ BIND(rounds24_loop); For Icache alignment, please use __ align64() before the loop entry. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1978822704 From jbhateja at openjdk.org Wed Mar 5 11:42:01 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 5 Mar 2025 11:42:01 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v3] In-Reply-To: References: Message-ID: <0E2AqFpNPjDjP6jqCXn8toePBcW2SIHw1kFXlZX4W_U=.8d692bfa-0598-4969-b480-4a285366e0bb@github.com> On Wed, 5 Mar 2025 11:33:06 GMT, Ferenc Rakoczi wrote: >> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: > > - Merged master. > - Added comments, removed debugging printfs > - JDK-8351034 Add AVX-512 intrinsics for ML-DSA src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 292: > 290: __ movl(iterations, 2); > 291: > 292: __ BIND(L_loop); Please align loop entry address using __align64(). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1981242267 From fbredberg at openjdk.org Wed Mar 5 12:31:23 2025 From: fbredberg at openjdk.org (Fredrik Bredberg) Date: Wed, 5 Mar 2025 12:31:23 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v3] In-Reply-To: References: Message-ID: > I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`. > > This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past. > > In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks. > > The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`. > > You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable. > > The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list. > > Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor). > > Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation. > > However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fact that c2 no longer has to check b... Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision: Updated comments after review by Patricio. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23421/files - new: https://git.openjdk.org/jdk/pull/23421/files/283c2431..0d2d6c34 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23421&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23421&range=01-02 Stats: 11 lines in 1 file changed: 1 ins; 1 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/23421.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23421/head:pull/23421 PR: https://git.openjdk.org/jdk/pull/23421 From fbredberg at openjdk.org Wed Mar 5 12:31:24 2025 From: fbredberg at openjdk.org (Fredrik Bredberg) Date: Wed, 5 Mar 2025 12:31:24 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: References: <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com> <09Lu69Do9amzXyGok3KDuP2whACShrPwRM7BOel5wgg=.ceed3ba0-9f91-4e95-9cf5-0e85362e29df@github.com> Message-ID: On Wed, 5 Mar 2025 05:14:54 GMT, David Holmes wrote: >> You're quite right. I'll rewrite that section of the comment. Thank you for spotting this. > > Yep my bad - you can't delete yourself without a prev node pointer when you are being pointed to. Fixed ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1981307190 From fbredberg at openjdk.org Wed Mar 5 12:34:56 2025 From: fbredberg at openjdk.org (Fredrik Bredberg) Date: Wed, 5 Mar 2025 12:34:56 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com> References: <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com> Message-ID: On Mon, 3 Mar 2025 23:10:29 GMT, Patricio Chilano Mateo wrote: >> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision: >> >> Update after review by David and Coleen. > > src/hotspot/share/runtime/objectMonitor.cpp line 1265: > >> 1263: // that updated _entry_list, so we can access w->_next. >> 1264: w = Atomic::load_acquire(&_entry_list); >> 1265: assert(w != nullptr, "invariant"); > > Maybe add the same assert as below for the single element case: `assert(w->TState == ObjectWaiter::TS_ENTER, "invariant")`. Since this is not strictly necessary, I will look into this in a follow up PR. > src/hotspot/share/runtime/objectMonitor.cpp line 1532: > >> 1530: // Let's say T1 then stalls. T2 acquires O and calls O.notify(). The >> 1531: // notify() operation moves T1 from O's waitset to O's entry_list. T2 then >> 1532: // release the lock "O". T2 resumes immediately after the ST of null into > > Pre-existent, but this should be T1. Same in next sentence. Fixed ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1981314184 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1981313487 From fbredberg at openjdk.org Wed Mar 5 12:34:57 2025 From: fbredberg at openjdk.org (Fredrik Bredberg) Date: Wed, 5 Mar 2025 12:34:57 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: References: <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com> Message-ID: On Tue, 4 Mar 2025 18:08:15 GMT, Patricio Chilano Mateo wrote: >> We don't have a prev node, we don't know which node to set next to our next node to. The list will be broken. > > Right, we still have to set the previous links for those nodes. I'm just suggesting we don't have to walk the whole list, just until the last node we set the previous pointer. Since this is not strictly necessary, I will look into this in a follow up PR. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1981312971 From yzheng at openjdk.org Wed Mar 5 12:37:52 2025 From: yzheng at openjdk.org (Yudi Zheng) Date: Wed, 5 Mar 2025 12:37:52 GMT Subject: RFR: 8350892: [JVMCI] Align ResolvedJavaType.getInstanceFields with Class.getDeclaredFields In-Reply-To: References: Message-ID: On Fri, 28 Feb 2025 23:46:54 GMT, Doug Simon wrote: > The current order of fields returned by `ResolvedJavaType.getInstanceFields` is a) not well specified and b) different than the order of fields used almost everywhere else in HotSpot. This PR aligns the order of `getInstanceFields` with `Class.getDeclaredFields()`. > > It also makes `ciInstanceKlass::_nonstatic_fields` use the same order which unifies how escape analysis and deoptimization treats fields across C2 and JVMCI. Overall looks good to me src/hotspot/share/ci/ciInstanceKlass.cpp line 481: > 479: // Now sort them by offset, ascending. > 480: // (In principle, they could mix with superclass fields.) > 481: fields->sort(sort_field_by_offset); This has no effect now, i.e., the fields were sorted already? ------------- Marked as reviewed by yzheng (Committer). PR Review: https://git.openjdk.org/jdk/pull/23849#pullrequestreview-2660958414 PR Review Comment: https://git.openjdk.org/jdk/pull/23849#discussion_r1981305860 From fbredberg at openjdk.org Wed Mar 5 12:43:03 2025 From: fbredberg at openjdk.org (Fredrik Bredberg) Date: Wed, 5 Mar 2025 12:43:03 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com> References: <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com> Message-ID: <2pmoWBdeasqGUxjDKvJMIBUqgipo33xTNrYIdB6U1vM=.79067439-b3c1-4b47-8669-7e4d77a22b3f@github.com> On Mon, 3 Mar 2025 23:12:05 GMT, Patricio Chilano Mateo wrote: >> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision: >> >> Update after review by David and Coleen. > > src/hotspot/share/runtime/objectMonitor.cpp line 1509: > >> 1507: // is no successor, so it appears that an heir-presumptive >> 1508: // (successor) must be made ready. Only the current lock owner can >> 1509: // detach threads from the entry_list, therefore we need to > > We don't detach threads here, so maybe manipulate would be better. Maybe, but manipulate may also include "pushing to the head", which is fine to do without holding the lock. I'll keep the comment as is for now, maybe this sentence will be deleted if we find a way of running exit without holding the lock, as we have talked about. If that's not possible I'll rephrase this sentence in a follow up PR. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1981325861 From fbredberg at openjdk.org Wed Mar 5 12:51:02 2025 From: fbredberg at openjdk.org (Fredrik Bredberg) Date: Wed, 5 Mar 2025 12:51:02 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v3] In-Reply-To: References: Message-ID: <6TKNnpUGSCflszCRIY531Nnf1kMxjlYQm3V4Yf44riY=.5c5f69ef-cf63-4748-902b-39c2898762ee@github.com> On Wed, 5 Mar 2025 12:31:23 GMT, Fredrik Bredberg wrote: >> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`. >> >> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past. >> >> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks. >> >> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`. >> >> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable. >> >> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list. >> >> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor). >> >> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation. >> >> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fac... > > Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision: > > Updated comments after review by Patricio. @mur47x111 I'm getting ready to integrate. I've seen that you have created [[JDK-8349711] Adapt JDK-8343840: Rewrite the ObjectMonitor lists](https://github.com/oracle/graal/pull/10757) to handle the change on your side. Do you see any reason why I shouldn't integrate, or are you fine with me integrating this PR now? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23421#issuecomment-2700837592 From duke at openjdk.org Wed Mar 5 13:10:34 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Wed, 5 Mar 2025 13:10:34 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v4] In-Reply-To: References: Message-ID: > By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: Added alignment to loop entries. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23860/files - new: https://git.openjdk.org/jdk/pull/23860/files/331f1ecb..3aaa106f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=02-03 Stats: 9 lines in 2 files changed: 9 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23860.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23860/head:pull/23860 PR: https://git.openjdk.org/jdk/pull/23860 From duke at openjdk.org Wed Mar 5 13:10:35 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Wed, 5 Mar 2025 13:10:35 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v3] In-Reply-To: <0E2AqFpNPjDjP6jqCXn8toePBcW2SIHw1kFXlZX4W_U=.8d692bfa-0598-4969-b480-4a285366e0bb@github.com> References: <0E2AqFpNPjDjP6jqCXn8toePBcW2SIHw1kFXlZX4W_U=.8d692bfa-0598-4969-b480-4a285366e0bb@github.com> Message-ID: On Wed, 5 Mar 2025 11:39:05 GMT, Jatin Bhateja wrote: >> Ferenc Rakoczi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: >> >> - Merged master. >> - Added comments, removed debugging printfs >> - JDK-8351034 Add AVX-512 intrinsics for ML-DSA > > src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 292: > >> 290: __ movl(iterations, 2); >> 291: >> 292: __ BIND(L_loop); > > Hi @ferakocz , Kindly align loop entry address using __align64() here and at all the places before __BIND(LOOP) Hi, @jatin-bhateja, thanks for the suggestion. I have added __ align(OptoLoopAlignment); before all loop entries. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1981364481 From dnsimon at openjdk.org Wed Mar 5 13:50:53 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 5 Mar 2025 13:50:53 GMT Subject: RFR: 8350892: [JVMCI] Align ResolvedJavaType.getInstanceFields with Class.getDeclaredFields In-Reply-To: References: Message-ID: On Wed, 5 Mar 2025 12:26:12 GMT, Yudi Zheng wrote: >> The current order of fields returned by `ResolvedJavaType.getInstanceFields` is a) not well specified and b) different than the order of fields used almost everywhere else in HotSpot. This PR aligns the order of `getInstanceFields` with `Class.getDeclaredFields()`. >> >> It also makes `ciInstanceKlass::_nonstatic_fields` use the same order which unifies how escape analysis and deoptimization treats fields across C2 and JVMCI. > > src/hotspot/share/ci/ciInstanceKlass.cpp line 481: > >> 479: // Now sort them by offset, ascending. >> 480: // (In principle, they could mix with superclass fields.) >> 481: fields->sort(sort_field_by_offset); > > This has no effect now, i.e., the fields were sorted already? They now have whatever sort order is given by JavaFieldStream. This happens to currently be class file declaration order but it doesn't really matter if it changes. The only requirement is that the same order is used by `get_reassigned_fields` in `deoptimization.cpp`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23849#discussion_r1981441818 From jbhateja at openjdk.org Wed Mar 5 14:05:53 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 5 Mar 2025 14:05:53 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v3] In-Reply-To: References: <0E2AqFpNPjDjP6jqCXn8toePBcW2SIHw1kFXlZX4W_U=.8d692bfa-0598-4969-b480-4a285366e0bb@github.com> Message-ID: On Wed, 5 Mar 2025 13:07:54 GMT, Ferenc Rakoczi wrote: >> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 292: >> >>> 290: __ movl(iterations, 2); >>> 291: >>> 292: __ BIND(L_loop); >> >> Hi @ferakocz , Kindly align loop entry address using __align64() here and at all the places before __BIND(LOOP) > > Hi, @jatin-bhateja, thanks for the suggestion. I have added __ align(OptoLoopAlignment); before all loop entries. Hi @ferakocz , Thanks!, for efficient utilization of Decode ICache (please refer to Intel SDM section 3.4.2.5), code blocks should be aligned to 32-byte boundaries; a 64-byte aligned code is a superset of both 16 and 32 byte aligned addresses and also matches with the cacheline size. However, I can noticed that we have been using OptoLoopAlignment at places in AES-GCM also. I introduced some errors in generate_dilithiumAlmostInverseNtt_avx512 implementation in anticipation of catching it through existing ML_DSA_Tests under test/jdk/sun/security/provider/acvp But all the tests passed for me. `java -jar /home/jatinbha/sandboxes/jtreg/build/images/jtreg/lib/jtreg.jar -jdk:$JAVA_HOME -Djdk.test.lib.artifacts.ACVP-Server=/home/jatinbha/softwares/v1.1.0.38.zip -va -timeout:4 Launcher.java` Can you please point out a test I need to use for validation ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1981468903 From coleenp at openjdk.org Wed Mar 5 14:35:58 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Wed, 5 Mar 2025 14:35:58 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v3] In-Reply-To: References: Message-ID: On Wed, 5 Mar 2025 12:31:23 GMT, Fredrik Bredberg wrote: >> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`. >> >> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past. >> >> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks. >> >> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`. >> >> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable. >> >> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list. >> >> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor). >> >> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation. >> >> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fac... > > Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision: > > Updated comments after review by Patricio. Marked as reviewed by coleenp (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23421#pullrequestreview-2661313357 From yzheng at openjdk.org Wed Mar 5 14:49:57 2025 From: yzheng at openjdk.org (Yudi Zheng) Date: Wed, 5 Mar 2025 14:49:57 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v3] In-Reply-To: References: Message-ID: On Wed, 5 Mar 2025 12:31:23 GMT, Fredrik Bredberg wrote: >> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`. >> >> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past. >> >> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks. >> >> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`. >> >> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable. >> >> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list. >> >> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor). >> >> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation. >> >> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fac... > > Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision: > > Updated comments after review by Patricio. JVMCI changes look go to me! We are good to go! ------------- Marked as reviewed by yzheng (Committer). PR Review: https://git.openjdk.org/jdk/pull/23421#pullrequestreview-2661358578 From pchilanomate at openjdk.org Wed Mar 5 14:52:57 2025 From: pchilanomate at openjdk.org (Patricio Chilano Mateo) Date: Wed, 5 Mar 2025 14:52:57 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v3] In-Reply-To: References: Message-ID: On Wed, 5 Mar 2025 12:31:23 GMT, Fredrik Bredberg wrote: >> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`. >> >> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past. >> >> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks. >> >> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`. >> >> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable. >> >> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list. >> >> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor). >> >> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation. >> >> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fac... > > Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision: > > Updated comments after review by Patricio. Thanks, looks good. ------------- Marked as reviewed by pchilanomate (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23421#pullrequestreview-2661362873 From pchilanomate at openjdk.org Wed Mar 5 14:52:59 2025 From: pchilanomate at openjdk.org (Patricio Chilano Mateo) Date: Wed, 5 Mar 2025 14:52:59 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: <2pmoWBdeasqGUxjDKvJMIBUqgipo33xTNrYIdB6U1vM=.79067439-b3c1-4b47-8669-7e4d77a22b3f@github.com> References: <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com> <2pmoWBdeasqGUxjDKvJMIBUqgipo33xTNrYIdB6U1vM=.79067439-b3c1-4b47-8669-7e4d77a22b3f@github.com> Message-ID: On Wed, 5 Mar 2025 12:40:41 GMT, Fredrik Bredberg wrote: >> src/hotspot/share/runtime/objectMonitor.cpp line 1509: >> >>> 1507: // is no successor, so it appears that an heir-presumptive >>> 1508: // (successor) must be made ready. Only the current lock owner can >>> 1509: // detach threads from the entry_list, therefore we need to >> >> We don't detach threads here, so maybe manipulate would be better. > > Maybe, but manipulate may also include "pushing to the head", which is fine to do without holding the lock. > I'll keep the comment as is for now, maybe this sentence will be deleted if we find a way of running exit without holding the lock, as we have talked about. If that's not possible I'll rephrase this sentence in a follow up PR. You could use the same wording we have in the comment above already just to make it consistent: `manipulate the _entry_list (except for pushing new threads to the head)`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1981553339 From fbredberg at openjdk.org Wed Mar 5 14:56:57 2025 From: fbredberg at openjdk.org (Fredrik Bredberg) Date: Wed, 5 Mar 2025 14:56:57 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: References: <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com> <2pmoWBdeasqGUxjDKvJMIBUqgipo33xTNrYIdB6U1vM=.79067439-b3c1-4b47-8669-7e4d77a22b3f@github.com> Message-ID: On Wed, 5 Mar 2025 14:49:01 GMT, Patricio Chilano Mateo wrote: >> Maybe, but manipulate may also include "pushing to the head", which is fine to do without holding the lock. >> I'll keep the comment as is for now, maybe this sentence will be deleted if we find a way of running exit without holding the lock, as we have talked about. If that's not possible I'll rephrase this sentence in a follow up PR. > > You could use the same wording we have in the comment above already just to make it consistent: `manipulate the _entry_list (except for pushing new threads to the head)`. I promise I'll fix that in the follow up PR. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1981563527 From duke at openjdk.org Wed Mar 5 18:30:03 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Wed, 5 Mar 2025 18:30:03 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v3] In-Reply-To: References: <0E2AqFpNPjDjP6jqCXn8toePBcW2SIHw1kFXlZX4W_U=.8d692bfa-0598-4969-b480-4a285366e0bb@github.com> Message-ID: On Wed, 5 Mar 2025 14:03:00 GMT, Jatin Bhateja wrote: >> Hi, @jatin-bhateja, thanks for the suggestion. I have added __ align(OptoLoopAlignment); before all loop entries. > > Hi @ferakocz , > > Thanks!, for efficient utilization of Decode ICache (please refer to Intel SDM section 3.4.2.5), code blocks should be aligned to 32-byte boundaries; a 64-byte aligned code is a superset of both 16 and 32 byte aligned addresses and also matches with the cacheline size. However, I can noticed that we have been using OptoLoopAlignment at places in AES-GCM also. > > I introduced some errors in generate_dilithiumAlmostInverseNtt_avx512 implementation in anticipation of catching it through existing ML_DSA_Tests under > test/jdk/sun/security/provider/acvp > > But all the tests passed for me. > `java -jar /home/jatinbha/sandboxes/jtreg/build/images/jtreg/lib/jtreg.jar -jdk:$JAVA_HOME -Djdk.test.lib.artifacts.ACVP-Server=/home/jatinbha/softwares/v1.1.0.38.zip -va -timeout:4 Launcher.java` > > Can you please point out a test I need to use for validation I think the easiest is to put a for (int i = 0; i < 1000; i++) loop around the switch statement in the run() method of the ML_DSA_Test class (test/jdk/sun/security/provider/acvp/ML_DSA_Test.java). (This is because the intrinsics kick in after a few thousand calls of the method.) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1981945490 From dholmes at openjdk.org Thu Mar 6 04:26:57 2025 From: dholmes at openjdk.org (David Holmes) Date: Thu, 6 Mar 2025 04:26:57 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v3] In-Reply-To: References: Message-ID: On Wed, 5 Mar 2025 12:31:23 GMT, Fredrik Bredberg wrote: >> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`. >> >> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past. >> >> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks. >> >> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`. >> >> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable. >> >> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list. >> >> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor). >> >> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation. >> >> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fac... > > Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision: > > Updated comments after review by Patricio. LGTM! Thanks ------------- Marked as reviewed by dholmes (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23421#pullrequestreview-2663279336 From fbredberg at openjdk.org Thu Mar 6 09:11:02 2025 From: fbredberg at openjdk.org (Fredrik Bredberg) Date: Thu, 6 Mar 2025 09:11:02 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v3] In-Reply-To: References: Message-ID: On Wed, 5 Mar 2025 12:31:23 GMT, Fredrik Bredberg wrote: >> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`. >> >> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past. >> >> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks. >> >> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`. >> >> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable. >> >> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list. >> >> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor). >> >> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation. >> >> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fac... > > Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision: > > Updated comments after review by Patricio. Thanks everyone for the reviews, testing and Graal adaptation. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23421#issuecomment-2703235851 From fbredberg at openjdk.org Thu Mar 6 09:11:02 2025 From: fbredberg at openjdk.org (Fredrik Bredberg) Date: Thu, 6 Mar 2025 09:11:02 GMT Subject: Integrated: 8343840: Rewrite the ObjectMonitor lists In-Reply-To: References: Message-ID: <8EoKGr_0E4MpBGmwoWS8At5wyy2Q44zgxI8eWi4A-AA=.46f9908d-a525-4ca5-9472-34cfd37de0d3@github.com> On Mon, 3 Feb 2025 16:29:25 GMT, Fredrik Bredberg wrote: > I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`. > > This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past. > > In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks. > > The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`. > > You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable. > > The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list. > > Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor). > > Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation. > > However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fact that c2 no longer has to check b... This pull request has now been integrated. Changeset: 7a5acb9b Author: Fredrik Bredberg URL: https://git.openjdk.org/jdk/commit/7a5acb9be17cd54bbd0abf2524386b981dd5ac04 Stats: 614 lines in 10 files changed: 214 ins; 228 del; 172 mod 8343840: Rewrite the ObjectMonitor lists Reviewed-by: dholmes, coleenp, pchilanomate, yzheng ------------- PR: https://git.openjdk.org/jdk/pull/23421 From jbhateja at openjdk.org Thu Mar 6 09:34:53 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 6 Mar 2025 09:34:53 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v3] In-Reply-To: References: <0E2AqFpNPjDjP6jqCXn8toePBcW2SIHw1kFXlZX4W_U=.8d692bfa-0598-4969-b480-4a285366e0bb@github.com> Message-ID: On Wed, 5 Mar 2025 18:27:44 GMT, Ferenc Rakoczi wrote: >> Hi @ferakocz , >> >> Thanks!, for efficient utilization of Decode ICache (please refer to Intel SDM section 3.4.2.5), code blocks should be aligned to 32-byte boundaries; a 64-byte aligned code is a superset of both 16 and 32 byte aligned addresses and also matches with the cacheline size. However, I can noticed that we have been using OptoLoopAlignment at places in AES-GCM also. >> >> I introduced some errors in generate_dilithiumAlmostInverseNtt_avx512 implementation in anticipation of catching it through existing ML_DSA_Tests under >> test/jdk/sun/security/provider/acvp >> >> But all the tests passed for me. >> `java -jar /home/jatinbha/sandboxes/jtreg/build/images/jtreg/lib/jtreg.jar -jdk:$JAVA_HOME -Djdk.test.lib.artifacts.ACVP-Server=/home/jatinbha/softwares/v1.1.0.38.zip -va -timeout:4 Launcher.java` >> >> Can you please point out a test I need to use for validation > > I think the easiest is to put a for (int i = 0; i < 1000; i++) loop around the switch statement in the run() method of the ML_DSA_Test class (test/jdk/sun/security/provider/acvp/ML_DSA_Test.java). (This is because the intrinsics kick in after a few thousand calls of the method.) Hi @ferakocz , Yes, we should modify the test or lower the compilation threshold with -Xbatch -XX:TieredCompileThreshold=0.1. Alternatively, since the tests has a depedency on Automatic Cryptographic Validation Test server I have created a simplified test which cover all the security levels. Kindly include [test/hotspot/jtreg/compiler/intrinsics/signature/TestModuleLatticeDSA.java ](https://github.com/ferakocz/jdk/pull/1) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1983009390 From jbhateja at openjdk.org Thu Mar 6 09:34:55 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 6 Mar 2025 09:34:55 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v4] In-Reply-To: References: Message-ID: On Wed, 5 Mar 2025 13:10:34 GMT, Ferenc Rakoczi wrote: >> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: > > Added alignment to loop entries. src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 85: > 83: if (UseSHA3Intrinsics) { > 84: StubRoutines::_sha3_implCompress = generate_sha3_implCompress(StubGenStubId::sha3_implCompress_id); > 85: StubRoutines::_double_keccak = generate_double_keccak(); Should UseDilithiumIntrinsics guard double_keccak generation ? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1982922845 From duke at openjdk.org Thu Mar 6 09:49:12 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Thu, 6 Mar 2025 09:49:12 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v4] In-Reply-To: References: Message-ID: On Thu, 6 Mar 2025 08:37:57 GMT, Jatin Bhateja wrote: >> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: >> >> Added alignment to loop entries. > > src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 85: > >> 83: if (UseSHA3Intrinsics) { >> 84: StubRoutines::_sha3_implCompress = generate_sha3_implCompress(StubGenStubId::sha3_implCompress_id); >> 85: StubRoutines::_double_keccak = generate_double_keccak(); > > Should UseDilithiumIntrinsics guard double_keccak generation ? No, that is more of a SHA3 thing, other algorithms can take advantage of it, too (e.g. ML-KEM). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1983033331 From galder at openjdk.org Thu Mar 6 14:06:35 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Thu, 6 Mar 2025 14:06:35 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v13] In-Reply-To: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> Message-ID: > This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance. > > Currently vectorization does not kick in for loops containing either of these calls because of the following error: > > > VLoop::check_preconditions: failed: control flow in loop not allowed > > > The control flow is due to the java implementation for these methods, e.g. > > > public static long max(long a, long b) { > return (a >= b) ? a : b; > } > > > This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively. > By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization. > E.g. > > > SuperWord::transform_loop: > Loop: N518/N126 counted [int,int),+4 (1025 iters) main has_sfpt strip_mined > 518 CountedLoop === 518 246 126 [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21) > > > Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1): > > > ============================== > Test summary > ============================== > TEST TOTAL PASS FAIL ERROR > jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java > 1 1 0 0 > ============================== > TEST SUCCESS > > long min 1155 > long max 1173 > > > After the patch, on darwin/aarch64 (M1): > > > ============================== > Test summary > ============================== > TEST TOTAL PASS FAIL ERROR > jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java > 1 1 0 0 > ============================== > TEST SUCCESS > > long min 1042 > long max 1042 > > > This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes. > Therefore, it still relies on the macro expansion to transform those into CMoveL. > > I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results: > > > ============================== > Test summary > ============================== > TEST TOTAL PASS FAIL ERROR > jtreg:test/hotspot/jtreg:tier1 2500 2500 0 0 >>> jtreg:test/jdk:tier1 ... Galder Zamarre?o has updated the pull request incrementally with one additional commit since the last revision: Add simple reduction benchmarks on top of multiply ones ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20098/files - new: https://git.openjdk.org/jdk/pull/20098/files/a190ae68..d0e793a3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20098&range=12 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20098&range=11-12 Stats: 44 lines in 1 file changed: 40 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/20098.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20098/head:pull/20098 PR: https://git.openjdk.org/jdk/pull/20098 From epeter at openjdk.org Thu Mar 6 15:07:07 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 6 Mar 2025 15:07:07 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> Message-ID: On Thu, 27 Feb 2025 16:38:30 GMT, Galder Zamarre?o wrote: >> Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision: >> >> - Merge branch 'master' into topic.intrinsify-max-min-long >> - Fix typo >> - Renaming methods and variables and add docu on algorithms >> - Fix copyright years >> - Make sure it runs with cpus with either avx512 or asimd >> - Test can only run with 256 bit registers or bigger >> >> * Remove platform dependant check >> and use platform independent configuration instead. >> - Fix license header >> - Tests should also run on aarch64 asimd=true envs >> - Added comment around the assertions >> - Adjust min/max identity IR test expectations after changes >> - ... and 34 more: https://git.openjdk.org/jdk/compare/47fdb836...a190ae68 > > Also, I've started a [discussion on jmh-dev](https://mail.openjdk.org/pipermail/jmh-dev/2025-February/004094.html) to see if there's a way to minimise pollution of `Math.min(II)` compilation. As a follow to https://github.com/openjdk/jdk/pull/20098#issuecomment-2684701935 I looked at where the other `Math.min(II)` calls are coming from, and a big chunk seem related to the JMH infrastructure. @galderz about: > Additional performance improvement: make SuperWord recognize more cases as profitble (see Regression 1). Optional. This should already be covered by these, and I will handle that eventually with the Cost-Model RFE [JDK-8340093](https://bugs.openjdk.org/browse/JDK-8340093): - [JDK-8345044](https://bugs.openjdk.org/browse/JDK-8345044) Sum of array elements not vectorized - (min/max of array) - [JDK-8336000](https://bugs.openjdk.org/browse/JDK-8336000) C2 SuperWord: report that 2-element reductions do not vectorize - You would for example see that on aarch64 machines with only neon/asimd support you can have at most 2 longs per vector, because the max vector length is 128 bits. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2704110051 From epeter at openjdk.org Thu Mar 6 15:26:09 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 6 Mar 2025 15:26:09 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> Message-ID: On Thu, 27 Feb 2025 16:38:30 GMT, Galder Zamarre?o wrote: >> Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision: >> >> - Merge branch 'master' into topic.intrinsify-max-min-long >> - Fix typo >> - Renaming methods and variables and add docu on algorithms >> - Fix copyright years >> - Make sure it runs with cpus with either avx512 or asimd >> - Test can only run with 256 bit registers or bigger >> >> * Remove platform dependant check >> and use platform independent configuration instead. >> - Fix license header >> - Tests should also run on aarch64 asimd=true envs >> - Added comment around the assertions >> - Adjust min/max identity IR test expectations after changes >> - ... and 34 more: https://git.openjdk.org/jdk/compare/dfbb2ee6...a190ae68 > > Also, I've started a [discussion on jmh-dev](https://mail.openjdk.org/pipermail/jmh-dev/2025-February/004094.html) to see if there's a way to minimise pollution of `Math.min(II)` compilation. As a follow to https://github.com/openjdk/jdk/pull/20098#issuecomment-2684701935 I looked at where the other `Math.min(II)` calls are coming from, and a big chunk seem related to the JMH infrastructure. @galderz about: > Additional performance improvement: extend backend capabilities for vectorization (see Regression 2 + 3). Optional. I looked at `src/hotspot/cpu/x86/x86.ad` bool Matcher::match_rule_supported_vector(int opcode, int vlen, BasicType bt) { 1774 case Op_MaxV: 1775 case Op_MinV: 1776 if (UseSSE < 4 && is_integral_type(bt)) { 1777 return false; 1778 } ... So it seems that here lanewise min/max are supported for AVX2. But it seems that's different for reductions: 1818 case Op_MinReductionV: 1819 case Op_MaxReductionV: 1820 if ((bt == T_INT || is_subword_type(bt)) && UseSSE < 4) { 1821 return false; 1822 } else if (bt == T_LONG && (UseAVX < 3 || !VM_Version::supports_avx512vlbwdq())) { 1823 return false; 1824 } ... So it seems maybe we could improve the AVX2 coverage for reductions. But honestly, I will probably find this issue again once I work on the other reductions above, and run the benchmarks. I think that will make it easier to investigate all of this. I will for example adjust the IR rules, and then it will be apparent where there are cases that are not covered. @galderz you said you would add some extra comments, then I will review again :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2704159992 PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2704161929 From tschatzl at openjdk.org Thu Mar 6 15:39:57 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Thu, 6 Mar 2025 15:39:57 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v11] In-Reply-To: References: Message-ID: <4um7PHAs89PIoa3QgbkPx-8Jx9vHiYr7afFQGOtFTY8=.f1ca8bad-0827-4f8c-852d-0fc82ffd546a@github.com> On Tue, 4 Mar 2025 15:33:29 GMT, Albert Mingkun Yang wrote: >> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: >> >> iwalulya review >> * comments for variables tracking to-collection-set and just dirtied cards after GC/refinement >> * predicate for determining whether the refinement has been disabled >> * some other typos/comment improvements >> * renamed _has_xxx_ref to _has_ref_to_xxx to be more consistent with naming > > src/hotspot/share/gc/g1/g1ConcurrentRefineThread.cpp line 219: > >> 217: // The young gen revising mechanism reads the predictor and the values set >> 218: // here. Avoid inconsistencies by locking. >> 219: MutexLocker x(G1RareEvent_lock, Mutex::_no_safepoint_check_flag); > > Who else can be in this critical-section? I don't get what this lock is protecting us from. Actually further discussion with @albertnetymk showed that this change introduces an unintended behaviorial change where since the refinement control thread is also responsible for updating the current young gen length. It means that the mutex isn't required. However this means that while the refinement is running this is not done any more, because refinement can take seconds, I need to move this work to another thread (probably the `G1ServiceThread?). I will add a separate mutex then. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1983587293 From tschatzl at openjdk.org Thu Mar 6 16:13:02 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Thu, 6 Mar 2025 16:13:02 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v13] In-Reply-To: References: Message-ID: On Wed, 5 Mar 2025 10:41:02 GMT, Ivan Walulya wrote: >> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: >> >> * fix whitespace >> * additional whitespace between log tags >> * rename G1ConcurrentRefineWorkTask -> ...SweepTask to conform to the other similar rename > > src/hotspot/share/gc/g1/g1ThreadLocalData.hpp line 29: > >> 27: #include "gc/g1/g1BarrierSet.hpp" >> 28: #include "gc/g1/g1CardTable.hpp" >> 29: #include "gc/g1/g1CollectedHeap.hpp" > > probably does not need to be included `g1CardTable.hpp` needed because of `G1CardTable::CardValue` I think. I removed the 'G1CollectedHeap` include though. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1983655594 From tschatzl at openjdk.org Thu Mar 6 16:26:31 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Thu, 6 Mar 2025 16:26:31 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v14] In-Reply-To: References: Message-ID: > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: * iwalulya review * renaming * fix some includes, forward declaration ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23739/files - new: https://git.openjdk.org/jdk/pull/23739/files/a457e6e7..350a4fa3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=13 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=12-13 Stats: 31 lines in 13 files changed: 1 ins; 2 del; 28 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739 From jbhateja at openjdk.org Thu Mar 6 16:38:57 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 6 Mar 2025 16:38:57 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v4] In-Reply-To: References: Message-ID: On Wed, 5 Mar 2025 13:10:34 GMT, Ferenc Rakoczi wrote: >> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: > > Added alignment to loop entries. src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 2: > 1: /* > 2: * Copyright (c) 2024, Oracle and/or its affiliates. All rights reserved. Please update copyright year src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 96: > 94: StubRoutines::_dilithiumMontMulByConstant = generate_dilithiumMontMulByConstant_avx512(); > 95: StubRoutines::_dilithiumDecomposePoly = generate_dilithiumDecomposePoly_avx512(); > 96: } Indentation fix needed src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 362: > 360: const Register roundsLeft = r11; > 361: > 362: __ align(OptoLoopAlignment); Redundant alignment before label should be before it's bind ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1983463096 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1983464620 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1983477681 From duke at openjdk.org Thu Mar 6 17:37:33 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Thu, 6 Mar 2025 17:37:33 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v5] In-Reply-To: References: Message-ID: <3bphXKLpIpxAZP-FEOeob6AaHbv0BAoEceJka64vMW8=.3e4f74e0-9479-4926-b365-b08d8d702692@github.com> > By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: Accepted review comments. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23860/files - new: https://git.openjdk.org/jdk/pull/23860/files/3aaa106f..64135f29 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=03-04 Stats: 3 lines in 2 files changed: 0 ins; 1 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/23860.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23860/head:pull/23860 PR: https://git.openjdk.org/jdk/pull/23860 From galder at openjdk.org Fri Mar 7 06:19:03 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Fri, 7 Mar 2025 06:19:03 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v14] In-Reply-To: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> Message-ID: <9c34YjVjK0BMclNqFWMSitBV2YTcu_jmgWVitjRgvF0=.0f225af6-5888-4160-9a54-09baa696da1c@github.com> > This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance. > > Currently vectorization does not kick in for loops containing either of these calls because of the following error: > > > VLoop::check_preconditions: failed: control flow in loop not allowed > > > The control flow is due to the java implementation for these methods, e.g. > > > public static long max(long a, long b) { > return (a >= b) ? a : b; > } > > > This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively. > By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization. > E.g. > > > SuperWord::transform_loop: > Loop: N518/N126 counted [int,int),+4 (1025 iters) main has_sfpt strip_mined > 518 CountedLoop === 518 246 126 [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21) > > > Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1): > > > ============================== > Test summary > ============================== > TEST TOTAL PASS FAIL ERROR > jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java > 1 1 0 0 > ============================== > TEST SUCCESS > > long min 1155 > long max 1173 > > > After the patch, on darwin/aarch64 (M1): > > > ============================== > Test summary > ============================== > TEST TOTAL PASS FAIL ERROR > jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java > 1 1 0 0 > ============================== > TEST SUCCESS > > long min 1042 > long max 1042 > > > This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes. > Therefore, it still relies on the macro expansion to transform those into CMoveL. > > I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results: > > > ============================== > Test summary > ============================== > TEST TOTAL PASS FAIL ERROR > jtreg:test/hotspot/jtreg:tier1 2500 2500 0 0 >>> jtreg:test/jdk:tier1 ... Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 47 additional commits since the last revision: - Merge branch 'master' into topic.intrinsify-max-min-long - Add assertion comments - Add simple reduction benchmarks on top of multiply ones - Merge branch 'master' into topic.intrinsify-max-min-long - Fix typo - Renaming methods and variables and add docu on algorithms - Fix copyright years - Make sure it runs with cpus with either avx512 or asimd - Test can only run with 256 bit registers or bigger * Remove platform dependant check and use platform independent configuration instead. - Fix license header - ... and 37 more: https://git.openjdk.org/jdk/compare/a328e466...1aa690d3 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20098/files - new: https://git.openjdk.org/jdk/pull/20098/files/d0e793a3..1aa690d3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20098&range=13 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20098&range=12-13 Stats: 65249 lines in 2144 files changed: 33401 ins; 21691 del; 10157 mod Patch: https://git.openjdk.org/jdk/pull/20098.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20098/head:pull/20098 PR: https://git.openjdk.org/jdk/pull/20098 From galder at openjdk.org Fri Mar 7 06:19:04 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Fri, 7 Mar 2025 06:19:04 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v4] In-Reply-To: <9ReqLUCZ6XDaSQxgYw3NyZZdMv3SOHkCkzJ0DLAksas=.8cb29982-8cb8-4068-a251-59a189c83b93@github.com> References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> <9ReqLUCZ6XDaSQxgYw3NyZZdMv3SOHkCkzJ0DLAksas=.8cb29982-8cb8-4068-a251-59a189c83b93@github.com> Message-ID: On Tue, 17 Dec 2024 16:40:01 GMT, Galder Zamarre?o wrote: >> test/hotspot/jtreg/compiler/intrinsics/math/TestMinMaxInlining.java line 80: >> >>> 78: @IR(phase = { CompilePhase.BEFORE_MACRO_EXPANSION }, counts = { IRNode.MIN_L, "1" }) >>> 79: @IR(phase = { CompilePhase.AFTER_MACRO_EXPANSION }, counts = { IRNode.MIN_L, "0" }) >>> 80: private static long testLongMin(long a, long b) { >> >> Can you add a comment why it disappears after macro expansion? > > ~Good question. On non-avx512 machines after macro expansion the min/max nodes become cmov nodes, but but that's not the full story because on avx512 machines, they become minV/maxV nodes. Would you tweak the `@IR` annotations to capture this? Or would you leave it just as a comment?~ > > Scratch that, this is not a test for arrays, so no minV/maxV nodes. I'll just add a comment. I've added a comment ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20098#discussion_r1984510490 From galder at openjdk.org Fri Mar 7 06:19:04 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Fri, 7 Mar 2025 06:19:04 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> Message-ID: On Thu, 6 Mar 2025 15:22:18 GMT, Emanuel Peter wrote: >> Also, I've started a [discussion on jmh-dev](https://mail.openjdk.org/pipermail/jmh-dev/2025-February/004094.html) to see if there's a way to minimise pollution of `Math.min(II)` compilation. As a follow to https://github.com/openjdk/jdk/pull/20098#issuecomment-2684701935 I looked at where the other `Math.min(II)` calls are coming from, and a big chunk seem related to the JMH infrastructure. > > @galderz you said you would add some extra comments, then I will review again :) @eme64 I've added the comment that was pending from your last review. I've also merged latest master. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2705620662 From epeter at openjdk.org Fri Mar 7 06:48:05 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 7 Mar 2025 06:48:05 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v14] In-Reply-To: <9c34YjVjK0BMclNqFWMSitBV2YTcu_jmgWVitjRgvF0=.0f225af6-5888-4160-9a54-09baa696da1c@github.com> References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> <9c34YjVjK0BMclNqFWMSitBV2YTcu_jmgWVitjRgvF0=.0f225af6-5888-4160-9a54-09baa696da1c@github.com> Message-ID: On Fri, 7 Mar 2025 06:19:03 GMT, Galder Zamarre?o wrote: >> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance. >> >> Currently vectorization does not kick in for loops containing either of these calls because of the following error: >> >> >> VLoop::check_preconditions: failed: control flow in loop not allowed >> >> >> The control flow is due to the java implementation for these methods, e.g. >> >> >> public static long max(long a, long b) { >> return (a >= b) ? a : b; >> } >> >> >> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively. >> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization. >> E.g. >> >> >> SuperWord::transform_loop: >> Loop: N518/N126 counted [int,int),+4 (1025 iters) main has_sfpt strip_mined >> 518 CountedLoop === 518 246 126 [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21) >> >> >> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java >> 1 1 0 0 >> ============================== >> TEST SUCCESS >> >> long min 1155 >> long max 1173 >> >> >> After the patch, on darwin/aarch64 (M1): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java >> 1 1 0 0 >> ============================== >> TEST SUCCESS >> >> long min 1042 >> long max 1042 >> >> >> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes. >> Therefore, it still relies on the macro expansion to transform those into CMoveL. >> >> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results: >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PA... > > Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 47 additional commits since the last revision: > > - Merge branch 'master' into topic.intrinsify-max-min-long > - Add assertion comments > - Add simple reduction benchmarks on top of multiply ones > - Merge branch 'master' into topic.intrinsify-max-min-long > - Fix typo > - Renaming methods and variables and add docu on algorithms > - Fix copyright years > - Make sure it runs with cpus with either avx512 or asimd > - Test can only run with 256 bit registers or bigger > > * Remove platform dependant check > and use platform independent configuration instead. > - Fix license header > - ... and 37 more: https://git.openjdk.org/jdk/compare/99572e4c...1aa690d3 Looks good, thanks for all the updates :) I'm launching another round of testing on our side ;) ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20098#pullrequestreview-2666394529 PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2705659841 From galder at openjdk.org Fri Mar 7 09:23:06 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Fri, 7 Mar 2025 09:23:06 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v14] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> <9c34YjVjK0BMclNqFWMSitBV2YTcu_jmgWVitjRgvF0=.0f225af6-5888-4160-9a54-09baa696da1c@github.com> Message-ID: On Fri, 7 Mar 2025 06:44:57 GMT, Emanuel Peter wrote: >> Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 47 additional commits since the last revision: >> >> - Merge branch 'master' into topic.intrinsify-max-min-long >> - Add assertion comments >> - Add simple reduction benchmarks on top of multiply ones >> - Merge branch 'master' into topic.intrinsify-max-min-long >> - Fix typo >> - Renaming methods and variables and add docu on algorithms >> - Fix copyright years >> - Make sure it runs with cpus with either avx512 or asimd >> - Test can only run with 256 bit registers or bigger >> >> * Remove platform dependant check >> and use platform independent configuration instead. >> - Fix license header >> - ... and 37 more: https://git.openjdk.org/jdk/compare/bc67ede6...1aa690d3 > > I'm launching another round of testing on our side ;) @eme64 I've run tier[1-3] locally and looked good overall. I had to update jtreg and noticed this failure but I don't think it's related to this PR: java.lang.AssertionError: gtest execution failed; exit code = 2. the failed tests: [codestrings::validate_vm] at GTestWrapper.main(GTestWrapper.java:98) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) at java.base/java.lang.reflect.Method.invoke(Method.java:565) at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335) at java.base/java.lang.Thread.run(Thread.java:1447) ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2705937075 From galder at openjdk.org Fri Mar 7 12:28:58 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Fri, 7 Mar 2025 12:28:58 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12] In-Reply-To: <63F-0aHgMthexL0b2DFmkW8_QrJeo8OOlCaIyZApfpY=.4744070d-9d56-4031-8684-be14cf66d1e5@github.com> References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> <63F-0aHgMthexL0b2DFmkW8_QrJeo8OOlCaIyZApfpY=.4744070d-9d56-4031-8684-be14cf66d1e5@github.com> Message-ID: On Thu, 27 Feb 2025 06:54:30 GMT, Emanuel Peter wrote: > As for possible solutions. In all Regression 1-3 cases, it seems the issue is scalar cmove. So actually in all cases a possible solution is using branching code (i.e. `cmp+mov`). So to me, these are the follow-up RFE's: > > * Detect "extreme" probability scalar cmove, and replace them with branching code. This should take care of all regressions here. This one has high priority, as it fixes the regression caused by this patch here. But it would also help to improve performance for the `Integer.min/max` cases, which have the same issue. I've created [JDK-8351409](https://bugs.openjdk.org/browse/JDK-8351409) to address this. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2706324225 From ayang at openjdk.org Fri Mar 7 13:16:59 2025 From: ayang at openjdk.org (Albert Mingkun Yang) Date: Fri, 7 Mar 2025 13:16:59 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v14] In-Reply-To: References: Message-ID: <5w6qUwzDQadxseocRl6rRF0AllyeukWTpYl2XjAfiTE=.fb62a50e-e308-4d08-8057-67e70e13ccbb@github.com> On Thu, 6 Mar 2025 16:26:31 GMT, Thomas Schatzl wrote: >> Hi all, >> >> please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. >> >> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. >> >> ### Current situation >> >> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. >> >> The main reason for the current barrier is how g1 implements concurrent refinement: >> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. >> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, >> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. >> >> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: >> >> >> // Filtering >> if (region(@x.a) == region(y)) goto done; // same region check >> if (y == null) goto done; // null value check >> if (card(@x.a) == young_card) goto done; // write to young gen check >> StoreLoad; // synchronize >> if (card(@x.a) == dirty_card) goto done; >> >> *card(@x.a) = dirty >> >> // Card tracking >> enqueue(card-address(@x.a)) into thread-local-dcq; >> if (thread-local-dcq is not full) goto done; >> >> call runtime to move thread-local-dcq into dcqs >> >> done: >> >> >> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. >> >> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. >> >> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). >> >> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c... > > Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: > > * iwalulya review > * renaming > * fix some includes, forward declaration src/hotspot/share/gc/g1/g1CardTable.hpp line 76: > 74: g1_card_already_scanned = 0x1, > 75: g1_to_cset_card = 0x2, > 76: g1_from_remset_card = 0x4 Could you outline the motivation for this more precise info? Is it for optimization or essentially for correctness? src/hotspot/share/gc/g1/g1ConcurrentRefineSweepTask.cpp line 54: > 52: assert(refinement_r == card_r, "not same region source %u (%zu) dest %u (%zu) ", refinement_r->hrm_index(), refinement_i, card_r->hrm_index(), card_i); > 53: assert(refinement_i == card_i, "indexes are not same %zu %zu", refinement_i, card_i); > 54: #endif I feel this assert logic can be extracted to a method, sth like `verify_card_pair`. src/hotspot/share/gc/g1/g1ConcurrentRefineThread.cpp line 64: > 62: report_inactive("Paused"); > 63: sts_join.yield(); > 64: // Reset after yield rather than accumulating across yields, else a The comment seems obsolete after the removal of stats. src/hotspot/share/gc/g1/g1OopClosures.inline.hpp line 158: > 156: if (_has_ref_to_cset) { > 157: return; > 158: } Is it really necessary to write `false` to `_has_ref_to_cset`? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1985041202 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1983846649 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1983842440 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1983857348 From epeter at openjdk.org Fri Mar 7 13:19:59 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 7 Mar 2025 13:19:59 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> <63F-0aHgMthexL0b2DFmkW8_QrJeo8OOlCaIyZApfpY=.4744070d-9d56-4031-8684-be14cf66d1e5@github.com> Message-ID: On Fri, 7 Mar 2025 12:25:51 GMT, Galder Zamarre?o wrote: >> @galderz Thanks for the summary of regressions! Yes, there are plenty of speedups, I assume primarily because of `Long.min/max` vectorization, but possibly also because the operation can now "float" out of a loop for example. >> >> All your Regressions 1-3 are cases with "extreme" probabilitiy (close to 100% / 0%), you listed none else. That matches with my intuition, that branching code is usually better than cmove in extreme probability cases. >> >> As for possible solutions. In all Regression 1-3 cases, it seems the issue is scalar cmove. So actually in all cases a possible solution is using branching code (i.e. `cmp+mov`). So to me, these are the follow-up RFE's: >> - Detect "extreme" probability scalar cmove, and replace them with branching code. This should take care of all regressions here. This one has high priority, as it fixes the regression caused by this patch here. But it would also help to improve performance for the `Integer.min/max` cases, which have the same issue. >> - Additional performance improvement: make SuperWord recognize more cases as profitble (see Regression 1). Optional. >> - Additional performance improvement: extend backend capabilities for vectorization (see Regression 2 + 3). Optional. >> >> Does that make sense, or am I missing something? > >> As for possible solutions. In all Regression 1-3 cases, it seems the issue is scalar cmove. So actually in all cases a possible solution is using branching code (i.e. `cmp+mov`). So to me, these are the follow-up RFE's: >> >> * Detect "extreme" probability scalar cmove, and replace them with branching code. This should take care of all regressions here. This one has high priority, as it fixes the regression caused by this patch here. But it would also help to improve performance for the `Integer.min/max` cases, which have the same issue. > > I've created [JDK-8351409](https://bugs.openjdk.org/browse/JDK-8351409) to address this. @galderz Excellent. Testing looks all good on our side. Yes I think what you saw was unrelated. @rwestrel Could give this a last quick scan and then I think you can integrate :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2706434983 From duke at openjdk.org Fri Mar 7 14:22:29 2025 From: duke at openjdk.org (Marc Chevalier) Date: Fri, 7 Mar 2025 14:22:29 GMT Subject: RFR: 8346989: Deoptimization and re-compilation cycle with C2 compiled code Message-ID: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments. This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached. Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all. tl;dr: - C1: no problem, no change - C2: - with intrinsics: - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms) - without overflow: no problem, no change - without intrinsics: no problem, no change Before the fix: Benchmark (SIZE) Mode Cnt Score Error Units MathExact.C1_1.loopAddIInBounds 1000000 avgt 3 1.272 ? 0.048 ms/op MathExact.C1_1.loopAddIOverflow 1000000 avgt 3 641.917 ? 58.238 ms/op MathExact.C1_1.loopAddLInBounds 1000000 avgt 3 1.402 ? 0.842 ms/op MathExact.C1_1.loopAddLOverflow 1000000 avgt 3 671.013 ? 229.425 ms/op MathExact.C1_1.loopDecrementIInBounds 1000000 avgt 3 3.722 ? 22.244 ms/op MathExact.C1_1.loopDecrementIOverflow 1000000 avgt 3 653.341 ? 279.003 ms/op MathExact.C1_1.loopDecrementLInBounds 1000000 avgt 3 2.525 ? 0.810 ms/op MathExact.C1_1.loopDecrementLOverflow 1000000 avgt 3 656.750 ? 141.792 ms/op MathExact.C1_1.loopIncrementIInBounds 1000000 avgt 3 4.621 ? 12.822 ms/op MathExact.C1_1.loopIncrementIOverflow 1000000 avgt 3 651.608 ? 274.396 ms/op MathExact.C1_1.loopIncrementLInBounds 1000000 avgt 3 2.576 ? 3.316 ms/op MathExact.C1_1.loopIncrementLOverflow 1000000 avgt 3 662.216 ? 71.879 ms/op MathExact.C1_1.loopMultiplyIInBounds 1000000 avgt 3 1.402 ? 0.587 ms/op MathExact.C1_1.loopMultiplyIOverflow 1000000 avgt 3 615.836 ? 252.137 ms/op MathExact.C1_1.loopMultiplyLInBounds 1000000 avgt 3 2.906 ? 5.718 ms/op MathExact.C1_1.loopMultiplyLOverflow 1000000 avgt 3 655.576 ? 147.432 ms/op MathExact.C1_1.loopNegateIInBounds 1000000 avgt 3 2.023 ? 0.027 ms/op MathExact.C1_1.loopNegateIOverflow 1000000 avgt 3 639.136 ? 30.841 ms/op MathExact.C1_1.loopNegateLInBounds 1000000 avgt 3 2.422 ? 3.590 ms/op MathExact.C1_1.loopNegateLOverflow 1000000 avgt 3 638.837 ? 49.512 ms/op MathExact.C1_1.loopSubtractIInBounds 1000000 avgt 3 1.255 ? 0.799 ms/op MathExact.C1_1.loopSubtractIOverflow 1000000 avgt 3 637.857 ? 231.804 ms/op MathExact.C1_1.loopSubtractLInBounds 1000000 avgt 3 1.412 ? 0.602 ms/op MathExact.C1_1.loopSubtractLOverflow 1000000 avgt 3 642.113 ? 251.349 ms/op MathExact.C1_2.loopAddIInBounds 1000000 avgt 3 1.748 ? 1.095 ms/op MathExact.C1_2.loopAddIOverflow 1000000 avgt 3 654.617 ? 287.678 ms/op MathExact.C1_2.loopAddLInBounds 1000000 avgt 3 2.004 ? 1.655 ms/op MathExact.C1_2.loopAddLOverflow 1000000 avgt 3 670.791 ? 93.689 ms/op MathExact.C1_2.loopDecrementIInBounds 1000000 avgt 3 5.306 ? 65.215 ms/op MathExact.C1_2.loopDecrementIOverflow 1000000 avgt 3 650.425 ? 461.740 ms/op MathExact.C1_2.loopDecrementLInBounds 1000000 avgt 3 5.484 ? 42.778 ms/op MathExact.C1_2.loopDecrementLOverflow 1000000 avgt 3 656.747 ? 333.281 ms/op MathExact.C1_2.loopIncrementIInBounds 1000000 avgt 3 3.077 ? 1.677 ms/op MathExact.C1_2.loopIncrementIOverflow 1000000 avgt 3 634.510 ? 51.365 ms/op MathExact.C1_2.loopIncrementLInBounds 1000000 avgt 3 3.902 ? 18.471 ms/op MathExact.C1_2.loopIncrementLOverflow 1000000 avgt 3 656.465 ? 227.014 ms/op MathExact.C1_2.loopMultiplyIInBounds 1000000 avgt 3 2.384 ? 10.045 ms/op MathExact.C1_2.loopMultiplyIOverflow 1000000 avgt 3 624.029 ? 342.084 ms/op MathExact.C1_2.loopMultiplyLInBounds 1000000 avgt 3 3.247 ? 0.735 ms/op MathExact.C1_2.loopMultiplyLOverflow 1000000 avgt 3 661.427 ? 100.744 ms/op MathExact.C1_2.loopNegateIInBounds 1000000 avgt 3 3.061 ? 1.148 ms/op MathExact.C1_2.loopNegateIOverflow 1000000 avgt 3 645.241 ? 323.824 ms/op MathExact.C1_2.loopNegateLInBounds 1000000 avgt 3 3.211 ? 0.068 ms/op MathExact.C1_2.loopNegateLOverflow 1000000 avgt 3 658.846 ? 204.524 ms/op MathExact.C1_2.loopSubtractIInBounds 1000000 avgt 3 1.717 ? 0.161 ms/op MathExact.C1_2.loopSubtractIOverflow 1000000 avgt 3 644.287 ? 301.787 ms/op MathExact.C1_2.loopSubtractLInBounds 1000000 avgt 3 3.976 ? 11.982 ms/op MathExact.C1_2.loopSubtractLOverflow 1000000 avgt 3 660.871 ? 16.538 ms/op MathExact.C1_3.loopAddIInBounds 1000000 avgt 3 4.380 ? 42.598 ms/op MathExact.C1_3.loopAddIOverflow 1000000 avgt 3 686.766 ? 511.146 ms/op MathExact.C1_3.loopAddLInBounds 1000000 avgt 3 5.445 ? 49.738 ms/op MathExact.C1_3.loopAddLOverflow 1000000 avgt 3 641.936 ? 32.769 ms/op MathExact.C1_3.loopDecrementIInBounds 1000000 avgt 3 8.340 ? 69.455 ms/op MathExact.C1_3.loopDecrementIOverflow 1000000 avgt 3 682.239 ? 212.017 ms/op MathExact.C1_3.loopDecrementLInBounds 1000000 avgt 3 6.048 ? 0.651 ms/op MathExact.C1_3.loopDecrementLOverflow 1000000 avgt 3 670.924 ? 42.037 ms/op MathExact.C1_3.loopIncrementIInBounds 1000000 avgt 3 7.970 ? 63.664 ms/op MathExact.C1_3.loopIncrementIOverflow 1000000 avgt 3 684.490 ? 197.407 ms/op MathExact.C1_3.loopIncrementLInBounds 1000000 avgt 3 8.780 ? 86.737 ms/op MathExact.C1_3.loopIncrementLOverflow 1000000 avgt 3 660.941 ? 172.305 ms/op MathExact.C1_3.loopMultiplyIInBounds 1000000 avgt 3 3.241 ? 0.567 ms/op MathExact.C1_3.loopMultiplyIOverflow 1000000 avgt 3 630.455 ? 138.458 ms/op MathExact.C1_3.loopMultiplyLInBounds 1000000 avgt 3 5.906 ? 0.662 ms/op MathExact.C1_3.loopMultiplyLOverflow 1000000 avgt 3 693.248 ? 539.146 ms/op MathExact.C1_3.loopNegateIInBounds 1000000 avgt 3 6.394 ? 7.757 ms/op MathExact.C1_3.loopNegateIOverflow 1000000 avgt 3 644.722 ? 56.929 ms/op MathExact.C1_3.loopNegateLInBounds 1000000 avgt 3 7.610 ? 41.533 ms/op MathExact.C1_3.loopNegateLOverflow 1000000 avgt 3 670.166 ? 14.496 ms/op MathExact.C1_3.loopSubtractIInBounds 1000000 avgt 3 3.345 ? 1.977 ms/op MathExact.C1_3.loopSubtractIOverflow 1000000 avgt 3 677.317 ? 22.878 ms/op MathExact.C1_3.loopSubtractLInBounds 1000000 avgt 3 3.226 ? 0.122 ms/op MathExact.C1_3.loopSubtractLOverflow 1000000 avgt 3 643.642 ? 65.217 ms/op MathExact.C2.loopAddIInBounds 1000000 avgt 3 1.217 ? 1.694 ms/op MathExact.C2.loopAddIOverflow 1000000 avgt 3 3995.424 ? 1177.165 ms/op MathExact.C2.loopAddLInBounds 1000000 avgt 3 2.404 ? 0.053 ms/op MathExact.C2.loopAddLOverflow 1000000 avgt 3 3997.984 ? 612.558 ms/op MathExact.C2.loopDecrementIInBounds 1000000 avgt 3 2.014 ? 0.176 ms/op MathExact.C2.loopDecrementIOverflow 1000000 avgt 3 3828.615 ? 260.670 ms/op MathExact.C2.loopDecrementLInBounds 1000000 avgt 3 1.986 ? 1.536 ms/op MathExact.C2.loopDecrementLOverflow 1000000 avgt 3 4075.934 ? 263.798 ms/op MathExact.C2.loopIncrementIInBounds 1000000 avgt 3 2.238 ? 6.380 ms/op MathExact.C2.loopIncrementIOverflow 1000000 avgt 3 3927.929 ? 837.162 ms/op MathExact.C2.loopIncrementLInBounds 1000000 avgt 3 1.971 ? 1.232 ms/op MathExact.C2.loopIncrementLOverflow 1000000 avgt 3 3915.202 ? 1024.956 ms/op MathExact.C2.loopMultiplyIInBounds 1000000 avgt 3 1.175 ? 0.509 ms/op MathExact.C2.loopMultiplyIOverflow 1000000 avgt 3 3803.719 ? 1583.828 ms/op MathExact.C2.loopMultiplyLInBounds 1000000 avgt 3 0.937 ? 0.631 ms/op MathExact.C2.loopMultiplyLOverflow 1000000 avgt 3 4023.742 ? 967.498 ms/op MathExact.C2.loopNegateIInBounds 1000000 avgt 3 2.129 ? 1.094 ms/op MathExact.C2.loopNegateIOverflow 1000000 avgt 3 3850.484 ? 464.979 ms/op MathExact.C2.loopNegateLInBounds 1000000 avgt 3 2.247 ? 9.714 ms/op MathExact.C2.loopNegateLOverflow 1000000 avgt 3 3911.853 ? 362.961 ms/op MathExact.C2.loopSubtractIInBounds 1000000 avgt 3 1.141 ? 1.579 ms/op MathExact.C2.loopSubtractIOverflow 1000000 avgt 3 3917.533 ? 628.485 ms/op MathExact.C2.loopSubtractLInBounds 1000000 avgt 3 2.232 ? 22.329 ms/op MathExact.C2.loopSubtractLOverflow 1000000 avgt 3 3995.088 ? 302.549 ms/op MathExact.C2_no_intrinsics.loopAddIInBounds 1000000 avgt 3 1.488 ? 12.243 ms/op MathExact.C2_no_intrinsics.loopAddIOverflow 1000000 avgt 3 585.568 ? 106.360 ms/op MathExact.C2_no_intrinsics.loopAddLInBounds 1000000 avgt 3 2.234 ? 23.010 ms/op MathExact.C2_no_intrinsics.loopAddLOverflow 1000000 avgt 3 602.290 ? 212.146 ms/op MathExact.C2_no_intrinsics.loopDecrementIInBounds 1000000 avgt 3 4.705 ? 36.814 ms/op MathExact.C2_no_intrinsics.loopDecrementIOverflow 1000000 avgt 3 590.212 ? 280.334 ms/op MathExact.C2_no_intrinsics.loopDecrementLInBounds 1000000 avgt 3 2.374 ? 13.667 ms/op MathExact.C2_no_intrinsics.loopDecrementLOverflow 1000000 avgt 3 583.053 ? 50.535 ms/op MathExact.C2_no_intrinsics.loopIncrementIInBounds 1000000 avgt 3 3.966 ? 15.366 ms/op MathExact.C2_no_intrinsics.loopIncrementIOverflow 1000000 avgt 3 591.683 ? 171.580 ms/op MathExact.C2_no_intrinsics.loopIncrementLInBounds 1000000 avgt 3 3.682 ? 23.147 ms/op MathExact.C2_no_intrinsics.loopIncrementLOverflow 1000000 avgt 3 601.325 ? 10.597 ms/op MathExact.C2_no_intrinsics.loopMultiplyIInBounds 1000000 avgt 3 1.307 ? 0.235 ms/op MathExact.C2_no_intrinsics.loopMultiplyIOverflow 1000000 avgt 3 570.615 ? 50.808 ms/op MathExact.C2_no_intrinsics.loopMultiplyLInBounds 1000000 avgt 3 1.087 ? 0.486 ms/op MathExact.C2_no_intrinsics.loopMultiplyLOverflow 1000000 avgt 3 595.713 ? 162.773 ms/op MathExact.C2_no_intrinsics.loopNegateIInBounds 1000000 avgt 3 1.874 ? 0.954 ms/op MathExact.C2_no_intrinsics.loopNegateIOverflow 1000000 avgt 3 596.588 ? 68.081 ms/op MathExact.C2_no_intrinsics.loopNegateLInBounds 1000000 avgt 3 2.337 ? 12.164 ms/op MathExact.C2_no_intrinsics.loopNegateLOverflow 1000000 avgt 3 573.711 ? 63.243 ms/op MathExact.C2_no_intrinsics.loopSubtractIInBounds 1000000 avgt 3 1.085 ? 0.815 ms/op MathExact.C2_no_intrinsics.loopSubtractIOverflow 1000000 avgt 3 579.489 ? 61.399 ms/op MathExact.C2_no_intrinsics.loopSubtractLInBounds 1000000 avgt 3 1.020 ? 0.161 ms/op MathExact.C2_no_intrinsics.loopSubtractLOverflow 1000000 avgt 3 580.578 ? 167.454 ms/op After: Benchmark (SIZE) Mode Cnt Score Error Units MathExact.C1_1.loopAddIInBounds 1000000 avgt 3 1.369 ? 0.462 ms/op MathExact.C1_1.loopAddIOverflow 1000000 avgt 3 635.020 ? 106.156 ms/op MathExact.C1_1.loopAddLInBounds 1000000 avgt 3 1.371 ? 0.020 ms/op MathExact.C1_1.loopAddLOverflow 1000000 avgt 3 633.864 ? 72.176 ms/op MathExact.C1_1.loopDecrementIInBounds 1000000 avgt 3 2.053 ? 0.330 ms/op MathExact.C1_1.loopDecrementIOverflow 1000000 avgt 3 634.675 ? 79.427 ms/op MathExact.C1_1.loopDecrementLInBounds 1000000 avgt 3 3.798 ? 38.502 ms/op MathExact.C1_1.loopDecrementLOverflow 1000000 avgt 3 650.880 ? 123.220 ms/op MathExact.C1_1.loopIncrementIInBounds 1000000 avgt 3 2.305 ? 4.829 ms/op MathExact.C1_1.loopIncrementIOverflow 1000000 avgt 3 648.231 ? 39.012 ms/op MathExact.C1_1.loopIncrementLInBounds 1000000 avgt 3 2.627 ? 3.129 ms/op MathExact.C1_1.loopIncrementLOverflow 1000000 avgt 3 663.671 ? 446.140 ms/op MathExact.C1_1.loopMultiplyIInBounds 1000000 avgt 3 1.479 ? 0.102 ms/op MathExact.C1_1.loopMultiplyIOverflow 1000000 avgt 3 627.959 ? 297.291 ms/op MathExact.C1_1.loopMultiplyLInBounds 1000000 avgt 3 2.718 ? 0.806 ms/op MathExact.C1_1.loopMultiplyLOverflow 1000000 avgt 3 655.310 ? 112.686 ms/op MathExact.C1_1.loopNegateIInBounds 1000000 avgt 3 2.079 ? 2.166 ms/op MathExact.C1_1.loopNegateIOverflow 1000000 avgt 3 640.530 ? 152.489 ms/op MathExact.C1_1.loopNegateLInBounds 1000000 avgt 3 3.168 ? 16.524 ms/op MathExact.C1_1.loopNegateLOverflow 1000000 avgt 3 650.823 ? 58.420 ms/op MathExact.C1_1.loopSubtractIInBounds 1000000 avgt 3 2.325 ? 27.865 ms/op MathExact.C1_1.loopSubtractIOverflow 1000000 avgt 3 632.198 ? 280.799 ms/op MathExact.C1_1.loopSubtractLInBounds 1000000 avgt 3 1.478 ? 0.281 ms/op MathExact.C1_1.loopSubtractLOverflow 1000000 avgt 3 626.481 ? 47.028 ms/op MathExact.C1_2.loopAddIInBounds 1000000 avgt 3 1.850 ? 0.462 ms/op MathExact.C1_2.loopAddIOverflow 1000000 avgt 3 640.668 ? 217.610 ms/op MathExact.C1_2.loopAddLInBounds 1000000 avgt 3 1.823 ? 0.123 ms/op MathExact.C1_2.loopAddLOverflow 1000000 avgt 3 643.123 ? 174.505 ms/op MathExact.C1_2.loopDecrementIInBounds 1000000 avgt 3 6.435 ? 54.316 ms/op MathExact.C1_2.loopDecrementIOverflow 1000000 avgt 3 649.622 ? 15.314 ms/op MathExact.C1_2.loopDecrementLInBounds 1000000 avgt 3 4.315 ? 26.421 ms/op MathExact.C1_2.loopDecrementLOverflow 1000000 avgt 3 649.018 ? 386.320 ms/op MathExact.C1_2.loopIncrementIInBounds 1000000 avgt 3 3.444 ? 1.375 ms/op MathExact.C1_2.loopIncrementIOverflow 1000000 avgt 3 628.711 ? 51.292 ms/op MathExact.C1_2.loopIncrementLInBounds 1000000 avgt 3 3.351 ? 0.483 ms/op MathExact.C1_2.loopIncrementLOverflow 1000000 avgt 3 653.560 ? 160.718 ms/op MathExact.C1_2.loopMultiplyIInBounds 1000000 avgt 3 1.860 ? 0.633 ms/op MathExact.C1_2.loopMultiplyIOverflow 1000000 avgt 3 620.883 ? 54.516 ms/op MathExact.C1_2.loopMultiplyLInBounds 1000000 avgt 3 3.998 ? 16.269 ms/op MathExact.C1_2.loopMultiplyLOverflow 1000000 avgt 3 671.956 ? 93.092 ms/op MathExact.C1_2.loopNegateIInBounds 1000000 avgt 3 4.415 ? 44.105 ms/op MathExact.C1_2.loopNegateIOverflow 1000000 avgt 3 661.902 ? 224.843 ms/op MathExact.C1_2.loopNegateLInBounds 1000000 avgt 3 3.492 ? 0.738 ms/op MathExact.C1_2.loopNegateLOverflow 1000000 avgt 3 634.946 ? 150.491 ms/op MathExact.C1_2.loopSubtractIInBounds 1000000 avgt 3 1.712 ? 0.066 ms/op MathExact.C1_2.loopSubtractIOverflow 1000000 avgt 3 651.508 ? 76.022 ms/op MathExact.C1_2.loopSubtractLInBounds 1000000 avgt 3 1.949 ? 0.201 ms/op MathExact.C1_2.loopSubtractLOverflow 1000000 avgt 3 627.459 ? 26.817 ms/op MathExact.C1_3.loopAddIInBounds 1000000 avgt 3 7.378 ? 4.301 ms/op MathExact.C1_3.loopAddIOverflow 1000000 avgt 3 647.275 ? 177.062 ms/op MathExact.C1_3.loopAddLInBounds 1000000 avgt 3 3.427 ? 0.037 ms/op MathExact.C1_3.loopAddLOverflow 1000000 avgt 3 643.735 ? 227.934 ms/op MathExact.C1_3.loopDecrementIInBounds 1000000 avgt 3 5.680 ? 0.497 ms/op MathExact.C1_3.loopDecrementIOverflow 1000000 avgt 3 666.431 ? 8.006 ms/op MathExact.C1_3.loopDecrementLInBounds 1000000 avgt 3 6.897 ? 24.615 ms/op MathExact.C1_3.loopDecrementLOverflow 1000000 avgt 3 683.691 ? 52.892 ms/op MathExact.C1_3.loopIncrementIInBounds 1000000 avgt 3 5.743 ? 0.602 ms/op MathExact.C1_3.loopIncrementIOverflow 1000000 avgt 3 670.027 ? 175.208 ms/op MathExact.C1_3.loopIncrementLInBounds 1000000 avgt 3 6.157 ? 2.876 ms/op MathExact.C1_3.loopIncrementLOverflow 1000000 avgt 3 673.410 ? 245.939 ms/op MathExact.C1_3.loopMultiplyIInBounds 1000000 avgt 3 3.220 ? 0.165 ms/op MathExact.C1_3.loopMultiplyIOverflow 1000000 avgt 3 640.165 ? 505.006 ms/op MathExact.C1_3.loopMultiplyLInBounds 1000000 avgt 3 7.986 ? 62.547 ms/op MathExact.C1_3.loopMultiplyLOverflow 1000000 avgt 3 681.282 ? 107.856 ms/op MathExact.C1_3.loopNegateIInBounds 1000000 avgt 3 7.133 ? 18.111 ms/op MathExact.C1_3.loopNegateIOverflow 1000000 avgt 3 680.976 ? 285.486 ms/op MathExact.C1_3.loopNegateLInBounds 1000000 avgt 3 7.405 ? 37.040 ms/op MathExact.C1_3.loopNegateLOverflow 1000000 avgt 3 681.574 ? 173.484 ms/op MathExact.C1_3.loopSubtractIInBounds 1000000 avgt 3 3.971 ? 16.942 ms/op MathExact.C1_3.loopSubtractIOverflow 1000000 avgt 3 655.780 ? 230.793 ms/op MathExact.C1_3.loopSubtractLInBounds 1000000 avgt 3 3.369 ? 3.844 ms/op MathExact.C1_3.loopSubtractLOverflow 1000000 avgt 3 634.824 ? 20.350 ms/op MathExact.C2.loopAddIInBounds 1000000 avgt 3 2.461 ? 2.936 ms/op MathExact.C2.loopAddIOverflow 1000000 avgt 3 589.095 ? 151.126 ms/op MathExact.C2.loopAddLInBounds 1000000 avgt 3 0.978 ? 0.604 ms/op MathExact.C2.loopAddLOverflow 1000000 avgt 3 590.511 ? 64.618 ms/op MathExact.C2.loopDecrementIInBounds 1000000 avgt 3 1.981 ? 0.443 ms/op MathExact.C2.loopDecrementIOverflow 1000000 avgt 3 593.578 ? 32.752 ms/op MathExact.C2.loopDecrementLInBounds 1000000 avgt 3 2.924 ? 29.455 ms/op MathExact.C2.loopDecrementLOverflow 1000000 avgt 3 601.392 ? 936.568 ms/op MathExact.C2.loopIncrementIInBounds 1000000 avgt 3 2.697 ? 22.142 ms/op MathExact.C2.loopIncrementIOverflow 1000000 avgt 3 602.418 ? 199.763 ms/op MathExact.C2.loopIncrementLInBounds 1000000 avgt 3 1.954 ? 0.396 ms/op MathExact.C2.loopIncrementLOverflow 1000000 avgt 3 601.183 ? 156.439 ms/op MathExact.C2.loopMultiplyIInBounds 1000000 avgt 3 1.530 ? 7.954 ms/op MathExact.C2.loopMultiplyIOverflow 1000000 avgt 3 566.677 ? 45.992 ms/op MathExact.C2.loopMultiplyLInBounds 1000000 avgt 3 2.184 ? 22.242 ms/op MathExact.C2.loopMultiplyLOverflow 1000000 avgt 3 600.233 ? 234.648 ms/op MathExact.C2.loopNegateIInBounds 1000000 avgt 3 2.130 ? 1.028 ms/op MathExact.C2.loopNegateIOverflow 1000000 avgt 3 593.145 ? 337.886 ms/op MathExact.C2.loopNegateLInBounds 1000000 avgt 3 2.600 ? 20.795 ms/op MathExact.C2.loopNegateLOverflow 1000000 avgt 3 592.288 ? 138.321 ms/op MathExact.C2.loopSubtractIInBounds 1000000 avgt 3 1.081 ? 0.265 ms/op MathExact.C2.loopSubtractIOverflow 1000000 avgt 3 575.884 ? 200.113 ms/op MathExact.C2.loopSubtractLInBounds 1000000 avgt 3 1.016 ? 0.792 ms/op MathExact.C2.loopSubtractLOverflow 1000000 avgt 3 589.873 ? 52.521 ms/op MathExact.C2_no_intrinsics.loopAddIInBounds 1000000 avgt 3 2.166 ? 10.999 ms/op MathExact.C2_no_intrinsics.loopAddIOverflow 1000000 avgt 3 586.660 ? 229.451 ms/op MathExact.C2_no_intrinsics.loopAddLInBounds 1000000 avgt 3 1.054 ? 0.528 ms/op MathExact.C2_no_intrinsics.loopAddLOverflow 1000000 avgt 3 572.511 ? 76.440 ms/op MathExact.C2_no_intrinsics.loopDecrementIInBounds 1000000 avgt 3 1.907 ? 0.149 ms/op MathExact.C2_no_intrinsics.loopDecrementIOverflow 1000000 avgt 3 599.262 ? 600.992 ms/op MathExact.C2_no_intrinsics.loopDecrementLInBounds 1000000 avgt 3 1.820 ? 0.106 ms/op MathExact.C2_no_intrinsics.loopDecrementLOverflow 1000000 avgt 3 570.464 ? 44.418 ms/op MathExact.C2_no_intrinsics.loopIncrementIInBounds 1000000 avgt 3 1.914 ? 0.131 ms/op MathExact.C2_no_intrinsics.loopIncrementIOverflow 1000000 avgt 3 575.143 ? 160.185 ms/op MathExact.C2_no_intrinsics.loopIncrementLInBounds 1000000 avgt 3 1.818 ? 0.288 ms/op MathExact.C2_no_intrinsics.loopIncrementLOverflow 1000000 avgt 3 589.998 ? 33.029 ms/op MathExact.C2_no_intrinsics.loopMultiplyIInBounds 1000000 avgt 3 1.960 ? 10.135 ms/op MathExact.C2_no_intrinsics.loopMultiplyIOverflow 1000000 avgt 3 571.497 ? 264.484 ms/op MathExact.C2_no_intrinsics.loopMultiplyLInBounds 1000000 avgt 3 1.061 ? 0.198 ms/op MathExact.C2_no_intrinsics.loopMultiplyLOverflow 1000000 avgt 3 585.139 ? 317.175 ms/op MathExact.C2_no_intrinsics.loopNegateIInBounds 1000000 avgt 3 2.611 ? 22.325 ms/op MathExact.C2_no_intrinsics.loopNegateIOverflow 1000000 avgt 3 579.911 ? 140.426 ms/op MathExact.C2_no_intrinsics.loopNegateLInBounds 1000000 avgt 3 2.233 ? 2.774 ms/op MathExact.C2_no_intrinsics.loopNegateLOverflow 1000000 avgt 3 572.368 ? 81.851 ms/op MathExact.C2_no_intrinsics.loopSubtractIInBounds 1000000 avgt 3 3.162 ? 38.115 ms/op MathExact.C2_no_intrinsics.loopSubtractIOverflow 1000000 avgt 3 582.794 ? 65.622 ms/op MathExact.C2_no_intrinsics.loopSubtractLInBounds 1000000 avgt 3 1.028 ? 0.255 ms/op MathExact.C2_no_intrinsics.loopSubtractLOverflow 1000000 avgt 3 577.491 ? 69.778 ms/op Is it worth having intrinsics at all? @eme64 wondered, so I tried with this code: public class Test { final static int N = 500_000_000; public static int test(int i) { try{ return Math.multiplyExact(i, i); } catch (Throwable e){ return 0; } } public static void loop() { for(int i = 0; i < N; i++) { test(i % 32_768); } } public static void main(String[] args) { loop(); } } and with much more runs (50 instead of 3), and in a more stable load for the rest of the system. No intrinsic (inlined Java implem): Benchmark 1: ~/jdk/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,"Test*::test*" -XX:-UseOnStackReplacement Test.java Time (mean ? ?): 8.651 s ? 0.902 s [User: 8.517 s, System: 0.155 s] Range (min ? max): 6.853 s ? 10.439 s 50 runs Always intrinsic (current behavior, and new behavior in absence of overflow, like in this example): Benchmark 1: ~/jdk/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,"Test*::test*" -XX:-UseOnStackReplacement Test.java Time (mean ? ?): 8.222 s ? 1.024 s [User: 8.090 s, System: 0.155 s] Range (min ? max): 6.667 s ? 10.406 s 50 runs So it's... not very conclusive, but likely to be a bit useful. The gap between the means is about 0.4s, which is less than half the standard deviation. Still, it seems good to have. >From a more theoretical point of view, we can see that the code generated for the instrinsics is mostly a `mul` and a `jo`, while it is much more complicated for inlined java (with many `mov`, `movsx`, `cmp` and conditional jumps, looking a lot like the Java code). Thanks, Marc ------------- Commit messages: - More exhaustive bench - Limit inlining of math Exact operations in case of too many deopts Changes: https://git.openjdk.org/jdk/pull/23916/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23916&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8346989 Stats: 405 lines in 2 files changed: 404 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23916.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23916/head:pull/23916 PR: https://git.openjdk.org/jdk/pull/23916 From epeter at openjdk.org Fri Mar 7 14:22:30 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 7 Mar 2025 14:22:30 GMT Subject: RFR: 8346989: Deoptimization and re-compilation cycle with C2 compiled code In-Reply-To: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> Message-ID: On Wed, 5 Mar 2025 12:56:48 GMT, Marc Chevalier wrote: > `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments. > This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached. > > Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all. > > tl;dr: > - C1: no problem, no change > - C2: > - with intrinsics: > - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms) > - without overflow: no problem, no change > - without intrinsics: no problem, no change > > Before the fix: > > Benchmark (SIZE) Mode Cnt Score Error Units > MathExact.C1_1.loopAddIInBounds 1000000 avgt 3 1.272 ? 0.048 ms/op > MathExact.C1_1.loopAddIOverflow 1000000 avgt 3 641.917 ? 58.238 ms/op > MathExact.C1_1.loopAddLInBounds 1000000 avgt 3 1.402 ? 0.842 ms/op > MathExact.C1_1.loopAddLOverflow 1000000 avgt 3 671.013 ? 229.425 ms/op > MathExact.C1_1.loopDecrementIInBounds 1000000 avgt 3 3.722 ? 22.244 ms/op > MathExact.C1_1.loopDecrementIOverflow 1000000 avgt 3 653.341 ? 279.003 ms/op > MathExact.C1_1.loopDecrementLInBounds 1000000 avgt 3 2.525 ? 0.810 ms/op > MathExact.C1_1.loopDecrementLOverflow 1000000 avgt 3 656.750 ? 141.792 ms/op > MathExact.C1_1.loopIncrementIInBounds 1000000 avgt 3 4.621 ? 12.822 ms/op > MathExact.C1_1.loopIncrementIOverflow 1000000 avgt 3 651.608 ? 274.396 ms/op > MathExact.C1_1.loopIncrementLInBounds 1000000 avgt 3 2.576 ? 3.316 ms/op > MathExact.C1_1.loopIncrementLOverflow 1000000 avgt 3 662.216 ? 71.879 ms/op > MathExact.C1_1.loopMultiplyIInBounds 1000000 avgt 3 1.402 ? 0.587 ms/op > MathExact.C1_1.loopMultiplyIOverflow 1000000 avgt 3 615.836 ? 252.137 ms/op > MathExact.C1_1.loopMultiplyLInBounds 1000000 avgt 3 2.906 ? 5.718 ms/op > MathExact.C1_1.loopMultiplyLOverflow 1000000 avgt 3 655.576 ? 147.432 ms/op > MathExact.C1_1.loopNegateIInBounds 1000000 avgt 3 2.023 ? 0.027 ms/op > MathExact.C1_1.loopNegateIOverflow 1000000 avgt 3 639.136 ? 30.841 ms/op > MathExact.C1_1.loopNegateLInBounds 1000000 avgt 3 2.422 ? 3.59... The benchmark generally looks good to me, I only have some minor suggestions ;) Ah. And is this only about `multiplyExact`, or are there other methods affected? Would be nice to extend the benchmark to those as well. And yet another idea: you could probably write an IR test that checks that we at first have the compilation with the trap, and another test where we trap too much and then get a different compilation (without the intrinsic?). Plus: the issue title is very generic. I think it should mention something about `Math.*Exact` as well ;) test/micro/org/openjdk/bench/vm/compiler/MultiplyExact.java line 47: > 45: try { > 46: return square(i); > 47: } catch (Throwable e) { Can you catch a more specific exception? Catching very general exceptions can often mask other bugs. I suppose this is only a benchmark, but it would still be good practice ;) test/micro/org/openjdk/bench/vm/compiler/MultiplyExact.java line 62: > 60: > 61: @Fork(value = 1) > 62: public static class C2 extends MultiplyExact {} What about a C2 version where you just disable the intrinsic? ------------- PR Review: https://git.openjdk.org/jdk/pull/23916#pullrequestreview-2663529726 PR Comment: https://git.openjdk.org/jdk/pull/23916#issuecomment-2703023122 PR Review Comment: https://git.openjdk.org/jdk/pull/23916#discussion_r1982809388 PR Review Comment: https://git.openjdk.org/jdk/pull/23916#discussion_r1982808076 From epeter at openjdk.org Fri Mar 7 14:22:30 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 7 Mar 2025 14:22:30 GMT Subject: RFR: 8346989: Deoptimization and re-compilation cycle with C2 compiled code In-Reply-To: References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> Message-ID: On Thu, 6 Mar 2025 07:16:40 GMT, Emanuel Peter wrote: >> `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments. >> This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached. >> >> Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all. >> >> tl;dr: >> - C1: no problem, no change >> - C2: >> - with intrinsics: >> - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms) >> - without overflow: no problem, no change >> - without intrinsics: no problem, no change >> >> Before the fix: >> >> Benchmark (SIZE) Mode Cnt Score Error Units >> MathExact.C1_1.loopAddIInBounds 1000000 avgt 3 1.272 ? 0.048 ms/op >> MathExact.C1_1.loopAddIOverflow 1000000 avgt 3 641.917 ? 58.238 ms/op >> MathExact.C1_1.loopAddLInBounds 1000000 avgt 3 1.402 ? 0.842 ms/op >> MathExact.C1_1.loopAddLOverflow 1000000 avgt 3 671.013 ? 229.425 ms/op >> MathExact.C1_1.loopDecrementIInBounds 1000000 avgt 3 3.722 ? 22.244 ms/op >> MathExact.C1_1.loopDecrementIOverflow 1000000 avgt 3 653.341 ? 279.003 ms/op >> MathExact.C1_1.loopDecrementLInBounds 1000000 avgt 3 2.525 ? 0.810 ms/op >> MathExact.C1_1.loopDecrementLOverflow 1000000 avgt 3 656.750 ? 141.792 ms/op >> MathExact.C1_1.loopIncrementIInBounds 1000000 avgt 3 4.621 ? 12.822 ms/op >> MathExact.C1_1.loopIncrementIOverflow 1000000 avgt 3 651.608 ? 274.396 ms/op >> MathExact.C1_1.loopIncrementLInBounds 1000000 avgt 3 2.576 ? 3.316 ms/op >> MathExact.C1_1.loopIncrementLOverflow 1000000 avgt 3 662.216 ? 71.879 ms/op >> MathExact.C1_1.loopMultiplyIInBounds 1000000 avgt 3 1.402 ? 0.587 ms/op >> MathExact.C1_1.loopMultiplyIOverflow 1000000 avgt 3 615.836 ? 252.137 ms/op >> MathExact.C1_1.loopMultiplyLInBounds 1000000 avgt 3 2.906 ? 5.718 ms/op >> MathExact.C1_1.loopMultiplyLOverflow 1000000 avgt 3 655.576 ? 147.432 ms/op >> MathExact.C1_1.loopNegateIInBounds 1000000 avgt 3 2.023 ? 0.027 ms/op >> MathExact.C1_1.loopNegateIOverflow 1000000 avgt 3 639.136 ? 30.841 ms/op >> MathExact.C1_1.loop... > > The benchmark generally looks good to me, I only have some minor suggestions ;) > Is it worth inlining at all? @eme64 wondered, so I tried with this code: You ask this in the PR description. I think I was not thinking about `inlining` but rather using the `intrinsic`. How much speedup does the intrinsic really deliver? Is it really better than pure Java? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23916#issuecomment-2703015476 From duke at openjdk.org Fri Mar 7 14:22:30 2025 From: duke at openjdk.org (Marc Chevalier) Date: Fri, 7 Mar 2025 14:22:30 GMT Subject: RFR: 8346989: Deoptimization and re-compilation cycle with C2 compiled code In-Reply-To: References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> Message-ID: On Thu, 6 Mar 2025 07:19:48 GMT, Emanuel Peter wrote: > You ask this in the PR description. I think I was not thinking about inlining but rather using the intrinsic. How much speedup does the intrinsic really deliver? Is it really better than pure Java? My fault. I used "inline" instead of "intrinsic" because the functions implementing the intrinsic are called `inline_math_mathExact` and alike. So, I compared the intrinsic vs. the pure java implementation, that happens to be inlined. And intrinsic is a bit better. I'll edit the text to fix that. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23916#issuecomment-2703823132 From duke at openjdk.org Fri Mar 7 14:22:30 2025 From: duke at openjdk.org (Marc Chevalier) Date: Fri, 7 Mar 2025 14:22:30 GMT Subject: RFR: 8346989: Deoptimization and re-compilation cycle with C2 compiled code In-Reply-To: References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> Message-ID: <7npMvWN2HNTIZOpeIVuhrZM9i5YiZEDvJC6xlReut_4=.e8a98a0b-7146-44a7-94e1-0d4a27566f1f@github.com> On Thu, 6 Mar 2025 07:11:40 GMT, Emanuel Peter wrote: >> `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments. >> This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached. >> >> Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all. >> >> tl;dr: >> - C1: no problem, no change >> - C2: >> - with intrinsics: >> - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms) >> - without overflow: no problem, no change >> - without intrinsics: no problem, no change >> >> Before the fix: >> >> Benchmark (SIZE) Mode Cnt Score Error Units >> MathExact.C1_1.loopAddIInBounds 1000000 avgt 3 1.272 ? 0.048 ms/op >> MathExact.C1_1.loopAddIOverflow 1000000 avgt 3 641.917 ? 58.238 ms/op >> MathExact.C1_1.loopAddLInBounds 1000000 avgt 3 1.402 ? 0.842 ms/op >> MathExact.C1_1.loopAddLOverflow 1000000 avgt 3 671.013 ? 229.425 ms/op >> MathExact.C1_1.loopDecrementIInBounds 1000000 avgt 3 3.722 ? 22.244 ms/op >> MathExact.C1_1.loopDecrementIOverflow 1000000 avgt 3 653.341 ? 279.003 ms/op >> MathExact.C1_1.loopDecrementLInBounds 1000000 avgt 3 2.525 ? 0.810 ms/op >> MathExact.C1_1.loopDecrementLOverflow 1000000 avgt 3 656.750 ? 141.792 ms/op >> MathExact.C1_1.loopIncrementIInBounds 1000000 avgt 3 4.621 ? 12.822 ms/op >> MathExact.C1_1.loopIncrementIOverflow 1000000 avgt 3 651.608 ? 274.396 ms/op >> MathExact.C1_1.loopIncrementLInBounds 1000000 avgt 3 2.576 ? 3.316 ms/op >> MathExact.C1_1.loopIncrementLOverflow 1000000 avgt 3 662.216 ? 71.879 ms/op >> MathExact.C1_1.loopMultiplyIInBounds 1000000 avgt 3 1.402 ? 0.587 ms/op >> MathExact.C1_1.loopMultiplyIOverflow 1000000 avgt 3 615.836 ? 252.137 ms/op >> MathExact.C1_1.loopMultiplyLInBounds 1000000 avgt 3 2.906 ? 5.718 ms/op >> MathExact.C1_1.loopMultiplyLOverflow 1000000 avgt 3 655.576 ? 147.432 ms/op >> MathExact.C1_1.loopNegateIInBounds 1000000 avgt 3 2.023 ? 0.027 ms/op >> MathExact.C1_1.loopNegateIOverflow 1000000 avgt 3 639.136 ? 30.841 ms/op >> MathExact.C1_1.loop... > > test/micro/org/openjdk/bench/vm/compiler/MultiplyExact.java line 47: > >> 45: try { >> 46: return square(i); >> 47: } catch (Throwable e) { > > Can you catch a more specific exception? Catching very general exceptions can often mask other bugs. I suppose this is only a benchmark, but it would still be good practice ;) Indeed. > test/micro/org/openjdk/bench/vm/compiler/MultiplyExact.java line 62: > >> 60: >> 61: @Fork(value = 1) >> 62: public static class C2 extends MultiplyExact {} > > What about a C2 version where you just disable the intrinsic? Good idea. Done. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23916#discussion_r1985004497 PR Review Comment: https://git.openjdk.org/jdk/pull/23916#discussion_r1985003664 From vlivanov at openjdk.org Fri Mar 7 18:06:56 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 7 Mar 2025 18:06:56 GMT Subject: RFR: 8346989: Deoptimization and re-compilation cycle with C2 compiled code In-Reply-To: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> Message-ID: <_3l8ylsbgvsqQE1Ihp0BUAx2o_VzcS6R2jWBSKW9u1E=.0dcb6086-ff6f-4c9a-b990-6665a476a3dc@github.com> On Wed, 5 Mar 2025 12:56:48 GMT, Marc Chevalier wrote: > `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments. > This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached. > > Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all. > > tl;dr: > - C1: no problem, no change > - C2: > - with intrinsics: > - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms) > - without overflow: no problem, no change > - without intrinsics: no problem, no change > > Before the fix: > > Benchmark (SIZE) Mode Cnt Score Error Units > MathExact.C1_1.loopAddIInBounds 1000000 avgt 3 1.272 ? 0.048 ms/op > MathExact.C1_1.loopAddIOverflow 1000000 avgt 3 641.917 ? 58.238 ms/op > MathExact.C1_1.loopAddLInBounds 1000000 avgt 3 1.402 ? 0.842 ms/op > MathExact.C1_1.loopAddLOverflow 1000000 avgt 3 671.013 ? 229.425 ms/op > MathExact.C1_1.loopDecrementIInBounds 1000000 avgt 3 3.722 ? 22.244 ms/op > MathExact.C1_1.loopDecrementIOverflow 1000000 avgt 3 653.341 ? 279.003 ms/op > MathExact.C1_1.loopDecrementLInBounds 1000000 avgt 3 2.525 ? 0.810 ms/op > MathExact.C1_1.loopDecrementLOverflow 1000000 avgt 3 656.750 ? 141.792 ms/op > MathExact.C1_1.loopIncrementIInBounds 1000000 avgt 3 4.621 ? 12.822 ms/op > MathExact.C1_1.loopIncrementIOverflow 1000000 avgt 3 651.608 ? 274.396 ms/op > MathExact.C1_1.loopIncrementLInBounds 1000000 avgt 3 2.576 ? 3.316 ms/op > MathExact.C1_1.loopIncrementLOverflow 1000000 avgt 3 662.216 ? 71.879 ms/op > MathExact.C1_1.loopMultiplyIInBounds 1000000 avgt 3 1.402 ? 0.587 ms/op > MathExact.C1_1.loopMultiplyIOverflow 1000000 avgt 3 615.836 ? 252.137 ms/op > MathExact.C1_1.loopMultiplyLInBounds 1000000 avgt 3 2.906 ? 5.718 ms/op > MathExact.C1_1.loopMultiplyLOverflow 1000000 avgt 3 655.576 ? 147.432 ms/op > MathExact.C1_1.loopNegateIInBounds 1000000 avgt 3 2.023 ? 0.027 ms/op > MathExact.C1_1.loopNegateIOverflow 1000000 avgt 3 639.136 ? 30.841 ms/op > MathExact.C1_1.loopNegateLInBounds 1000000 avgt 3 2.422 ? 3.59... Nice benchmark, Marc! src/hotspot/share/opto/library_call.cpp line 1963: > 1961: set_i_o(i_o()); > 1962: > 1963: uncommon_trap(Deoptimization::Reason_intrinsic, What about using `builtin_throw` here? (Requires some tuning on `builtin_throw` side.) How much does it affect performance? Also, passing `must_throw = true` into `uncommon_trap` may help a bit here as well. ------------- PR Review: https://git.openjdk.org/jdk/pull/23916#pullrequestreview-2667969834 PR Review Comment: https://git.openjdk.org/jdk/pull/23916#discussion_r1985476888 From tschatzl at openjdk.org Sat Mar 8 19:32:54 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Sat, 8 Mar 2025 19:32:54 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v15] In-Reply-To: References: Message-ID: > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... Thomas Schatzl has updated the pull request incrementally with two additional commits since the last revision: - * fix card table verification crashes: in the first refinement phase, when switching the global card tables, we need to re-check whether we are still in the same sweep epoch or not. It might have changed due to a GC interrupting acquiring the Heap_lock. Otherwise new threads will scribble on the refinement table. Cause are last-minute changes before making the PR ready to review. Testing: without the patch, occurs fairly frequently when continuously (1 in 20) starting refinement. Does not afterward. - * ayang review 3 * comments * minor refactorings ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23739/files - new: https://git.openjdk.org/jdk/pull/23739/files/350a4fa3..93b884f1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=14 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=13-14 Stats: 35 lines in 5 files changed: 30 ins; 0 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739 From tschatzl at openjdk.org Sat Mar 8 19:32:54 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Sat, 8 Mar 2025 19:32:54 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v9] In-Reply-To: References: Message-ID: On Tue, 4 Mar 2025 10:46:13 GMT, Thomas Schatzl wrote: > I got an error while testing java/foreign/TestUpcallStress.java on linuxaarch64 with this PR: Fixed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23739#issuecomment-2708458459 From dnsimon at openjdk.org Sun Mar 9 19:12:34 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Sun, 9 Mar 2025 19:12:34 GMT Subject: RFR: 8346825: [JVMCI] Remove NativeImageReinitialize annotationremoved NativeImageReinitialize annotation Message-ID: The `jdk.vm.ci.common.NativeImageReinitialize` annotation was introduced to reset JVMCI and Graal fields to their default values as they are copied into the libgraal image. Now that class loader separation is used to isolate the JVMCI and Graal classes compiled to produce libgraal from the JVMCI and Graal classes being executed to do the AOT compilation, the need for this field resetting is no longer needed. This PR removes the `NativeImageReinitialize` annotation. ------------- Commit messages: - removed NativeImageReinitialize annotation Changes: https://git.openjdk.org/jdk/pull/23957/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23957&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8346825 Stats: 69 lines in 10 files changed: 0 ins; 44 del; 25 mod Patch: https://git.openjdk.org/jdk/pull/23957.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23957/head:pull/23957 PR: https://git.openjdk.org/jdk/pull/23957 From never at openjdk.org Mon Mar 10 02:46:03 2025 From: never at openjdk.org (Tom Rodriguez) Date: Mon, 10 Mar 2025 02:46:03 GMT Subject: RFR: 8346825: [JVMCI] Remove NativeImageReinitialize annotationremoved NativeImageReinitialize annotation In-Reply-To: References: Message-ID: On Sun, 9 Mar 2025 19:07:54 GMT, Doug Simon wrote: > The `jdk.vm.ci.common.NativeImageReinitialize` annotation was introduced to reset JVMCI and Graal fields to their default values as they are copied into the libgraal image. Now that class loader separation is used to isolate the JVMCI and Graal classes compiled to produce libgraal from the JVMCI and Graal classes being executed to do the AOT compilation, the need for this field resetting is no longer needed. This PR removes the `NativeImageReinitialize` annotation. Marked as reviewed by never (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23957#pullrequestreview-2669672002 From lmesnik at openjdk.org Mon Mar 10 03:03:00 2025 From: lmesnik at openjdk.org (Leonid Mesnik) Date: Mon, 10 Mar 2025 03:03:00 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v5] In-Reply-To: <3bphXKLpIpxAZP-FEOeob6AaHbv0BAoEceJka64vMW8=.3e4f74e0-9479-4926-b365-b08d8d702692@github.com> References: <3bphXKLpIpxAZP-FEOeob6AaHbv0BAoEceJka64vMW8=.3e4f74e0-9479-4926-b365-b08d8d702692@github.com> Message-ID: On Thu, 6 Mar 2025 17:37:33 GMT, Ferenc Rakoczi wrote: >> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: > > Accepted review comments. There are no any new tests in the PR. How fix has been tested by openjdk tests? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23860#issuecomment-2709309387 From roland at openjdk.org Mon Mar 10 09:02:15 2025 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 10 Mar 2025 09:02:15 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v14] In-Reply-To: <9c34YjVjK0BMclNqFWMSitBV2YTcu_jmgWVitjRgvF0=.0f225af6-5888-4160-9a54-09baa696da1c@github.com> References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> <9c34YjVjK0BMclNqFWMSitBV2YTcu_jmgWVitjRgvF0=.0f225af6-5888-4160-9a54-09baa696da1c@github.com> Message-ID: On Fri, 7 Mar 2025 06:19:03 GMT, Galder Zamarre?o wrote: >> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance. >> >> Currently vectorization does not kick in for loops containing either of these calls because of the following error: >> >> >> VLoop::check_preconditions: failed: control flow in loop not allowed >> >> >> The control flow is due to the java implementation for these methods, e.g. >> >> >> public static long max(long a, long b) { >> return (a >= b) ? a : b; >> } >> >> >> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively. >> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization. >> E.g. >> >> >> SuperWord::transform_loop: >> Loop: N518/N126 counted [int,int),+4 (1025 iters) main has_sfpt strip_mined >> 518 CountedLoop === 518 246 126 [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21) >> >> >> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java >> 1 1 0 0 >> ============================== >> TEST SUCCESS >> >> long min 1155 >> long max 1173 >> >> >> After the patch, on darwin/aarch64 (M1): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java >> 1 1 0 0 >> ============================== >> TEST SUCCESS >> >> long min 1042 >> long max 1042 >> >> >> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes. >> Therefore, it still relies on the macro expansion to transform those into CMoveL. >> >> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results: >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PA... > > Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 47 additional commits since the last revision: > > - Merge branch 'master' into topic.intrinsify-max-min-long > - Add assertion comments > - Add simple reduction benchmarks on top of multiply ones > - Merge branch 'master' into topic.intrinsify-max-min-long > - Fix typo > - Renaming methods and variables and add docu on algorithms > - Fix copyright years > - Make sure it runs with cpus with either avx512 or asimd > - Test can only run with 256 bit registers or bigger > > * Remove platform dependant check > and use platform independent configuration instead. > - Fix license header > - ... and 37 more: https://git.openjdk.org/jdk/compare/07ef652d...1aa690d3 Marked as reviewed by roland (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/20098#pullrequestreview-2670211951 From chagedorn at openjdk.org Mon Mar 10 09:19:10 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 10 Mar 2025 09:19:10 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v14] In-Reply-To: <9c34YjVjK0BMclNqFWMSitBV2YTcu_jmgWVitjRgvF0=.0f225af6-5888-4160-9a54-09baa696da1c@github.com> References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> <9c34YjVjK0BMclNqFWMSitBV2YTcu_jmgWVitjRgvF0=.0f225af6-5888-4160-9a54-09baa696da1c@github.com> Message-ID: On Fri, 7 Mar 2025 06:19:03 GMT, Galder Zamarre?o wrote: >> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance. >> >> Currently vectorization does not kick in for loops containing either of these calls because of the following error: >> >> >> VLoop::check_preconditions: failed: control flow in loop not allowed >> >> >> The control flow is due to the java implementation for these methods, e.g. >> >> >> public static long max(long a, long b) { >> return (a >= b) ? a : b; >> } >> >> >> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively. >> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization. >> E.g. >> >> >> SuperWord::transform_loop: >> Loop: N518/N126 counted [int,int),+4 (1025 iters) main has_sfpt strip_mined >> 518 CountedLoop === 518 246 126 [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21) >> >> >> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java >> 1 1 0 0 >> ============================== >> TEST SUCCESS >> >> long min 1155 >> long max 1173 >> >> >> After the patch, on darwin/aarch64 (M1): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java >> 1 1 0 0 >> ============================== >> TEST SUCCESS >> >> long min 1042 >> long max 1042 >> >> >> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes. >> Therefore, it still relies on the macro expansion to transform those into CMoveL. >> >> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results: >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PA... > > Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 47 additional commits since the last revision: > > - Merge branch 'master' into topic.intrinsify-max-min-long > - Add assertion comments > - Add simple reduction benchmarks on top of multiply ones > - Merge branch 'master' into topic.intrinsify-max-min-long > - Fix typo > - Renaming methods and variables and add docu on algorithms > - Fix copyright years > - Make sure it runs with cpus with either avx512 or asimd > - Test can only run with 256 bit registers or bigger > > * Remove platform dependant check > and use platform independent configuration instead. > - Fix license header > - ... and 37 more: https://git.openjdk.org/jdk/compare/fd78e706...1aa690d3 Good work and collection of all the data! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20098#pullrequestreview-2670256931 From duke at openjdk.org Mon Mar 10 10:23:01 2025 From: duke at openjdk.org (Marc Chevalier) Date: Mon, 10 Mar 2025 10:23:01 GMT Subject: RFR: 8346989: Deoptimization and re-compilation cycle with C2 compiled code In-Reply-To: <_3l8ylsbgvsqQE1Ihp0BUAx2o_VzcS6R2jWBSKW9u1E=.0dcb6086-ff6f-4c9a-b990-6665a476a3dc@github.com> References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> <_3l8ylsbgvsqQE1Ihp0BUAx2o_VzcS6R2jWBSKW9u1E=.0dcb6086-ff6f-4c9a-b990-6665a476a3dc@github.com> Message-ID: On Fri, 7 Mar 2025 18:03:14 GMT, Vladimir Ivanov wrote: >> `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments. >> This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached. >> >> Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all. >> >> tl;dr: >> - C1: no problem, no change >> - C2: >> - with intrinsics: >> - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms) >> - without overflow: no problem, no change >> - without intrinsics: no problem, no change >> >> Before the fix: >> >> Benchmark (SIZE) Mode Cnt Score Error Units >> MathExact.C1_1.loopAddIInBounds 1000000 avgt 3 1.272 ? 0.048 ms/op >> MathExact.C1_1.loopAddIOverflow 1000000 avgt 3 641.917 ? 58.238 ms/op >> MathExact.C1_1.loopAddLInBounds 1000000 avgt 3 1.402 ? 0.842 ms/op >> MathExact.C1_1.loopAddLOverflow 1000000 avgt 3 671.013 ? 229.425 ms/op >> MathExact.C1_1.loopDecrementIInBounds 1000000 avgt 3 3.722 ? 22.244 ms/op >> MathExact.C1_1.loopDecrementIOverflow 1000000 avgt 3 653.341 ? 279.003 ms/op >> MathExact.C1_1.loopDecrementLInBounds 1000000 avgt 3 2.525 ? 0.810 ms/op >> MathExact.C1_1.loopDecrementLOverflow 1000000 avgt 3 656.750 ? 141.792 ms/op >> MathExact.C1_1.loopIncrementIInBounds 1000000 avgt 3 4.621 ? 12.822 ms/op >> MathExact.C1_1.loopIncrementIOverflow 1000000 avgt 3 651.608 ? 274.396 ms/op >> MathExact.C1_1.loopIncrementLInBounds 1000000 avgt 3 2.576 ? 3.316 ms/op >> MathExact.C1_1.loopIncrementLOverflow 1000000 avgt 3 662.216 ? 71.879 ms/op >> MathExact.C1_1.loopMultiplyIInBounds 1000000 avgt 3 1.402 ? 0.587 ms/op >> MathExact.C1_1.loopMultiplyIOverflow 1000000 avgt 3 615.836 ? 252.137 ms/op >> MathExact.C1_1.loopMultiplyLInBounds 1000000 avgt 3 2.906 ? 5.718 ms/op >> MathExact.C1_1.loopMultiplyLOverflow 1000000 avgt 3 655.576 ? 147.432 ms/op >> MathExact.C1_1.loopNegateIInBounds 1000000 avgt 3 2.023 ? 0.027 ms/op >> MathExact.C1_1.loopNegateIOverflow 1000000 avgt 3 639.136 ? 30.841 ms/op >> MathExact.C1_1.loop... > > src/hotspot/share/opto/library_call.cpp line 1963: > >> 1961: set_i_o(i_o()); >> 1962: >> 1963: uncommon_trap(Deoptimization::Reason_intrinsic, > > What about using `builtin_throw` here? (Requires some tuning on `builtin_throw` side.) How much does it affect performance? Also, passing `must_throw = true` into `uncommon_trap` may help a bit here as well. Using `builtin_throw` sounds nice! But indeed, it won't work so directly. I want to prevent intrinsic in case of `too_many_traps`. But that's only when `builtin_throw` will do something. But if I only rely on `builtin_throw`, then, when the built-in throwing is not possible (that is when `treat_throw_as_hot && method()->can_omit_stack_trace()` is false), we will have the repeated deopt again. There is also throwing the right exception, which is right now determined only by the reason (which adapts poorly to this case). I guess that's what you meant by tuning: be able to know if we would built-in throw, and if so, do it, otherwise, prevent infinitely repeated deopt. The way I see doing that is by (maybe optionally) providing the preallocated exception to throw as a parameter so that we don't have to rely on the "reason to exception" decision (or we can override it), and factor out the decision whether we can take the nice branch of `builtin_throw` so that we can bail out of intrinsic if we can't fast throw before we start setting up the intrinsic (that we would then need to undo). Does that match what you had in mind or you have another suggestion? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23916#discussion_r1986999005 From dnsimon at openjdk.org Mon Mar 10 11:06:00 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 10 Mar 2025 11:06:00 GMT Subject: RFR: 8346825: [JVMCI] Remove NativeImageReinitialize annotation In-Reply-To: References: Message-ID: On Sun, 9 Mar 2025 19:07:54 GMT, Doug Simon wrote: > The `jdk.vm.ci.common.NativeImageReinitialize` annotation was introduced to reset JVMCI and Graal fields to their default values as they are copied into the libgraal image. Now that class loader separation is used to isolate the JVMCI and Graal classes compiled to produce libgraal from the JVMCI and Graal classes being executed to do the AOT compilation, the need for this field resetting is no longer needed. This PR removes the `NativeImageReinitialize` annotation. Thanks for the review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23957#issuecomment-2710203461 From dnsimon at openjdk.org Mon Mar 10 11:06:00 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 10 Mar 2025 11:06:00 GMT Subject: Integrated: 8346825: [JVMCI] Remove NativeImageReinitialize annotation In-Reply-To: References: Message-ID: On Sun, 9 Mar 2025 19:07:54 GMT, Doug Simon wrote: > The `jdk.vm.ci.common.NativeImageReinitialize` annotation was introduced to reset JVMCI and Graal fields to their default values as they are copied into the libgraal image. Now that class loader separation is used to isolate the JVMCI and Graal classes compiled to produce libgraal from the JVMCI and Graal classes being executed to do the AOT compilation, the need for this field resetting is no longer needed. This PR removes the `NativeImageReinitialize` annotation. This pull request has now been integrated. Changeset: 99547c5b Author: Doug Simon URL: https://git.openjdk.org/jdk/commit/99547c5b254807580e0a5238b95d55d38181f4fc Stats: 69 lines in 10 files changed: 0 ins; 44 del; 25 mod 8346825: [JVMCI] Remove NativeImageReinitialize annotation Reviewed-by: never ------------- PR: https://git.openjdk.org/jdk/pull/23957 From fyang at openjdk.org Tue Mar 11 03:25:55 2025 From: fyang at openjdk.org (Fei Yang) Date: Tue, 11 Mar 2025 03:25:55 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v15] In-Reply-To: References: Message-ID: On Sat, 8 Mar 2025 19:32:54 GMT, Thomas Schatzl wrote: >> Hi all, >> >> please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. >> >> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. >> >> ### Current situation >> >> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. >> >> The main reason for the current barrier is how g1 implements concurrent refinement: >> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. >> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, >> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. >> >> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: >> >> >> // Filtering >> if (region(@x.a) == region(y)) goto done; // same region check >> if (y == null) goto done; // null value check >> if (card(@x.a) == young_card) goto done; // write to young gen check >> StoreLoad; // synchronize >> if (card(@x.a) == dirty_card) goto done; >> >> *card(@x.a) = dirty >> >> // Card tracking >> enqueue(card-address(@x.a)) into thread-local-dcq; >> if (thread-local-dcq is not full) goto done; >> >> call runtime to move thread-local-dcq into dcqs >> >> done: >> >> >> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. >> >> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. >> >> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). >> >> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c... > > Thomas Schatzl has updated the pull request incrementally with two additional commits since the last revision: > > - * fix card table verification crashes: in the first refinement phase, when switching the global card tables, we need to re-check whether we are still in the same sweep epoch or not. It might have changed due to a GC interrupting acquiring the Heap_lock. Otherwise new threads will scribble on the refinement table. > Cause are last-minute changes before making the PR ready to review. > > Testing: without the patch, occurs fairly frequently when continuously > (1 in 20) starting refinement. Does not afterward. > - * ayang review 3 > * comments > * minor refactorings Tier1-3 test good on linux-riscv64 platform. And I have prepared an add-on change which implements the barrier method to write cards for a reference array for this platform. Do you want to have it in this PR? Thanks. [23739-riscv-addon.txt](https://github.com/user-attachments/files/19174898/23739-riscv-addon.txt) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23739#issuecomment-2712469306 From tschatzl at openjdk.org Tue Mar 11 09:51:53 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Tue, 11 Mar 2025 09:51:53 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v16] In-Reply-To: References: Message-ID: > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: * optimized RISCV gen_write_ref_array_post_barrier() implementation contributed by @RealFYang ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23739/files - new: https://git.openjdk.org/jdk/pull/23739/files/93b884f1..758fac01 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=15 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=14-15 Stats: 36 lines in 1 file changed: 28 ins; 0 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739 From tschatzl at openjdk.org Tue Mar 11 09:54:05 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Tue, 11 Mar 2025 09:54:05 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v15] In-Reply-To: References: Message-ID: On Tue, 11 Mar 2025 03:22:52 GMT, Fei Yang wrote: > Tier1-3 test good on linux-riscv64 platform. And I have prepared an add-on change which implements the barrier method to write cards for a reference array for this platform. Do you want to have it in this PR? Thanks. I added your changes, thank you! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23739#issuecomment-2713415911 From shade at openjdk.org Tue Mar 11 11:41:28 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 11 Mar 2025 11:41:28 GMT Subject: RFR: 8351640: Print reason for making method not entrant Message-ID: <_XHdskC5Q0n4cwspFV97uiUyS2HsWDSZAK-YkGGCUIA=.8e86e6e2-9c46-440f-aca5-efbc54475f29@github.com> A simple quality of life improvement. We are studying compiler dynamics in Leyden, and it would be convenient to know why the particular methods are marked as not entrant. We just need to pass the extra string argument to `nmethod::make_not_entrant` and print it out. Sample log excerpt for mainline: $ grep com.sun.tools.javac.util.IntHashTable::lookup print-compilation.log 987 780 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) 1019 877 4 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) 1024 780 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) made not entrant: not used 4995 877 4 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) made not entrant: uncommon trap 5287 3734 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) 6615 5472 4 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) 6626 3734 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) made not entrant: not used You can now clearly see the method lifecycle. 1 second in app lifetime, the method was initially compiled at level 3. Shortly after, it got compiled at level 4, turning level 3 method unused. 4 seconds later, level 4 method encountered uncommon trap, so we are back to level 3. After 1.3 seconds more, the final compilation at level 4 completed, and second level 3 compilation was removed as unused. Additional testing: - [x] Linux x86_64 server fastdebug, `hotspot:tier1` ------------- Commit messages: - Use resource allocation for temp buffer - Base version Changes: https://git.openjdk.org/jdk/pull/23980/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23980&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8351640 Stats: 36 lines in 14 files changed: 8 ins; 0 del; 28 mod Patch: https://git.openjdk.org/jdk/pull/23980.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23980/head:pull/23980 PR: https://git.openjdk.org/jdk/pull/23980 From kvn at openjdk.org Tue Mar 11 17:58:54 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 11 Mar 2025 17:58:54 GMT Subject: RFR: 8351640: Print reason for making method not entrant In-Reply-To: <_XHdskC5Q0n4cwspFV97uiUyS2HsWDSZAK-YkGGCUIA=.8e86e6e2-9c46-440f-aca5-efbc54475f29@github.com> References: <_XHdskC5Q0n4cwspFV97uiUyS2HsWDSZAK-YkGGCUIA=.8e86e6e2-9c46-440f-aca5-efbc54475f29@github.com> Message-ID: On Tue, 11 Mar 2025 11:36:59 GMT, Aleksey Shipilev wrote: > A simple quality of life improvement. We are studying compiler dynamics in Leyden, and it would be convenient to know why the particular methods are marked as not entrant. We just need to pass the extra string argument to `nmethod::make_not_entrant` and print it out. > > Sample log excerpt for mainline: > > > $ grep com.sun.tools.javac.util.IntHashTable::lookup print-compilation.log > 987 780 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) > 1019 877 4 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) > 1024 780 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) made not entrant: not used > 4995 877 4 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) made not entrant: uncommon trap > 5287 3734 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) > 6615 5472 4 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) > 6626 3734 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) made not entrant: not used > > > You can now clearly see the method lifecycle. 1 second in app lifetime, the method was initially compiled at level 3. Shortly after, it got compiled at level 4, turning level 3 method unused. 4 seconds later, level 4 method encountered uncommon trap, so we are back to level 3. After 1.3 seconds more, the final compilation at level 4 completed, and second level 3 compilation was removed as unused. > > Additional testing: > - [x] Linux x86_64 server fastdebug, `hotspot:tier1` > - [x] Linux x86_64 server fastdebug, `all` Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23980#pullrequestreview-2675594015 From vlivanov at openjdk.org Tue Mar 11 18:52:56 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Tue, 11 Mar 2025 18:52:56 GMT Subject: RFR: 8351640: Print reason for making method not entrant In-Reply-To: <_XHdskC5Q0n4cwspFV97uiUyS2HsWDSZAK-YkGGCUIA=.8e86e6e2-9c46-440f-aca5-efbc54475f29@github.com> References: <_XHdskC5Q0n4cwspFV97uiUyS2HsWDSZAK-YkGGCUIA=.8e86e6e2-9c46-440f-aca5-efbc54475f29@github.com> Message-ID: On Tue, 11 Mar 2025 11:36:59 GMT, Aleksey Shipilev wrote: > A simple quality of life improvement. We are studying compiler dynamics in Leyden, and it would be convenient to know why the particular methods are marked as not entrant. We just need to pass the extra string argument to `nmethod::make_not_entrant` and print it out. > > Sample log excerpt for mainline: > > > $ grep com.sun.tools.javac.util.IntHashTable::lookup print-compilation.log > 987 780 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) > 1019 877 4 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) > 1024 780 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) made not entrant: not used > 4995 877 4 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) made not entrant: uncommon trap > 5287 3734 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) > 6615 5472 4 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) > 6626 3734 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) made not entrant: not used > > > You can now clearly see the method lifecycle. 1 second in app lifetime, the method was initially compiled at level 3. Shortly after, it got compiled at level 4, turning level 3 method unused. 4 seconds later, level 4 method encountered uncommon trap, so we are back to level 3. After 1.3 seconds more, the final compilation at level 4 completed, and second level 3 compilation was removed as unused. > > Additional testing: > - [x] Linux x86_64 server fastdebug, `hotspot:tier1` > - [x] Linux x86_64 server fastdebug, `all` src/hotspot/share/code/nmethod.cpp line 1965: > 1963: if (LogCompilation) { > 1964: if (xtty != nullptr) { > 1965: ttyLocker ttyl; // keep the following output all in one block Please, include same info in `LogCompilation` log. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23980#discussion_r1989937760 From dnsimon at openjdk.org Tue Mar 11 19:41:18 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 11 Mar 2025 19:41:18 GMT Subject: RFR: 8351700: Remove code conditional on BarrierSetNMethod being null Message-ID: All GCs started needing nmethod entry barriers as of loom so there's no longer any need to test for null nmethod entry barriers. ------------- Commit messages: - nmethod entry barriers are no longer optional Changes: https://git.openjdk.org/jdk/pull/23996/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23996&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8351700 Stats: 171 lines in 27 files changed: 5 ins; 103 del; 63 mod Patch: https://git.openjdk.org/jdk/pull/23996.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23996/head:pull/23996 PR: https://git.openjdk.org/jdk/pull/23996 From eosterlund at openjdk.org Tue Mar 11 19:41:18 2025 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Tue, 11 Mar 2025 19:41:18 GMT Subject: RFR: 8351700: Remove code conditional on BarrierSetNMethod being null In-Reply-To: References: Message-ID: On Tue, 11 Mar 2025 19:29:05 GMT, Doug Simon wrote: > All GCs started needing nmethod entry barriers as of loom so there's no longer any need to test for null nmethod entry barriers. Nice! Looks good. ------------- Marked as reviewed by eosterlund (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23996#pullrequestreview-2675894137 From never at openjdk.org Tue Mar 11 19:53:00 2025 From: never at openjdk.org (Tom Rodriguez) Date: Tue, 11 Mar 2025 19:53:00 GMT Subject: RFR: 8351700: Remove code conditional on BarrierSetNMethod being null In-Reply-To: References: Message-ID: On Tue, 11 Mar 2025 19:29:05 GMT, Doug Simon wrote: > All GCs started needing nmethod entry barriers as of loom so there's no longer any need to test for null nmethod entry barriers. src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 6549: > 6547: BarrierSetNMethod* bs_nm = BarrierSet::barrier_set()->barrier_set_nmethod(); > 6548: if (bs_nm != nullptr) { > 6549: StubRoutines::_method_entry_barrier = generate_method_entry_barrier(); Shouldn't you have kept this line? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23996#discussion_r1990025685 From dnsimon at openjdk.org Tue Mar 11 20:01:00 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 11 Mar 2025 20:01:00 GMT Subject: RFR: 8351700: Remove code conditional on BarrierSetNMethod being null [v2] In-Reply-To: References: Message-ID: On Tue, 11 Mar 2025 19:50:18 GMT, Tom Rodriguez wrote: >> Doug Simon has updated the pull request incrementally with one additional commit since the last revision: >> >> revived accidentally deleted code > > src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 6549: > >> 6547: BarrierSetNMethod* bs_nm = BarrierSet::barrier_set()->barrier_set_nmethod(); >> 6548: if (bs_nm != nullptr) { >> 6549: StubRoutines::_method_entry_barrier = generate_method_entry_barrier(); > > Shouldn't you have kept this line? Absolutely! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23996#discussion_r1990039724 From dnsimon at openjdk.org Tue Mar 11 20:00:59 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 11 Mar 2025 20:00:59 GMT Subject: RFR: 8351700: Remove code conditional on BarrierSetNMethod being null [v2] In-Reply-To: References: Message-ID: > All GCs started needing nmethod entry barriers as of loom so there's no longer any need to test for null nmethod entry barriers. Doug Simon has updated the pull request incrementally with one additional commit since the last revision: revived accidentally deleted code ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23996/files - new: https://git.openjdk.org/jdk/pull/23996/files/b958ee43..b3d4721d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23996&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23996&range=00-01 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23996.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23996/head:pull/23996 PR: https://git.openjdk.org/jdk/pull/23996 From never at openjdk.org Tue Mar 11 21:53:55 2025 From: never at openjdk.org (Tom Rodriguez) Date: Tue, 11 Mar 2025 21:53:55 GMT Subject: RFR: 8351700: Remove code conditional on BarrierSetNMethod being null [v2] In-Reply-To: References: Message-ID: On Tue, 11 Mar 2025 20:00:59 GMT, Doug Simon wrote: >> All GCs started needing nmethod entry barriers as of loom so there's no longer any need to test for null nmethod entry barriers. > > Doug Simon has updated the pull request incrementally with one additional commit since the last revision: > > revived accidentally deleted code Marked as reviewed by never (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23996#pullrequestreview-2676195527 From fyang at openjdk.org Wed Mar 12 00:32:53 2025 From: fyang at openjdk.org (Fei Yang) Date: Wed, 12 Mar 2025 00:32:53 GMT Subject: RFR: 8351700: Remove code conditional on BarrierSetNMethod being null [v2] In-Reply-To: References: Message-ID: On Tue, 11 Mar 2025 20:00:59 GMT, Doug Simon wrote: >> All GCs started needing nmethod entry barriers as of loom so there's no longer any need to test for null nmethod entry barriers. > > Doug Simon has updated the pull request incrementally with one additional commit since the last revision: > > revived accidentally deleted code src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 9903: > 9901: generate_arraycopy_stubs(); > 9902: > 9903: BarrierSetNMethod* bs_nm = BarrierSet::barrier_set()->barrier_set_nmethod(); Drive-by comment: `bs_nm` seems not used any more. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23996#discussion_r1990347462 From shade at openjdk.org Wed Mar 12 07:35:33 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 12 Mar 2025 07:35:33 GMT Subject: RFR: 8351640: Print reason for making method not entrant [v2] In-Reply-To: <_XHdskC5Q0n4cwspFV97uiUyS2HsWDSZAK-YkGGCUIA=.8e86e6e2-9c46-440f-aca5-efbc54475f29@github.com> References: <_XHdskC5Q0n4cwspFV97uiUyS2HsWDSZAK-YkGGCUIA=.8e86e6e2-9c46-440f-aca5-efbc54475f29@github.com> Message-ID: <370pnPWKXnqHXz9pVOoU9vFfqdH8zIIV2K7BpqWRcEI=.0c63f38f-84ab-49c3-a0da-1ad9f1b22fb1@github.com> > A simple quality of life improvement. We are studying compiler dynamics in Leyden, and it would be convenient to know why the particular methods are marked as not entrant. We just need to pass the extra string argument to `nmethod::make_not_entrant` and print it out. > > Sample log excerpt for mainline: > > > $ grep com.sun.tools.javac.util.IntHashTable::lookup print-compilation.log > 987 780 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) > 1019 877 4 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) > 1024 780 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) made not entrant: not used > 4995 877 4 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) made not entrant: uncommon trap > 5287 3734 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) > 6615 5472 4 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) > 6626 3734 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) made not entrant: not used > > > You can now clearly see the method lifecycle. 1 second in app lifetime, the method was initially compiled at level 3. Shortly after, it got compiled at level 4, turning level 3 method unused. 4 seconds later, level 4 method encountered uncommon trap, so we are back to level 3. After 1.3 seconds more, the final compilation at level 4 completed, and second level 3 compilation was removed as unused. > > Additional testing: > - [x] Linux x86_64 server fastdebug, `hotspot:tier1` > - [x] Linux x86_64 server fastdebug, `all` Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: - Add to LogCompilation as well - Merge branch 'master' into JDK-8351640-nmethod-not-entrant-reason - Use resource allocation for temp buffer - Base version ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23980/files - new: https://git.openjdk.org/jdk/pull/23980/files/b13a1080..38491fb2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23980&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23980&range=00-01 Stats: 38661 lines in 408 files changed: 18309 ins; 13442 del; 6910 mod Patch: https://git.openjdk.org/jdk/pull/23980.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23980/head:pull/23980 PR: https://git.openjdk.org/jdk/pull/23980 From shade at openjdk.org Wed Mar 12 07:35:33 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 12 Mar 2025 07:35:33 GMT Subject: RFR: 8351640: Print reason for making method not entrant [v2] In-Reply-To: References: <_XHdskC5Q0n4cwspFV97uiUyS2HsWDSZAK-YkGGCUIA=.8e86e6e2-9c46-440f-aca5-efbc54475f29@github.com> Message-ID: On Tue, 11 Mar 2025 18:45:40 GMT, Vladimir Ivanov wrote: >> Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: >> >> - Add to LogCompilation as well >> - Merge branch 'master' into JDK-8351640-nmethod-not-entrant-reason >> - Use resource allocation for temp buffer >> - Base version > > src/hotspot/share/code/nmethod.cpp line 1965: > >> 1963: if (LogCompilation) { >> 1964: if (xtty != nullptr) { >> 1965: ttyLocker ttyl; // keep the following output all in one block > > Please, include same info in `LogCompilation` log. Sure, added. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23980#discussion_r1990826189 From dnsimon at openjdk.org Wed Mar 12 09:16:44 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 12 Mar 2025 09:16:44 GMT Subject: RFR: 8351700: Remove code conditional on BarrierSetNMethod being null [v3] In-Reply-To: References: Message-ID: > All GCs started needing nmethod entry barriers as of loom so there's no longer any need to test for null nmethod entry barriers. Doug Simon has updated the pull request incrementally with one additional commit since the last revision: removed unused code ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23996/files - new: https://git.openjdk.org/jdk/pull/23996/files/b3d4721d..95da3c2f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23996&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23996&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23996.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23996/head:pull/23996 PR: https://git.openjdk.org/jdk/pull/23996 From shade at openjdk.org Wed Mar 12 09:46:01 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 12 Mar 2025 09:46:01 GMT Subject: RFR: 8351700: Remove code conditional on BarrierSetNMethod being null [v3] In-Reply-To: References: Message-ID: <1stcVqx5LbF9cnNm4gb4YXqoHBbBBigH5fpYlBqRttI=.79261377-2b11-49eb-802d-b579fd23a9ff@github.com> On Wed, 12 Mar 2025 09:16:44 GMT, Doug Simon wrote: >> All GCs started needing nmethod entry barriers as of loom so there's no longer any need to test for null nmethod entry barriers. > > Doug Simon has updated the pull request incrementally with one additional commit since the last revision: > > removed unused code Looks fine, thanks. ------------- Marked as reviewed by shade (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23996#pullrequestreview-2677747978 From tschatzl at openjdk.org Wed Mar 12 11:58:45 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Wed, 12 Mar 2025 11:58:45 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v17] In-Reply-To: References: Message-ID: > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... Thomas Schatzl has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 24 additional commits since the last revision: - Merge branch 'master' into 8342382-card-table-instead-of-dcq - * optimized RISCV gen_write_ref_array_post_barrier() implementation contributed by @RealFYang - * fix card table verification crashes: in the first refinement phase, when switching the global card tables, we need to re-check whether we are still in the same sweep epoch or not. It might have changed due to a GC interrupting acquiring the Heap_lock. Otherwise new threads will scribble on the refinement table. Cause are last-minute changes before making the PR ready to review. Testing: without the patch, occurs fairly frequently when continuously (1 in 20) starting refinement. Does not afterward. - * ayang review 3 * comments * minor refactorings - * iwalulya review * renaming * fix some includes, forward declaration - * fix whitespace * additional whitespace between log tags * rename G1ConcurrentRefineWorkTask -> ...SweepTask to conform to the other similar rename - ayang review * renamings * refactorings - iwalulya review * comments for variables tracking to-collection-set and just dirtied cards after GC/refinement * predicate for determining whether the refinement has been disabled * some other typos/comment improvements * renamed _has_xxx_ref to _has_ref_to_xxx to be more consistent with naming - * ayang review - fix comment - * iwalulya review 2 * G1ConcurrentRefineWorkState -> G1ConcurrentRefineSweepState * some additional documentation - ... and 14 more: https://git.openjdk.org/jdk/compare/f77fa17b...aec95051 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23739/files - new: https://git.openjdk.org/jdk/pull/23739/files/758fac01..aec95051 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=16 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=15-16 Stats: 78123 lines in 1539 files changed: 36243 ins; 29177 del; 12703 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739 From dnsimon at openjdk.org Wed Mar 12 12:21:57 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 12 Mar 2025 12:21:57 GMT Subject: RFR: 8351700: Remove code conditional on BarrierSetNMethod being null [v3] In-Reply-To: References: Message-ID: On Wed, 12 Mar 2025 09:16:44 GMT, Doug Simon wrote: >> All GCs started needing nmethod entry barriers as of loom so there's no longer any need to test for null nmethod entry barriers. > > Doug Simon has updated the pull request incrementally with one additional commit since the last revision: > > removed unused code `gc/TestAllocHumongousFragment.java#generational` is failing on Windows: https://github.com/dougxc/jdk/actions/runs/13807682996/job/38625487569#step:9:630 I don't think it can be caused by this PR. Are you able to confirm that @shipilev ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23996#issuecomment-2717699848 From shade at openjdk.org Wed Mar 12 12:34:03 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 12 Mar 2025 12:34:03 GMT Subject: RFR: 8351700: Remove code conditional on BarrierSetNMethod being null [v3] In-Reply-To: <1stcVqx5LbF9cnNm4gb4YXqoHBbBBigH5fpYlBqRttI=.79261377-2b11-49eb-802d-b579fd23a9ff@github.com> References: <1stcVqx5LbF9cnNm4gb4YXqoHBbBBigH5fpYlBqRttI=.79261377-2b11-49eb-802d-b579fd23a9ff@github.com> Message-ID: On Wed, 12 Mar 2025 09:43:21 GMT, Aleksey Shipilev wrote: >> Doug Simon has updated the pull request incrementally with one additional commit since the last revision: >> >> removed unused code > > Looks fine, thanks. > `gc/TestAllocHumongousFragment.java#generational` is failing on Windows: https://github.com/dougxc/jdk/actions/runs/13807682996/job/38625487569#step:9:630 I don't think it can be caused by this PR. Are you able to confirm that @shipilev ? It was problemlisted by #23982 yesterday. You can ignore it, or merge with recent master to get clean GHA runs. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23996#issuecomment-2717727270 From dnsimon at openjdk.org Wed Mar 12 12:34:04 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 12 Mar 2025 12:34:04 GMT Subject: RFR: 8351700: Remove code conditional on BarrierSetNMethod being null [v3] In-Reply-To: References: Message-ID: <-urz_l6_Sa21e9SspzfanN4VGdOFZJxOv6E79Npfv5A=.baeb6814-351b-4711-b7fe-4d87e0700532@github.com> On Wed, 12 Mar 2025 09:16:44 GMT, Doug Simon wrote: >> All GCs started needing nmethod entry barriers as of loom so there's no longer any need to test for null nmethod entry barriers. > > Doug Simon has updated the pull request incrementally with one additional commit since the last revision: > > removed unused code I'll ignore it. Thanks for pointing out the problem listing. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23996#issuecomment-2717730379 From dnsimon at openjdk.org Wed Mar 12 12:34:05 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 12 Mar 2025 12:34:05 GMT Subject: Integrated: 8351700: Remove code conditional on BarrierSetNMethod being null In-Reply-To: References: Message-ID: On Tue, 11 Mar 2025 19:29:05 GMT, Doug Simon wrote: > All GCs started needing nmethod entry barriers as of loom so there's no longer any need to test for null nmethod entry barriers. This pull request has now been integrated. Changeset: 95b66d5a Author: Doug Simon URL: https://git.openjdk.org/jdk/commit/95b66d5a43a77b257a097afe5df369f92769abd2 Stats: 171 lines in 27 files changed: 5 ins; 102 del; 64 mod 8351700: Remove code conditional on BarrierSetNMethod being null Reviewed-by: shade, eosterlund, never ------------- PR: https://git.openjdk.org/jdk/pull/23996 From ayang at openjdk.org Wed Mar 12 13:33:59 2025 From: ayang at openjdk.org (Albert Mingkun Yang) Date: Wed, 12 Mar 2025 13:33:59 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v14] In-Reply-To: <5w6qUwzDQadxseocRl6rRF0AllyeukWTpYl2XjAfiTE=.fb62a50e-e308-4d08-8057-67e70e13ccbb@github.com> References: <5w6qUwzDQadxseocRl6rRF0AllyeukWTpYl2XjAfiTE=.fb62a50e-e308-4d08-8057-67e70e13ccbb@github.com> Message-ID: On Fri, 7 Mar 2025 13:14:02 GMT, Albert Mingkun Yang wrote: >> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: >> >> * iwalulya review >> * renaming >> * fix some includes, forward declaration > > src/hotspot/share/gc/g1/g1CardTable.hpp line 76: > >> 74: g1_card_already_scanned = 0x1, >> 75: g1_to_cset_card = 0x2, >> 76: g1_from_remset_card = 0x4 > > Could you outline the motivation for this more precise info? Is it for optimization or essentially for correctness? OK, it's for better performance, not correctness. How much is the improvement? As I understand it, this more precise info is largely independent of the new barrier logic. I wonder if it makes sense to extract this out to its own ticket to better assess its impact on perf and impl complexity. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1991375754 From ayang at openjdk.org Wed Mar 12 13:34:04 2025 From: ayang at openjdk.org (Albert Mingkun Yang) Date: Wed, 12 Mar 2025 13:34:04 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v17] In-Reply-To: References: Message-ID: <0w7seS1tIFhUxnmStxQySISWVfpBBsRmUtx7EoTy9a4=.509a3d5e-56d0-4fd8-8896-51835b14302b@github.com> On Wed, 12 Mar 2025 11:58:45 GMT, Thomas Schatzl wrote: >> Hi all, >> >> please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. >> >> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. >> >> ### Current situation >> >> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. >> >> The main reason for the current barrier is how g1 implements concurrent refinement: >> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. >> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, >> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. >> >> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: >> >> >> // Filtering >> if (region(@x.a) == region(y)) goto done; // same region check >> if (y == null) goto done; // null value check >> if (card(@x.a) == young_card) goto done; // write to young gen check >> StoreLoad; // synchronize >> if (card(@x.a) == dirty_card) goto done; >> >> *card(@x.a) = dirty >> >> // Card tracking >> enqueue(card-address(@x.a)) into thread-local-dcq; >> if (thread-local-dcq is not full) goto done; >> >> call runtime to move thread-local-dcq into dcqs >> >> done: >> >> >> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. >> >> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. >> >> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). >> >> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c... > > Thomas Schatzl has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 24 additional commits since the last revision: > > - Merge branch 'master' into 8342382-card-table-instead-of-dcq > - * optimized RISCV gen_write_ref_array_post_barrier() implementation contributed by @RealFYang > - * fix card table verification crashes: in the first refinement phase, when switching the global card tables, we need to re-check whether we are still in the same sweep epoch or not. It might have changed due to a GC interrupting acquiring the Heap_lock. Otherwise new threads will scribble on the refinement table. > Cause are last-minute changes before making the PR ready to review. > > Testing: without the patch, occurs fairly frequently when continuously > (1 in 20) starting refinement. Does not afterward. > - * ayang review 3 > * comments > * minor refactorings > - * iwalulya review > * renaming > * fix some includes, forward declaration > - * fix whitespace > * additional whitespace between log tags > * rename G1ConcurrentRefineWorkTask -> ...SweepTask to conform to the other similar rename > - ayang review > * renamings > * refactorings > - iwalulya review > * comments for variables tracking to-collection-set and just dirtied cards after GC/refinement > * predicate for determining whether the refinement has been disabled > * some other typos/comment improvements > * renamed _has_xxx_ref to _has_ref_to_xxx to be more consistent with naming > - * ayang review - fix comment > - * iwalulya review 2 > * G1ConcurrentRefineWorkState -> G1ConcurrentRefineSweepState > * some additional documentation > - ... and 14 more: https://git.openjdk.org/jdk/compare/53a66058...aec95051 src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 217: > 215: > 216: { > 217: SuspendibleThreadSetLeaver sts_leave; Can you add some comment on why leaving the set is required? It's not obvious to me why. I'd expect handshake to work out of the box... src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 263: > 261: > 262: SuspendibleThreadSetLeaver sts_leave; > 263: VMThread::execute(&op); Can you elaborate what synchronization this VM op is trying to achieve? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1991489399 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1991382024 From duke at openjdk.org Wed Mar 12 13:42:33 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Wed, 12 Mar 2025 13:42:33 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v6] In-Reply-To: References: Message-ID: > By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: Added validity test for the intrinsics. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23860/files - new: https://git.openjdk.org/jdk/pull/23860/files/64135f29..f65ef7c4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=04-05 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23860.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23860/head:pull/23860 PR: https://git.openjdk.org/jdk/pull/23860 From duke at openjdk.org Wed Mar 12 13:51:58 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Wed, 12 Mar 2025 13:51:58 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v5] In-Reply-To: References: <3bphXKLpIpxAZP-FEOeob6AaHbv0BAoEceJka64vMW8=.3e4f74e0-9479-4926-b365-b08d8d702692@github.com> Message-ID: On Mon, 10 Mar 2025 03:00:09 GMT, Leonid Mesnik wrote: > There are no any new tests in the PR. How fix has been tested by openjdk tests? I have just added one. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23860#issuecomment-2717950685 From duke at openjdk.org Wed Mar 12 13:52:02 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Wed, 12 Mar 2025 13:52:02 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v4] In-Reply-To: References: Message-ID: On Thu, 6 Mar 2025 14:30:35 GMT, Jatin Bhateja wrote: >> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: >> >> Added alignment to loop entries. > > src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 2: > >> 1: /* >> 2: * Copyright (c) 2024, Oracle and/or its affiliates. All rights reserved. > > Please update copyright year Thanks, fixed. > src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 96: > >> 94: StubRoutines::_dilithiumMontMulByConstant = generate_dilithiumMontMulByConstant_avx512(); >> 95: StubRoutines::_dilithiumDecomposePoly = generate_dilithiumDecomposePoly_avx512(); >> 96: } > > Indentation fix needed Thanks, fixed. > src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 362: > >> 360: const Register roundsLeft = r11; >> 361: >> 362: __ align(OptoLoopAlignment); > > Redundant alignment before label should be before it's bind Thanks, fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1991546308 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1991546488 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1991546606 From duke at openjdk.org Wed Mar 12 13:52:06 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Wed, 12 Mar 2025 13:52:06 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v3] In-Reply-To: References: <0E2AqFpNPjDjP6jqCXn8toePBcW2SIHw1kFXlZX4W_U=.8d692bfa-0598-4969-b480-4a285366e0bb@github.com> Message-ID: <74tlAsyoYwN-fvtFyxp3xJYo76U68oF0ES4UVy7S_iY=.01f96647-395e-49bb-9e5a-f047b63460e0@github.com> On Thu, 6 Mar 2025 09:32:19 GMT, Jatin Bhateja wrote: >> I think the easiest is to put a for (int i = 0; i < 1000; i++) loop around the switch statement in the run() method of the ML_DSA_Test class (test/jdk/sun/security/provider/acvp/ML_DSA_Test.java). (This is because the intrinsics kick in after a few thousand calls of the method.) > > Hi @ferakocz , Yes, we should modify the test or lower the compilation threshold with -Xbatch -XX:TieredCompileThreshold=0.1. > > Alternatively, since the tests has a depedency on Automatic Cryptographic Validation Test server I have created a simplified test which cover all the security levels. > > Kindly include [test/hotspot/jtreg/compiler/intrinsics/signature/TestModuleLatticeDSA.java > ](https://github.com/ferakocz/jdk/pull/1) I have added a new command to the test test/jdk/sun/security/provider/acvp/Launcher.java. The line with the -Xcomp will invoke the intrinsics on the first call, so they will be tested. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1991546056 From tschatzl at openjdk.org Wed Mar 12 14:00:15 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Wed, 12 Mar 2025 14:00:15 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v17] In-Reply-To: <0w7seS1tIFhUxnmStxQySISWVfpBBsRmUtx7EoTy9a4=.509a3d5e-56d0-4fd8-8896-51835b14302b@github.com> References: <0w7seS1tIFhUxnmStxQySISWVfpBBsRmUtx7EoTy9a4=.509a3d5e-56d0-4fd8-8896-51835b14302b@github.com> Message-ID: On Wed, 12 Mar 2025 12:23:50 GMT, Albert Mingkun Yang wrote: >> Thomas Schatzl has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 24 additional commits since the last revision: >> >> - Merge branch 'master' into 8342382-card-table-instead-of-dcq >> - * optimized RISCV gen_write_ref_array_post_barrier() implementation contributed by @RealFYang >> - * fix card table verification crashes: in the first refinement phase, when switching the global card tables, we need to re-check whether we are still in the same sweep epoch or not. It might have changed due to a GC interrupting acquiring the Heap_lock. Otherwise new threads will scribble on the refinement table. >> Cause are last-minute changes before making the PR ready to review. >> >> Testing: without the patch, occurs fairly frequently when continuously >> (1 in 20) starting refinement. Does not afterward. >> - * ayang review 3 >> * comments >> * minor refactorings >> - * iwalulya review >> * renaming >> * fix some includes, forward declaration >> - * fix whitespace >> * additional whitespace between log tags >> * rename G1ConcurrentRefineWorkTask -> ...SweepTask to conform to the other similar rename >> - ayang review >> * renamings >> * refactorings >> - iwalulya review >> * comments for variables tracking to-collection-set and just dirtied cards after GC/refinement >> * predicate for determining whether the refinement has been disabled >> * some other typos/comment improvements >> * renamed _has_xxx_ref to _has_ref_to_xxx to be more consistent with naming >> - * ayang review - fix comment >> - * iwalulya review 2 >> * G1ConcurrentRefineWorkState -> G1ConcurrentRefineSweepState >> * some additional documentation >> - ... and 14 more: https://git.openjdk.org/jdk/compare/5727f166...aec95051 > > src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 263: > >> 261: >> 262: SuspendibleThreadSetLeaver sts_leave; >> 263: VMThread::execute(&op); > > Can you elaborate what synchronization this VM op is trying to achieve? Memory visibility for refinement threads for the references written to the heap. Without them, they may not have received the most recent values. This is the same as the `StoreLoad` barriers synchronization between mutator and refinement threads imo. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1991561707 From lmesnik at openjdk.org Wed Mar 12 15:37:12 2025 From: lmesnik at openjdk.org (Leonid Mesnik) Date: Wed, 12 Mar 2025 15:37:12 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v6] In-Reply-To: References: Message-ID: On Wed, 12 Mar 2025 13:42:33 GMT, Ferenc Rakoczi wrote: >> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: > > Added validity test for the intrinsics. test/jdk/sun/security/provider/acvp/Launcher.java line 43: > 41: * @modules java.base/sun.security.provider > 42: * @run main Launcher > 43: * @run main/othervm -Xcomp Launcher Thank you for adding this case. Please add it as a separate testcase: /* * @test * @summary Test verifies intrinsic implementation. * @library /test/lib * @modules java.base/sun.security.provider * @run main/othervm -Xcomp Launcher */ ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1991769739 From vlivanov at openjdk.org Wed Mar 12 17:25:06 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 12 Mar 2025 17:25:06 GMT Subject: RFR: 8351640: Print reason for making method not entrant [v2] In-Reply-To: <370pnPWKXnqHXz9pVOoU9vFfqdH8zIIV2K7BpqWRcEI=.0c63f38f-84ab-49c3-a0da-1ad9f1b22fb1@github.com> References: <_XHdskC5Q0n4cwspFV97uiUyS2HsWDSZAK-YkGGCUIA=.8e86e6e2-9c46-440f-aca5-efbc54475f29@github.com> <370pnPWKXnqHXz9pVOoU9vFfqdH8zIIV2K7BpqWRcEI=.0c63f38f-84ab-49c3-a0da-1ad9f1b22fb1@github.com> Message-ID: On Wed, 12 Mar 2025 07:35:33 GMT, Aleksey Shipilev wrote: >> A simple quality of life improvement. We are studying compiler dynamics in Leyden, and it would be convenient to know why the particular methods are marked as not entrant. We just need to pass the extra string argument to `nmethod::make_not_entrant` and print it out. >> >> Sample log excerpt for mainline: >> >> >> $ grep com.sun.tools.javac.util.IntHashTable::lookup print-compilation.log >> 987 780 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) >> 1019 877 4 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) >> 1024 780 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) made not entrant: not used >> 4995 877 4 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) made not entrant: uncommon trap >> 5287 3734 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) >> 6615 5472 4 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) >> 6626 3734 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) made not entrant: not used >> >> >> You can now clearly see the method lifecycle. 1 second in app lifetime, the method was initially compiled at level 3. Shortly after, it got compiled at level 4, turning level 3 method unused. 4 seconds later, level 4 method encountered uncommon trap, so we are back to level 3. After 1.3 seconds more, the final compilation at level 4 completed, and second level 3 compilation was removed as unused. >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `hotspot:tier1` >> - [x] Linux x86_64 server fastdebug, `all` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Add to LogCompilation as well > - Merge branch 'master' into JDK-8351640-nmethod-not-entrant-reason > - Use resource allocation for temp buffer > - Base version Looks good. Do you mind incorporating log compilation tool support? [1] diff --git a/src/utils/LogCompilation/src/main/java/com/sun/hotspot/tools/compiler/LogParser.java b/src/utils/LogCompilation/src/main/java/com/sun/hotspot/tools/compiler/LogParser.java index e1e305abe10..61cbc054200 100644 --- a/src/utils/LogCompilation/src/main/java/com/sun/hotspot/tools/compiler/LogParser.java +++ b/src/utils/LogCompilation/src/main/java/com/sun/hotspot/tools/compiler/LogParser.java @@ -1099,6 +1099,10 @@ public void startElement(String uri, String localName, String qname, Attributes e.setCompileKind(compileKind); String level = atts.getValue("level"); e.setLevel(level); + String reason = atts.getValue("reason"); + if (reason != null) { + e.setReason(reason); + } events.add(e); } else if (qname.equals("uncommon_trap")) { String id = atts.getValue("compile_id"); diff --git a/src/utils/LogCompilation/src/main/java/com/sun/hotspot/tools/compiler/MakeNotEntrantEvent.java b/src/utils/LogCompilation/src/main/java/com/sun/hotspot/tools/compiler/MakeNotEntrantEvent.java index b4015537c74..d230f1b4336 100644 --- a/src/utils/LogCompilation/src/main/java/com/sun/hotspot/tools/compiler/MakeNotEntrantEvent.java +++ b/src/utils/LogCompilation/src/main/java/com/sun/hotspot/tools/compiler/MakeNotEntrantEvent.java @@ -47,6 +47,11 @@ class MakeNotEntrantEvent extends BasicLogEvent { */ private String level; + /** + * The reason of invalidation. + */ + private String reason; + /** * The compile kind. */ @@ -64,10 +69,14 @@ public NMethod getNMethod() { public void print(PrintStream stream, boolean printID) { if (isZombie()) { - stream.printf("%s make_zombie\n", getId()); + stream.printf("%s make_zombie", getId()); } else { - stream.printf("%s make_not_entrant\n", getId()); + stream.printf("%s make_not_entrant", getId()); + } + if (getReason() != null) { + stream.printf(": %s", getReason()); } + stream.println(); } public boolean isZombie() { @@ -88,7 +97,21 @@ public void setLevel(String level) { this.level = level; } - /** + /** + * @return the reason + */ + public String getReason() { + return reason; + } + + /** + * @param reason the reason to set + */ + public void setReason(String reason) { + this.reason = reason; + } + + /** * @return the compileKind */ public String getCompileKind() { ------------- PR Review: https://git.openjdk.org/jdk/pull/23980#pullrequestreview-2679301582 From shade at openjdk.org Wed Mar 12 17:39:35 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 12 Mar 2025 17:39:35 GMT Subject: RFR: 8351640: Print reason for making method not entrant [v3] In-Reply-To: <_XHdskC5Q0n4cwspFV97uiUyS2HsWDSZAK-YkGGCUIA=.8e86e6e2-9c46-440f-aca5-efbc54475f29@github.com> References: <_XHdskC5Q0n4cwspFV97uiUyS2HsWDSZAK-YkGGCUIA=.8e86e6e2-9c46-440f-aca5-efbc54475f29@github.com> Message-ID: > A simple quality of life improvement. We are studying compiler dynamics in Leyden, and it would be convenient to know why the particular methods are marked as not entrant. We just need to pass the extra string argument to `nmethod::make_not_entrant` and print it out. > > Sample log excerpt for mainline: > > > $ grep com.sun.tools.javac.util.IntHashTable::lookup print-compilation.log > 987 780 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) > 1019 877 4 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) > 1024 780 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) made not entrant: not used > 4995 877 4 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) made not entrant: uncommon trap > 5287 3734 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) > 6615 5472 4 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) > 6626 3734 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) made not entrant: not used > > > You can now clearly see the method lifecycle. 1 second in app lifetime, the method was initially compiled at level 3. Shortly after, it got compiled at level 4, turning level 3 method unused. 4 seconds later, level 4 method encountered uncommon trap, so we are back to level 3. After 1.3 seconds more, the final compilation at level 4 completed, and second level 3 compilation was removed as unused. > > Additional testing: > - [x] Linux x86_64 server fastdebug, `hotspot:tier1` > - [x] Linux x86_64 server fastdebug, `all` Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: Add LogCompilation support ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23980/files - new: https://git.openjdk.org/jdk/pull/23980/files/38491fb2..5da9766d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23980&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23980&range=01-02 Stats: 30 lines in 2 files changed: 27 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/23980.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23980/head:pull/23980 PR: https://git.openjdk.org/jdk/pull/23980 From shade at openjdk.org Wed Mar 12 17:39:35 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 12 Mar 2025 17:39:35 GMT Subject: RFR: 8351640: Print reason for making method not entrant [v2] In-Reply-To: References: <_XHdskC5Q0n4cwspFV97uiUyS2HsWDSZAK-YkGGCUIA=.8e86e6e2-9c46-440f-aca5-efbc54475f29@github.com> <370pnPWKXnqHXz9pVOoU9vFfqdH8zIIV2K7BpqWRcEI=.0c63f38f-84ab-49c3-a0da-1ad9f1b22fb1@github.com> Message-ID: On Wed, 12 Mar 2025 17:22:06 GMT, Vladimir Ivanov wrote: > Do you mind incorporating log compilation tool support? [1] I don't mind, added. Looks like this still works: $ cd src/tools/LogCompilation $ make ------------- PR Comment: https://git.openjdk.org/jdk/pull/23980#issuecomment-2718629919 From tschatzl at openjdk.org Wed Mar 12 17:44:01 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Wed, 12 Mar 2025 17:44:01 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v17] In-Reply-To: <0w7seS1tIFhUxnmStxQySISWVfpBBsRmUtx7EoTy9a4=.509a3d5e-56d0-4fd8-8896-51835b14302b@github.com> References: <0w7seS1tIFhUxnmStxQySISWVfpBBsRmUtx7EoTy9a4=.509a3d5e-56d0-4fd8-8896-51835b14302b@github.com> Message-ID: On Wed, 12 Mar 2025 13:20:25 GMT, Albert Mingkun Yang wrote: >> Thomas Schatzl has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 24 additional commits since the last revision: >> >> - Merge branch 'master' into 8342382-card-table-instead-of-dcq >> - * optimized RISCV gen_write_ref_array_post_barrier() implementation contributed by @RealFYang >> - * fix card table verification crashes: in the first refinement phase, when switching the global card tables, we need to re-check whether we are still in the same sweep epoch or not. It might have changed due to a GC interrupting acquiring the Heap_lock. Otherwise new threads will scribble on the refinement table. >> Cause are last-minute changes before making the PR ready to review. >> >> Testing: without the patch, occurs fairly frequently when continuously >> (1 in 20) starting refinement. Does not afterward. >> - * ayang review 3 >> * comments >> * minor refactorings >> - * iwalulya review >> * renaming >> * fix some includes, forward declaration >> - * fix whitespace >> * additional whitespace between log tags >> * rename G1ConcurrentRefineWorkTask -> ...SweepTask to conform to the other similar rename >> - ayang review >> * renamings >> * refactorings >> - iwalulya review >> * comments for variables tracking to-collection-set and just dirtied cards after GC/refinement >> * predicate for determining whether the refinement has been disabled >> * some other typos/comment improvements >> * renamed _has_xxx_ref to _has_ref_to_xxx to be more consistent with naming >> - * ayang review - fix comment >> - * iwalulya review 2 >> * G1ConcurrentRefineWorkState -> G1ConcurrentRefineSweepState >> * some additional documentation >> - ... and 14 more: https://git.openjdk.org/jdk/compare/0c7b5abb...aec95051 > > src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 217: > >> 215: >> 216: { >> 217: SuspendibleThreadSetLeaver sts_leave; > > Can you add some comment on why leaving the set is required? It's not obvious to me why. I'd expect handshake to work out of the box... It isn't apparently. Removed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1991999476 From tschatzl at openjdk.org Wed Mar 12 17:59:51 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Wed, 12 Mar 2025 17:59:51 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v18] In-Reply-To: References: Message-ID: <3KOwgdzYn_vXQVWisVUEY-0i1gtZEfZhcD1-id3epYE=.17aa84bc-a7ec-4dda-b596-7a1016d710fc@github.com> > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: * ayang review * remove unnecessary STSleaver * some more documentation around to_collection_card card color ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23739/files - new: https://git.openjdk.org/jdk/pull/23739/files/aec95051..3766b76c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=17 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=16-17 Stats: 18 lines in 2 files changed: 5 ins; 4 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739 From vlivanov at openjdk.org Wed Mar 12 18:17:03 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 12 Mar 2025 18:17:03 GMT Subject: RFR: 8351640: Print reason for making method not entrant [v3] In-Reply-To: References: <_XHdskC5Q0n4cwspFV97uiUyS2HsWDSZAK-YkGGCUIA=.8e86e6e2-9c46-440f-aca5-efbc54475f29@github.com> Message-ID: On Wed, 12 Mar 2025 17:39:35 GMT, Aleksey Shipilev wrote: >> A simple quality of life improvement. We are studying compiler dynamics in Leyden, and it would be convenient to know why the particular methods are marked as not entrant. We just need to pass the extra string argument to `nmethod::make_not_entrant` and print it out. >> >> Sample log excerpt for mainline: >> >> >> $ grep com.sun.tools.javac.util.IntHashTable::lookup print-compilation.log >> 987 780 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) >> 1019 877 4 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) >> 1024 780 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) made not entrant: not used >> 4995 877 4 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) made not entrant: uncommon trap >> 5287 3734 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) >> 6615 5472 4 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) >> 6626 3734 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) made not entrant: not used >> >> >> You can now clearly see the method lifecycle. 1 second in app lifetime, the method was initially compiled at level 3. Shortly after, it got compiled at level 4, turning level 3 method unused. 4 seconds later, level 4 method encountered uncommon trap, so we are back to level 3. After 1.3 seconds more, the final compilation at level 4 completed, and second level 3 compilation was removed as unused. >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `hotspot:tier1` >> - [x] Linux x86_64 server fastdebug, `all` > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Add LogCompilation support Thanks. Looks good. ------------- Marked as reviewed by vlivanov (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23980#pullrequestreview-2679447944 From shade at openjdk.org Wed Mar 12 18:17:04 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 12 Mar 2025 18:17:04 GMT Subject: RFR: 8351640: Print reason for making method not entrant [v3] In-Reply-To: References: <_XHdskC5Q0n4cwspFV97uiUyS2HsWDSZAK-YkGGCUIA=.8e86e6e2-9c46-440f-aca5-efbc54475f29@github.com> Message-ID: On Wed, 12 Mar 2025 17:39:35 GMT, Aleksey Shipilev wrote: >> A simple quality of life improvement. We are studying compiler dynamics in Leyden, and it would be convenient to know why the particular methods are marked as not entrant. We just need to pass the extra string argument to `nmethod::make_not_entrant` and print it out. >> >> Sample log excerpt for mainline: >> >> >> $ grep com.sun.tools.javac.util.IntHashTable::lookup print-compilation.log >> 987 780 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) >> 1019 877 4 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) >> 1024 780 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) made not entrant: not used >> 4995 877 4 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) made not entrant: uncommon trap >> 5287 3734 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) >> 6615 5472 4 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) >> 6626 3734 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) made not entrant: not used >> >> >> You can now clearly see the method lifecycle. 1 second in app lifetime, the method was initially compiled at level 3. Shortly after, it got compiled at level 4, turning level 3 method unused. 4 seconds later, level 4 method encountered uncommon trap, so we are back to level 3. After 1.3 seconds more, the final compilation at level 4 completed, and second level 3 compilation was removed as unused. >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `hotspot:tier1` >> - [x] Linux x86_64 server fastdebug, `all` > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Add LogCompilation support Thanks! I'll integrate once GHA clears. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23980#issuecomment-2718719599 From duke at openjdk.org Wed Mar 12 19:19:08 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Wed, 12 Mar 2025 19:19:08 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v7] In-Reply-To: References: Message-ID: > By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: Made the intrinsics test separate from the pure java test. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23860/files - new: https://git.openjdk.org/jdk/pull/23860/files/f65ef7c4..aa2fdf2d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=05-06 Stats: 8 lines in 1 file changed: 8 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23860.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23860/head:pull/23860 PR: https://git.openjdk.org/jdk/pull/23860 From shade at openjdk.org Wed Mar 12 19:47:58 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 12 Mar 2025 19:47:58 GMT Subject: Integrated: 8351640: Print reason for making method not entrant In-Reply-To: <_XHdskC5Q0n4cwspFV97uiUyS2HsWDSZAK-YkGGCUIA=.8e86e6e2-9c46-440f-aca5-efbc54475f29@github.com> References: <_XHdskC5Q0n4cwspFV97uiUyS2HsWDSZAK-YkGGCUIA=.8e86e6e2-9c46-440f-aca5-efbc54475f29@github.com> Message-ID: On Tue, 11 Mar 2025 11:36:59 GMT, Aleksey Shipilev wrote: > A simple quality of life improvement. We are studying compiler dynamics in Leyden, and it would be convenient to know why the particular methods are marked as not entrant. We just need to pass the extra string argument to `nmethod::make_not_entrant` and print it out. > > Sample log excerpt for mainline: > > > $ grep com.sun.tools.javac.util.IntHashTable::lookup print-compilation.log > 987 780 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) > 1019 877 4 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) > 1024 780 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) made not entrant: not used > 4995 877 4 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) made not entrant: uncommon trap > 5287 3734 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) > 6615 5472 4 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) > 6626 3734 3 com.sun.tools.javac.util.IntHashTable::lookup (100 bytes) made not entrant: not used > > > You can now clearly see the method lifecycle. 1 second in app lifetime, the method was initially compiled at level 3. Shortly after, it got compiled at level 4, turning level 3 method unused. 4 seconds later, level 4 method encountered uncommon trap, so we are back to level 3. After 1.3 seconds more, the final compilation at level 4 completed, and second level 3 compilation was removed as unused. > > Additional testing: > - [x] Linux x86_64 server fastdebug, `hotspot:tier1` > - [x] Linux x86_64 server fastdebug, `all` This pull request has now been integrated. Changeset: 930455b5 Author: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/930455b59608b547017c9649efeb6bd381340c34 Stats: 68 lines in 16 files changed: 35 ins; 0 del; 33 mod 8351640: Print reason for making method not entrant Co-authored-by: Vladimir Ivanov Reviewed-by: vlivanov, kvn ------------- PR: https://git.openjdk.org/jdk/pull/23980 From tschatzl at openjdk.org Thu Mar 13 13:07:29 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Thu, 13 Mar 2025 13:07:29 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v19] In-Reply-To: References: Message-ID: > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: * when aborting refinement during full collection, the global card table and the per-thread card table might not be in sync. Roll forward during abort of the refinement in these situations. * additional verification * added some missing ResourceMarks in asserts * added variant of ArrayJuggle2 that crashes fairly quickly without these changes ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23739/files - new: https://git.openjdk.org/jdk/pull/23739/files/3766b76c..78611173 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=18 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=17-18 Stats: 111 lines in 11 files changed: 82 ins; 13 del; 16 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739 From galder at openjdk.org Thu Mar 13 13:50:14 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Thu, 13 Mar 2025 13:50:14 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> <63F-0aHgMthexL0b2DFmkW8_QrJeo8OOlCaIyZApfpY=.4744070d-9d56-4031-8684-be14cf66d1e5@github.com> Message-ID: On Fri, 7 Mar 2025 13:17:29 GMT, Emanuel Peter wrote: >>> As for possible solutions. In all Regression 1-3 cases, it seems the issue is scalar cmove. So actually in all cases a possible solution is using branching code (i.e. `cmp+mov`). So to me, these are the follow-up RFE's: >>> >>> * Detect "extreme" probability scalar cmove, and replace them with branching code. This should take care of all regressions here. This one has high priority, as it fixes the regression caused by this patch here. But it would also help to improve performance for the `Integer.min/max` cases, which have the same issue. >> >> I've created [JDK-8351409](https://bugs.openjdk.org/browse/JDK-8351409) to address this. > > @galderz Excellent. Testing looks all good on our side. Yes I think what you saw was unrelated. > @rwestrel Could give this a last quick scan and then I think you can integrate :) Thanks @eme64 @rwestrel @chhagedorn for your patience with this! ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2721319344 From duke at openjdk.org Thu Mar 13 13:50:22 2025 From: duke at openjdk.org (duke) Date: Thu, 13 Mar 2025 13:50:22 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v14] In-Reply-To: <9c34YjVjK0BMclNqFWMSitBV2YTcu_jmgWVitjRgvF0=.0f225af6-5888-4160-9a54-09baa696da1c@github.com> References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> <9c34YjVjK0BMclNqFWMSitBV2YTcu_jmgWVitjRgvF0=.0f225af6-5888-4160-9a54-09baa696da1c@github.com> Message-ID: On Fri, 7 Mar 2025 06:19:03 GMT, Galder Zamarre?o wrote: >> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance. >> >> Currently vectorization does not kick in for loops containing either of these calls because of the following error: >> >> >> VLoop::check_preconditions: failed: control flow in loop not allowed >> >> >> The control flow is due to the java implementation for these methods, e.g. >> >> >> public static long max(long a, long b) { >> return (a >= b) ? a : b; >> } >> >> >> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively. >> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization. >> E.g. >> >> >> SuperWord::transform_loop: >> Loop: N518/N126 counted [int,int),+4 (1025 iters) main has_sfpt strip_mined >> 518 CountedLoop === 518 246 126 [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21) >> >> >> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java >> 1 1 0 0 >> ============================== >> TEST SUCCESS >> >> long min 1155 >> long max 1173 >> >> >> After the patch, on darwin/aarch64 (M1): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java >> 1 1 0 0 >> ============================== >> TEST SUCCESS >> >> long min 1042 >> long max 1042 >> >> >> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes. >> Therefore, it still relies on the macro expansion to transform those into CMoveL. >> >> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results: >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PA... > > Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 47 additional commits since the last revision: > > - Merge branch 'master' into topic.intrinsify-max-min-long > - Add assertion comments > - Add simple reduction benchmarks on top of multiply ones > - Merge branch 'master' into topic.intrinsify-max-min-long > - Fix typo > - Renaming methods and variables and add docu on algorithms > - Fix copyright years > - Make sure it runs with cpus with either avx512 or asimd > - Test can only run with 256 bit registers or bigger > > * Remove platform dependant check > and use platform independent configuration instead. > - Fix license header > - ... and 37 more: https://git.openjdk.org/jdk/compare/c836c5b7...1aa690d3 @galderz Your change (at version 1aa690d391ef3536d422ba93c33d0fc273a911c6) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2721323015 From galder at openjdk.org Thu Mar 13 13:57:23 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Thu, 13 Mar 2025 13:57:23 GMT Subject: Integrated: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) In-Reply-To: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> Message-ID: On Tue, 9 Jul 2024 12:07:37 GMT, Galder Zamarre?o wrote: > This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance. > > Currently vectorization does not kick in for loops containing either of these calls because of the following error: > > > VLoop::check_preconditions: failed: control flow in loop not allowed > > > The control flow is due to the java implementation for these methods, e.g. > > > public static long max(long a, long b) { > return (a >= b) ? a : b; > } > > > This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively. > By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization. > E.g. > > > SuperWord::transform_loop: > Loop: N518/N126 counted [int,int),+4 (1025 iters) main has_sfpt strip_mined > 518 CountedLoop === 518 246 126 [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21) > > > Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1): > > > ============================== > Test summary > ============================== > TEST TOTAL PASS FAIL ERROR > jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java > 1 1 0 0 > ============================== > TEST SUCCESS > > long min 1155 > long max 1173 > > > After the patch, on darwin/aarch64 (M1): > > > ============================== > Test summary > ============================== > TEST TOTAL PASS FAIL ERROR > jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java > 1 1 0 0 > ============================== > TEST SUCCESS > > long min 1042 > long max 1042 > > > This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes. > Therefore, it still relies on the macro expansion to transform those into CMoveL. > > I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results: > > > ============================== > Test summary > ============================== > TEST TOTAL PASS FAIL ERROR > jtreg:test/hotspot/jtreg:tier1 2500 2500 0 0 >>> jtreg:test/jdk:tier1 ... This pull request has now been integrated. Changeset: 4e51a8c9 Author: Galder Zamarre?o URL: https://git.openjdk.org/jdk/commit/4e51a8c9ad4e5345d05cf32ce1e82b7158f80e93 Stats: 844 lines in 9 files changed: 725 ins; 107 del; 12 mod 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) Reviewed-by: roland, epeter, chagedorn, darcy ------------- PR: https://git.openjdk.org/jdk/pull/20098 From tschatzl at openjdk.org Thu Mar 13 14:16:07 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Thu, 13 Mar 2025 14:16:07 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v19] In-Reply-To: References: Message-ID: <-ys7CbBNU4hCmEgYQyZpmBQ_rso4i2_KoFHLPNv73sI=.bd715b1d-b9fd-48b7-bb06-d6673ab2dbfc@github.com> On Thu, 13 Mar 2025 13:07:29 GMT, Thomas Schatzl wrote: >> Hi all, >> >> please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. >> >> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. >> >> ### Current situation >> >> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. >> >> The main reason for the current barrier is how g1 implements concurrent refinement: >> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. >> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, >> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. >> >> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: >> >> >> // Filtering >> if (region(@x.a) == region(y)) goto done; // same region check >> if (y == null) goto done; // null value check >> if (card(@x.a) == young_card) goto done; // write to young gen check >> StoreLoad; // synchronize >> if (card(@x.a) == dirty_card) goto done; >> >> *card(@x.a) = dirty >> >> // Card tracking >> enqueue(card-address(@x.a)) into thread-local-dcq; >> if (thread-local-dcq is not full) goto done; >> >> call runtime to move thread-local-dcq into dcqs >> >> done: >> >> >> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. >> >> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. >> >> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). >> >> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c... > > Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: > > * when aborting refinement during full collection, the global card table and the per-thread card table might not be in sync. Roll forward during abort of the refinement in these situations. > * additional verification > * added some missing ResourceMarks in asserts > * added variant of ArrayJuggle2 that crashes fairly quickly without these changes Commit https://github.com/openjdk/jdk/pull/23739/commits/786111735c306583af5bc75f7653f0da67d52adb fixes an issue with full gc interrupting refinement while the global card table and the JavaThread's card table changes. Testing: tier1-7 with changes, tier1-5 with changes stressing refinement similar to the ones added to the new test. The new variant of `ArrayJuggle2` fails >50% of all times in our CI without the patch (verified 700 or so executions of that not failing with patch). ------------- PR Comment: https://git.openjdk.org/jdk/pull/23739#issuecomment-2721413659 From tschatzl at openjdk.org Fri Mar 14 14:28:57 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Fri, 14 Mar 2025 14:28:57 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v17] In-Reply-To: References: <0w7seS1tIFhUxnmStxQySISWVfpBBsRmUtx7EoTy9a4=.509a3d5e-56d0-4fd8-8896-51835b14302b@github.com> Message-ID: <58jXaIS3TNN9Y9xWGSKWM7B4C0dbZ6YxRWjPMmBeFnY=.506b75a0-12a4-424c-869c-8358195947d9@github.com> On Wed, 12 Mar 2025 13:56:57 GMT, Thomas Schatzl wrote: >> src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 263: >> >>> 261: >>> 262: SuspendibleThreadSetLeaver sts_leave; >>> 263: VMThread::execute(&op); >> >> Can you elaborate what synchronization this VM op is trying to achieve? > > Memory visibility for refinement threads for the references written to the heap. Without them, they may not have received the most recent values. > This is the same as the `StoreLoad` barriers synchronization between mutator and refinement threads imo. There has been a discussion about whether this is actually needed. Initially we thought that this could be removed because it's only the refinement worker threads that would need memory synchronization, and the memory synchronization is handled by just starting up the refinement threads. However the rebuild remsets process (marking threads) also access the global card table reference to mark the to-collection-set cards and its value must be synchronized. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1995683088 From tschatzl at openjdk.org Fri Mar 14 14:37:27 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Fri, 14 Mar 2025 14:37:27 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v20] In-Reply-To: References: Message-ID: > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: * ayang review * re-add STS leaver for java thread handshake ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23739/files - new: https://git.openjdk.org/jdk/pull/23739/files/78611173..51a9eed8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=19 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=18-19 Stats: 15 lines in 1 file changed: 5 ins; 0 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739 From tschatzl at openjdk.org Fri Mar 14 16:35:38 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Fri, 14 Mar 2025 16:35:38 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v21] In-Reply-To: References: Message-ID: <1bH6bLmIYx6eVtZ4IPlFtdYpdCAwSaNB6w0uNljTSJE=.8a4a88c7-2f66-493c-91dd-6fc6c744c08f@github.com> > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... Thomas Schatzl has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 28 commits: - Merge branch 'master' into 8342381-card-table-instead-of-dcq - * ayang review * re-add STS leaver for java thread handshake - * when aborting refinement during full collection, the global card table and the per-thread card table might not be in sync. Roll forward during abort of the refinement in these situations. * additional verification * added some missing ResourceMarks in asserts * added variant of ArrayJuggle2 that crashes fairly quickly without these changes - * ayang review * remove unnecessary STSleaver * some more documentation around to_collection_card card color - Merge branch 'master' into 8342382-card-table-instead-of-dcq - * optimized RISCV gen_write_ref_array_post_barrier() implementation contributed by @RealFYang - * fix card table verification crashes: in the first refinement phase, when switching the global card tables, we need to re-check whether we are still in the same sweep epoch or not. It might have changed due to a GC interrupting acquiring the Heap_lock. Otherwise new threads will scribble on the refinement table. Cause are last-minute changes before making the PR ready to review. Testing: without the patch, occurs fairly frequently when continuously (1 in 20) starting refinement. Does not afterward. - * ayang review 3 * comments * minor refactorings - * iwalulya review * renaming * fix some includes, forward declaration - * fix whitespace * additional whitespace between log tags * rename G1ConcurrentRefineWorkTask -> ...SweepTask to conform to the other similar rename - ... and 18 more: https://git.openjdk.org/jdk/compare/7f428041...b0730176 ------------- Changes: https://git.openjdk.org/jdk/pull/23739/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=20 Stats: 6761 lines in 99 files changed: 2368 ins; 3464 del; 929 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739 From tschatzl at openjdk.org Sat Mar 15 13:12:39 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Sat, 15 Mar 2025 13:12:39 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v22] In-Reply-To: References: Message-ID: > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: * more documentation on why we need to rendezvous the gc threads ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23739/files - new: https://git.openjdk.org/jdk/pull/23739/files/b0730176..447fe39b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=21 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=20-21 Stats: 7 lines in 1 file changed: 6 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739 From tschatzl at openjdk.org Mon Mar 17 10:32:33 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Mon, 17 Mar 2025 10:32:33 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v23] In-Reply-To: References: Message-ID: > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: * obsolete G1UpdateBufferSize G1UpdateBufferSize has previously been used to size the refinement buffers and impose a minimum limit on the number of cards per thread that need to be pending before refinement starts. The former function is now obsolete with the removal of the dirty card queues, the latter functionality has been taken over by the new diagnostic option `G1PerThreadPendingCardThreshold`. I prefer to make this a diagnostic option is better than a product option because it is something that is only necessary for some test cases to produce some otherwise unwanted behavior (continuous refinement). CSR is pending. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23739/files - new: https://git.openjdk.org/jdk/pull/23739/files/447fe39b..4d0afd57 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=22 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=21-22 Stats: 16 lines in 7 files changed: 2 ins; 9 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739 From duke at openjdk.org Mon Mar 17 11:38:04 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Mon, 17 Mar 2025 11:38:04 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v6] In-Reply-To: References: Message-ID: On Wed, 12 Mar 2025 15:34:18 GMT, Leonid Mesnik wrote: >> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: >> >> Added validity test for the intrinsics. > > test/jdk/sun/security/provider/acvp/Launcher.java line 43: > >> 41: * @modules java.base/sun.security.provider >> 42: * @run main Launcher >> 43: * @run main/othervm -Xcomp Launcher > > Thank you for adding this case. Please add it as a separate testcase: > /* > * @test > * @summary Test verifies intrinsic implementation. > * @library /test/lib > * @modules java.base/sun.security.provider > * @run main/othervm -Xcomp Launcher > */ Done. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1998545085 From lmesnik at openjdk.org Mon Mar 17 16:10:25 2025 From: lmesnik at openjdk.org (Leonid Mesnik) Date: Mon, 17 Mar 2025 16:10:25 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v7] In-Reply-To: References: Message-ID: On Wed, 12 Mar 2025 19:19:08 GMT, Ferenc Rakoczi wrote: >> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: > > Made the intrinsics test separate from the pure java test. Test changes looks good. ------------- Marked as reviewed by lmesnik (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23860#pullrequestreview-2691165965 From vpaprotski at openjdk.org Mon Mar 17 21:49:12 2025 From: vpaprotski at openjdk.org (Volodymyr Paprotski) Date: Mon, 17 Mar 2025 21:49:12 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v7] In-Reply-To: References: Message-ID: On Wed, 12 Mar 2025 19:19:08 GMT, Ferenc Rakoczi wrote: >> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: > > Made the intrinsics test separate from the pure java test. Partial review, just didnt want to sit on comments for this long. (Spent quite a bit of time catching up on papers and math required) The biggest roadblock I have following the code are raw register numbers. (And more comments? perhaps I need more math knowledge, but comments would help too). Also, 'hidden variables' (xmm30). Can't complain, because this is exactly what Vladimir Ivanov told me to do on my first PR https://github.com/openjdk/jdk/pull/10582#discussion_r1022185591 Perhaps that discussion applies here too. src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 45: > 43: // Constants > 44: // > 45: ATTRIBUTE_ALIGNED(64) static const uint32_t dilithiumAvx512Consts[] = { This is really nitpicking.. but could had loaded constants inline with `movl` without requiring an ExternalAddress()? Nice to have constants together, only complaint is we have 'magic offsets' in ASM to reach in for particular one.. This one isnt too bad, offset of 32bits is easy to inspect visually (`dilithiumAvx512ConstsAddr()` could take a parameter perhaps) src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 58: > 56: > 57: ATTRIBUTE_ALIGNED(64) static const uint32_t dilithiumAvx512Perms[] = { > 58: // collect montmul results into the destination register same as `dilithiumAvx512Consts()`, 'magic offsets'; except here they are harder to count (eg. not clear visually what is the offset of `ntt inverse`). Could be split into three constant arrays to make the compiler count for us src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 127: > 125: for (int i = 0; i < parCnt; i++) { > 126: __ evpsubd(xmm(i + outputReg), k0, xmm(i + scratchReg1), xmm(i + scratchReg2), false, Assembler::AVX_512bit); > 127: } This is such a deceptively brilliant function!!! Took me a while to understand (and map to Java `montMul` function). Perhaps needs more comments. The comment on line 99 does provide good hints, but I still had some trouble. I ended up annotating a copy quite a bit. I do think all 'clever code' needs comments. Here is my annotated version, if you want to copy out anything: static void montmulEven2(XMMRegister outputReg, XMMRegister inputReg1, XMMRegister inputReg2, XMMRegister scratchReg1, XMMRegister scratchReg2, XMMRegister montQInvModR, XMMRegister dilithium_q, int parCnt, MacroAssembler* _masm) { int output = outputReg->encoding(); int input1 = inputReg1->encoding(); int input2 = inputReg2->encoding(); int scratch1 = scratchReg1->encoding(); int scratch2 = scratchReg2->encoding(); for (int i = 0; i < parCnt; i++) { // scratch1 = (int64)input1_even*input2_even // Java: long a = (long) b * (long) c; __ vpmuldq(xmm(i + scratch1), xmm(i + input1), xmm((input2 == 29) ? 29 : input2 + i), Assembler::AVX_512bit); } for (int i = 0; i < parCnt; i++) { // scratch2 = int32(montQInvModR*(int32)scratch1) // Java: int aLow = (int) a; // Java: int m = MONT_Q_INV_MOD_R * aLow; // signed low product __ vpmulld(xmm(i + scratch2), xmm(i + scratch1), montQInvModR, Assembler::AVX_512bit); } for (int i = 0; i < parCnt; i++) { // scratch2 = (int64)scratch2_even*dilithium_q_even // Java: ((long)m * MONT_Q) __ vpmuldq(xmm(i + scratch2), xmm(i + scratch2), dilithium_q, Assembler::AVX_512bit); } for (int i = 0; i < parCnt; i++) { // output_odd = scratch1_odd - scratch2_odd // Java: (aHigh - (int) (("scratch2") >> MONT_R_BITS)) __ evpsubd(xmm(i + output), k0, xmm(i + scratch1), xmm(i + scratch2), false, Assembler::AVX_512bit); } } - add comment that input2 can be xmm29, treated as constants, not consecutive (i.e. zetas) - Candidate for ascii art, even/odd columns, implicit int/long casts (or more 'math' comments on what happens) - use XMMRegisters instead of numbers (improve callsite readability) - can use either `inputReg1 = inputReg1->successor()` - or get `encoding()` and keep current style - could be static (local) function (hide from header), then pass _masm - pass all registers used (helps seeing register allocation, confirm no overlaps) False trails (i.e. nothing to do, but I thought about it already, so other reviewer doesnt have to?) - (ignore: worse performance) squash into a single for loop, let cpu do out-of-order (and improve readability) - xmm30/xmm31 (montQInvModR/dilithium_q) are constant. At a glance, it looks like they should be combined into one precomputed one. And paper 039.pdf suggests merging constants precompute the product; but.. different constants and looking at Java, there are several implicit casts For reductions of products inside the NTT this is not a problem because one has to multiply by the roots of unity which are compile-time constants. So one can just precompute them with an additional factor of ? mod q so that the results after Montgomery reduction are in fact congruent to the desired value a src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 140: > 138: __ vpmuldq(xmm(scratchReg1 + 1), xmm(inputReg12), xmm(inputReg2 + 1), Assembler::AVX_512bit); > 139: __ vpmuldq(xmm(scratchReg1 + 2), xmm(inputReg13), xmm(inputReg2 + 2), Assembler::AVX_512bit); > 140: __ vpmuldq(xmm(scratchReg1 + 3), xmm(inputReg14), xmm(inputReg2 + 3), Assembler::AVX_512bit); Another option for these four lines, to keep the style of rest of function int inputReg1[] = {inputReg11, inputReg12, inputReg13, inputReg14}; for (int i = 0; i < parCnt; i++) { __ vpmuldq(xmm(scratchReg1 + i), inputReg1[i], xmm(inputReg2 + i), Assembler::AVX_512bit); } src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 197: > 195: > 196: // level 0 > 197: montmulEven(20, 8, 29, 20, 16, 4); It would improve readability to know which parameter is a register, and which is a count.. i.e. `montmulEven(xmm20, xmm8, xmm29, xmm20, xmm16, 4);` (its not _that_ bad, once I remember that its always the last parameter.. but it does add to the 'mental load' one has to carry, and this code is already interesting enough) src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 980: > 978: // Dilithium multiply polynomials in the NTT domain. > 979: // Implements > 980: // static int implDilithiumNttMult( I suppose no java changes in this PR, but I notice that the inputs are all assumed to have fixed size. Most/all intrinsics I worked with had some sort of guard (eg `Objects.checkFromIndexSize`) right before the intrinsic java call. (It usually looks like it can be optimized away). But I notice no such guard here on the java side. src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 1010: > 1008: __ vpbroadcastd(xmm31, Address(dilithiumConsts, 4), Assembler::AVX_512bit); // q > 1009: __ vpbroadcastd(xmm29, Address(dilithiumConsts, 12), Assembler::AVX_512bit); // 2^64 mod q > 1010: __ evmovdqul(xmm28, Address(perms, 0), Assembler::AVX_512bit); - use of `c_rarg3` is 'clever' so probably should have a comment (ie. 'no 3rd parameter, free register') - Alternatively, load directly into the vector with `ExternalAddress()`; you need a scratch register (use r10) but address is close enough, it actually wont be used. Here is the disassembly I got: StubRoutines::dilithiumNttMult [0x00007f414fb68280, 0x00007f414fb68548] (712 bytes) -------------------------------------------------------------------------------- add %al,(%rax) 0x00007f414fb68280: push %rbp 0x00007f414fb68281: mov %rsp,%rbp 0x00007f414fb68284: vpbroadcastd 0x18f9fe32(%rip),%zmm30 # 0x00007f4168b080c0 0x00007f414fb6828e: vpbroadcastd 0x18f9fe2c(%rip),%zmm31 # 0x00007f4168b080c4 0x00007f414fb68298: vpbroadcastd 0x18f9fe2a(%rip),%zmm29 # 0x00007f4168b080cc 0x00007f414fb682a2: vmovdqu32 0x18f9f8d4(%rip),%zmm28 # 0x00007f4168b07b80 ``` The `ExternalAddress()` calls for above assembler ``` const Register scratch = r10; const XMMRegister montRSquareModQ = xmm29; const XMMRegister montQInvModR = xmm30; const XMMRegister dilithium_q = xmm31; const XMMRegister perms = xmm28; __ vpbroadcastd(montQInvModR, ExternalAddress(dilithiumAvx512ConstsAddr()), Assembler::AVX_512bit, scratch); // q^-1 mod 2^32 __ vpbroadcastd(dilithium_q, ExternalAddress(dilithiumAvx512ConstsAddr() + 4), Assembler::AVX_512bit, scratch); // q __ vpbroadcastd(montRSquareModQ, ExternalAddress(dilithiumAvx512ConstsAddr() + 12), Assembler::AVX_512bit, scratch); // 2^64 mod q __ evmovdqul(perms, k0, ExternalAddress(dilithiumAvx512PermsAddr()), false, Assembler::AVX_512bit, scratch); (and `dilithiumAvx512ConstsAddr(offset)` cound take an int parameter too) src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 1012: > 1010: __ evmovdqul(xmm28, Address(perms, 0), Assembler::AVX_512bit); > 1011: > 1012: __ movl(len, 4); Compile-time constant, why not 'unroll at compile time'? i.e. wrap this loop with `for (int len=0; len<4; len++)` instead? src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 1041: > 1039: for (int i = 0; i < 4; i++) { > 1040: __ evmovdqul(Address(result, i * 64), xmm(i), Assembler::AVX_512bit); > 1041: } This is nice, compact and clean. The biggest issue I have with following this code is really with all the 'raw' registers. I would much rather prefer symbolic names, but up to you to decide style. I ended up 'annotating' this snippet, so I could understand it and confirm everything.. as with montmulEven, hope some of it can be useful to you to copy out. XMMRegister POLY1[] = {xmm0, xmm1, xmm2, xmm3}; XMMRegister POLY2[] = {xmm4, xmm5, xmm6, xmm7}; XMMRegister SCRATCH1[] = {xmm12, xmm13, xmm14, xmm15}; XMMRegister SCRATCH2[] = {xmm16, xmm17, xmm18, xmm19}; XMMRegister SCRATCH3[] = {xmm8, xmm9, xmm10, xmm11}; for (int i = 0; i < 4; i++) { __ evmovdqul(POLY1[i], Address(poly1, i * 64), Assembler::AVX_512bit); __ evmovdqul(POLY2[i], Address(poly2, i * 64), Assembler::AVX_512bit); } // montmulEven: inputs are in even columns and output is in odd columns // scratch3_even = poly2_even*montRSquareModQ // poly2 to montgomery domain montmulEven2(SCRATCH3[0], POLY2[0], montRSquareModQ, SCRATCH1[0], SCRATCH2[0], montQInvModR, dilithium_q, 4, _masm); for (int i = 0; i < 4; i++) { // swap even/odd; 0xB1 == 2-3-0-1 __ vpshufd(SCRATCH3[i], SCRATCH3[i], 0xB1, Assembler::AVX_512bit); } // scratch3_odd = poly1_even*scratch3_even = poly1_even*poly2_even*montRSquareModQ montmulEven2(SCRATCH3[0], POLY1[0], SCRATCH3[0], SCRATCH1[0], SCRATCH2[0], 4, montQInvModR, dilithium_q, 4, _masm); for (int i = 0; i < 4; i++) { __ vpshufd(POLY1[i], POLY1[i], 0xB1, Assembler::AVX_512bit); __ vpshufd(POLY2[i], POLY2[i], 0xB1, Assembler::AVX_512bit); } // poly2_even = poly2_odd*montRSquareModQ // poly2 to montgomery domain montmulEven2(POLY2[0], POLY2[0], montRSquareModQ, SCRATCH1[0], SCRATCH2[0], 4, montQInvModR, dilithium_q, 4, _masm); for (int i = 0; i < 4; i++) { __ vpshufd(POLY2[i], POLY2[i], 0xB1, Assembler::AVX_512bit); } // poly1_odd = poly1_even*poly2_even montmulEven2(POLY1[0], POLY1[0], POLY2[0], SCRATCH1[0], SCRATCH2[0], 4, montQInvModR, dilithium_q, 4, _masm); for (int i = 0; i < 4; i++) { // result is scrambled between scratch3_odd and poly1_odd; unscramble __ evpermt2d(POLY1[i], perms, SCRATCH3[i], Assembler::AVX_512bit); } for (int i = 0; i < 4; i++) { __ evmovdqul(Address(result, i * 64), POLY1[i], Assembler::AVX_512bit); } With symbolic variable names, code was much easier to follow conceptually. Also has the side benefit of making it obvious which XMM registers are used and that there is no conflicts src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 1090: > 1088: __ evpbroadcastd(xmm29, constant, Assembler::AVX_512bit); // constant multiplier > 1089: > 1090: __ movl(len, 2); Same comment here as the `generate_dilithiumNttMult_avx512` - constants can be loaded directly into XMM - len can be removed by unrolling at compile time - symbolic names could be used for registers - comments could be added ------------- PR Review: https://git.openjdk.org/jdk/pull/23860#pullrequestreview-2665370975 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1999468929 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1999471763 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1999625933 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1992230295 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1992235625 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1999712200 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1999413007 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1999367607 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1999683384 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1999686631 From vpaprotski at openjdk.org Mon Mar 17 21:49:14 2025 From: vpaprotski at openjdk.org (Volodymyr Paprotski) Date: Mon, 17 Mar 2025 21:49:14 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v5] In-Reply-To: <3bphXKLpIpxAZP-FEOeob6AaHbv0BAoEceJka64vMW8=.3e4f74e0-9479-4926-b365-b08d8d702692@github.com> References: <3bphXKLpIpxAZP-FEOeob6AaHbv0BAoEceJka64vMW8=.3e4f74e0-9479-4926-b365-b08d8d702692@github.com> Message-ID: On Thu, 6 Mar 2025 17:37:33 GMT, Ferenc Rakoczi wrote: >> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: > > Accepted review comments. src/hotspot/cpu/x86/stubGenerator_x86_64.hpp line 494: > 492: address generate_sha3_implCompress(StubGenStubId stub_id); > 493: > 494: address generate_double_keccak(); you can hide internal helper functions (i.e. `montmulEven(*)`) if you wish. The trick is to add `MacroAssembler* _masm` as a parameter to the static (local) function. Its a trick I use to keep header clean, but still have plenty of helpers src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 409: > 407: __ evmovdquq(xmm29, Address(permsAndRots, 768), Assembler::AVX_512bit); > 408: __ evmovdquq(xmm30, Address(permsAndRots, 832), Assembler::AVX_512bit); > 409: __ evmovdquq(xmm31, Address(permsAndRots, 896), Assembler::AVX_512bit); Matter of taste, but I liked the compactness of montmulEven; i.e. for (i=0; i<15; i++) __ evmovdquq(xmm(17+i), Address(permsAndRots, 64*i), Assembler::AVX_512bit); src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 426: > 424: __ subl( roundsLeft, 1); > 425: > 426: __ evmovdquw(xmm5, xmm0, Assembler::AVX_512bit); Is there a pattern here; that can be 'compacted' into a loop? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1983903347 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1983935964 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1983937154 From tschatzl at openjdk.org Tue Mar 18 16:24:56 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Tue, 18 Mar 2025 16:24:56 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v24] In-Reply-To: References: Message-ID: > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... Thomas Schatzl has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 32 commits: - * factor out card table and refinement table merging into a single method - Merge branch 'master' into 8342382-card-table-instead-of-dcq3 - * obsolete G1UpdateBufferSize G1UpdateBufferSize has previously been used to size the refinement buffers and impose a minimum limit on the number of cards per thread that need to be pending before refinement starts. The former function is now obsolete with the removal of the dirty card queues, the latter functionality has been taken over by the new diagnostic option `G1PerThreadPendingCardThreshold`. I prefer to make this a diagnostic option is better than a product option because it is something that is only necessary for some test cases to produce some otherwise unwanted behavior (continuous refinement). CSR is pending. - * more documentation on why we need to rendezvous the gc threads - Merge branch 'master' into 8342381-card-table-instead-of-dcq - * ayang review * re-add STS leaver for java thread handshake - * when aborting refinement during full collection, the global card table and the per-thread card table might not be in sync. Roll forward during abort of the refinement in these situations. * additional verification * added some missing ResourceMarks in asserts * added variant of ArrayJuggle2 that crashes fairly quickly without these changes - * ayang review * remove unnecessary STSleaver * some more documentation around to_collection_card card color - Merge branch 'master' into 8342382-card-table-instead-of-dcq - * optimized RISCV gen_write_ref_array_post_barrier() implementation contributed by @RealFYang - ... and 22 more: https://git.openjdk.org/jdk/compare/b025d8c2...c833bc83 ------------- Changes: https://git.openjdk.org/jdk/pull/23739/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=23 Stats: 6788 lines in 104 files changed: 2382 ins; 3476 del; 930 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739 From tschatzl at openjdk.org Wed Mar 19 13:17:19 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Wed, 19 Mar 2025 13:17:19 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v25] In-Reply-To: References: Message-ID: <5Q9-MERAD4KIP-fzgw7JVAtC9u4L1fEFGcNkdHBvkg4=.1917bd58-a5f8-4c5c-b1f9-27b7457c6262@github.com> > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: * fix IR code generation tests that change due to barrier cost changes ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23739/files - new: https://git.openjdk.org/jdk/pull/23739/files/c833bc83..f419556e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=24 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=23-24 Stats: 5 lines in 2 files changed: 2 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739 From tschatzl at openjdk.org Wed Mar 19 13:27:17 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Wed, 19 Mar 2025 13:27:17 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v25] In-Reply-To: <5Q9-MERAD4KIP-fzgw7JVAtC9u4L1fEFGcNkdHBvkg4=.1917bd58-a5f8-4c5c-b1f9-27b7457c6262@github.com> References: <5Q9-MERAD4KIP-fzgw7JVAtC9u4L1fEFGcNkdHBvkg4=.1917bd58-a5f8-4c5c-b1f9-27b7457c6262@github.com> Message-ID: On Wed, 19 Mar 2025 13:17:19 GMT, Thomas Schatzl wrote: >> Hi all, >> >> please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. >> >> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. >> >> ### Current situation >> >> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. >> >> The main reason for the current barrier is how g1 implements concurrent refinement: >> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. >> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, >> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. >> >> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: >> >> >> // Filtering >> if (region(@x.a) == region(y)) goto done; // same region check >> if (y == null) goto done; // null value check >> if (card(@x.a) == young_card) goto done; // write to young gen check >> StoreLoad; // synchronize >> if (card(@x.a) == dirty_card) goto done; >> >> *card(@x.a) = dirty >> >> // Card tracking >> enqueue(card-address(@x.a)) into thread-local-dcq; >> if (thread-local-dcq is not full) goto done; >> >> call runtime to move thread-local-dcq into dcqs >> >> done: >> >> >> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. >> >> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. >> >> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). >> >> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c... > > Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: > > * fix IR code generation tests that change due to barrier cost changes Commit https://github.com/openjdk/jdk/pull/23739/commits/f419556e9177ecf9fbf22e606dd6c1b850f4330f fixes the failing compiler tests that check whether the compiler emits the correct object graph. Occurs after merging with mainline that significantly reduces total barrier cost calculation. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23739#issuecomment-2736639357 From tschatzl at openjdk.org Thu Mar 20 09:44:07 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Thu, 20 Mar 2025 09:44:07 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v26] In-Reply-To: References: Message-ID: > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: * make young gen length revising independent of refinement thread * use a service task * both refinement control thread and young gen length revising use the same infrastructure to get the number of available bytes and determine the time to the next update ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23739/files - new: https://git.openjdk.org/jdk/pull/23739/files/f419556e..5e76a516 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=25 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=24-25 Stats: 337 lines in 12 files changed: 237 ins; 90 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739 From tschatzl at openjdk.org Thu Mar 20 09:49:13 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Thu, 20 Mar 2025 09:49:13 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v26] In-Reply-To: References: Message-ID: On Thu, 20 Mar 2025 09:44:07 GMT, Thomas Schatzl wrote: >> Hi all, >> >> please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. >> >> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. >> >> ### Current situation >> >> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. >> >> The main reason for the current barrier is how g1 implements concurrent refinement: >> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. >> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, >> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. >> >> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: >> >> >> // Filtering >> if (region(@x.a) == region(y)) goto done; // same region check >> if (y == null) goto done; // null value check >> if (card(@x.a) == young_card) goto done; // write to young gen check >> StoreLoad; // synchronize >> if (card(@x.a) == dirty_card) goto done; >> >> *card(@x.a) = dirty >> >> // Card tracking >> enqueue(card-address(@x.a)) into thread-local-dcq; >> if (thread-local-dcq is not full) goto done; >> >> call runtime to move thread-local-dcq into dcqs >> >> done: >> >> >> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. >> >> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. >> >> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). >> >> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c... > > Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: > > * make young gen length revising independent of refinement thread > * use a service task > * both refinement control thread and young gen length revising use the same infrastructure to get the number of available bytes and determine the time to the next update Commit https://github.com/openjdk/jdk/pull/23739/commits/5e76a516c848e75f56e966a1ffe4115b1dce786c implements the change to make young gen length revising independent of the refinement control thread. Infrastructure to determine currently available number of bytes for allocation and determining the next time the particular task should be redone is shared. It may be distributed across a bit more methods than I would prefer, but particularly the refinement control thread wants to reuse and keep some intermediate results (to not be required to get the `Heap_lock` again basically). I did not have a good reason to make the heuristic to determine the time to the next action different for both, so they are basically the same. There is some pre-existing problem that the minimum time for re-doing the work is ~50ms. That might be too short in some cases, but then again, if you have that short of a GC interval it may not be very useful to e.g. revise young gen length anyway. I think with this change all current concerns are addressed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23739#issuecomment-2739766880 From duke at openjdk.org Thu Mar 20 11:29:57 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Thu, 20 Mar 2025 11:29:57 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v8] In-Reply-To: References: Message-ID: > By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: responding to review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23860/files - new: https://git.openjdk.org/jdk/pull/23860/files/aa2fdf2d..2438fb5c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=06-07 Stats: 750 lines in 3 files changed: 174 ins; 447 del; 129 mod Patch: https://git.openjdk.org/jdk/pull/23860.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23860/head:pull/23860 PR: https://git.openjdk.org/jdk/pull/23860 From thartmann at openjdk.org Thu Mar 20 12:29:22 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 20 Mar 2025 12:29:22 GMT Subject: RFR: 8346989: Deoptimization and re-compilation cycle with C2 compiled code In-Reply-To: <_3l8ylsbgvsqQE1Ihp0BUAx2o_VzcS6R2jWBSKW9u1E=.0dcb6086-ff6f-4c9a-b990-6665a476a3dc@github.com> References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> <_3l8ylsbgvsqQE1Ihp0BUAx2o_VzcS6R2jWBSKW9u1E=.0dcb6086-ff6f-4c9a-b990-6665a476a3dc@github.com> Message-ID: On Fri, 7 Mar 2025 18:03:14 GMT, Vladimir Ivanov wrote: >> `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments. >> This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached. >> >> Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all. >> >> tl;dr: >> - C1: no problem, no change >> - C2: >> - with intrinsics: >> - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms) >> - without overflow: no problem, no change >> - without intrinsics: no problem, no change >> >> Before the fix: >> >> Benchmark (SIZE) Mode Cnt Score Error Units >> MathExact.C1_1.loopAddIInBounds 1000000 avgt 3 1.272 ? 0.048 ms/op >> MathExact.C1_1.loopAddIOverflow 1000000 avgt 3 641.917 ? 58.238 ms/op >> MathExact.C1_1.loopAddLInBounds 1000000 avgt 3 1.402 ? 0.842 ms/op >> MathExact.C1_1.loopAddLOverflow 1000000 avgt 3 671.013 ? 229.425 ms/op >> MathExact.C1_1.loopDecrementIInBounds 1000000 avgt 3 3.722 ? 22.244 ms/op >> MathExact.C1_1.loopDecrementIOverflow 1000000 avgt 3 653.341 ? 279.003 ms/op >> MathExact.C1_1.loopDecrementLInBounds 1000000 avgt 3 2.525 ? 0.810 ms/op >> MathExact.C1_1.loopDecrementLOverflow 1000000 avgt 3 656.750 ? 141.792 ms/op >> MathExact.C1_1.loopIncrementIInBounds 1000000 avgt 3 4.621 ? 12.822 ms/op >> MathExact.C1_1.loopIncrementIOverflow 1000000 avgt 3 651.608 ? 274.396 ms/op >> MathExact.C1_1.loopIncrementLInBounds 1000000 avgt 3 2.576 ? 3.316 ms/op >> MathExact.C1_1.loopIncrementLOverflow 1000000 avgt 3 662.216 ? 71.879 ms/op >> MathExact.C1_1.loopMultiplyIInBounds 1000000 avgt 3 1.402 ? 0.587 ms/op >> MathExact.C1_1.loopMultiplyIOverflow 1000000 avgt 3 615.836 ? 252.137 ms/op >> MathExact.C1_1.loopMultiplyLInBounds 1000000 avgt 3 2.906 ? 5.718 ms/op >> MathExact.C1_1.loopMultiplyLOverflow 1000000 avgt 3 655.576 ? 147.432 ms/op >> MathExact.C1_1.loopNegateIInBounds 1000000 avgt 3 2.023 ? 0.027 ms/op >> MathExact.C1_1.loopNegateIOverflow 1000000 avgt 3 639.136 ? 30.841 ms/op >> MathExact.C1_1.loop... > > src/hotspot/share/opto/library_call.cpp line 1963: > >> 1961: set_i_o(i_o()); >> 1962: >> 1963: uncommon_trap(Deoptimization::Reason_intrinsic, > > What about using `builtin_throw` here? (Requires some tuning on `builtin_throw` side.) How much does it affect performance? Also, passing `must_throw = true` into `uncommon_trap` may help a bit here as well. I think adapting and re-using `builtin_throw` like you described is reasonable but I let @iwanowww confirm :slightly_smiling_face: ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23916#discussion_r2005526386 From epeter at openjdk.org Thu Mar 20 13:54:07 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 20 Mar 2025 13:54:07 GMT Subject: RFR: 8346989: Deoptimization and re-compilation cycle with C2 compiled code In-Reply-To: <_3l8ylsbgvsqQE1Ihp0BUAx2o_VzcS6R2jWBSKW9u1E=.0dcb6086-ff6f-4c9a-b990-6665a476a3dc@github.com> References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> <_3l8ylsbgvsqQE1Ihp0BUAx2o_VzcS6R2jWBSKW9u1E=.0dcb6086-ff6f-4c9a-b990-6665a476a3dc@github.com> Message-ID: <5rSvBeQxKuX-hhaLGygKRBi_VpALqwywgnKfK61a8j4=.258cf9ca-56fe-42a9-85b1-b6aa30f2eb5c@github.com> On Fri, 7 Mar 2025 18:03:53 GMT, Vladimir Ivanov wrote: >> `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments. >> This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached. >> >> Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all. >> >> tl;dr: >> - C1: no problem, no change >> - C2: >> - with intrinsics: >> - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms) >> - without overflow: no problem, no change >> - without intrinsics: no problem, no change >> >> Before the fix: >> >> Benchmark (SIZE) Mode Cnt Score Error Units >> MathExact.C1_1.loopAddIInBounds 1000000 avgt 3 1.272 ? 0.048 ms/op >> MathExact.C1_1.loopAddIOverflow 1000000 avgt 3 641.917 ? 58.238 ms/op >> MathExact.C1_1.loopAddLInBounds 1000000 avgt 3 1.402 ? 0.842 ms/op >> MathExact.C1_1.loopAddLOverflow 1000000 avgt 3 671.013 ? 229.425 ms/op >> MathExact.C1_1.loopDecrementIInBounds 1000000 avgt 3 3.722 ? 22.244 ms/op >> MathExact.C1_1.loopDecrementIOverflow 1000000 avgt 3 653.341 ? 279.003 ms/op >> MathExact.C1_1.loopDecrementLInBounds 1000000 avgt 3 2.525 ? 0.810 ms/op >> MathExact.C1_1.loopDecrementLOverflow 1000000 avgt 3 656.750 ? 141.792 ms/op >> MathExact.C1_1.loopIncrementIInBounds 1000000 avgt 3 4.621 ? 12.822 ms/op >> MathExact.C1_1.loopIncrementIOverflow 1000000 avgt 3 651.608 ? 274.396 ms/op >> MathExact.C1_1.loopIncrementLInBounds 1000000 avgt 3 2.576 ? 3.316 ms/op >> MathExact.C1_1.loopIncrementLOverflow 1000000 avgt 3 662.216 ? 71.879 ms/op >> MathExact.C1_1.loopMultiplyIInBounds 1000000 avgt 3 1.402 ? 0.587 ms/op >> MathExact.C1_1.loopMultiplyIOverflow 1000000 avgt 3 615.836 ? 252.137 ms/op >> MathExact.C1_1.loopMultiplyLInBounds 1000000 avgt 3 2.906 ? 5.718 ms/op >> MathExact.C1_1.loopMultiplyLOverflow 1000000 avgt 3 655.576 ? 147.432 ms/op >> MathExact.C1_1.loopNegateIInBounds 1000000 avgt 3 2.023 ? 0.027 ms/op >> MathExact.C1_1.loopNegateIOverflow 1000000 avgt 3 639.136 ? 30.841 ms/op >> MathExact.C1_1.loop... > > Nice benchmark, Marc! @iwanowww Are you still reviewing or should I have a look? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23916#issuecomment-2740528216 From duke at openjdk.org Thu Mar 20 18:42:48 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Thu, 20 Mar 2025 18:42:48 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v9] In-Reply-To: References: Message-ID: > By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: More beautification ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23860/files - new: https://git.openjdk.org/jdk/pull/23860/files/2438fb5c..1cfab778 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=07-08 Stats: 307 lines in 1 file changed: 49 ins; 131 del; 127 mod Patch: https://git.openjdk.org/jdk/pull/23860.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23860/head:pull/23860 PR: https://git.openjdk.org/jdk/pull/23860 From duke at openjdk.org Thu Mar 20 20:37:25 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Thu, 20 Mar 2025 20:37:25 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v10] In-Reply-To: References: Message-ID: <2N5Evij0f6qZi_pG3tqoz11aQbSnLG0YszqHR9ROfKI=.d44b16c6-d334-42c4-8de8-92eb41229248@github.com> > By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: Fix windows build ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23860/files - new: https://git.openjdk.org/jdk/pull/23860/files/1cfab778..e9db09e2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=09 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=08-09 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/23860.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23860/head:pull/23860 PR: https://git.openjdk.org/jdk/pull/23860 From duke at openjdk.org Thu Mar 20 21:09:14 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Thu, 20 Mar 2025 21:09:14 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v5] In-Reply-To: References: <3bphXKLpIpxAZP-FEOeob6AaHbv0BAoEceJka64vMW8=.3e4f74e0-9479-4926-b365-b08d8d702692@github.com> Message-ID: On Thu, 6 Mar 2025 19:27:12 GMT, Volodymyr Paprotski wrote: >> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: >> >> Accepted review comments. > > src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 426: > >> 424: __ subl( roundsLeft, 1); >> 425: >> 426: __ evmovdquw(xmm5, xmm0, Assembler::AVX_512bit); > > Is there a pattern here; that can be 'compacted' into a loop? Unfortunately, no. This loop body is imported from generate_sha3_implCompress() and doubled, as explained in the comment about 15 lines above. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2006455877 From duke at openjdk.org Thu Mar 20 21:09:12 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Thu, 20 Mar 2025 21:09:12 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v7] In-Reply-To: References: Message-ID: On Mon, 17 Mar 2025 19:24:52 GMT, Volodymyr Paprotski wrote: >> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: >> >> Made the intrinsics test separate from the pure java test. > > src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 58: > >> 56: >> 57: ATTRIBUTE_ALIGNED(64) static const uint32_t dilithiumAvx512Perms[] = { >> 58: // collect montmul results into the destination register > > same as `dilithiumAvx512Consts()`, 'magic offsets'; except here they are harder to count (eg. not clear visually what is the offset of `ntt inverse`). > > Could be split into three constant arrays to make the compiler count for us Well, it is 64 bytes per line (16 4-byte uint32_ts), not that hard :-) ... > src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 140: > >> 138: __ vpmuldq(xmm(scratchReg1 + 1), xmm(inputReg12), xmm(inputReg2 + 1), Assembler::AVX_512bit); >> 139: __ vpmuldq(xmm(scratchReg1 + 2), xmm(inputReg13), xmm(inputReg2 + 2), Assembler::AVX_512bit); >> 140: __ vpmuldq(xmm(scratchReg1 + 3), xmm(inputReg14), xmm(inputReg2 + 3), Assembler::AVX_512bit); > > Another option for these four lines, to keep the style of rest of function > > int inputReg1[] = {inputReg11, inputReg12, inputReg13, inputReg14}; > for (int i = 0; i < parCnt; i++) { > __ vpmuldq(xmm(scratchReg1 + i), inputReg1[i], xmm(inputReg2 + i), Assembler::AVX_512bit); > } I have changed the whole structure instead. > src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 197: > >> 195: >> 196: // level 0 >> 197: montmulEven(20, 8, 29, 20, 16, 4); > > It would improve readability to know which parameter is a register, and which is a count.. i.e. > > `montmulEven(xmm20, xmm8, xmm29, xmm20, xmm16, 4);` > > (its not _that_ bad, once I remember that its always the last parameter.. but it does add to the 'mental load' one has to carry, and this code is already interesting enough) I have changed the structure, now it is clear(er) which parameter is what. > src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 980: > >> 978: // Dilithium multiply polynomials in the NTT domain. >> 979: // Implements >> 980: // static int implDilithiumNttMult( > > I suppose no java changes in this PR, but I notice that the inputs are all assumed to have fixed size. > > Most/all intrinsics I worked with had some sort of guard (eg `Objects.checkFromIndexSize`) right before the intrinsic java call. (It usually looks like it can be optimized away). But I notice no such guard here on the java side. These functions will not be used anywhere else and in ML_DSA.java all of the arrays passed to inrinsics are of the correct size. > src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 1010: > >> 1008: __ vpbroadcastd(xmm31, Address(dilithiumConsts, 4), Assembler::AVX_512bit); // q >> 1009: __ vpbroadcastd(xmm29, Address(dilithiumConsts, 12), Assembler::AVX_512bit); // 2^64 mod q >> 1010: __ evmovdqul(xmm28, Address(perms, 0), Assembler::AVX_512bit); > > - use of `c_rarg3` is 'clever' so probably should have a comment (ie. 'no 3rd parameter, free register') > - Alternatively, load directly into the vector with `ExternalAddress()`; you need a scratch register (use r10) but address is close enough, it actually wont be used. Here is the disassembly I got: > > StubRoutines::dilithiumNttMult [0x00007f414fb68280, 0x00007f414fb68548] (712 bytes) > -------------------------------------------------------------------------------- > add %al,(%rax) > 0x00007f414fb68280: push %rbp > 0x00007f414fb68281: mov %rsp,%rbp > 0x00007f414fb68284: vpbroadcastd 0x18f9fe32(%rip),%zmm30 # 0x00007f4168b080c0 > 0x00007f414fb6828e: vpbroadcastd 0x18f9fe2c(%rip),%zmm31 # 0x00007f4168b080c4 > 0x00007f414fb68298: vpbroadcastd 0x18f9fe2a(%rip),%zmm29 # 0x00007f4168b080cc > 0x00007f414fb682a2: vmovdqu32 0x18f9f8d4(%rip),%zmm28 # 0x00007f4168b07b80 > ``` > > The `ExternalAddress()` calls for above assembler > ``` > const Register scratch = r10; > const XMMRegister montRSquareModQ = xmm29; > const XMMRegister montQInvModR = xmm30; > const XMMRegister dilithium_q = xmm31; > const XMMRegister perms = xmm28; > > __ vpbroadcastd(montQInvModR, ExternalAddress(dilithiumAvx512ConstsAddr()), Assembler::AVX_512bit, scratch); // q^-1 mod 2^32 > __ vpbroadcastd(dilithium_q, ExternalAddress(dilithiumAvx512ConstsAddr() + 4), Assembler::AVX_512bit, scratch); // q > __ vpbroadcastd(montRSquareModQ, ExternalAddress(dilithiumAvx512ConstsAddr() + 12), Assembler::AVX_512bit, scratch); // 2^64 mod q > __ evmovdqul(perms, k0, ExternalAddress(dilithiumAvx512PermsAddr()), false, Assembler::AVX_512bit, scratch); > > (and `dilithiumAvx512ConstsAddr(offset)` cound take an int parameter too) I added comments and changed the vpbroadcast loads to load directly from memory.l > src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 1012: > >> 1010: __ evmovdqul(xmm28, Address(perms, 0), Assembler::AVX_512bit); >> 1011: >> 1012: __ movl(len, 4); > > Compile-time constant, why not 'unroll at compile time'? i.e. wrap this loop with `for (int len=0; len<4; len++)` instead? I have found that unrolling these loops actually hurts performance (probably an I-cache effect. > src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 1041: > >> 1039: for (int i = 0; i < 4; i++) { >> 1040: __ evmovdqul(Address(result, i * 64), xmm(i), Assembler::AVX_512bit); >> 1041: } > > This is nice, compact and clean. The biggest issue I have with following this code is really with all the 'raw' registers. I would much rather prefer symbolic names, but up to you to decide style. > > I ended up 'annotating' this snippet, so I could understand it and confirm everything.. as with montmulEven, hope some of it can be useful to you to copy out. > > > XMMRegister POLY1[] = {xmm0, xmm1, xmm2, xmm3}; > XMMRegister POLY2[] = {xmm4, xmm5, xmm6, xmm7}; > XMMRegister SCRATCH1[] = {xmm12, xmm13, xmm14, xmm15}; > XMMRegister SCRATCH2[] = {xmm16, xmm17, xmm18, xmm19}; > XMMRegister SCRATCH3[] = {xmm8, xmm9, xmm10, xmm11}; > for (int i = 0; i < 4; i++) { > __ evmovdqul(POLY1[i], Address(poly1, i * 64), Assembler::AVX_512bit); > __ evmovdqul(POLY2[i], Address(poly2, i * 64), Assembler::AVX_512bit); > } > > // montmulEven: inputs are in even columns and output is in odd columns > // scratch3_even = poly2_even*montRSquareModQ // poly2 to montgomery domain > montmulEven2(SCRATCH3[0], POLY2[0], montRSquareModQ, SCRATCH1[0], SCRATCH2[0], montQInvModR, dilithium_q, 4, _masm); > for (int i = 0; i < 4; i++) { > // swap even/odd; 0xB1 == 2-3-0-1 > __ vpshufd(SCRATCH3[i], SCRATCH3[i], 0xB1, Assembler::AVX_512bit); > } > > // scratch3_odd = poly1_even*scratch3_even = poly1_even*poly2_even*montRSquareModQ > montmulEven2(SCRATCH3[0], POLY1[0], SCRATCH3[0], SCRATCH1[0], SCRATCH2[0], 4, montQInvModR, dilithium_q, 4, _masm); > for (int i = 0; i < 4; i++) { > __ vpshufd(POLY1[i], POLY1[i], 0xB1, Assembler::AVX_512bit); > __ vpshufd(POLY2[i], POLY2[i], 0xB1, Assembler::AVX_512bit); > } > > // poly2_even = poly2_odd*montRSquareModQ // poly2 to montgomery domain > montmulEven2(POLY2[0], POLY2[0], montRSquareModQ, SCRATCH1[0], SCRATCH2[0], 4, montQInvModR, dilithium_q, 4, _masm); > for (int i = 0; i < 4; i++) { > __ vpshufd(POLY2[i], POLY2[i], 0xB1, Assembler::AVX_512bit); > } > > // poly1_odd = poly1_even*poly2_even > montmulEven2(POLY1[0], POLY1[0], POLY2[0], SCRATCH1[0], SCRATCH2[0], 4, montQInvModR, dilithium_q, 4, _masm); > for (int i = 0; i < 4; i++) { > // result is scrambled between scratch3_odd and poly1_odd; unscramble > __ evpermt2d(POLY1[i], perms, SCRATCH3[i], Assembler::AVX_512bit); > } > for (int i = 0; i < 4; i++) { > __ evmovdqul(Address(result, i *... I have rewritten it to use full montmuls (a new function) her and everywhere else. It is much easier to follow the code that way. > src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 1090: > >> 1088: __ evpbroadcastd(xmm29, constant, Assembler::AVX_512bit); // constant multiplier >> 1089: >> 1090: __ movl(len, 2); > > Same comment here as the `generate_dilithiumNttMult_avx512` > - constants can be loaded directly into XMM > - len can be removed by unrolling at compile time > - symbolic names could be used for registers > - comments could be added Done. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2006455445 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2006455814 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2006455732 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2006454991 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2006455529 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2006455662 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2006455178 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2006455086 From never at openjdk.org Fri Mar 21 05:59:09 2025 From: never at openjdk.org (Tom Rodriguez) Date: Fri, 21 Mar 2025 05:59:09 GMT Subject: RFR: 8350892: [JVMCI] Align ResolvedJavaType.getInstanceFields with Class.getDeclaredFields In-Reply-To: References: Message-ID: On Fri, 28 Feb 2025 23:46:54 GMT, Doug Simon wrote: > The current order of fields returned by `ResolvedJavaType.getInstanceFields` is a) not well specified and b) different than the order of fields used almost everywhere else in HotSpot. This PR aligns the order of `getInstanceFields` with `Class.getDeclaredFields()`. > > It also makes `ciInstanceKlass::_nonstatic_fields` use the same order which unifies how escape analysis and deoptimization treats fields across C2 and JVMCI. Seems like a good change. ------------- Marked as reviewed by never (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23849#pullrequestreview-2704807131 From thartmann at openjdk.org Fri Mar 21 12:59:09 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 21 Mar 2025 12:59:09 GMT Subject: RFR: 8350892: [JVMCI] Align ResolvedJavaType.getInstanceFields with Class.getDeclaredFields In-Reply-To: References: Message-ID: On Fri, 28 Feb 2025 23:46:54 GMT, Doug Simon wrote: > The current order of fields returned by `ResolvedJavaType.getInstanceFields` is a) not well specified and b) different than the order of fields used almost everywhere else in HotSpot. This PR aligns the order of `getInstanceFields` with `Class.getDeclaredFields()`. > > It also makes `ciInstanceKlass::_nonstatic_fields` use the same order which unifies how escape analysis and deoptimization treats fields across C2 and JVMCI. Nice cleanup, CI changes look good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23849#pullrequestreview-2705854715 From dnsimon at openjdk.org Fri Mar 21 13:03:17 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Fri, 21 Mar 2025 13:03:17 GMT Subject: RFR: 8350892: [JVMCI] Align ResolvedJavaType.getInstanceFields with Class.getDeclaredFields In-Reply-To: References: Message-ID: On Fri, 28 Feb 2025 23:46:54 GMT, Doug Simon wrote: > The current order of fields returned by `ResolvedJavaType.getInstanceFields` is a) not well specified and b) different than the order of fields used almost everywhere else in HotSpot. This PR aligns the order of `getInstanceFields` with `Class.getDeclaredFields()`. > > It also makes `ciInstanceKlass::_nonstatic_fields` use the same order which unifies how escape analysis and deoptimization treats fields across C2 and JVMCI. Thanks for the reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23849#issuecomment-2743295318 From dnsimon at openjdk.org Fri Mar 21 13:03:17 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Fri, 21 Mar 2025 13:03:17 GMT Subject: Integrated: 8350892: [JVMCI] Align ResolvedJavaType.getInstanceFields with Class.getDeclaredFields In-Reply-To: References: Message-ID: On Fri, 28 Feb 2025 23:46:54 GMT, Doug Simon wrote: > The current order of fields returned by `ResolvedJavaType.getInstanceFields` is a) not well specified and b) different than the order of fields used almost everywhere else in HotSpot. This PR aligns the order of `getInstanceFields` with `Class.getDeclaredFields()`. > > It also makes `ciInstanceKlass::_nonstatic_fields` use the same order which unifies how escape analysis and deoptimization treats fields across C2 and JVMCI. This pull request has now been integrated. Changeset: 0cb110eb Author: Doug Simon URL: https://git.openjdk.org/jdk/commit/0cb110ebb7f8d184dd855f64c5dd7924c8202b3d Stats: 89 lines in 6 files changed: 18 ins; 32 del; 39 mod 8350892: [JVMCI] Align ResolvedJavaType.getInstanceFields with Class.getDeclaredFields Reviewed-by: yzheng, never, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/23849 From adinn at openjdk.org Fri Mar 21 14:02:17 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Fri, 21 Mar 2025 14:02:17 GMT Subject: RFR: 8349721: Add aarch64 intrinsics for ML-KEM [v4] In-Reply-To: References: Message-ID: On Tue, 4 Mar 2025 22:04:26 GMT, Ferenc Rakoczi wrote: >> By using the aarch64 vector registers the speed of the computation of the ML-KEM algorithms (key generation, encapsulation, decapsulation) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: > > - Fixed mismerge. > - Merged master. > - A little cleanup > - Merged master > - removing trailing spaces > - kyber aarch64 intrinsics src/hotspot/share/opto/library_call.cpp line 7800: > 7798: const char *stubName; > 7799: assert(UseKyberIntrinsics, "need Kyber intrinsics support"); > 7800: assert(callee()->signature()->size() == 3, "kyber12To16 has 3 parameters"); Just as an aside this causes testing of a debug build to fail. The intrinsic has 4 parameters. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2007638886 From adinn at openjdk.org Fri Mar 21 14:02:18 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Fri, 21 Mar 2025 14:02:18 GMT Subject: RFR: 8349721: Add aarch64 intrinsics for ML-KEM [v4] In-Reply-To: References: Message-ID: <54ED2n9rhYXWQuwge7bPuvPXtAmL2WpfRJFfXH__r2I=.dead1c37-4283-48a6-ad01-26fc92be30fa@github.com> On Fri, 21 Mar 2025 13:59:10 GMT, Andrew Dinn wrote: >> Ferenc Rakoczi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: >> >> - Fixed mismerge. >> - Merged master. >> - A little cleanup >> - Merged master >> - removing trailing spaces >> - kyber aarch64 intrinsics > > src/hotspot/share/opto/library_call.cpp line 7800: > >> 7798: const char *stubName; >> 7799: assert(UseKyberIntrinsics, "need Kyber intrinsics support"); >> 7800: assert(callee()->signature()->size() == 3, "kyber12To16 has 3 parameters"); > > Just as an aside this causes testing of a debug build to fail. The intrinsic has 4 parameters. With this value reset to 4 the ML_DSA test passes for ML_KEM on a debug build. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2007642721 From tschatzl at openjdk.org Fri Mar 21 14:20:34 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Fri, 21 Mar 2025 14:20:34 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v27] In-Reply-To: References: Message-ID: > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... Thomas Schatzl has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 35 commits: - Merge branch 'master' into submit/8342382-card-table-instead-of-dcq - * make young gen length revising independent of refinement thread * use a service task * both refinement control thread and young gen length revising use the same infrastructure to get the number of available bytes and determine the time to the next update - * fix IR code generation tests that change due to barrier cost changes - * factor out card table and refinement table merging into a single method - Merge branch 'master' into 8342382-card-table-instead-of-dcq3 - * obsolete G1UpdateBufferSize G1UpdateBufferSize has previously been used to size the refinement buffers and impose a minimum limit on the number of cards per thread that need to be pending before refinement starts. The former function is now obsolete with the removal of the dirty card queues, the latter functionality has been taken over by the new diagnostic option `G1PerThreadPendingCardThreshold`. I prefer to make this a diagnostic option is better than a product option because it is something that is only necessary for some test cases to produce some otherwise unwanted behavior (continuous refinement). CSR is pending. - * more documentation on why we need to rendezvous the gc threads - Merge branch 'master' into 8342381-card-table-instead-of-dcq - * ayang review * re-add STS leaver for java thread handshake - * when aborting refinement during full collection, the global card table and the per-thread card table might not be in sync. Roll forward during abort of the refinement in these situations. * additional verification * added some missing ResourceMarks in asserts * added variant of ArrayJuggle2 that crashes fairly quickly without these changes - ... and 25 more: https://git.openjdk.org/jdk/compare/0cb110eb...d9311047 ------------- Changes: https://git.openjdk.org/jdk/pull/23739/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=26 Stats: 7089 lines in 110 files changed: 2610 ins; 3555 del; 924 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739 From vlivanov at openjdk.org Fri Mar 21 22:37:14 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 21 Mar 2025 22:37:14 GMT Subject: RFR: 8346989: Deoptimization and re-compilation cycle with C2 compiled code In-Reply-To: References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> <_3l8ylsbgvsqQE1Ihp0BUAx2o_VzcS6R2jWBSKW9u1E=.0dcb6086-ff6f-4c9a-b990-6665a476a3dc@github.com> Message-ID: On Thu, 20 Mar 2025 12:26:52 GMT, Tobias Hartmann wrote: >> src/hotspot/share/opto/library_call.cpp line 1963: >> >>> 1961: set_i_o(i_o()); >>> 1962: >>> 1963: uncommon_trap(Deoptimization::Reason_intrinsic, >> >> What about using `builtin_throw` here? (Requires some tuning on `builtin_throw` side.) How much does it affect performance? Also, passing `must_throw = true` into `uncommon_trap` may help a bit here as well. > > I think adapting and re-using `builtin_throw` like you described is reasonable but I let @iwanowww confirm :slightly_smiling_face: Yes, that's basically what I had in mind. Currently, the focus of the intrinsic is on well-behaved case (overflows are **very** rare). `builtin_throw()` covers more ground and optimize for scenarios when exceptions are thrown. But it depends on `ciMethod::can_omit_stack_trace()` where `-XX:-OmitStackTraceInFastThrow` mode will suffer from the original problem (continuous deoptimizations), plus a round of recompilations before giving up. I suggest to improve and reuse `builtin_throw` here and add additional checks in the intrinsic to guard against problematic scenario with continuous deoptimizations. IMO it improves performance model for a wide range of use cases while addressing pathological scenarios. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23916#discussion_r2008427776 From duke at openjdk.org Sat Mar 22 12:23:47 2025 From: duke at openjdk.org (Zihao Lin) Date: Sat, 22 Mar 2025 12:23:47 GMT Subject: RFR: 8347706: jvmciEnv.cpp has jvmci includes out of order Message-ID: 8347706: jvmciEnv.cpp has jvmci includes out of order ------------- Commit messages: - 8347706: Reorder jvmci includes in jvmciEvn.cpp Changes: https://git.openjdk.org/jdk/pull/24174/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24174&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8347706 Stats: 6 lines in 1 file changed: 3 ins; 3 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/24174.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24174/head:pull/24174 PR: https://git.openjdk.org/jdk/pull/24174 From dnsimon at openjdk.org Sat Mar 22 14:44:09 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Sat, 22 Mar 2025 14:44:09 GMT Subject: RFR: 8347706: jvmciEnv.cpp has jvmci includes out of order In-Reply-To: References: Message-ID: <4eXcUGVycNCCf3Ago-Mtf7zobSoLrZVEateUS0NpQuQ=.3512d121-7ad0-422f-9dcd-3edf4e28ec4e@github.com> On Sat, 22 Mar 2025 12:16:31 GMT, Zihao Lin wrote: > Reorder jvmci includes in jvmciEvn.cpp The change is fine but I personally think manually fixing these ordering problems is not the best use of time until there's a way to automatically enforce the expected ordering (and catch regressions). ------------- Marked as reviewed by dnsimon (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/24174#pullrequestreview-2708058400 From duke at openjdk.org Sat Mar 22 14:54:06 2025 From: duke at openjdk.org (Zihao Lin) Date: Sat, 22 Mar 2025 14:54:06 GMT Subject: RFR: 8347706: jvmciEnv.cpp has jvmci includes out of order In-Reply-To: References: Message-ID: On Sat, 22 Mar 2025 12:16:31 GMT, Zihao Lin wrote: > Reorder jvmci includes in jvmciEvn.cpp You are right, Do we have some code check tool which help to point out the ordering issue? ------------- PR Comment: https://git.openjdk.org/jdk/pull/24174#issuecomment-2745306642 From duke at openjdk.org Sat Mar 22 14:54:06 2025 From: duke at openjdk.org (duke) Date: Sat, 22 Mar 2025 14:54:06 GMT Subject: RFR: 8347706: jvmciEnv.cpp has jvmci includes out of order In-Reply-To: References: Message-ID: On Sat, 22 Mar 2025 12:16:31 GMT, Zihao Lin wrote: > Reorder jvmci includes in jvmciEvn.cpp @linzihao1999 Your change (at version f0a6b84815d7a866a1561426d6b31a5e3f3b3c73) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24174#issuecomment-2745306821 From duke at openjdk.org Sat Mar 22 20:02:31 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Sat, 22 Mar 2025 20:02:31 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v11] In-Reply-To: References: Message-ID: > By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision: - Further readability improvements. - Added asserts for array sizes ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23860/files - new: https://git.openjdk.org/jdk/pull/23860/files/e9db09e2..56656894 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=10 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=09-10 Stats: 228 lines in 2 files changed: 72 ins; 56 del; 100 mod Patch: https://git.openjdk.org/jdk/pull/23860.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23860/head:pull/23860 PR: https://git.openjdk.org/jdk/pull/23860 From vpaprotski at openjdk.org Sat Mar 22 20:05:11 2025 From: vpaprotski at openjdk.org (Volodymyr Paprotski) Date: Sat, 22 Mar 2025 20:05:11 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v10] In-Reply-To: <2N5Evij0f6qZi_pG3tqoz11aQbSnLG0YszqHR9ROfKI=.d44b16c6-d334-42c4-8de8-92eb41229248@github.com> References: <2N5Evij0f6qZi_pG3tqoz11aQbSnLG0YszqHR9ROfKI=.d44b16c6-d334-42c4-8de8-92eb41229248@github.com> Message-ID: <2yP2P1VNWgQu6cWvn0_a_7LdidS71C6PWKcqGKTOHnc=.49f8ac0f-df23-4f1e-adb9-e03a3f2295b2@github.com> On Thu, 20 Mar 2025 20:37:25 GMT, Ferenc Rakoczi wrote: >> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: > > Fix windows build was going to finish the rest of the functions.. but I see you pushed an update so I better rebase! here are the pending comments I had that perhaps are no longer applicable.. (working through the ntt math..) src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 121: > 119: static void montmulEven(int outputReg, int inputReg1, int inputReg2, > 120: int scratchReg1, int scratchReg2, > 121: int parCnt, MacroAssembler *_masm) { nitpick.. this could be made to look more like `montMul64()` by also taking in an array of registers. src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 160: > 158: for (int i = 0; i < 4; i++) { > 159: __ vpmuldq(xmm(scratchRegs[i]), xmm(inputRegs1[i]), xmm(inputRegs2[i]), > 160: Assembler::AVX_512bit); using an array of registers, instead of array of ints would read somewhat more compact and fewer 'indirections' . i.e. static void montMul64(XMMRegister outputRegs*, XMMRegister inputRegs1*, XMMRegister inputRegs2*, ... __ vpmuldq(scratchRegs[i], inputRegs1[i], inputRegs2[i], Assembler::AVX_512bit); src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 216: > 214: // Zmm8-Zmm23 used as scratch registers > 215: // result goes to Zmm0-Zmm7 > 216: static void montMulByConst128(MacroAssembler *_masm) { wish the inputs and output register arrays were explicit.. easier to follow that way src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 230: > 228: } > 229: > 230: static void sub_add(int subResult[], int addResult[], Big fan of all these helper functions! Makes reading the top level functions way easier, thanks for refactoring! src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 279: > 277: static int xmm4_20_24[] = {4, 5, 6, 7, 20, 21, 22, 23, 24, 25, 26, 27}; > 278: static int xmm16_27[] = {16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27}; > 279: static int xmm29_29[] = {29, 29, 29, 29}; I very much like the new refactor, waaaay clearer now. Some 'Could Do' comments.. - I probably would have preferred 'even more symbolic' variable names (i.e. its ideal when you can match the java variable names!). Conversely, if 'forced to defend this style', these names are MUCH much easier to debug from GDB, its clear what the matching instruction is. - Not sure about it being global. It works currently, but less 'future proof'. src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 645: > 643: // poly1 (int[256]) = c_rarg1 > 644: // poly2 (int[256]) = c_rarg2 > 645: static address generate_dilithiumNttMult_avx512(StubGenerator *stubgen, This would be 'nice to have', something 'lost' with the refactor.. As I was reviewing this (original) function, I was thinking, "there is nothing here _that_ specific to AVX512, mostly columnar&independent operations... This function could be made 'vector-length-independent'..." - double the loop length: int iter = vector_len==Assembler::AVX_512bit?4:8; __ movl(len, 4); -> __ movl(len, iter); - halve the register arrays.. (or keep them the same but shuffle them to make SURE the first half are in xmm0-xmm15 range) XMMRegister POLY1[] = {xmm0, xmm1, xmm12, xmm13}; XMMRegister POLY2[] = {xmm4, xmm5, xmm16, xmm17}; XMMRegister SCRATCH1[] = {xmm2, xmm3, xmm14, xmm15}; <<< here XMMRegister SCRATCH2[] = {xmm6, xmm7, xmm18, xmm19}; <<< and here XMMRegister SCRATCH3[] = {xmm8, xmm9, xmm10, xmm11}; - couple of other int constants (like the memory 'step' and such) - for assembler calls, like `evmovdqul` and `evpsubd`, need a few small new MacroAssembler helpers to instead generate VEX encoded versions (plenty of instructions already do that). - I think only the perm instruction was unique to evex (didnt really think of an alternative for AVX2.. but can be abstracted away with another helper) Anyway; not suggesting its something you do here.. but it would be convenient to leave breadcrumbs/hooks for a future update so one of us can revisit this code and add AVX2 support. e.g. `parCnt` variable was very convenient before for exactly this, now its gone... it probably could be derived in each function from vector_len but..; Its now cleaner, but also harder to 'upgrade'? Why AVX2? many of the newer (Atom/Ecore-based/EnableX86ECoreOpts) processors do not have AVX512 support, so its something I've been prioritizing recently The alternative would be to write a completely separate AVX2 implementation, but that would be a shame, not to 'just' reuse this code. ? "For fun", I had even gone and parametrized the mult function with the `vector_len` to see how it would look (almost identical... to the original version): static void montmulEven2(XMMRegister* outputReg, XMMRegister* inputReg1, XMMRegister* inputReg2, XMMRegister* scratchReg1, XMMRegister* scratchReg2, XMMRegister montQInvModR, XMMRegister dilithium_q, int parCnt, int vector_len, MacroAssembler* _masm) { for (int i = 0; i < parCnt; i++) { // scratch1 = (int64)input1_even*input2_even // Java: long a = (long) b * (long) c; __ vpmuldq(scratchReg1[i], inputReg1[i], inputReg2[i], vector_len); } for (int i = 0; i < parCnt; i++) { // scratch2 = int32(montQInvModR*(int32)scratch1) // Java: int aLow = (int) a; // Java: int m = MONT_Q_INV_MOD_R * aLow; // signed low product __ vpmulld(scratchReg2[i], scratchReg1[i], montQInvModR, vector_len); } for (int i = 0; i < parCnt; i++) { // scratch2 = (int64)scratch2_even*dilithium_q_even // Java: ((long)m * MONT_Q) __ vpmuldq(scratchReg2[i], scratchReg2[i], dilithium_q, vector_len); } for (int i = 0; i < parCnt; i++) { // output_odd = scratch1_odd - scratch2_odd // Java: (aHigh - (int) (("scratch2") >> MONT_R_BITS)) __ vpsubd(outputReg[i], scratchReg1[i], scratchReg2[i], vector_len); } } ------------- PR Review: https://git.openjdk.org/jdk/pull/23860#pullrequestreview-2708079853 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2008809855 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2008811046 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2008811541 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2008811704 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2008808110 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2008824304 From vpaprotski at openjdk.org Sat Mar 22 20:05:12 2025 From: vpaprotski at openjdk.org (Volodymyr Paprotski) Date: Sat, 22 Mar 2025 20:05:12 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v7] In-Reply-To: References: Message-ID: On Thu, 20 Mar 2025 21:06:30 GMT, Ferenc Rakoczi wrote: >> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 58: >> >>> 56: >>> 57: ATTRIBUTE_ALIGNED(64) static const uint32_t dilithiumAvx512Perms[] = { >>> 58: // collect montmul results into the destination register >> >> same as `dilithiumAvx512Consts()`, 'magic offsets'; except here they are harder to count (eg. not clear visually what is the offset of `ntt inverse`). >> >> Could be split into three constant arrays to make the compiler count for us > > Well, it is 64 bytes per line (16 4-byte uint32_ts), not that hard :-) ... Ha! I didn't realize it was 16 per line.. ran out of fingers while counting!!! :) 'works for me, as long as its a "premeditated" decision' >> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 980: >> >>> 978: // Dilithium multiply polynomials in the NTT domain. >>> 979: // Implements >>> 980: // static int implDilithiumNttMult( >> >> I suppose no java changes in this PR, but I notice that the inputs are all assumed to have fixed size. >> >> Most/all intrinsics I worked with had some sort of guard (eg `Objects.checkFromIndexSize`) right before the intrinsic java call. (It usually looks like it can be optimized away). But I notice no such guard here on the java side. > > These functions will not be used anywhere else and in ML_DSA.java all of the arrays passed to inrinsics are of the correct size. Works for me; just thought I would point it out, so its a 'premeditated' decision. >> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 1012: >> >>> 1010: __ evmovdqul(xmm28, Address(perms, 0), Assembler::AVX_512bit); >>> 1011: >>> 1012: __ movl(len, 4); >> >> Compile-time constant, why not 'unroll at compile time'? i.e. wrap this loop with `for (int len=0; len<4; len++)` instead? > > I have found that unrolling these loops actually hurts performance (probably an I-cache effect. Interesting; I keep on having to re-train my intuition, thanks for the data ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2008806159 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2008805574 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2008805113 From duke at openjdk.org Sat Mar 22 20:23:25 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Sat, 22 Mar 2025 20:23:25 GMT Subject: RFR: 8349721: Add aarch64 intrinsics for ML-KEM [v5] In-Reply-To: References: Message-ID: > By using the aarch64 vector registers the speed of the computation of the ML-KEM algorithms (key generation, encapsulation, decapsulation) can be approximately doubled. Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: Fixed bad assertion. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23663/files - new: https://git.openjdk.org/jdk/pull/23663/files/7e9b3d84..9ec9a6cd Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23663&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23663&range=03-04 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23663.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23663/head:pull/23663 PR: https://git.openjdk.org/jdk/pull/23663 From vpaprotski at openjdk.org Sat Mar 22 20:42:09 2025 From: vpaprotski at openjdk.org (Volodymyr Paprotski) Date: Sat, 22 Mar 2025 20:42:09 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v11] In-Reply-To: References: Message-ID: On Sat, 22 Mar 2025 20:02:31 GMT, Ferenc Rakoczi wrote: >> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision: > > - Further readability improvements. > - Added asserts for array sizes src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 119: > 117: static address dilithiumAvx512PermsAddr() { > 118: return (address) dilithiumAvx512Perms; > 119: } Hear me out.. ... enums!! enum nttPermOffset { montMulPermsIdx = 0, nttL4PermsIdx = 64, nttL5PermsIdx = 192, nttL6PermsIdx = 320, nttL7PermsIdx = 448, nttInvL0PermsIdx = 704, nttInvL1PermsIdx = 832, nttInvL2PermsIdx = 960, nttInvL3PermsIdx = 1088, nttInvL4PermsIdx = 1216, }; static address dilithiumAvx512PermsAddr(nttPermOffset offset) { return (address) dilithiumAvx512Perms + offset; } ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2008900858 From duke at openjdk.org Sun Mar 23 00:39:11 2025 From: duke at openjdk.org (Zihao Lin) Date: Sun, 23 Mar 2025 00:39:11 GMT Subject: Integrated: 8347706: jvmciEnv.cpp has jvmci includes out of order In-Reply-To: References: Message-ID: On Sat, 22 Mar 2025 12:16:31 GMT, Zihao Lin wrote: > Reorder jvmci includes in jvmciEvn.cpp This pull request has now been integrated. Changeset: df9210e6 Author: Zihao Lin Committer: SendaoYan URL: https://git.openjdk.org/jdk/commit/df9210e6578acd53384ee1ac06601510c9a52696 Stats: 6 lines in 1 file changed: 3 ins; 3 del; 0 mod 8347706: jvmciEnv.cpp has jvmci includes out of order Reviewed-by: dnsimon ------------- PR: https://git.openjdk.org/jdk/pull/24174 From syan at openjdk.org Sun Mar 23 01:16:20 2025 From: syan at openjdk.org (SendaoYan) Date: Sun, 23 Mar 2025 01:16:20 GMT Subject: RFR: 8347706: jvmciEnv.cpp has jvmci includes out of order In-Reply-To: References: Message-ID: On Sat, 22 Mar 2025 12:16:31 GMT, Zihao Lin wrote: > Reorder jvmci includes in jvmciEvn.cpp > /sponsor Sorry, did not noticed that this PR no satisfied more than 24 hours... ------------- PR Comment: https://git.openjdk.org/jdk/pull/24174#issuecomment-2745952316 From dnsimon at openjdk.org Sun Mar 23 11:56:11 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Sun, 23 Mar 2025 11:56:11 GMT Subject: RFR: 8347706: jvmciEnv.cpp has jvmci includes out of order In-Reply-To: References: Message-ID: On Sat, 22 Mar 2025 14:51:16 GMT, Zihao Lin wrote: > You are right, Do we have some code check tool which help to point out the ordering issue? Not as far as I know but it should not be too hard to come up with. I've opened https://bugs.openjdk.org/browse/JDK-8352645 to have this considered. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24174#issuecomment-2746168751 From jbhateja at openjdk.org Mon Mar 24 02:41:14 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 24 Mar 2025 02:41:14 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v11] In-Reply-To: References: Message-ID: On Sat, 22 Mar 2025 20:02:31 GMT, Ferenc Rakoczi wrote: >> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision: > > - Further readability improvements. > - Added asserts for array sizes src/hotspot/cpu/x86/vm_version_x86.cpp line 1252: > 1250: // Currently we only have them for AVX512 > 1251: #ifdef _LP64 > 1252: if (supports_evex() && supports_avx512bw()) { supports_evex check looks redundant. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2009379308 From vpaprotski at openjdk.org Mon Mar 24 15:19:22 2025 From: vpaprotski at openjdk.org (Volodymyr Paprotski) Date: Mon, 24 Mar 2025 15:19:22 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v11] In-Reply-To: References: Message-ID: <_TOBoO4cMQpw4sgzIpNpQZ2w5wDgezKQZLe314DQ7zo=.813b81bf-ecc0-4f75-a0d6-fbb13dde594e@github.com> On Sat, 22 Mar 2025 20:02:31 GMT, Ferenc Rakoczi wrote: >> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision: > > - Further readability improvements. > - Added asserts for array sizes I still need to have a look at the sha3 changes, but I think I am done with the most complex part of the review. This was a really interesting bit of code to review! src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 270: > 268: } > 269: > 270: static void loadPerm(int destinationRegs[], Register perms, `replXmm`? i.e. this function is replicating (any) Xmm register, not just perm?.. src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 327: > 325: // > 326: // > 327: static address generate_dilithiumAlmostNtt_avx512(StubGenerator *stubgen, Similar comments as to `generate_dilithiumAlmostInverseNtt_avx512` - similar comment about the 'pair-wise' operation, updating `[j]` and `[j+l]` at a time.. - somehow had less trouble following the flow through registers here, perhaps I am getting used to it. FYI, ended renaming some as: // xmm16_27 = Temp1 // xmm0_3 = Coeffs1 // xmm4_7 = Coeffs2 // xmm8_11 = Coeffs3 // xmm12_15 = Coeffs4 = Temp2 // xmm16_27 = Scratch src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 421: > 419: for (int i = 0; i < 8; i += 2) { > 420: __ evpermi2d(xmm(i / 2 + 12), xmm(i), xmm(i + 1), Assembler::AVX_512bit); > 421: } Wish there was a more 'abstract' way to arrange this, so its obvious from the shape of the code what registers are input/outputs (i.e. and use the register arrays). Even though its just 'elementary index operations' `i/2 + 16` is still 'clever'. Couldnt think of anything myself though (same elsewhere in this function for the table permutes). src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 509: > 507: // coeffs (int[256]) = c_rarg0 > 508: // zetas (int[256]) = c_rarg1 > 509: static address generate_dilithiumAlmostInverseNtt_avx512(StubGenerator *stubgen, Done with this function; Perhaps the 'permute table' is a common vector-algorithm pattern, but this is really clever! Some general comments first, rest inline. - The array names for registers helped a lot. And so did the new helper functions! - The java version of this code is quite intimidating to vectorize.. 3D loop, with geometric iteration variables.. and the literature is even more intimidating (discrete convolutions which I havent touched in two decades, ffts, ntts, etc.) Here is my attempt at a comment to 'un-scare' the next reader, though feel free to reword however you like. The core of the (Java) loop is this 'pair-wise' operation: int a = coeffs[j]; int b = coeffs[j + offset]; coeffs[j] = (a + b); coeffs[j + offset] = montMul(a - b, -MONT_ZETAS_FOR_NTT[m]); There are 8 'levels' (0-7); ('levels' are equivalent to (unrolling) the outer (Java) loop) At each level, the 'pair-wise-offset' doubles (2^l: 1, 2, 4, 8, 16, 32, 64, 128). To vectorize this Java code, observe that at each level, REGARDLESS the offset, half the operations are the SUM, and the other half is the montgomery MULTIPLICATION (of the pair-difference with a constant). At each level, one 'just' has to shuffle the coefficients, so that SUMs and MULTIPLICATIONs line up accordingly. Otherwise, this pattern is 'lightly similar' to a discrete convolution (compute integral/summation of two functions at every offset) - I still would prefer (more) symbolic register names.. I wouldn't hold my approval over it so won't object if nobody else does, but register numbers are harder to 'see' through the flow. I ended up search/replacing/'annotating' to make it easier on myself to follow the flow of data: // xmm8_11 = Perms1 // xmm12_15 = Perms2 // xmm16_27 = Scratch // xmm0_3 = CoeffsPlus // xmm4_7 = CoeffsMul // xmm24_27 = CoeffsMinus (overlaps with Scratch) (I made a similar comment, but I think it is now hidden after the last refactor) - would prefer to see the helper functions to get ALL the registers passed explicitly (i.e. currently `montMulPerm`, `montQInvModR`, `dilithium_q`, `xmm29`, are implicit.). As a general rule, I've tried to set up all the registers up at the 'entry' function (`generate_dilithium*` in this case) and from there on, use symbolic names. Not always reasonable, but what I've grown used to see? Done with this function; Perhaps the 'permute table' is a common vector-algorithm pattern, but this is really clever! Some general comments first, rest inline. - The array names for registers helped a lot. And so did the new helper functions! - The java version of this code is quite intimidating to vectorize.. 3D loop, with geometric iteration variables.. and the literature is even more intimidating (discrete convolutions which I havent touched in two decades, ffts, ntts, etc.) Here is my attempt at a comment to 'un-scare' the next reader, though feel free to reword however you like. The core of the (Java) loop is this 'pair-wise' operation: int a = coeffs[j]; int b = coeffs[j + offset]; coeffs[j] = (a + b); coeffs[j + offset] = montMul(a - b, -MONT_ZETAS_FOR_NTT[m]); There are 8 'levels' (0-7); ('levels' are equivalent to (unrolling) the outer (Java) loop) At each level, the 'pair-wise-offset' doubles (2^l: 1, 2, 4, 8, 16, 32, 64, 128). To vectorize this Java code, observe that at each level, REGARDLESS the offset, half the operations are the SUM, and the other half is the montgomery MULTIPLICATION (of the pair-difference with a constant). At each level, one 'just' has to shuffle the coefficients, so that SUMs and MULTIPLICATIONs line up accordingly. Otherwise, this pattern is 'lightly similar' to a discrete convolution (compute integral/summation of two functions at every offset) - I still would prefer (more) symbolic register names.. I wouldn't hold my approval over it so won't object if nobody else does, but register numbers are harder to 'see' through the flow. I ended up search/replacing/'annotating' to make it easier on myself to follow the flow of data: // xmm8_11 = Perms1 // xmm12_15 = Perms2 // xmm16_27 = Scratch // xmm0_3 = CoeffsPlus // xmm4_7 = CoeffsMul // xmm24_27 = CoeffsMinus (overlaps with Scratch) (I made a similar comment, but I think it is now hidden after the last refactor) - would prefer to see the helper functions to get ALL the registers passed explicitly (i.e. currently `montMulPerm`, `montQInvModR`, `dilithium_q`, `xmm29`, are implicit.). As a general rule, I've tried to set up all the registers up at the 'entry' function (`generate_dilithium*` in this case) and from there on, use symbolic names. Not always reasonable, but what I've grown used to see? src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 554: > 552: for (int i = 0; i < 8; i += 2) { > 553: __ evpermi2d(xmm(i / 2 + 8), xmm(i), xmm(i + 1), Assembler::AVX_512bit); > 554: __ evpermi2d(xmm(i / 2 + 12), xmm(i), xmm(i + 1), Assembler::AVX_512bit); Took a bit to unscramble the flow, so a comment needed? Purpose 'fairly obvious' once I got the general shape of the level/algorithm (as per my top-level comment) but something like "shuffle xmm0-7 into xmm8-15"? src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 572: > 570: load4Xmms(xmm4_7, zetas, 512, _masm); > 571: sub_add(xmm24_27, xmm0_3, xmm8_11, xmm12_15, _masm); > 572: montMul64(xmm4_7, xmm24_27, xmm4_7, xmm16_27, _masm); >From my annotated version, levels 1-4, fairly 'straightforward': // level 1 replXmm(Perms1, perms, nttInvL1PermsIdx, _masm); replXmm(Perms2, perms, nttInvL1PermsIdx + 64, _masm); for (int i = 0; i < 4; i++) { __ evpermi2d(xmm(Perms1[i]), xmm(CoeffsPlus[i]), xmm(CoeffsMul[i]), Assembler::AVX_512bit); __ evpermi2d(xmm(Perms2[i]), xmm(CoeffsPlus[i]), xmm(CoeffsMul[i]), Assembler::AVX_512bit); } load4Xmms(CoeffsMul, zetas, 512, _masm); sub_add(CoeffsMinus, CoeffsPlus, Perms1, Perms2, _masm); montMul64(CoeffsMul, CoeffsMinus, CoeffsMul, Scratch, _masm); src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 613: > 611: montMul64(xmm4_7, xmm24_27, xmm4_7, xmm16_27, _masm); > 612: > 613: // level 5 "// No shuffling for level 5 and 6; can just rearrange full registers" src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 656: > 654: for (int i = 0; i < 8; i++) { > 655: __ evpsubd(xmm(i), k0, xmm(i + 8), xmm(i), false, Assembler::AVX_512bit); > 656: } Fairly clean as is, but could also be two sub_add calls, I think (you have to swap order of add/sub in the helper, to be able to clobber `xmm(i)`.. or swap register usage downstream, so perhaps not.. but would be cleaner) sub_add(CoeffsPlus, Scratch, Perms1, CoeffsPlus, _masm); sub_add(CoeffsMul, &Scratch[4], Perms2, CoeffsMul, _masm); If nothing else, would had prefered to see the use of the register array variables src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 660: > 658: store4Xmms(coeffs, 0, xmm16_19, _masm); > 659: store4Xmms(coeffs, 4 * XMMBYTES, xmm20_23, _masm); > 660: montMulByConst128(_masm); Would prefer explicit parameters here. But I think this could also be two `montMul64` calls? montMul64(xmm0_3, xmm0_3, xmm29_29, Scratch, _masm); montMul64(xmm4_7, xmm4_7, xmm29_29, Scratch, _masm); (I think there is one other use of `montMulByConst128` where same applies; then you could delete both `montMulByConst128` and `montmulEven` src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 871: > 869: __ evpaddd(xmm5, k0, xmm1, barrettAddend, false, Assembler::AVX_512bit); > 870: __ evpaddd(xmm6, k0, xmm2, barrettAddend, false, Assembler::AVX_512bit); > 871: __ evpaddd(xmm7, k0, xmm3, barrettAddend, false, Assembler::AVX_512bit); Fairly 'straightforward' transcription of the java code.. no comments from me. At first glance using `xmm0_3`, `xmm4_7`, etc. might had been a good idea, but you only save one line per 4x group. (Unless you have one big loop, but I suspect that give you worse performance? Is that something you tried already? Might be worth it otherwise..) src/java.base/share/classes/sun/security/provider/ML_DSA.java line 1418: > 1416: int twoGamma2, int multiplier) { > 1417: assert (input.length == ML_DSA_N) && (lowPart.length == ML_DSA_N) > 1418: && (highPart.length == ML_DSA_N); I wrote this test to test java-to-intrinsic correspondence. Might be good to include it (and add the other 4 intrinsics). This is very similar to all my other *Fuzz* tests I've been adding for my own intrinsics (and you made this test FAR easier to write by breaking out the java implementation; need to 'copy' that pattern myself) import java.util.Arrays; import java.util.Random; import java.lang.invoke.MethodHandle; import java.lang.invoke.MethodHandles; import java.lang.reflect.Field; import java.lang.reflect.Method; import java.lang.reflect.Constructor; public class ML_DSA_Intrinsic_Test { public static void main(String[] args) throws Exception { MethodHandles.Lookup lookup = MethodHandles.lookup(); Class kClazz = Class.forName("sun.security.provider.ML_DSA"); Constructor constructor = kClazz.getDeclaredConstructor( int.class); constructor.setAccessible(true); Method m = kClazz.getDeclaredMethod("mlDsaNttMultiply", int[].class, int[].class, int[].class); m.setAccessible(true); MethodHandle mult = lookup.unreflect(m); m = kClazz.getDeclaredMethod("implDilithiumNttMultJava", int[].class, int[].class, int[].class); m.setAccessible(true); MethodHandle multJava = lookup.unreflect(m); Random rnd = new Random(); long seed = rnd.nextLong(); rnd.setSeed(seed); //Note: it might be useful to increase this number during development of new intrinsics final int repeat = 1000000; int[] coeffs1 = new int[ML_DSA_N]; int[] coeffs2 = new int[ML_DSA_N]; int[] prod1 = new int[ML_DSA_N]; int[] prod2 = new int[ML_DSA_N]; try { for (int i = 0; i < repeat; i++) { run(prod1, prod2, coeffs1, coeffs2, mult, multJava, rnd, seed, i); } System.out.println("Fuzz Success"); } catch (Throwable e) { System.out.println("Fuzz Failed: " + e); } } private static final int ML_DSA_N = 256; public static void run(int[] prod1, int[] prod2, int[] coeffs1, int[] coeffs2, MethodHandle mult, MethodHandle multJava, Random rnd, long seed, int i) throws Exception, Throwable { for (int j = 0; j This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)). By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order. The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now. I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories. When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example: java.lang.RuntimeException: The unsorted includes listed below should be fixable by running: java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) at java.base/java.lang.reflect.Method.invoke(Method.java:565) at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335) at java.base/java.lang.Thread.run(Thread.java:1447) Caused by: java.lang.RuntimeException: 36 files with unsorted headers found: /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Optimizer.cpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.hpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_FrameMap.hpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_RangeCheckElimination.cpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_InstructionPrinter.cpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci/bcEscapeAnalyzer.cpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci/ciInstance.cpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci/ciEnv.hpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci/ciUtilities.inline.hpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci/ciMethod.cpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci/ciUtilities.cpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci/ciEnv.cpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci/ciCallSite.cpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci/bcEscapeAnalyzer.hpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci/ciReplay.cpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci/ciInstanceKlass.cpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler/compilationMemoryStatistic.cpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler/compilationFailureInfo.cpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler/compilationPolicy.cpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler/directivesParser.hpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler/compileBroker.cpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler/directivesParser.cpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler/compilerDirectives.hpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler/methodMatcher.hpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler/compilationMemoryStatistic.hpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler/compileTask.cpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler/disassembler.hpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler/oopMap.inline.hpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci/jvmciRuntime.cpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci/jvmci.cpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci/jvmciCompiler.cpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci/jvmci.hpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci/jvmciEnv.cpp /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci/jvmciJavaClasses.cpp Note that non-space characters after the closing " or > of an include statement can be used to prevent re-ordering of the include. For example: #include "e.hpp" #include "d.hpp" #include "c.hpp" // do not reorder #include "b.hpp" #include "a.hpp" will be reformatted as: #include "d.hpp" #include "e.hpp" #include "c.hpp" // do not reorder #include "a.hpp" #include "b.hpp" at SortIncludes.main(SortIncludes.java:190) at TestIncludesAreSorted.main(TestIncludesAreSorted.java:75) ... 4 more JavaTest Message: Test threw exception: java.lang.RuntimeException This PR includes a [commit](https://github.com/openjdk/jdk/pull/24247/commits/a76d4f98c7e6074b4745c1c1791fe605e352d79f) with ordering suppression comments for some files I discovered needed it while playing around in #24180 . This PR replaces #24180. ------------- Commit messages: - sort includes in subset of HotSpot sources and added a test to keep them sorted - added tool to sort includes - do not reorder certain includes Changes: https://git.openjdk.org/jdk/pull/24247/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24247&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8352645 Stats: 396 lines in 53 files changed: 335 ins; 54 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/24247.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24247/head:pull/24247 PR: https://git.openjdk.org/jdk/pull/24247 From shade at openjdk.org Wed Mar 26 11:35:25 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 26 Mar 2025 11:35:25 GMT Subject: RFR: 8351155: C1/C2: Remove 32-bit x86 specific FP rounding support Message-ID: <8zUrV-sMSOwRSQk_jERtFqjrzOFUP7rlUwTTN7cPP_8=.b1d30fb5-d9c0-4417-bacd-bf09f2af433b@github.com> C1 and C2 have support for rounding double/floats, to support awkward rounding modes of x87 FPU. With 32-bit x86 port removed, we can remove those parts. This basically deletes all the code that uses `strict_fp_requires_explicit_rounding`, which is now universally `false` for all supported platforms. For C1, we remove `RoundFP` op, its associated `lir_roundfp` and related utility methods that insert these nodes in the graph. For C2, we remove `RoundDouble` and `RoundFloat` nodes (note there is a confusingly named `RoundDoubleMode` nodes that are not related to this), associated utility methods, AD match rules that reference these nodes (as nops!), and some `Ideal`-s that are no longer needed. Additional testing: - [x] Linux x86_64 server fastdebug, `tier1` - [ ] Linux x86_64 server fastdebug, `all` ------------- Commit messages: - Leftover - Fix Changes: https://git.openjdk.org/jdk/pull/24250/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24250&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8351155 Stats: 545 lines in 48 files changed: 0 ins; 511 del; 34 mod Patch: https://git.openjdk.org/jdk/pull/24250.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24250/head:pull/24250 PR: https://git.openjdk.org/jdk/pull/24250 From stefank at openjdk.org Wed Mar 26 12:27:09 2025 From: stefank at openjdk.org (Stefan Karlsson) Date: Wed, 26 Mar 2025 12:27:09 GMT Subject: RFR: 8352645: Add tool support to check order of includes In-Reply-To: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> Message-ID: On Wed, 26 Mar 2025 09:21:59 GMT, Doug Simon wrote: > This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)). > > By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order. > > The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now. > > I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories. > > When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example: > > java.lang.RuntimeException: The unsorted includes listed below should be fixable by running: > > java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci > > at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80) > at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) > at java.base/java.lang.reflect.Method.invoke(Method.java:565) > at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335) > at java.base/java.lang.Thread.run(Thread.java:1447) > Caused by: java.lang.RuntimeException: 36 files with unsorted headers found: > > /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp > /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp > /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Optim... Thanks for updating to use the lower-case comparison. I wonder if a small tweak can fix the extra blank lines I complained about in the other PR. The tool removes the extra blank line we have in our .inline.hpp. From the Style Guide: All .inline.hpp files should include their corresponding .hpp file as the first include line with a blank line separating it from the rest of the include lines. Declarations needed by other files should be put in the .hpp file, and not in the .inline.hpp file. This rule exists to resolve problems with circular dependencies between .inline.hpp files. I think this needs to be fixed, otherwise people will start to remove these. src/hotspot/share/compiler/oopMap.inline.hpp line 29: > 27: > 28: #include "compiler/oopMap.hpp" > 29: This blank line should not be removed. test/hotspot/jtreg/sources/SortIncludes.java line 77: > 75: blankLines = List.of(""); > 76: } > 77: result.addAll(blankLines); If this line is removed you don't get the extra blank lines I mentioned in the previous PR. It also removes the extra blank line that you get inserted into oopMap.inline.hpp before the INCLUDE_JVMCI block. ------------- Changes requested by stefank (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/24247#pullrequestreview-2716954567 PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2014026694 PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2014025793 From stefank at openjdk.org Wed Mar 26 13:43:16 2025 From: stefank at openjdk.org (Stefan Karlsson) Date: Wed, 26 Mar 2025 13:43:16 GMT Subject: RFR: 8352645: Add tool support to check order of includes In-Reply-To: References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> Message-ID: On Wed, 26 Mar 2025 12:19:14 GMT, Stefan Karlsson wrote: >> This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)). >> >> By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order. >> >> The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now. >> >> I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories. >> >> When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example: >> >> java.lang.RuntimeException: The unsorted includes listed below should be fixable by running: >> >> java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci >> >> at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80) >> at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) >> at java.base/java.lang.reflect.Method.invoke(Method.java:565) >> at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335) >> at java.base/java.lang.Thread.run(Thread.java:1447) >> Caused by: java.lang.RuntimeException: 36 files with unsorted headers found: >> >> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp >> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp >> /Users/dnsimo... > > test/hotspot/jtreg/sources/SortIncludes.java line 77: > >> 75: blankLines = List.of(""); >> 76: } >> 77: result.addAll(blankLines); > > If this line is removed you don't get the extra blank lines I mentioned in the previous PR. It also removes the extra blank line that you get inserted into oopMap.inline.hpp before the INCLUDE_JVMCI block. Or, rather if the code is changed to: if (!userIncludes.isEmpty() && !sysIncludes.isEmpty()) { result.add(""); } result.addAll(sysIncludes); ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2014172537 From dnsimon at openjdk.org Wed Mar 26 14:23:09 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 26 Mar 2025 14:23:09 GMT Subject: RFR: 8352645: Add tool support to check order of includes [v2] In-Reply-To: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> Message-ID: > This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)). > > By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order. > > The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now. > > I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories. > > When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example: > > java.lang.RuntimeException: The unsorted includes listed below should be fixable by running: > > java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci > > at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80) > at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) > at java.base/java.lang.reflect.Method.invoke(Method.java:565) > at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335) > at java.base/java.lang.Thread.run(Thread.java:1447) > Caused by: java.lang.RuntimeException: 36 files with unsorted headers found: > > /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp > /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp > /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Optim... Doug Simon has updated the pull request incrementally with one additional commit since the last revision: drop extra blank lines and preserve rule for first include in .inline.hpp files ------------- Changes: - all: https://git.openjdk.org/jdk/pull/24247/files - new: https://git.openjdk.org/jdk/pull/24247/files/62779478..18e2a1d6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=24247&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24247&range=00-01 Stats: 52 lines in 4 files changed: 40 ins; 5 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/24247.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24247/head:pull/24247 PR: https://git.openjdk.org/jdk/pull/24247 From duke at openjdk.org Wed Mar 26 15:23:47 2025 From: duke at openjdk.org (Zihao Lin) Date: Wed, 26 Mar 2025 15:23:47 GMT Subject: RFR: 8344116: C2: remove slice parameter from LoadNode::make Message-ID: This patch remove slice parameter from LoadNode::make Mention in https://github.com/openjdk/jdk/pull/21834#pullrequestreview-2429164805 Hi team, I am new, I'd appreciate any guidance. Thank a lot! ------------- Commit messages: - 8344116: C2: remove slice parameter from LoadNode::make Changes: https://git.openjdk.org/jdk/pull/24258/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24258&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8344116 Stats: 54 lines in 13 files changed: 3 ins; 14 del; 37 mod Patch: https://git.openjdk.org/jdk/pull/24258.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24258/head:pull/24258 PR: https://git.openjdk.org/jdk/pull/24258 From duke at openjdk.org Wed Mar 26 15:43:30 2025 From: duke at openjdk.org (Zihao Lin) Date: Wed, 26 Mar 2025 15:43:30 GMT Subject: RFR: 8344116: C2: remove slice parameter from LoadNode::make [v2] In-Reply-To: References: Message-ID: <6NXNfV1dqzZxpogva4dsv0kxkAQtJlgmLnSHvgZm5YA=.461d9a09-1e23-4acd-8230-0840348183ef@github.com> > This patch remove slice parameter from LoadNode::make > > Mention in https://github.com/openjdk/jdk/pull/21834#pullrequestreview-2429164805 > > Hi team, I am new, I'd appreciate any guidance. Thank a lot! Zihao Lin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge branch 'openjdk:master' into 8344116 - 8344116: C2: remove slice parameter from LoadNode::make ------------- Changes: - all: https://git.openjdk.org/jdk/pull/24258/files - new: https://git.openjdk.org/jdk/pull/24258/files/27df4a01..f4ef46dc Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=24258&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24258&range=00-01 Stats: 34071 lines in 1200 files changed: 1990 ins; 30272 del; 1809 mod Patch: https://git.openjdk.org/jdk/pull/24258.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24258/head:pull/24258 PR: https://git.openjdk.org/jdk/pull/24258 From stefank at openjdk.org Wed Mar 26 15:46:18 2025 From: stefank at openjdk.org (Stefan Karlsson) Date: Wed, 26 Mar 2025 15:46:18 GMT Subject: RFR: 8352645: Add tool support to check order of includes [v2] In-Reply-To: References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> Message-ID: On Wed, 26 Mar 2025 14:23:09 GMT, Doug Simon wrote: >> This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)). >> >> By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order. >> >> The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now. >> >> I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories. >> >> When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example: >> >> java.lang.RuntimeException: The unsorted includes listed below should be fixable by running: >> >> java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci >> >> at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80) >> at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) >> at java.base/java.lang.reflect.Method.invoke(Method.java:565) >> at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335) >> at java.base/java.lang.Thread.run(Thread.java:1447) >> Caused by: java.lang.RuntimeException: 36 files with unsorted headers found: >> >> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp >> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp >> /Users/dnsimo... > > Doug Simon has updated the pull request incrementally with one additional commit since the last revision: > > drop extra blank lines and preserve rule for first include in .inline.hpp files Thanks for doing the last two fixes. I think this looks good now, but I need a bit more time to do some deeper verification. Thanks! ------------- PR Review: https://git.openjdk.org/jdk/pull/24247#pullrequestreview-2717748962 From kvn at openjdk.org Wed Mar 26 19:02:13 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 26 Mar 2025 19:02:13 GMT Subject: RFR: 8351155: C1/C2: Remove 32-bit x86 specific FP rounding support In-Reply-To: <8zUrV-sMSOwRSQk_jERtFqjrzOFUP7rlUwTTN7cPP_8=.b1d30fb5-d9c0-4417-bacd-bf09f2af433b@github.com> References: <8zUrV-sMSOwRSQk_jERtFqjrzOFUP7rlUwTTN7cPP_8=.b1d30fb5-d9c0-4417-bacd-bf09f2af433b@github.com> Message-ID: On Wed, 26 Mar 2025 10:11:25 GMT, Aleksey Shipilev wrote: > C1 and C2 have support for rounding double/floats, to support awkward rounding modes of x87 FPU. With 32-bit x86 port removed, we can remove those parts. This basically deletes all the code that uses `strict_fp_requires_explicit_rounding`, which is now universally `false` for all supported platforms. > > For C1, we remove `RoundFP` op, its associated `lir_roundfp` and related utility methods that insert these nodes in the graph. > > For C2, we remove `RoundDouble` and `RoundFloat` nodes (note there is a confusingly named `RoundDoubleMode` nodes that are not related to this), associated utility methods, AD match rules that reference these nodes (as nops!), and some `Ideal`-s that are no longer needed. > > Additional testing: > - [x] Linux x86_64 server fastdebug, `tier1` > - [x] Linux x86_64 server fastdebug, `all` Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/24250#pullrequestreview-2718342841 From vlivanov at openjdk.org Wed Mar 26 19:13:12 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 26 Mar 2025 19:13:12 GMT Subject: RFR: 8351155: C1/C2: Remove 32-bit x86 specific FP rounding support In-Reply-To: <8zUrV-sMSOwRSQk_jERtFqjrzOFUP7rlUwTTN7cPP_8=.b1d30fb5-d9c0-4417-bacd-bf09f2af433b@github.com> References: <8zUrV-sMSOwRSQk_jERtFqjrzOFUP7rlUwTTN7cPP_8=.b1d30fb5-d9c0-4417-bacd-bf09f2af433b@github.com> Message-ID: On Wed, 26 Mar 2025 10:11:25 GMT, Aleksey Shipilev wrote: > C1 and C2 have support for rounding double/floats, to support awkward rounding modes of x87 FPU. With 32-bit x86 port removed, we can remove those parts. This basically deletes all the code that uses `strict_fp_requires_explicit_rounding`, which is now universally `false` for all supported platforms. > > For C1, we remove `RoundFP` op, its associated `lir_roundfp` and related utility methods that insert these nodes in the graph. > > For C2, we remove `RoundDouble` and `RoundFloat` nodes (note there is a confusingly named `RoundDoubleMode` nodes that are not related to this), associated utility methods, AD match rules that reference these nodes (as nops!), and some `Ideal`-s that are no longer needed. > > Additional testing: > - [x] Linux x86_64 server fastdebug, `tier1` > - [x] Linux x86_64 server fastdebug, `all` Looks good. ------------- Marked as reviewed by vlivanov (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/24250#pullrequestreview-2718363625 From kbarrett at openjdk.org Thu Mar 27 06:14:21 2025 From: kbarrett at openjdk.org (Kim Barrett) Date: Thu, 27 Mar 2025 06:14:21 GMT Subject: RFR: 8352645: Add tool support to check order of includes [v2] In-Reply-To: References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> Message-ID: On Wed, 26 Mar 2025 14:23:09 GMT, Doug Simon wrote: >> This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)). >> >> By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order. >> >> The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now. >> >> I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories. >> >> When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example: >> >> java.lang.RuntimeException: The unsorted includes listed below should be fixable by running: >> >> java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci >> >> at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80) >> at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) >> at java.base/java.lang.reflect.Method.invoke(Method.java:565) >> at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335) >> at java.base/java.lang.Thread.run(Thread.java:1447) >> Caused by: java.lang.RuntimeException: 36 files with unsorted headers found: >> >> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp >> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp >> /Users/dnsimo... > > Doug Simon has updated the pull request incrementally with one additional commit since the last revision: > > drop extra blank lines and preserve rule for first include in .inline.hpp files Changes requested by kbarrett (Reviewer). src/hotspot/share/ci/ciUtilities.inline.hpp line 29: > 27: > 28: #include "ci/ciUtilities.hpp" > 29: Extra blank line not removed? src/hotspot/share/ci/ciUtilities.inline.hpp line 32: > 30: #include "runtime/interfaceSupport.inline.hpp" > 31: > 32: Extra blank line inserted? src/hotspot/share/compiler/compilationFailureInfo.cpp line 35: > 33: #include "compiler/compilationFailureInfo.hpp" > 34: #include "compiler/compileTask.hpp" > 35: #ifdef COMPILER2 Conditional includes are supposed to follow unconditional in a section. Out of scope for this PR? src/hotspot/share/compiler/disassembler.hpp line 36: > 34: #include "utilities/macros.hpp" > 35: > 36: Extra blank line inserted? test/hotspot/jtreg/sources/SortIncludes.java line 39: > 37: > 38: public class SortIncludes { > 39: private static final String INCLUDE_LINE = "^ *#include *(<[^>]+>|\"[^\"]+\") *$\\n"; There are files that have spaces between the `#` and `include`. I'm kind of inclined to suggest we fix those at some point (not in this PR). But the regex here needs to allow for that possibility, and perhaps (eventually) complain about such. test/hotspot/jtreg/sources/SortIncludes.java line 115: > 113: } > 114: > 115: /// Processes the C++ source file in `path` to sort its include statements. If we want to apply this to hotspot jtreg test code, then C source files also come into the picture. test/hotspot/jtreg/sources/SortIncludes.java line 153: > 151: > 152: /// Processes the C++ source files in `paths` to check if their include statements are sorted. > 153: /// Include statements with any non-space characters after the closing `"` or `>` will not Perhaps this should be mentioned in the style guide? ------------- PR Review: https://git.openjdk.org/jdk/pull/24247#pullrequestreview-2719852021 PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2015721384 PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2015718606 PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2015723999 PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2015725371 PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2015706803 PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2015712545 PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2015714360 From kbarrett at openjdk.org Thu Mar 27 06:30:09 2025 From: kbarrett at openjdk.org (Kim Barrett) Date: Thu, 27 Mar 2025 06:30:09 GMT Subject: RFR: 8352645: Add tool support to check order of includes [v2] In-Reply-To: References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> Message-ID: On Wed, 26 Mar 2025 14:23:09 GMT, Doug Simon wrote: >> This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)). >> >> By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order. >> >> The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now. >> >> I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories. >> >> When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example: >> >> java.lang.RuntimeException: The unsorted includes listed below should be fixable by running: >> >> java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci >> >> at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80) >> at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) >> at java.base/java.lang.reflect.Method.invoke(Method.java:565) >> at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335) >> at java.base/java.lang.Thread.run(Thread.java:1447) >> Caused by: java.lang.RuntimeException: 36 files with unsorted headers found: >> >> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp >> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp >> /Users/dnsimo... > > Doug Simon has updated the pull request incrementally with one additional commit since the last revision: > > drop extra blank lines and preserve rule for first include in .inline.hpp files Probably we want to eventually apply this to gtests, but there might be additional rules there. The include of unittest.hpp is (usually) last, and there may be (or may have been) a technical reason for that. Applying it to jtreg test support files could also introduce some challenges. Or at least discover a lot of non-conforming files. We might eventually want a mechanism for excluding directories, in addition to an inclusion list (that might eventually be "all"). These kinds of things can be followups once we have the basic mechanism in place. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24247#issuecomment-2756881833 From bkilambi at openjdk.org Thu Mar 27 08:13:11 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Thu, 27 Mar 2025 08:13:11 GMT Subject: RFR: 8345125: Aarch64: Add aarch64 backend for Float16 scalar operations [v2] In-Reply-To: <8QDbenZGakijqUrwAcaVogoJBEiNpzYhN3sDrrteSDk=.d8539631-ab03-45ff-a762-0b6e14c63f89@github.com> References: <8QDbenZGakijqUrwAcaVogoJBEiNpzYhN3sDrrteSDk=.d8539631-ab03-45ff-a762-0b6e14c63f89@github.com> Message-ID: On Tue, 25 Feb 2025 19:45:31 GMT, Bhavana Kilambi wrote: >> This patch adds aarch64 backend for scalar FP16 operations namely - add, subtract, multiply, divide, fma, sqrt, min and max. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Address review comments Hello @shqking @theRealAph , sincere apologies for the delay in addressing the review comments. I am planning on uploading a patch soon addressing all review comments. Thank you ! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23748#issuecomment-2757083553 From dnsimon at openjdk.org Thu Mar 27 08:21:09 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Thu, 27 Mar 2025 08:21:09 GMT Subject: RFR: 8352645: Add tool support to check order of includes [v2] In-Reply-To: References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> Message-ID: On Thu, 27 Mar 2025 05:56:55 GMT, Kim Barrett wrote: >> Doug Simon has updated the pull request incrementally with one additional commit since the last revision: >> >> drop extra blank lines and preserve rule for first include in .inline.hpp files > > test/hotspot/jtreg/sources/SortIncludes.java line 39: > >> 37: >> 38: public class SortIncludes { >> 39: private static final String INCLUDE_LINE = "^ *#include *(<[^>]+>|\"[^\"]+\") *$\\n"; > > There are files that have spaces between the `#` and `include`. I'm kind of inclined to suggest we fix those > at some point (not in this PR). But the regex here needs to allow for that possibility, and perhaps (eventually) > complain about such. Since there are no such cases in the files processed in this PR, I'd suggest not adding support for them. They can be fixed in follow up PRs as the relevant directories are added to `TestIncludesAreSorted.HOTSPOT_SOURCES_TO_CHECK`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2015912061 From shade at openjdk.org Thu Mar 27 08:45:49 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 27 Mar 2025 08:45:49 GMT Subject: RFR: 8351155: C1/C2: Remove 32-bit x86 specific FP rounding support [v2] In-Reply-To: <8zUrV-sMSOwRSQk_jERtFqjrzOFUP7rlUwTTN7cPP_8=.b1d30fb5-d9c0-4417-bacd-bf09f2af433b@github.com> References: <8zUrV-sMSOwRSQk_jERtFqjrzOFUP7rlUwTTN7cPP_8=.b1d30fb5-d9c0-4417-bacd-bf09f2af433b@github.com> Message-ID: > C1 and C2 have support for rounding double/floats, to support awkward rounding modes of x87 FPU. With 32-bit x86 port removed, we can remove those parts. This basically deletes all the code that uses `strict_fp_requires_explicit_rounding`, which is now universally `false` for all supported platforms. > > For C1, we remove `RoundFP` op, its associated `lir_roundfp` and related utility methods that insert these nodes in the graph. > > For C2, we remove `RoundDouble` and `RoundFloat` nodes (note there is a confusingly named `RoundDoubleMode` nodes that are not related to this), associated utility methods, AD match rules that reference these nodes (as nops!), and some `Ideal`-s that are no longer needed. > > Additional testing: > - [x] Linux x86_64 server fastdebug, `tier1` > - [x] Linux x86_64 server fastdebug, `all` Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: Minor leftover ------------- Changes: - all: https://git.openjdk.org/jdk/pull/24250/files - new: https://git.openjdk.org/jdk/pull/24250/files/376c5ad8..88e4589c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=24250&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24250&range=00-01 Stats: 2 lines in 1 file changed: 0 ins; 2 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/24250.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24250/head:pull/24250 PR: https://git.openjdk.org/jdk/pull/24250 From stefank at openjdk.org Thu Mar 27 08:46:09 2025 From: stefank at openjdk.org (Stefan Karlsson) Date: Thu, 27 Mar 2025 08:46:09 GMT Subject: RFR: 8352645: Add tool support to check order of includes [v2] In-Reply-To: References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> Message-ID: On Wed, 26 Mar 2025 14:23:09 GMT, Doug Simon wrote: >> This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)). >> >> By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order. >> >> The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now. >> >> I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories. >> >> When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example: >> >> java.lang.RuntimeException: The unsorted includes listed below should be fixable by running: >> >> java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci >> >> at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80) >> at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) >> at java.base/java.lang.reflect.Method.invoke(Method.java:565) >> at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335) >> at java.base/java.lang.Thread.run(Thread.java:1447) >> Caused by: java.lang.RuntimeException: 36 files with unsorted headers found: >> >> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp >> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp >> /Users/dnsimo... > > Doug Simon has updated the pull request incrementally with one additional commit since the last revision: > > drop extra blank lines and preserve rule for first include in .inline.hpp files I ran the latest script over the HotSpot source and see that it messes up corner-cases with our platform includes. diff --git a/src/hotspot/cpu/aarch64/continuationEntry_aarch64.inline.hpp b/src/hotspot/cpu/aarch64/continuationEntry_aarch64.inline.hpp index df4d3957239..e8816767a96 100644 --- a/src/hotspot/cpu/aarch64/continuationEntry_aarch64.inline.hpp +++ b/src/hotspot/cpu/aarch64/continuationEntry_aarch64.inline.hpp @@ -25,10 +25,9 @@ #ifndef CPU_AARCH64_CONTINUATIONENTRY_AARCH64_INLINE_HPP #define CPU_AARCH64_CONTINUATIONENTRY_AARCH64_INLINE_HPP -#include "runtime/continuationEntry.hpp" - #include "code/codeCache.hpp" #include "oops/method.inline.hpp" +#include "runtime/continuationEntry.hpp" #include "runtime/frame.inline.hpp" #include "runtime/registerMap.hpp" The includes are: .hpp --------------> _aarch64.hpp ^ ^ | | | +------------------+ | | .inline.hpp -------> _aarch64.inline.hpp So, continuationEntry.hpp acts like the .hpp file for continuationEntry_aarc64.inline.hpp. Unfortunately, we don't have a fully consistent way to write our platform includes, so I don't know how to codify this in a tool without breaking things. ------------- PR Review: https://git.openjdk.org/jdk/pull/24247#pullrequestreview-2720267338 From stefank at openjdk.org Thu Mar 27 08:46:10 2025 From: stefank at openjdk.org (Stefan Karlsson) Date: Thu, 27 Mar 2025 08:46:10 GMT Subject: RFR: 8352645: Add tool support to check order of includes [v2] In-Reply-To: References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> Message-ID: <0I5RGRwY9sT2TJDoc1RjzTOck5evkm4-iO2Int7Imqg=.d3d3abce-e771-455f-9de6-cae4781434a1@github.com> On Thu, 27 Mar 2025 06:10:37 GMT, Kim Barrett wrote: >> Doug Simon has updated the pull request incrementally with one additional commit since the last revision: >> >> drop extra blank lines and preserve rule for first include in .inline.hpp files > > src/hotspot/share/compiler/disassembler.hpp line 36: > >> 34: #include "utilities/macros.hpp" >> 35: >> 36: > > Extra blank line inserted? This seems to be left-overs from an earlier run. If I run the tool on this file it doesn't add this blank line. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2015915647 From kbarrett at openjdk.org Thu Mar 27 09:07:22 2025 From: kbarrett at openjdk.org (Kim Barrett) Date: Thu, 27 Mar 2025 09:07:22 GMT Subject: RFR: 8352645: Add tool support to check order of includes [v2] In-Reply-To: References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> Message-ID: On Thu, 27 Mar 2025 08:18:58 GMT, Doug Simon wrote: >> test/hotspot/jtreg/sources/SortIncludes.java line 39: >> >>> 37: >>> 38: public class SortIncludes { >>> 39: private static final String INCLUDE_LINE = "^ *#include *(<[^>]+>|\"[^\"]+\") *$\\n"; >> >> There are files that have spaces between the `#` and `include`. I'm kind of inclined to suggest we fix those >> at some point (not in this PR). But the regex here needs to allow for that possibility, and perhaps (eventually) >> complain about such. > > Since there are no such cases in the files processed in this PR, I'd suggest not adding support for them. They can be fixed in follow up PRs as the relevant directories are added to `TestIncludesAreSorted.HOTSPOT_SOURCES_TO_CHECK`. The regex needs to detect that case eventually anyway, so I think it should be done now. Either we allow that case, in which case the regex must match to work properly where they are present. Or we forbid that case, in which case the regex must match to detect future mistakes even after we've cleaned up existing usage. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2016008497 From stefank at openjdk.org Thu Mar 27 09:17:09 2025 From: stefank at openjdk.org (Stefan Karlsson) Date: Thu, 27 Mar 2025 09:17:09 GMT Subject: RFR: 8352645: Add tool support to check order of includes [v2] In-Reply-To: References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> Message-ID: <1W8bUhsbNfCXWzdT6QxlegrTNqYo-wxbQHhpzifIFK4=.71382d20-0999-4385-b285-e34936be436c@github.com> On Wed, 26 Mar 2025 14:23:09 GMT, Doug Simon wrote: >> This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)). >> >> By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order. >> >> The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now. >> >> I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories. >> >> When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example: >> >> java.lang.RuntimeException: The unsorted includes listed below should be fixable by running: >> >> java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci >> >> at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80) >> at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) >> at java.base/java.lang.reflect.Method.invoke(Method.java:565) >> at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335) >> at java.base/java.lang.Thread.run(Thread.java:1447) >> Caused by: java.lang.RuntimeException: 36 files with unsorted headers found: >> >> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp >> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp >> /Users/dnsimo... > > Doug Simon has updated the pull request incrementally with one additional commit since the last revision: > > drop extra blank lines and preserve rule for first include in .inline.hpp files I verified that adding a comment to the end of the `#include "runtime/continuationEntry.hpp"` line leaves that file intact, so I think that is a good enough workaround for the problematic platform includes. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24247#issuecomment-2757283373 From stefank at openjdk.org Thu Mar 27 09:24:17 2025 From: stefank at openjdk.org (Stefan Karlsson) Date: Thu, 27 Mar 2025 09:24:17 GMT Subject: RFR: 8352645: Add tool support to check order of includes [v2] In-Reply-To: References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> Message-ID: On Thu, 27 Mar 2025 09:04:45 GMT, Kim Barrett wrote: >> Since there are no such cases in the files processed in this PR, I'd suggest not adding support for them. They can be fixed in follow up PRs as the relevant directories are added to `TestIncludesAreSorted.HOTSPOT_SOURCES_TO_CHECK`. > > The regex needs to detect that case eventually anyway, so I think it should be done now. Either we allow that > case, in which case the regex must match to work properly where they are present. Or we forbid that case, > in which case the regex must match to detect future mistakes even after we've cleaned up existing usage. To me it seems like a small adjustment fixes this Suggestion: private static final String INCLUDE_LINE = "^ *# *include *(<[^>]+>|"[^"]+") *$\\n"; ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2016040674 From dnsimon at openjdk.org Thu Mar 27 09:43:08 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Thu, 27 Mar 2025 09:43:08 GMT Subject: RFR: 8352645: Add tool support to check order of includes [v2] In-Reply-To: References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> Message-ID: On Thu, 27 Mar 2025 06:26:43 GMT, Kim Barrett wrote: > Probably we want to eventually apply this to gtests, but there might be additional rules there. The include of unittest.hpp is (usually) last, and there may be (or may have been) a technical reason for that. > > Applying it to jtreg test support files could also introduce some challenges. Or at least discover a lot of non-conforming files. We might eventually want a mechanism for excluding directories, in addition to an inclusion list (that might eventually be "all"). > > These kinds of things can be followups once we have the basic mechanism in place. I would suggest someone open issue(s) for follow up enhancements to the tool. I think having something in place now and incrementally improving it and adjusting it for all the special cases makes most sense. > src/hotspot/share/compiler/compilationFailureInfo.cpp line 35: > >> 33: #include "compiler/compilationFailureInfo.hpp" >> 34: #include "compiler/compileTask.hpp" >> 35: #ifdef COMPILER2 > > Conditional includes are supposed to follow unconditional in a section. > Out of scope for this PR? Yep. From the PR description: The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now. > test/hotspot/jtreg/sources/SortIncludes.java line 115: > >> 113: } >> 114: >> 115: /// Processes the C++ source file in `path` to sort its include statements. > > If we want to apply this to hotspot jtreg test code, then C source files also come into the picture. I think the tool will need to be updated to handle C source files. At that point, the comment should be generalized. > test/hotspot/jtreg/sources/SortIncludes.java line 153: > >> 151: >> 152: /// Processes the C++ source files in `paths` to check if their include statements are sorted. >> 153: /// Include statements with any non-space characters after the closing `"` or `>` will not > > Perhaps this should be mentioned in the style guide? Done. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24247#issuecomment-2757350491 PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2016078724 PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2016077938 PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2016078194 From dnsimon at openjdk.org Thu Mar 27 09:49:38 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Thu, 27 Mar 2025 09:49:38 GMT Subject: RFR: 8352645: Add tool support to check order of includes [v3] In-Reply-To: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> Message-ID: <4eq6qUl0x1TJxdlM6oWmpAazGCmFsVbzSjY58KFosv0=.c2175ffa-cdc1-4754-a84b-30f0a389397c@github.com> > This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)). > > By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order. > > The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now. > > I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories. > > When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example: > > java.lang.RuntimeException: The unsorted includes listed below should be fixable by running: > > java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci > > at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80) > at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) > at java.base/java.lang.reflect.Method.invoke(Method.java:565) > at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335) > at java.base/java.lang.Thread.run(Thread.java:1447) > Caused by: java.lang.RuntimeException: 36 files with unsorted headers found: > > /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp > /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp > /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Optim... Doug Simon has updated the pull request incrementally with four additional commits since the last revision: - allow spaces between `#` and `include` - moved some logic out of SortIncludes into TestIncludesAreSorted - removed extra blank lines - update style guide with advice on how to label includes that should not be re-ordered ------------- Changes: - all: https://git.openjdk.org/jdk/pull/24247/files - new: https://git.openjdk.org/jdk/pull/24247/files/18e2a1d6..cada0df4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=24247&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24247&range=01-02 Stats: 117 lines in 6 files changed: 60 ins; 29 del; 28 mod Patch: https://git.openjdk.org/jdk/pull/24247.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24247/head:pull/24247 PR: https://git.openjdk.org/jdk/pull/24247 From dnsimon at openjdk.org Thu Mar 27 09:49:38 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Thu, 27 Mar 2025 09:49:38 GMT Subject: RFR: 8352645: Add tool support to check order of includes [v2] In-Reply-To: References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> Message-ID: On Thu, 27 Mar 2025 09:20:58 GMT, Stefan Karlsson wrote: >> The regex needs to detect that case eventually anyway, so I think it should be done now. Either we allow that >> case, in which case the regex must match to work properly where they are present. Or we forbid that case, >> in which case the regex must match to detect future mistakes even after we've cleaned up existing usage. > > To me it seems like a small adjustment fixes this > Suggestion: > > private static final String INCLUDE_LINE = "^ *# *include *(<[^>]+>|"[^"]+") *$\\n"; Done. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2016091219 From stefank at openjdk.org Thu Mar 27 10:13:15 2025 From: stefank at openjdk.org (Stefan Karlsson) Date: Thu, 27 Mar 2025 10:13:15 GMT Subject: RFR: 8352645: Add tool support to check order of includes [v3] In-Reply-To: <4eq6qUl0x1TJxdlM6oWmpAazGCmFsVbzSjY58KFosv0=.c2175ffa-cdc1-4754-a84b-30f0a389397c@github.com> References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> <4eq6qUl0x1TJxdlM6oWmpAazGCmFsVbzSjY58KFosv0=.c2175ffa-cdc1-4754-a84b-30f0a389397c@github.com> Message-ID: On Thu, 27 Mar 2025 09:49:38 GMT, Doug Simon wrote: >> This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)). >> >> By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order. >> >> The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now. >> >> I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories. >> >> When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example: >> >> java.lang.RuntimeException: The unsorted includes listed below should be fixable by running: >> >> java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci >> >> at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80) >> at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) >> at java.base/java.lang.reflect.Method.invoke(Method.java:565) >> at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335) >> at java.base/java.lang.Thread.run(Thread.java:1447) >> Caused by: java.lang.RuntimeException: 36 files with unsorted headers found: >> >> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp >> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp >> /Users/dnsimo... > > Doug Simon has updated the pull request incrementally with four additional commits since the last revision: > > - allow spaces between `#` and `include` > - moved some logic out of SortIncludes into TestIncludesAreSorted > - removed extra blank lines > - update style guide with advice on how to label includes that should not be re-ordered I'm happy with the capabilities of the tool now and think that it is good enough to include and promote to HotSpot devs. One questions is where to put the tool? I don't think the test directory is the best place. Maybe somewhere in `src/utils/`. There is a tools dir here `src/utils/src/build/tools/` but I don't know if it is appropriate to put it there. Maybe @magicus knows a good place for this? A couple of nits: 1) jcheck fails because of whitespaces 2) The /// style comments is a style I haven't encountered before. ------------- PR Review: https://git.openjdk.org/jdk/pull/24247#pullrequestreview-2720671629 From dnsimon at openjdk.org Thu Mar 27 10:39:13 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Thu, 27 Mar 2025 10:39:13 GMT Subject: RFR: 8352645: Add tool support to check order of includes [v3] In-Reply-To: References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> <4eq6qUl0x1TJxdlM6oWmpAazGCmFsVbzSjY58KFosv0=.c2175ffa-cdc1-4754-a84b-30f0a389397c@github.com> Message-ID: On Thu, 27 Mar 2025 10:10:07 GMT, Stefan Karlsson wrote: > A couple of nits: > > 1. jcheck fails because of whitespaces > 2. The /// style comments is a style I haven't encountered before. I fixed the whitespaces. I can convert the `///` comments if you want - no strong opinion. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24247#issuecomment-2757548404 From dnsimon at openjdk.org Thu Mar 27 10:47:14 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Thu, 27 Mar 2025 10:47:14 GMT Subject: RFR: 8352645: Add tool support to check order of includes [v3] In-Reply-To: <4eq6qUl0x1TJxdlM6oWmpAazGCmFsVbzSjY58KFosv0=.c2175ffa-cdc1-4754-a84b-30f0a389397c@github.com> References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> <4eq6qUl0x1TJxdlM6oWmpAazGCmFsVbzSjY58KFosv0=.c2175ffa-cdc1-4754-a84b-30f0a389397c@github.com> Message-ID: On Thu, 27 Mar 2025 09:49:38 GMT, Doug Simon wrote: >> This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)). >> >> By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order. >> >> The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now. >> >> I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories. >> >> When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example: >> >> java.lang.RuntimeException: The unsorted includes listed below should be fixable by running: >> >> java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci >> >> at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80) >> at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) >> at java.base/java.lang.reflect.Method.invoke(Method.java:565) >> at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335) >> at java.base/java.lang.Thread.run(Thread.java:1447) >> Caused by: java.lang.RuntimeException: 36 files with unsorted headers found: >> >> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp >> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp >> /Users/dnsimo... > > Doug Simon has updated the pull request incrementally with four additional commits since the last revision: > > - allow spaces between `#` and `include` > - moved some logic out of SortIncludes into TestIncludesAreSorted > - removed extra blank lines > - update style guide with advice on how to label includes that should not be re-ordered I just noticed that TestIncludesAreSorted is not run by GHA. How about we move `test/hotspot/jtreg/sources` into `tier1_common`: diff --git a/test/hotspot/jtreg/TEST.groups b/test/hotspot/jtreg/TEST.groups index 71b9e497e25..62b11e73aa0 100644 --- a/test/hotspot/jtreg/TEST.groups +++ b/test/hotspot/jtreg/TEST.groups @@ -139,6 +139,7 @@ serviceability_ttf_virtual = \ -serviceability/jvmti/negative tier1_common = \ + sources \ sanity/BasicVMTest.java \ gtest/GTestWrapper.java \ gtest/LockStackGtests.java \ @@ -619,16 +620,12 @@ tier1_serviceability = \ -serviceability/sa/TestJmapCore.java \ -serviceability/sa/TestJmapCoreMetaspace.java -tier1_sources = \ - sources - tier1 = \ :tier1_common \ :tier1_compiler \ :tier1_gc \ :tier1_runtime \ :tier1_serviceability \ - :tier1_sources tier2 = \ :hotspot_tier2_runtime \ ------------- PR Comment: https://git.openjdk.org/jdk/pull/24247#issuecomment-2757570734 From stefank at openjdk.org Thu Mar 27 11:16:07 2025 From: stefank at openjdk.org (Stefan Karlsson) Date: Thu, 27 Mar 2025 11:16:07 GMT Subject: RFR: 8352645: Add tool support to check order of includes [v3] In-Reply-To: References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> <4eq6qUl0x1TJxdlM6oWmpAazGCmFsVbzSjY58KFosv0=.c2175ffa-cdc1-4754-a84b-30f0a389397c@github.com> Message-ID: On Thu, 27 Mar 2025 10:36:38 GMT, Doug Simon wrote: > > A couple of nits: > > > > 1. jcheck fails because of whitespaces > > 2. The /// style comments is a style I haven't encountered before. > > I fixed the whitespaces. I can convert the `///` comments if you want - no strong opinion. Maybe someone else knows the preferred style for this? I don't think we need to block the integration because of this. If someone comes late with the proper comment style, we'll update it in a separate PR. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24247#issuecomment-2757648203 From duke at openjdk.org Thu Mar 27 12:40:29 2025 From: duke at openjdk.org (Zihao Lin) Date: Thu, 27 Mar 2025 12:40:29 GMT Subject: RFR: 8344116: C2: remove slice parameter from LoadNode::make [v3] In-Reply-To: References: Message-ID: > This patch remove slice parameter from LoadNode::make > > Mention in https://github.com/openjdk/jdk/pull/21834#pullrequestreview-2429164805 > > Hi team, I am new, I'd appreciate any guidance. Thank a lot! Zihao Lin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Merge branch 'openjdk:master' into 8344116 - Merge branch 'openjdk:master' into 8344116 - 8344116: C2: remove slice parameter from LoadNode::make ------------- Changes: - all: https://git.openjdk.org/jdk/pull/24258/files - new: https://git.openjdk.org/jdk/pull/24258/files/f4ef46dc..08c1a382 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=24258&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24258&range=01-02 Stats: 3892 lines in 94 files changed: 1545 ins; 2033 del; 314 mod Patch: https://git.openjdk.org/jdk/pull/24258.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24258/head:pull/24258 PR: https://git.openjdk.org/jdk/pull/24258 From dnsimon at openjdk.org Thu Mar 27 13:21:55 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Thu, 27 Mar 2025 13:21:55 GMT Subject: RFR: 8352645: Add tool support to check order of includes [v4] In-Reply-To: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> Message-ID: > This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)). > > By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order. > > The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now. > > I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories. > > When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example: > > java.lang.RuntimeException: The unsorted includes listed below should be fixable by running: > > java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci > > at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80) > at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) > at java.base/java.lang.reflect.Method.invoke(Method.java:565) > at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335) > at java.base/java.lang.Thread.run(Thread.java:1447) > Caused by: java.lang.RuntimeException: 36 files with unsorted headers found: > > /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp > /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp > /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Optim... Doug Simon has updated the pull request incrementally with two additional commits since the last revision: - moved test/hotspot/jtreg/sources into tier1_common - remove trailing spaces ------------- Changes: - all: https://git.openjdk.org/jdk/pull/24247/files - new: https://git.openjdk.org/jdk/pull/24247/files/cada0df4..93770e71 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=24247&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24247&range=02-03 Stats: 7 lines in 2 files changed: 1 ins; 4 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/24247.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24247/head:pull/24247 PR: https://git.openjdk.org/jdk/pull/24247 From ihse at openjdk.org Thu Mar 27 13:38:19 2025 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Thu, 27 Mar 2025 13:38:19 GMT Subject: RFR: 8352645: Add tool support to check order of includes [v3] In-Reply-To: References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> <4eq6qUl0x1TJxdlM6oWmpAazGCmFsVbzSjY58KFosv0=.c2175ffa-cdc1-4754-a84b-30f0a389397c@github.com> Message-ID: <8Fj5Ui2g5uviJDr4x5rqLaoODjxjfxY_SWqbxXojSlI=.47630310-20d6-40e3-aab9-bce915bc04ad@github.com> On Thu, 27 Mar 2025 10:10:07 GMT, Stefan Karlsson wrote: > One questions is where to put the tool? I don't think the test directory is the best place. Maybe somewhere in src/utils/. There is a tools dir here src/utils/src/build/tools/ but I don't know if it is appropriate to put it there. Maybe @magicus knows a good place for this? I would actually recommend just the `bin` directory. This is , after all, intended to be run as a simple script (remember, it was originally a python script), in a similar vein to the already existing `blessed-modifier-order.sh` script. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24247#issuecomment-2758060262 From ihse at openjdk.org Thu Mar 27 13:38:20 2025 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Thu, 27 Mar 2025 13:38:20 GMT Subject: RFR: 8352645: Add tool support to check order of includes [v3] In-Reply-To: References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> <4eq6qUl0x1TJxdlM6oWmpAazGCmFsVbzSjY58KFosv0=.c2175ffa-cdc1-4754-a84b-30f0a389397c@github.com> Message-ID: On Thu, 27 Mar 2025 11:13:27 GMT, Stefan Karlsson wrote: > The /// style comments is a style I haven't encountered before. This is for the new markdown comments. Personally, I very much prefer them and have been looking forward to these for a long time. But I don't know if we have any policy for or against those in the JDK. Using them in a script like this seems fine to me, at any rate. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24247#issuecomment-2758064273 From dnsimon at openjdk.org Thu Mar 27 14:11:07 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Thu, 27 Mar 2025 14:11:07 GMT Subject: RFR: 8352645: Add tool support to check order of includes [v5] In-Reply-To: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> Message-ID: > This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)). > > By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order. > > The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now. > > I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories. > > When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example: > > java.lang.RuntimeException: The unsorted includes listed below should be fixable by running: > > java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci > > at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80) > at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) > at java.base/java.lang.reflect.Method.invoke(Method.java:565) > at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335) > at java.base/java.lang.Thread.run(Thread.java:1447) > Caused by: java.lang.RuntimeException: 36 files with unsorted headers found: > > /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp > /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp > /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Optim... Doug Simon has updated the pull request incrementally with one additional commit since the last revision: moved error message into UnsortedIncludesException ------------- Changes: - all: https://git.openjdk.org/jdk/pull/24247/files - new: https://git.openjdk.org/jdk/pull/24247/files/93770e71..c93e6646 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=24247&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24247&range=03-04 Stats: 13 lines in 2 files changed: 8 ins; 4 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/24247.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24247/head:pull/24247 PR: https://git.openjdk.org/jdk/pull/24247 From dnsimon at openjdk.org Thu Mar 27 14:11:07 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Thu, 27 Mar 2025 14:11:07 GMT Subject: RFR: 8352645: Add tool support to check order of includes [v3] In-Reply-To: References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> <4eq6qUl0x1TJxdlM6oWmpAazGCmFsVbzSjY58KFosv0=.c2175ffa-cdc1-4754-a84b-30f0a389397c@github.com> Message-ID: On Thu, 27 Mar 2025 10:44:48 GMT, Doug Simon wrote: > I just noticed that TestIncludesAreSorted is not run by GHA. How about we move `test/hotspot/jtreg/sources` into `tier1_common`: I went ahead and pushed this change. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24247#issuecomment-2758187724 From dnsimon at openjdk.org Thu Mar 27 14:17:34 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Thu, 27 Mar 2025 14:17:34 GMT Subject: RFR: 8352645: Add tool support to check order of includes [v3] In-Reply-To: <8Fj5Ui2g5uviJDr4x5rqLaoODjxjfxY_SWqbxXojSlI=.47630310-20d6-40e3-aab9-bce915bc04ad@github.com> References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> <4eq6qUl0x1TJxdlM6oWmpAazGCmFsVbzSjY58KFosv0=.c2175ffa-cdc1-4754-a84b-30f0a389397c@github.com> <8Fj5Ui2g5uviJDr4x5rqLaoODjxjfxY_SWqbxXojSlI=.47630310-20d6-40e3-aab9-bce915bc04ad@github.com> Message-ID: On Thu, 27 Mar 2025 13:34:02 GMT, Magnus Ihse Bursie wrote: > I would actually recommend just the bin directory. Fine by me but I'm not sure how to then use `bin/SortIncludes.java` in `test/hotspot/jtreg/sources/TestIncludesAreSorted.java`. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24247#issuecomment-2758217497 From thartmann at openjdk.org Thu Mar 27 15:36:18 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 27 Mar 2025 15:36:18 GMT Subject: RFR: 8348853: Fold layout helper check for objects implementing non-array interfaces In-Reply-To: References: Message-ID: <-Ri4lJUzCkI9yLG-kGwTGeAhd453SDgt_qvoB1iw4_A=.f3e126ab-a4ff-4f7f-80a7-c6e739cc6727@github.com> On Wed, 26 Mar 2025 09:16:17 GMT, Marc Chevalier wrote: > If `TypeInstKlassPtr` represents an array type, it has to be `java.lang.Object`. From contraposition, if it is not `java.lang.Object`, we can conclude it is not an array, and we can skip some array checks, for instance. > > In this PR, we improve this deduction with an interface base reasoning: arrays implements only Cloneable and Serializable, so if a type implements anything else, it cannot be an array. > > This change partially reverts the changes from [JDK-8348631](https://bugs.openjdk.org/browse/JDK-8348631) (#23331) (in `LibraryCallKit::generate_array_guard_common`) and the test still passes. > > The way interfaces are check might be done differently. The current situation is a balance between visibility (not to leak too much things explicitly private), having not overly general methods for one use-case and avoiding too concrete (and brittle) interfaces. > > Tested with tier1..3, hs-precheckin-comp and hs-comp-stress > > Thanks, > Marc @rwestrel Should have a look at this :) Please add an IR framework test that verifies that layout helper checks are optimized. src/hotspot/share/opto/type.cpp line 3684: > 3682: } > 3683: > 3684: bool TypeInterfaces::has_non_array_interface() const { What about using `TypeAryPtr::_array_interfaces->contains(_interfaces);` instead? ------------- Changes requested by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/24245#pullrequestreview-2722219539 PR Review Comment: https://git.openjdk.org/jdk/pull/24245#discussion_r2016955402 From shade at openjdk.org Thu Mar 27 17:08:44 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 27 Mar 2025 17:08:44 GMT Subject: RFR: 8351155: C1/C2: Remove 32-bit x86 specific FP rounding support [v2] In-Reply-To: References: <8zUrV-sMSOwRSQk_jERtFqjrzOFUP7rlUwTTN7cPP_8=.b1d30fb5-d9c0-4417-bacd-bf09f2af433b@github.com> Message-ID: <02tkGNzza6MfOkCxeymt8tcXm3bSCPiv6GBCkwjcLs4=.4d351dd7-82b9-46c2-ada6-facf807f70a2@github.com> On Thu, 27 Mar 2025 08:45:49 GMT, Aleksey Shipilev wrote: >> C1 and C2 have support for rounding double/floats, to support awkward rounding modes of x87 FPU. With 32-bit x86 port removed, we can remove those parts. This basically deletes all the code that uses `strict_fp_requires_explicit_rounding`, which is now universally `false` for all supported platforms. >> >> For C1, we remove `RoundFP` op, its associated `lir_roundfp` and related utility methods that insert these nodes in the graph. >> >> For C2, we remove `RoundDouble` and `RoundFloat` nodes (note there is a confusingly named `RoundDoubleMode` nodes that are not related to this), associated utility methods, AD match rules that reference these nodes (as nops!), and some `Ideal`-s that are no longer needed. >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `tier1` >> - [x] Linux x86_64 server fastdebug, `all` > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Minor leftover Need a quick re-review after a minor leftover removal. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24250#issuecomment-2758809458 From vlivanov at openjdk.org Thu Mar 27 17:59:25 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Thu, 27 Mar 2025 17:59:25 GMT Subject: RFR: 8351155: C1/C2: Remove 32-bit x86 specific FP rounding support [v2] In-Reply-To: References: <8zUrV-sMSOwRSQk_jERtFqjrzOFUP7rlUwTTN7cPP_8=.b1d30fb5-d9c0-4417-bacd-bf09f2af433b@github.com> Message-ID: On Thu, 27 Mar 2025 08:45:49 GMT, Aleksey Shipilev wrote: >> C1 and C2 have support for rounding double/floats, to support awkward rounding modes of x87 FPU. With 32-bit x86 port removed, we can remove those parts. This basically deletes all the code that uses `strict_fp_requires_explicit_rounding`, which is now universally `false` for all supported platforms. >> >> For C1, we remove `RoundFP` op, its associated `lir_roundfp` and related utility methods that insert these nodes in the graph. >> >> For C2, we remove `RoundDouble` and `RoundFloat` nodes (note there is a confusingly named `RoundDoubleMode` nodes that are not related to this), associated utility methods, AD match rules that reference these nodes (as nops!), and some `Ideal`-s that are no longer needed. >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `tier1` >> - [x] Linux x86_64 server fastdebug, `all` > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Minor leftover Marked as reviewed by vlivanov (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/24250#pullrequestreview-2722925670 From shade at openjdk.org Thu Mar 27 18:14:34 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 27 Mar 2025 18:14:34 GMT Subject: RFR: 8351155: C1/C2: Remove 32-bit x86 specific FP rounding support [v2] In-Reply-To: References: <8zUrV-sMSOwRSQk_jERtFqjrzOFUP7rlUwTTN7cPP_8=.b1d30fb5-d9c0-4417-bacd-bf09f2af433b@github.com> Message-ID: On Thu, 27 Mar 2025 08:45:49 GMT, Aleksey Shipilev wrote: >> C1 and C2 have support for rounding double/floats, to support awkward rounding modes of x87 FPU. With 32-bit x86 port removed, we can remove those parts. This basically deletes all the code that uses `strict_fp_requires_explicit_rounding`, which is now universally `false` for all supported platforms. >> >> For C1, we remove `RoundFP` op, its associated `lir_roundfp` and related utility methods that insert these nodes in the graph. >> >> For C2, we remove `RoundDouble` and `RoundFloat` nodes (note there is a confusingly named `RoundDoubleMode` nodes that are not related to this), associated utility methods, AD match rules that reference these nodes (as nops!), and some `Ideal`-s that are no longer needed. >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `tier1` >> - [x] Linux x86_64 server fastdebug, `all` > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Minor leftover Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/24250#issuecomment-2759003292 From shade at openjdk.org Thu Mar 27 18:14:34 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 27 Mar 2025 18:14:34 GMT Subject: Integrated: 8351155: C1/C2: Remove 32-bit x86 specific FP rounding support In-Reply-To: <8zUrV-sMSOwRSQk_jERtFqjrzOFUP7rlUwTTN7cPP_8=.b1d30fb5-d9c0-4417-bacd-bf09f2af433b@github.com> References: <8zUrV-sMSOwRSQk_jERtFqjrzOFUP7rlUwTTN7cPP_8=.b1d30fb5-d9c0-4417-bacd-bf09f2af433b@github.com> Message-ID: On Wed, 26 Mar 2025 10:11:25 GMT, Aleksey Shipilev wrote: > C1 and C2 have support for rounding double/floats, to support awkward rounding modes of x87 FPU. With 32-bit x86 port removed, we can remove those parts. This basically deletes all the code that uses `strict_fp_requires_explicit_rounding`, which is now universally `false` for all supported platforms. > > For C1, we remove `RoundFP` op, its associated `lir_roundfp` and related utility methods that insert these nodes in the graph. > > For C2, we remove `RoundDouble` and `RoundFloat` nodes (note there is a confusingly named `RoundDoubleMode` nodes that are not related to this), associated utility methods, AD match rules that reference these nodes (as nops!), and some `Ideal`-s that are no longer needed. > > Additional testing: > - [x] Linux x86_64 server fastdebug, `tier1` > - [x] Linux x86_64 server fastdebug, `all` This pull request has now been integrated. Changeset: b73663a2 Author: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/b73663a2b4fe7049fc0990c1a1e51221640b4e29 Stats: 547 lines in 48 files changed: 0 ins; 513 del; 34 mod 8351155: C1/C2: Remove 32-bit x86 specific FP rounding support Reviewed-by: vlivanov, kvn ------------- PR: https://git.openjdk.org/jdk/pull/24250 From mchevalier at openjdk.org Fri Mar 28 09:41:11 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Fri, 28 Mar 2025 09:41:11 GMT Subject: RFR: 8348853: Fold layout helper check for objects implementing non-array interfaces In-Reply-To: <-Ri4lJUzCkI9yLG-kGwTGeAhd453SDgt_qvoB1iw4_A=.f3e126ab-a4ff-4f7f-80a7-c6e739cc6727@github.com> References: <-Ri4lJUzCkI9yLG-kGwTGeAhd453SDgt_qvoB1iw4_A=.f3e126ab-a4ff-4f7f-80a7-c6e739cc6727@github.com> Message-ID: On Thu, 27 Mar 2025 15:33:31 GMT, Tobias Hartmann wrote: >> If `TypeInstKlassPtr` represents an array type, it has to be `java.lang.Object`. From contraposition, if it is not `java.lang.Object`, we can conclude it is not an array, and we can skip some array checks, for instance. >> >> In this PR, we improve this deduction with an interface base reasoning: arrays implements only Cloneable and Serializable, so if a type implements anything else, it cannot be an array. >> >> This change partially reverts the changes from [JDK-8348631](https://bugs.openjdk.org/browse/JDK-8348631) (#23331) (in `LibraryCallKit::generate_array_guard_common`) and the test still passes. >> >> The way interfaces are check might be done differently. The current situation is a balance between visibility (not to leak too much things explicitly private), having not overly general methods for one use-case and avoiding too concrete (and brittle) interfaces. >> >> Tested with tier1..3, hs-precheckin-comp and hs-comp-stress >> >> Thanks, >> Marc > > src/hotspot/share/opto/type.cpp line 3684: > >> 3682: } >> 3683: >> 3684: bool TypeInterfaces::has_non_array_interface() const { > > What about using `TypeAryPtr::_array_interfaces->contains(_interfaces);` instead? Almost! return !TypeAryPtr::_array_interfaces->contains(this); Contains is about TypeInterfaces, that is set of interfaces. So I just need to check that `this` is not a sub-set of array interfaces. That should do it. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24245#discussion_r2018248760 From mchevalier at openjdk.org Fri Mar 28 09:53:13 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Fri, 28 Mar 2025 09:53:13 GMT Subject: RFR: 8348853: Fold layout helper check for objects implementing non-array interfaces In-Reply-To: References: Message-ID: On Wed, 26 Mar 2025 09:16:17 GMT, Marc Chevalier wrote: > If `TypeInstKlassPtr` represents an array type, it has to be `java.lang.Object`. From contraposition, if it is not `java.lang.Object`, we can conclude it is not an array, and we can skip some array checks, for instance. > > In this PR, we improve this deduction with an interface base reasoning: arrays implements only Cloneable and Serializable, so if a type implements anything else, it cannot be an array. > > This change partially reverts the changes from [JDK-8348631](https://bugs.openjdk.org/browse/JDK-8348631) (#23331) (in `LibraryCallKit::generate_array_guard_common`) and the test still passes. > > The way interfaces are check might be done differently. The current situation is a balance between visibility (not to leak too much things explicitly private), having not overly general methods for one use-case and avoiding too concrete (and brittle) interfaces. > > Tested with tier1..3, hs-precheckin-comp and hs-comp-stress > > Thanks, > Marc I'm not sure how to write such an IR test. I'm looking at [TestArrayGuardWithInterfaces.java](https://github.com/openjdk/jdk/blob/3e9a7a4aed168422473c941ff5626d0d65aaadfa/test/hotspot/jtreg/compiler/intrinsics/TestArrayGuardWithInterfaces.java). I see the graphs of `test1` before and after, and the new one is smaller. But the nodes used are pretty much the same, or they don't feel clearly linked to interface checking: there is `DecodeNKlass` or `AddP`, but it doesn't seem obvious without having the graph under the eyes that it actually checks something meaningful. There are also less `If` (2 instead of 3), but once again, the test seems brittle. I also see that There is no more `Return` only `Halt` since we can now prove the function cannot return normally. But on the graph of `test2` ends with two `Halt`: traps everywhere, even if there are paths on which `test2` doesn't throw. So the lack of `Return` doesn't sound very robust. Overall, not sure what a good test would be. I can write a test that would not pass before and pass now, but I'm not convinced they would reliably catch regression, and that they won't break for unrelated reasons. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24245#issuecomment-2760760131 From duke at openjdk.org Fri Mar 28 13:03:48 2025 From: duke at openjdk.org (Zihao Lin) Date: Fri, 28 Mar 2025 13:03:48 GMT Subject: RFR: 8344116: C2: remove slice parameter from LoadNode::make [v4] In-Reply-To: References: Message-ID: > This patch remove slice parameter from LoadNode::make > > Mention in https://github.com/openjdk/jdk/pull/21834#pullrequestreview-2429164805 > > Hi team, I am new, I'd appreciate any guidance. Thank a lot! Zihao Lin has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: 8344116: C2: remove slice parameter from LoadNode::make ------------- Changes: - all: https://git.openjdk.org/jdk/pull/24258/files - new: https://git.openjdk.org/jdk/pull/24258/files/08c1a382..f6b2fbec Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=24258&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24258&range=02-03 Stats: 0 lines in 0 files changed: 0 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/24258.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24258/head:pull/24258 PR: https://git.openjdk.org/jdk/pull/24258 From thartmann at openjdk.org Fri Mar 28 19:27:15 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 28 Mar 2025 19:27:15 GMT Subject: RFR: 8348853: Fold layout helper check for objects implementing non-array interfaces In-Reply-To: References: Message-ID: On Wed, 26 Mar 2025 09:16:17 GMT, Marc Chevalier wrote: > If `TypeInstKlassPtr` represents an array type, it has to be `java.lang.Object`. From contraposition, if it is not `java.lang.Object`, we can conclude it is not an array, and we can skip some array checks, for instance. > > In this PR, we improve this deduction with an interface base reasoning: arrays implements only Cloneable and Serializable, so if a type implements anything else, it cannot be an array. > > This change partially reverts the changes from [JDK-8348631](https://bugs.openjdk.org/browse/JDK-8348631) (#23331) (in `LibraryCallKit::generate_array_guard_common`) and the test still passes. > > The way interfaces are check might be done differently. The current situation is a balance between visibility (not to leak too much things explicitly private), having not overly general methods for one use-case and avoiding too concrete (and brittle) interfaces. > > Tested with tier1..3, hs-precheckin-comp and hs-comp-stress > > Thanks, > Marc Right, I was hoping that there would be some other suitable users of `GraphKit::get_layout_helper` that would now be folded but all current uses either trap or don't handle both arrays and non-arrays (and therefore wouldn't fold). So I agree, adding an IR framework test does not make sense. The existing test is sufficient. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24245#issuecomment-2762234661 From thartmann at openjdk.org Fri Mar 28 19:27:15 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 28 Mar 2025 19:27:15 GMT Subject: RFR: 8348853: Fold layout helper check for objects implementing non-array interfaces In-Reply-To: References: <-Ri4lJUzCkI9yLG-kGwTGeAhd453SDgt_qvoB1iw4_A=.f3e126ab-a4ff-4f7f-80a7-c6e739cc6727@github.com> Message-ID: On Fri, 28 Mar 2025 09:38:19 GMT, Marc Chevalier wrote: >> src/hotspot/share/opto/type.cpp line 3684: >> >>> 3682: } >>> 3683: >>> 3684: bool TypeInterfaces::has_non_array_interface() const { >> >> What about using `TypeAryPtr::_array_interfaces->contains(_interfaces);` instead? > > Almost! > > return !TypeAryPtr::_array_interfaces->contains(this); > > Contains is about TypeInterfaces, that is set of interfaces. So I just need to check that `this` is not a sub-set of array interfaces. That should do it. Now I'm confused, isn't this what I proposed? I didn't check the exact syntax, I just wondered if the `TypeInterfaces::contains` method couldn't be used instead of adding a new method. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24245#discussion_r2019219027 From vlivanov at openjdk.org Fri Mar 28 21:49:25 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 28 Mar 2025 21:49:25 GMT Subject: RFR: 8346989: C2: deoptimization and re-compilation cycle with Math.*Exact in case of frequent overflow [v2] In-Reply-To: References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> Message-ID: On Wed, 26 Mar 2025 08:33:58 GMT, Marc Chevalier wrote: >> `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments. >> This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached. >> >> Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all. >> >> tl;dr: >> - C1: no problem, no change >> - C2: >> - with intrinsics: >> - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms) >> - without overflow: no problem, no change >> - without intrinsics: no problem, no change >> >> Before the fix: >> >> Benchmark (SIZE) Mode Cnt Score Error Units >> MathExact.C1_1.loopAddIInBounds 1000000 avgt 3 1.272 ? 0.048 ms/op >> MathExact.C1_1.loopAddIOverflow 1000000 avgt 3 641.917 ? 58.238 ms/op >> MathExact.C1_1.loopAddLInBounds 1000000 avgt 3 1.402 ? 0.842 ms/op >> MathExact.C1_1.loopAddLOverflow 1000000 avgt 3 671.013 ? 229.425 ms/op >> MathExact.C1_1.loopDecrementIInBounds 1000000 avgt 3 3.722 ? 22.244 ms/op >> MathExact.C1_1.loopDecrementIOverflow 1000000 avgt 3 653.341 ? 279.003 ms/op >> MathExact.C1_1.loopDecrementLInBounds 1000000 avgt 3 2.525 ? 0.810 ms/op >> MathExact.C1_1.loopDecrementLOverflow 1000000 avgt 3 656.750 ? 141.792 ms/op >> MathExact.C1_1.loopIncrementIInBounds 1000000 avgt 3 4.621 ? 12.822 ms/op >> MathExact.C1_1.loopIncrementIOverflow 1000000 avgt 3 651.608 ? 274.396 ms/op >> MathExact.C1_1.loopIncrementLInBounds 1000000 avgt 3 2.576 ? 3.316 ms/op >> MathExact.C1_1.loopIncrementLOverflow 1000000 avgt 3 662.216 ? 71.879 ms/op >> MathExact.C1_1.loopMultiplyIInBounds 1000000 avgt 3 1.402 ? 0.587 ms/op >> MathExact.C1_1.loopMultiplyIOverflow 1000000 avgt 3 615.836 ? 252.137 ms/op >> MathExact.C1_1.loopMultiplyLInBounds 1000000 avgt 3 2.906 ? 5.718 ms/op >> MathExact.C1_1.loopMultiplyLOverflow 1000000 avgt 3 655.576 ? 147.432 ms/op >> MathExact.C1_1.loopNegateIInBounds 1000000 avgt 3 2.023 ? 0.027 ms/op >> MathExact.C1_1.loopNegateIOverflow 1000000 avgt 3 639.136 ? 30.841 ms/op >> MathExact.C1_1.loop... > > Marc Chevalier has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Use builtin_throw > - Merge branch 'master' into fix/Deoptimization-and-re-compilation-cycle-with-C2-compiled-code > - More exhaustive bench > - Limit inlining of math Exact operations in case of too many deopts Thanks, Marc. It looks a bit too convoluted to me. IMO an unconditional call to `builtin_throw`, plus `too_many_traps` check should do the job. Do I miss something important here? src/hotspot/share/opto/graphKit.hpp line 279: > 277: // The JVMS must allow the bytecode to be re-executed via an uncommon trap. > 278: // If `exception_object` is nullptr, the exception to throw will be guessed based on `reason` > 279: void builtin_throw(Deoptimization::DeoptReason reason, ciInstance* exception_object = nullptr); Please, introduce a new overload instead. I suggest to extract Deoptimization::DeoptReason -> ciInstance mapping into a helper method and turn `void builtin_throw(Deoptimization::DeoptReason reason)` into a wrapper: void GraphKit::builtin_throw(Deoptimization::DeoptReason reason) { builtin_throw(reason, exception_on_deopt(reason)); } src/hotspot/share/opto/library_call.cpp line 2035: > 2033: > 2034: if (use_builtin_throw) { > 2035: builtin_throw(Deoptimization::Reason_intrinsic, env()->ArithmeticException_instance()); I suggest to unconditionally call `builtin_throw()`. It should handle `uncommon_trap` case as well. What makes sense is to ensure that `builtin_throw()` doesn't change deoptimization reason. It can be implemented with an extra argument to new `GraphKit::builtin_throw` overload (e.g., `bool allow_deopt_reason_none`). src/hotspot/share/opto/library_call.cpp line 2054: > 2052: // instead of bailing out on intrinsic or potentially deopting, let's do that! > 2053: use_builtin_throw = true; > 2054: } else if (too_many_traps(Deoptimization::Reason_intrinsic)) { Why `too_many_traps(Deoptimization::Reason_intrinsic)` check is not enough here? ------------- PR Review: https://git.openjdk.org/jdk/pull/23916#pullrequestreview-2726864135 PR Review Comment: https://git.openjdk.org/jdk/pull/23916#discussion_r2019432922 PR Review Comment: https://git.openjdk.org/jdk/pull/23916#discussion_r2019444895 PR Review Comment: https://git.openjdk.org/jdk/pull/23916#discussion_r2019449996 From dnsimon at openjdk.org Fri Mar 28 22:24:40 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Fri, 28 Mar 2025 22:24:40 GMT Subject: RFR: 8352645: Add tool support to check order of includes [v6] In-Reply-To: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> Message-ID: > This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)). > > By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order. > > The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now. > > I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories. > > When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example: > > java.lang.RuntimeException: The unsorted includes listed below should be fixable by running: > > java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci > > at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80) > at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) > at java.base/java.lang.reflect.Method.invoke(Method.java:565) > at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335) > at java.base/java.lang.Thread.run(Thread.java:1447) > Caused by: java.lang.RuntimeException: 36 files with unsorted headers found: > > /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp > /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp > /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Optim... Doug Simon has updated the pull request incrementally with one additional commit since the last revision: convert Windows path to Unix path ------------- Changes: - all: https://git.openjdk.org/jdk/pull/24247/files - new: https://git.openjdk.org/jdk/pull/24247/files/c93e6646..921e3251 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=24247&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24247&range=04-05 Stats: 4 lines in 1 file changed: 4 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/24247.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24247/head:pull/24247 PR: https://git.openjdk.org/jdk/pull/24247 From duke at openjdk.org Sat Mar 29 07:19:21 2025 From: duke at openjdk.org (Zihao Lin) Date: Sat, 29 Mar 2025 07:19:21 GMT Subject: RFR: 8344116: C2: remove slice parameter from LoadNode::make [v5] In-Reply-To: References: Message-ID: > This patch remove slice parameter from LoadNode::make > > Mention in https://github.com/openjdk/jdk/pull/21834#pullrequestreview-2429164805 > > Hi team, I am new, I'd appreciate any guidance. Thank a lot! Zihao Lin has updated the pull request incrementally with two additional commits since the last revision: - Fix build - Fix test failed ------------- Changes: - all: https://git.openjdk.org/jdk/pull/24258/files - new: https://git.openjdk.org/jdk/pull/24258/files/f6b2fbec..a1924c35 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=24258&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24258&range=03-04 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/24258.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24258/head:pull/24258 PR: https://git.openjdk.org/jdk/pull/24258 From mchevalier at openjdk.org Mon Mar 31 06:49:50 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Mon, 31 Mar 2025 06:49:50 GMT Subject: RFR: 8348853: Fold layout helper check for objects implementing non-array interfaces [v2] In-Reply-To: References: Message-ID: > If `TypeInstKlassPtr` represents an array type, it has to be `java.lang.Object`. From contraposition, if it is not `java.lang.Object`, we can conclude it is not an array, and we can skip some array checks, for instance. > > In this PR, we improve this deduction with an interface base reasoning: arrays implements only Cloneable and Serializable, so if a type implements anything else, it cannot be an array. > > This change partially reverts the changes from [JDK-8348631](https://bugs.openjdk.org/browse/JDK-8348631) (#23331) (in `LibraryCallKit::generate_array_guard_common`) and the test still passes. > > The way interfaces are check might be done differently. The current situation is a balance between visibility (not to leak too much things explicitly private), having not overly general methods for one use-case and avoiding too concrete (and brittle) interfaces. > > Tested with tier1..3, hs-precheckin-comp and hs-comp-stress > > Thanks, > Marc Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: not reinventing the wheel ------------- Changes: - all: https://git.openjdk.org/jdk/pull/24245/files - new: https://git.openjdk.org/jdk/pull/24245/files/a77c397c..daaaf9ae Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=24245&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24245&range=00-01 Stats: 13 lines in 1 file changed: 0 ins; 12 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/24245.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24245/head:pull/24245 PR: https://git.openjdk.org/jdk/pull/24245 From mchevalier at openjdk.org Mon Mar 31 06:49:50 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Mon, 31 Mar 2025 06:49:50 GMT Subject: RFR: 8348853: Fold layout helper check for objects implementing non-array interfaces [v2] In-Reply-To: References: <-Ri4lJUzCkI9yLG-kGwTGeAhd453SDgt_qvoB1iw4_A=.f3e126ab-a4ff-4f7f-80a7-c6e739cc6727@github.com> Message-ID: <48D8vzTXZDKtZxAMTDdo9ggjWnWn7XNjs6rZqwuDZxc=.d833c90c-09da-4167-aec9-aba8b9e523b5@github.com> On Fri, 28 Mar 2025 19:24:22 GMT, Tobias Hartmann wrote: >> Almost! >> >> return !TypeAryPtr::_array_interfaces->contains(this); >> >> Contains is about TypeInterfaces, that is set of interfaces. So I just need to check that `this` is not a sub-set of array interfaces. That should do it. > > Now I'm confused, isn't this what I proposed? I didn't check the exact syntax, I just wondered if the `TypeInterfaces::contains` method couldn't be used instead of adding a new method. Yes, totally! It's just a detail difference. But there is another question: whether we still want `has_non_array_interface` has a wrapper for this call with a more explicit name, or if we simply inline your suggestion on the callsite of `has_non_array_interface`. I tend toward the first, I like explicit names, and I suspect it might be useful in more than one place, but not a strong opinion. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24245#discussion_r2020483393 From mchevalier at openjdk.org Mon Mar 31 07:54:14 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Mon, 31 Mar 2025 07:54:14 GMT Subject: RFR: 8346989: C2: deoptimization and re-compilation cycle with Math.*Exact in case of frequent overflow [v2] In-Reply-To: References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> Message-ID: On Wed, 26 Mar 2025 08:33:58 GMT, Marc Chevalier wrote: >> `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments. >> This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached. >> >> Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all. >> >> tl;dr: >> - C1: no problem, no change >> - C2: >> - with intrinsics: >> - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms) >> - without overflow: no problem, no change >> - without intrinsics: no problem, no change >> >> Before the fix: >> >> Benchmark (SIZE) Mode Cnt Score Error Units >> MathExact.C1_1.loopAddIInBounds 1000000 avgt 3 1.272 ? 0.048 ms/op >> MathExact.C1_1.loopAddIOverflow 1000000 avgt 3 641.917 ? 58.238 ms/op >> MathExact.C1_1.loopAddLInBounds 1000000 avgt 3 1.402 ? 0.842 ms/op >> MathExact.C1_1.loopAddLOverflow 1000000 avgt 3 671.013 ? 229.425 ms/op >> MathExact.C1_1.loopDecrementIInBounds 1000000 avgt 3 3.722 ? 22.244 ms/op >> MathExact.C1_1.loopDecrementIOverflow 1000000 avgt 3 653.341 ? 279.003 ms/op >> MathExact.C1_1.loopDecrementLInBounds 1000000 avgt 3 2.525 ? 0.810 ms/op >> MathExact.C1_1.loopDecrementLOverflow 1000000 avgt 3 656.750 ? 141.792 ms/op >> MathExact.C1_1.loopIncrementIInBounds 1000000 avgt 3 4.621 ? 12.822 ms/op >> MathExact.C1_1.loopIncrementIOverflow 1000000 avgt 3 651.608 ? 274.396 ms/op >> MathExact.C1_1.loopIncrementLInBounds 1000000 avgt 3 2.576 ? 3.316 ms/op >> MathExact.C1_1.loopIncrementLOverflow 1000000 avgt 3 662.216 ? 71.879 ms/op >> MathExact.C1_1.loopMultiplyIInBounds 1000000 avgt 3 1.402 ? 0.587 ms/op >> MathExact.C1_1.loopMultiplyIOverflow 1000000 avgt 3 615.836 ? 252.137 ms/op >> MathExact.C1_1.loopMultiplyLInBounds 1000000 avgt 3 2.906 ? 5.718 ms/op >> MathExact.C1_1.loopMultiplyLOverflow 1000000 avgt 3 655.576 ? 147.432 ms/op >> MathExact.C1_1.loopNegateIInBounds 1000000 avgt 3 2.023 ? 0.027 ms/op >> MathExact.C1_1.loopNegateIOverflow 1000000 avgt 3 639.136 ? 30.841 ms/op >> MathExact.C1_1.loop... > > Marc Chevalier has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Use builtin_throw > - Merge branch 'master' into fix/Deoptimization-and-re-compilation-cycle-with-C2-compiled-code > - More exhaustive bench > - Limit inlining of math Exact operations in case of too many deopts Actually, yes, there is a reason I've made it so weird (and I agree it's pretty convoluted). `builtin_throw` kicks in if `too_many_traps(reason)` is true (and another case, but it might not apply): https://github.com/openjdk/jdk/blob/59629f88e6fad9c1ff91be4cfea83f78f0ea503c/src/hotspot/share/opto/graphKit.cpp#L540-L555 If `treat_throw_as_hot` is false (so before too many traps) it just ends up as a `uncommon_trap` with `Action_maybe_recompile` action. That is fine at first. But later, we would like `builtin_throw` to do its job, but it can only do if if https://github.com/openjdk/jdk/blob/59629f88e6fad9c1ff91be4cfea83f78f0ea503c/src/hotspot/share/opto/graphKit.cpp#L563 which is not `too_many_traps(reason)`. Which means that: - if we don't bailout intrinsics on `too_many_traps(reason)` we may be in the same situation as in the bug, with deopt cycles, in the situation where `builtin_throw` doesn't do it's job (for instance `method()->can_omit_stack_trace()` is false) - if we bailout intrincs on `too_many_traps(reason)`, then `builtin_throw` never get a hot enough throw that it can speed up, and we have the same situation as my first version, before you suggested `builtin_throw` (with performances similar for C2 and C1). In other words, we need `too_many_traps(reason)` to be reached to have `builtin_throw` start to have a change to do something, but it might not, and in this case, we need to bailout from intrinsics otherwise, we will repeatedly deopt. So, when `too_many_traps(reason)` is true, we have two options: either we give it to `builtin_throw` or we bailout. And to avoid the deopt cycles, we must know in advance if `builtin_throw` will do its job or just default to an `uncommon_trap` again (in which case, bailing out is better). This is why I extracted the condition for `builtin_throw` into `builtin_throw_applies`: so that intrinsic can decide what is best to do. Some of your suggestions are still relevant tho! I'll apply them. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23916#issuecomment-2765414288 From mchevalier at openjdk.org Mon Mar 31 08:05:50 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Mon, 31 Mar 2025 08:05:50 GMT Subject: RFR: 8346989: C2: deoptimization and re-compilation cycle with Math.*Exact in case of frequent overflow [v3] In-Reply-To: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> Message-ID: > `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments. > This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached. > > Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all. > > tl;dr: > - C1: no problem, no change > - C2: > - with intrinsics: > - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms) > - without overflow: no problem, no change > - without intrinsics: no problem, no change > > Before the fix: > > Benchmark (SIZE) Mode Cnt Score Error Units > MathExact.C1_1.loopAddIInBounds 1000000 avgt 3 1.272 ? 0.048 ms/op > MathExact.C1_1.loopAddIOverflow 1000000 avgt 3 641.917 ? 58.238 ms/op > MathExact.C1_1.loopAddLInBounds 1000000 avgt 3 1.402 ? 0.842 ms/op > MathExact.C1_1.loopAddLOverflow 1000000 avgt 3 671.013 ? 229.425 ms/op > MathExact.C1_1.loopDecrementIInBounds 1000000 avgt 3 3.722 ? 22.244 ms/op > MathExact.C1_1.loopDecrementIOverflow 1000000 avgt 3 653.341 ? 279.003 ms/op > MathExact.C1_1.loopDecrementLInBounds 1000000 avgt 3 2.525 ? 0.810 ms/op > MathExact.C1_1.loopDecrementLOverflow 1000000 avgt 3 656.750 ? 141.792 ms/op > MathExact.C1_1.loopIncrementIInBounds 1000000 avgt 3 4.621 ? 12.822 ms/op > MathExact.C1_1.loopIncrementIOverflow 1000000 avgt 3 651.608 ? 274.396 ms/op > MathExact.C1_1.loopIncrementLInBounds 1000000 avgt 3 2.576 ? 3.316 ms/op > MathExact.C1_1.loopIncrementLOverflow 1000000 avgt 3 662.216 ? 71.879 ms/op > MathExact.C1_1.loopMultiplyIInBounds 1000000 avgt 3 1.402 ? 0.587 ms/op > MathExact.C1_1.loopMultiplyIOverflow 1000000 avgt 3 615.836 ? 252.137 ms/op > MathExact.C1_1.loopMultiplyLInBounds 1000000 avgt 3 2.906 ? 5.718 ms/op > MathExact.C1_1.loopMultiplyLOverflow 1000000 avgt 3 655.576 ? 147.432 ms/op > MathExact.C1_1.loopNegateIInBounds 1000000 avgt 3 2.023 ? 0.027 ms/op > MathExact.C1_1.loopNegateIOverflow 1000000 avgt 3 639.136 ? 30.841 ms/op > MathExact.C1_1.loopNegateLInBounds 1000000 avgt 3 2.422 ? 3.59... Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: guess_exception_from_deopt_reason out of builtin_throw ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23916/files - new: https://git.openjdk.org/jdk/pull/23916/files/9372228d..41d7a1d4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23916&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23916&range=01-02 Stats: 49 lines in 2 files changed: 21 ins; 25 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/23916.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23916/head:pull/23916 PR: https://git.openjdk.org/jdk/pull/23916 From mchevalier at openjdk.org Mon Mar 31 08:33:42 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Mon, 31 Mar 2025 08:33:42 GMT Subject: RFR: 8346989: C2: deoptimization and re-compilation cycle with Math.*Exact in case of frequent overflow [v4] In-Reply-To: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com> Message-ID: > `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments. > This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached. > > Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all. > > tl;dr: > - C1: no problem, no change > - C2: > - with intrinsics: > - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms) > - without overflow: no problem, no change > - without intrinsics: no problem, no change > > Before the fix: > > Benchmark (SIZE) Mode Cnt Score Error Units > MathExact.C1_1.loopAddIInBounds 1000000 avgt 3 1.272 ? 0.048 ms/op > MathExact.C1_1.loopAddIOverflow 1000000 avgt 3 641.917 ? 58.238 ms/op > MathExact.C1_1.loopAddLInBounds 1000000 avgt 3 1.402 ? 0.842 ms/op > MathExact.C1_1.loopAddLOverflow 1000000 avgt 3 671.013 ? 229.425 ms/op > MathExact.C1_1.loopDecrementIInBounds 1000000 avgt 3 3.722 ? 22.244 ms/op > MathExact.C1_1.loopDecrementIOverflow 1000000 avgt 3 653.341 ? 279.003 ms/op > MathExact.C1_1.loopDecrementLInBounds 1000000 avgt 3 2.525 ? 0.810 ms/op > MathExact.C1_1.loopDecrementLOverflow 1000000 avgt 3 656.750 ? 141.792 ms/op > MathExact.C1_1.loopIncrementIInBounds 1000000 avgt 3 4.621 ? 12.822 ms/op > MathExact.C1_1.loopIncrementIOverflow 1000000 avgt 3 651.608 ? 274.396 ms/op > MathExact.C1_1.loopIncrementLInBounds 1000000 avgt 3 2.576 ? 3.316 ms/op > MathExact.C1_1.loopIncrementLOverflow 1000000 avgt 3 662.216 ? 71.879 ms/op > MathExact.C1_1.loopMultiplyIInBounds 1000000 avgt 3 1.402 ? 0.587 ms/op > MathExact.C1_1.loopMultiplyIOverflow 1000000 avgt 3 615.836 ? 252.137 ms/op > MathExact.C1_1.loopMultiplyLInBounds 1000000 avgt 3 2.906 ? 5.718 ms/op > MathExact.C1_1.loopMultiplyLOverflow 1000000 avgt 3 655.576 ? 147.432 ms/op > MathExact.C1_1.loopNegateIInBounds 1000000 avgt 3 2.023 ? 0.027 ms/op > MathExact.C1_1.loopNegateIOverflow 1000000 avgt 3 639.136 ? 30.841 ms/op > MathExact.C1_1.loopNegateLInBounds 1000000 avgt 3 2.422 ? 3.59... Marc Chevalier has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: - Merge branch 'master' into fix/Deoptimization-and-re-compilation-cycle-with-C2-compiled-code - guess_exception_from_deopt_reason out of builtin_throw - Use builtin_throw - Merge branch 'master' into fix/Deoptimization-and-re-compilation-cycle-with-C2-compiled-code - More exhaustive bench - Limit inlining of math Exact operations in case of too many deopts ------------- Changes: https://git.openjdk.org/jdk/pull/23916/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23916&range=03 Stats: 759 lines in 6 files changed: 723 ins; 27 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/23916.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23916/head:pull/23916 PR: https://git.openjdk.org/jdk/pull/23916 From mchevalier at openjdk.org Mon Mar 31 09:37:08 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Mon, 31 Mar 2025 09:37:08 GMT Subject: RFR: 8348853: Fold layout helper check for objects implementing non-array interfaces [v2] In-Reply-To: <48D8vzTXZDKtZxAMTDdo9ggjWnWn7XNjs6rZqwuDZxc=.d833c90c-09da-4167-aec9-aba8b9e523b5@github.com> References: <-Ri4lJUzCkI9yLG-kGwTGeAhd453SDgt_qvoB1iw4_A=.f3e126ab-a4ff-4f7f-80a7-c6e739cc6727@github.com> <48D8vzTXZDKtZxAMTDdo9ggjWnWn7XNjs6rZqwuDZxc=.d833c90c-09da-4167-aec9-aba8b9e523b5@github.com> Message-ID: On Mon, 31 Mar 2025 06:46:51 GMT, Marc Chevalier wrote: >> Now I'm confused, isn't this what I proposed? I didn't check the exact syntax, I just wondered if the `TypeInterfaces::contains` method couldn't be used instead of adding a new method. > > Yes, totally! It's just a detail difference. But there is another question: whether we still want `has_non_array_interface` has a wrapper for this call with a more explicit name, or if we simply inline your suggestion on the callsite of `has_non_array_interface`. I tend toward the first, I like explicit names, and I suspect it might be useful in more than one place, but not a strong opinion. For now, I just replaced the implementation of `has_non_array_interface`. If one feels even keeping the method is premature factorization, I can easily inline it. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24245#discussion_r2020704570 From bkilambi at openjdk.org Mon Mar 31 09:54:14 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Mon, 31 Mar 2025 09:54:14 GMT Subject: RFR: 8345125: Aarch64: Add aarch64 backend for Float16 scalar operations [v2] In-Reply-To: <8QDbenZGakijqUrwAcaVogoJBEiNpzYhN3sDrrteSDk=.d8539631-ab03-45ff-a762-0b6e14c63f89@github.com> References: <8QDbenZGakijqUrwAcaVogoJBEiNpzYhN3sDrrteSDk=.d8539631-ab03-45ff-a762-0b6e14c63f89@github.com> Message-ID: <5_o8l6NUDH-laA-OZT9wvJ5-AR9vs2tUwXf0jVzB9T4=.0ec06331-95ca-45a2-bd1f-14cea2150b81@github.com> On Tue, 25 Feb 2025 19:45:31 GMT, Bhavana Kilambi wrote: >> This patch adds aarch64 backend for scalar FP16 operations namely - add, subtract, multiply, divide, fma, sqrt, min and max. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Address review comments Hello, I would not be able to respond to comments until the next couple months or so due to some urgent tasks at work. Until then, I'd move this PR to draft status so that it would not be closed due to lack of activity. Thank you for the review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23748#issuecomment-2765729618 From ihse at openjdk.org Mon Mar 31 10:05:19 2025 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Mon, 31 Mar 2025 10:05:19 GMT Subject: RFR: 8352645: Add tool support to check order of includes [v6] In-Reply-To: References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com> Message-ID: On Fri, 28 Mar 2025 22:24:40 GMT, Doug Simon wrote: >> This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)). >> >> By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order. >> >> The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now. >> >> I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories. >> >> When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example: >> >> java.lang.RuntimeException: The unsorted includes listed below should be fixable by running: >> >> java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci >> >> at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80) >> at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) >> at java.base/java.lang.reflect.Method.invoke(Method.java:565) >> at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335) >> at java.base/java.lang.Thread.run(Thread.java:1447) >> Caused by: java.lang.RuntimeException: 36 files with unsorted headers found: >> >> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp >> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp >> /Users/dnsimo... > > Doug Simon has updated the pull request incrementally with one additional commit since the last revision: > > convert Windows path to Unix path Hm... I know the source code is bundled with the test image, but I'm not 100% sure if it just includes `src`, or if the entire top-level source is included. I'll need to check that, including what is the best way to get a proper reference to the top-level directory from a test. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24247#issuecomment-2765754142 From duke at openjdk.org Mon Mar 31 11:14:14 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Mon, 31 Mar 2025 11:14:14 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v11] In-Reply-To: References: Message-ID: On Mon, 24 Mar 2025 02:38:37 GMT, Jatin Bhateja wrote: >> Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision: >> >> - Further readability improvements. >> - Added asserts for array sizes > > src/hotspot/cpu/x86/vm_version_x86.cpp line 1252: > >> 1250: // Currently we only have them for AVX512 >> 1251: #ifdef _LP64 >> 1252: if (supports_evex() && supports_avx512bw()) { > > supports_evex check looks redundant. These are checks for two different feature bits: CPU_AVX512F and CPU_AVX512BW. Are you saying that the latter implies the former in every implementation of the spec? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2020853815 From roland at openjdk.org Mon Mar 31 11:49:09 2025 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 31 Mar 2025 11:49:09 GMT Subject: RFR: 8348853: Fold layout helper check for objects implementing non-array interfaces [v2] In-Reply-To: References: Message-ID: On Mon, 31 Mar 2025 06:49:50 GMT, Marc Chevalier wrote: >> If `TypeInstKlassPtr` represents an array type, it has to be `java.lang.Object`. From contraposition, if it is not `java.lang.Object`, we can conclude it is not an array, and we can skip some array checks, for instance. >> >> In this PR, we improve this deduction with an interface base reasoning: arrays implements only Cloneable and Serializable, so if a type implements anything else, it cannot be an array. >> >> This change partially reverts the changes from [JDK-8348631](https://bugs.openjdk.org/browse/JDK-8348631) (#23331) (in `LibraryCallKit::generate_array_guard_common`) and the test still passes. >> >> The way interfaces are check might be done differently. The current situation is a balance between visibility (not to leak too much things explicitly private), having not overly general methods for one use-case and avoiding too concrete (and brittle) interfaces. >> >> Tested with tier1..3, hs-precheckin-comp and hs-comp-stress >> >> Thanks, >> Marc > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > not reinventing the wheel src/hotspot/share/opto/memnode.cpp line 2214: > 2212: if (tkls->offset() == in_bytes(Klass::layout_helper_offset()) && > 2213: tkls->isa_instklassptr() && // not directly typed as an array > 2214: !tkls->is_instklassptr()->might_be_an_array() // not the supertype of all T[] (java.lang.Object) or has an interface that is not Serializable or Cloneable Could we do the same by using `TypeKlassPtr::maybe_java_subtype_of(TypeAryKlassPtr::BOTTOM)` and define a `TypeAryKlassPtr::BOTTOM` to be a static field for the `array_interfaces`? AFAICT, `TypeKlassPtr::maybe_java_subtype_of()` already covers that case so it would avoid some logic duplication. Also in the test above, maybe you could simplify the test a little but by removing `tkls->isa_instklassptr()`? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24245#discussion_r2020893305 From duke at openjdk.org Mon Mar 31 14:28:20 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Mon, 31 Mar 2025 14:28:20 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v11] In-Reply-To: <_TOBoO4cMQpw4sgzIpNpQZ2w5wDgezKQZLe314DQ7zo=.813b81bf-ecc0-4f75-a0d6-fbb13dde594e@github.com> References: <_TOBoO4cMQpw4sgzIpNpQZ2w5wDgezKQZLe314DQ7zo=.813b81bf-ecc0-4f75-a0d6-fbb13dde594e@github.com> Message-ID: On Mon, 24 Mar 2025 15:16:20 GMT, Volodymyr Paprotski wrote: >> Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision: >> >> - Further readability improvements. >> - Added asserts for array sizes > > I still need to have a look at the sha3 changes, but I think I am done with the most complex part of the review. This was a really interesting bit of code to review! @vpaprotsk , thanks a lot for the very thorough review! > src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 270: > >> 268: } >> 269: >> 270: static void loadPerm(int destinationRegs[], Register perms, > > `replXmm`? i.e. this function is replicating (any) Xmm register, not just perm?.. Since I am only using it for permutation describers, I thought this way it is easier to follow what is happening. > src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 327: > >> 325: // >> 326: // >> 327: static address generate_dilithiumAlmostNtt_avx512(StubGenerator *stubgen, > > Similar comments as to `generate_dilithiumAlmostInverseNtt_avx512` > > - similar comment about the 'pair-wise' operation, updating `[j]` and `[j+l]` at a time.. > - somehow had less trouble following the flow through registers here, perhaps I am getting used to it. FYI, ended renaming some as: > > // xmm16_27 = Temp1 > // xmm0_3 = Coeffs1 > // xmm4_7 = Coeffs2 > // xmm8_11 = Coeffs3 > // xmm12_15 = Coeffs4 = Temp2 > // xmm16_27 = Scratch For me, it was easier to follow what goes where using the xmm... names (with the symbolic names you always have to remember which one overlaps with another and how much). > src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 421: > >> 419: for (int i = 0; i < 8; i += 2) { >> 420: __ evpermi2d(xmm(i / 2 + 12), xmm(i), xmm(i + 1), Assembler::AVX_512bit); >> 421: } > > Wish there was a more 'abstract' way to arrange this, so its obvious from the shape of the code what registers are input/outputs (i.e. and use the register arrays). Even though its just 'elementary index operations' `i/2 + 16` is still 'clever'. Couldnt think of anything myself though (same elsewhere in this function for the table permutes). Well, this is how it is when we have three inputs, one of which also plays as output... At least the output is always the first one (so that one gets clobbered). This is why you have to replicate the permutation describer when you need both permutands later. > src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 509: > >> 507: // coeffs (int[256]) = c_rarg0 >> 508: // zetas (int[256]) = c_rarg1 >> 509: static address generate_dilithiumAlmostInverseNtt_avx512(StubGenerator *stubgen, > > Done with this function; Perhaps the 'permute table' is a common vector-algorithm pattern, but this is really clever! > > Some general comments first, rest inline. > > - The array names for registers helped a lot. And so did the new helper functions! > - The java version of this code is quite intimidating to vectorize.. 3D loop, with geometric iteration variables.. and the literature is even more intimidating (discrete convolutions which I havent touched in two decades, ffts, ntts, etc.) Here is my attempt at a comment to 'un-scare' the next reader, though feel free to reword however you like. > > The core of the (Java) loop is this 'pair-wise' operation: > int a = coeffs[j]; > int b = coeffs[j + offset]; > coeffs[j] = (a + b); > coeffs[j + offset] = montMul(a - b, -MONT_ZETAS_FOR_NTT[m]); > > There are 8 'levels' (0-7); ('levels' are equivalent to (unrolling) the outer (Java) loop) > At each level, the 'pair-wise-offset' doubles (2^l: 1, 2, 4, 8, 16, 32, 64, 128). > > To vectorize this Java code, observe that at each level, REGARDLESS the offset, half the operations are the SUM, and the other half is the > montgomery MULTIPLICATION (of the pair-difference with a constant). At each level, one 'just' has to shuffle > the coefficients, so that SUMs and MULTIPLICATIONs line up accordingly. > > Otherwise, this pattern is 'lightly similar' to a discrete convolution (compute integral/summation of two functions at every offset) > > - I still would prefer (more) symbolic register names.. I wouldn't hold my approval over it so won't object if nobody else does, but register numbers are harder to 'see' through the flow. I ended up search/replacing/'annotating' to make it easier on myself to follow the flow of data: > > // xmm8_11 = Perms1 > // xmm12_15 = Perms2 > // xmm16_27 = Scratch > // xmm0_3 = CoeffsPlus > // xmm4_7 = CoeffsMul > // xmm24_27 = CoeffsMinus (overlaps with Scratch) > > (I made a similar comment, but I think it is now hidden after the last refactor) > - would prefer to see the helper functions to get ALL the registers passed explicitly (i.e. currently `montMulPerm`, `montQInvModR`, `dilithium_q`, `xmm29`, are implicit.). As a general rule, I've tried to set up all the registers up at the 'entry' function (`generate_dilithium*` in this case) and ... I added some more comments, but I kept the xmm... names for the registers, just like with the ntt function. > src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 554: > >> 552: for (int i = 0; i < 8; i += 2) { >> 553: __ evpermi2d(xmm(i / 2 + 8), xmm(i), xmm(i + 1), Assembler::AVX_512bit); >> 554: __ evpermi2d(xmm(i / 2 + 12), xmm(i), xmm(i + 1), Assembler::AVX_512bit); > > Took a bit to unscramble the flow, so a comment needed? Purpose 'fairly obvious' once I got the general shape of the level/algorithm (as per my top-level comment) but something like "shuffle xmm0-7 into xmm8-15"? I hope the comment that I added at the beginning of the function sheds some light on the purpose of these permutations. > src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 656: > >> 654: for (int i = 0; i < 8; i++) { >> 655: __ evpsubd(xmm(i), k0, xmm(i + 8), xmm(i), false, Assembler::AVX_512bit); >> 656: } > > Fairly clean as is, but could also be two sub_add calls, I think (you have to swap order of add/sub in the helper, to be able to clobber `xmm(i)`.. or swap register usage downstream, so perhaps not.. but would be cleaner) > > sub_add(CoeffsPlus, Scratch, Perms1, CoeffsPlus, _masm); > sub_add(CoeffsMul, &Scratch[4], Perms2, CoeffsMul, _masm); > > > If nothing else, would had prefered to see the use of the register array variables I would rather leave this alone, too. I was considering the same, but decided that this is fairly easy to follow, it would be more complicated to either add a new helper function or follow where there are overlaps in the symbolically named register sets. > src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 871: > >> 869: __ evpaddd(xmm5, k0, xmm1, barrettAddend, false, Assembler::AVX_512bit); >> 870: __ evpaddd(xmm6, k0, xmm2, barrettAddend, false, Assembler::AVX_512bit); >> 871: __ evpaddd(xmm7, k0, xmm3, barrettAddend, false, Assembler::AVX_512bit); > > Fairly 'straightforward' transcription of the java code.. no comments from me. > > At first glance using `xmm0_3`, `xmm4_7`, etc. might had been a good idea, but you only save one line per 4x group. (Unless you have one big loop, but I suspect that give you worse performance? Is that something you tried already? Might be worth it otherwise..) I have considered this but decided to leave it alone (for the reason that you mentioned). > src/java.base/share/classes/sun/security/provider/ML_DSA.java line 1418: > >> 1416: int twoGamma2, int multiplier) { >> 1417: assert (input.length == ML_DSA_N) && (lowPart.length == ML_DSA_N) >> 1418: && (highPart.length == ML_DSA_N); > > I wrote this test to test java-to-intrinsic correspondence. Might be good to include it (and add the other 4 intrinsics). This is very similar to all my other *Fuzz* tests I've been adding for my own intrinsics (and you made this test FAR easier to write by breaking out the java implementation; need to 'copy' that pattern myself) > > import java.util.Arrays; > import java.util.Random; > > import java.lang.invoke.MethodHandle; > import java.lang.invoke.MethodHandles; > import java.lang.reflect.Field; > import java.lang.reflect.Method; > import java.lang.reflect.Constructor; > > public class ML_DSA_Intrinsic_Test { > > public static void main(String[] args) throws Exception { > MethodHandles.Lookup lookup = MethodHandles.lookup(); > Class kClazz = Class.forName("sun.security.provider.ML_DSA"); > Constructor constructor = kClazz.getDeclaredConstructor( > int.class); > constructor.setAccessible(true); > > Method m = kClazz.getDeclaredMethod("mlDsaNttMultiply", > int[].class, int[].class, int[].class); > m.setAccessible(true); > MethodHandle mult = lookup.unreflect(m); > > m = kClazz.getDeclaredMethod("implDilithiumNttMultJava", > int[].class, int[].class, int[].class); > m.setAccessible(true); > MethodHandle multJava = lookup.unreflect(m); > > Random rnd = new Random(); > long seed = rnd.nextLong(); > rnd.setSeed(seed); > //Note: it might be useful to increase this number during development of new intrinsics > final int repeat = 1000000; > int[] coeffs1 = new int[ML_DSA_N]; > int[] coeffs2 = new int[ML_DSA_N]; > int[] prod1 = new int[ML_DSA_N]; > int[] prod2 = new int[ML_DSA_N]; > try { > for (int i = 0; i < repeat; i++) { > run(prod1, prod2, coeffs1, coeffs2, mult, multJava, rnd, seed, i); > } > System.out.println("Fuzz Success"); > } catch (Throwable e) { > System.out.println("Fuzz Failed: " + e); > } > } > > private static final int ML_DSA_N = 256; > public static void run(int[] prod1, int[] prod2, int[] coeffs1, int[] coeffs2, > MethodH... We will consider it for a follow-up PR. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23860#issuecomment-2766414076 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021150966 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021151152 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021151361 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021151680 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021152095 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021152962 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021154571 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021156249 From duke at openjdk.org Mon Mar 31 14:28:21 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Mon, 31 Mar 2025 14:28:21 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v7] In-Reply-To: References: Message-ID: On Mon, 17 Mar 2025 19:22:41 GMT, Volodymyr Paprotski wrote: >> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: >> >> Made the intrinsics test separate from the pure java test. > > src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 45: > >> 43: // Constants >> 44: // >> 45: ATTRIBUTE_ALIGNED(64) static const uint32_t dilithiumAvx512Consts[] = { > > This is really nitpicking.. but could had loaded constants inline with `movl` without requiring an ExternalAddress()? > > Nice to have constants together, only complaint is we have 'magic offsets' in ASM to reach in for particular one.. > > This one isnt too bad, offset of 32bits is easy to inspect visually (`dilithiumAvx512ConstsAddr()` could take a parameter perhaps) I added symbolic names for the indexes. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021149647 From duke at openjdk.org Mon Mar 31 14:28:22 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Mon, 31 Mar 2025 14:28:22 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v11] In-Reply-To: References: Message-ID: On Sun, 23 Mar 2025 00:21:18 GMT, Volodymyr Paprotski wrote: >> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 119: >> >>> 117: static address dilithiumAvx512PermsAddr() { >>> 118: return (address) dilithiumAvx512Perms; >>> 119: } >> >> Hear me out.. ... >> enums!! >> >> enum nttPermOffset { >> montMulPermsIdx = 0, >> nttL4PermsIdx = 64, >> nttL5PermsIdx = 192, >> nttL6PermsIdx = 320, >> nttL7PermsIdx = 448, >> nttInvL0PermsIdx = 704, >> nttInvL1PermsIdx = 832, >> nttInvL2PermsIdx = 960, >> nttInvL3PermsIdx = 1088, >> nttInvL4PermsIdx = 1216, >> }; >> static address dilithiumAvx512PermsAddr(nttPermOffset offset) { >> return (address) dilithiumAvx512Perms + offset; >> } > > belay that comment.. now that I looked at `generate_dilithiumAlmostInverseNtt_avx512`, I see why thats not the 'entire picture'.. I leave it as it is now. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021149925 From duke at openjdk.org Mon Mar 31 14:28:24 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Mon, 31 Mar 2025 14:28:24 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v10] In-Reply-To: <2yP2P1VNWgQu6cWvn0_a_7LdidS71C6PWKcqGKTOHnc=.49f8ac0f-df23-4f1e-adb9-e03a3f2295b2@github.com> References: <2N5Evij0f6qZi_pG3tqoz11aQbSnLG0YszqHR9ROfKI=.d44b16c6-d334-42c4-8de8-92eb41229248@github.com> <2yP2P1VNWgQu6cWvn0_a_7LdidS71C6PWKcqGKTOHnc=.49f8ac0f-df23-4f1e-adb9-e03a3f2295b2@github.com> Message-ID: On Sat, 22 Mar 2025 16:36:08 GMT, Volodymyr Paprotski wrote: >> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix windows build > > src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 121: > >> 119: static void montmulEven(int outputReg, int inputReg1, int inputReg2, >> 120: int scratchReg1, int scratchReg2, >> 121: int parCnt, MacroAssembler *_masm) { > > nitpick.. this could be made to look more like `montMul64()` by also taking in an array of registers. I eliminated this function. > src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 160: > >> 158: for (int i = 0; i < 4; i++) { >> 159: __ vpmuldq(xmm(scratchRegs[i]), xmm(inputRegs1[i]), xmm(inputRegs2[i]), >> 160: Assembler::AVX_512bit); > > using an array of registers, instead of array of ints would read somewhat more compact and fewer 'indirections' . i.e. > > static void montMul64(XMMRegister outputRegs*, XMMRegister inputRegs1*, XMMRegister inputRegs2*, > ... > __ vpmuldq(scratchRegs[i], inputRegs1[i], inputRegs2[i], Assembler::AVX_512bit); I think from the names it is easy enough to see that we are really passing register names here and it is also easy to check that the indexes of the registers in the named arrays are really what the names of those arrays suggest, so I would like to leave this alone. > src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 645: > >> 643: // poly1 (int[256]) = c_rarg1 >> 644: // poly2 (int[256]) = c_rarg2 >> 645: static address generate_dilithiumNttMult_avx512(StubGenerator *stubgen, > > This would be 'nice to have', something 'lost' with the refactor.. > > As I was reviewing this (original) function, I was thinking, "there is nothing here _that_ specific to AVX512, mostly columnar&independent operations... This function could be made 'vector-length-independent'..." > - double the loop length: > > int iter = vector_len==Assembler::AVX_512bit?4:8; > __ movl(len, 4); -> __ movl(len, iter); > > - halve the register arrays.. (or keep them the same but shuffle them to make SURE the first half are in xmm0-xmm15 range) > > XMMRegister POLY1[] = {xmm0, xmm1, xmm12, xmm13}; > XMMRegister POLY2[] = {xmm4, xmm5, xmm16, xmm17}; > XMMRegister SCRATCH1[] = {xmm2, xmm3, xmm14, xmm15}; <<< here > XMMRegister SCRATCH2[] = {xmm6, xmm7, xmm18, xmm19}; <<< and here > XMMRegister SCRATCH3[] = {xmm8, xmm9, xmm10, xmm11}; > > - couple of other int constants (like the memory 'step' and such) > - for assembler calls, like `evmovdqul` and `evpsubd`, need a few small new MacroAssembler helpers to instead generate VEX encoded versions (plenty of instructions already do that). > - I think only the perm instruction was unique to evex (didnt really think of an alternative for AVX2.. but can be abstracted away with another helper) > > Anyway; not suggesting its something you do here.. but it would be convenient to leave breadcrumbs/hooks for a future update so one of us can revisit this code and add AVX2 support. e.g. `parCnt` variable was very convenient before for exactly this, now its gone... it probably could be derived in each function from vector_len but..; Its now cleaner, but also harder to 'upgrade'? > > Why AVX2? many of the newer (Atom/Ecore-based/EnableX86ECoreOpts) processors do not have AVX512 support, so its something I've been prioritizing recently > > The alternative would be to write a completely separate AVX2 implementation, but that would be a shame, not to 'just' reuse this code. > ? > "For fun", I had even gone and parametrized the mult function with the `vector_len` to see how it would look (almost identical... to the original version): > > static void montmulEven2(XMMRegister* outputReg, XMMRegister* inputReg1, XMMRegister* inputReg2, XMMRegister* scratchReg1, > XMMRegister* scratchReg2, XMMRegister montQInvModR, XMMRegister dilithium_q, int parCnt, int vector_len, ... I'd like to leave this for another PR. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021150150 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021150516 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021153931 From duke at openjdk.org Mon Mar 31 14:28:25 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Mon, 31 Mar 2025 14:28:25 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v10] In-Reply-To: <36fyT0z29o9GYLeQhpYkIT4d2By-8z7TEU8TGtT2uHI=.50647fa4-32ca-41ef-8287-075a70254143@github.com> References: <2N5Evij0f6qZi_pG3tqoz11aQbSnLG0YszqHR9ROfKI=.d44b16c6-d334-42c4-8de8-92eb41229248@github.com> <2yP2P1VNWgQu6cWvn0_a_7LdidS71C6PWKcqGKTOHnc=.49f8ac0f-df23-4f1e-adb9-e03a3f2295b2@github.com> <36fyT0z29o9GYLeQhpYkIT4d2By-8z7TEU8TGtT2uHI=.50647fa4-32ca-41ef-8287-075a70254143@github.com> Message-ID: On Sun, 23 Mar 2025 00:26:20 GMT, Volodymyr Paprotski wrote: >> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 216: >> >>> 214: // Zmm8-Zmm23 used as scratch registers >>> 215: // result goes to Zmm0-Zmm7 >>> 216: static void montMulByConst128(MacroAssembler *_masm) { >> >> wish the inputs and output register arrays were explicit.. easier to follow that way > > Looking at this function some more.. I think you could remove this function and replace it with two calls to `montMul64`? > > montMul64(xmm0_3, xmm0_3, xmm29_29, Scratch*, _masm); > montMul64(xmm4_7, xmm4_7, xmm29_29, Scratch*, _masm); > ``` > Scratch would have to be defined.. I accepted this suggestion, it really saved quite a few lines of code, thanks! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021150687 From duke at openjdk.org Mon Mar 31 14:28:26 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Mon, 31 Mar 2025 14:28:26 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v7] In-Reply-To: References: Message-ID: On Sat, 22 Mar 2025 16:11:02 GMT, Volodymyr Paprotski wrote: >> These functions will not be used anywhere else and in ML_DSA.java all of the arrays passed to inrinsics are of the correct size. > > Works for me; just thought I would point it out, so its a 'premeditated' decision. Well, I ended up putting some asserts in the java code, just in case... ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021153417 From duke at openjdk.org Mon Mar 31 14:28:27 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Mon, 31 Mar 2025 14:28:27 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v5] In-Reply-To: References: <3bphXKLpIpxAZP-FEOeob6AaHbv0BAoEceJka64vMW8=.3e4f74e0-9479-4926-b365-b08d8d702692@github.com> Message-ID: On Thu, 6 Mar 2025 19:26:14 GMT, Volodymyr Paprotski wrote: >> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: >> >> Accepted review comments. > > src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 409: > >> 407: __ evmovdquq(xmm29, Address(permsAndRots, 768), Assembler::AVX_512bit); >> 408: __ evmovdquq(xmm30, Address(permsAndRots, 832), Assembler::AVX_512bit); >> 409: __ evmovdquq(xmm31, Address(permsAndRots, 896), Assembler::AVX_512bit); > > Matter of taste, but I liked the compactness of montmulEven; i.e. > > for (i=0; i<15; i++) > __ evmovdquq(xmm(17+i), Address(permsAndRots, 64*i), Assembler::AVX_512bit); Changed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021155416 From duke at openjdk.org Mon Mar 31 14:40:56 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Mon, 31 Mar 2025 14:40:56 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v12] In-Reply-To: References: Message-ID: > By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: Reacting to comments by Volodymyr. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23860/files - new: https://git.openjdk.org/jdk/pull/23860/files/56656894..7a9f6645 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=11 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=10-11 Stats: 145 lines in 2 files changed: 24 ins; 91 del; 30 mod Patch: https://git.openjdk.org/jdk/pull/23860.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23860/head:pull/23860 PR: https://git.openjdk.org/jdk/pull/23860 From jbhateja at openjdk.org Mon Mar 31 16:43:39 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 31 Mar 2025 16:43:39 GMT Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v11] In-Reply-To: References: Message-ID: <-sFpKarpt9CP7DYd7v9vSBAgHYthQ4OZFNGHFOgb2AI=.fc908719-8e45-43d2-97df-95ff01129275@github.com> On Mon, 31 Mar 2025 11:11:54 GMT, Ferenc Rakoczi wrote: >> src/hotspot/cpu/x86/vm_version_x86.cpp line 1252: >> >>> 1250: // Currently we only have them for AVX512 >>> 1251: #ifdef _LP64 >>> 1252: if (supports_evex() && supports_avx512bw()) { >> >> supports_evex check looks redundant. > > These are checks for two different feature bits: CPU_AVX512F and CPU_AVX512BW. Are you saying that the latter implies the former in every implementation of the spec? AVX512BW is built on top of AVX512F spec. In assembler and other places we only check BW in assertions which implies EVEX. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021381288