From tschatzl at openjdk.org  Mon Mar  3 08:42:05 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Mon, 3 Mar 2025 08:42:05 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v5]
In-Reply-To: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
Message-ID: <JMHe-qTzcm8wVAxWcCPuDwedBFHpraKRIxPtcKt4SNw=.dc931266-66ca-43ad-a28c-570bcc0d4514@github.com>

> Hi all,
> 
>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
> 
> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
> 
> ### Current situation
> 
> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
> 
> The main reason for the current barrier is how g1 implements concurrent refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
> 
> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
> 
> 
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done;     // null value check
> if (card(@x.a) == young_card) goto done;  // write to young gen check
> StoreLoad;                // synchronize
> if (card(@x.a) == dirty_card) goto done;
> 
> *card(@x.a) = dirty
> 
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
> 
> call runtime to move thread-local-dcq into dcqs
> 
> done:
> 
> 
> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
> 
> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
> 
> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
> 
> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se...

Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:

  * fix comment (trailing whitespace)
  * another assert when snapshotting at a safepoint.

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23739/files
  - new: https://git.openjdk.org/jdk/pull/23739/files/d87935a0..810bf2d3

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=04
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=03-04

  Stats: 3 lines in 1 file changed: 1 ins; 0 del; 2 mod
  Patch: https://git.openjdk.org/jdk/pull/23739.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739

PR: https://git.openjdk.org/jdk/pull/23739

From duke at openjdk.org  Mon Mar  3 11:18:32 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Mon, 3 Mar 2025 11:18:32 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA
Message-ID: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>

By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.

-------------

Commit messages:
 - JDK-8351034 Add AVX-512 intrinsics for ML-DSA

Changes: https://git.openjdk.org/jdk/pull/23860/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8351034
  Stats: 2530 lines in 18 files changed: 2445 ins; 9 del; 76 mod
  Patch: https://git.openjdk.org/jdk/pull/23860.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23860/head:pull/23860

PR: https://git.openjdk.org/jdk/pull/23860

From amitkumar at openjdk.org  Mon Mar  3 14:25:54 2025
From: amitkumar at openjdk.org (Amit Kumar)
Date: Mon, 3 Mar 2025 14:25:54 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v5]
In-Reply-To: <JMHe-qTzcm8wVAxWcCPuDwedBFHpraKRIxPtcKt4SNw=.dc931266-66ca-43ad-a28c-570bcc0d4514@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <JMHe-qTzcm8wVAxWcCPuDwedBFHpraKRIxPtcKt4SNw=.dc931266-66ca-43ad-a28c-570bcc0d4514@github.com>
Message-ID: <GUYxnKDbm51rmW4Vqa2jiXLH5sI1dzesKpEz2PHG90s=.d72fa785-a33b-4ebd-9a98-d909822a96cc@github.com>

On Mon, 3 Mar 2025 08:42:05 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> Hi all,
>> 
>>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
>> 
>> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
>> 
>> ### Current situation
>> 
>> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
>> 
>> The main reason for the current barrier is how g1 implements concurrent refinement:
>> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
>> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
>> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
>> 
>> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
>> 
>> 
>> // Filtering
>> if (region(@x.a) == region(y)) goto done; // same region check
>> if (y == null) goto done;     // null value check
>> if (card(@x.a) == young_card) goto done;  // write to young gen check
>> StoreLoad;                // synchronize
>> if (card(@x.a) == dirty_card) goto done;
>> 
>> *card(@x.a) = dirty
>> 
>> // Card tracking
>> enqueue(card-address(@x.a)) into thread-local-dcq;
>> if (thread-local-dcq is not full) goto done;
>> 
>> call runtime to move thread-local-dcq into dcqs
>> 
>> done:
>> 
>> 
>> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
>> 
>> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
>> 
>> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
>> 
>> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c...
>
> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
> 
>   * fix comment (trailing whitespace)
>   * another assert when snapshotting at a safepoint.

I don't see any failure on s390x. Tier1 test looks good.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23739#issuecomment-2694563382

From ayang at openjdk.org  Mon Mar  3 15:22:10 2025
From: ayang at openjdk.org (Albert Mingkun Yang)
Date: Mon, 3 Mar 2025 15:22:10 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v5]
In-Reply-To: <JMHe-qTzcm8wVAxWcCPuDwedBFHpraKRIxPtcKt4SNw=.dc931266-66ca-43ad-a28c-570bcc0d4514@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <JMHe-qTzcm8wVAxWcCPuDwedBFHpraKRIxPtcKt4SNw=.dc931266-66ca-43ad-a28c-570bcc0d4514@github.com>
Message-ID: <lDNpoEoc-yZg1ZAjybO07RyxLR7t6utOCRFmYSqpAVk=.730718ee-bf90-443c-ac34-4519a3c42a82@github.com>

On Mon, 3 Mar 2025 08:42:05 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> Hi all,
>> 
>>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
>> 
>> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
>> 
>> ### Current situation
>> 
>> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
>> 
>> The main reason for the current barrier is how g1 implements concurrent refinement:
>> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
>> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
>> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
>> 
>> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
>> 
>> 
>> // Filtering
>> if (region(@x.a) == region(y)) goto done; // same region check
>> if (y == null) goto done;     // null value check
>> if (card(@x.a) == young_card) goto done;  // write to young gen check
>> StoreLoad;                // synchronize
>> if (card(@x.a) == dirty_card) goto done;
>> 
>> *card(@x.a) = dirty
>> 
>> // Card tracking
>> enqueue(card-address(@x.a)) into thread-local-dcq;
>> if (thread-local-dcq is not full) goto done;
>> 
>> call runtime to move thread-local-dcq into dcqs
>> 
>> done:
>> 
>> 
>> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
>> 
>> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
>> 
>> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
>> 
>> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c...
>
> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
> 
>   * fix comment (trailing whitespace)
>   * another assert when snapshotting at a safepoint.

src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 106:

> 104: 
> 105:   __ testptr(count, count);
> 106:   __ jcc(Assembler::equal, done);

I wonder if we can use "zero" instead of "equal" here; they have the same underlying value, but the semantic is to checking for "zero".

src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 133:

> 131:   Label is_clean_card;
> 132:   __ cmpb(Address(addr, 0), G1CardTable::clean_card_val());
> 133:   __ jcc(Assembler::equal, is_clean_card);

Should this checking be guarded by `if (UseCondCardMark)`? I see that aarch64 does that.

src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 143:

> 141: 
> 142:   __ bind(is_clean_card);
> 143:   // Card was not clean. Dirty card and go to next..

Why "not clean"? I thought this path is for dirtying clean card?

src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 323:

> 321:   assert(thread == r15_thread, "must be");
> 322: #endif // _LP64
> 323:   assert_different_registers(store_addr, new_val, thread, tmp1 /*, tmp2 unused */, noreg);

Seems that `tmp2` is unused in this method. It is used in aarch64, but it's not obvious to me whether that is indeed necessary. If so, can you add a comment saying sth like "this unused var is needed for other archs..."?

src/hotspot/share/gc/g1/g1CardTable.inline.hpp line 54:

> 52: // result = 0xBBAABBAA
> 53: inline size_t blend(size_t a, size_t b, size_t mask) {
> 54:   return a ^ ((a ^ b) & mask);

The example makes it much clearer; I wonder if `return (a & ~mask) | (b & mask);` is more readable.

src/hotspot/share/gc/g1/g1CardTableClaimTable.cpp line 59:

> 57: 
> 58: void G1CardTableClaimTable::reset_all_claims_to_claimed() {
> 59:   for (size_t i = 0; i < _max_reserved_regions; i++) {

`uint` for `i`?

src/hotspot/share/gc/g1/g1CardTableClaimTable.hpp line 64:

> 62:   void reset_all_claims_to_unclaimed();
> 63:   void reset_all_claims_to_claimed();
> 64: 

I wonder if these two APIs can be renamed to "reset_all_to_x", which is more aligned with its single-region counterpart, `reset_to_unclaimed`, IMO.

src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 348:

> 346: void G1ConcurrentRefineWorkState::snapshot_heap_into(G1CardTableClaimTable* sweep_table) {
> 347:   // G1CollectedHeap::heap_region_iterate() below will only visit committed regions. Initialize
> 348:   // all entries in the state table here to not require special handling when iterating over it.

Can you elaborate on what the "special handling" would be, if we don's set "claimed" for non-committed regions?

src/hotspot/share/gc/g1/g1RemSet.cpp line 837:

> 835:         for (; refinement_cur_card < refinement_end_card; ++refinement_cur_card, ++card_cur_word) {
> 836:           size_t value = *refinement_cur_card;
> 837:           *refinement_cur_card = G1CardTable::WordAllClean;

Similarly, this is a "word", not "card", also.

src/hotspot/share/gc/g1/g1YoungGCPostEvacuateTasks.cpp line 857:

> 855:     // We do not expect too many non-Java threads compared to Java threads, so just
> 856:     // let one worker claim that work.
> 857:     if (!_non_java_threads_claim && !Atomic::cmpxchg(&_non_java_threads_claim, false, true, memory_order_relaxed)) {

Do non-java threads have card-table-base?

src/hotspot/share/gc/g1/g1YoungGCPostEvacuateTasks.cpp line 862:

> 860: 
> 861:     class ResizeAndSwapCardTableClosure : public ThreadClosure {
> 862:     SwapCardTableClosure _cl;

Field indentation.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977586579
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977594184
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977583002
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977601907
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977645576
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977571306
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977573354
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977704351
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977575441
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977701293
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977679688

From dnsimon at openjdk.org  Mon Mar  3 15:32:18 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Mon, 3 Mar 2025 15:32:18 GMT
Subject: RFR: 8350892: [JVMCI] Align ResolvedJavaType.getInstanceFields with
 Class.getDeclaredFields
Message-ID: <poRDowqib3iR011jUOlIyKkBX8XJgqB8dmdtILmNuyU=.86ddf950-95fb-40a0-85a6-8c08a069a345@github.com>

The current order of fields returned by `ResolvedJavaType.getInstanceFields` is a) not well specified and b) different than the order of fields used almost everywhere else in HotSpot. This PR aligns the order of `getInstanceFields` with `Class.getDeclaredFields()`.

It also makes `ciInstanceKlass::_nonstatic_fields` use the same order which unifies how escape analysis and deoptimization treats fields across C2 and JVMCI.

-------------

Commit messages:
 - made order of ciInstanceKlass::_nonstatic_fields same as JavaFieldStream (and Class.getDeclaredFields)
 - made order of ResolvedJavaType.getInstanceFields match Class.getDeclaredFields

Changes: https://git.openjdk.org/jdk/pull/23849/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23849&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8350892
  Stats: 89 lines in 6 files changed: 18 ins; 32 del; 39 mod
  Patch: https://git.openjdk.org/jdk/pull/23849.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23849/head:pull/23849

PR: https://git.openjdk.org/jdk/pull/23849

From tschatzl at openjdk.org  Mon Mar  3 15:40:04 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Mon, 3 Mar 2025 15:40:04 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v5]
In-Reply-To: <lDNpoEoc-yZg1ZAjybO07RyxLR7t6utOCRFmYSqpAVk=.730718ee-bf90-443c-ac34-4519a3c42a82@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <JMHe-qTzcm8wVAxWcCPuDwedBFHpraKRIxPtcKt4SNw=.dc931266-66ca-43ad-a28c-570bcc0d4514@github.com>
 <lDNpoEoc-yZg1ZAjybO07RyxLR7t6utOCRFmYSqpAVk=.730718ee-bf90-443c-ac34-4519a3c42a82@github.com>
Message-ID: <GChLPeDseqU606zlXluvJAgxlpcCWgCHxEKH8v_LnTU=.5a5cf8ad-b6fb-4d6a-a8f7-988df56da6d7@github.com>

On Mon, 3 Mar 2025 14:11:09 GMT, Albert Mingkun Yang <ayang at openjdk.org> wrote:

>> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   * fix comment (trailing whitespace)
>>   * another assert when snapshotting at a safepoint.
>
> src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 143:
> 
>> 141: 
>> 142:   __ bind(is_clean_card);
>> 143:   // Card was not clean. Dirty card and go to next..
> 
> Why "not clean"? I thought this path is for dirtying clean card?

My interpretation is: in this path the card has been found clean ("is clean") earlier. So dirty it.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977733993

From tschatzl at openjdk.org  Mon Mar  3 15:42:57 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Mon, 3 Mar 2025 15:42:57 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v5]
In-Reply-To: <lDNpoEoc-yZg1ZAjybO07RyxLR7t6utOCRFmYSqpAVk=.730718ee-bf90-443c-ac34-4519a3c42a82@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <JMHe-qTzcm8wVAxWcCPuDwedBFHpraKRIxPtcKt4SNw=.dc931266-66ca-43ad-a28c-570bcc0d4514@github.com>
 <lDNpoEoc-yZg1ZAjybO07RyxLR7t6utOCRFmYSqpAVk=.730718ee-bf90-443c-ac34-4519a3c42a82@github.com>
Message-ID: <ShnWwtq2Gk4LvrpGj7lVyz3nIhKg7_AiCxE_5kX4gUo=.5ae43451-fd86-4a80-86c8-86f4de367112@github.com>

On Mon, 3 Mar 2025 14:47:00 GMT, Albert Mingkun Yang <ayang at openjdk.org> wrote:

>> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   * fix comment (trailing whitespace)
>>   * another assert when snapshotting at a safepoint.
>
> src/hotspot/share/gc/g1/g1CardTable.inline.hpp line 54:
> 
>> 52: // result = 0xBBAABBAA
>> 53: inline size_t blend(size_t a, size_t b, size_t mask) {
>> 54:   return a ^ ((a ^ b) & mask);
> 
> The example makes it much clearer; I wonder if `return (a & ~mask) | (b & mask);` is more readable.

... and hope that the optimizer knows this pattern? If you insist I can do that, brief examination of that code snippet by itself (not within this code) showed that it does.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977739888

From mdoerr at openjdk.org  Mon Mar  3 16:31:58 2025
From: mdoerr at openjdk.org (Martin Doerr)
Date: Mon, 3 Mar 2025 16:31:58 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <wErNcGWUCit5J6xCpc9WCI0ZeZ9H8Geu2BT-3B6Dup0=.5ae69aeb-4e73-42c4-af52-b67834fc8417@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <VkUPkPpw2u6nQ7IToE9MkPhq7fJdRcNAyreKlfCRE4g=.c30f2a89-27bd-47b5-80a2-b6288d77fe18@github.com>
 <_XnhdwtuB6AhiTL4TYmV4yqIy_WwQEeASn2b2zL9-V0=.05ec2994-8599-4f76-871d-a9e2bbe8afa2@github.com>
 <DjDHJL77EOzvK3nZllhrCSaowdVKJNbwc5wyo1mKJuk=.3ede9419-9988-4650-b254-6292bdb94cf4@github.com>
 <Se02S4h4RdkeTMltPoCUtQ-ebNlGVl2AaNEgZtXmBF8=.b438a78a-4335-4498-a03f-84aadd40d05a@github.com>
 <wErNcGWUCit5J6xCpc9WCI0ZeZ9H8Geu2BT-3B6Dup0=.5ae69aeb-4e73-42c4-af52-b67834fc8417@github.com>
Message-ID: <413JPgs-IIREKFfH05GHeskZzg5lpyBuNbW6jeGyQVk=.35277f99-0552-4e06-92a0-17d051979e1a@github.com>

On Fri, 28 Feb 2025 10:47:39 GMT, Martin Doerr <mdoerr at openjdk.org> wrote:

> > I've used QEMU to smoke test this PR on ppc64le, riscv64 and s390x, But it would be nice if @TheRealMDoerr, @RealFYang and @offamitkumar could check if it runs okay on real hardware as well.
> 
> The PPC64 code looks correct and some quick tests have passed. I'll run larger test suites over the weekend.

Test results look good (including tier 1-4 on many platforms). I didn't see any new issue related to this.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23421#issuecomment-2694935731

From tschatzl at openjdk.org  Mon Mar  3 16:55:55 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Mon, 3 Mar 2025 16:55:55 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v5]
In-Reply-To: <lDNpoEoc-yZg1ZAjybO07RyxLR7t6utOCRFmYSqpAVk=.730718ee-bf90-443c-ac34-4519a3c42a82@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <JMHe-qTzcm8wVAxWcCPuDwedBFHpraKRIxPtcKt4SNw=.dc931266-66ca-43ad-a28c-570bcc0d4514@github.com>
 <lDNpoEoc-yZg1ZAjybO07RyxLR7t6utOCRFmYSqpAVk=.730718ee-bf90-443c-ac34-4519a3c42a82@github.com>
Message-ID: <GtYYK7rFFKRIKtmOnLp1ye2dpDzGb1WTzAuokdra0B8=.d6db8a0b-0708-4580-9127-02896ca7b71f@github.com>

On Mon, 3 Mar 2025 15:17:27 GMT, Albert Mingkun Yang <ayang at openjdk.org> wrote:

>> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   * fix comment (trailing whitespace)
>>   * another assert when snapshotting at a safepoint.
>
> src/hotspot/share/gc/g1/g1YoungGCPostEvacuateTasks.cpp line 857:
> 
>> 855:     // We do not expect too many non-Java threads compared to Java threads, so just
>> 856:     // let one worker claim that work.
>> 857:     if (!_non_java_threads_claim && !Atomic::cmpxchg(&_non_java_threads_claim, false, true, memory_order_relaxed)) {
> 
> Do non-java threads have card-table-base?

This code should not be necessary (any more). Will remove.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977853483

From tschatzl at openjdk.org  Mon Mar  3 18:22:24 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Mon, 3 Mar 2025 18:22:24 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v6]
In-Reply-To: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
Message-ID: <E4npHAdUlz7y08VNbVDeD4j8HTlBR4mMG2OxqGBSAHs=.82d30197-7093-4008-b71b-256a1dc63d25@github.com>

> Hi all,
> 
>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
> 
> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
> 
> ### Current situation
> 
> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
> 
> The main reason for the current barrier is how g1 implements concurrent refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
> 
> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
> 
> 
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done;     // null value check
> if (card(@x.a) == young_card) goto done;  // write to young gen check
> StoreLoad;                // synchronize
> if (card(@x.a) == dirty_card) goto done;
> 
> *card(@x.a) = dirty
> 
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
> 
> call runtime to move thread-local-dcq into dcqs
> 
> done:
> 
> 
> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
> 
> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
> 
> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
> 
> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se...

Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:

  ayang review 2
  * removal of useless code
  * renamings

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23739/files
  - new: https://git.openjdk.org/jdk/pull/23739/files/810bf2d3..b3dd0084

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=05
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=04-05

  Stats: 51 lines in 7 files changed: 16 ins; 10 del; 25 mod
  Patch: https://git.openjdk.org/jdk/pull/23739.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739

PR: https://git.openjdk.org/jdk/pull/23739

From duke at openjdk.org  Mon Mar  3 19:00:59 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Mon, 3 Mar 2025 19:00:59 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v2]
In-Reply-To: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
Message-ID: <Ona146wzT7py7EoMfS_4weGixbRkxvw0OsnNeDt2yl8=.179adb54-62f0-4d77-b5ec-865d1924c627@github.com>

> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.

Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:

  Added comments, removed debugging printfs

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23860/files
  - new: https://git.openjdk.org/jdk/pull/23860/files/1ff58512..fe50e0d8

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=00-01

  Stats: 12 lines in 2 files changed: 9 ins; 1 del; 2 mod
  Patch: https://git.openjdk.org/jdk/pull/23860.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23860/head:pull/23860

PR: https://git.openjdk.org/jdk/pull/23860

From iwalulya at openjdk.org  Mon Mar  3 20:18:58 2025
From: iwalulya at openjdk.org (Ivan Walulya)
Date: Mon, 3 Mar 2025 20:18:58 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v5]
In-Reply-To: <JMHe-qTzcm8wVAxWcCPuDwedBFHpraKRIxPtcKt4SNw=.dc931266-66ca-43ad-a28c-570bcc0d4514@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <JMHe-qTzcm8wVAxWcCPuDwedBFHpraKRIxPtcKt4SNw=.dc931266-66ca-43ad-a28c-570bcc0d4514@github.com>
Message-ID: <mDgx2RYI6XSxtc4MIc_oojXoxEskRX8cl2OLjYsRbMg=.88e3781f-84df-4863-a887-cc93784fe113@github.com>

On Mon, 3 Mar 2025 08:42:05 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> Hi all,
>> 
>>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
>> 
>> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
>> 
>> ### Current situation
>> 
>> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
>> 
>> The main reason for the current barrier is how g1 implements concurrent refinement:
>> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
>> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
>> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
>> 
>> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
>> 
>> 
>> // Filtering
>> if (region(@x.a) == region(y)) goto done; // same region check
>> if (y == null) goto done;     // null value check
>> if (card(@x.a) == young_card) goto done;  // write to young gen check
>> StoreLoad;                // synchronize
>> if (card(@x.a) == dirty_card) goto done;
>> 
>> *card(@x.a) = dirty
>> 
>> // Card tracking
>> enqueue(card-address(@x.a)) into thread-local-dcq;
>> if (thread-local-dcq is not full) goto done;
>> 
>> call runtime to move thread-local-dcq into dcqs
>> 
>> done:
>> 
>> 
>> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
>> 
>> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
>> 
>> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
>> 
>> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c...
>
> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
> 
>   * fix comment (trailing whitespace)
>   * another assert when snapshotting at a safepoint.

src/hotspot/share/gc/g1/g1CardTable.cpp line 44:

> 42:       if (!failures) {
> 43:         G1CollectedHeap* g1h = G1CollectedHeap::heap();
> 44:         G1HeapRegion* r = g1h->heap_region_containing(mr.start());

Probably we can move this outside the loop, and assert that `mr` does not cross region boundaries

src/hotspot/share/gc/g1/g1CollectedHeap.hpp line 916:

> 914:   void safepoint_synchronize_end() override;
> 915: 
> 916:   jlong synchronized_duration() const { return _safepoint_duration; }

safepoint_duration() seems easier to comprehend.

src/hotspot/share/gc/g1/g1CollectionSet.cpp line 310:

> 308:   verify_young_cset_indices();
> 309: 
> 310:   size_t card_rs_length = _policy->analytics()->predict_card_rs_length(in_young_only_phase);

Why are we using a prediction here? Additionally, won't this prediction also include cards from the old gen regions in case of mixed gcs? How do we reconcile that when we are adding old gen regions to c-set?

src/hotspot/share/gc/g1/g1ConcurrentRefine.hpp line 42:

> 40: class G1HeapRegion;
> 41: class G1Policy;
> 42: class G1CardTableClaimTable;

Nit: ordering of the declarations

src/hotspot/share/gc/g1/g1ConcurrentRefine.hpp line 84:

> 82: // Tracks the current refinement state from idle to completion (and reset back
> 83: // to idle).
> 84: class G1ConcurrentRefineWorkState {

G1ConcurrentRefinementState? I am not convinced the "Work" adds any clarity

src/hotspot/share/gc/g1/g1ConcurrentRefine.hpp line 113:

> 111:   // Current epoch the work has been started; used to determine if there has been
> 112:   // a forced card table swap due to a garbage collection while doing work.
> 113:   size_t _refine_work_epoch;

same as previous comment, why `refine_work` instead of `refinement`?

src/hotspot/share/gc/g1/g1ConcurrentRefineStats.hpp line 43:

> 41:   size_t _cards_clean;                // Number of cards found clean.
> 42:   size_t _cards_not_parsable;         // Number of cards we could not parse and left unrefined.
> 43:   size_t _cards_still_refer_to_cset;  // Number of cards marked still young.

`_cards_still_refer_to_cset` from the naming it is not clear what the difference is with `_cards_refer_to_cset`, the comment is not helping with that

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977688778
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977969470
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977982999
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1977991124
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1978017843
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1978019093
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1978119476

From pchilanomate at openjdk.org  Mon Mar  3 23:42:56 2025
From: pchilanomate at openjdk.org (Patricio Chilano Mateo)
Date: Mon, 3 Mar 2025 23:42:56 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <VkUPkPpw2u6nQ7IToE9MkPhq7fJdRcNAyreKlfCRE4g=.c30f2a89-27bd-47b5-80a2-b6288d77fe18@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <VkUPkPpw2u6nQ7IToE9MkPhq7fJdRcNAyreKlfCRE4g=.c30f2a89-27bd-47b5-80a2-b6288d77fe18@github.com>
Message-ID: <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com>

On Thu, 27 Feb 2025 15:54:28 GMT, Fredrik Bredberg <fbredberg at openjdk.org> wrote:

>> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`.
>> 
>> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past.
>> 
>> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks.
>> 
>> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`.
>> 
>> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable.
>> 
>> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list.
>> 
>> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor).
>> 
>> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation.
>> 
>> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fac...
>
> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Update after review by David and Coleen.

Changes look good to me. Just a few comments.

Thanks,
Patricio

src/hotspot/share/runtime/objectMonitor.cpp line 204:

> 202: //   If the thread (F) that removes itself from the end of the list
> 203: //   hasn't got any prev pointer, we just set the tail pointer to
> 204: //   null, see 5) and 6) below.

Setting the tail pointer to null would be for the case when this node is also the head, i.e single element. Otherwise we just rebuild the doubly link list, unlink F, and set entry_list_tail to G. In other words, the comment here and below seems to be missing that we have to build the doubly link list when F acquires the monitor, not when F needs to find a successor.

src/hotspot/share/runtime/objectMonitor.cpp line 1265:

> 1263:   // that updated _entry_list, so we can access w->_next.
> 1264:   w = Atomic::load_acquire(&_entry_list);
> 1265:   assert(w != nullptr, "invariant");

Maybe add the same assert as below for the single element case: `assert(w->TState == ObjectWaiter::TS_ENTER, "invariant")`.

src/hotspot/share/runtime/objectMonitor.cpp line 1359:

> 1357:     // Build the doubly linked list to get hold of currentNode->prev().
> 1358:     _entry_list_tail = nullptr;
> 1359:     entry_list_tail(current);

I think we should try to avoid having to rebuild the doubly link list from scratch, since only a few nodes in the front might be missing the previous links. For platform threads it might not matter that much, but for virtual threads this list could be much larger. Maybe we can leave it as a future enhancement.

src/hotspot/share/runtime/objectMonitor.cpp line 1509:

> 1507:     // is no successor, so it appears that an heir-presumptive
> 1508:     // (successor) must be made ready. Only the current lock owner can
> 1509:     // detach threads from the entry_list, therefore we need to

We don't detach threads here, so maybe manipulate would be better.

src/hotspot/share/runtime/objectMonitor.cpp line 1532:

> 1530:       // Let's say T1 then stalls.  T2 acquires O and calls O.notify().  The
> 1531:       // notify() operation moves T1 from O's waitset to O's entry_list. T2 then
> 1532:       // release the lock "O".  T2 resumes immediately after the ST of null into

Pre-existent, but this should be T1. Same in next sentence.

-------------

PR Review: https://git.openjdk.org/jdk/pull/23421#pullrequestreview-2655551088
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1978372164
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1978368315
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1978374081
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1978369547
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1978370888

From dholmes at openjdk.org  Tue Mar  4 04:52:57 2025
From: dholmes at openjdk.org (David Holmes)
Date: Tue, 4 Mar 2025 04:52:57 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <VkUPkPpw2u6nQ7IToE9MkPhq7fJdRcNAyreKlfCRE4g=.c30f2a89-27bd-47b5-80a2-b6288d77fe18@github.com>
 <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com>
Message-ID: <dFZa-FBTJ7-5NlIaku7Ex4LqZ78fBjlr4Fkv3IUBlkc=.592accf7-3bf8-4043-8368-9b27daa672c3@github.com>

On Mon, 3 Mar 2025 23:15:46 GMT, Patricio Chilano Mateo <pchilanomate at openjdk.org> wrote:

>> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Update after review by David and Coleen.
>
> src/hotspot/share/runtime/objectMonitor.cpp line 204:
> 
>> 202: //   If the thread (F) that removes itself from the end of the list
>> 203: //   hasn't got any prev pointer, we just set the tail pointer to
>> 204: //   null, see 5) and 6) below.
> 
> Setting the tail pointer to null would be for the case when this node is also the head, i.e single element. Otherwise we just rebuild the doubly link list, unlink F, and set entry_list_tail to G. In other words, the comment here and below seems to be missing that we have to build the doubly link list when F acquires the monitor, not when F needs to find a successor.

We don't rebuild at this point. The thread that is removing itself just sets tail to null if there is no prev. Later when F exits the monitor it will construct the DLL to find the next successor.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1978608595

From tschatzl at openjdk.org  Tue Mar  4 08:24:54 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Tue, 4 Mar 2025 08:24:54 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v5]
In-Reply-To: <mDgx2RYI6XSxtc4MIc_oojXoxEskRX8cl2OLjYsRbMg=.88e3781f-84df-4863-a887-cc93784fe113@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <JMHe-qTzcm8wVAxWcCPuDwedBFHpraKRIxPtcKt4SNw=.dc931266-66ca-43ad-a28c-570bcc0d4514@github.com>
 <mDgx2RYI6XSxtc4MIc_oojXoxEskRX8cl2OLjYsRbMg=.88e3781f-84df-4863-a887-cc93784fe113@github.com>
Message-ID: <jqVNz7R70hpVVrx1KjaHAJr0ZaUAHdG3soo3XP71lX4=.7e6c8dd1-b7f4-4c03-8b4d-8d4610f620e2@github.com>

On Mon, 3 Mar 2025 20:02:16 GMT, Ivan Walulya <iwalulya at openjdk.org> wrote:

>> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   * fix comment (trailing whitespace)
>>   * another assert when snapshotting at a safepoint.
>
> src/hotspot/share/gc/g1/g1ConcurrentRefineStats.hpp line 43:
> 
>> 41:   size_t _cards_clean;                // Number of cards found clean.
>> 42:   size_t _cards_not_parsable;         // Number of cards we could not parse and left unrefined.
>> 43:   size_t _cards_still_refer_to_cset;  // Number of cards marked still young.
> 
> `_cards_still_refer_to_cset` from the naming it is not clear what the difference is with `_cards_refer_to_cset`, the comment is not helping with that

`cards_still_refer_to_cset` refers to cards that were found to have already been marked as `to-collection-set`. Renamed to `_cards_already_refer_to_cset`, would that be okay?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1978868225

From tschatzl at openjdk.org  Tue Mar  4 08:28:56 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Tue, 4 Mar 2025 08:28:56 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v5]
In-Reply-To: <mDgx2RYI6XSxtc4MIc_oojXoxEskRX8cl2OLjYsRbMg=.88e3781f-84df-4863-a887-cc93784fe113@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <JMHe-qTzcm8wVAxWcCPuDwedBFHpraKRIxPtcKt4SNw=.dc931266-66ca-43ad-a28c-570bcc0d4514@github.com>
 <mDgx2RYI6XSxtc4MIc_oojXoxEskRX8cl2OLjYsRbMg=.88e3781f-84df-4863-a887-cc93784fe113@github.com>
Message-ID: <yeI6EF7CBEWy6NPs5ER8YC1oq1bFEfe2AIrvpD77-PA=.649b4c0c-1030-488d-9827-2e39b4365229@github.com>

On Mon, 3 Mar 2025 18:28:48 GMT, Ivan Walulya <iwalulya at openjdk.org> wrote:

> Why are we using a prediction here?

Quickly checking again, do we have the actual count here from somewhere?

> Additionally, won't this prediction also include cards from the old gen regions in case of mixed gcs? How do we reconcile that when we are adding old gen regions to c-set?

The predictor contents changed to (supposedly) only contain cards containing young gen references. See g1Policy.cpp:934:

    _analytics->report_card_rs_length(total_cards_scanned - total_non_young_rs_cards, is_young_only_pause);

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1978876199

From tschatzl at openjdk.org  Tue Mar  4 08:36:55 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Tue, 4 Mar 2025 08:36:55 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v5]
In-Reply-To: <lDNpoEoc-yZg1ZAjybO07RyxLR7t6utOCRFmYSqpAVk=.730718ee-bf90-443c-ac34-4519a3c42a82@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <JMHe-qTzcm8wVAxWcCPuDwedBFHpraKRIxPtcKt4SNw=.dc931266-66ca-43ad-a28c-570bcc0d4514@github.com>
 <lDNpoEoc-yZg1ZAjybO07RyxLR7t6utOCRFmYSqpAVk=.730718ee-bf90-443c-ac34-4519a3c42a82@github.com>
Message-ID: <ED-IwlrajTwvcuDgp8rq9of33qWLLMEe26Zizv2MQmM=.ed6e709a-45aa-41bd-833d-5397c71202c6@github.com>

On Mon, 3 Mar 2025 15:19:20 GMT, Albert Mingkun Yang <ayang at openjdk.org> wrote:

> Can you elaborate on what the "special handling" would be, if we don's set "claimed" for non-committed regions?

the iteration code, would for every region check whether the region is actually committed or not.

The `heap_region_iterate()` API of `G1CollectedHeap` only iterates over committed regions. So only committed regions will be updated in the state table. Later when iterating over the state table, the code uses the array directly, i.e. the claim state of uncommitted regions would be read as uninitialized.

Further, it would be hard to exclude regions committed after the snapshot otherwise (we do not need to iterate over them. Their card table can't contain card marks) as we do not track newly committed regions in the snapshot. We could do, but would be a headache due to memory synchronization because regions can be committed any time.

Imho it is much simpler to reset all the card claims to "already processed" and then make the regions we want to work on claimable.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1978893134

From tschatzl at openjdk.org  Tue Mar  4 08:39:56 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Tue, 4 Mar 2025 08:39:56 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v5]
In-Reply-To: <jqVNz7R70hpVVrx1KjaHAJr0ZaUAHdG3soo3XP71lX4=.7e6c8dd1-b7f4-4c03-8b4d-8d4610f620e2@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <JMHe-qTzcm8wVAxWcCPuDwedBFHpraKRIxPtcKt4SNw=.dc931266-66ca-43ad-a28c-570bcc0d4514@github.com>
 <mDgx2RYI6XSxtc4MIc_oojXoxEskRX8cl2OLjYsRbMg=.88e3781f-84df-4863-a887-cc93784fe113@github.com>
 <jqVNz7R70hpVVrx1KjaHAJr0ZaUAHdG3soo3XP71lX4=.7e6c8dd1-b7f4-4c03-8b4d-8d4610f620e2@github.com>
Message-ID: <FrC_QzrVY3u9MaI7GatPExLrV9JElTjb2jZoW6pAr_o=.2d2dd605-ad99-4324-bc91-939289f1aed7@github.com>

On Tue, 4 Mar 2025 08:22:03 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> src/hotspot/share/gc/g1/g1ConcurrentRefineStats.hpp line 43:
>> 
>>> 41:   size_t _cards_clean;                // Number of cards found clean.
>>> 42:   size_t _cards_not_parsable;         // Number of cards we could not parse and left unrefined.
>>> 43:   size_t _cards_still_refer_to_cset;  // Number of cards marked still young.
>> 
>> `_cards_still_refer_to_cset` from the naming it is not clear what the difference is with `_cards_refer_to_cset`, the comment is not helping with that
>
> `cards_still_refer_to_cset` refers to cards that were found to have already been marked as `to-collection-set`. Renamed to `_cards_already_refer_to_cset`, would that be okay?

Fwiw, this is just for statistics, so if you want I can remove these. I did some experiments with re-examining these cards too to see whether we could clear them later. For determining if/when to do that a rate of increase for the young cards has been interesting.

As mentioned, if you want I can remove them.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1978896272

From tschatzl at openjdk.org  Tue Mar  4 08:53:46 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Tue, 4 Mar 2025 08:53:46 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v7]
In-Reply-To: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
Message-ID: <y39asx4DK57r1SJozk7NQs_oomW9w3aUif-vHI6vWTA=.a15b41a9-486c-438f-a78d-313e38b28bbc@github.com>

> Hi all,
> 
>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
> 
> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
> 
> ### Current situation
> 
> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
> 
> The main reason for the current barrier is how g1 implements concurrent refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
> 
> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
> 
> 
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done;     // null value check
> if (card(@x.a) == young_card) goto done;  // write to young gen check
> StoreLoad;                // synchronize
> if (card(@x.a) == dirty_card) goto done;
> 
> *card(@x.a) = dirty
> 
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
> 
> call runtime to move thread-local-dcq into dcqs
> 
> done:
> 
> 
> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
> 
> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
> 
> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
> 
> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se...

Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:

  * iwalulya initial comments
    * renaming
  * made blend() helper function more clear; at least gcc will optimize it to the same code as before

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23739/files
  - new: https://git.openjdk.org/jdk/pull/23739/files/b3dd0084..8f46dc9a

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=06
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=05-06

  Stats: 27 lines in 9 files changed: 7 ins; 3 del; 17 mod
  Patch: https://git.openjdk.org/jdk/pull/23739.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739

PR: https://git.openjdk.org/jdk/pull/23739

From tschatzl at openjdk.org  Tue Mar  4 09:15:24 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Tue, 4 Mar 2025 09:15:24 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v8]
In-Reply-To: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
Message-ID: <Pt1ehf-Dxe0APJJumrIpjjWd9nTuhS-orBfPcHyoTMw=.5ea071ee-e2b7-4cb0-8052-9cc438d8800c@github.com>

> Hi all,
> 
>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
> 
> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
> 
> ### Current situation
> 
> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
> 
> The main reason for the current barrier is how g1 implements concurrent refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
> 
> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
> 
> 
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done;     // null value check
> if (card(@x.a) == young_card) goto done;  // write to young gen check
> StoreLoad;                // synchronize
> if (card(@x.a) == dirty_card) goto done;
> 
> *card(@x.a) = dirty
> 
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
> 
> call runtime to move thread-local-dcq into dcqs
> 
> done:
> 
> 
> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
> 
> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
> 
> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
> 
> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se...

Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:

  * do not change card table base for gc threads during swapping
    * not necessary because they do not use it
    * (recent assert that verifies that non-java threads do not have a card table found this)

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23739/files
  - new: https://git.openjdk.org/jdk/pull/23739/files/8f46dc9a..9e2ee543

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=07
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=06-07

  Stats: 25 lines in 1 file changed: 9 ins; 14 del; 2 mod
  Patch: https://git.openjdk.org/jdk/pull/23739.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739

PR: https://git.openjdk.org/jdk/pull/23739

From dnsimon at openjdk.org  Tue Mar  4 09:23:08 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Tue, 4 Mar 2025 09:23:08 GMT
Subject: RFR: 8351036: [JVMCI] value not an s2: -32776
Message-ID: <D-dKgSpDR3Y0ytbaMO2I2Mpgofa61h5DxHxrS90xmgk=.dee7d64d-dc52-4203-a665-c54ce19edc76@github.com>

This PR adds support for JVMCI to install code that requires stack slots whose offset > `Short.MAX_VALUE`.

-------------

Commit messages:
 - support stack slots with an offset > Short.MAX_VALUE

Changes: https://git.openjdk.org/jdk/pull/23888/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23888&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8351036
  Stats: 44 lines in 4 files changed: 36 ins; 0 del; 8 mod
  Patch: https://git.openjdk.org/jdk/pull/23888.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23888/head:pull/23888

PR: https://git.openjdk.org/jdk/pull/23888

From yzheng at openjdk.org  Tue Mar  4 09:30:53 2025
From: yzheng at openjdk.org (Yudi Zheng)
Date: Tue, 4 Mar 2025 09:30:53 GMT
Subject: RFR: 8351036: [JVMCI] value not an s2: -32776
In-Reply-To: <D-dKgSpDR3Y0ytbaMO2I2Mpgofa61h5DxHxrS90xmgk=.dee7d64d-dc52-4203-a665-c54ce19edc76@github.com>
References: <D-dKgSpDR3Y0ytbaMO2I2Mpgofa61h5DxHxrS90xmgk=.dee7d64d-dc52-4203-a665-c54ce19edc76@github.com>
Message-ID: <Hswd8e3Xk_47WeapoIcIqal63A993t0oRFpxIqBNhCU=.24d326ac-dd3e-4609-8269-546851beb173@github.com>

On Tue, 4 Mar 2025 09:18:40 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

> This PR adds support for JVMCI to install code that requires stack slots whose offset > `Short.MAX_VALUE`.

Marked as reviewed by yzheng (Committer).

-------------

PR Review: https://git.openjdk.org/jdk/pull/23888#pullrequestreview-2656634837

From iwalulya at openjdk.org  Tue Mar  4 09:38:58 2025
From: iwalulya at openjdk.org (Ivan Walulya)
Date: Tue, 4 Mar 2025 09:38:58 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v5]
In-Reply-To: <FrC_QzrVY3u9MaI7GatPExLrV9JElTjb2jZoW6pAr_o=.2d2dd605-ad99-4324-bc91-939289f1aed7@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <JMHe-qTzcm8wVAxWcCPuDwedBFHpraKRIxPtcKt4SNw=.dc931266-66ca-43ad-a28c-570bcc0d4514@github.com>
 <mDgx2RYI6XSxtc4MIc_oojXoxEskRX8cl2OLjYsRbMg=.88e3781f-84df-4863-a887-cc93784fe113@github.com>
 <jqVNz7R70hpVVrx1KjaHAJr0ZaUAHdG3soo3XP71lX4=.7e6c8dd1-b7f4-4c03-8b4d-8d4610f620e2@github.com>
 <FrC_QzrVY3u9MaI7GatPExLrV9JElTjb2jZoW6pAr_o=.2d2dd605-ad99-4324-bc91-939289f1aed7@github.com>
Message-ID: <vA17_uhW82Gd8OzlXCgu5eKhM9LWoVUvcPzSTnTpX-Y=.6cc7f89c-d445-4726-984e-782b1dc9a09d@github.com>

On Tue, 4 Mar 2025 08:36:58 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> `cards_still_refer_to_cset` refers to cards that were found to have already been marked as `to-collection-set`. Renamed to `_cards_already_refer_to_cset`, would that be okay?
>
> Fwiw, this particular counter is just for statistics, so if you want I can remove these. I did some experiments with re-examining these cards too to see whether we could clear them later. For determining if/when to do that a rate of increase for the young cards has been interesting.
> 
> As mentioned, if you want I can remove them.

`_cards_already_refer_to_cset` is fine by me, i don't like the option of removing them

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979009507

From iwalulya at openjdk.org  Tue Mar  4 09:43:54 2025
From: iwalulya at openjdk.org (Ivan Walulya)
Date: Tue, 4 Mar 2025 09:43:54 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v5]
In-Reply-To: <yeI6EF7CBEWy6NPs5ER8YC1oq1bFEfe2AIrvpD77-PA=.649b4c0c-1030-488d-9827-2e39b4365229@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <JMHe-qTzcm8wVAxWcCPuDwedBFHpraKRIxPtcKt4SNw=.dc931266-66ca-43ad-a28c-570bcc0d4514@github.com>
 <mDgx2RYI6XSxtc4MIc_oojXoxEskRX8cl2OLjYsRbMg=.88e3781f-84df-4863-a887-cc93784fe113@github.com>
 <yeI6EF7CBEWy6NPs5ER8YC1oq1bFEfe2AIrvpD77-PA=.649b4c0c-1030-488d-9827-2e39b4365229@github.com>
Message-ID: <nl76pEYWudN4M9FJVfAemT3ZumXlwtQDr0Cq5ozYQxo=.6a002536-f818-45df-9522-8105378b9d6a@github.com>

On Tue, 4 Mar 2025 08:26:10 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> src/hotspot/share/gc/g1/g1CollectionSet.cpp line 310:
>> 
>>> 308:   verify_young_cset_indices();
>>> 309: 
>>> 310:   size_t card_rs_length = _policy->analytics()->predict_card_rs_length(in_young_only_phase);
>> 
>> Why are we using a prediction here? Additionally, won't this prediction also include cards from the old gen regions in case of mixed gcs? How do we reconcile that when we are adding old gen regions to c-set?
>
>> Why are we using a prediction here?
> 
> Quickly checking again, do we have the actual count here from somewhere?
> 
>> Additionally, won't this prediction also include cards from the old gen regions in case of mixed gcs? How do we reconcile that when we are adding old gen regions to c-set?
> 
> The predictor contents changed to (supposedly) only contain cards containing young gen references. See g1Policy.cpp:934:
> 
>     _analytics->report_card_rs_length(total_cards_scanned - total_non_young_rs_cards, is_young_only_pause);

Fair, I missed that details on young RS have been removed.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979022900

From tschatzl at openjdk.org  Tue Mar  4 09:57:56 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Tue, 4 Mar 2025 09:57:56 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v9]
In-Reply-To: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
Message-ID: <h_JgI8BYE9hkXd6VInuebQvKl9K8abUlnhe1ICNuMlM=.f8844689-836e-4da3-a9e9-d02b5b967b7b@github.com>

> Hi all,
> 
>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
> 
> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
> 
> ### Current situation
> 
> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
> 
> The main reason for the current barrier is how g1 implements concurrent refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
> 
> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
> 
> 
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done;     // null value check
> if (card(@x.a) == young_card) goto done;  // write to young gen check
> StoreLoad;                // synchronize
> if (card(@x.a) == dirty_card) goto done;
> 
> *card(@x.a) = dirty
> 
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
> 
> call runtime to move thread-local-dcq into dcqs
> 
> done:
> 
> 
> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
> 
> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
> 
> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
> 
> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se...

Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:

  * iwalulya review 2
    * G1ConcurrentRefineWorkState -> G1ConcurrentRefineSweepState
    * some additional documentation

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23739/files
  - new: https://git.openjdk.org/jdk/pull/23739/files/9e2ee543..442d9eae

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=08
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=07-08

  Stats: 93 lines in 7 files changed: 27 ins; 3 del; 63 mod
  Patch: https://git.openjdk.org/jdk/pull/23739.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739

PR: https://git.openjdk.org/jdk/pull/23739

From tschatzl at openjdk.org  Tue Mar  4 09:57:58 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Tue, 4 Mar 2025 09:57:58 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v5]
In-Reply-To: <MgugNZCzmHcx9wcRzJH8NsnuhzLLDmUxzMFIIkLQmMc=.550bd710-34f0-400d-9c22-be4aa3f2ed1f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <JMHe-qTzcm8wVAxWcCPuDwedBFHpraKRIxPtcKt4SNw=.dc931266-66ca-43ad-a28c-570bcc0d4514@github.com>
 <mDgx2RYI6XSxtc4MIc_oojXoxEskRX8cl2OLjYsRbMg=.88e3781f-84df-4863-a887-cc93784fe113@github.com>
 <MgugNZCzmHcx9wcRzJH8NsnuhzLLDmUxzMFIIkLQmMc=.550bd710-34f0-400d-9c22-be4aa3f2ed1f@github.com>
Message-ID: <3BAl6ELdTMEhWoovthkw7lq86mwuoUnyKxzCANFnwNc=.41077bf4-8073-4810-9d0d-078d7ad06240@github.com>

On Tue, 4 Mar 2025 09:52:40 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> src/hotspot/share/gc/g1/g1ConcurrentRefine.hpp line 84:
>> 
>>> 82: // Tracks the current refinement state from idle to completion (and reset back
>>> 83: // to idle).
>>> 84: class G1ConcurrentRefineWorkState {
>> 
>> G1ConcurrentRefinementState? I am not convinced the "Work" adds any clarity
>
> We agreed on `G1ConcurrentRefineSweepState` for now, better suggestions welcome.
> 
> Use `Refine` instead of `Refinement` since all pre-existing classes also use `Refine`. This could be renamed in an extra change.

Add the `Sweep` in the name because this is not the state for entire refinement (which also includes information about when to start refinement/sweeping).

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979053344

From tschatzl at openjdk.org  Tue Mar  4 09:57:58 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Tue, 4 Mar 2025 09:57:58 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v5]
In-Reply-To: <mDgx2RYI6XSxtc4MIc_oojXoxEskRX8cl2OLjYsRbMg=.88e3781f-84df-4863-a887-cc93784fe113@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <JMHe-qTzcm8wVAxWcCPuDwedBFHpraKRIxPtcKt4SNw=.dc931266-66ca-43ad-a28c-570bcc0d4514@github.com>
 <mDgx2RYI6XSxtc4MIc_oojXoxEskRX8cl2OLjYsRbMg=.88e3781f-84df-4863-a887-cc93784fe113@github.com>
Message-ID: <MgugNZCzmHcx9wcRzJH8NsnuhzLLDmUxzMFIIkLQmMc=.550bd710-34f0-400d-9c22-be4aa3f2ed1f@github.com>

On Mon, 3 Mar 2025 18:50:37 GMT, Ivan Walulya <iwalulya at openjdk.org> wrote:

>> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   * fix comment (trailing whitespace)
>>   * another assert when snapshotting at a safepoint.
>
> src/hotspot/share/gc/g1/g1ConcurrentRefine.hpp line 84:
> 
>> 82: // Tracks the current refinement state from idle to completion (and reset back
>> 83: // to idle).
>> 84: class G1ConcurrentRefineWorkState {
> 
> G1ConcurrentRefinementState? I am not convinced the "Work" adds any clarity

We agreed on `G1ConcurrentRefineSweepState` for now, better suggestions welcome.

Use `Refine` instead of `Refinement` since all pre-existing classes also use `Refine`. This could be renamed in an extra change.

> src/hotspot/share/gc/g1/g1ConcurrentRefine.hpp line 113:
> 
>> 111:   // Current epoch the work has been started; used to determine if there has been
>> 112:   // a forced card table swap due to a garbage collection while doing work.
>> 113:   size_t _refine_work_epoch;
> 
> same as previous comment, why `refine_work` instead of `refinement`?

Already renamed, same as previous comment.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979050867
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979051649

From mdoerr at openjdk.org  Tue Mar  4 10:40:55 2025
From: mdoerr at openjdk.org (Martin Doerr)
Date: Tue, 4 Mar 2025 10:40:55 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v9]
In-Reply-To: <h_JgI8BYE9hkXd6VInuebQvKl9K8abUlnhe1ICNuMlM=.f8844689-836e-4da3-a9e9-d02b5b967b7b@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <h_JgI8BYE9hkXd6VInuebQvKl9K8abUlnhe1ICNuMlM=.f8844689-836e-4da3-a9e9-d02b5b967b7b@github.com>
Message-ID: <VgZ6DLi7s4RPt9wmqyDe0gZIhCJkjmCtDn0sqmaa5cY=.5ba2b1dd-a9f2-47c5-8fd2-64aac31d7507@github.com>

On Tue, 4 Mar 2025 09:57:56 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> Hi all,
>> 
>>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
>> 
>> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
>> 
>> ### Current situation
>> 
>> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
>> 
>> The main reason for the current barrier is how g1 implements concurrent refinement:
>> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
>> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
>> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
>> 
>> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
>> 
>> 
>> // Filtering
>> if (region(@x.a) == region(y)) goto done; // same region check
>> if (y == null) goto done;     // null value check
>> if (card(@x.a) == young_card) goto done;  // write to young gen check
>> StoreLoad;                // synchronize
>> if (card(@x.a) == dirty_card) goto done;
>> 
>> *card(@x.a) = dirty
>> 
>> // Card tracking
>> enqueue(card-address(@x.a)) into thread-local-dcq;
>> if (thread-local-dcq is not full) goto done;
>> 
>> call runtime to move thread-local-dcq into dcqs
>> 
>> done:
>> 
>> 
>> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
>> 
>> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
>> 
>> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
>> 
>> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c...
>
> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
> 
>   * iwalulya review 2
>     * G1ConcurrentRefineWorkState -> G1ConcurrentRefineSweepState
>     * some additional documentation

I got an error while testing java/foreign/TestUpcallStress.java on linuxaarch64 with this PR:

#  Internal Error (/openjdk-jdk-linux_aarch64-dbg/jdk/src/hotspot/share/gc/g1/g1CardTable.cpp:56), pid=19044, tid=19159
#  guarantee(!failures) failed: there should not have been any failures
...
V  [libjvm.so+0xb6e988]  G1CardTable::verify_region(MemRegion, unsigned char, bool)+0x3b8  (g1CardTable.cpp:56)
V  [libjvm.so+0xc3a10c]  G1MergeHeapRootsTask::G1ClearBitmapClosure::do_heap_region(G1HeapRegion*)+0x13c  (g1RemSet.cpp:1048)
V  [libjvm.so+0xb7a80c]  G1CollectedHeap::par_iterate_regions_array(G1HeapRegionClosure*, G1HeapRegionClaimer*, unsigned int const*, unsigned long, unsigned int) const+0x9c  (g1CollectedHeap.cpp:2059)
V  [libjvm.so+0xc49fe8]  G1MergeHeapRootsTask::work(unsigned int)+0x708  (g1RemSet.cpp:1225)
V  [libjvm.so+0x19597bc]  WorkerThread::run()+0x98  (workerThread.cpp:69)
V  [libjvm.so+0x1824510]  Thread::call_run()+0xac  (thread.cpp:231)
V  [libjvm.so+0x13b3994]  thread_native_entry(Thread*)+0x130  (os_linux.cpp:877)
C  [libpthread.so.0+0x875c]  start_thread+0x18c

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23739#issuecomment-2697024679

From tschatzl at openjdk.org  Tue Mar  4 10:48:56 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Tue, 4 Mar 2025 10:48:56 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v9]
In-Reply-To: <VgZ6DLi7s4RPt9wmqyDe0gZIhCJkjmCtDn0sqmaa5cY=.5ba2b1dd-a9f2-47c5-8fd2-64aac31d7507@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <h_JgI8BYE9hkXd6VInuebQvKl9K8abUlnhe1ICNuMlM=.f8844689-836e-4da3-a9e9-d02b5b967b7b@github.com>
 <VgZ6DLi7s4RPt9wmqyDe0gZIhCJkjmCtDn0sqmaa5cY=.5ba2b1dd-a9f2-47c5-8fd2-64aac31d7507@github.com>
Message-ID: <KZAJNzVkICZfGlNaFWmIyR3vmeepnbgl5Nvx-xa_u_8=.be66f1b9-b2f9-4d3c-a5e3-07899deb1c14@github.com>

On Tue, 4 Mar 2025 10:37:47 GMT, Martin Doerr <mdoerr at openjdk.org> wrote:

> I got an error while testing java/foreign/TestUpcallStress.java on linuxaarch64 with this PR:
> 
> ```
> #  Internal Error (/openjdk-jdk-linux_aarch64-dbg/jdk/src/hotspot/share/gc/g1/g1CardTable.cpp:56), pid=19044, tid=19159
> #  guarantee(!failures) failed: there should not have been any failures
> ...
> V  [libjvm.so+0xb6e988]  G1CardTable::verify_region(MemRegion, unsigned char, bool)+0x3b8  (g1CardTable.cpp:56)
> V  [libjvm.so+0xc3a10c]  G1MergeHeapRootsTask::G1ClearBitmapClosure::do_heap_region(G1HeapRegion*)+0x13c  (g1RemSet.cpp:1048)
> V  [libjvm.so+0xb7a80c]  G1CollectedHeap::par_iterate_regions_array(G1HeapRegionClosure*, G1HeapRegionClaimer*, unsigned int const*, unsigned long, unsigned int) const+0x9c  (g1CollectedHeap.cpp:2059)
> V  [libjvm.so+0xc49fe8]  G1MergeHeapRootsTask::work(unsigned int)+0x708  (g1RemSet.cpp:1225)
> V  [libjvm.so+0x19597bc]  WorkerThread::run()+0x98  (workerThread.cpp:69)
> V  [libjvm.so+0x1824510]  Thread::call_run()+0xac  (thread.cpp:231)
> V  [libjvm.so+0x13b3994]  thread_native_entry(Thread*)+0x130  (os_linux.cpp:877)
> C  [libpthread.so.0+0x875c]  start_thread+0x18c
> ```

I will try to reproduce. Thanks.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23739#issuecomment-2697052899

From tschatzl at openjdk.org  Tue Mar  4 10:53:46 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Tue, 4 Mar 2025 10:53:46 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v10]
In-Reply-To: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
Message-ID: <V7wNlW1z7H1hl2x7L9nbvlvIuDO-0mP6zKo4mnSg42w=.499b6924-e452-4ff1-bc0d-c07187cc7404@github.com>

> Hi all,
> 
>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
> 
> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
> 
> ### Current situation
> 
> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
> 
> The main reason for the current barrier is how g1 implements concurrent refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
> 
> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
> 
> 
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done;     // null value check
> if (card(@x.a) == young_card) goto done;  // write to young gen check
> StoreLoad;                // synchronize
> if (card(@x.a) == dirty_card) goto done;
> 
> *card(@x.a) = dirty
> 
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
> 
> call runtime to move thread-local-dcq into dcqs
> 
> done:
> 
> 
> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
> 
> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
> 
> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
> 
> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se...

Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:

  * ayang review - fix comment

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23739/files
  - new: https://git.openjdk.org/jdk/pull/23739/files/442d9eae..fc674f02

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=09
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=08-09

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/23739.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739

PR: https://git.openjdk.org/jdk/pull/23739

From duke at openjdk.org  Tue Mar  4 11:14:01 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Tue, 4 Mar 2025 11:14:01 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v7]
In-Reply-To: <dDw877tmMKNbdKSXIoAookE9CPwydOtWM8N1OIj4q7A=.1e26c53e-5ae0-4c1e-a22b-9d7ca284aa68@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <bGF_OqX377tbkkgzkwww0GZfE40OxsOf4uU3X_8KPHE=.9bb9cd5a-49c4-470b-8fa4-44bc81e05928@github.com>
 <dDw877tmMKNbdKSXIoAookE9CPwydOtWM8N1OIj4q7A=.1e26c53e-5ae0-4c1e-a22b-9d7ca284aa68@github.com>
Message-ID: <SIhQXbXG9IqkQq9ldhGZCRb_ut0LYKUBetOnkRpR7SE=.d2381ce3-365e-4337-8772-491193e89e55@github.com>

On Thu, 27 Feb 2025 09:53:21 GMT, Andrew Dinn <adinn at openjdk.org> wrote:

> Oops. sorry - cut and paste error -- the new setting should be
> 
> ```
> do_arch_blob(compiler, 55000 ZGC_ONLY(+5000))
> ```

@adinn, I have done this change, but that erased your approval. Could you reapprove?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2697145316

From iwalulya at openjdk.org  Tue Mar  4 11:19:59 2025
From: iwalulya at openjdk.org (Ivan Walulya)
Date: Tue, 4 Mar 2025 11:19:59 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v9]
In-Reply-To: <h_JgI8BYE9hkXd6VInuebQvKl9K8abUlnhe1ICNuMlM=.f8844689-836e-4da3-a9e9-d02b5b967b7b@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <h_JgI8BYE9hkXd6VInuebQvKl9K8abUlnhe1ICNuMlM=.f8844689-836e-4da3-a9e9-d02b5b967b7b@github.com>
Message-ID: <e3TWIfqQ9wo6fxircXLvm-LDjgfRfXr0F6uW4Sjts9I=.ca0293f0-e651-445f-a60c-77d5ab969336@github.com>

On Tue, 4 Mar 2025 09:57:56 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> Hi all,
>> 
>>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
>> 
>> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
>> 
>> ### Current situation
>> 
>> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
>> 
>> The main reason for the current barrier is how g1 implements concurrent refinement:
>> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
>> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
>> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
>> 
>> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
>> 
>> 
>> // Filtering
>> if (region(@x.a) == region(y)) goto done; // same region check
>> if (y == null) goto done;     // null value check
>> if (card(@x.a) == young_card) goto done;  // write to young gen check
>> StoreLoad;                // synchronize
>> if (card(@x.a) == dirty_card) goto done;
>> 
>> *card(@x.a) = dirty
>> 
>> // Card tracking
>> enqueue(card-address(@x.a)) into thread-local-dcq;
>> if (thread-local-dcq is not full) goto done;
>> 
>> call runtime to move thread-local-dcq into dcqs
>> 
>> done:
>> 
>> 
>> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
>> 
>> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
>> 
>> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
>> 
>> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c...
>
> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
> 
>   * iwalulya review 2
>     * G1ConcurrentRefineWorkState -> G1ConcurrentRefineSweepState
>     * some additional documentation

src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 108:

> 106: 
> 107: void G1ConcurrentRefineThreadControl::control_thread_do(ThreadClosure* tc) {
> 108:   if (_control_thread != nullptr) {

maybe maintain using `if (max_num_threads() > 0)` as used in `G1ConcurrentRefineThreadControl::initialize`, so that it is clear that setting `G1ConcRefinementThreads=0` effectively turns off concurrent refinement.

src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 354:

> 352:       if (!r->is_free()) {
> 353:         // Need to scan all parts of non-free regions, so reset the claim.
> 354:         // No need for synchronization: we are only interested about regions

s/about/in

src/hotspot/share/gc/g1/g1OopClosures.hpp line 205:

> 203:   G1CollectedHeap* _g1h;
> 204:   uint _worker_id;
> 205:   bool _has_to_cset_ref;

Similar to `_cards_refer_to_cset` , do you mind renaming `_has_to_cset_ref`  and `_has_to_old_ref`  to `_has_ref_to_cset` and `_has_ref_to_old`

src/hotspot/share/gc/g1/g1Policy.hpp line 105:

> 103:   uint _free_regions_at_end_of_collection;
> 104: 
> 105:   size_t _pending_cards_from_gc;

A comment on the variable would be nice, especially on how it is set/reset both at end of GC and by refinement. Also the `_to_collection_set_cards` below could use a comment

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979077904
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979102189
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979212854
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979155941

From adinn at openjdk.org  Tue Mar  4 11:21:00 2025
From: adinn at openjdk.org (Andrew Dinn)
Date: Tue, 4 Mar 2025 11:21:00 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v8]
In-Reply-To: <Kkfzgfg4Op3XpVBv4QezdDMS43v1mwyzN9UI47q3asE=.1a7edee1-0500-43c6-a0f8-b430e88150dc@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <Kkfzgfg4Op3XpVBv4QezdDMS43v1mwyzN9UI47q3asE=.1a7edee1-0500-43c6-a0f8-b430e88150dc@github.com>
Message-ID: <LqHBJamypDpKlBflkBZ6svodvuRe07Oe01rCH06viaI=.acff5ac0-b7b6-4fcf-827c-60cccb67604a@github.com>

On Fri, 28 Feb 2025 06:22:09 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits:
> 
>  - Merged master.
>  - Added more comments, mainly as suggested by Andrew Dinn
>  - Changed aarch64-asmtest.py as suggested by Bhavana-Kilambi
>  - Accepting suggested change from Andrew Dinn
>  - Added comments suggested by Andrew Dinn
>  - Fixed copyright years
>  - renaming a couple of functions
>  - Adding comments + some code reorganization
>  - removed debugging code
>  - merging master
>  - ... and 3 more: https://git.openjdk.org/jdk/compare/ab4b0ef9...d82dfb2f

Still good.

-------------

Marked as reviewed by adinn (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/23300#pullrequestreview-2657047714

From tschatzl at openjdk.org  Tue Mar  4 11:39:55 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Tue, 4 Mar 2025 11:39:55 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v9]
In-Reply-To: <e3TWIfqQ9wo6fxircXLvm-LDjgfRfXr0F6uW4Sjts9I=.ca0293f0-e651-445f-a60c-77d5ab969336@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <h_JgI8BYE9hkXd6VInuebQvKl9K8abUlnhe1ICNuMlM=.f8844689-836e-4da3-a9e9-d02b5b967b7b@github.com>
 <e3TWIfqQ9wo6fxircXLvm-LDjgfRfXr0F6uW4Sjts9I=.ca0293f0-e651-445f-a60c-77d5ab969336@github.com>
Message-ID: <g0uvqZKOdjy9rzw17__XvrF0nPNXjpih69tOanLT51I=.6bbab1b2-9650-41b2-bf63-37487bd3dd14@github.com>

On Tue, 4 Mar 2025 10:06:37 GMT, Ivan Walulya <iwalulya at openjdk.org> wrote:

>> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   * iwalulya review 2
>>     * G1ConcurrentRefineWorkState -> G1ConcurrentRefineSweepState
>>     * some additional documentation
>
> src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 108:
> 
>> 106: 
>> 107: void G1ConcurrentRefineThreadControl::control_thread_do(ThreadClosure* tc) {
>> 108:   if (_control_thread != nullptr) {
> 
> maybe maintain using `if (max_num_threads() > 0)` as used in `G1ConcurrentRefineThreadControl::initialize`, so that it is clear that setting `G1ConcRefinementThreads=0` effectively turns off concurrent refinement.

I added a new `is_refinement_enabled()` predicate instead  (that uses `max_num_threads()`.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979252156

From shade at openjdk.org  Tue Mar  4 11:51:07 2025
From: shade at openjdk.org (Aleksey Shipilev)
Date: Tue, 4 Mar 2025 11:51:07 GMT
Subject: RFR: 8345169: Implement JEP XXX: Remove the 32-bit x86 Port
In-Reply-To: <bpVKvXGIroa5l6ytgedtvYFRkzTp1GrQDLmaFSNuI3I=.982cd22c-ec21-4f0f-b9d0-b84bdf989ee7@github.com>
References: <bpVKvXGIroa5l6ytgedtvYFRkzTp1GrQDLmaFSNuI3I=.982cd22c-ec21-4f0f-b9d0-b84bdf989ee7@github.com>
Message-ID: <h5fLorT7d2SOYdn_cqNrBOmXzrEpaZ5t2UdQRdzwZNg=.a0c56970-d4ad-4cea-944e-e1b795cf2cce@github.com>

On Thu, 5 Dec 2024 08:26:10 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

> **NOTE: This is work-in-progress draft for interested parties. The JEP is not even submitted, let alone targeted.**
> 
> My plan is to to get this done in a quiet time in mainline to limit the ongoing conflicts with mainline. Feel free to comment in this PR, if you see something ahead of time. These comments might adjust the trajectory we take to implement this removal and/or allows us submit and work out more RFEs ahead of this removal. I plan to re-open a clean PR after this preliminary PR is done, maybe after the round of preliminary reviews.
> 
> This removes the 32-bit x86 port and does a deeper cleaning in Hotspot. The following paragraphs describe what and why was being done.
> 
> Easy stuff first: all files named `*_x86_32` are gone. Those are only built when build system knows we are compiling for x86_32. There is therefore no impact on x86_64.
> 
> The code under `!LP64`, `!AMD64` and `IA32` is removed in `x86`-specific files. There is quite a bit of the code, especially around `Assembler` and `MacroAssembler`. I think these removals make the whole thing cleaner. The downside is that some of the `MacroAssembler::*ptr` functions that were used to select the "machine pointer" instructions either from x86_64 or x86_32 are now exclusively for x86_64. I don't think we want to rewrite `*ptr` -> `*q` at this point. I think we gradually morph the code base to use `*q`-flavored methods in new code.
> 
> x86_32 is the only platform that has special cases for x87 FPU.
> 
> C1 even implements the whole separate thing to deal with x87 FPU: the parts of regalloc treat it specially, there is `FpuStackSim`, there is `VerifyFPU` family of flags, etc. There are also peculiarities with FP conversions that use FPU, that's why x86_32 used to have template interpreter stubs for FP conversion methods. None of that is needed anymore without x86_32. This cleans up some arch-specific code as well.
> 
> Both C1 and C2 implement the workarounds for non-IEEE compliant rounding of x87 FPU. After x86_32 is gone, these are not needed anymore. This removes some C2 nodes, removes the rounding instructions in C1.
> 
> x86_64 is baselined on SSE2+, the VM would not even start if SSE2 is not supported. Most of the checks that we have for `UseSSE < 2` are for the benefit of x86_32. Because of this I folded redundant `UseSSE` checks around Hotspot.
> 
> The one thing I _deliberately_ avoided doing is merging `x86.ad` and `x86_64.ad`. It would likely introduce uncomfortable amount of conflicts with pending work in mainli...

Great, thanks for the feedback. I think we are going to go with the JEP implementation that removes the easy parts of x86_32 code, and then do the deeper cleanups under [JDK-8351148](https://bugs.openjdk.org/browse/JDK-8351148) umbrella. I added some subtasks there, based on the commits from this bulk PR. 

I am closing this PR in favor of about-to-be-created cleaner PR for JEP 503.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22567#issuecomment-2697266596

From shade at openjdk.org  Tue Mar  4 11:51:07 2025
From: shade at openjdk.org (Aleksey Shipilev)
Date: Tue, 4 Mar 2025 11:51:07 GMT
Subject: Withdrawn: 8345169: Implement JEP XXX: Remove the 32-bit x86 Port
In-Reply-To: <bpVKvXGIroa5l6ytgedtvYFRkzTp1GrQDLmaFSNuI3I=.982cd22c-ec21-4f0f-b9d0-b84bdf989ee7@github.com>
References: <bpVKvXGIroa5l6ytgedtvYFRkzTp1GrQDLmaFSNuI3I=.982cd22c-ec21-4f0f-b9d0-b84bdf989ee7@github.com>
Message-ID: <r0EUAnIxOLVhuANGTDw8vFBmswD1k2Sa5iME5CTtGcI=.4fdd7690-def1-454e-ad4d-db18eeaac109@github.com>

On Thu, 5 Dec 2024 08:26:10 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

> **NOTE: This is work-in-progress draft for interested parties. The JEP is not even submitted, let alone targeted.**
> 
> My plan is to to get this done in a quiet time in mainline to limit the ongoing conflicts with mainline. Feel free to comment in this PR, if you see something ahead of time. These comments might adjust the trajectory we take to implement this removal and/or allows us submit and work out more RFEs ahead of this removal. I plan to re-open a clean PR after this preliminary PR is done, maybe after the round of preliminary reviews.
> 
> This removes the 32-bit x86 port and does a deeper cleaning in Hotspot. The following paragraphs describe what and why was being done.
> 
> Easy stuff first: all files named `*_x86_32` are gone. Those are only built when build system knows we are compiling for x86_32. There is therefore no impact on x86_64.
> 
> The code under `!LP64`, `!AMD64` and `IA32` is removed in `x86`-specific files. There is quite a bit of the code, especially around `Assembler` and `MacroAssembler`. I think these removals make the whole thing cleaner. The downside is that some of the `MacroAssembler::*ptr` functions that were used to select the "machine pointer" instructions either from x86_64 or x86_32 are now exclusively for x86_64. I don't think we want to rewrite `*ptr` -> `*q` at this point. I think we gradually morph the code base to use `*q`-flavored methods in new code.
> 
> x86_32 is the only platform that has special cases for x87 FPU.
> 
> C1 even implements the whole separate thing to deal with x87 FPU: the parts of regalloc treat it specially, there is `FpuStackSim`, there is `VerifyFPU` family of flags, etc. There are also peculiarities with FP conversions that use FPU, that's why x86_32 used to have template interpreter stubs for FP conversion methods. None of that is needed anymore without x86_32. This cleans up some arch-specific code as well.
> 
> Both C1 and C2 implement the workarounds for non-IEEE compliant rounding of x87 FPU. After x86_32 is gone, these are not needed anymore. This removes some C2 nodes, removes the rounding instructions in C1.
> 
> x86_64 is baselined on SSE2+, the VM would not even start if SSE2 is not supported. Most of the checks that we have for `UseSSE < 2` are for the benefit of x86_32. Because of this I folded redundant `UseSSE` checks around Hotspot.
> 
> The one thing I _deliberately_ avoided doing is merging `x86.ad` and `x86_64.ad`. It would likely introduce uncomfortable amount of conflicts with pending work in mainli...

This pull request has been closed without being integrated.

-------------

PR: https://git.openjdk.org/jdk/pull/22567

From tschatzl at openjdk.org  Tue Mar  4 11:56:56 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Tue, 4 Mar 2025 11:56:56 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v11]
In-Reply-To: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
Message-ID: <YzxVGcqtMOdBRVTtN4y3nzVHNsLFMUTc2lUwaWU0KSA=.798ee2c8-8d73-458d-a9a8-f0f86ea059f4@github.com>

> Hi all,
> 
>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
> 
> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
> 
> ### Current situation
> 
> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
> 
> The main reason for the current barrier is how g1 implements concurrent refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
> 
> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
> 
> 
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done;     // null value check
> if (card(@x.a) == young_card) goto done;  // write to young gen check
> StoreLoad;                // synchronize
> if (card(@x.a) == dirty_card) goto done;
> 
> *card(@x.a) = dirty
> 
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
> 
> call runtime to move thread-local-dcq into dcqs
> 
> done:
> 
> 
> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
> 
> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
> 
> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
> 
> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se...

Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:

  iwalulya review
    * comments for variables tracking to-collection-set and just dirtied cards after GC/refinement
    * predicate for determining whether the refinement has been disabled
    * some other typos/comment improvements
    * renamed _has_xxx_ref to _has_ref_to_xxx to be more consistent with naming

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23739/files
  - new: https://git.openjdk.org/jdk/pull/23739/files/fc674f02..b4d19d9b

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=10
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=09-10

  Stats: 40 lines in 8 files changed: 14 ins; 0 del; 26 mod
  Patch: https://git.openjdk.org/jdk/pull/23739.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739

PR: https://git.openjdk.org/jdk/pull/23739

From coleenp at openjdk.org  Tue Mar  4 13:30:04 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Tue, 4 Mar 2025 13:30:04 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <VkUPkPpw2u6nQ7IToE9MkPhq7fJdRcNAyreKlfCRE4g=.c30f2a89-27bd-47b5-80a2-b6288d77fe18@github.com>
 <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com>
Message-ID: <JN4Rs5fUGe0Z1ZMLamXL1TOmRs9G6CUhX-6qpSyhJ6g=.ec395e9e-d474-4508-aca2-32270bf20382@github.com>

On Mon, 3 Mar 2025 23:18:13 GMT, Patricio Chilano Mateo <pchilanomate at openjdk.org> wrote:

>> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Update after review by David and Coleen.
>
> src/hotspot/share/runtime/objectMonitor.cpp line 1359:
> 
>> 1357:     // Build the doubly linked list to get hold of currentNode->prev().
>> 1358:     _entry_list_tail = nullptr;
>> 1359:     entry_list_tail(current);
> 
> I think we should try to avoid having to rebuild the doubly link list from scratch, since only a few nodes in the front might be missing the previous links. For platform threads it might not matter that much, but for virtual threads this list could be much larger. Maybe we can leave it as a future enhancement.

We don't have a prev node, we don't know which node to set next to our next node to.  The list will be broken.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1979432912

From adinn at openjdk.org  Tue Mar  4 14:04:03 2025
From: adinn at openjdk.org (Andrew Dinn)
Date: Tue, 4 Mar 2025 14:04:03 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v7]
In-Reply-To: <SIhQXbXG9IqkQq9ldhGZCRb_ut0LYKUBetOnkRpR7SE=.d2381ce3-365e-4337-8772-491193e89e55@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <bGF_OqX377tbkkgzkwww0GZfE40OxsOf4uU3X_8KPHE=.9bb9cd5a-49c4-470b-8fa4-44bc81e05928@github.com>
 <dDw877tmMKNbdKSXIoAookE9CPwydOtWM8N1OIj4q7A=.1e26c53e-5ae0-4c1e-a22b-9d7ca284aa68@github.com>
 <SIhQXbXG9IqkQq9ldhGZCRb_ut0LYKUBetOnkRpR7SE=.d2381ce3-365e-4337-8772-491193e89e55@github.com>
Message-ID: <UpZNmeK8RjbujG3sSVgop5ewyxRCjMDw4zJmRuPF03w=.166c18b6-613e-4730-a969-5f77cefc094d@github.com>

On Tue, 4 Mar 2025 11:11:44 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> Oops. sorry - cut and paste error -- the new setting should be
>> 
>> do_arch_blob(compiler, 55000 ZGC_ONLY(+5000))
>
>> Oops. sorry - cut and paste error -- the new setting should be
>> 
>> ```
>> do_arch_blob(compiler, 55000 ZGC_ONLY(+5000))
>> ```
> 
> @adinn, I have done this change, but that erased your approval. Could you reapprove?

@ferakocz Feel free to integrate and I will sponsor

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2697719261

From duke at openjdk.org  Tue Mar  4 14:13:05 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Tue, 4 Mar 2025 14:13:05 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v7]
In-Reply-To: <SIhQXbXG9IqkQq9ldhGZCRb_ut0LYKUBetOnkRpR7SE=.d2381ce3-365e-4337-8772-491193e89e55@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <bGF_OqX377tbkkgzkwww0GZfE40OxsOf4uU3X_8KPHE=.9bb9cd5a-49c4-470b-8fa4-44bc81e05928@github.com>
 <dDw877tmMKNbdKSXIoAookE9CPwydOtWM8N1OIj4q7A=.1e26c53e-5ae0-4c1e-a22b-9d7ca284aa68@github.com>
 <SIhQXbXG9IqkQq9ldhGZCRb_ut0LYKUBetOnkRpR7SE=.d2381ce3-365e-4337-8772-491193e89e55@github.com>
Message-ID: <WUYBI649aucd85gNsd8MmWUsqhRObH9tcKbKq7woLXw=.3e0fa35a-7725-442f-a8eb-8dc11f444e89@github.com>

On Tue, 4 Mar 2025 11:11:44 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> Oops. sorry - cut and paste error -- the new setting should be
>> 
>> do_arch_blob(compiler, 55000 ZGC_ONLY(+5000))
>
>> Oops. sorry - cut and paste error -- the new setting should be
>> 
>> ```
>> do_arch_blob(compiler, 55000 ZGC_ONLY(+5000))
>> ```
> 
> @adinn, I have done this change, but that erased your approval. Could you reapprove?

> @ferakocz Feel free to integrate and I will sponsor

@adinn thanks a lot for the review and the sponsoring, too!

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2697761033

From duke at openjdk.org  Tue Mar  4 14:13:05 2025
From: duke at openjdk.org (duke)
Date: Tue, 4 Mar 2025 14:13:05 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v8]
In-Reply-To: <Kkfzgfg4Op3XpVBv4QezdDMS43v1mwyzN9UI47q3asE=.1a7edee1-0500-43c6-a0f8-b430e88150dc@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <Kkfzgfg4Op3XpVBv4QezdDMS43v1mwyzN9UI47q3asE=.1a7edee1-0500-43c6-a0f8-b430e88150dc@github.com>
Message-ID: <4goExO2NlWn1wVnu0eYddpXAN4h_t9F7VG4b-MHI_sE=.74de8ba0-eec5-401e-9aa5-6bda6a4e74a5@github.com>

On Fri, 28 Feb 2025 06:22:09 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits:
> 
>  - Merged master.
>  - Added more comments, mainly as suggested by Andrew Dinn
>  - Changed aarch64-asmtest.py as suggested by Bhavana-Kilambi
>  - Accepting suggested change from Andrew Dinn
>  - Added comments suggested by Andrew Dinn
>  - Fixed copyright years
>  - renaming a couple of functions
>  - Adding comments + some code reorganization
>  - removed debugging code
>  - merging master
>  - ... and 3 more: https://git.openjdk.org/jdk/compare/ab4b0ef9...d82dfb2f

@ferakocz 
Your change (at version d82dfb2f6d329f4caa0949bfbcd5dd5e5d52d6e9) is now ready to be sponsored by a Committer.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2697751091

From mullan at openjdk.org  Tue Mar  4 14:36:04 2025
From: mullan at openjdk.org (Sean Mullan)
Date: Tue, 4 Mar 2025 14:36:04 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v8]
In-Reply-To: <Kkfzgfg4Op3XpVBv4QezdDMS43v1mwyzN9UI47q3asE=.1a7edee1-0500-43c6-a0f8-b430e88150dc@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <Kkfzgfg4Op3XpVBv4QezdDMS43v1mwyzN9UI47q3asE=.1a7edee1-0500-43c6-a0f8-b430e88150dc@github.com>
Message-ID: <fBcbOxblqmweHUM4cYEqnwAajvnDW1ShjIGsTgNhlsw=.f1a40611-a6a1-4f87-8d1c-dd7d0e10502d@github.com>

On Fri, 28 Feb 2025 06:22:09 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits:
> 
>  - Merged master.
>  - Added more comments, mainly as suggested by Andrew Dinn
>  - Changed aarch64-asmtest.py as suggested by Bhavana-Kilambi
>  - Accepting suggested change from Andrew Dinn
>  - Added comments suggested by Andrew Dinn
>  - Fixed copyright years
>  - renaming a couple of functions
>  - Adding comments + some code reorganization
>  - removed debugging code
>  - merging master
>  - ... and 3 more: https://git.openjdk.org/jdk/compare/ab4b0ef9...d82dfb2f

I think it would be nice to add a release note for this describing the approximate performance improvement.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2697841749

From duke at openjdk.org  Tue Mar  4 14:44:00 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Tue, 4 Mar 2025 14:44:00 GMT
Subject: Integrated: 8348561: Add aarch64 intrinsics for ML-DSA
In-Reply-To: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
Message-ID: <G9MmPgHVVXuK5m_r37j25HLQwa_y324m6Ww-RkDavmU=.6ee17042-fda9-4096-b995-121665dbd22f@github.com>

On Fri, 24 Jan 2025 14:24:23 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.

This pull request has now been integrated.

Changeset: 3230894b
Author:    Ferenc Rakoczi <ferenc.r.rakoczi at oracle.com>
Committer: Andrew Dinn <adinn at openjdk.org>
URL:       https://git.openjdk.org/jdk/commit/3230894bdd8ab4183b83ad4c942eb6acad4acce6
Stats:     2611 lines in 22 files changed: 2030 ins; 92 del; 489 mod

8348561: Add aarch64 intrinsics for ML-DSA

Reviewed-by: adinn

-------------

PR: https://git.openjdk.org/jdk/pull/23300

From ayang at openjdk.org  Tue Mar  4 15:47:00 2025
From: ayang at openjdk.org (Albert Mingkun Yang)
Date: Tue, 4 Mar 2025 15:47:00 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v11]
In-Reply-To: <YzxVGcqtMOdBRVTtN4y3nzVHNsLFMUTc2lUwaWU0KSA=.798ee2c8-8d73-458d-a9a8-f0f86ea059f4@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <YzxVGcqtMOdBRVTtN4y3nzVHNsLFMUTc2lUwaWU0KSA=.798ee2c8-8d73-458d-a9a8-f0f86ea059f4@github.com>
Message-ID: <S3XUG6aB9XuLozIGt1PhbbdJXCBqz35Pgo2kaSJGHVA=.f2c5a5ec-5be0-4857-a23a-494bf24edb39@github.com>

On Tue, 4 Mar 2025 11:56:56 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> Hi all,
>> 
>>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
>> 
>> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
>> 
>> ### Current situation
>> 
>> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
>> 
>> The main reason for the current barrier is how g1 implements concurrent refinement:
>> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
>> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
>> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
>> 
>> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
>> 
>> 
>> // Filtering
>> if (region(@x.a) == region(y)) goto done; // same region check
>> if (y == null) goto done;     // null value check
>> if (card(@x.a) == young_card) goto done;  // write to young gen check
>> StoreLoad;                // synchronize
>> if (card(@x.a) == dirty_card) goto done;
>> 
>> *card(@x.a) = dirty
>> 
>> // Card tracking
>> enqueue(card-address(@x.a)) into thread-local-dcq;
>> if (thread-local-dcq is not full) goto done;
>> 
>> call runtime to move thread-local-dcq into dcqs
>> 
>> done:
>> 
>> 
>> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
>> 
>> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
>> 
>> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
>> 
>> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c...
>
> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
> 
>   iwalulya review
>     * comments for variables tracking to-collection-set and just dirtied cards after GC/refinement
>     * predicate for determining whether the refinement has been disabled
>     * some other typos/comment improvements
>     * renamed _has_xxx_ref to _has_ref_to_xxx to be more consistent with naming

src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 356:

> 354:     bool do_heap_region(G1HeapRegion* r) override {
> 355:       if (!r->is_free()) {
> 356:         // Need to scan all parts of non-free regions, so reset the claim.

Why is the condition "is_free"? I thought we scan only old-or-humongous regions?

src/hotspot/share/gc/g1/g1ConcurrentRefine.hpp line 116:

> 114:     SwapGlobalCT,                // Swap global card table.
> 115:     SwapJavaThreadsCT,           // Swap java thread's card tables.
> 116:     SwapGCThreadsCT,             // Swap GC thread's card tables.

Do GC threads have card-table?

src/hotspot/share/gc/g1/g1ConcurrentRefineThread.cpp line 219:

> 217:       // The young gen revising mechanism reads the predictor and the values set
> 218:       // here. Avoid inconsistencies by locking.
> 219:       MutexLocker x(G1RareEvent_lock, Mutex::_no_safepoint_check_flag);

Who else can be in this critical-section? I don't get what this lock is protecting us from.

src/hotspot/share/gc/g1/g1ConcurrentRefineThread.hpp line 83:

> 81: 
> 82: public:
> 83:   static G1ConcurrentRefineThread* create(G1ConcurrentRefine* cr);

I wonder if the comment for this class "One or more G1 Concurrent Refinement Threads..."  has become obsolete. (AFAICS, this class is a singleton.)

src/hotspot/share/gc/g1/g1ConcurrentRefineWorkTask.cpp line 69:

> 67:     } else if (res == G1RemSet::NoInteresting) {
> 68:       _refine_stats.inc_cards_clean_again();
> 69:     }

A `switch` is probably cleaner.

src/hotspot/share/gc/g1/g1ConcurrentRefineWorkTask.cpp line 78:

> 76:       do_dirty_card(source, dest_card);
> 77:     }
> 78:     return pointer_delta(dirty_r, dirty_l, sizeof(CardValue));

I feel the `pointer_delta` line belongs to the caller. After that, even the entire method can be inlined to the caller.

YMMV.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979666477
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979678325
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979699376
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979695999
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979705019
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979709682

From tschatzl at openjdk.org  Tue Mar  4 16:03:55 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Tue, 4 Mar 2025 16:03:55 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v11]
In-Reply-To: <S3XUG6aB9XuLozIGt1PhbbdJXCBqz35Pgo2kaSJGHVA=.f2c5a5ec-5be0-4857-a23a-494bf24edb39@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <YzxVGcqtMOdBRVTtN4y3nzVHNsLFMUTc2lUwaWU0KSA=.798ee2c8-8d73-458d-a9a8-f0f86ea059f4@github.com>
 <S3XUG6aB9XuLozIGt1PhbbdJXCBqz35Pgo2kaSJGHVA=.f2c5a5ec-5be0-4857-a23a-494bf24edb39@github.com>
Message-ID: <xEeqpy6JKA09hoVnwgcSMFQYHU-NEZ9eUV7S3INS2hE=.5a6ee280-443a-4283-80c8-45263a653ff8@github.com>

On Tue, 4 Mar 2025 15:16:17 GMT, Albert Mingkun Yang <ayang at openjdk.org> wrote:

>> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   iwalulya review
>>     * comments for variables tracking to-collection-set and just dirtied cards after GC/refinement
>>     * predicate for determining whether the refinement has been disabled
>>     * some other typos/comment improvements
>>     * renamed _has_xxx_ref to _has_ref_to_xxx to be more consistent with naming
>
> src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 356:
> 
>> 354:     bool do_heap_region(G1HeapRegion* r) override {
>> 355:       if (!r->is_free()) {
>> 356:         // Need to scan all parts of non-free regions, so reset the claim.
> 
> Why is the condition "is_free"? I thought we scan only old-or-humongous regions?

We also need to clear young gen region marks because we want them to be all clean in the card table for the garbage collection (evacuation failure handling, use in next cycle).
This is maybe a bit of a waste if there are multiple refinement rounds between two gcs, but it's less expensive than in the pause wrt to latency. It's fast anyway.

> src/hotspot/share/gc/g1/g1ConcurrentRefine.hpp line 116:
> 
>> 114:     SwapGlobalCT,                // Swap global card table.
>> 115:     SwapJavaThreadsCT,           // Swap java thread's card tables.
>> 116:     SwapGCThreadsCT,             // Swap GC thread's card tables.
> 
> Do GC threads have card-table?

Hmm, I thought I changed tat already just recently with Ivan's latest requests. Will fix.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979742662
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979752692

From tschatzl at openjdk.org  Tue Mar  4 16:07:58 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Tue, 4 Mar 2025 16:07:58 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v11]
In-Reply-To: <S3XUG6aB9XuLozIGt1PhbbdJXCBqz35Pgo2kaSJGHVA=.f2c5a5ec-5be0-4857-a23a-494bf24edb39@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <YzxVGcqtMOdBRVTtN4y3nzVHNsLFMUTc2lUwaWU0KSA=.798ee2c8-8d73-458d-a9a8-f0f86ea059f4@github.com>
 <S3XUG6aB9XuLozIGt1PhbbdJXCBqz35Pgo2kaSJGHVA=.f2c5a5ec-5be0-4857-a23a-494bf24edb39@github.com>
Message-ID: <jEwBh0qRAFV2W5NGmjtFvagI942udCPqU-LcLDhBfTQ=.137cae1b-2c1d-4452-b1c1-52f77e1ea18d@github.com>

On Tue, 4 Mar 2025 15:33:29 GMT, Albert Mingkun Yang <ayang at openjdk.org> wrote:

>> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   iwalulya review
>>     * comments for variables tracking to-collection-set and just dirtied cards after GC/refinement
>>     * predicate for determining whether the refinement has been disabled
>>     * some other typos/comment improvements
>>     * renamed _has_xxx_ref to _has_ref_to_xxx to be more consistent with naming
>
> src/hotspot/share/gc/g1/g1ConcurrentRefineThread.cpp line 219:
> 
>> 217:       // The young gen revising mechanism reads the predictor and the values set
>> 218:       // here. Avoid inconsistencies by locking.
>> 219:       MutexLocker x(G1RareEvent_lock, Mutex::_no_safepoint_check_flag);
> 
> Who else can be in this critical-section? I don't get what this lock is protecting us from.

The concurrent refine control thread in `G1ConcurrentRefineThread::do_refinement`, when calling `G1Policy::record_dirtying_stats`.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979759329

From tschatzl at openjdk.org  Tue Mar  4 16:07:56 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Tue, 4 Mar 2025 16:07:56 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v11]
In-Reply-To: <xEeqpy6JKA09hoVnwgcSMFQYHU-NEZ9eUV7S3INS2hE=.5a6ee280-443a-4283-80c8-45263a653ff8@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <YzxVGcqtMOdBRVTtN4y3nzVHNsLFMUTc2lUwaWU0KSA=.798ee2c8-8d73-458d-a9a8-f0f86ea059f4@github.com>
 <S3XUG6aB9XuLozIGt1PhbbdJXCBqz35Pgo2kaSJGHVA=.f2c5a5ec-5be0-4857-a23a-494bf24edb39@github.com>
 <xEeqpy6JKA09hoVnwgcSMFQYHU-NEZ9eUV7S3INS2hE=.5a6ee280-443a-4283-80c8-45263a653ff8@github.com>
Message-ID: <yi-jHwlfQFml5hE7ZZLlCDPvQa-ui137o6uTeJK6uH0=.52cbdd2f-0b16-4b69-bb8f-839d2f6fdc84@github.com>

On Tue, 4 Mar 2025 16:00:46 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> src/hotspot/share/gc/g1/g1ConcurrentRefine.hpp line 116:
>> 
>>> 114:     SwapGlobalCT,                // Swap global card table.
>>> 115:     SwapJavaThreadsCT,           // Swap java thread's card tables.
>>> 116:     SwapGCThreadsCT,             // Swap GC thread's card tables.
>> 
>> Do GC threads have card-table?
>
> Hmm, I thought I changed tat already just recently with Ivan's latest requests. Will fix.

Oh, I only fixed the string. Apologies.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979761737

From tschatzl at openjdk.org  Tue Mar  4 16:20:58 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Tue, 4 Mar 2025 16:20:58 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v11]
In-Reply-To: <xEeqpy6JKA09hoVnwgcSMFQYHU-NEZ9eUV7S3INS2hE=.5a6ee280-443a-4283-80c8-45263a653ff8@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <YzxVGcqtMOdBRVTtN4y3nzVHNsLFMUTc2lUwaWU0KSA=.798ee2c8-8d73-458d-a9a8-f0f86ea059f4@github.com>
 <S3XUG6aB9XuLozIGt1PhbbdJXCBqz35Pgo2kaSJGHVA=.f2c5a5ec-5be0-4857-a23a-494bf24edb39@github.com>
 <xEeqpy6JKA09hoVnwgcSMFQYHU-NEZ9eUV7S3INS2hE=.5a6ee280-443a-4283-80c8-45263a653ff8@github.com>
Message-ID: <ItDgoD9-z4hAmyOVa-nnM6G3u1fIS4xXg4JWykUjtGw=.626f0b39-a2e9-4db8-81d5-ecc3083ea369@github.com>

On Tue, 4 Mar 2025 15:56:05 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

> It's fast anyway.

To clarify: If you have multiple refinement rounds between two garbage collections, the time to clear the young gen cards is almost noise compared to the actual refinement effort. Like two magnitudes faster.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979785011

From tschatzl at openjdk.org  Tue Mar  4 16:34:56 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Tue, 4 Mar 2025 16:34:56 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v11]
In-Reply-To: <jEwBh0qRAFV2W5NGmjtFvagI942udCPqU-LcLDhBfTQ=.137cae1b-2c1d-4452-b1c1-52f77e1ea18d@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <YzxVGcqtMOdBRVTtN4y3nzVHNsLFMUTc2lUwaWU0KSA=.798ee2c8-8d73-458d-a9a8-f0f86ea059f4@github.com>
 <S3XUG6aB9XuLozIGt1PhbbdJXCBqz35Pgo2kaSJGHVA=.f2c5a5ec-5be0-4857-a23a-494bf24edb39@github.com>
 <jEwBh0qRAFV2W5NGmjtFvagI942udCPqU-LcLDhBfTQ=.137cae1b-2c1d-4452-b1c1-52f77e1ea18d@github.com>
Message-ID: <3LR5VKMhSuXWmMlphpe8SLHm8vQQt6j343qaO61S_mQ=.dc1d2e4a-c858-44bd-9da0-f3f98340d939@github.com>

On Tue, 4 Mar 2025 16:04:00 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> src/hotspot/share/gc/g1/g1ConcurrentRefineThread.cpp line 219:
>> 
>>> 217:       // The young gen revising mechanism reads the predictor and the values set
>>> 218:       // here. Avoid inconsistencies by locking.
>>> 219:       MutexLocker x(G1RareEvent_lock, Mutex::_no_safepoint_check_flag);
>> 
>> Who else can be in this critical-section? I don't get what this lock is protecting us from.
>
> The concurrent refine control thread in `G1ConcurrentRefineThread::do_refinement`, when calling `G1Policy::record_dirtying_stats`.

I could create an extra mutex for that if you want to make it clear which two parties access the same data.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1979810144

From tschatzl at openjdk.org  Tue Mar  4 17:20:28 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Tue, 4 Mar 2025 17:20:28 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v12]
In-Reply-To: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
Message-ID: <qCuzsGfzupz6si7J04cHfftyVPmvNaD4pUl1Yzp-4uQ=.2ec2447c-2da3-411f-b6fa-b9b219191706@github.com>

> Hi all,
> 
>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
> 
> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
> 
> ### Current situation
> 
> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
> 
> The main reason for the current barrier is how g1 implements concurrent refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
> 
> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
> 
> 
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done;     // null value check
> if (card(@x.a) == young_card) goto done;  // write to young gen check
> StoreLoad;                // synchronize
> if (card(@x.a) == dirty_card) goto done;
> 
> *card(@x.a) = dirty
> 
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
> 
> call runtime to move thread-local-dcq into dcqs
> 
> done:
> 
> 
> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
> 
> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
> 
> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
> 
> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se...

Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:

  ayang review
    * renamings
    * refactorings

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23739/files
  - new: https://git.openjdk.org/jdk/pull/23739/files/b4d19d9b..4a978118

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=11
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=10-11

  Stats: 34 lines in 4 files changed: 13 ins; 1 del; 20 mod
  Patch: https://git.openjdk.org/jdk/pull/23739.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739

PR: https://git.openjdk.org/jdk/pull/23739

From pchilanomate at openjdk.org  Tue Mar  4 17:38:58 2025
From: pchilanomate at openjdk.org (Patricio Chilano Mateo)
Date: Tue, 4 Mar 2025 17:38:58 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <dFZa-FBTJ7-5NlIaku7Ex4LqZ78fBjlr4Fkv3IUBlkc=.592accf7-3bf8-4043-8368-9b27daa672c3@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <VkUPkPpw2u6nQ7IToE9MkPhq7fJdRcNAyreKlfCRE4g=.c30f2a89-27bd-47b5-80a2-b6288d77fe18@github.com>
 <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com>
 <dFZa-FBTJ7-5NlIaku7Ex4LqZ78fBjlr4Fkv3IUBlkc=.592accf7-3bf8-4043-8368-9b27daa672c3@github.com>
Message-ID: <09Lu69Do9amzXyGok3KDuP2whACShrPwRM7BOel5wgg=.ceed3ba0-9f91-4e95-9cf5-0e85362e29df@github.com>

On Tue, 4 Mar 2025 04:50:34 GMT, David Holmes <dholmes at openjdk.org> wrote:

>> src/hotspot/share/runtime/objectMonitor.cpp line 204:
>> 
>>> 202: //   If the thread (F) that removes itself from the end of the list
>>> 203: //   hasn't got any prev pointer, we just set the tail pointer to
>>> 204: //   null, see 5) and 6) below.
>> 
>> Setting the tail pointer to null would be for the case when this node is also the head, i.e single element. Otherwise we just rebuild the doubly link list, unlink F, and set entry_list_tail to G. In other words, the comment here and below seems to be missing that we have to build the doubly link list when F acquires the monitor, not when F needs to find a successor.
>
> We don't rebuild at this point. The thread that is removing itself just sets tail to null if there is no prev. Later when F exits the monitor it will construct the DLL to find the next successor.

But if there is a previous node (just no previous pointer set) we have to rebuild the list, otherwise G would still be pointing to F. It would be this case: https://github.com/fbredber/jdk/blob/283c2431ec64b0865d4e678913c636732d01658f/src/hotspot/share/runtime/objectMonitor.cpp#L1313

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1979921706

From fbredberg at openjdk.org  Tue Mar  4 18:12:59 2025
From: fbredberg at openjdk.org (Fredrik Bredberg)
Date: Tue, 4 Mar 2025 18:12:59 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <09Lu69Do9amzXyGok3KDuP2whACShrPwRM7BOel5wgg=.ceed3ba0-9f91-4e95-9cf5-0e85362e29df@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <VkUPkPpw2u6nQ7IToE9MkPhq7fJdRcNAyreKlfCRE4g=.c30f2a89-27bd-47b5-80a2-b6288d77fe18@github.com>
 <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com>
 <dFZa-FBTJ7-5NlIaku7Ex4LqZ78fBjlr4Fkv3IUBlkc=.592accf7-3bf8-4043-8368-9b27daa672c3@github.com>
 <09Lu69Do9amzXyGok3KDuP2whACShrPwRM7BOel5wgg=.ceed3ba0-9f91-4e95-9cf5-0e85362e29df@github.com>
Message-ID: <VX3PH6pG13tAERSveeEd6sSgrx37bByuLhEEmpZZN4M=.57f6404d-1e93-4a33-8945-f84e364f5921@github.com>

On Tue, 4 Mar 2025 17:36:43 GMT, Patricio Chilano Mateo <pchilanomate at openjdk.org> wrote:

>> We don't rebuild at this point. The thread that is removing itself just sets tail to null if there is no prev. Later when F exits the monitor it will construct the DLL to find the next successor.
>
> But if there is a previous node (just no previous pointer set) we have to rebuild the list, otherwise G would still be pointing to F. It would be this case: https://github.com/fbredber/jdk/blob/283c2431ec64b0865d4e678913c636732d01658f/src/hotspot/share/runtime/objectMonitor.cpp#L1313

You're quite right. I'll rewrite that section of the comment. Thank you for spotting this.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1979966484

From pchilanomate at openjdk.org  Tue Mar  4 18:13:00 2025
From: pchilanomate at openjdk.org (Patricio Chilano Mateo)
Date: Tue, 4 Mar 2025 18:13:00 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <JN4Rs5fUGe0Z1ZMLamXL1TOmRs9G6CUhX-6qpSyhJ6g=.ec395e9e-d474-4508-aca2-32270bf20382@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <VkUPkPpw2u6nQ7IToE9MkPhq7fJdRcNAyreKlfCRE4g=.c30f2a89-27bd-47b5-80a2-b6288d77fe18@github.com>
 <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com>
 <JN4Rs5fUGe0Z1ZMLamXL1TOmRs9G6CUhX-6qpSyhJ6g=.ec395e9e-d474-4508-aca2-32270bf20382@github.com>
Message-ID: <IAWPgD-SqqaQkgE9YF5zMIc3XJscUUdAuDH-_P9n95o=.f7db1494-2d98-43a3-9273-b3077d5a323c@github.com>

On Tue, 4 Mar 2025 13:27:17 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> src/hotspot/share/runtime/objectMonitor.cpp line 1359:
>> 
>>> 1357:     // Build the doubly linked list to get hold of currentNode->prev().
>>> 1358:     _entry_list_tail = nullptr;
>>> 1359:     entry_list_tail(current);
>> 
>> I think we should try to avoid having to rebuild the doubly link list from scratch, since only a few nodes in the front might be missing the previous links. For platform threads it might not matter that much, but for virtual threads this list could be much larger. Maybe we can leave it as a future enhancement.
>
> We don't have a prev node, we don't know which node to set next to our next node to.  The list will be broken.

Right, we still have to set the previous links for those nodes. I'm just suggesting we don't have to walk the whole list, just until the last node we set the previous pointer.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1979963352

From dlong at openjdk.org  Tue Mar  4 18:14:00 2025
From: dlong at openjdk.org (Dean Long)
Date: Tue, 4 Mar 2025 18:14:00 GMT
Subject: RFR: 8351036: [JVMCI] value not an s2: -32776
In-Reply-To: <D-dKgSpDR3Y0ytbaMO2I2Mpgofa61h5DxHxrS90xmgk=.dee7d64d-dc52-4203-a665-c54ce19edc76@github.com>
References: <D-dKgSpDR3Y0ytbaMO2I2Mpgofa61h5DxHxrS90xmgk=.dee7d64d-dc52-4203-a665-c54ce19edc76@github.com>
Message-ID: <TtDNrGbzinAhn-3aAWEOy-oXumx8ujS3WKDPWCRqZ5A=.ef0aa672-ba59-46ac-8429-0d1379b20ec6@github.com>

On Tue, 4 Mar 2025 09:18:40 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

> This PR adds support for JVMCI to install code that requires stack slots whose offset > `Short.MAX_VALUE`.

Marked as reviewed by dlong (Reviewer).

-------------

PR Review: https://git.openjdk.org/jdk/pull/23888#pullrequestreview-2658525441

From mpowers at openjdk.org  Tue Mar  4 19:28:02 2025
From: mpowers at openjdk.org (Mark Powers)
Date: Tue, 4 Mar 2025 19:28:02 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v2]
In-Reply-To: <Ona146wzT7py7EoMfS_4weGixbRkxvw0OsnNeDt2yl8=.179adb54-62f0-4d77-b5ec-865d1924c627@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <Ona146wzT7py7EoMfS_4weGixbRkxvw0OsnNeDt2yl8=.179adb54-62f0-4d77-b5ec-865d1924c627@github.com>
Message-ID: <wMkeLWMfMobrG8a5rF-5j8rbqkXpqmhhiqb5pcBGnyw=.18aaf547-918d-4f2a-93a0-bb5e27f71167@github.com>

On Mon, 3 Mar 2025 19:00:59 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Added comments, removed debugging printfs

ML-DSA benchmark results for this PR

keygen    ML-DSA-44    96 us/op
keygen    ML-DSA-65   200 us/op
keygen    ML-DSA-87   272 us/op
siggen    ML-DSA-44   297 us/op
siggen    ML-DSA-65   452 us/op
siggen    ML-DSA-87   728 us/op
sigver    ML-DSA-44   115 us/op
sigver    ML-DSA-65   176 us/op
sigver    ML-DSA-87   290 us/op


ML-DSA no intrinsics

keygen    ML-DSA-44   169 us/op
keygen    ML-DSA-65   302 us/op
keygen    ML-DSA-87   444 us/op
siggen    ML-DSA-44   696 us/op
siggen    ML-DSA-65  1114 us/op
siggen    ML-DSA-87  1828 us/op
sigver    ML-DSA-44   187 us/op
sigver    ML-DSA-65   295 us/op
sigver    ML-DSA-87   473 us/op

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23860#issuecomment-2698691038

From dnsimon at openjdk.org  Tue Mar  4 20:14:03 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Tue, 4 Mar 2025 20:14:03 GMT
Subject: RFR: 8351036: [JVMCI] value not an s2: -32776
In-Reply-To: <D-dKgSpDR3Y0ytbaMO2I2Mpgofa61h5DxHxrS90xmgk=.dee7d64d-dc52-4203-a665-c54ce19edc76@github.com>
References: <D-dKgSpDR3Y0ytbaMO2I2Mpgofa61h5DxHxrS90xmgk=.dee7d64d-dc52-4203-a665-c54ce19edc76@github.com>
Message-ID: <-tI6hRLLVFZKckI0dXweArTpvkkuppQ-UCe7QCP204M=.7071b95a-b2bd-43fd-8593-47a3e0711a98@github.com>

On Tue, 4 Mar 2025 09:18:40 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

> This PR adds support for JVMCI to install code that requires stack slots whose offset > `Short.MAX_VALUE`.

Thanks for the reviews!

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23888#issuecomment-2698788687

From dnsimon at openjdk.org  Tue Mar  4 20:14:04 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Tue, 4 Mar 2025 20:14:04 GMT
Subject: Integrated: 8351036: [JVMCI] value not an s2: -32776
In-Reply-To: <D-dKgSpDR3Y0ytbaMO2I2Mpgofa61h5DxHxrS90xmgk=.dee7d64d-dc52-4203-a665-c54ce19edc76@github.com>
References: <D-dKgSpDR3Y0ytbaMO2I2Mpgofa61h5DxHxrS90xmgk=.dee7d64d-dc52-4203-a665-c54ce19edc76@github.com>
Message-ID: <13cAPTn_ilQ-6cQXLy7mta5wV4zczVRsYdpRe5RqnWw=.be50128c-0f43-4bca-8cd8-5a01b51b1c34@github.com>

On Tue, 4 Mar 2025 09:18:40 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

> This PR adds support for JVMCI to install code that requires stack slots whose offset > `Short.MAX_VALUE`.

This pull request has now been integrated.

Changeset: a21302bb
Author:    Doug Simon <dnsimon at openjdk.org>
URL:       https://git.openjdk.org/jdk/commit/a21302bb3244b85dd9809c42d1c0fd502bd677cc
Stats:     44 lines in 4 files changed: 36 ins; 0 del; 8 mod

8351036: [JVMCI] value not an s2: -32776

Reviewed-by: yzheng, dlong

-------------

PR: https://git.openjdk.org/jdk/pull/23888

From duke at openjdk.org  Tue Mar  4 22:04:26 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Tue, 4 Mar 2025 22:04:26 GMT
Subject: RFR: 8349721: Add aarch64 intrinsics for ML-KEM [v4]
In-Reply-To: <eUTcEbCy4gKPEfe0fS4GXXR8i49JYSKCGygRz8CsCnE=.c17a1739-85d2-4bb9-8a74-5ad1694d8d3d@github.com>
References: <eUTcEbCy4gKPEfe0fS4GXXR8i49JYSKCGygRz8CsCnE=.c17a1739-85d2-4bb9-8a74-5ad1694d8d3d@github.com>
Message-ID: <s1i_jDlsrQSGx-cx7O4uE_qwU8L1uxhANZEqG4oXclM=.bf657af0-9c8e-4b4f-a275-43a7ce118c6a@github.com>

> By using the aarch64 vector registers the speed of the computation of the ML-KEM algorithms (key generation, encapsulation, decapsulation) can be approximately doubled.

Ferenc Rakoczi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits:

 - Fixed mismerge.
 - Merged master.
 - A little cleanup
 - Merged master
 - removing trailing spaces
 - kyber aarch64 intrinsics

-------------

Changes: https://git.openjdk.org/jdk/pull/23663/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23663&range=03
  Stats: 2508 lines in 18 files changed: 2464 ins; 16 del; 28 mod
  Patch: https://git.openjdk.org/jdk/pull/23663.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23663/head:pull/23663

PR: https://git.openjdk.org/jdk/pull/23663

From dholmes at openjdk.org  Wed Mar  5 05:17:55 2025
From: dholmes at openjdk.org (David Holmes)
Date: Wed, 5 Mar 2025 05:17:55 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <VX3PH6pG13tAERSveeEd6sSgrx37bByuLhEEmpZZN4M=.57f6404d-1e93-4a33-8945-f84e364f5921@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <VkUPkPpw2u6nQ7IToE9MkPhq7fJdRcNAyreKlfCRE4g=.c30f2a89-27bd-47b5-80a2-b6288d77fe18@github.com>
 <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com>
 <dFZa-FBTJ7-5NlIaku7Ex4LqZ78fBjlr4Fkv3IUBlkc=.592accf7-3bf8-4043-8368-9b27daa672c3@github.com>
 <09Lu69Do9amzXyGok3KDuP2whACShrPwRM7BOel5wgg=.ceed3ba0-9f91-4e95-9cf5-0e85362e29df@github.com>
 <VX3PH6pG13tAERSveeEd6sSgrx37bByuLhEEmpZZN4M=.57f6404d-1e93-4a33-8945-f84e364f5921@github.com>
Message-ID: <VqZRLcS79ZNqyJa5BpPB5_XU1w5s784pzRsPX0crtBc=.80f7e7fc-50a4-494c-9c29-0807a3735b6e@github.com>

On Tue, 4 Mar 2025 18:09:56 GMT, Fredrik Bredberg <fbredberg at openjdk.org> wrote:

>> But if there is a previous node (just no previous pointer set) we have to rebuild the list, otherwise G would still be pointing to F. It would be this case: https://github.com/fbredber/jdk/blob/283c2431ec64b0865d4e678913c636732d01658f/src/hotspot/share/runtime/objectMonitor.cpp#L1313
>
> You're quite right. I'll rewrite that section of the comment. Thank you for spotting this.

Yep my bad - you can't delete yourself without a prev node pointer when you are being pointed to.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1980709561

From tschatzl at openjdk.org  Wed Mar  5 09:45:00 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Wed, 5 Mar 2025 09:45:00 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v13]
In-Reply-To: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
Message-ID: <ZEc2ciz8e-3Sk1PVxsXDyTNIwDUG4pe1oUT1wMrXQHg=.3a809d0a-c07d-41ac-aca2-aab172e7d9d1@github.com>

> Hi all,
> 
>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
> 
> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
> 
> ### Current situation
> 
> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
> 
> The main reason for the current barrier is how g1 implements concurrent refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
> 
> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
> 
> 
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done;     // null value check
> if (card(@x.a) == young_card) goto done;  // write to young gen check
> StoreLoad;                // synchronize
> if (card(@x.a) == dirty_card) goto done;
> 
> *card(@x.a) = dirty
> 
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
> 
> call runtime to move thread-local-dcq into dcqs
> 
> done:
> 
> 
> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
> 
> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
> 
> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
> 
> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se...

Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:

  * fix whitespace
  * additional whitespace between log tags
  * rename G1ConcurrentRefineWorkTask -> ...SweepTask to conform to the other similar rename

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23739/files
  - new: https://git.openjdk.org/jdk/pull/23739/files/4a978118..a457e6e7

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=12
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=11-12

  Stats: 116 lines in 6 files changed: 50 ins; 50 del; 16 mod
  Patch: https://git.openjdk.org/jdk/pull/23739.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739

PR: https://git.openjdk.org/jdk/pull/23739

From iwalulya at openjdk.org  Wed Mar  5 11:12:56 2025
From: iwalulya at openjdk.org (Ivan Walulya)
Date: Wed, 5 Mar 2025 11:12:56 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v13]
In-Reply-To: <ZEc2ciz8e-3Sk1PVxsXDyTNIwDUG4pe1oUT1wMrXQHg=.3a809d0a-c07d-41ac-aca2-aab172e7d9d1@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <ZEc2ciz8e-3Sk1PVxsXDyTNIwDUG4pe1oUT1wMrXQHg=.3a809d0a-c07d-41ac-aca2-aab172e7d9d1@github.com>
Message-ID: <OWqZCjQ-R7QG-awnDVD6Y-S7TbWQ4BRgwky6IErm9Vg=.a5ee21ad-c55f-4cb9-b07b-52d05e4fd153@github.com>

On Wed, 5 Mar 2025 09:45:00 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> Hi all,
>> 
>>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
>> 
>> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
>> 
>> ### Current situation
>> 
>> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
>> 
>> The main reason for the current barrier is how g1 implements concurrent refinement:
>> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
>> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
>> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
>> 
>> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
>> 
>> 
>> // Filtering
>> if (region(@x.a) == region(y)) goto done; // same region check
>> if (y == null) goto done;     // null value check
>> if (card(@x.a) == young_card) goto done;  // write to young gen check
>> StoreLoad;                // synchronize
>> if (card(@x.a) == dirty_card) goto done;
>> 
>> *card(@x.a) = dirty
>> 
>> // Card tracking
>> enqueue(card-address(@x.a)) into thread-local-dcq;
>> if (thread-local-dcq is not full) goto done;
>> 
>> call runtime to move thread-local-dcq into dcqs
>> 
>> done:
>> 
>> 
>> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
>> 
>> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
>> 
>> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
>> 
>> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c...
>
> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
> 
>   * fix whitespace
>   * additional whitespace between log tags
>   * rename G1ConcurrentRefineWorkTask -> ...SweepTask to conform to the other similar rename

src/hotspot/share/gc/g1/c1/g1BarrierSetC1.cpp line 32:

> 30: #include "gc/g1/g1HeapRegion.hpp"
> 31: #include "gc/g1/g1ThreadLocalData.hpp"
> 32: #include "utilities/macros.hpp"

Suggestion:

#include "utilities/formatBuffer.hpp"
#include "utilities/macros.hpp"

to use `err_msg`

src/hotspot/share/gc/g1/g1RemSet.cpp line 90:

> 88: // contiguous ranges of dirty cards to be scanned. These blocks are converted to actual
> 89: // memory ranges and then passed on to actual scanning.
> 90: class G1RemSetScanState : public CHeapObj<mtGC> {

Need to update the comment above to remove reference to "log buffers" (L:67).

src/hotspot/share/gc/g1/g1RemSet.hpp line 44:

> 42: class CardTableBarrierSet;
> 43: class G1AbstractSubTask;
> 44: class G1RemSetScanState;

Already declared on line 48 below

src/hotspot/share/gc/g1/g1ThreadLocalData.hpp line 29:

> 27: #include "gc/g1/g1BarrierSet.hpp"
> 28: #include "gc/g1/g1CardTable.hpp"
> 29: #include "gc/g1/g1CollectedHeap.hpp"

probably does not need to be included

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1981138746
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1981162792
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1981118865
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1981142943

From iwalulya at openjdk.org  Wed Mar  5 11:12:58 2025
From: iwalulya at openjdk.org (Ivan Walulya)
Date: Wed, 5 Mar 2025 11:12:58 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v12]
In-Reply-To: <qCuzsGfzupz6si7J04cHfftyVPmvNaD4pUl1Yzp-4uQ=.2ec2447c-2da3-411f-b6fa-b9b219191706@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <qCuzsGfzupz6si7J04cHfftyVPmvNaD4pUl1Yzp-4uQ=.2ec2447c-2da3-411f-b6fa-b9b219191706@github.com>
Message-ID: <lSeY8F7qhYn1-S4U8WCh48z6D8r2G9y3iu8Ks_76F3w=.d7404272-883a-4be6-aae1-74e68f175235@github.com>

On Tue, 4 Mar 2025 17:20:28 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> Hi all,
>> 
>>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
>> 
>> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
>> 
>> ### Current situation
>> 
>> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
>> 
>> The main reason for the current barrier is how g1 implements concurrent refinement:
>> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
>> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
>> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
>> 
>> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
>> 
>> 
>> // Filtering
>> if (region(@x.a) == region(y)) goto done; // same region check
>> if (y == null) goto done;     // null value check
>> if (card(@x.a) == young_card) goto done;  // write to young gen check
>> StoreLoad;                // synchronize
>> if (card(@x.a) == dirty_card) goto done;
>> 
>> *card(@x.a) = dirty
>> 
>> // Card tracking
>> enqueue(card-address(@x.a)) into thread-local-dcq;
>> if (thread-local-dcq is not full) goto done;
>> 
>> call runtime to move thread-local-dcq into dcqs
>> 
>> done:
>> 
>> 
>> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
>> 
>> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
>> 
>> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
>> 
>> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c...
>
> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
> 
>   ayang review
>     * renamings
>     * refactorings

src/hotspot/share/gc/g1/g1HeapRegion.hpp line 475:

> 473:   void hr_clear(bool clear_space);
> 474:   // Clear the card table corresponding to this region.
> 475:   void clear_cardtable();

in some places `cardtable()` has been refactored to `card_table` e.g. in G1HeapRegionManager.

src/hotspot/share/gc/g1/g1ParScanThreadState.hpp line 67:

> 65: 
> 66:   size_t _num_marked_as_dirty_cards;
> 67:   size_t _num_marked_as_into_cset_cards;

Suggestion:

  size_t _num_cards_marked_dirty;
  size_t _num_cards_marked_to_cset;

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1980117641
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1980145229

From duke at openjdk.org  Wed Mar  5 11:33:06 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Wed, 5 Mar 2025 11:33:06 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v3]
In-Reply-To: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
Message-ID: <xYDRgjvW5AksXcXKUBnx7LlB4PmGEjEITKA27iHs4rE=.e69744e8-2fe9-4b0c-a49d-25177a2e8daf@github.com>

> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.

Ferenc Rakoczi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits:

 - Merged master.
 - Added comments, removed debugging printfs
 - JDK-8351034 Add AVX-512 intrinsics for ML-DSA

-------------

Changes: https://git.openjdk.org/jdk/pull/23860/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=02
  Stats: 1642 lines in 8 files changed: 1636 ins; 2 del; 4 mod
  Patch: https://git.openjdk.org/jdk/pull/23860.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23860/head:pull/23860

PR: https://git.openjdk.org/jdk/pull/23860

From jbhateja at openjdk.org  Wed Mar  5 11:38:52 2025
From: jbhateja at openjdk.org (Jatin Bhateja)
Date: Wed, 5 Mar 2025 11:38:52 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v2]
In-Reply-To: <Ona146wzT7py7EoMfS_4weGixbRkxvw0OsnNeDt2yl8=.179adb54-62f0-4d77-b5ec-865d1924c627@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <Ona146wzT7py7EoMfS_4weGixbRkxvw0OsnNeDt2yl8=.179adb54-62f0-4d77-b5ec-865d1924c627@github.com>
Message-ID: <y6k-dj9j91M8yH4OQiQtvKjnOht1EfdYxr6X5rzDd1s=.074ab025-f97a-4281-993a-448d822a7196@github.com>

On Mon, 3 Mar 2025 19:00:59 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Added comments, removed debugging printfs

src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 420:

> 418:   __ movptr(constant2use, round_consts);
> 419: 
> 420:   __ BIND(rounds24_loop);

For Icache alignment, please use __ align64() before the loop entry.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1978822704

From jbhateja at openjdk.org  Wed Mar  5 11:42:01 2025
From: jbhateja at openjdk.org (Jatin Bhateja)
Date: Wed, 5 Mar 2025 11:42:01 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v3]
In-Reply-To: <xYDRgjvW5AksXcXKUBnx7LlB4PmGEjEITKA27iHs4rE=.e69744e8-2fe9-4b0c-a49d-25177a2e8daf@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <xYDRgjvW5AksXcXKUBnx7LlB4PmGEjEITKA27iHs4rE=.e69744e8-2fe9-4b0c-a49d-25177a2e8daf@github.com>
Message-ID: <0E2AqFpNPjDjP6jqCXn8toePBcW2SIHw1kFXlZX4W_U=.8d692bfa-0598-4969-b480-4a285366e0bb@github.com>

On Wed, 5 Mar 2025 11:33:06 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits:
> 
>  - Merged master.
>  - Added comments, removed debugging printfs
>  - JDK-8351034 Add AVX-512 intrinsics for ML-DSA

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 292:

> 290:   __ movl(iterations, 2);
> 291: 
> 292:   __ BIND(L_loop);

Please align loop entry address using __align64().

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1981242267

From fbredberg at openjdk.org  Wed Mar  5 12:31:23 2025
From: fbredberg at openjdk.org (Fredrik Bredberg)
Date: Wed, 5 Mar 2025 12:31:23 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v3]
In-Reply-To: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
Message-ID: <OjJXnyhDGujq-_1fMwKg-drzyt-HUNoxggOgrsyLDUM=.dc04e9ff-95c6-48f7-b4d1-209ef816f205@github.com>

> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`.
> 
> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past.
> 
> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks.
> 
> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`.
> 
> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable.
> 
> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list.
> 
> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor).
> 
> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation.
> 
> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fact that c2 no longer has to check b...

Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision:

  Updated comments after review by Patricio.

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23421/files
  - new: https://git.openjdk.org/jdk/pull/23421/files/283c2431..0d2d6c34

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23421&range=02
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23421&range=01-02

  Stats: 11 lines in 1 file changed: 1 ins; 1 del; 9 mod
  Patch: https://git.openjdk.org/jdk/pull/23421.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23421/head:pull/23421

PR: https://git.openjdk.org/jdk/pull/23421

From fbredberg at openjdk.org  Wed Mar  5 12:31:24 2025
From: fbredberg at openjdk.org (Fredrik Bredberg)
Date: Wed, 5 Mar 2025 12:31:24 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <VqZRLcS79ZNqyJa5BpPB5_XU1w5s784pzRsPX0crtBc=.80f7e7fc-50a4-494c-9c29-0807a3735b6e@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <VkUPkPpw2u6nQ7IToE9MkPhq7fJdRcNAyreKlfCRE4g=.c30f2a89-27bd-47b5-80a2-b6288d77fe18@github.com>
 <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com>
 <dFZa-FBTJ7-5NlIaku7Ex4LqZ78fBjlr4Fkv3IUBlkc=.592accf7-3bf8-4043-8368-9b27daa672c3@github.com>
 <09Lu69Do9amzXyGok3KDuP2whACShrPwRM7BOel5wgg=.ceed3ba0-9f91-4e95-9cf5-0e85362e29df@github.com>
 <VX3PH6pG13tAERSveeEd6sSgrx37bByuLhEEmpZZN4M=.57f6404d-1e93-4a33-8945-f84e364f5921@github.com>
 <VqZRLcS79ZNqyJa5BpPB5_XU1w5s784pzRsPX0crtBc=.80f7e7fc-50a4-494c-9c29-0807a3735b6e@github.com>
Message-ID: <oCIYg3CW9KY6gESlLrxEk7HhhTakf0p4_IGJ3mpjiAM=.ec8d3279-7968-4f6a-bfd2-f37ab3d04931@github.com>

On Wed, 5 Mar 2025 05:14:54 GMT, David Holmes <dholmes at openjdk.org> wrote:

>> You're quite right. I'll rewrite that section of the comment. Thank you for spotting this.
>
> Yep my bad - you can't delete yourself without a prev node pointer when you are being pointed to.

Fixed

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1981307190

From fbredberg at openjdk.org  Wed Mar  5 12:34:56 2025
From: fbredberg at openjdk.org (Fredrik Bredberg)
Date: Wed, 5 Mar 2025 12:34:56 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <VkUPkPpw2u6nQ7IToE9MkPhq7fJdRcNAyreKlfCRE4g=.c30f2a89-27bd-47b5-80a2-b6288d77fe18@github.com>
 <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com>
Message-ID: <hDIfe3BxyKlFLcVSEIQ5m58g4l5fRHAwNgq8K_franI=.5fe084e1-5776-4bc3-bd83-5023a08928e8@github.com>

On Mon, 3 Mar 2025 23:10:29 GMT, Patricio Chilano Mateo <pchilanomate at openjdk.org> wrote:

>> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Update after review by David and Coleen.
>
> src/hotspot/share/runtime/objectMonitor.cpp line 1265:
> 
>> 1263:   // that updated _entry_list, so we can access w->_next.
>> 1264:   w = Atomic::load_acquire(&_entry_list);
>> 1265:   assert(w != nullptr, "invariant");
> 
> Maybe add the same assert as below for the single element case: `assert(w->TState == ObjectWaiter::TS_ENTER, "invariant")`.

Since this is not strictly necessary, I will look into this in a follow up PR.

> src/hotspot/share/runtime/objectMonitor.cpp line 1532:
> 
>> 1530:       // Let's say T1 then stalls.  T2 acquires O and calls O.notify().  The
>> 1531:       // notify() operation moves T1 from O's waitset to O's entry_list. T2 then
>> 1532:       // release the lock "O".  T2 resumes immediately after the ST of null into
> 
> Pre-existent, but this should be T1. Same in next sentence.

Fixed

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1981314184
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1981313487

From fbredberg at openjdk.org  Wed Mar  5 12:34:57 2025
From: fbredberg at openjdk.org (Fredrik Bredberg)
Date: Wed, 5 Mar 2025 12:34:57 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <IAWPgD-SqqaQkgE9YF5zMIc3XJscUUdAuDH-_P9n95o=.f7db1494-2d98-43a3-9273-b3077d5a323c@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <VkUPkPpw2u6nQ7IToE9MkPhq7fJdRcNAyreKlfCRE4g=.c30f2a89-27bd-47b5-80a2-b6288d77fe18@github.com>
 <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com>
 <JN4Rs5fUGe0Z1ZMLamXL1TOmRs9G6CUhX-6qpSyhJ6g=.ec395e9e-d474-4508-aca2-32270bf20382@github.com>
 <IAWPgD-SqqaQkgE9YF5zMIc3XJscUUdAuDH-_P9n95o=.f7db1494-2d98-43a3-9273-b3077d5a323c@github.com>
Message-ID: <SXIOY1sz7SaVX_7BP5tHjgfBYGQKVeqA-KvmCFZOPDM=.a253bd26-0cd2-4b2e-90c4-e35771f9bb68@github.com>

On Tue, 4 Mar 2025 18:08:15 GMT, Patricio Chilano Mateo <pchilanomate at openjdk.org> wrote:

>> We don't have a prev node, we don't know which node to set next to our next node to.  The list will be broken.
>
> Right, we still have to set the previous links for those nodes. I'm just suggesting we don't have to walk the whole list, just until the last node we set the previous pointer.

Since this is not strictly necessary, I will look into this in a follow up PR.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1981312971

From yzheng at openjdk.org  Wed Mar  5 12:37:52 2025
From: yzheng at openjdk.org (Yudi Zheng)
Date: Wed, 5 Mar 2025 12:37:52 GMT
Subject: RFR: 8350892: [JVMCI] Align ResolvedJavaType.getInstanceFields
 with Class.getDeclaredFields
In-Reply-To: <poRDowqib3iR011jUOlIyKkBX8XJgqB8dmdtILmNuyU=.86ddf950-95fb-40a0-85a6-8c08a069a345@github.com>
References: <poRDowqib3iR011jUOlIyKkBX8XJgqB8dmdtILmNuyU=.86ddf950-95fb-40a0-85a6-8c08a069a345@github.com>
Message-ID: <AAUaObqc1npcld2ODpaN00P_o-oimLC272Wq4eEcehU=.a539edeb-c6d1-46d5-b3c9-ecec3dd90d97@github.com>

On Fri, 28 Feb 2025 23:46:54 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

> The current order of fields returned by `ResolvedJavaType.getInstanceFields` is a) not well specified and b) different than the order of fields used almost everywhere else in HotSpot. This PR aligns the order of `getInstanceFields` with `Class.getDeclaredFields()`.
> 
> It also makes `ciInstanceKlass::_nonstatic_fields` use the same order which unifies how escape analysis and deoptimization treats fields across C2 and JVMCI.

Overall looks good to me

src/hotspot/share/ci/ciInstanceKlass.cpp line 481:

> 479:   // Now sort them by offset, ascending.
> 480:   // (In principle, they could mix with superclass fields.)
> 481:   fields->sort(sort_field_by_offset);

This has no effect now, i.e., the fields were sorted already?

-------------

Marked as reviewed by yzheng (Committer).

PR Review: https://git.openjdk.org/jdk/pull/23849#pullrequestreview-2660958414
PR Review Comment: https://git.openjdk.org/jdk/pull/23849#discussion_r1981305860

From fbredberg at openjdk.org  Wed Mar  5 12:43:03 2025
From: fbredberg at openjdk.org (Fredrik Bredberg)
Date: Wed, 5 Mar 2025 12:43:03 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <VkUPkPpw2u6nQ7IToE9MkPhq7fJdRcNAyreKlfCRE4g=.c30f2a89-27bd-47b5-80a2-b6288d77fe18@github.com>
 <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com>
Message-ID: <2pmoWBdeasqGUxjDKvJMIBUqgipo33xTNrYIdB6U1vM=.79067439-b3c1-4b47-8669-7e4d77a22b3f@github.com>

On Mon, 3 Mar 2025 23:12:05 GMT, Patricio Chilano Mateo <pchilanomate at openjdk.org> wrote:

>> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Update after review by David and Coleen.
>
> src/hotspot/share/runtime/objectMonitor.cpp line 1509:
> 
>> 1507:     // is no successor, so it appears that an heir-presumptive
>> 1508:     // (successor) must be made ready. Only the current lock owner can
>> 1509:     // detach threads from the entry_list, therefore we need to
> 
> We don't detach threads here, so maybe manipulate would be better.

Maybe, but manipulate may also include "pushing to the head", which is fine to do without holding the lock.
I'll keep the comment as is for now, maybe this sentence will be deleted if we find a way of running exit without holding the lock, as we have talked about. If that's not possible I'll rephrase this sentence in a follow up PR.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1981325861

From fbredberg at openjdk.org  Wed Mar  5 12:51:02 2025
From: fbredberg at openjdk.org (Fredrik Bredberg)
Date: Wed, 5 Mar 2025 12:51:02 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v3]
In-Reply-To: <OjJXnyhDGujq-_1fMwKg-drzyt-HUNoxggOgrsyLDUM=.dc04e9ff-95c6-48f7-b4d1-209ef816f205@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <OjJXnyhDGujq-_1fMwKg-drzyt-HUNoxggOgrsyLDUM=.dc04e9ff-95c6-48f7-b4d1-209ef816f205@github.com>
Message-ID: <6TKNnpUGSCflszCRIY531Nnf1kMxjlYQm3V4Yf44riY=.5c5f69ef-cf63-4748-902b-39c2898762ee@github.com>

On Wed, 5 Mar 2025 12:31:23 GMT, Fredrik Bredberg <fbredberg at openjdk.org> wrote:

>> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`.
>> 
>> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past.
>> 
>> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks.
>> 
>> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`.
>> 
>> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable.
>> 
>> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list.
>> 
>> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor).
>> 
>> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation.
>> 
>> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fac...
>
> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Updated comments after review by Patricio.

@mur47x111 
I'm getting ready to integrate. I've seen that you have created  [[JDK-8349711] Adapt JDK-8343840: Rewrite the ObjectMonitor lists](https://github.com/oracle/graal/pull/10757) to handle the change on your side. Do you see any reason why I shouldn't integrate, or are you fine with me integrating this PR now?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23421#issuecomment-2700837592

From duke at openjdk.org  Wed Mar  5 13:10:34 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Wed, 5 Mar 2025 13:10:34 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v4]
In-Reply-To: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
Message-ID: <nmN_TxNaqRqSAVqYzrUhO9NXpzzixG6d1OsvBArpLsA=.857b5aa3-694a-4528-a922-c3fdab141ffa@github.com>

> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.

Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:

  Added alignment to loop entries.

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23860/files
  - new: https://git.openjdk.org/jdk/pull/23860/files/331f1ecb..3aaa106f

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=03
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=02-03

  Stats: 9 lines in 2 files changed: 9 ins; 0 del; 0 mod
  Patch: https://git.openjdk.org/jdk/pull/23860.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23860/head:pull/23860

PR: https://git.openjdk.org/jdk/pull/23860

From duke at openjdk.org  Wed Mar  5 13:10:35 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Wed, 5 Mar 2025 13:10:35 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v3]
In-Reply-To: <0E2AqFpNPjDjP6jqCXn8toePBcW2SIHw1kFXlZX4W_U=.8d692bfa-0598-4969-b480-4a285366e0bb@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <xYDRgjvW5AksXcXKUBnx7LlB4PmGEjEITKA27iHs4rE=.e69744e8-2fe9-4b0c-a49d-25177a2e8daf@github.com>
 <0E2AqFpNPjDjP6jqCXn8toePBcW2SIHw1kFXlZX4W_U=.8d692bfa-0598-4969-b480-4a285366e0bb@github.com>
Message-ID: <V62AuLpV4oPamF6hi7zMdqs21tyw9i4ug_9UlNCQpNE=.2d6e0fcf-3e2c-406f-a336-bb2dedd923e8@github.com>

On Wed, 5 Mar 2025 11:39:05 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Ferenc Rakoczi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits:
>> 
>>  - Merged master.
>>  - Added comments, removed debugging printfs
>>  - JDK-8351034 Add AVX-512 intrinsics for ML-DSA
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 292:
> 
>> 290:   __ movl(iterations, 2);
>> 291: 
>> 292:   __ BIND(L_loop);
> 
> Hi @ferakocz , Kindly align loop entry address using __align64() here and at all the places before __BIND(LOOP)

Hi, @jatin-bhateja, thanks for the suggestion. I have added __ align(OptoLoopAlignment); before all loop entries.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1981364481

From dnsimon at openjdk.org  Wed Mar  5 13:50:53 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Wed, 5 Mar 2025 13:50:53 GMT
Subject: RFR: 8350892: [JVMCI] Align ResolvedJavaType.getInstanceFields
 with Class.getDeclaredFields
In-Reply-To: <AAUaObqc1npcld2ODpaN00P_o-oimLC272Wq4eEcehU=.a539edeb-c6d1-46d5-b3c9-ecec3dd90d97@github.com>
References: <poRDowqib3iR011jUOlIyKkBX8XJgqB8dmdtILmNuyU=.86ddf950-95fb-40a0-85a6-8c08a069a345@github.com>
 <AAUaObqc1npcld2ODpaN00P_o-oimLC272Wq4eEcehU=.a539edeb-c6d1-46d5-b3c9-ecec3dd90d97@github.com>
Message-ID: <BoFDKQtN2f7ozXXy4D0kMTa1nGExi0ynSm7TX9fDzi0=.0ada518e-4d95-47f5-9bde-ef9a486b3e9e@github.com>

On Wed, 5 Mar 2025 12:26:12 GMT, Yudi Zheng <yzheng at openjdk.org> wrote:

>> The current order of fields returned by `ResolvedJavaType.getInstanceFields` is a) not well specified and b) different than the order of fields used almost everywhere else in HotSpot. This PR aligns the order of `getInstanceFields` with `Class.getDeclaredFields()`.
>> 
>> It also makes `ciInstanceKlass::_nonstatic_fields` use the same order which unifies how escape analysis and deoptimization treats fields across C2 and JVMCI.
>
> src/hotspot/share/ci/ciInstanceKlass.cpp line 481:
> 
>> 479:   // Now sort them by offset, ascending.
>> 480:   // (In principle, they could mix with superclass fields.)
>> 481:   fields->sort(sort_field_by_offset);
> 
> This has no effect now, i.e., the fields were sorted already?

They now have whatever sort order is given by JavaFieldStream. This happens to currently be class file declaration order but it doesn't really matter if it changes. The only requirement is that the same order is used by `get_reassigned_fields` in `deoptimization.cpp`.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23849#discussion_r1981441818

From jbhateja at openjdk.org  Wed Mar  5 14:05:53 2025
From: jbhateja at openjdk.org (Jatin Bhateja)
Date: Wed, 5 Mar 2025 14:05:53 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v3]
In-Reply-To: <V62AuLpV4oPamF6hi7zMdqs21tyw9i4ug_9UlNCQpNE=.2d6e0fcf-3e2c-406f-a336-bb2dedd923e8@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <xYDRgjvW5AksXcXKUBnx7LlB4PmGEjEITKA27iHs4rE=.e69744e8-2fe9-4b0c-a49d-25177a2e8daf@github.com>
 <0E2AqFpNPjDjP6jqCXn8toePBcW2SIHw1kFXlZX4W_U=.8d692bfa-0598-4969-b480-4a285366e0bb@github.com>
 <V62AuLpV4oPamF6hi7zMdqs21tyw9i4ug_9UlNCQpNE=.2d6e0fcf-3e2c-406f-a336-bb2dedd923e8@github.com>
Message-ID: <M5Z8mk-KZoytxUmhzRWKqUFtHRWoxY28Bfv8h_ijjMg=.351389d9-57d1-4589-a853-41125858760a@github.com>

On Wed, 5 Mar 2025 13:07:54 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 292:
>> 
>>> 290:   __ movl(iterations, 2);
>>> 291: 
>>> 292:   __ BIND(L_loop);
>> 
>> Hi @ferakocz , Kindly align loop entry address using __align64() here and at all the places before __BIND(LOOP)
>
> Hi, @jatin-bhateja, thanks for the suggestion. I have added __ align(OptoLoopAlignment); before all loop entries.

Hi @ferakocz , 

Thanks!, for efficient utilization of Decode ICache (please refer to Intel SDM section 3.4.2.5), code blocks should be aligned to 32-byte boundaries; a 64-byte aligned code is a superset of both 16 and 32 byte aligned addresses and also matches with the cacheline size. However, I can noticed that we have been using OptoLoopAlignment at places in AES-GCM also.

I introduced some errors in generate_dilithiumAlmostInverseNtt_avx512 implementation in anticipation of catching it through existing ML_DSA_Tests under 
test/jdk/sun/security/provider/acvp

But all the tests passed for me.
`java  -jar /home/jatinbha/sandboxes/jtreg/build/images/jtreg/lib/jtreg.jar -jdk:$JAVA_HOME -Djdk.test.lib.artifacts.ACVP-Server=/home/jatinbha/softwares/v1.1.0.38.zip -va -timeout:4 Launcher.java`

Can you please point out a test I need to use for validation

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1981468903

From coleenp at openjdk.org  Wed Mar  5 14:35:58 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Wed, 5 Mar 2025 14:35:58 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v3]
In-Reply-To: <OjJXnyhDGujq-_1fMwKg-drzyt-HUNoxggOgrsyLDUM=.dc04e9ff-95c6-48f7-b4d1-209ef816f205@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <OjJXnyhDGujq-_1fMwKg-drzyt-HUNoxggOgrsyLDUM=.dc04e9ff-95c6-48f7-b4d1-209ef816f205@github.com>
Message-ID: <IDU3lNi6WKoZv-J14TvgT0yx1fnN-YxQsEMNz4b7Xvw=.3a525e1d-ad82-4fe1-b526-54ca6201f51b@github.com>

On Wed, 5 Mar 2025 12:31:23 GMT, Fredrik Bredberg <fbredberg at openjdk.org> wrote:

>> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`.
>> 
>> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past.
>> 
>> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks.
>> 
>> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`.
>> 
>> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable.
>> 
>> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list.
>> 
>> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor).
>> 
>> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation.
>> 
>> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fac...
>
> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Updated comments after review by Patricio.

Marked as reviewed by coleenp (Reviewer).

-------------

PR Review: https://git.openjdk.org/jdk/pull/23421#pullrequestreview-2661313357

From yzheng at openjdk.org  Wed Mar  5 14:49:57 2025
From: yzheng at openjdk.org (Yudi Zheng)
Date: Wed, 5 Mar 2025 14:49:57 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v3]
In-Reply-To: <OjJXnyhDGujq-_1fMwKg-drzyt-HUNoxggOgrsyLDUM=.dc04e9ff-95c6-48f7-b4d1-209ef816f205@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <OjJXnyhDGujq-_1fMwKg-drzyt-HUNoxggOgrsyLDUM=.dc04e9ff-95c6-48f7-b4d1-209ef816f205@github.com>
Message-ID: <XCgNj9mEd2zW1nMXo7zkPfZVnVcKrYHB5CsXAvbEe18=.9e597be3-e397-4533-ad61-62f89a00a094@github.com>

On Wed, 5 Mar 2025 12:31:23 GMT, Fredrik Bredberg <fbredberg at openjdk.org> wrote:

>> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`.
>> 
>> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past.
>> 
>> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks.
>> 
>> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`.
>> 
>> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable.
>> 
>> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list.
>> 
>> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor).
>> 
>> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation.
>> 
>> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fac...
>
> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Updated comments after review by Patricio.

JVMCI changes look go to me! We are good to go!

-------------

Marked as reviewed by yzheng (Committer).

PR Review: https://git.openjdk.org/jdk/pull/23421#pullrequestreview-2661358578

From pchilanomate at openjdk.org  Wed Mar  5 14:52:57 2025
From: pchilanomate at openjdk.org (Patricio Chilano Mateo)
Date: Wed, 5 Mar 2025 14:52:57 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v3]
In-Reply-To: <OjJXnyhDGujq-_1fMwKg-drzyt-HUNoxggOgrsyLDUM=.dc04e9ff-95c6-48f7-b4d1-209ef816f205@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <OjJXnyhDGujq-_1fMwKg-drzyt-HUNoxggOgrsyLDUM=.dc04e9ff-95c6-48f7-b4d1-209ef816f205@github.com>
Message-ID: <RJRG37QIyLIRFgaUNFFREUrFM44kLbJ9FQoO0Sfl2Fg=.6edc3134-1acf-4120-b4c0-d70eb7c85df0@github.com>

On Wed, 5 Mar 2025 12:31:23 GMT, Fredrik Bredberg <fbredberg at openjdk.org> wrote:

>> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`.
>> 
>> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past.
>> 
>> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks.
>> 
>> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`.
>> 
>> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable.
>> 
>> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list.
>> 
>> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor).
>> 
>> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation.
>> 
>> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fac...
>
> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Updated comments after review by Patricio.

Thanks, looks good.

-------------

Marked as reviewed by pchilanomate (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/23421#pullrequestreview-2661362873

From pchilanomate at openjdk.org  Wed Mar  5 14:52:59 2025
From: pchilanomate at openjdk.org (Patricio Chilano Mateo)
Date: Wed, 5 Mar 2025 14:52:59 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <2pmoWBdeasqGUxjDKvJMIBUqgipo33xTNrYIdB6U1vM=.79067439-b3c1-4b47-8669-7e4d77a22b3f@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <VkUPkPpw2u6nQ7IToE9MkPhq7fJdRcNAyreKlfCRE4g=.c30f2a89-27bd-47b5-80a2-b6288d77fe18@github.com>
 <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com>
 <2pmoWBdeasqGUxjDKvJMIBUqgipo33xTNrYIdB6U1vM=.79067439-b3c1-4b47-8669-7e4d77a22b3f@github.com>
Message-ID: <xDby6xOJ3FoxE2DkLLKEWG95JAYios8zsdtHHgpGVwk=.ee844f55-c87b-4d48-b756-338556fcf80c@github.com>

On Wed, 5 Mar 2025 12:40:41 GMT, Fredrik Bredberg <fbredberg at openjdk.org> wrote:

>> src/hotspot/share/runtime/objectMonitor.cpp line 1509:
>> 
>>> 1507:     // is no successor, so it appears that an heir-presumptive
>>> 1508:     // (successor) must be made ready. Only the current lock owner can
>>> 1509:     // detach threads from the entry_list, therefore we need to
>> 
>> We don't detach threads here, so maybe manipulate would be better.
>
> Maybe, but manipulate may also include "pushing to the head", which is fine to do without holding the lock.
> I'll keep the comment as is for now, maybe this sentence will be deleted if we find a way of running exit without holding the lock, as we have talked about. If that's not possible I'll rephrase this sentence in a follow up PR.

You could use the same wording we have in the comment above already just to make it consistent: `manipulate the _entry_list (except for pushing new threads to the head)`.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1981553339

From fbredberg at openjdk.org  Wed Mar  5 14:56:57 2025
From: fbredberg at openjdk.org (Fredrik Bredberg)
Date: Wed, 5 Mar 2025 14:56:57 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <xDby6xOJ3FoxE2DkLLKEWG95JAYios8zsdtHHgpGVwk=.ee844f55-c87b-4d48-b756-338556fcf80c@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <VkUPkPpw2u6nQ7IToE9MkPhq7fJdRcNAyreKlfCRE4g=.c30f2a89-27bd-47b5-80a2-b6288d77fe18@github.com>
 <4POZFfUl_AWAh3K2rV3Uqey0xkYHApoZDjfuw3TVBlA=.4cf1547b-5279-40b4-bef4-4c9775ec1ad8@github.com>
 <2pmoWBdeasqGUxjDKvJMIBUqgipo33xTNrYIdB6U1vM=.79067439-b3c1-4b47-8669-7e4d77a22b3f@github.com>
 <xDby6xOJ3FoxE2DkLLKEWG95JAYios8zsdtHHgpGVwk=.ee844f55-c87b-4d48-b756-338556fcf80c@github.com>
Message-ID: <b7SoQ6gXtQT5wHTC4LgRUYQ6oXwdQkPGzSZc40hHKS0=.bf5add6e-8c26-449f-957a-72e0d595ff3d@github.com>

On Wed, 5 Mar 2025 14:49:01 GMT, Patricio Chilano Mateo <pchilanomate at openjdk.org> wrote:

>> Maybe, but manipulate may also include "pushing to the head", which is fine to do without holding the lock.
>> I'll keep the comment as is for now, maybe this sentence will be deleted if we find a way of running exit without holding the lock, as we have talked about. If that's not possible I'll rephrase this sentence in a follow up PR.
>
> You could use the same wording we have in the comment above already just to make it consistent: `manipulate the _entry_list (except for pushing new threads to the head)`.

I promise I'll fix that in the follow up PR.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1981563527

From duke at openjdk.org  Wed Mar  5 18:30:03 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Wed, 5 Mar 2025 18:30:03 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v3]
In-Reply-To: <M5Z8mk-KZoytxUmhzRWKqUFtHRWoxY28Bfv8h_ijjMg=.351389d9-57d1-4589-a853-41125858760a@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <xYDRgjvW5AksXcXKUBnx7LlB4PmGEjEITKA27iHs4rE=.e69744e8-2fe9-4b0c-a49d-25177a2e8daf@github.com>
 <0E2AqFpNPjDjP6jqCXn8toePBcW2SIHw1kFXlZX4W_U=.8d692bfa-0598-4969-b480-4a285366e0bb@github.com>
 <V62AuLpV4oPamF6hi7zMdqs21tyw9i4ug_9UlNCQpNE=.2d6e0fcf-3e2c-406f-a336-bb2dedd923e8@github.com>
 <M5Z8mk-KZoytxUmhzRWKqUFtHRWoxY28Bfv8h_ijjMg=.351389d9-57d1-4589-a853-41125858760a@github.com>
Message-ID: <wBKHcU8T9cMhYxLItWJ62IocCpPm9t8C4H_Ky4my9Fo=.fac18f07-5383-4333-b68e-031d376be1e2@github.com>

On Wed, 5 Mar 2025 14:03:00 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Hi, @jatin-bhateja, thanks for the suggestion. I have added __ align(OptoLoopAlignment); before all loop entries.
>
> Hi @ferakocz , 
> 
> Thanks!, for efficient utilization of Decode ICache (please refer to Intel SDM section 3.4.2.5), code blocks should be aligned to 32-byte boundaries; a 64-byte aligned code is a superset of both 16 and 32 byte aligned addresses and also matches with the cacheline size. However, I can noticed that we have been using OptoLoopAlignment at places in AES-GCM also.
> 
> I introduced some errors in generate_dilithiumAlmostInverseNtt_avx512 implementation in anticipation of catching it through existing ML_DSA_Tests under 
> test/jdk/sun/security/provider/acvp
> 
> But all the tests passed for me.
> `java  -jar /home/jatinbha/sandboxes/jtreg/build/images/jtreg/lib/jtreg.jar -jdk:$JAVA_HOME -Djdk.test.lib.artifacts.ACVP-Server=/home/jatinbha/softwares/v1.1.0.38.zip -va -timeout:4 Launcher.java`
> 
> Can you please point out a test I need to use for validation

I think the easiest is to put a for (int i = 0; i < 1000; i++) loop around  the switch statement in the run() method of  the ML_DSA_Test class (test/jdk/sun/security/provider/acvp/ML_DSA_Test.java). (This is because the intrinsics kick in after a few thousand calls of the method.)

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1981945490

From dholmes at openjdk.org  Thu Mar  6 04:26:57 2025
From: dholmes at openjdk.org (David Holmes)
Date: Thu, 6 Mar 2025 04:26:57 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v3]
In-Reply-To: <OjJXnyhDGujq-_1fMwKg-drzyt-HUNoxggOgrsyLDUM=.dc04e9ff-95c6-48f7-b4d1-209ef816f205@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <OjJXnyhDGujq-_1fMwKg-drzyt-HUNoxggOgrsyLDUM=.dc04e9ff-95c6-48f7-b4d1-209ef816f205@github.com>
Message-ID: <mehTDFpoq-J0uMQQFj2x-HtTAS-1Vej3u8mbSRBShE4=.782d2fe0-345d-416f-af53-74e861c1a48a@github.com>

On Wed, 5 Mar 2025 12:31:23 GMT, Fredrik Bredberg <fbredberg at openjdk.org> wrote:

>> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`.
>> 
>> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past.
>> 
>> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks.
>> 
>> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`.
>> 
>> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable.
>> 
>> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list.
>> 
>> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor).
>> 
>> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation.
>> 
>> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fac...
>
> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Updated comments after review by Patricio.

LGTM!

Thanks

-------------

Marked as reviewed by dholmes (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/23421#pullrequestreview-2663279336

From fbredberg at openjdk.org  Thu Mar  6 09:11:02 2025
From: fbredberg at openjdk.org (Fredrik Bredberg)
Date: Thu, 6 Mar 2025 09:11:02 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v3]
In-Reply-To: <OjJXnyhDGujq-_1fMwKg-drzyt-HUNoxggOgrsyLDUM=.dc04e9ff-95c6-48f7-b4d1-209ef816f205@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <OjJXnyhDGujq-_1fMwKg-drzyt-HUNoxggOgrsyLDUM=.dc04e9ff-95c6-48f7-b4d1-209ef816f205@github.com>
Message-ID: <TwX4UpPy30V0tYYQ8sYuj4RsN87hziFl_0Law8w6OuI=.e762cac0-e4fb-478e-93d8-63e8774f0108@github.com>

On Wed, 5 Mar 2025 12:31:23 GMT, Fredrik Bredberg <fbredberg at openjdk.org> wrote:

>> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`.
>> 
>> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past.
>> 
>> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks.
>> 
>> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`.
>> 
>> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable.
>> 
>> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list.
>> 
>> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor).
>> 
>> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation.
>> 
>> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fac...
>
> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Updated comments after review by Patricio.

Thanks everyone for the reviews, testing and Graal adaptation.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23421#issuecomment-2703235851

From fbredberg at openjdk.org  Thu Mar  6 09:11:02 2025
From: fbredberg at openjdk.org (Fredrik Bredberg)
Date: Thu, 6 Mar 2025 09:11:02 GMT
Subject: Integrated: 8343840: Rewrite the ObjectMonitor lists
In-Reply-To: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
Message-ID: <8EoKGr_0E4MpBGmwoWS8At5wyy2Q44zgxI8eWi4A-AA=.46f9908d-a525-4ca5-9472-34cfd37de0d3@github.com>

On Mon, 3 Feb 2025 16:29:25 GMT, Fredrik Bredberg <fbredberg at openjdk.org> wrote:

> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`.
> 
> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past.
> 
> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks.
> 
> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`.
> 
> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable.
> 
> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list.
> 
> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor).
> 
> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation.
> 
> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fact that c2 no longer has to check b...

This pull request has now been integrated.

Changeset: 7a5acb9b
Author:    Fredrik Bredberg <fbredberg at openjdk.org>
URL:       https://git.openjdk.org/jdk/commit/7a5acb9be17cd54bbd0abf2524386b981dd5ac04
Stats:     614 lines in 10 files changed: 214 ins; 228 del; 172 mod

8343840: Rewrite the ObjectMonitor lists

Reviewed-by: dholmes, coleenp, pchilanomate, yzheng

-------------

PR: https://git.openjdk.org/jdk/pull/23421

From jbhateja at openjdk.org  Thu Mar  6 09:34:53 2025
From: jbhateja at openjdk.org (Jatin Bhateja)
Date: Thu, 6 Mar 2025 09:34:53 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v3]
In-Reply-To: <wBKHcU8T9cMhYxLItWJ62IocCpPm9t8C4H_Ky4my9Fo=.fac18f07-5383-4333-b68e-031d376be1e2@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <xYDRgjvW5AksXcXKUBnx7LlB4PmGEjEITKA27iHs4rE=.e69744e8-2fe9-4b0c-a49d-25177a2e8daf@github.com>
 <0E2AqFpNPjDjP6jqCXn8toePBcW2SIHw1kFXlZX4W_U=.8d692bfa-0598-4969-b480-4a285366e0bb@github.com>
 <V62AuLpV4oPamF6hi7zMdqs21tyw9i4ug_9UlNCQpNE=.2d6e0fcf-3e2c-406f-a336-bb2dedd923e8@github.com>
 <M5Z8mk-KZoytxUmhzRWKqUFtHRWoxY28Bfv8h_ijjMg=.351389d9-57d1-4589-a853-41125858760a@github.com>
 <wBKHcU8T9cMhYxLItWJ62IocCpPm9t8C4H_Ky4my9Fo=.fac18f07-5383-4333-b68e-031d376be1e2@github.com>
Message-ID: <v0R-8kNylqKabqUJFfrIc8vcJ4tNMEfFm00stOcCxlo=.37519051-674e-41fe-80c8-60167e82f86b@github.com>

On Wed, 5 Mar 2025 18:27:44 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> Hi @ferakocz , 
>> 
>> Thanks!, for efficient utilization of Decode ICache (please refer to Intel SDM section 3.4.2.5), code blocks should be aligned to 32-byte boundaries; a 64-byte aligned code is a superset of both 16 and 32 byte aligned addresses and also matches with the cacheline size. However, I can noticed that we have been using OptoLoopAlignment at places in AES-GCM also.
>> 
>> I introduced some errors in generate_dilithiumAlmostInverseNtt_avx512 implementation in anticipation of catching it through existing ML_DSA_Tests under 
>> test/jdk/sun/security/provider/acvp
>> 
>> But all the tests passed for me.
>> `java  -jar /home/jatinbha/sandboxes/jtreg/build/images/jtreg/lib/jtreg.jar -jdk:$JAVA_HOME -Djdk.test.lib.artifacts.ACVP-Server=/home/jatinbha/softwares/v1.1.0.38.zip -va -timeout:4 Launcher.java`
>> 
>> Can you please point out a test I need to use for validation
>
> I think the easiest is to put a for (int i = 0; i < 1000; i++) loop around  the switch statement in the run() method of  the ML_DSA_Test class (test/jdk/sun/security/provider/acvp/ML_DSA_Test.java). (This is because the intrinsics kick in after a few thousand calls of the method.)

Hi @ferakocz , Yes, we should modify the test or lower the compilation threshold with -Xbatch -XX:TieredCompileThreshold=0.1.

Alternatively, since the tests has a depedency on Automatic Cryptographic Validation Test server I have created a simplified test which cover all the security levels.

Kindly include [test/hotspot/jtreg/compiler/intrinsics/signature/TestModuleLatticeDSA.java
](https://github.com/ferakocz/jdk/pull/1)

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1983009390

From jbhateja at openjdk.org  Thu Mar  6 09:34:55 2025
From: jbhateja at openjdk.org (Jatin Bhateja)
Date: Thu, 6 Mar 2025 09:34:55 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v4]
In-Reply-To: <nmN_TxNaqRqSAVqYzrUhO9NXpzzixG6d1OsvBArpLsA=.857b5aa3-694a-4528-a922-c3fdab141ffa@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <nmN_TxNaqRqSAVqYzrUhO9NXpzzixG6d1OsvBArpLsA=.857b5aa3-694a-4528-a922-c3fdab141ffa@github.com>
Message-ID: <Tt_EK6dquzqjnrcKkoLH1aLcXODeLsjaL7TrRelfBkw=.85fe1fa7-08fb-4b85-bef3-bfc963ebdd5e@github.com>

On Wed, 5 Mar 2025 13:10:34 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Added alignment to loop entries.

src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 85:

> 83:   if (UseSHA3Intrinsics) {
> 84:     StubRoutines::_sha3_implCompress   = generate_sha3_implCompress(StubGenStubId::sha3_implCompress_id);
> 85:     StubRoutines::_double_keccak         = generate_double_keccak();

Should UseDilithiumIntrinsics guard double_keccak generation ?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1982922845

From duke at openjdk.org  Thu Mar  6 09:49:12 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Thu, 6 Mar 2025 09:49:12 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v4]
In-Reply-To: <Tt_EK6dquzqjnrcKkoLH1aLcXODeLsjaL7TrRelfBkw=.85fe1fa7-08fb-4b85-bef3-bfc963ebdd5e@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <nmN_TxNaqRqSAVqYzrUhO9NXpzzixG6d1OsvBArpLsA=.857b5aa3-694a-4528-a922-c3fdab141ffa@github.com>
 <Tt_EK6dquzqjnrcKkoLH1aLcXODeLsjaL7TrRelfBkw=.85fe1fa7-08fb-4b85-bef3-bfc963ebdd5e@github.com>
Message-ID: <ZSL7xZxzHn6Y7QG-QZJR8MtSgxWnQXZiws2AaciR83g=.04fc97c3-e287-421b-ba8b-14fd04852b37@github.com>

On Thu, 6 Mar 2025 08:37:57 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Added alignment to loop entries.
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 85:
> 
>> 83:   if (UseSHA3Intrinsics) {
>> 84:     StubRoutines::_sha3_implCompress   = generate_sha3_implCompress(StubGenStubId::sha3_implCompress_id);
>> 85:     StubRoutines::_double_keccak         = generate_double_keccak();
> 
> Should UseDilithiumIntrinsics guard double_keccak generation ?

No, that is more of a SHA3 thing, other algorithms can take advantage of it, too (e.g. ML-KEM).

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1983033331

From galder at openjdk.org  Thu Mar  6 14:06:35 2025
From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=)
Date: Thu, 6 Mar 2025 14:06:35 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v13]
In-Reply-To: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
Message-ID: <xUSDByGQX3grezxntL9IK9NPV2YVC1NIcW1bm5HD6v8=.3c045b3f-d844-4b4b-8772-8dbe0f5d0f5a@github.com>

> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance.
> 
> Currently vectorization does not kick in for loops containing either of these calls because of the following error:
> 
> 
> VLoop::check_preconditions: failed: control flow in loop not allowed
> 
> 
> The control flow is due to the java implementation for these methods, e.g.
> 
> 
> public static long max(long a, long b) {
>     return (a >= b) ? a : b;
> }
> 
> 
> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively.
> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization.
> E.g.
> 
> 
> SuperWord::transform_loop:
>     Loop: N518/N126  counted [int,int),+4 (1025 iters)  main has_sfpt strip_mined
>  518  CountedLoop  === 518 246 126  [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21)
> 
> 
> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1):
> 
> 
> ==============================
> Test summary
> ==============================
>    TEST                                              TOTAL  PASS  FAIL ERROR
>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>                                                          1     1     0     0
> ==============================
> TEST SUCCESS
> 
> long min   1155
> long max   1173
> 
> 
> After the patch, on darwin/aarch64 (M1):
> 
> 
> ==============================
> Test summary
> ==============================
>    TEST                                              TOTAL  PASS  FAIL ERROR
>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>                                                          1     1     0     0
> ==============================
> TEST SUCCESS
> 
> long min   1042
> long max   1042
> 
> 
> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes.
> Therefore, it still relies on the macro expansion to transform those into CMoveL.
> 
> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results:
> 
> 
> ==============================
> Test summary
> ==============================
>    TEST                                              TOTAL  PASS  FAIL ERROR
>    jtreg:test/hotspot/jtreg:tier1                     2500  2500     0     0
>>> jtreg:test/jdk:tier1                     ...

Galder Zamarre?o has updated the pull request incrementally with one additional commit since the last revision:

  Add simple reduction benchmarks on top of multiply ones

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/20098/files
  - new: https://git.openjdk.org/jdk/pull/20098/files/a190ae68..d0e793a3

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=20098&range=12
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20098&range=11-12

  Stats: 44 lines in 1 file changed: 40 ins; 0 del; 4 mod
  Patch: https://git.openjdk.org/jdk/pull/20098.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/20098/head:pull/20098

PR: https://git.openjdk.org/jdk/pull/20098

From epeter at openjdk.org  Thu Mar  6 15:07:07 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Thu, 6 Mar 2025 15:07:07 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v12]
In-Reply-To: <a5rgV-mShBLPGVHpb_rApMTUxpDMdPnD9L8VnTyfRxQ=.23a74fc4-e041-4c67-a653-1e00b2568d2c@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <pZjDpZKJUmXi85-qf3F-NX91qVc42_QgZGbuo36XhPk=.f2e4ba72-bf19-4ced-9656-c01907bdae1b@github.com>
 <a5rgV-mShBLPGVHpb_rApMTUxpDMdPnD9L8VnTyfRxQ=.23a74fc4-e041-4c67-a653-1e00b2568d2c@github.com>
Message-ID: <Q8nWkDkI66FH0omJ6yqWnYBWq68rmPRfhiFEFJy99bQ=.022b5db3-9cf9-49f6-b9d3-49d82e1043f0@github.com>

On Thu, 27 Feb 2025 16:38:30 GMT, Galder Zamarre?o <galder at openjdk.org> wrote:

>> Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision:
>> 
>>  - Merge branch 'master' into topic.intrinsify-max-min-long
>>  - Fix typo
>>  - Renaming methods and variables and add docu on algorithms
>>  - Fix copyright years
>>  - Make sure it runs with cpus with either avx512 or asimd
>>  - Test can only run with 256 bit registers or bigger
>>    
>>    * Remove platform dependant check
>>    and use platform independent configuration instead.
>>  - Fix license header
>>  - Tests should also run on aarch64 asimd=true envs
>>  - Added comment around the assertions
>>  - Adjust min/max identity IR test expectations after changes
>>  - ... and 34 more: https://git.openjdk.org/jdk/compare/47fdb836...a190ae68
>
> Also, I've started a [discussion on jmh-dev](https://mail.openjdk.org/pipermail/jmh-dev/2025-February/004094.html) to see if there's a way to minimise pollution of `Math.min(II)` compilation. As a follow to https://github.com/openjdk/jdk/pull/20098#issuecomment-2684701935 I looked at where the other `Math.min(II)` calls are coming from, and a big chunk seem related to the JMH infrastructure.

@galderz about:
> Additional performance improvement: make SuperWord recognize more cases as profitble (see Regression 1). Optional.

This should already be covered by these, and I will handle that eventually with the Cost-Model RFE [JDK-8340093](https://bugs.openjdk.org/browse/JDK-8340093):
- [JDK-8345044](https://bugs.openjdk.org/browse/JDK-8345044) Sum of array elements not vectorized
  - (min/max of array)
- [JDK-8336000](https://bugs.openjdk.org/browse/JDK-8336000) C2 SuperWord: report that 2-element reductions do not vectorize
  - You would for example see that on aarch64 machines with only neon/asimd support you can have at most 2 longs per vector, because the max vector length is 128 bits.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2704110051

From epeter at openjdk.org  Thu Mar  6 15:26:09 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Thu, 6 Mar 2025 15:26:09 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v12]
In-Reply-To: <a5rgV-mShBLPGVHpb_rApMTUxpDMdPnD9L8VnTyfRxQ=.23a74fc4-e041-4c67-a653-1e00b2568d2c@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <pZjDpZKJUmXi85-qf3F-NX91qVc42_QgZGbuo36XhPk=.f2e4ba72-bf19-4ced-9656-c01907bdae1b@github.com>
 <a5rgV-mShBLPGVHpb_rApMTUxpDMdPnD9L8VnTyfRxQ=.23a74fc4-e041-4c67-a653-1e00b2568d2c@github.com>
Message-ID: <CcbDAyUy_jG4t5yPZV2bqiY-RKlUy3kF5OiGhnZf9w0=.7e73b446-20a9-4e35-bc64-c56a05113dad@github.com>

On Thu, 27 Feb 2025 16:38:30 GMT, Galder Zamarre?o <galder at openjdk.org> wrote:

>> Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision:
>> 
>>  - Merge branch 'master' into topic.intrinsify-max-min-long
>>  - Fix typo
>>  - Renaming methods and variables and add docu on algorithms
>>  - Fix copyright years
>>  - Make sure it runs with cpus with either avx512 or asimd
>>  - Test can only run with 256 bit registers or bigger
>>    
>>    * Remove platform dependant check
>>    and use platform independent configuration instead.
>>  - Fix license header
>>  - Tests should also run on aarch64 asimd=true envs
>>  - Added comment around the assertions
>>  - Adjust min/max identity IR test expectations after changes
>>  - ... and 34 more: https://git.openjdk.org/jdk/compare/dfbb2ee6...a190ae68
>
> Also, I've started a [discussion on jmh-dev](https://mail.openjdk.org/pipermail/jmh-dev/2025-February/004094.html) to see if there's a way to minimise pollution of `Math.min(II)` compilation. As a follow to https://github.com/openjdk/jdk/pull/20098#issuecomment-2684701935 I looked at where the other `Math.min(II)` calls are coming from, and a big chunk seem related to the JMH infrastructure.

@galderz about:
> Additional performance improvement: extend backend capabilities for vectorization (see Regression 2 + 3). Optional.

I looked at `src/hotspot/cpu/x86/x86.ad`
bool Matcher::match_rule_supported_vector(int opcode, int vlen, BasicType bt) {

   1774     case Op_MaxV:                                                                                                                                                                                                             
   1775     case Op_MinV:
   1776       if (UseSSE < 4 && is_integral_type(bt)) {
   1777         return false;
   1778       }
...

So it seems that here lanewise min/max are supported for AVX2. But it seems that's different for reductions:

   1818     case Op_MinReductionV:
   1819     case Op_MaxReductionV:                                                                                                                                                                                                    
   1820       if ((bt == T_INT || is_subword_type(bt)) && UseSSE < 4) {
   1821         return false;
   1822       } else if (bt == T_LONG && (UseAVX < 3 || !VM_Version::supports_avx512vlbwdq())) {
   1823         return false;
   1824       }
...

So it seems maybe we could improve the AVX2 coverage for reductions. But honestly, I will probably find this issue again once I work on the other reductions above, and run the benchmarks. I think that will make it easier to investigate all of this. I will for example adjust the IR rules, and then it will be apparent where there are cases that are not covered.

@galderz you said you would add some extra comments, then I will review again :)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2704159992
PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2704161929

From tschatzl at openjdk.org  Thu Mar  6 15:39:57 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Thu, 6 Mar 2025 15:39:57 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v11]
In-Reply-To: <S3XUG6aB9XuLozIGt1PhbbdJXCBqz35Pgo2kaSJGHVA=.f2c5a5ec-5be0-4857-a23a-494bf24edb39@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <YzxVGcqtMOdBRVTtN4y3nzVHNsLFMUTc2lUwaWU0KSA=.798ee2c8-8d73-458d-a9a8-f0f86ea059f4@github.com>
 <S3XUG6aB9XuLozIGt1PhbbdJXCBqz35Pgo2kaSJGHVA=.f2c5a5ec-5be0-4857-a23a-494bf24edb39@github.com>
Message-ID: <4um7PHAs89PIoa3QgbkPx-8Jx9vHiYr7afFQGOtFTY8=.f1ca8bad-0827-4f8c-852d-0fc82ffd546a@github.com>

On Tue, 4 Mar 2025 15:33:29 GMT, Albert Mingkun Yang <ayang at openjdk.org> wrote:

>> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   iwalulya review
>>     * comments for variables tracking to-collection-set and just dirtied cards after GC/refinement
>>     * predicate for determining whether the refinement has been disabled
>>     * some other typos/comment improvements
>>     * renamed _has_xxx_ref to _has_ref_to_xxx to be more consistent with naming
>
> src/hotspot/share/gc/g1/g1ConcurrentRefineThread.cpp line 219:
> 
>> 217:       // The young gen revising mechanism reads the predictor and the values set
>> 218:       // here. Avoid inconsistencies by locking.
>> 219:       MutexLocker x(G1RareEvent_lock, Mutex::_no_safepoint_check_flag);
> 
> Who else can be in this critical-section? I don't get what this lock is protecting us from.

Actually further discussion with @albertnetymk showed that this change introduces an unintended behaviorial change where since the refinement control thread is also responsible for updating the current young gen length.

It means that the mutex isn't required.

However this means that while the refinement is running this is not done any more, because refinement can take seconds, I need to move this work to another thread (probably the `G1ServiceThread?). I will add a separate mutex then.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1983587293

From tschatzl at openjdk.org  Thu Mar  6 16:13:02 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Thu, 6 Mar 2025 16:13:02 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v13]
In-Reply-To: <OWqZCjQ-R7QG-awnDVD6Y-S7TbWQ4BRgwky6IErm9Vg=.a5ee21ad-c55f-4cb9-b07b-52d05e4fd153@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <ZEc2ciz8e-3Sk1PVxsXDyTNIwDUG4pe1oUT1wMrXQHg=.3a809d0a-c07d-41ac-aca2-aab172e7d9d1@github.com>
 <OWqZCjQ-R7QG-awnDVD6Y-S7TbWQ4BRgwky6IErm9Vg=.a5ee21ad-c55f-4cb9-b07b-52d05e4fd153@github.com>
Message-ID: <Njy8AG2hcb1KkivEiiwHepIB9vgYEk-sS-f8H1m-oyo=.f989f615-f581-411d-88df-3d405024b7f6@github.com>

On Wed, 5 Mar 2025 10:41:02 GMT, Ivan Walulya <iwalulya at openjdk.org> wrote:

>> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   * fix whitespace
>>   * additional whitespace between log tags
>>   * rename G1ConcurrentRefineWorkTask -> ...SweepTask to conform to the other similar rename
>
> src/hotspot/share/gc/g1/g1ThreadLocalData.hpp line 29:
> 
>> 27: #include "gc/g1/g1BarrierSet.hpp"
>> 28: #include "gc/g1/g1CardTable.hpp"
>> 29: #include "gc/g1/g1CollectedHeap.hpp"
> 
> probably does not need to be included

`g1CardTable.hpp` needed because of `G1CardTable::CardValue` I think. I removed the 'G1CollectedHeap` include though.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1983655594

From tschatzl at openjdk.org  Thu Mar  6 16:26:31 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Thu, 6 Mar 2025 16:26:31 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v14]
In-Reply-To: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
Message-ID: <bd0QCtZodFam-tasUihbAr2UFSSs3pRQyil13aJ6etk=.103e9141-a16d-4b44-9524-d2e51b4cefd5@github.com>

> Hi all,
> 
>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
> 
> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
> 
> ### Current situation
> 
> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
> 
> The main reason for the current barrier is how g1 implements concurrent refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
> 
> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
> 
> 
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done;     // null value check
> if (card(@x.a) == young_card) goto done;  // write to young gen check
> StoreLoad;                // synchronize
> if (card(@x.a) == dirty_card) goto done;
> 
> *card(@x.a) = dirty
> 
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
> 
> call runtime to move thread-local-dcq into dcqs
> 
> done:
> 
> 
> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
> 
> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
> 
> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
> 
> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se...

Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:

  * iwalulya review
    * renaming
    * fix some includes, forward declaration

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23739/files
  - new: https://git.openjdk.org/jdk/pull/23739/files/a457e6e7..350a4fa3

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=13
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=12-13

  Stats: 31 lines in 13 files changed: 1 ins; 2 del; 28 mod
  Patch: https://git.openjdk.org/jdk/pull/23739.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739

PR: https://git.openjdk.org/jdk/pull/23739

From jbhateja at openjdk.org  Thu Mar  6 16:38:57 2025
From: jbhateja at openjdk.org (Jatin Bhateja)
Date: Thu, 6 Mar 2025 16:38:57 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v4]
In-Reply-To: <nmN_TxNaqRqSAVqYzrUhO9NXpzzixG6d1OsvBArpLsA=.857b5aa3-694a-4528-a922-c3fdab141ffa@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <nmN_TxNaqRqSAVqYzrUhO9NXpzzixG6d1OsvBArpLsA=.857b5aa3-694a-4528-a922-c3fdab141ffa@github.com>
Message-ID: <P32zEjQp0CL9wYkVJU9i8pZa_VphYse1dv-DDOXE580=.30b087eb-29ca-40a2-bc34-6e802940bead@github.com>

On Wed, 5 Mar 2025 13:10:34 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Added alignment to loop entries.

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 2:

> 1: /*
> 2:  * Copyright (c) 2024, Oracle and/or its affiliates. All rights reserved.

Please update copyright year

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 96:

> 94:       StubRoutines::_dilithiumMontMulByConstant = generate_dilithiumMontMulByConstant_avx512();
> 95:       StubRoutines::_dilithiumDecomposePoly = generate_dilithiumDecomposePoly_avx512();
> 96:     }

Indentation fix needed

src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 362:

> 360:   const Register roundsLeft = r11;
> 361: 
> 362:   __ align(OptoLoopAlignment);

Redundant alignment before label should be before it's bind

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1983463096
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1983464620
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1983477681

From duke at openjdk.org  Thu Mar  6 17:37:33 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Thu, 6 Mar 2025 17:37:33 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v5]
In-Reply-To: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
Message-ID: <3bphXKLpIpxAZP-FEOeob6AaHbv0BAoEceJka64vMW8=.3e4f74e0-9479-4926-b365-b08d8d702692@github.com>

> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.

Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:

  Accepted review comments.

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23860/files
  - new: https://git.openjdk.org/jdk/pull/23860/files/3aaa106f..64135f29

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=04
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=03-04

  Stats: 3 lines in 2 files changed: 0 ins; 1 del; 2 mod
  Patch: https://git.openjdk.org/jdk/pull/23860.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23860/head:pull/23860

PR: https://git.openjdk.org/jdk/pull/23860

From galder at openjdk.org  Fri Mar  7 06:19:03 2025
From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=)
Date: Fri, 7 Mar 2025 06:19:03 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v14]
In-Reply-To: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
Message-ID: <9c34YjVjK0BMclNqFWMSitBV2YTcu_jmgWVitjRgvF0=.0f225af6-5888-4160-9a54-09baa696da1c@github.com>

> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance.
> 
> Currently vectorization does not kick in for loops containing either of these calls because of the following error:
> 
> 
> VLoop::check_preconditions: failed: control flow in loop not allowed
> 
> 
> The control flow is due to the java implementation for these methods, e.g.
> 
> 
> public static long max(long a, long b) {
>     return (a >= b) ? a : b;
> }
> 
> 
> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively.
> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization.
> E.g.
> 
> 
> SuperWord::transform_loop:
>     Loop: N518/N126  counted [int,int),+4 (1025 iters)  main has_sfpt strip_mined
>  518  CountedLoop  === 518 246 126  [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21)
> 
> 
> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1):
> 
> 
> ==============================
> Test summary
> ==============================
>    TEST                                              TOTAL  PASS  FAIL ERROR
>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>                                                          1     1     0     0
> ==============================
> TEST SUCCESS
> 
> long min   1155
> long max   1173
> 
> 
> After the patch, on darwin/aarch64 (M1):
> 
> 
> ==============================
> Test summary
> ==============================
>    TEST                                              TOTAL  PASS  FAIL ERROR
>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>                                                          1     1     0     0
> ==============================
> TEST SUCCESS
> 
> long min   1042
> long max   1042
> 
> 
> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes.
> Therefore, it still relies on the macro expansion to transform those into CMoveL.
> 
> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results:
> 
> 
> ==============================
> Test summary
> ==============================
>    TEST                                              TOTAL  PASS  FAIL ERROR
>    jtreg:test/hotspot/jtreg:tier1                     2500  2500     0     0
>>> jtreg:test/jdk:tier1                     ...

Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 47 additional commits since the last revision:

 - Merge branch 'master' into topic.intrinsify-max-min-long
 - Add assertion comments
 - Add simple reduction benchmarks on top of multiply ones
 - Merge branch 'master' into topic.intrinsify-max-min-long
 - Fix typo
 - Renaming methods and variables and add docu on algorithms
 - Fix copyright years
 - Make sure it runs with cpus with either avx512 or asimd
 - Test can only run with 256 bit registers or bigger
   
   * Remove platform dependant check
   and use platform independent configuration instead.
 - Fix license header
 - ... and 37 more: https://git.openjdk.org/jdk/compare/a328e466...1aa690d3

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/20098/files
  - new: https://git.openjdk.org/jdk/pull/20098/files/d0e793a3..1aa690d3

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=20098&range=13
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20098&range=12-13

  Stats: 65249 lines in 2144 files changed: 33401 ins; 21691 del; 10157 mod
  Patch: https://git.openjdk.org/jdk/pull/20098.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/20098/head:pull/20098

PR: https://git.openjdk.org/jdk/pull/20098

From galder at openjdk.org  Fri Mar  7 06:19:04 2025
From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=)
Date: Fri, 7 Mar 2025 06:19:04 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v4]
In-Reply-To: <9ReqLUCZ6XDaSQxgYw3NyZZdMv3SOHkCkzJ0DLAksas=.8cb29982-8cb8-4068-a251-59a189c83b93@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <SORE-JaaSAfTlOzjLCzOHgaSIb5Zy03Ri0XhrUzAUkY=.c5dca437-fe98-45fc-853b-e89b88f6ccf2@github.com>
 <qEK_fgwN0K-S9tyXH1apALwn06b64fb7th4UjMCOtIM=.ff6446be-4fb5-432a-af23-049296232d3e@github.com>
 <9ReqLUCZ6XDaSQxgYw3NyZZdMv3SOHkCkzJ0DLAksas=.8cb29982-8cb8-4068-a251-59a189c83b93@github.com>
Message-ID: <k_N4Lbx7DXair6ddUQyptY1ce6gnXrJFUFVTeX-Z0dg=.2647747f-d31b-4b99-bc70-6c6f314e013d@github.com>

On Tue, 17 Dec 2024 16:40:01 GMT, Galder Zamarre?o <galder at openjdk.org> wrote:

>> test/hotspot/jtreg/compiler/intrinsics/math/TestMinMaxInlining.java line 80:
>> 
>>> 78:     @IR(phase = { CompilePhase.BEFORE_MACRO_EXPANSION }, counts = { IRNode.MIN_L, "1" })
>>> 79:     @IR(phase = { CompilePhase.AFTER_MACRO_EXPANSION }, counts = { IRNode.MIN_L, "0" })
>>> 80:     private static long testLongMin(long a, long b) {
>> 
>> Can you add a comment why it disappears after macro expansion?
>
> ~Good question. On non-avx512 machines after macro expansion the min/max nodes become cmov nodes, but but that's not the full story because on avx512 machines, they become minV/maxV nodes. Would you tweak the `@IR` annotations to capture this? Or would you leave it just as a comment?~
> 
> Scratch that, this is not a test for arrays, so no minV/maxV nodes. I'll just add a comment.

I've added a comment

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/20098#discussion_r1984510490

From galder at openjdk.org  Fri Mar  7 06:19:04 2025
From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=)
Date: Fri, 7 Mar 2025 06:19:04 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v12]
In-Reply-To: <CcbDAyUy_jG4t5yPZV2bqiY-RKlUy3kF5OiGhnZf9w0=.7e73b446-20a9-4e35-bc64-c56a05113dad@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <pZjDpZKJUmXi85-qf3F-NX91qVc42_QgZGbuo36XhPk=.f2e4ba72-bf19-4ced-9656-c01907bdae1b@github.com>
 <a5rgV-mShBLPGVHpb_rApMTUxpDMdPnD9L8VnTyfRxQ=.23a74fc4-e041-4c67-a653-1e00b2568d2c@github.com>
 <CcbDAyUy_jG4t5yPZV2bqiY-RKlUy3kF5OiGhnZf9w0=.7e73b446-20a9-4e35-bc64-c56a05113dad@github.com>
Message-ID: <b5M5EH8mqSKFoyZGWN_3EFNDsgFlCk8VenT1kzIo8JE=.88ec8477-69fc-4d6b-a5a6-404ba0d778f8@github.com>

On Thu, 6 Mar 2025 15:22:18 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> Also, I've started a [discussion on jmh-dev](https://mail.openjdk.org/pipermail/jmh-dev/2025-February/004094.html) to see if there's a way to minimise pollution of `Math.min(II)` compilation. As a follow to https://github.com/openjdk/jdk/pull/20098#issuecomment-2684701935 I looked at where the other `Math.min(II)` calls are coming from, and a big chunk seem related to the JMH infrastructure.
>
> @galderz you said you would add some extra comments, then I will review again :)

@eme64 I've added the comment that was pending from your last review. I've also merged latest master.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2705620662

From epeter at openjdk.org  Fri Mar  7 06:48:05 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Fri, 7 Mar 2025 06:48:05 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v14]
In-Reply-To: <9c34YjVjK0BMclNqFWMSitBV2YTcu_jmgWVitjRgvF0=.0f225af6-5888-4160-9a54-09baa696da1c@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <9c34YjVjK0BMclNqFWMSitBV2YTcu_jmgWVitjRgvF0=.0f225af6-5888-4160-9a54-09baa696da1c@github.com>
Message-ID: <uSkNL8MPIldbxKrsaw9YcoFE-q_uZ5RQ2GoJYDARSV0=.076b7e62-2131-40e0-ab2e-a4752ab87a36@github.com>

On Fri, 7 Mar 2025 06:19:03 GMT, Galder Zamarre?o <galder at openjdk.org> wrote:

>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance.
>> 
>> Currently vectorization does not kick in for loops containing either of these calls because of the following error:
>> 
>> 
>> VLoop::check_preconditions: failed: control flow in loop not allowed
>> 
>> 
>> The control flow is due to the java implementation for these methods, e.g.
>> 
>> 
>> public static long max(long a, long b) {
>>     return (a >= b) ? a : b;
>> }
>> 
>> 
>> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively.
>> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization.
>> E.g.
>> 
>> 
>> SuperWord::transform_loop:
>>     Loop: N518/N126  counted [int,int),+4 (1025 iters)  main has_sfpt strip_mined
>>  518  CountedLoop  === 518 246 126  [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21)
>> 
>> 
>> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1155
>> long max   1173
>> 
>> 
>> After the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1042
>> long max   1042
>> 
>> 
>> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes.
>> Therefore, it still relies on the macro expansion to transform those into CMoveL.
>> 
>> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results:
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PA...
>
> Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 47 additional commits since the last revision:
> 
>  - Merge branch 'master' into topic.intrinsify-max-min-long
>  - Add assertion comments
>  - Add simple reduction benchmarks on top of multiply ones
>  - Merge branch 'master' into topic.intrinsify-max-min-long
>  - Fix typo
>  - Renaming methods and variables and add docu on algorithms
>  - Fix copyright years
>  - Make sure it runs with cpus with either avx512 or asimd
>  - Test can only run with 256 bit registers or bigger
>    
>    * Remove platform dependant check
>    and use platform independent configuration instead.
>  - Fix license header
>  - ... and 37 more: https://git.openjdk.org/jdk/compare/99572e4c...1aa690d3

Looks good, thanks for all the updates :)

I'm launching another round of testing on our side ;)

-------------

Marked as reviewed by epeter (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/20098#pullrequestreview-2666394529
PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2705659841

From galder at openjdk.org  Fri Mar  7 09:23:06 2025
From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=)
Date: Fri, 7 Mar 2025 09:23:06 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v14]
In-Reply-To: <uSkNL8MPIldbxKrsaw9YcoFE-q_uZ5RQ2GoJYDARSV0=.076b7e62-2131-40e0-ab2e-a4752ab87a36@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <9c34YjVjK0BMclNqFWMSitBV2YTcu_jmgWVitjRgvF0=.0f225af6-5888-4160-9a54-09baa696da1c@github.com>
 <uSkNL8MPIldbxKrsaw9YcoFE-q_uZ5RQ2GoJYDARSV0=.076b7e62-2131-40e0-ab2e-a4752ab87a36@github.com>
Message-ID: <RjcAmRSaI4fUAIKQs8nH2BTRhsMVdzsMqgEoTXNLpoU=.c3d7b256-f4fd-470b-9f1c-951a7c03077f@github.com>

On Fri, 7 Mar 2025 06:44:57 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 47 additional commits since the last revision:
>> 
>>  - Merge branch 'master' into topic.intrinsify-max-min-long
>>  - Add assertion comments
>>  - Add simple reduction benchmarks on top of multiply ones
>>  - Merge branch 'master' into topic.intrinsify-max-min-long
>>  - Fix typo
>>  - Renaming methods and variables and add docu on algorithms
>>  - Fix copyright years
>>  - Make sure it runs with cpus with either avx512 or asimd
>>  - Test can only run with 256 bit registers or bigger
>>    
>>    * Remove platform dependant check
>>    and use platform independent configuration instead.
>>  - Fix license header
>>  - ... and 37 more: https://git.openjdk.org/jdk/compare/bc67ede6...1aa690d3
>
> I'm launching another round of testing on our side ;)

@eme64 I've run tier[1-3] locally and looked good overall. I had to update jtreg and noticed this failure but I don't think it's related to this PR:


java.lang.AssertionError: gtest execution failed; exit code = 2. the failed tests: [codestrings::validate_vm]
	at GTestWrapper.main(GTestWrapper.java:98)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:565)
	at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335)
	at java.base/java.lang.Thread.run(Thread.java:1447)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2705937075

From galder at openjdk.org  Fri Mar  7 12:28:58 2025
From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=)
Date: Fri, 7 Mar 2025 12:28:58 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v12]
In-Reply-To: <63F-0aHgMthexL0b2DFmkW8_QrJeo8OOlCaIyZApfpY=.4744070d-9d56-4031-8684-be14cf66d1e5@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <pZjDpZKJUmXi85-qf3F-NX91qVc42_QgZGbuo36XhPk=.f2e4ba72-bf19-4ced-9656-c01907bdae1b@github.com>
 <wDsCyP79rQ4dN3G6lMjZliTn6ym5-HwjGZ-Y-Xx_vQY=.0a187197-7505-4c95-a6cd-8b8eea0bea88@github.com>
 <DvPOVOuqVlNY-0K5E201YQgKmguBMpJDQw2h1myD8S0=.81a46beb-90fa-4417-a9b2-9d3ebc746538@github.com>
 <63F-0aHgMthexL0b2DFmkW8_QrJeo8OOlCaIyZApfpY=.4744070d-9d56-4031-8684-be14cf66d1e5@github.com>
Message-ID: <Ce3tQ94W9WhirtnEpPtBNvr0szvAod7tuLkxVvNeRcQ=.f1d5cf13-6d1f-46c8-a7c0-5d72f6575d6c@github.com>

On Thu, 27 Feb 2025 06:54:30 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> As for possible solutions. In all Regression 1-3 cases, it seems the issue is scalar cmove. So actually in all cases a possible solution is using branching code (i.e. `cmp+mov`). So to me, these are the follow-up RFE's:
> 
> * Detect "extreme" probability scalar cmove, and replace them with branching code. This should take care of all regressions here. This one has high priority, as it fixes the regression caused by this patch here. But it would also help to improve performance for the `Integer.min/max` cases, which have the same issue.

I've created [JDK-8351409](https://bugs.openjdk.org/browse/JDK-8351409) to address this.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2706324225

From ayang at openjdk.org  Fri Mar  7 13:16:59 2025
From: ayang at openjdk.org (Albert Mingkun Yang)
Date: Fri, 7 Mar 2025 13:16:59 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v14]
In-Reply-To: <bd0QCtZodFam-tasUihbAr2UFSSs3pRQyil13aJ6etk=.103e9141-a16d-4b44-9524-d2e51b4cefd5@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <bd0QCtZodFam-tasUihbAr2UFSSs3pRQyil13aJ6etk=.103e9141-a16d-4b44-9524-d2e51b4cefd5@github.com>
Message-ID: <5w6qUwzDQadxseocRl6rRF0AllyeukWTpYl2XjAfiTE=.fb62a50e-e308-4d08-8057-67e70e13ccbb@github.com>

On Thu, 6 Mar 2025 16:26:31 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> Hi all,
>> 
>>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
>> 
>> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
>> 
>> ### Current situation
>> 
>> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
>> 
>> The main reason for the current barrier is how g1 implements concurrent refinement:
>> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
>> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
>> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
>> 
>> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
>> 
>> 
>> // Filtering
>> if (region(@x.a) == region(y)) goto done; // same region check
>> if (y == null) goto done;     // null value check
>> if (card(@x.a) == young_card) goto done;  // write to young gen check
>> StoreLoad;                // synchronize
>> if (card(@x.a) == dirty_card) goto done;
>> 
>> *card(@x.a) = dirty
>> 
>> // Card tracking
>> enqueue(card-address(@x.a)) into thread-local-dcq;
>> if (thread-local-dcq is not full) goto done;
>> 
>> call runtime to move thread-local-dcq into dcqs
>> 
>> done:
>> 
>> 
>> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
>> 
>> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
>> 
>> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
>> 
>> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c...
>
> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
> 
>   * iwalulya review
>     * renaming
>     * fix some includes, forward declaration

src/hotspot/share/gc/g1/g1CardTable.hpp line 76:

> 74:     g1_card_already_scanned = 0x1,
> 75:     g1_to_cset_card = 0x2,
> 76:     g1_from_remset_card = 0x4

Could you outline the motivation for this more precise info? Is it for optimization or essentially for correctness?

src/hotspot/share/gc/g1/g1ConcurrentRefineSweepTask.cpp line 54:

> 52:     assert(refinement_r == card_r, "not same region source %u (%zu) dest %u (%zu) ", refinement_r->hrm_index(), refinement_i, card_r->hrm_index(), card_i);
> 53:     assert(refinement_i == card_i, "indexes are not same %zu %zu", refinement_i, card_i);
> 54: #endif

I feel this assert logic can be extracted to a method, sth like `verify_card_pair`.

src/hotspot/share/gc/g1/g1ConcurrentRefineThread.cpp line 64:

> 62:         report_inactive("Paused");
> 63:         sts_join.yield();
> 64:         // Reset after yield rather than accumulating across yields, else a

The comment seems obsolete after the removal of stats.

src/hotspot/share/gc/g1/g1OopClosures.inline.hpp line 158:

> 156:   if (_has_ref_to_cset) {
> 157:     return;
> 158:   }

Is it really necessary to write `false` to `_has_ref_to_cset`?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1985041202
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1983846649
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1983842440
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1983857348

From epeter at openjdk.org  Fri Mar  7 13:19:59 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Fri, 7 Mar 2025 13:19:59 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v12]
In-Reply-To: <Ce3tQ94W9WhirtnEpPtBNvr0szvAod7tuLkxVvNeRcQ=.f1d5cf13-6d1f-46c8-a7c0-5d72f6575d6c@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <pZjDpZKJUmXi85-qf3F-NX91qVc42_QgZGbuo36XhPk=.f2e4ba72-bf19-4ced-9656-c01907bdae1b@github.com>
 <wDsCyP79rQ4dN3G6lMjZliTn6ym5-HwjGZ-Y-Xx_vQY=.0a187197-7505-4c95-a6cd-8b8eea0bea88@github.com>
 <DvPOVOuqVlNY-0K5E201YQgKmguBMpJDQw2h1myD8S0=.81a46beb-90fa-4417-a9b2-9d3ebc746538@github.com>
 <63F-0aHgMthexL0b2DFmkW8_QrJeo8OOlCaIyZApfpY=.4744070d-9d56-4031-8684-be14cf66d1e5@github.com>
 <Ce3tQ94W9WhirtnEpPtBNvr0szvAod7tuLkxVvNeRcQ=.f1d5cf13-6d1f-46c8-a7c0-5d72f6575d6c@github.com>
Message-ID: <BShUfHEAVSKx4fZzf0QntcC0XOsfNSyMPCA1H3ZLKGQ=.49e26b40-ef74-44d1-bf60-4bb228ff195e@github.com>

On Fri, 7 Mar 2025 12:25:51 GMT, Galder Zamarre?o <galder at openjdk.org> wrote:

>> @galderz Thanks for the summary of regressions! Yes, there are plenty of speedups, I assume primarily because of `Long.min/max` vectorization, but possibly also because the operation can now "float" out of a loop for example.
>> 
>> All your Regressions 1-3 are cases with "extreme" probabilitiy (close to 100% / 0%), you listed none else. That matches with my intuition, that branching code is usually better than cmove in extreme probability cases.
>> 
>> As for possible solutions. In all Regression 1-3 cases, it seems the issue is scalar cmove. So actually in all cases a possible solution is using branching code (i.e. `cmp+mov`). So to me, these are the follow-up RFE's:
>> - Detect "extreme" probability scalar cmove, and replace them with branching code. This should take care of all regressions here. This one has high priority, as it fixes the regression caused by this patch here. But it would also help to improve performance for the `Integer.min/max` cases, which have the same issue.
>> - Additional performance improvement: make SuperWord recognize more cases as profitble (see Regression 1). Optional.
>> - Additional performance improvement: extend backend capabilities for vectorization (see Regression 2 + 3). Optional.
>> 
>> Does that make sense, or am I missing something?
>
>> As for possible solutions. In all Regression 1-3 cases, it seems the issue is scalar cmove. So actually in all cases a possible solution is using branching code (i.e. `cmp+mov`). So to me, these are the follow-up RFE's:
>> 
>> * Detect "extreme" probability scalar cmove, and replace them with branching code. This should take care of all regressions here. This one has high priority, as it fixes the regression caused by this patch here. But it would also help to improve performance for the `Integer.min/max` cases, which have the same issue.
> 
> I've created [JDK-8351409](https://bugs.openjdk.org/browse/JDK-8351409) to address this.

@galderz Excellent. Testing looks all good on our side. Yes I think what you saw was unrelated.
@rwestrel Could give this a last quick scan and then I think you can integrate :)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2706434983

From duke at openjdk.org  Fri Mar  7 14:22:29 2025
From: duke at openjdk.org (Marc Chevalier)
Date: Fri, 7 Mar 2025 14:22:29 GMT
Subject: RFR: 8346989: Deoptimization and re-compilation cycle with C2 compiled
 code
Message-ID: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com>

`Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments.
This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached.

Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all.

tl;dr:
- C1: no problem, no change
- C2:
  - with intrinsics:
    - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms)
    - without overflow: no problem, no change
  - without intrinsics: no problem, no change

Before the fix:

Benchmark                                           (SIZE)  Mode  Cnt     Score      Error  Units
MathExact.C1_1.loopAddIInBounds                    1000000  avgt    3     1.272 ?    0.048  ms/op
MathExact.C1_1.loopAddIOverflow                    1000000  avgt    3   641.917 ?   58.238  ms/op
MathExact.C1_1.loopAddLInBounds                    1000000  avgt    3     1.402 ?    0.842  ms/op
MathExact.C1_1.loopAddLOverflow                    1000000  avgt    3   671.013 ?  229.425  ms/op
MathExact.C1_1.loopDecrementIInBounds              1000000  avgt    3     3.722 ?   22.244  ms/op
MathExact.C1_1.loopDecrementIOverflow              1000000  avgt    3   653.341 ?  279.003  ms/op
MathExact.C1_1.loopDecrementLInBounds              1000000  avgt    3     2.525 ?    0.810  ms/op
MathExact.C1_1.loopDecrementLOverflow              1000000  avgt    3   656.750 ?  141.792  ms/op
MathExact.C1_1.loopIncrementIInBounds              1000000  avgt    3     4.621 ?   12.822  ms/op
MathExact.C1_1.loopIncrementIOverflow              1000000  avgt    3   651.608 ?  274.396  ms/op
MathExact.C1_1.loopIncrementLInBounds              1000000  avgt    3     2.576 ?    3.316  ms/op
MathExact.C1_1.loopIncrementLOverflow              1000000  avgt    3   662.216 ?   71.879  ms/op
MathExact.C1_1.loopMultiplyIInBounds               1000000  avgt    3     1.402 ?    0.587  ms/op
MathExact.C1_1.loopMultiplyIOverflow               1000000  avgt    3   615.836 ?  252.137  ms/op
MathExact.C1_1.loopMultiplyLInBounds               1000000  avgt    3     2.906 ?    5.718  ms/op
MathExact.C1_1.loopMultiplyLOverflow               1000000  avgt    3   655.576 ?  147.432  ms/op
MathExact.C1_1.loopNegateIInBounds                 1000000  avgt    3     2.023 ?    0.027  ms/op
MathExact.C1_1.loopNegateIOverflow                 1000000  avgt    3   639.136 ?   30.841  ms/op
MathExact.C1_1.loopNegateLInBounds                 1000000  avgt    3     2.422 ?    3.590  ms/op
MathExact.C1_1.loopNegateLOverflow                 1000000  avgt    3   638.837 ?   49.512  ms/op
MathExact.C1_1.loopSubtractIInBounds               1000000  avgt    3     1.255 ?    0.799  ms/op
MathExact.C1_1.loopSubtractIOverflow               1000000  avgt    3   637.857 ?  231.804  ms/op
MathExact.C1_1.loopSubtractLInBounds               1000000  avgt    3     1.412 ?    0.602  ms/op
MathExact.C1_1.loopSubtractLOverflow               1000000  avgt    3   642.113 ?  251.349  ms/op
MathExact.C1_2.loopAddIInBounds                    1000000  avgt    3     1.748 ?    1.095  ms/op
MathExact.C1_2.loopAddIOverflow                    1000000  avgt    3   654.617 ?  287.678  ms/op
MathExact.C1_2.loopAddLInBounds                    1000000  avgt    3     2.004 ?    1.655  ms/op
MathExact.C1_2.loopAddLOverflow                    1000000  avgt    3   670.791 ?   93.689  ms/op
MathExact.C1_2.loopDecrementIInBounds              1000000  avgt    3     5.306 ?   65.215  ms/op
MathExact.C1_2.loopDecrementIOverflow              1000000  avgt    3   650.425 ?  461.740  ms/op
MathExact.C1_2.loopDecrementLInBounds              1000000  avgt    3     5.484 ?   42.778  ms/op
MathExact.C1_2.loopDecrementLOverflow              1000000  avgt    3   656.747 ?  333.281  ms/op
MathExact.C1_2.loopIncrementIInBounds              1000000  avgt    3     3.077 ?    1.677  ms/op
MathExact.C1_2.loopIncrementIOverflow              1000000  avgt    3   634.510 ?   51.365  ms/op
MathExact.C1_2.loopIncrementLInBounds              1000000  avgt    3     3.902 ?   18.471  ms/op
MathExact.C1_2.loopIncrementLOverflow              1000000  avgt    3   656.465 ?  227.014  ms/op
MathExact.C1_2.loopMultiplyIInBounds               1000000  avgt    3     2.384 ?   10.045  ms/op
MathExact.C1_2.loopMultiplyIOverflow               1000000  avgt    3   624.029 ?  342.084  ms/op
MathExact.C1_2.loopMultiplyLInBounds               1000000  avgt    3     3.247 ?    0.735  ms/op
MathExact.C1_2.loopMultiplyLOverflow               1000000  avgt    3   661.427 ?  100.744  ms/op
MathExact.C1_2.loopNegateIInBounds                 1000000  avgt    3     3.061 ?    1.148  ms/op
MathExact.C1_2.loopNegateIOverflow                 1000000  avgt    3   645.241 ?  323.824  ms/op
MathExact.C1_2.loopNegateLInBounds                 1000000  avgt    3     3.211 ?    0.068  ms/op
MathExact.C1_2.loopNegateLOverflow                 1000000  avgt    3   658.846 ?  204.524  ms/op
MathExact.C1_2.loopSubtractIInBounds               1000000  avgt    3     1.717 ?    0.161  ms/op
MathExact.C1_2.loopSubtractIOverflow               1000000  avgt    3   644.287 ?  301.787  ms/op
MathExact.C1_2.loopSubtractLInBounds               1000000  avgt    3     3.976 ?   11.982  ms/op
MathExact.C1_2.loopSubtractLOverflow               1000000  avgt    3   660.871 ?   16.538  ms/op
MathExact.C1_3.loopAddIInBounds                    1000000  avgt    3     4.380 ?   42.598  ms/op
MathExact.C1_3.loopAddIOverflow                    1000000  avgt    3   686.766 ?  511.146  ms/op
MathExact.C1_3.loopAddLInBounds                    1000000  avgt    3     5.445 ?   49.738  ms/op
MathExact.C1_3.loopAddLOverflow                    1000000  avgt    3   641.936 ?   32.769  ms/op
MathExact.C1_3.loopDecrementIInBounds              1000000  avgt    3     8.340 ?   69.455  ms/op
MathExact.C1_3.loopDecrementIOverflow              1000000  avgt    3   682.239 ?  212.017  ms/op
MathExact.C1_3.loopDecrementLInBounds              1000000  avgt    3     6.048 ?    0.651  ms/op
MathExact.C1_3.loopDecrementLOverflow              1000000  avgt    3   670.924 ?   42.037  ms/op
MathExact.C1_3.loopIncrementIInBounds              1000000  avgt    3     7.970 ?   63.664  ms/op
MathExact.C1_3.loopIncrementIOverflow              1000000  avgt    3   684.490 ?  197.407  ms/op
MathExact.C1_3.loopIncrementLInBounds              1000000  avgt    3     8.780 ?   86.737  ms/op
MathExact.C1_3.loopIncrementLOverflow              1000000  avgt    3   660.941 ?  172.305  ms/op
MathExact.C1_3.loopMultiplyIInBounds               1000000  avgt    3     3.241 ?    0.567  ms/op
MathExact.C1_3.loopMultiplyIOverflow               1000000  avgt    3   630.455 ?  138.458  ms/op
MathExact.C1_3.loopMultiplyLInBounds               1000000  avgt    3     5.906 ?    0.662  ms/op
MathExact.C1_3.loopMultiplyLOverflow               1000000  avgt    3   693.248 ?  539.146  ms/op
MathExact.C1_3.loopNegateIInBounds                 1000000  avgt    3     6.394 ?    7.757  ms/op
MathExact.C1_3.loopNegateIOverflow                 1000000  avgt    3   644.722 ?   56.929  ms/op
MathExact.C1_3.loopNegateLInBounds                 1000000  avgt    3     7.610 ?   41.533  ms/op
MathExact.C1_3.loopNegateLOverflow                 1000000  avgt    3   670.166 ?   14.496  ms/op
MathExact.C1_3.loopSubtractIInBounds               1000000  avgt    3     3.345 ?    1.977  ms/op
MathExact.C1_3.loopSubtractIOverflow               1000000  avgt    3   677.317 ?   22.878  ms/op
MathExact.C1_3.loopSubtractLInBounds               1000000  avgt    3     3.226 ?    0.122  ms/op
MathExact.C1_3.loopSubtractLOverflow               1000000  avgt    3   643.642 ?   65.217  ms/op
MathExact.C2.loopAddIInBounds                      1000000  avgt    3     1.217 ?    1.694  ms/op
MathExact.C2.loopAddIOverflow                      1000000  avgt    3  3995.424 ? 1177.165  ms/op
MathExact.C2.loopAddLInBounds                      1000000  avgt    3     2.404 ?    0.053  ms/op
MathExact.C2.loopAddLOverflow                      1000000  avgt    3  3997.984 ?  612.558  ms/op
MathExact.C2.loopDecrementIInBounds                1000000  avgt    3     2.014 ?    0.176  ms/op
MathExact.C2.loopDecrementIOverflow                1000000  avgt    3  3828.615 ?  260.670  ms/op
MathExact.C2.loopDecrementLInBounds                1000000  avgt    3     1.986 ?    1.536  ms/op
MathExact.C2.loopDecrementLOverflow                1000000  avgt    3  4075.934 ?  263.798  ms/op
MathExact.C2.loopIncrementIInBounds                1000000  avgt    3     2.238 ?    6.380  ms/op
MathExact.C2.loopIncrementIOverflow                1000000  avgt    3  3927.929 ?  837.162  ms/op
MathExact.C2.loopIncrementLInBounds                1000000  avgt    3     1.971 ?    1.232  ms/op
MathExact.C2.loopIncrementLOverflow                1000000  avgt    3  3915.202 ? 1024.956  ms/op
MathExact.C2.loopMultiplyIInBounds                 1000000  avgt    3     1.175 ?    0.509  ms/op
MathExact.C2.loopMultiplyIOverflow                 1000000  avgt    3  3803.719 ? 1583.828  ms/op
MathExact.C2.loopMultiplyLInBounds                 1000000  avgt    3     0.937 ?    0.631  ms/op
MathExact.C2.loopMultiplyLOverflow                 1000000  avgt    3  4023.742 ?  967.498  ms/op
MathExact.C2.loopNegateIInBounds                   1000000  avgt    3     2.129 ?    1.094  ms/op
MathExact.C2.loopNegateIOverflow                   1000000  avgt    3  3850.484 ?  464.979  ms/op
MathExact.C2.loopNegateLInBounds                   1000000  avgt    3     2.247 ?    9.714  ms/op
MathExact.C2.loopNegateLOverflow                   1000000  avgt    3  3911.853 ?  362.961  ms/op
MathExact.C2.loopSubtractIInBounds                 1000000  avgt    3     1.141 ?    1.579  ms/op
MathExact.C2.loopSubtractIOverflow                 1000000  avgt    3  3917.533 ?  628.485  ms/op
MathExact.C2.loopSubtractLInBounds                 1000000  avgt    3     2.232 ?   22.329  ms/op
MathExact.C2.loopSubtractLOverflow                 1000000  avgt    3  3995.088 ?  302.549  ms/op
MathExact.C2_no_intrinsics.loopAddIInBounds        1000000  avgt    3     1.488 ?   12.243  ms/op
MathExact.C2_no_intrinsics.loopAddIOverflow        1000000  avgt    3   585.568 ?  106.360  ms/op
MathExact.C2_no_intrinsics.loopAddLInBounds        1000000  avgt    3     2.234 ?   23.010  ms/op
MathExact.C2_no_intrinsics.loopAddLOverflow        1000000  avgt    3   602.290 ?  212.146  ms/op
MathExact.C2_no_intrinsics.loopDecrementIInBounds  1000000  avgt    3     4.705 ?   36.814  ms/op
MathExact.C2_no_intrinsics.loopDecrementIOverflow  1000000  avgt    3   590.212 ?  280.334  ms/op
MathExact.C2_no_intrinsics.loopDecrementLInBounds  1000000  avgt    3     2.374 ?   13.667  ms/op
MathExact.C2_no_intrinsics.loopDecrementLOverflow  1000000  avgt    3   583.053 ?   50.535  ms/op
MathExact.C2_no_intrinsics.loopIncrementIInBounds  1000000  avgt    3     3.966 ?   15.366  ms/op
MathExact.C2_no_intrinsics.loopIncrementIOverflow  1000000  avgt    3   591.683 ?  171.580  ms/op
MathExact.C2_no_intrinsics.loopIncrementLInBounds  1000000  avgt    3     3.682 ?   23.147  ms/op
MathExact.C2_no_intrinsics.loopIncrementLOverflow  1000000  avgt    3   601.325 ?   10.597  ms/op
MathExact.C2_no_intrinsics.loopMultiplyIInBounds   1000000  avgt    3     1.307 ?    0.235  ms/op
MathExact.C2_no_intrinsics.loopMultiplyIOverflow   1000000  avgt    3   570.615 ?   50.808  ms/op
MathExact.C2_no_intrinsics.loopMultiplyLInBounds   1000000  avgt    3     1.087 ?    0.486  ms/op
MathExact.C2_no_intrinsics.loopMultiplyLOverflow   1000000  avgt    3   595.713 ?  162.773  ms/op
MathExact.C2_no_intrinsics.loopNegateIInBounds     1000000  avgt    3     1.874 ?    0.954  ms/op
MathExact.C2_no_intrinsics.loopNegateIOverflow     1000000  avgt    3   596.588 ?   68.081  ms/op
MathExact.C2_no_intrinsics.loopNegateLInBounds     1000000  avgt    3     2.337 ?   12.164  ms/op
MathExact.C2_no_intrinsics.loopNegateLOverflow     1000000  avgt    3   573.711 ?   63.243  ms/op
MathExact.C2_no_intrinsics.loopSubtractIInBounds   1000000  avgt    3     1.085 ?    0.815  ms/op
MathExact.C2_no_intrinsics.loopSubtractIOverflow   1000000  avgt    3   579.489 ?   61.399  ms/op
MathExact.C2_no_intrinsics.loopSubtractLInBounds   1000000  avgt    3     1.020 ?    0.161  ms/op
MathExact.C2_no_intrinsics.loopSubtractLOverflow   1000000  avgt    3   580.578 ?  167.454  ms/op


After:

Benchmark                                           (SIZE)  Mode  Cnt    Score     Error  Units
MathExact.C1_1.loopAddIInBounds                    1000000  avgt    3    1.369 ?   0.462  ms/op
MathExact.C1_1.loopAddIOverflow                    1000000  avgt    3  635.020 ? 106.156  ms/op
MathExact.C1_1.loopAddLInBounds                    1000000  avgt    3    1.371 ?   0.020  ms/op
MathExact.C1_1.loopAddLOverflow                    1000000  avgt    3  633.864 ?  72.176  ms/op
MathExact.C1_1.loopDecrementIInBounds              1000000  avgt    3    2.053 ?   0.330  ms/op
MathExact.C1_1.loopDecrementIOverflow              1000000  avgt    3  634.675 ?  79.427  ms/op
MathExact.C1_1.loopDecrementLInBounds              1000000  avgt    3    3.798 ?  38.502  ms/op
MathExact.C1_1.loopDecrementLOverflow              1000000  avgt    3  650.880 ? 123.220  ms/op
MathExact.C1_1.loopIncrementIInBounds              1000000  avgt    3    2.305 ?   4.829  ms/op
MathExact.C1_1.loopIncrementIOverflow              1000000  avgt    3  648.231 ?  39.012  ms/op
MathExact.C1_1.loopIncrementLInBounds              1000000  avgt    3    2.627 ?   3.129  ms/op
MathExact.C1_1.loopIncrementLOverflow              1000000  avgt    3  663.671 ? 446.140  ms/op
MathExact.C1_1.loopMultiplyIInBounds               1000000  avgt    3    1.479 ?   0.102  ms/op
MathExact.C1_1.loopMultiplyIOverflow               1000000  avgt    3  627.959 ? 297.291  ms/op
MathExact.C1_1.loopMultiplyLInBounds               1000000  avgt    3    2.718 ?   0.806  ms/op
MathExact.C1_1.loopMultiplyLOverflow               1000000  avgt    3  655.310 ? 112.686  ms/op
MathExact.C1_1.loopNegateIInBounds                 1000000  avgt    3    2.079 ?   2.166  ms/op
MathExact.C1_1.loopNegateIOverflow                 1000000  avgt    3  640.530 ? 152.489  ms/op
MathExact.C1_1.loopNegateLInBounds                 1000000  avgt    3    3.168 ?  16.524  ms/op
MathExact.C1_1.loopNegateLOverflow                 1000000  avgt    3  650.823 ?  58.420  ms/op
MathExact.C1_1.loopSubtractIInBounds               1000000  avgt    3    2.325 ?  27.865  ms/op
MathExact.C1_1.loopSubtractIOverflow               1000000  avgt    3  632.198 ? 280.799  ms/op
MathExact.C1_1.loopSubtractLInBounds               1000000  avgt    3    1.478 ?   0.281  ms/op
MathExact.C1_1.loopSubtractLOverflow               1000000  avgt    3  626.481 ?  47.028  ms/op
MathExact.C1_2.loopAddIInBounds                    1000000  avgt    3    1.850 ?   0.462  ms/op
MathExact.C1_2.loopAddIOverflow                    1000000  avgt    3  640.668 ? 217.610  ms/op
MathExact.C1_2.loopAddLInBounds                    1000000  avgt    3    1.823 ?   0.123  ms/op
MathExact.C1_2.loopAddLOverflow                    1000000  avgt    3  643.123 ? 174.505  ms/op
MathExact.C1_2.loopDecrementIInBounds              1000000  avgt    3    6.435 ?  54.316  ms/op
MathExact.C1_2.loopDecrementIOverflow              1000000  avgt    3  649.622 ?  15.314  ms/op
MathExact.C1_2.loopDecrementLInBounds              1000000  avgt    3    4.315 ?  26.421  ms/op
MathExact.C1_2.loopDecrementLOverflow              1000000  avgt    3  649.018 ? 386.320  ms/op
MathExact.C1_2.loopIncrementIInBounds              1000000  avgt    3    3.444 ?   1.375  ms/op
MathExact.C1_2.loopIncrementIOverflow              1000000  avgt    3  628.711 ?  51.292  ms/op
MathExact.C1_2.loopIncrementLInBounds              1000000  avgt    3    3.351 ?   0.483  ms/op
MathExact.C1_2.loopIncrementLOverflow              1000000  avgt    3  653.560 ? 160.718  ms/op
MathExact.C1_2.loopMultiplyIInBounds               1000000  avgt    3    1.860 ?   0.633  ms/op
MathExact.C1_2.loopMultiplyIOverflow               1000000  avgt    3  620.883 ?  54.516  ms/op
MathExact.C1_2.loopMultiplyLInBounds               1000000  avgt    3    3.998 ?  16.269  ms/op
MathExact.C1_2.loopMultiplyLOverflow               1000000  avgt    3  671.956 ?  93.092  ms/op
MathExact.C1_2.loopNegateIInBounds                 1000000  avgt    3    4.415 ?  44.105  ms/op
MathExact.C1_2.loopNegateIOverflow                 1000000  avgt    3  661.902 ? 224.843  ms/op
MathExact.C1_2.loopNegateLInBounds                 1000000  avgt    3    3.492 ?   0.738  ms/op
MathExact.C1_2.loopNegateLOverflow                 1000000  avgt    3  634.946 ? 150.491  ms/op
MathExact.C1_2.loopSubtractIInBounds               1000000  avgt    3    1.712 ?   0.066  ms/op
MathExact.C1_2.loopSubtractIOverflow               1000000  avgt    3  651.508 ?  76.022  ms/op
MathExact.C1_2.loopSubtractLInBounds               1000000  avgt    3    1.949 ?   0.201  ms/op
MathExact.C1_2.loopSubtractLOverflow               1000000  avgt    3  627.459 ?  26.817  ms/op
MathExact.C1_3.loopAddIInBounds                    1000000  avgt    3    7.378 ?   4.301  ms/op
MathExact.C1_3.loopAddIOverflow                    1000000  avgt    3  647.275 ? 177.062  ms/op
MathExact.C1_3.loopAddLInBounds                    1000000  avgt    3    3.427 ?   0.037  ms/op
MathExact.C1_3.loopAddLOverflow                    1000000  avgt    3  643.735 ? 227.934  ms/op
MathExact.C1_3.loopDecrementIInBounds              1000000  avgt    3    5.680 ?   0.497  ms/op
MathExact.C1_3.loopDecrementIOverflow              1000000  avgt    3  666.431 ?   8.006  ms/op
MathExact.C1_3.loopDecrementLInBounds              1000000  avgt    3    6.897 ?  24.615  ms/op
MathExact.C1_3.loopDecrementLOverflow              1000000  avgt    3  683.691 ?  52.892  ms/op
MathExact.C1_3.loopIncrementIInBounds              1000000  avgt    3    5.743 ?   0.602  ms/op
MathExact.C1_3.loopIncrementIOverflow              1000000  avgt    3  670.027 ? 175.208  ms/op
MathExact.C1_3.loopIncrementLInBounds              1000000  avgt    3    6.157 ?   2.876  ms/op
MathExact.C1_3.loopIncrementLOverflow              1000000  avgt    3  673.410 ? 245.939  ms/op
MathExact.C1_3.loopMultiplyIInBounds               1000000  avgt    3    3.220 ?   0.165  ms/op
MathExact.C1_3.loopMultiplyIOverflow               1000000  avgt    3  640.165 ? 505.006  ms/op
MathExact.C1_3.loopMultiplyLInBounds               1000000  avgt    3    7.986 ?  62.547  ms/op
MathExact.C1_3.loopMultiplyLOverflow               1000000  avgt    3  681.282 ? 107.856  ms/op
MathExact.C1_3.loopNegateIInBounds                 1000000  avgt    3    7.133 ?  18.111  ms/op
MathExact.C1_3.loopNegateIOverflow                 1000000  avgt    3  680.976 ? 285.486  ms/op
MathExact.C1_3.loopNegateLInBounds                 1000000  avgt    3    7.405 ?  37.040  ms/op
MathExact.C1_3.loopNegateLOverflow                 1000000  avgt    3  681.574 ? 173.484  ms/op
MathExact.C1_3.loopSubtractIInBounds               1000000  avgt    3    3.971 ?  16.942  ms/op
MathExact.C1_3.loopSubtractIOverflow               1000000  avgt    3  655.780 ? 230.793  ms/op
MathExact.C1_3.loopSubtractLInBounds               1000000  avgt    3    3.369 ?   3.844  ms/op
MathExact.C1_3.loopSubtractLOverflow               1000000  avgt    3  634.824 ?  20.350  ms/op
MathExact.C2.loopAddIInBounds                      1000000  avgt    3    2.461 ?   2.936  ms/op
MathExact.C2.loopAddIOverflow                      1000000  avgt    3  589.095 ? 151.126  ms/op
MathExact.C2.loopAddLInBounds                      1000000  avgt    3    0.978 ?   0.604  ms/op
MathExact.C2.loopAddLOverflow                      1000000  avgt    3  590.511 ?  64.618  ms/op
MathExact.C2.loopDecrementIInBounds                1000000  avgt    3    1.981 ?   0.443  ms/op
MathExact.C2.loopDecrementIOverflow                1000000  avgt    3  593.578 ?  32.752  ms/op
MathExact.C2.loopDecrementLInBounds                1000000  avgt    3    2.924 ?  29.455  ms/op
MathExact.C2.loopDecrementLOverflow                1000000  avgt    3  601.392 ? 936.568  ms/op
MathExact.C2.loopIncrementIInBounds                1000000  avgt    3    2.697 ?  22.142  ms/op
MathExact.C2.loopIncrementIOverflow                1000000  avgt    3  602.418 ? 199.763  ms/op
MathExact.C2.loopIncrementLInBounds                1000000  avgt    3    1.954 ?   0.396  ms/op
MathExact.C2.loopIncrementLOverflow                1000000  avgt    3  601.183 ? 156.439  ms/op
MathExact.C2.loopMultiplyIInBounds                 1000000  avgt    3    1.530 ?   7.954  ms/op
MathExact.C2.loopMultiplyIOverflow                 1000000  avgt    3  566.677 ?  45.992  ms/op
MathExact.C2.loopMultiplyLInBounds                 1000000  avgt    3    2.184 ?  22.242  ms/op
MathExact.C2.loopMultiplyLOverflow                 1000000  avgt    3  600.233 ? 234.648  ms/op
MathExact.C2.loopNegateIInBounds                   1000000  avgt    3    2.130 ?   1.028  ms/op
MathExact.C2.loopNegateIOverflow                   1000000  avgt    3  593.145 ? 337.886  ms/op
MathExact.C2.loopNegateLInBounds                   1000000  avgt    3    2.600 ?  20.795  ms/op
MathExact.C2.loopNegateLOverflow                   1000000  avgt    3  592.288 ? 138.321  ms/op
MathExact.C2.loopSubtractIInBounds                 1000000  avgt    3    1.081 ?   0.265  ms/op
MathExact.C2.loopSubtractIOverflow                 1000000  avgt    3  575.884 ? 200.113  ms/op
MathExact.C2.loopSubtractLInBounds                 1000000  avgt    3    1.016 ?   0.792  ms/op
MathExact.C2.loopSubtractLOverflow                 1000000  avgt    3  589.873 ?  52.521  ms/op
MathExact.C2_no_intrinsics.loopAddIInBounds        1000000  avgt    3    2.166 ?  10.999  ms/op
MathExact.C2_no_intrinsics.loopAddIOverflow        1000000  avgt    3  586.660 ? 229.451  ms/op
MathExact.C2_no_intrinsics.loopAddLInBounds        1000000  avgt    3    1.054 ?   0.528  ms/op
MathExact.C2_no_intrinsics.loopAddLOverflow        1000000  avgt    3  572.511 ?  76.440  ms/op
MathExact.C2_no_intrinsics.loopDecrementIInBounds  1000000  avgt    3    1.907 ?   0.149  ms/op
MathExact.C2_no_intrinsics.loopDecrementIOverflow  1000000  avgt    3  599.262 ? 600.992  ms/op
MathExact.C2_no_intrinsics.loopDecrementLInBounds  1000000  avgt    3    1.820 ?   0.106  ms/op
MathExact.C2_no_intrinsics.loopDecrementLOverflow  1000000  avgt    3  570.464 ?  44.418  ms/op
MathExact.C2_no_intrinsics.loopIncrementIInBounds  1000000  avgt    3    1.914 ?   0.131  ms/op
MathExact.C2_no_intrinsics.loopIncrementIOverflow  1000000  avgt    3  575.143 ? 160.185  ms/op
MathExact.C2_no_intrinsics.loopIncrementLInBounds  1000000  avgt    3    1.818 ?   0.288  ms/op
MathExact.C2_no_intrinsics.loopIncrementLOverflow  1000000  avgt    3  589.998 ?  33.029  ms/op
MathExact.C2_no_intrinsics.loopMultiplyIInBounds   1000000  avgt    3    1.960 ?  10.135  ms/op
MathExact.C2_no_intrinsics.loopMultiplyIOverflow   1000000  avgt    3  571.497 ? 264.484  ms/op
MathExact.C2_no_intrinsics.loopMultiplyLInBounds   1000000  avgt    3    1.061 ?   0.198  ms/op
MathExact.C2_no_intrinsics.loopMultiplyLOverflow   1000000  avgt    3  585.139 ? 317.175  ms/op
MathExact.C2_no_intrinsics.loopNegateIInBounds     1000000  avgt    3    2.611 ?  22.325  ms/op
MathExact.C2_no_intrinsics.loopNegateIOverflow     1000000  avgt    3  579.911 ? 140.426  ms/op
MathExact.C2_no_intrinsics.loopNegateLInBounds     1000000  avgt    3    2.233 ?   2.774  ms/op
MathExact.C2_no_intrinsics.loopNegateLOverflow     1000000  avgt    3  572.368 ?  81.851  ms/op
MathExact.C2_no_intrinsics.loopSubtractIInBounds   1000000  avgt    3    3.162 ?  38.115  ms/op
MathExact.C2_no_intrinsics.loopSubtractIOverflow   1000000  avgt    3  582.794 ?  65.622  ms/op
MathExact.C2_no_intrinsics.loopSubtractLInBounds   1000000  avgt    3    1.028 ?   0.255  ms/op
MathExact.C2_no_intrinsics.loopSubtractLOverflow   1000000  avgt    3  577.491 ?  69.778  ms/op


Is it worth having intrinsics at all? @eme64 wondered, so I tried with this code:

public class Test {
    final static int N = 500_000_000;

    public static int test(int i) {
        try{
            return Math.multiplyExact(i, i);
        } catch (Throwable e){
            return 0;
        }
    }

    public static void loop() {
        for(int i = 0; i < N; i++) {
            test(i % 32_768);
        }
    }

    public static void main(String[] args) {
        loop();
    }
}

and with much more runs (50 instead of 3), and in a more stable load for the rest of the system.

No intrinsic (inlined Java implem):

Benchmark 1: ~/jdk/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,"Test*::test*" -XX:-UseOnStackReplacement Test.java
  Time (mean ? ?):      8.651 s ?  0.902 s    [User: 8.517 s, System: 0.155 s]
  Range (min ? max):    6.853 s ? 10.439 s    50 runs


Always intrinsic (current behavior, and new behavior in absence of overflow, like in this example):

Benchmark 1: ~/jdk/build/linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,"Test*::test*" -XX:-UseOnStackReplacement Test.java
  Time (mean ? ?):      8.222 s ?  1.024 s    [User: 8.090 s, System: 0.155 s]
  Range (min ? max):    6.667 s ? 10.406 s    50 runs


So it's... not very conclusive, but likely to be a bit useful. The gap between the means is about 0.4s, which is less than half the standard deviation.
Still, it seems good to have.

>From a more theoretical point of view, we can see that the code generated for the instrinsics is mostly a `mul` and a `jo`, while it is much more complicated for inlined java (with many `mov`, `movsx`, `cmp` and conditional jumps, looking a lot like the Java code).

Thanks,
Marc

-------------

Commit messages:
 - More exhaustive bench
 - Limit inlining of math Exact operations in case of too many deopts

Changes: https://git.openjdk.org/jdk/pull/23916/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23916&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8346989
  Stats: 405 lines in 2 files changed: 404 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/23916.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23916/head:pull/23916

PR: https://git.openjdk.org/jdk/pull/23916

From epeter at openjdk.org  Fri Mar  7 14:22:30 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Fri, 7 Mar 2025 14:22:30 GMT
Subject: RFR: 8346989: Deoptimization and re-compilation cycle with C2
 compiled code
In-Reply-To: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com>
References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com>
Message-ID: <Rk-iuouBNy6yTjfyK1RzJx73V2UwMANQ25iuUmFlEGc=.6d0a4e4a-fad9-45af-9c93-5a5f81259790@github.com>

On Wed, 5 Mar 2025 12:56:48 GMT, Marc Chevalier <duke at openjdk.org> wrote:

> `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments.
> This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached.
> 
> Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all.
> 
> tl;dr:
> - C1: no problem, no change
> - C2:
>   - with intrinsics:
>     - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms)
>     - without overflow: no problem, no change
>   - without intrinsics: no problem, no change
> 
> Before the fix:
> 
> Benchmark                                           (SIZE)  Mode  Cnt     Score      Error  Units
> MathExact.C1_1.loopAddIInBounds                    1000000  avgt    3     1.272 ?    0.048  ms/op
> MathExact.C1_1.loopAddIOverflow                    1000000  avgt    3   641.917 ?   58.238  ms/op
> MathExact.C1_1.loopAddLInBounds                    1000000  avgt    3     1.402 ?    0.842  ms/op
> MathExact.C1_1.loopAddLOverflow                    1000000  avgt    3   671.013 ?  229.425  ms/op
> MathExact.C1_1.loopDecrementIInBounds              1000000  avgt    3     3.722 ?   22.244  ms/op
> MathExact.C1_1.loopDecrementIOverflow              1000000  avgt    3   653.341 ?  279.003  ms/op
> MathExact.C1_1.loopDecrementLInBounds              1000000  avgt    3     2.525 ?    0.810  ms/op
> MathExact.C1_1.loopDecrementLOverflow              1000000  avgt    3   656.750 ?  141.792  ms/op
> MathExact.C1_1.loopIncrementIInBounds              1000000  avgt    3     4.621 ?   12.822  ms/op
> MathExact.C1_1.loopIncrementIOverflow              1000000  avgt    3   651.608 ?  274.396  ms/op
> MathExact.C1_1.loopIncrementLInBounds              1000000  avgt    3     2.576 ?    3.316  ms/op
> MathExact.C1_1.loopIncrementLOverflow              1000000  avgt    3   662.216 ?   71.879  ms/op
> MathExact.C1_1.loopMultiplyIInBounds               1000000  avgt    3     1.402 ?    0.587  ms/op
> MathExact.C1_1.loopMultiplyIOverflow               1000000  avgt    3   615.836 ?  252.137  ms/op
> MathExact.C1_1.loopMultiplyLInBounds               1000000  avgt    3     2.906 ?    5.718  ms/op
> MathExact.C1_1.loopMultiplyLOverflow               1000000  avgt    3   655.576 ?  147.432  ms/op
> MathExact.C1_1.loopNegateIInBounds                 1000000  avgt    3     2.023 ?    0.027  ms/op
> MathExact.C1_1.loopNegateIOverflow                 1000000  avgt    3   639.136 ?   30.841  ms/op
> MathExact.C1_1.loopNegateLInBounds                 1000000  avgt    3     2.422 ?    3.59...

The benchmark generally looks good to me, I only have some minor suggestions ;)

Ah. And is this only about `multiplyExact`, or are there other methods affected? Would be nice to extend the benchmark to those as well.

And yet another idea: you could probably write an IR test that checks that we at first have the compilation with the trap, and another test where we trap too much and then get a different compilation (without the intrinsic?).

Plus: the issue title is very generic. I think it should mention something about `Math.*Exact` as well ;)

test/micro/org/openjdk/bench/vm/compiler/MultiplyExact.java line 47:

> 45:         try {
> 46:             return square(i);
> 47:         } catch (Throwable e) {

Can you catch a more specific exception? Catching very general exceptions can often mask other bugs. I suppose this is only a benchmark, but it would still be good practice ;)

test/micro/org/openjdk/bench/vm/compiler/MultiplyExact.java line 62:

> 60: 
> 61:     @Fork(value = 1)
> 62:     public static class C2 extends MultiplyExact {}

What about a C2 version where you just disable the intrinsic?

-------------

PR Review: https://git.openjdk.org/jdk/pull/23916#pullrequestreview-2663529726
PR Comment: https://git.openjdk.org/jdk/pull/23916#issuecomment-2703023122
PR Review Comment: https://git.openjdk.org/jdk/pull/23916#discussion_r1982809388
PR Review Comment: https://git.openjdk.org/jdk/pull/23916#discussion_r1982808076

From epeter at openjdk.org  Fri Mar  7 14:22:30 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Fri, 7 Mar 2025 14:22:30 GMT
Subject: RFR: 8346989: Deoptimization and re-compilation cycle with C2
 compiled code
In-Reply-To: <Rk-iuouBNy6yTjfyK1RzJx73V2UwMANQ25iuUmFlEGc=.6d0a4e4a-fad9-45af-9c93-5a5f81259790@github.com>
References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com>
 <Rk-iuouBNy6yTjfyK1RzJx73V2UwMANQ25iuUmFlEGc=.6d0a4e4a-fad9-45af-9c93-5a5f81259790@github.com>
Message-ID: <aww6SJYpj_gUfmV9Kb4FRW4D6tqb4ILAZXR8Da3Ah1g=.18f612fb-0e22-4d88-ad43-702cf1501d3c@github.com>

On Thu, 6 Mar 2025 07:16:40 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments.
>> This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached.
>> 
>> Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all.
>> 
>> tl;dr:
>> - C1: no problem, no change
>> - C2:
>>   - with intrinsics:
>>     - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms)
>>     - without overflow: no problem, no change
>>   - without intrinsics: no problem, no change
>> 
>> Before the fix:
>> 
>> Benchmark                                           (SIZE)  Mode  Cnt     Score      Error  Units
>> MathExact.C1_1.loopAddIInBounds                    1000000  avgt    3     1.272 ?    0.048  ms/op
>> MathExact.C1_1.loopAddIOverflow                    1000000  avgt    3   641.917 ?   58.238  ms/op
>> MathExact.C1_1.loopAddLInBounds                    1000000  avgt    3     1.402 ?    0.842  ms/op
>> MathExact.C1_1.loopAddLOverflow                    1000000  avgt    3   671.013 ?  229.425  ms/op
>> MathExact.C1_1.loopDecrementIInBounds              1000000  avgt    3     3.722 ?   22.244  ms/op
>> MathExact.C1_1.loopDecrementIOverflow              1000000  avgt    3   653.341 ?  279.003  ms/op
>> MathExact.C1_1.loopDecrementLInBounds              1000000  avgt    3     2.525 ?    0.810  ms/op
>> MathExact.C1_1.loopDecrementLOverflow              1000000  avgt    3   656.750 ?  141.792  ms/op
>> MathExact.C1_1.loopIncrementIInBounds              1000000  avgt    3     4.621 ?   12.822  ms/op
>> MathExact.C1_1.loopIncrementIOverflow              1000000  avgt    3   651.608 ?  274.396  ms/op
>> MathExact.C1_1.loopIncrementLInBounds              1000000  avgt    3     2.576 ?    3.316  ms/op
>> MathExact.C1_1.loopIncrementLOverflow              1000000  avgt    3   662.216 ?   71.879  ms/op
>> MathExact.C1_1.loopMultiplyIInBounds               1000000  avgt    3     1.402 ?    0.587  ms/op
>> MathExact.C1_1.loopMultiplyIOverflow               1000000  avgt    3   615.836 ?  252.137  ms/op
>> MathExact.C1_1.loopMultiplyLInBounds               1000000  avgt    3     2.906 ?    5.718  ms/op
>> MathExact.C1_1.loopMultiplyLOverflow               1000000  avgt    3   655.576 ?  147.432  ms/op
>> MathExact.C1_1.loopNegateIInBounds                 1000000  avgt    3     2.023 ?    0.027  ms/op
>> MathExact.C1_1.loopNegateIOverflow                 1000000  avgt    3   639.136 ?   30.841  ms/op
>> MathExact.C1_1.loop...
>
> The benchmark generally looks good to me, I only have some minor suggestions ;)

> Is it worth inlining at all? @eme64 wondered, so I tried with this code:

You ask this in the PR description. I think I was not thinking about `inlining` but rather using the `intrinsic`. How much speedup does the intrinsic really deliver? Is it really better than pure Java?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23916#issuecomment-2703015476

From duke at openjdk.org  Fri Mar  7 14:22:30 2025
From: duke at openjdk.org (Marc Chevalier)
Date: Fri, 7 Mar 2025 14:22:30 GMT
Subject: RFR: 8346989: Deoptimization and re-compilation cycle with C2
 compiled code
In-Reply-To: <aww6SJYpj_gUfmV9Kb4FRW4D6tqb4ILAZXR8Da3Ah1g=.18f612fb-0e22-4d88-ad43-702cf1501d3c@github.com>
References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com>
 <Rk-iuouBNy6yTjfyK1RzJx73V2UwMANQ25iuUmFlEGc=.6d0a4e4a-fad9-45af-9c93-5a5f81259790@github.com>
 <aww6SJYpj_gUfmV9Kb4FRW4D6tqb4ILAZXR8Da3Ah1g=.18f612fb-0e22-4d88-ad43-702cf1501d3c@github.com>
Message-ID: <EzvwYan8luiIy3ly7oNYSFbayFRN1Ttb21Kk3ntK2XQ=.e9b58408-8ee8-497a-8c5a-2ce7aecce2a8@github.com>

On Thu, 6 Mar 2025 07:19:48 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> You ask this in the PR description. I think I was not thinking about inlining but rather using the intrinsic. How much speedup does the intrinsic really deliver? Is it really better than pure Java?

My fault. I used "inline" instead of "intrinsic" because the functions implementing the intrinsic are called `inline_math_mathExact` and alike. So, I compared the intrinsic vs. the pure java implementation, that happens to be inlined. And intrinsic is a bit better.

I'll edit the text to fix that.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23916#issuecomment-2703823132

From duke at openjdk.org  Fri Mar  7 14:22:30 2025
From: duke at openjdk.org (Marc Chevalier)
Date: Fri, 7 Mar 2025 14:22:30 GMT
Subject: RFR: 8346989: Deoptimization and re-compilation cycle with C2
 compiled code
In-Reply-To: <Rk-iuouBNy6yTjfyK1RzJx73V2UwMANQ25iuUmFlEGc=.6d0a4e4a-fad9-45af-9c93-5a5f81259790@github.com>
References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com>
 <Rk-iuouBNy6yTjfyK1RzJx73V2UwMANQ25iuUmFlEGc=.6d0a4e4a-fad9-45af-9c93-5a5f81259790@github.com>
Message-ID: <7npMvWN2HNTIZOpeIVuhrZM9i5YiZEDvJC6xlReut_4=.e8a98a0b-7146-44a7-94e1-0d4a27566f1f@github.com>

On Thu, 6 Mar 2025 07:11:40 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments.
>> This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached.
>> 
>> Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all.
>> 
>> tl;dr:
>> - C1: no problem, no change
>> - C2:
>>   - with intrinsics:
>>     - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms)
>>     - without overflow: no problem, no change
>>   - without intrinsics: no problem, no change
>> 
>> Before the fix:
>> 
>> Benchmark                                           (SIZE)  Mode  Cnt     Score      Error  Units
>> MathExact.C1_1.loopAddIInBounds                    1000000  avgt    3     1.272 ?    0.048  ms/op
>> MathExact.C1_1.loopAddIOverflow                    1000000  avgt    3   641.917 ?   58.238  ms/op
>> MathExact.C1_1.loopAddLInBounds                    1000000  avgt    3     1.402 ?    0.842  ms/op
>> MathExact.C1_1.loopAddLOverflow                    1000000  avgt    3   671.013 ?  229.425  ms/op
>> MathExact.C1_1.loopDecrementIInBounds              1000000  avgt    3     3.722 ?   22.244  ms/op
>> MathExact.C1_1.loopDecrementIOverflow              1000000  avgt    3   653.341 ?  279.003  ms/op
>> MathExact.C1_1.loopDecrementLInBounds              1000000  avgt    3     2.525 ?    0.810  ms/op
>> MathExact.C1_1.loopDecrementLOverflow              1000000  avgt    3   656.750 ?  141.792  ms/op
>> MathExact.C1_1.loopIncrementIInBounds              1000000  avgt    3     4.621 ?   12.822  ms/op
>> MathExact.C1_1.loopIncrementIOverflow              1000000  avgt    3   651.608 ?  274.396  ms/op
>> MathExact.C1_1.loopIncrementLInBounds              1000000  avgt    3     2.576 ?    3.316  ms/op
>> MathExact.C1_1.loopIncrementLOverflow              1000000  avgt    3   662.216 ?   71.879  ms/op
>> MathExact.C1_1.loopMultiplyIInBounds               1000000  avgt    3     1.402 ?    0.587  ms/op
>> MathExact.C1_1.loopMultiplyIOverflow               1000000  avgt    3   615.836 ?  252.137  ms/op
>> MathExact.C1_1.loopMultiplyLInBounds               1000000  avgt    3     2.906 ?    5.718  ms/op
>> MathExact.C1_1.loopMultiplyLOverflow               1000000  avgt    3   655.576 ?  147.432  ms/op
>> MathExact.C1_1.loopNegateIInBounds                 1000000  avgt    3     2.023 ?    0.027  ms/op
>> MathExact.C1_1.loopNegateIOverflow                 1000000  avgt    3   639.136 ?   30.841  ms/op
>> MathExact.C1_1.loop...
>
> test/micro/org/openjdk/bench/vm/compiler/MultiplyExact.java line 47:
> 
>> 45:         try {
>> 46:             return square(i);
>> 47:         } catch (Throwable e) {
> 
> Can you catch a more specific exception? Catching very general exceptions can often mask other bugs. I suppose this is only a benchmark, but it would still be good practice ;)

Indeed.

> test/micro/org/openjdk/bench/vm/compiler/MultiplyExact.java line 62:
> 
>> 60: 
>> 61:     @Fork(value = 1)
>> 62:     public static class C2 extends MultiplyExact {}
> 
> What about a C2 version where you just disable the intrinsic?

Good idea. Done.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23916#discussion_r1985004497
PR Review Comment: https://git.openjdk.org/jdk/pull/23916#discussion_r1985003664

From vlivanov at openjdk.org  Fri Mar  7 18:06:56 2025
From: vlivanov at openjdk.org (Vladimir Ivanov)
Date: Fri, 7 Mar 2025 18:06:56 GMT
Subject: RFR: 8346989: Deoptimization and re-compilation cycle with C2
 compiled code
In-Reply-To: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com>
References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com>
Message-ID: <_3l8ylsbgvsqQE1Ihp0BUAx2o_VzcS6R2jWBSKW9u1E=.0dcb6086-ff6f-4c9a-b990-6665a476a3dc@github.com>

On Wed, 5 Mar 2025 12:56:48 GMT, Marc Chevalier <duke at openjdk.org> wrote:

> `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments.
> This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached.
> 
> Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all.
> 
> tl;dr:
> - C1: no problem, no change
> - C2:
>   - with intrinsics:
>     - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms)
>     - without overflow: no problem, no change
>   - without intrinsics: no problem, no change
> 
> Before the fix:
> 
> Benchmark                                           (SIZE)  Mode  Cnt     Score      Error  Units
> MathExact.C1_1.loopAddIInBounds                    1000000  avgt    3     1.272 ?    0.048  ms/op
> MathExact.C1_1.loopAddIOverflow                    1000000  avgt    3   641.917 ?   58.238  ms/op
> MathExact.C1_1.loopAddLInBounds                    1000000  avgt    3     1.402 ?    0.842  ms/op
> MathExact.C1_1.loopAddLOverflow                    1000000  avgt    3   671.013 ?  229.425  ms/op
> MathExact.C1_1.loopDecrementIInBounds              1000000  avgt    3     3.722 ?   22.244  ms/op
> MathExact.C1_1.loopDecrementIOverflow              1000000  avgt    3   653.341 ?  279.003  ms/op
> MathExact.C1_1.loopDecrementLInBounds              1000000  avgt    3     2.525 ?    0.810  ms/op
> MathExact.C1_1.loopDecrementLOverflow              1000000  avgt    3   656.750 ?  141.792  ms/op
> MathExact.C1_1.loopIncrementIInBounds              1000000  avgt    3     4.621 ?   12.822  ms/op
> MathExact.C1_1.loopIncrementIOverflow              1000000  avgt    3   651.608 ?  274.396  ms/op
> MathExact.C1_1.loopIncrementLInBounds              1000000  avgt    3     2.576 ?    3.316  ms/op
> MathExact.C1_1.loopIncrementLOverflow              1000000  avgt    3   662.216 ?   71.879  ms/op
> MathExact.C1_1.loopMultiplyIInBounds               1000000  avgt    3     1.402 ?    0.587  ms/op
> MathExact.C1_1.loopMultiplyIOverflow               1000000  avgt    3   615.836 ?  252.137  ms/op
> MathExact.C1_1.loopMultiplyLInBounds               1000000  avgt    3     2.906 ?    5.718  ms/op
> MathExact.C1_1.loopMultiplyLOverflow               1000000  avgt    3   655.576 ?  147.432  ms/op
> MathExact.C1_1.loopNegateIInBounds                 1000000  avgt    3     2.023 ?    0.027  ms/op
> MathExact.C1_1.loopNegateIOverflow                 1000000  avgt    3   639.136 ?   30.841  ms/op
> MathExact.C1_1.loopNegateLInBounds                 1000000  avgt    3     2.422 ?    3.59...

Nice benchmark, Marc!

src/hotspot/share/opto/library_call.cpp line 1963:

> 1961:     set_i_o(i_o());
> 1962: 
> 1963:     uncommon_trap(Deoptimization::Reason_intrinsic,

What about using `builtin_throw` here? (Requires some tuning on `builtin_throw` side.) How much does it affect performance? Also, passing `must_throw = true` into `uncommon_trap` may help a bit here as well.

-------------

PR Review: https://git.openjdk.org/jdk/pull/23916#pullrequestreview-2667969834
PR Review Comment: https://git.openjdk.org/jdk/pull/23916#discussion_r1985476888

From tschatzl at openjdk.org  Sat Mar  8 19:32:54 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Sat, 8 Mar 2025 19:32:54 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v15]
In-Reply-To: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
Message-ID: <fUhXDaJRytM0Z-n2iHZgKJiRPTqbaI1tkO1tfhqHUsU=.aace41e0-3bd9-4ecf-a40b-6c555a968871@github.com>

> Hi all,
> 
>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
> 
> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
> 
> ### Current situation
> 
> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
> 
> The main reason for the current barrier is how g1 implements concurrent refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
> 
> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
> 
> 
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done;     // null value check
> if (card(@x.a) == young_card) goto done;  // write to young gen check
> StoreLoad;                // synchronize
> if (card(@x.a) == dirty_card) goto done;
> 
> *card(@x.a) = dirty
> 
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
> 
> call runtime to move thread-local-dcq into dcqs
> 
> done:
> 
> 
> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
> 
> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
> 
> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
> 
> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se...

Thomas Schatzl has updated the pull request incrementally with two additional commits since the last revision:

 - * fix card table verification crashes: in the first refinement phase, when switching the global card tables, we need to re-check whether we are still in the same sweep epoch or not. It might have changed due to a GC interrupting acquiring the Heap_lock. Otherwise new threads will scribble on the refinement table.
   Cause are last-minute changes before making the PR ready to review.
   
     Testing: without the patch, occurs fairly frequently when continuously
   (1 in 20) starting refinement. Does not afterward.
 - * ayang review 3
     * comments
     * minor refactorings

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23739/files
  - new: https://git.openjdk.org/jdk/pull/23739/files/350a4fa3..93b884f1

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=14
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=13-14

  Stats: 35 lines in 5 files changed: 30 ins; 0 del; 5 mod
  Patch: https://git.openjdk.org/jdk/pull/23739.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739

PR: https://git.openjdk.org/jdk/pull/23739

From tschatzl at openjdk.org  Sat Mar  8 19:32:54 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Sat, 8 Mar 2025 19:32:54 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v9]
In-Reply-To: <KZAJNzVkICZfGlNaFWmIyR3vmeepnbgl5Nvx-xa_u_8=.be66f1b9-b2f9-4d3c-a5e3-07899deb1c14@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <h_JgI8BYE9hkXd6VInuebQvKl9K8abUlnhe1ICNuMlM=.f8844689-836e-4da3-a9e9-d02b5b967b7b@github.com>
 <VgZ6DLi7s4RPt9wmqyDe0gZIhCJkjmCtDn0sqmaa5cY=.5ba2b1dd-a9f2-47c5-8fd2-64aac31d7507@github.com>
 <KZAJNzVkICZfGlNaFWmIyR3vmeepnbgl5Nvx-xa_u_8=.be66f1b9-b2f9-4d3c-a5e3-07899deb1c14@github.com>
Message-ID: <ANfUIWctw9i-vvXWSgymnZqBzSSl09lgMOmLjssHdhQ=.4d869ce1-b0c6-4465-8271-2e0cf8d69e52@github.com>

On Tue, 4 Mar 2025 10:46:13 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

> I got an error while testing java/foreign/TestUpcallStress.java on linuxaarch64 with this PR:

Fixed.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23739#issuecomment-2708458459

From dnsimon at openjdk.org  Sun Mar  9 19:12:34 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Sun, 9 Mar 2025 19:12:34 GMT
Subject: RFR: 8346825: [JVMCI] Remove NativeImageReinitialize annotationremoved
 NativeImageReinitialize annotation
Message-ID: <jzZNEBqI1sR2AX-dkROxfmdx4BwmFBgp2ytDMNxowko=.28355d03-eead-45ca-9e06-0b375ce516c3@github.com>

The `jdk.vm.ci.common.NativeImageReinitialize` annotation was introduced to reset JVMCI and Graal fields to their default values as they are copied into the libgraal image. Now that class loader separation is used to isolate the JVMCI and Graal classes compiled to produce libgraal from the JVMCI and Graal classes being executed to do the AOT compilation, the need for this field resetting is no longer needed. This PR removes the `NativeImageReinitialize` annotation.

-------------

Commit messages:
 - removed NativeImageReinitialize annotation

Changes: https://git.openjdk.org/jdk/pull/23957/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23957&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8346825
  Stats: 69 lines in 10 files changed: 0 ins; 44 del; 25 mod
  Patch: https://git.openjdk.org/jdk/pull/23957.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23957/head:pull/23957

PR: https://git.openjdk.org/jdk/pull/23957

From never at openjdk.org  Mon Mar 10 02:46:03 2025
From: never at openjdk.org (Tom Rodriguez)
Date: Mon, 10 Mar 2025 02:46:03 GMT
Subject: RFR: 8346825: [JVMCI] Remove NativeImageReinitialize
 annotationremoved NativeImageReinitialize annotation
In-Reply-To: <jzZNEBqI1sR2AX-dkROxfmdx4BwmFBgp2ytDMNxowko=.28355d03-eead-45ca-9e06-0b375ce516c3@github.com>
References: <jzZNEBqI1sR2AX-dkROxfmdx4BwmFBgp2ytDMNxowko=.28355d03-eead-45ca-9e06-0b375ce516c3@github.com>
Message-ID: <l1tz98nXXt4vKagJcEhg8D_5lj1lOyGDQhk6iHPGWbw=.7dc277bd-fe52-4efe-bf05-e86f6b1a1698@github.com>

On Sun, 9 Mar 2025 19:07:54 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

> The `jdk.vm.ci.common.NativeImageReinitialize` annotation was introduced to reset JVMCI and Graal fields to their default values as they are copied into the libgraal image. Now that class loader separation is used to isolate the JVMCI and Graal classes compiled to produce libgraal from the JVMCI and Graal classes being executed to do the AOT compilation, the need for this field resetting is no longer needed. This PR removes the `NativeImageReinitialize` annotation.

Marked as reviewed by never (Reviewer).

-------------

PR Review: https://git.openjdk.org/jdk/pull/23957#pullrequestreview-2669672002

From lmesnik at openjdk.org  Mon Mar 10 03:03:00 2025
From: lmesnik at openjdk.org (Leonid Mesnik)
Date: Mon, 10 Mar 2025 03:03:00 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v5]
In-Reply-To: <3bphXKLpIpxAZP-FEOeob6AaHbv0BAoEceJka64vMW8=.3e4f74e0-9479-4926-b365-b08d8d702692@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <3bphXKLpIpxAZP-FEOeob6AaHbv0BAoEceJka64vMW8=.3e4f74e0-9479-4926-b365-b08d8d702692@github.com>
Message-ID: <qlwb4JKyY446ZQz5ZBdMN6SOblRZNQQ2gbNntRjiL2o=.8f5c1cb4-74d7-4005-832c-a44e170d3e33@github.com>

On Thu, 6 Mar 2025 17:37:33 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Accepted review comments.

There are no any new tests in the PR. How fix has been tested by openjdk tests?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23860#issuecomment-2709309387

From roland at openjdk.org  Mon Mar 10 09:02:15 2025
From: roland at openjdk.org (Roland Westrelin)
Date: Mon, 10 Mar 2025 09:02:15 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v14]
In-Reply-To: <9c34YjVjK0BMclNqFWMSitBV2YTcu_jmgWVitjRgvF0=.0f225af6-5888-4160-9a54-09baa696da1c@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <9c34YjVjK0BMclNqFWMSitBV2YTcu_jmgWVitjRgvF0=.0f225af6-5888-4160-9a54-09baa696da1c@github.com>
Message-ID: <zcmqwehMt2KpjdGee6BQc3NMgUEqHdi12m_8W8xD5uw=.6e578b61-ff1c-44bf-b00f-623548a3a447@github.com>

On Fri, 7 Mar 2025 06:19:03 GMT, Galder Zamarre?o <galder at openjdk.org> wrote:

>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance.
>> 
>> Currently vectorization does not kick in for loops containing either of these calls because of the following error:
>> 
>> 
>> VLoop::check_preconditions: failed: control flow in loop not allowed
>> 
>> 
>> The control flow is due to the java implementation for these methods, e.g.
>> 
>> 
>> public static long max(long a, long b) {
>>     return (a >= b) ? a : b;
>> }
>> 
>> 
>> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively.
>> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization.
>> E.g.
>> 
>> 
>> SuperWord::transform_loop:
>>     Loop: N518/N126  counted [int,int),+4 (1025 iters)  main has_sfpt strip_mined
>>  518  CountedLoop  === 518 246 126  [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21)
>> 
>> 
>> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1155
>> long max   1173
>> 
>> 
>> After the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1042
>> long max   1042
>> 
>> 
>> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes.
>> Therefore, it still relies on the macro expansion to transform those into CMoveL.
>> 
>> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results:
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PA...
>
> Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 47 additional commits since the last revision:
> 
>  - Merge branch 'master' into topic.intrinsify-max-min-long
>  - Add assertion comments
>  - Add simple reduction benchmarks on top of multiply ones
>  - Merge branch 'master' into topic.intrinsify-max-min-long
>  - Fix typo
>  - Renaming methods and variables and add docu on algorithms
>  - Fix copyright years
>  - Make sure it runs with cpus with either avx512 or asimd
>  - Test can only run with 256 bit registers or bigger
>    
>    * Remove platform dependant check
>    and use platform independent configuration instead.
>  - Fix license header
>  - ... and 37 more: https://git.openjdk.org/jdk/compare/07ef652d...1aa690d3

Marked as reviewed by roland (Reviewer).

-------------

PR Review: https://git.openjdk.org/jdk/pull/20098#pullrequestreview-2670211951

From chagedorn at openjdk.org  Mon Mar 10 09:19:10 2025
From: chagedorn at openjdk.org (Christian Hagedorn)
Date: Mon, 10 Mar 2025 09:19:10 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v14]
In-Reply-To: <9c34YjVjK0BMclNqFWMSitBV2YTcu_jmgWVitjRgvF0=.0f225af6-5888-4160-9a54-09baa696da1c@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <9c34YjVjK0BMclNqFWMSitBV2YTcu_jmgWVitjRgvF0=.0f225af6-5888-4160-9a54-09baa696da1c@github.com>
Message-ID: <UBgNSlAfeyZ4OVR0rJYZ7caKcOaf6xLPkRwNGwWOiJY=.30519159-d4f9-4be9-9b36-a8bc31298925@github.com>

On Fri, 7 Mar 2025 06:19:03 GMT, Galder Zamarre?o <galder at openjdk.org> wrote:

>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance.
>> 
>> Currently vectorization does not kick in for loops containing either of these calls because of the following error:
>> 
>> 
>> VLoop::check_preconditions: failed: control flow in loop not allowed
>> 
>> 
>> The control flow is due to the java implementation for these methods, e.g.
>> 
>> 
>> public static long max(long a, long b) {
>>     return (a >= b) ? a : b;
>> }
>> 
>> 
>> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively.
>> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization.
>> E.g.
>> 
>> 
>> SuperWord::transform_loop:
>>     Loop: N518/N126  counted [int,int),+4 (1025 iters)  main has_sfpt strip_mined
>>  518  CountedLoop  === 518 246 126  [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21)
>> 
>> 
>> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1155
>> long max   1173
>> 
>> 
>> After the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1042
>> long max   1042
>> 
>> 
>> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes.
>> Therefore, it still relies on the macro expansion to transform those into CMoveL.
>> 
>> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results:
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PA...
>
> Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 47 additional commits since the last revision:
> 
>  - Merge branch 'master' into topic.intrinsify-max-min-long
>  - Add assertion comments
>  - Add simple reduction benchmarks on top of multiply ones
>  - Merge branch 'master' into topic.intrinsify-max-min-long
>  - Fix typo
>  - Renaming methods and variables and add docu on algorithms
>  - Fix copyright years
>  - Make sure it runs with cpus with either avx512 or asimd
>  - Test can only run with 256 bit registers or bigger
>    
>    * Remove platform dependant check
>    and use platform independent configuration instead.
>  - Fix license header
>  - ... and 37 more: https://git.openjdk.org/jdk/compare/fd78e706...1aa690d3

Good work and collection of all the data!

-------------

Marked as reviewed by chagedorn (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/20098#pullrequestreview-2670256931

From duke at openjdk.org  Mon Mar 10 10:23:01 2025
From: duke at openjdk.org (Marc Chevalier)
Date: Mon, 10 Mar 2025 10:23:01 GMT
Subject: RFR: 8346989: Deoptimization and re-compilation cycle with C2
 compiled code
In-Reply-To: <_3l8ylsbgvsqQE1Ihp0BUAx2o_VzcS6R2jWBSKW9u1E=.0dcb6086-ff6f-4c9a-b990-6665a476a3dc@github.com>
References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com>
 <_3l8ylsbgvsqQE1Ihp0BUAx2o_VzcS6R2jWBSKW9u1E=.0dcb6086-ff6f-4c9a-b990-6665a476a3dc@github.com>
Message-ID: <vVk7H057myDnXvdscNhJaTEkbDya1KeIU5epZkEE6Ps=.86ff8b7d-d081-45d6-ba5e-5033b5023a31@github.com>

On Fri, 7 Mar 2025 18:03:14 GMT, Vladimir Ivanov <vlivanov at openjdk.org> wrote:

>> `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments.
>> This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached.
>> 
>> Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all.
>> 
>> tl;dr:
>> - C1: no problem, no change
>> - C2:
>>   - with intrinsics:
>>     - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms)
>>     - without overflow: no problem, no change
>>   - without intrinsics: no problem, no change
>> 
>> Before the fix:
>> 
>> Benchmark                                           (SIZE)  Mode  Cnt     Score      Error  Units
>> MathExact.C1_1.loopAddIInBounds                    1000000  avgt    3     1.272 ?    0.048  ms/op
>> MathExact.C1_1.loopAddIOverflow                    1000000  avgt    3   641.917 ?   58.238  ms/op
>> MathExact.C1_1.loopAddLInBounds                    1000000  avgt    3     1.402 ?    0.842  ms/op
>> MathExact.C1_1.loopAddLOverflow                    1000000  avgt    3   671.013 ?  229.425  ms/op
>> MathExact.C1_1.loopDecrementIInBounds              1000000  avgt    3     3.722 ?   22.244  ms/op
>> MathExact.C1_1.loopDecrementIOverflow              1000000  avgt    3   653.341 ?  279.003  ms/op
>> MathExact.C1_1.loopDecrementLInBounds              1000000  avgt    3     2.525 ?    0.810  ms/op
>> MathExact.C1_1.loopDecrementLOverflow              1000000  avgt    3   656.750 ?  141.792  ms/op
>> MathExact.C1_1.loopIncrementIInBounds              1000000  avgt    3     4.621 ?   12.822  ms/op
>> MathExact.C1_1.loopIncrementIOverflow              1000000  avgt    3   651.608 ?  274.396  ms/op
>> MathExact.C1_1.loopIncrementLInBounds              1000000  avgt    3     2.576 ?    3.316  ms/op
>> MathExact.C1_1.loopIncrementLOverflow              1000000  avgt    3   662.216 ?   71.879  ms/op
>> MathExact.C1_1.loopMultiplyIInBounds               1000000  avgt    3     1.402 ?    0.587  ms/op
>> MathExact.C1_1.loopMultiplyIOverflow               1000000  avgt    3   615.836 ?  252.137  ms/op
>> MathExact.C1_1.loopMultiplyLInBounds               1000000  avgt    3     2.906 ?    5.718  ms/op
>> MathExact.C1_1.loopMultiplyLOverflow               1000000  avgt    3   655.576 ?  147.432  ms/op
>> MathExact.C1_1.loopNegateIInBounds                 1000000  avgt    3     2.023 ?    0.027  ms/op
>> MathExact.C1_1.loopNegateIOverflow                 1000000  avgt    3   639.136 ?   30.841  ms/op
>> MathExact.C1_1.loop...
>
> src/hotspot/share/opto/library_call.cpp line 1963:
> 
>> 1961:     set_i_o(i_o());
>> 1962: 
>> 1963:     uncommon_trap(Deoptimization::Reason_intrinsic,
> 
> What about using `builtin_throw` here? (Requires some tuning on `builtin_throw` side.) How much does it affect performance? Also, passing `must_throw = true` into `uncommon_trap` may help a bit here as well.

Using `builtin_throw` sounds nice! But indeed, it won't work so directly. I want to prevent intrinsic in case of `too_many_traps`. But that's only when `builtin_throw` will do something. But if I only rely on `builtin_throw`, then, when the built-in throwing is not possible (that is when `treat_throw_as_hot && method()->can_omit_stack_trace()` is false), we will have the repeated deopt again.

There is also throwing the right exception, which is right now determined only by the reason (which adapts poorly to this case).

I guess that's what you meant by tuning: be able to know if we would built-in throw, and if so, do it, otherwise, prevent infinitely repeated deopt.

The way I see doing that is by (maybe optionally) providing the preallocated exception to throw as a parameter so that we don't have to rely on the "reason to exception" decision (or we can override it), and factor out the decision whether we can take the nice branch of `builtin_throw` so that we can bail out of intrinsic if we can't fast throw before we start setting up the intrinsic (that we would then need to undo). Does that match what you had in mind or you have another suggestion?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23916#discussion_r1986999005

From dnsimon at openjdk.org  Mon Mar 10 11:06:00 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Mon, 10 Mar 2025 11:06:00 GMT
Subject: RFR: 8346825: [JVMCI] Remove NativeImageReinitialize annotation
In-Reply-To: <jzZNEBqI1sR2AX-dkROxfmdx4BwmFBgp2ytDMNxowko=.28355d03-eead-45ca-9e06-0b375ce516c3@github.com>
References: <jzZNEBqI1sR2AX-dkROxfmdx4BwmFBgp2ytDMNxowko=.28355d03-eead-45ca-9e06-0b375ce516c3@github.com>
Message-ID: <HmUIDxLCVg4-9eZghPMwm9FqAKG747gq12mScu_dJpI=.3f8c8ad6-8f37-4ee1-80ff-ded3ba7409da@github.com>

On Sun, 9 Mar 2025 19:07:54 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

> The `jdk.vm.ci.common.NativeImageReinitialize` annotation was introduced to reset JVMCI and Graal fields to their default values as they are copied into the libgraal image. Now that class loader separation is used to isolate the JVMCI and Graal classes compiled to produce libgraal from the JVMCI and Graal classes being executed to do the AOT compilation, the need for this field resetting is no longer needed. This PR removes the `NativeImageReinitialize` annotation.

Thanks for the review.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23957#issuecomment-2710203461

From dnsimon at openjdk.org  Mon Mar 10 11:06:00 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Mon, 10 Mar 2025 11:06:00 GMT
Subject: Integrated: 8346825: [JVMCI] Remove NativeImageReinitialize annotation
In-Reply-To: <jzZNEBqI1sR2AX-dkROxfmdx4BwmFBgp2ytDMNxowko=.28355d03-eead-45ca-9e06-0b375ce516c3@github.com>
References: <jzZNEBqI1sR2AX-dkROxfmdx4BwmFBgp2ytDMNxowko=.28355d03-eead-45ca-9e06-0b375ce516c3@github.com>
Message-ID: <xt3jl5EdR6o9McE71lbzAXv7cbnb9p-fFBOthba_4do=.cd731a86-3139-42ca-b743-17e97ea6bdef@github.com>

On Sun, 9 Mar 2025 19:07:54 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

> The `jdk.vm.ci.common.NativeImageReinitialize` annotation was introduced to reset JVMCI and Graal fields to their default values as they are copied into the libgraal image. Now that class loader separation is used to isolate the JVMCI and Graal classes compiled to produce libgraal from the JVMCI and Graal classes being executed to do the AOT compilation, the need for this field resetting is no longer needed. This PR removes the `NativeImageReinitialize` annotation.

This pull request has now been integrated.

Changeset: 99547c5b
Author:    Doug Simon <dnsimon at openjdk.org>
URL:       https://git.openjdk.org/jdk/commit/99547c5b254807580e0a5238b95d55d38181f4fc
Stats:     69 lines in 10 files changed: 0 ins; 44 del; 25 mod

8346825: [JVMCI] Remove NativeImageReinitialize annotation

Reviewed-by: never

-------------

PR: https://git.openjdk.org/jdk/pull/23957

From fyang at openjdk.org  Tue Mar 11 03:25:55 2025
From: fyang at openjdk.org (Fei Yang)
Date: Tue, 11 Mar 2025 03:25:55 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v15]
In-Reply-To: <fUhXDaJRytM0Z-n2iHZgKJiRPTqbaI1tkO1tfhqHUsU=.aace41e0-3bd9-4ecf-a40b-6c555a968871@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <fUhXDaJRytM0Z-n2iHZgKJiRPTqbaI1tkO1tfhqHUsU=.aace41e0-3bd9-4ecf-a40b-6c555a968871@github.com>
Message-ID: <p6xSjzkmq9zM_fb6ja_tlVyAZCGE__vMCWgn4AWBOWo=.121c539d-1a5d-41f0-982c-811178d59cf5@github.com>

On Sat, 8 Mar 2025 19:32:54 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> Hi all,
>> 
>>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
>> 
>> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
>> 
>> ### Current situation
>> 
>> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
>> 
>> The main reason for the current barrier is how g1 implements concurrent refinement:
>> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
>> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
>> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
>> 
>> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
>> 
>> 
>> // Filtering
>> if (region(@x.a) == region(y)) goto done; // same region check
>> if (y == null) goto done;     // null value check
>> if (card(@x.a) == young_card) goto done;  // write to young gen check
>> StoreLoad;                // synchronize
>> if (card(@x.a) == dirty_card) goto done;
>> 
>> *card(@x.a) = dirty
>> 
>> // Card tracking
>> enqueue(card-address(@x.a)) into thread-local-dcq;
>> if (thread-local-dcq is not full) goto done;
>> 
>> call runtime to move thread-local-dcq into dcqs
>> 
>> done:
>> 
>> 
>> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
>> 
>> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
>> 
>> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
>> 
>> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c...
>
> Thomas Schatzl has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - * fix card table verification crashes: in the first refinement phase, when switching the global card tables, we need to re-check whether we are still in the same sweep epoch or not. It might have changed due to a GC interrupting acquiring the Heap_lock. Otherwise new threads will scribble on the refinement table.
>    Cause are last-minute changes before making the PR ready to review.
>    
>      Testing: without the patch, occurs fairly frequently when continuously
>    (1 in 20) starting refinement. Does not afterward.
>  - * ayang review 3
>      * comments
>      * minor refactorings

Tier1-3 test good on linux-riscv64 platform. And I have prepared an add-on change which implements the barrier method to write cards for a reference array for this platform. Do you want to have it in this PR? Thanks.
[23739-riscv-addon.txt](https://github.com/user-attachments/files/19174898/23739-riscv-addon.txt)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23739#issuecomment-2712469306

From tschatzl at openjdk.org  Tue Mar 11 09:51:53 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Tue, 11 Mar 2025 09:51:53 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v16]
In-Reply-To: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
Message-ID: <FBMENlZBX-iJAxIt7alFDElSsFxxGP_D28dxb3Q2gRU=.ce9f0e9e-0ce2-48e4-9191-b1717523f107@github.com>

> Hi all,
> 
>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
> 
> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
> 
> ### Current situation
> 
> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
> 
> The main reason for the current barrier is how g1 implements concurrent refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
> 
> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
> 
> 
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done;     // null value check
> if (card(@x.a) == young_card) goto done;  // write to young gen check
> StoreLoad;                // synchronize
> if (card(@x.a) == dirty_card) goto done;
> 
> *card(@x.a) = dirty
> 
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
> 
> call runtime to move thread-local-dcq into dcqs
> 
> done:
> 
> 
> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
> 
> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
> 
> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
> 
> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se...

Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:

  * optimized RISCV gen_write_ref_array_post_barrier() implementation contributed by @RealFYang

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23739/files
  - new: https://git.openjdk.org/jdk/pull/23739/files/93b884f1..758fac01

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=15
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=14-15

  Stats: 36 lines in 1 file changed: 28 ins; 0 del; 8 mod
  Patch: https://git.openjdk.org/jdk/pull/23739.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739

PR: https://git.openjdk.org/jdk/pull/23739

From tschatzl at openjdk.org  Tue Mar 11 09:54:05 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Tue, 11 Mar 2025 09:54:05 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v15]
In-Reply-To: <p6xSjzkmq9zM_fb6ja_tlVyAZCGE__vMCWgn4AWBOWo=.121c539d-1a5d-41f0-982c-811178d59cf5@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <fUhXDaJRytM0Z-n2iHZgKJiRPTqbaI1tkO1tfhqHUsU=.aace41e0-3bd9-4ecf-a40b-6c555a968871@github.com>
 <p6xSjzkmq9zM_fb6ja_tlVyAZCGE__vMCWgn4AWBOWo=.121c539d-1a5d-41f0-982c-811178d59cf5@github.com>
Message-ID: <eBFnm9yEY7mkdSHTF_q98nKQyH8I1cFaTe23i24v504=.5d8b8246-8578-4cc9-b1ae-94d5f92cc1c4@github.com>

On Tue, 11 Mar 2025 03:22:52 GMT, Fei Yang <fyang at openjdk.org> wrote:

> Tier1-3 test good on linux-riscv64 platform. And I have prepared an add-on change which implements the barrier method to write cards for a reference array for this platform. Do you want to have it in this PR? Thanks.

I added your changes, thank you!

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23739#issuecomment-2713415911

From shade at openjdk.org  Tue Mar 11 11:41:28 2025
From: shade at openjdk.org (Aleksey Shipilev)
Date: Tue, 11 Mar 2025 11:41:28 GMT
Subject: RFR: 8351640: Print reason for making method not entrant
Message-ID: <_XHdskC5Q0n4cwspFV97uiUyS2HsWDSZAK-YkGGCUIA=.8e86e6e2-9c46-440f-aca5-efbc54475f29@github.com>

A simple quality of life improvement. We are studying compiler dynamics in Leyden, and it would be convenient to know why the particular methods are marked as not entrant. We just need to pass the extra string argument to `nmethod::make_not_entrant` and print it out. 

Sample log excerpt for mainline:


$ grep com.sun.tools.javac.util.IntHashTable::lookup print-compilation.log 
987   780       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
1019  877       4       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
1024  780       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)   made not entrant: not used
4995  877       4       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)   made not entrant: uncommon trap
5287 3734       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
6615 5472       4       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
6626 3734       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)   made not entrant: not used


You can now clearly see the method lifecycle. 1 second in app lifetime, the method was initially compiled at level 3. Shortly after, it got compiled at level 4, turning level 3 method unused. 4 seconds later, level 4 method encountered uncommon trap, so we are back to level 3. After 1.3 seconds more, the final compilation at level 4 completed, and second level 3 compilation was removed as unused.

Additional testing:
 - [x] Linux x86_64 server fastdebug, `hotspot:tier1`

-------------

Commit messages:
 - Use resource allocation for temp buffer
 - Base version

Changes: https://git.openjdk.org/jdk/pull/23980/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23980&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8351640
  Stats: 36 lines in 14 files changed: 8 ins; 0 del; 28 mod
  Patch: https://git.openjdk.org/jdk/pull/23980.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23980/head:pull/23980

PR: https://git.openjdk.org/jdk/pull/23980

From kvn at openjdk.org  Tue Mar 11 17:58:54 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Tue, 11 Mar 2025 17:58:54 GMT
Subject: RFR: 8351640: Print reason for making method not entrant
In-Reply-To: <_XHdskC5Q0n4cwspFV97uiUyS2HsWDSZAK-YkGGCUIA=.8e86e6e2-9c46-440f-aca5-efbc54475f29@github.com>
References: <_XHdskC5Q0n4cwspFV97uiUyS2HsWDSZAK-YkGGCUIA=.8e86e6e2-9c46-440f-aca5-efbc54475f29@github.com>
Message-ID: <zencHJBe62ltqyerULFhWtAtUFwJXa4tnNIxbmKGGr4=.ad9484d1-e314-4b8f-bace-bd24264e5b0f@github.com>

On Tue, 11 Mar 2025 11:36:59 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

> A simple quality of life improvement. We are studying compiler dynamics in Leyden, and it would be convenient to know why the particular methods are marked as not entrant. We just need to pass the extra string argument to `nmethod::make_not_entrant` and print it out. 
> 
> Sample log excerpt for mainline:
> 
> 
> $ grep com.sun.tools.javac.util.IntHashTable::lookup print-compilation.log 
> 987   780       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
> 1019  877       4       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
> 1024  780       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)   made not entrant: not used
> 4995  877       4       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)   made not entrant: uncommon trap
> 5287 3734       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
> 6615 5472       4       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
> 6626 3734       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)   made not entrant: not used
> 
> 
> You can now clearly see the method lifecycle. 1 second in app lifetime, the method was initially compiled at level 3. Shortly after, it got compiled at level 4, turning level 3 method unused. 4 seconds later, level 4 method encountered uncommon trap, so we are back to level 3. After 1.3 seconds more, the final compilation at level 4 completed, and second level 3 compilation was removed as unused.
> 
> Additional testing:
>  - [x] Linux x86_64 server fastdebug, `hotspot:tier1`
>  - [x] Linux x86_64 server fastdebug, `all`

Good.

-------------

Marked as reviewed by kvn (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/23980#pullrequestreview-2675594015

From vlivanov at openjdk.org  Tue Mar 11 18:52:56 2025
From: vlivanov at openjdk.org (Vladimir Ivanov)
Date: Tue, 11 Mar 2025 18:52:56 GMT
Subject: RFR: 8351640: Print reason for making method not entrant
In-Reply-To: <_XHdskC5Q0n4cwspFV97uiUyS2HsWDSZAK-YkGGCUIA=.8e86e6e2-9c46-440f-aca5-efbc54475f29@github.com>
References: <_XHdskC5Q0n4cwspFV97uiUyS2HsWDSZAK-YkGGCUIA=.8e86e6e2-9c46-440f-aca5-efbc54475f29@github.com>
Message-ID: <kUSFESgwOzrk3BN2Bg8pfYCR8Jc7CFMsRy4gWKs0Ino=.983827cd-eaac-4e67-ba91-dc686db79ce2@github.com>

On Tue, 11 Mar 2025 11:36:59 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

> A simple quality of life improvement. We are studying compiler dynamics in Leyden, and it would be convenient to know why the particular methods are marked as not entrant. We just need to pass the extra string argument to `nmethod::make_not_entrant` and print it out. 
> 
> Sample log excerpt for mainline:
> 
> 
> $ grep com.sun.tools.javac.util.IntHashTable::lookup print-compilation.log 
> 987   780       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
> 1019  877       4       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
> 1024  780       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)   made not entrant: not used
> 4995  877       4       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)   made not entrant: uncommon trap
> 5287 3734       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
> 6615 5472       4       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
> 6626 3734       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)   made not entrant: not used
> 
> 
> You can now clearly see the method lifecycle. 1 second in app lifetime, the method was initially compiled at level 3. Shortly after, it got compiled at level 4, turning level 3 method unused. 4 seconds later, level 4 method encountered uncommon trap, so we are back to level 3. After 1.3 seconds more, the final compilation at level 4 completed, and second level 3 compilation was removed as unused.
> 
> Additional testing:
>  - [x] Linux x86_64 server fastdebug, `hotspot:tier1`
>  - [x] Linux x86_64 server fastdebug, `all`

src/hotspot/share/code/nmethod.cpp line 1965:

> 1963:   if (LogCompilation) {
> 1964:     if (xtty != nullptr) {
> 1965:       ttyLocker ttyl;  // keep the following output all in one block

Please, include same info in `LogCompilation` log.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23980#discussion_r1989937760

From dnsimon at openjdk.org  Tue Mar 11 19:41:18 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Tue, 11 Mar 2025 19:41:18 GMT
Subject: RFR: 8351700: Remove code conditional on BarrierSetNMethod being null
Message-ID: <C0ct7ayZl0E-IYPm-BHp7nkt49pDzF73306ZgIZZXHY=.361cbb89-8a61-497b-a0d9-7ce25f458657@github.com>

All GCs started needing nmethod entry barriers as of loom so there's no longer any need to test for null nmethod entry barriers.

-------------

Commit messages:
 - nmethod entry barriers are no longer optional

Changes: https://git.openjdk.org/jdk/pull/23996/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23996&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8351700
  Stats: 171 lines in 27 files changed: 5 ins; 103 del; 63 mod
  Patch: https://git.openjdk.org/jdk/pull/23996.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23996/head:pull/23996

PR: https://git.openjdk.org/jdk/pull/23996

From eosterlund at openjdk.org  Tue Mar 11 19:41:18 2025
From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=)
Date: Tue, 11 Mar 2025 19:41:18 GMT
Subject: RFR: 8351700: Remove code conditional on BarrierSetNMethod being
 null
In-Reply-To: <C0ct7ayZl0E-IYPm-BHp7nkt49pDzF73306ZgIZZXHY=.361cbb89-8a61-497b-a0d9-7ce25f458657@github.com>
References: <C0ct7ayZl0E-IYPm-BHp7nkt49pDzF73306ZgIZZXHY=.361cbb89-8a61-497b-a0d9-7ce25f458657@github.com>
Message-ID: <RPtA4UhBsiBh8XFO0FsFfp0PoZZEtqMjlBU-vIlcGlk=.db7eb79d-43a4-472e-92f7-f4c65797ebd2@github.com>

On Tue, 11 Mar 2025 19:29:05 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

> All GCs started needing nmethod entry barriers as of loom so there's no longer any need to test for null nmethod entry barriers.

Nice! Looks good.

-------------

Marked as reviewed by eosterlund (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/23996#pullrequestreview-2675894137

From never at openjdk.org  Tue Mar 11 19:53:00 2025
From: never at openjdk.org (Tom Rodriguez)
Date: Tue, 11 Mar 2025 19:53:00 GMT
Subject: RFR: 8351700: Remove code conditional on BarrierSetNMethod being
 null
In-Reply-To: <C0ct7ayZl0E-IYPm-BHp7nkt49pDzF73306ZgIZZXHY=.361cbb89-8a61-497b-a0d9-7ce25f458657@github.com>
References: <C0ct7ayZl0E-IYPm-BHp7nkt49pDzF73306ZgIZZXHY=.361cbb89-8a61-497b-a0d9-7ce25f458657@github.com>
Message-ID: <fHSkriFwRSvVRGpPtFJOcTIZx11QANNE5cJSaX_E4dk=.b383b13e-2891-4462-ad47-f84208628299@github.com>

On Tue, 11 Mar 2025 19:29:05 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

> All GCs started needing nmethod entry barriers as of loom so there's no longer any need to test for null nmethod entry barriers.

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 6549:

> 6547:     BarrierSetNMethod* bs_nm = BarrierSet::barrier_set()->barrier_set_nmethod();
> 6548:     if (bs_nm != nullptr) {
> 6549:       StubRoutines::_method_entry_barrier = generate_method_entry_barrier();

Shouldn't you have kept this line?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23996#discussion_r1990025685

From dnsimon at openjdk.org  Tue Mar 11 20:01:00 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Tue, 11 Mar 2025 20:01:00 GMT
Subject: RFR: 8351700: Remove code conditional on BarrierSetNMethod being
 null [v2]
In-Reply-To: <fHSkriFwRSvVRGpPtFJOcTIZx11QANNE5cJSaX_E4dk=.b383b13e-2891-4462-ad47-f84208628299@github.com>
References: <C0ct7ayZl0E-IYPm-BHp7nkt49pDzF73306ZgIZZXHY=.361cbb89-8a61-497b-a0d9-7ce25f458657@github.com>
 <fHSkriFwRSvVRGpPtFJOcTIZx11QANNE5cJSaX_E4dk=.b383b13e-2891-4462-ad47-f84208628299@github.com>
Message-ID: <LFT8DabtkV53eUo6VE4Y1askHF5YZ6qQYIWth7myck4=.98298975-14cc-404b-81a5-a8cf9a125caf@github.com>

On Tue, 11 Mar 2025 19:50:18 GMT, Tom Rodriguez <never at openjdk.org> wrote:

>> Doug Simon has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   revived accidentally deleted code
>
> src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 6549:
> 
>> 6547:     BarrierSetNMethod* bs_nm = BarrierSet::barrier_set()->barrier_set_nmethod();
>> 6548:     if (bs_nm != nullptr) {
>> 6549:       StubRoutines::_method_entry_barrier = generate_method_entry_barrier();
> 
> Shouldn't you have kept this line?

Absolutely!

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23996#discussion_r1990039724

From dnsimon at openjdk.org  Tue Mar 11 20:00:59 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Tue, 11 Mar 2025 20:00:59 GMT
Subject: RFR: 8351700: Remove code conditional on BarrierSetNMethod being
 null [v2]
In-Reply-To: <C0ct7ayZl0E-IYPm-BHp7nkt49pDzF73306ZgIZZXHY=.361cbb89-8a61-497b-a0d9-7ce25f458657@github.com>
References: <C0ct7ayZl0E-IYPm-BHp7nkt49pDzF73306ZgIZZXHY=.361cbb89-8a61-497b-a0d9-7ce25f458657@github.com>
Message-ID: <xxotI78T4CQt4gkTuitW2qY9PXaFFR0x3G20uDZjrMo=.5666ec2c-c022-408a-9d9f-10a5181ee849@github.com>

> All GCs started needing nmethod entry barriers as of loom so there's no longer any need to test for null nmethod entry barriers.

Doug Simon has updated the pull request incrementally with one additional commit since the last revision:

  revived accidentally deleted code

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23996/files
  - new: https://git.openjdk.org/jdk/pull/23996/files/b958ee43..b3d4721d

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23996&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23996&range=00-01

  Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod
  Patch: https://git.openjdk.org/jdk/pull/23996.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23996/head:pull/23996

PR: https://git.openjdk.org/jdk/pull/23996

From never at openjdk.org  Tue Mar 11 21:53:55 2025
From: never at openjdk.org (Tom Rodriguez)
Date: Tue, 11 Mar 2025 21:53:55 GMT
Subject: RFR: 8351700: Remove code conditional on BarrierSetNMethod being
 null [v2]
In-Reply-To: <xxotI78T4CQt4gkTuitW2qY9PXaFFR0x3G20uDZjrMo=.5666ec2c-c022-408a-9d9f-10a5181ee849@github.com>
References: <C0ct7ayZl0E-IYPm-BHp7nkt49pDzF73306ZgIZZXHY=.361cbb89-8a61-497b-a0d9-7ce25f458657@github.com>
 <xxotI78T4CQt4gkTuitW2qY9PXaFFR0x3G20uDZjrMo=.5666ec2c-c022-408a-9d9f-10a5181ee849@github.com>
Message-ID: <N5TSc8ZSHNEa1ASnNqqzNx3pz6mscTVr6z4fy15TqTk=.7afa76e7-4815-4703-b372-3c28dfc65a0c@github.com>

On Tue, 11 Mar 2025 20:00:59 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

>> All GCs started needing nmethod entry barriers as of loom so there's no longer any need to test for null nmethod entry barriers.
>
> Doug Simon has updated the pull request incrementally with one additional commit since the last revision:
> 
>   revived accidentally deleted code

Marked as reviewed by never (Reviewer).

-------------

PR Review: https://git.openjdk.org/jdk/pull/23996#pullrequestreview-2676195527

From fyang at openjdk.org  Wed Mar 12 00:32:53 2025
From: fyang at openjdk.org (Fei Yang)
Date: Wed, 12 Mar 2025 00:32:53 GMT
Subject: RFR: 8351700: Remove code conditional on BarrierSetNMethod being
 null [v2]
In-Reply-To: <xxotI78T4CQt4gkTuitW2qY9PXaFFR0x3G20uDZjrMo=.5666ec2c-c022-408a-9d9f-10a5181ee849@github.com>
References: <C0ct7ayZl0E-IYPm-BHp7nkt49pDzF73306ZgIZZXHY=.361cbb89-8a61-497b-a0d9-7ce25f458657@github.com>
 <xxotI78T4CQt4gkTuitW2qY9PXaFFR0x3G20uDZjrMo=.5666ec2c-c022-408a-9d9f-10a5181ee849@github.com>
Message-ID: <basv0O6ajURlfcwg3QjrnYwOjOdozDQIH3MbcaBl7MM=.607e10a5-484c-46a4-8a37-16be622756d2@github.com>

On Tue, 11 Mar 2025 20:00:59 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

>> All GCs started needing nmethod entry barriers as of loom so there's no longer any need to test for null nmethod entry barriers.
>
> Doug Simon has updated the pull request incrementally with one additional commit since the last revision:
> 
>   revived accidentally deleted code

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 9903:

> 9901:     generate_arraycopy_stubs();
> 9902: 
> 9903:     BarrierSetNMethod* bs_nm = BarrierSet::barrier_set()->barrier_set_nmethod();

Drive-by comment: `bs_nm` seems not used any more.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23996#discussion_r1990347462

From shade at openjdk.org  Wed Mar 12 07:35:33 2025
From: shade at openjdk.org (Aleksey Shipilev)
Date: Wed, 12 Mar 2025 07:35:33 GMT
Subject: RFR: 8351640: Print reason for making method not entrant [v2]
In-Reply-To: <_XHdskC5Q0n4cwspFV97uiUyS2HsWDSZAK-YkGGCUIA=.8e86e6e2-9c46-440f-aca5-efbc54475f29@github.com>
References: <_XHdskC5Q0n4cwspFV97uiUyS2HsWDSZAK-YkGGCUIA=.8e86e6e2-9c46-440f-aca5-efbc54475f29@github.com>
Message-ID: <370pnPWKXnqHXz9pVOoU9vFfqdH8zIIV2K7BpqWRcEI=.0c63f38f-84ab-49c3-a0da-1ad9f1b22fb1@github.com>

> A simple quality of life improvement. We are studying compiler dynamics in Leyden, and it would be convenient to know why the particular methods are marked as not entrant. We just need to pass the extra string argument to `nmethod::make_not_entrant` and print it out. 
> 
> Sample log excerpt for mainline:
> 
> 
> $ grep com.sun.tools.javac.util.IntHashTable::lookup print-compilation.log 
> 987   780       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
> 1019  877       4       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
> 1024  780       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)   made not entrant: not used
> 4995  877       4       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)   made not entrant: uncommon trap
> 5287 3734       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
> 6615 5472       4       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
> 6626 3734       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)   made not entrant: not used
> 
> 
> You can now clearly see the method lifecycle. 1 second in app lifetime, the method was initially compiled at level 3. Shortly after, it got compiled at level 4, turning level 3 method unused. 4 seconds later, level 4 method encountered uncommon trap, so we are back to level 3. After 1.3 seconds more, the final compilation at level 4 completed, and second level 3 compilation was removed as unused.
> 
> Additional testing:
>  - [x] Linux x86_64 server fastdebug, `hotspot:tier1`
>  - [x] Linux x86_64 server fastdebug, `all`

Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision:

 - Add to LogCompilation as well
 - Merge branch 'master' into JDK-8351640-nmethod-not-entrant-reason
 - Use resource allocation for temp buffer
 - Base version

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23980/files
  - new: https://git.openjdk.org/jdk/pull/23980/files/b13a1080..38491fb2

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23980&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23980&range=00-01

  Stats: 38661 lines in 408 files changed: 18309 ins; 13442 del; 6910 mod
  Patch: https://git.openjdk.org/jdk/pull/23980.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23980/head:pull/23980

PR: https://git.openjdk.org/jdk/pull/23980

From shade at openjdk.org  Wed Mar 12 07:35:33 2025
From: shade at openjdk.org (Aleksey Shipilev)
Date: Wed, 12 Mar 2025 07:35:33 GMT
Subject: RFR: 8351640: Print reason for making method not entrant [v2]
In-Reply-To: <kUSFESgwOzrk3BN2Bg8pfYCR8Jc7CFMsRy4gWKs0Ino=.983827cd-eaac-4e67-ba91-dc686db79ce2@github.com>
References: <_XHdskC5Q0n4cwspFV97uiUyS2HsWDSZAK-YkGGCUIA=.8e86e6e2-9c46-440f-aca5-efbc54475f29@github.com>
 <kUSFESgwOzrk3BN2Bg8pfYCR8Jc7CFMsRy4gWKs0Ino=.983827cd-eaac-4e67-ba91-dc686db79ce2@github.com>
Message-ID: <BUEvIq0JVRn7u072LbaCALrGFAyHvy-XAzeQ_flwDcE=.7d1282e0-cef7-4727-a9a0-a2ab2f5476a3@github.com>

On Tue, 11 Mar 2025 18:45:40 GMT, Vladimir Ivanov <vlivanov at openjdk.org> wrote:

>> Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision:
>> 
>>  - Add to LogCompilation as well
>>  - Merge branch 'master' into JDK-8351640-nmethod-not-entrant-reason
>>  - Use resource allocation for temp buffer
>>  - Base version
>
> src/hotspot/share/code/nmethod.cpp line 1965:
> 
>> 1963:   if (LogCompilation) {
>> 1964:     if (xtty != nullptr) {
>> 1965:       ttyLocker ttyl;  // keep the following output all in one block
> 
> Please, include same info in `LogCompilation` log.

Sure, added.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23980#discussion_r1990826189

From dnsimon at openjdk.org  Wed Mar 12 09:16:44 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Wed, 12 Mar 2025 09:16:44 GMT
Subject: RFR: 8351700: Remove code conditional on BarrierSetNMethod being
 null [v3]
In-Reply-To: <C0ct7ayZl0E-IYPm-BHp7nkt49pDzF73306ZgIZZXHY=.361cbb89-8a61-497b-a0d9-7ce25f458657@github.com>
References: <C0ct7ayZl0E-IYPm-BHp7nkt49pDzF73306ZgIZZXHY=.361cbb89-8a61-497b-a0d9-7ce25f458657@github.com>
Message-ID: <pMw9Wr9EIDBnJNbTTAMkA4OOsPEhsx-jUqI1P01mAuI=.5963869f-dccf-4245-8579-e83b64e67d26@github.com>

> All GCs started needing nmethod entry barriers as of loom so there's no longer any need to test for null nmethod entry barriers.

Doug Simon has updated the pull request incrementally with one additional commit since the last revision:

  removed unused code

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23996/files
  - new: https://git.openjdk.org/jdk/pull/23996/files/b3d4721d..95da3c2f

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23996&range=02
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23996&range=01-02

  Stats: 1 line in 1 file changed: 0 ins; 1 del; 0 mod
  Patch: https://git.openjdk.org/jdk/pull/23996.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23996/head:pull/23996

PR: https://git.openjdk.org/jdk/pull/23996

From shade at openjdk.org  Wed Mar 12 09:46:01 2025
From: shade at openjdk.org (Aleksey Shipilev)
Date: Wed, 12 Mar 2025 09:46:01 GMT
Subject: RFR: 8351700: Remove code conditional on BarrierSetNMethod being
 null [v3]
In-Reply-To: <pMw9Wr9EIDBnJNbTTAMkA4OOsPEhsx-jUqI1P01mAuI=.5963869f-dccf-4245-8579-e83b64e67d26@github.com>
References: <C0ct7ayZl0E-IYPm-BHp7nkt49pDzF73306ZgIZZXHY=.361cbb89-8a61-497b-a0d9-7ce25f458657@github.com>
 <pMw9Wr9EIDBnJNbTTAMkA4OOsPEhsx-jUqI1P01mAuI=.5963869f-dccf-4245-8579-e83b64e67d26@github.com>
Message-ID: <1stcVqx5LbF9cnNm4gb4YXqoHBbBBigH5fpYlBqRttI=.79261377-2b11-49eb-802d-b579fd23a9ff@github.com>

On Wed, 12 Mar 2025 09:16:44 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

>> All GCs started needing nmethod entry barriers as of loom so there's no longer any need to test for null nmethod entry barriers.
>
> Doug Simon has updated the pull request incrementally with one additional commit since the last revision:
> 
>   removed unused code

Looks fine, thanks.

-------------

Marked as reviewed by shade (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/23996#pullrequestreview-2677747978

From tschatzl at openjdk.org  Wed Mar 12 11:58:45 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Wed, 12 Mar 2025 11:58:45 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v17]
In-Reply-To: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
Message-ID: <tUp5IEUnv96nOsSSGPGesujwklXO4nOJhs8UNp0EEUs=.fba619db-c76f-43ad-a18d-8ccb7a8bd87f@github.com>

> Hi all,
> 
>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
> 
> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
> 
> ### Current situation
> 
> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
> 
> The main reason for the current barrier is how g1 implements concurrent refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
> 
> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
> 
> 
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done;     // null value check
> if (card(@x.a) == young_card) goto done;  // write to young gen check
> StoreLoad;                // synchronize
> if (card(@x.a) == dirty_card) goto done;
> 
> *card(@x.a) = dirty
> 
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
> 
> call runtime to move thread-local-dcq into dcqs
> 
> done:
> 
> 
> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
> 
> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
> 
> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
> 
> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se...

Thomas Schatzl has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 24 additional commits since the last revision:

 - Merge branch 'master' into 8342382-card-table-instead-of-dcq
 - * optimized RISCV gen_write_ref_array_post_barrier() implementation contributed by @RealFYang
 - * fix card table verification crashes: in the first refinement phase, when switching the global card tables, we need to re-check whether we are still in the same sweep epoch or not. It might have changed due to a GC interrupting acquiring the Heap_lock. Otherwise new threads will scribble on the refinement table.
   Cause are last-minute changes before making the PR ready to review.
   
     Testing: without the patch, occurs fairly frequently when continuously
   (1 in 20) starting refinement. Does not afterward.
 - * ayang review 3
     * comments
     * minor refactorings
 - * iwalulya review
     * renaming
     * fix some includes, forward declaration
 - * fix whitespace
   * additional whitespace between log tags
   * rename G1ConcurrentRefineWorkTask -> ...SweepTask to conform to the other similar rename
 - ayang review
     * renamings
     * refactorings
 - iwalulya review
     * comments for variables tracking to-collection-set and just dirtied cards after GC/refinement
     * predicate for determining whether the refinement has been disabled
     * some other typos/comment improvements
     * renamed _has_xxx_ref to _has_ref_to_xxx to be more consistent with naming
 - * ayang review - fix comment
 - * iwalulya review 2
     * G1ConcurrentRefineWorkState -> G1ConcurrentRefineSweepState
     * some additional documentation
 - ... and 14 more: https://git.openjdk.org/jdk/compare/f77fa17b...aec95051

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23739/files
  - new: https://git.openjdk.org/jdk/pull/23739/files/758fac01..aec95051

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=16
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=15-16

  Stats: 78123 lines in 1539 files changed: 36243 ins; 29177 del; 12703 mod
  Patch: https://git.openjdk.org/jdk/pull/23739.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739

PR: https://git.openjdk.org/jdk/pull/23739

From dnsimon at openjdk.org  Wed Mar 12 12:21:57 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Wed, 12 Mar 2025 12:21:57 GMT
Subject: RFR: 8351700: Remove code conditional on BarrierSetNMethod being
 null [v3]
In-Reply-To: <pMw9Wr9EIDBnJNbTTAMkA4OOsPEhsx-jUqI1P01mAuI=.5963869f-dccf-4245-8579-e83b64e67d26@github.com>
References: <C0ct7ayZl0E-IYPm-BHp7nkt49pDzF73306ZgIZZXHY=.361cbb89-8a61-497b-a0d9-7ce25f458657@github.com>
 <pMw9Wr9EIDBnJNbTTAMkA4OOsPEhsx-jUqI1P01mAuI=.5963869f-dccf-4245-8579-e83b64e67d26@github.com>
Message-ID: <z3GDST852W83nrFCHkRFkCgporhmkeIUnaRWi57tT54=.88f3f515-23f8-4394-aa7c-61e417cdc361@github.com>

On Wed, 12 Mar 2025 09:16:44 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

>> All GCs started needing nmethod entry barriers as of loom so there's no longer any need to test for null nmethod entry barriers.
>
> Doug Simon has updated the pull request incrementally with one additional commit since the last revision:
> 
>   removed unused code

`gc/TestAllocHumongousFragment.java#generational` is failing on Windows: https://github.com/dougxc/jdk/actions/runs/13807682996/job/38625487569#step:9:630
I don't think it can be caused by this PR. Are you able to confirm that @shipilev ?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23996#issuecomment-2717699848

From shade at openjdk.org  Wed Mar 12 12:34:03 2025
From: shade at openjdk.org (Aleksey Shipilev)
Date: Wed, 12 Mar 2025 12:34:03 GMT
Subject: RFR: 8351700: Remove code conditional on BarrierSetNMethod being
 null [v3]
In-Reply-To: <1stcVqx5LbF9cnNm4gb4YXqoHBbBBigH5fpYlBqRttI=.79261377-2b11-49eb-802d-b579fd23a9ff@github.com>
References: <C0ct7ayZl0E-IYPm-BHp7nkt49pDzF73306ZgIZZXHY=.361cbb89-8a61-497b-a0d9-7ce25f458657@github.com>
 <pMw9Wr9EIDBnJNbTTAMkA4OOsPEhsx-jUqI1P01mAuI=.5963869f-dccf-4245-8579-e83b64e67d26@github.com>
 <1stcVqx5LbF9cnNm4gb4YXqoHBbBBigH5fpYlBqRttI=.79261377-2b11-49eb-802d-b579fd23a9ff@github.com>
Message-ID: <FckaygwjJAz0Hf2Xlh6eQq793Z4JzR3Ob0DL4ZOiwo0=.0377564b-f46a-4cef-baff-1ba3ea6aa341@github.com>

On Wed, 12 Mar 2025 09:43:21 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

>> Doug Simon has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   removed unused code
>
> Looks fine, thanks.

> `gc/TestAllocHumongousFragment.java#generational` is failing on Windows: https://github.com/dougxc/jdk/actions/runs/13807682996/job/38625487569#step:9:630 I don't think it can be caused by this PR. Are you able to confirm that @shipilev ?

It was problemlisted by #23982 yesterday. You can ignore it, or merge with recent master to get clean GHA runs.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23996#issuecomment-2717727270

From dnsimon at openjdk.org  Wed Mar 12 12:34:04 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Wed, 12 Mar 2025 12:34:04 GMT
Subject: RFR: 8351700: Remove code conditional on BarrierSetNMethod being
 null [v3]
In-Reply-To: <pMw9Wr9EIDBnJNbTTAMkA4OOsPEhsx-jUqI1P01mAuI=.5963869f-dccf-4245-8579-e83b64e67d26@github.com>
References: <C0ct7ayZl0E-IYPm-BHp7nkt49pDzF73306ZgIZZXHY=.361cbb89-8a61-497b-a0d9-7ce25f458657@github.com>
 <pMw9Wr9EIDBnJNbTTAMkA4OOsPEhsx-jUqI1P01mAuI=.5963869f-dccf-4245-8579-e83b64e67d26@github.com>
Message-ID: <-urz_l6_Sa21e9SspzfanN4VGdOFZJxOv6E79Npfv5A=.baeb6814-351b-4711-b7fe-4d87e0700532@github.com>

On Wed, 12 Mar 2025 09:16:44 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

>> All GCs started needing nmethod entry barriers as of loom so there's no longer any need to test for null nmethod entry barriers.
>
> Doug Simon has updated the pull request incrementally with one additional commit since the last revision:
> 
>   removed unused code

I'll ignore it. Thanks for pointing out the problem listing.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23996#issuecomment-2717730379

From dnsimon at openjdk.org  Wed Mar 12 12:34:05 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Wed, 12 Mar 2025 12:34:05 GMT
Subject: Integrated: 8351700: Remove code conditional on BarrierSetNMethod
 being null
In-Reply-To: <C0ct7ayZl0E-IYPm-BHp7nkt49pDzF73306ZgIZZXHY=.361cbb89-8a61-497b-a0d9-7ce25f458657@github.com>
References: <C0ct7ayZl0E-IYPm-BHp7nkt49pDzF73306ZgIZZXHY=.361cbb89-8a61-497b-a0d9-7ce25f458657@github.com>
Message-ID: <KSQA-vhV5quePSX1-MXA2NHZd1yhdHa6AmNQkVwr7xs=.ca7e1451-0ad7-4305-b8cc-6f0230a14ac8@github.com>

On Tue, 11 Mar 2025 19:29:05 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

> All GCs started needing nmethod entry barriers as of loom so there's no longer any need to test for null nmethod entry barriers.

This pull request has now been integrated.

Changeset: 95b66d5a
Author:    Doug Simon <dnsimon at openjdk.org>
URL:       https://git.openjdk.org/jdk/commit/95b66d5a43a77b257a097afe5df369f92769abd2
Stats:     171 lines in 27 files changed: 5 ins; 102 del; 64 mod

8351700: Remove code conditional on BarrierSetNMethod being null

Reviewed-by: shade, eosterlund, never

-------------

PR: https://git.openjdk.org/jdk/pull/23996

From ayang at openjdk.org  Wed Mar 12 13:33:59 2025
From: ayang at openjdk.org (Albert Mingkun Yang)
Date: Wed, 12 Mar 2025 13:33:59 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v14]
In-Reply-To: <5w6qUwzDQadxseocRl6rRF0AllyeukWTpYl2XjAfiTE=.fb62a50e-e308-4d08-8057-67e70e13ccbb@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <bd0QCtZodFam-tasUihbAr2UFSSs3pRQyil13aJ6etk=.103e9141-a16d-4b44-9524-d2e51b4cefd5@github.com>
 <5w6qUwzDQadxseocRl6rRF0AllyeukWTpYl2XjAfiTE=.fb62a50e-e308-4d08-8057-67e70e13ccbb@github.com>
Message-ID: <Gg3yZi3fljesO2VRm6G7iM6fpJiZ9a8DKdOFDC1nD4k=.335e058b-f708-4ef1-b313-101971c6d0dc@github.com>

On Fri, 7 Mar 2025 13:14:02 GMT, Albert Mingkun Yang <ayang at openjdk.org> wrote:

>> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   * iwalulya review
>>     * renaming
>>     * fix some includes, forward declaration
>
> src/hotspot/share/gc/g1/g1CardTable.hpp line 76:
> 
>> 74:     g1_card_already_scanned = 0x1,
>> 75:     g1_to_cset_card = 0x2,
>> 76:     g1_from_remset_card = 0x4
> 
> Could you outline the motivation for this more precise info? Is it for optimization or essentially for correctness?

OK, it's for better performance, not correctness. How much is the improvement? As I understand it, this more precise info is largely independent of the new barrier logic. I wonder if it makes sense to extract this out to its own ticket to better assess its impact on perf and impl complexity.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1991375754

From ayang at openjdk.org  Wed Mar 12 13:34:04 2025
From: ayang at openjdk.org (Albert Mingkun Yang)
Date: Wed, 12 Mar 2025 13:34:04 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v17]
In-Reply-To: <tUp5IEUnv96nOsSSGPGesujwklXO4nOJhs8UNp0EEUs=.fba619db-c76f-43ad-a18d-8ccb7a8bd87f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <tUp5IEUnv96nOsSSGPGesujwklXO4nOJhs8UNp0EEUs=.fba619db-c76f-43ad-a18d-8ccb7a8bd87f@github.com>
Message-ID: <0w7seS1tIFhUxnmStxQySISWVfpBBsRmUtx7EoTy9a4=.509a3d5e-56d0-4fd8-8896-51835b14302b@github.com>

On Wed, 12 Mar 2025 11:58:45 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> Hi all,
>> 
>>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
>> 
>> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
>> 
>> ### Current situation
>> 
>> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
>> 
>> The main reason for the current barrier is how g1 implements concurrent refinement:
>> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
>> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
>> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
>> 
>> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
>> 
>> 
>> // Filtering
>> if (region(@x.a) == region(y)) goto done; // same region check
>> if (y == null) goto done;     // null value check
>> if (card(@x.a) == young_card) goto done;  // write to young gen check
>> StoreLoad;                // synchronize
>> if (card(@x.a) == dirty_card) goto done;
>> 
>> *card(@x.a) = dirty
>> 
>> // Card tracking
>> enqueue(card-address(@x.a)) into thread-local-dcq;
>> if (thread-local-dcq is not full) goto done;
>> 
>> call runtime to move thread-local-dcq into dcqs
>> 
>> done:
>> 
>> 
>> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
>> 
>> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
>> 
>> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
>> 
>> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c...
>
> Thomas Schatzl has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 24 additional commits since the last revision:
> 
>  - Merge branch 'master' into 8342382-card-table-instead-of-dcq
>  - * optimized RISCV gen_write_ref_array_post_barrier() implementation contributed by @RealFYang
>  - * fix card table verification crashes: in the first refinement phase, when switching the global card tables, we need to re-check whether we are still in the same sweep epoch or not. It might have changed due to a GC interrupting acquiring the Heap_lock. Otherwise new threads will scribble on the refinement table.
>    Cause are last-minute changes before making the PR ready to review.
>    
>      Testing: without the patch, occurs fairly frequently when continuously
>    (1 in 20) starting refinement. Does not afterward.
>  - * ayang review 3
>      * comments
>      * minor refactorings
>  - * iwalulya review
>      * renaming
>      * fix some includes, forward declaration
>  - * fix whitespace
>    * additional whitespace between log tags
>    * rename G1ConcurrentRefineWorkTask -> ...SweepTask to conform to the other similar rename
>  - ayang review
>      * renamings
>      * refactorings
>  - iwalulya review
>      * comments for variables tracking to-collection-set and just dirtied cards after GC/refinement
>      * predicate for determining whether the refinement has been disabled
>      * some other typos/comment improvements
>      * renamed _has_xxx_ref to _has_ref_to_xxx to be more consistent with naming
>  - * ayang review - fix comment
>  - * iwalulya review 2
>      * G1ConcurrentRefineWorkState -> G1ConcurrentRefineSweepState
>      * some additional documentation
>  - ... and 14 more: https://git.openjdk.org/jdk/compare/53a66058...aec95051

src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 217:

> 215: 
> 216:   {
> 217:     SuspendibleThreadSetLeaver sts_leave;

Can you add some comment on why leaving the set is required? It's not obvious to me why. I'd expect handshake to work out of the box...

src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 263:

> 261: 
> 262:     SuspendibleThreadSetLeaver sts_leave;
> 263:     VMThread::execute(&op);

Can you elaborate what synchronization this VM op is trying to achieve?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1991489399
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1991382024

From duke at openjdk.org  Wed Mar 12 13:42:33 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Wed, 12 Mar 2025 13:42:33 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v6]
In-Reply-To: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
Message-ID: <CY_w5CM4ZSozDcR8YQp8Bz_kCvv4g57pJlYTnuLma3w=.47f64d0e-2dd3-4785-aed0-254d93b61689@github.com>

> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.

Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:

  Added validity test for the intrinsics.

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23860/files
  - new: https://git.openjdk.org/jdk/pull/23860/files/64135f29..f65ef7c4

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=05
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=04-05

  Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod
  Patch: https://git.openjdk.org/jdk/pull/23860.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23860/head:pull/23860

PR: https://git.openjdk.org/jdk/pull/23860

From duke at openjdk.org  Wed Mar 12 13:51:58 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Wed, 12 Mar 2025 13:51:58 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v5]
In-Reply-To: <qlwb4JKyY446ZQz5ZBdMN6SOblRZNQQ2gbNntRjiL2o=.8f5c1cb4-74d7-4005-832c-a44e170d3e33@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <3bphXKLpIpxAZP-FEOeob6AaHbv0BAoEceJka64vMW8=.3e4f74e0-9479-4926-b365-b08d8d702692@github.com>
 <qlwb4JKyY446ZQz5ZBdMN6SOblRZNQQ2gbNntRjiL2o=.8f5c1cb4-74d7-4005-832c-a44e170d3e33@github.com>
Message-ID: <vHbz488ucPSpzVNYFAI0mnVLcJtwTIq9w-GD2l4stMw=.5f324fd1-6c2c-4a2c-95d7-acb1e01d6780@github.com>

On Mon, 10 Mar 2025 03:00:09 GMT, Leonid Mesnik <lmesnik at openjdk.org> wrote:

> There are no any new tests in the PR. How fix has been tested by openjdk tests?

I have just added one.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23860#issuecomment-2717950685

From duke at openjdk.org  Wed Mar 12 13:52:02 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Wed, 12 Mar 2025 13:52:02 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v4]
In-Reply-To: <P32zEjQp0CL9wYkVJU9i8pZa_VphYse1dv-DDOXE580=.30b087eb-29ca-40a2-bc34-6e802940bead@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <nmN_TxNaqRqSAVqYzrUhO9NXpzzixG6d1OsvBArpLsA=.857b5aa3-694a-4528-a922-c3fdab141ffa@github.com>
 <P32zEjQp0CL9wYkVJU9i8pZa_VphYse1dv-DDOXE580=.30b087eb-29ca-40a2-bc34-6e802940bead@github.com>
Message-ID: <DsStsOBPwSWrKYqBjHd164OUYUjAjKEJiXva8-OGcPo=.534b97d5-68e7-46d6-ad59-93cf31672026@github.com>

On Thu, 6 Mar 2025 14:30:35 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Added alignment to loop entries.
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 2:
> 
>> 1: /*
>> 2:  * Copyright (c) 2024, Oracle and/or its affiliates. All rights reserved.
> 
> Please update copyright year

Thanks, fixed.

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 96:
> 
>> 94:       StubRoutines::_dilithiumMontMulByConstant = generate_dilithiumMontMulByConstant_avx512();
>> 95:       StubRoutines::_dilithiumDecomposePoly = generate_dilithiumDecomposePoly_avx512();
>> 96:     }
> 
> Indentation fix needed

Thanks, fixed.

> src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 362:
> 
>> 360:   const Register roundsLeft = r11;
>> 361: 
>> 362:   __ align(OptoLoopAlignment);
> 
> Redundant alignment before label should be before it's bind

Thanks, fixed.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1991546308
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1991546488
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1991546606

From duke at openjdk.org  Wed Mar 12 13:52:06 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Wed, 12 Mar 2025 13:52:06 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v3]
In-Reply-To: <v0R-8kNylqKabqUJFfrIc8vcJ4tNMEfFm00stOcCxlo=.37519051-674e-41fe-80c8-60167e82f86b@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <xYDRgjvW5AksXcXKUBnx7LlB4PmGEjEITKA27iHs4rE=.e69744e8-2fe9-4b0c-a49d-25177a2e8daf@github.com>
 <0E2AqFpNPjDjP6jqCXn8toePBcW2SIHw1kFXlZX4W_U=.8d692bfa-0598-4969-b480-4a285366e0bb@github.com>
 <V62AuLpV4oPamF6hi7zMdqs21tyw9i4ug_9UlNCQpNE=.2d6e0fcf-3e2c-406f-a336-bb2dedd923e8@github.com>
 <M5Z8mk-KZoytxUmhzRWKqUFtHRWoxY28Bfv8h_ijjMg=.351389d9-57d1-4589-a853-41125858760a@github.com>
 <wBKHcU8T9cMhYxLItWJ62IocCpPm9t8C4H_Ky4my9Fo=.fac18f07-5383-4333-b68e-031d376be1e2@github.com>
 <v0R-8kNylqKabqUJFfrIc8vcJ4tNMEfFm00stOcCxlo=.37519051-674e-41fe-80c8-60167e82f86b@github.com>
Message-ID: <74tlAsyoYwN-fvtFyxp3xJYo76U68oF0ES4UVy7S_iY=.01f96647-395e-49bb-9e5a-f047b63460e0@github.com>

On Thu, 6 Mar 2025 09:32:19 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> I think the easiest is to put a for (int i = 0; i < 1000; i++) loop around  the switch statement in the run() method of  the ML_DSA_Test class (test/jdk/sun/security/provider/acvp/ML_DSA_Test.java). (This is because the intrinsics kick in after a few thousand calls of the method.)
>
> Hi @ferakocz , Yes, we should modify the test or lower the compilation threshold with -Xbatch -XX:TieredCompileThreshold=0.1.
> 
> Alternatively, since the tests has a depedency on Automatic Cryptographic Validation Test server I have created a simplified test which cover all the security levels.
> 
> Kindly include [test/hotspot/jtreg/compiler/intrinsics/signature/TestModuleLatticeDSA.java
> ](https://github.com/ferakocz/jdk/pull/1)

I have added a new command to the test test/jdk/sun/security/provider/acvp/Launcher.java. The line with the -Xcomp will invoke the intrinsics on the first call, so they will be tested.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1991546056

From tschatzl at openjdk.org  Wed Mar 12 14:00:15 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Wed, 12 Mar 2025 14:00:15 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v17]
In-Reply-To: <0w7seS1tIFhUxnmStxQySISWVfpBBsRmUtx7EoTy9a4=.509a3d5e-56d0-4fd8-8896-51835b14302b@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <tUp5IEUnv96nOsSSGPGesujwklXO4nOJhs8UNp0EEUs=.fba619db-c76f-43ad-a18d-8ccb7a8bd87f@github.com>
 <0w7seS1tIFhUxnmStxQySISWVfpBBsRmUtx7EoTy9a4=.509a3d5e-56d0-4fd8-8896-51835b14302b@github.com>
Message-ID: <v9GaiQKCcy1jNjdaW2oAsBrP6wg-L7D2O5i0DpiQnT0=.906ab988-0be3-46e6-b028-ad08421d98c5@github.com>

On Wed, 12 Mar 2025 12:23:50 GMT, Albert Mingkun Yang <ayang at openjdk.org> wrote:

>> Thomas Schatzl has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 24 additional commits since the last revision:
>> 
>>  - Merge branch 'master' into 8342382-card-table-instead-of-dcq
>>  - * optimized RISCV gen_write_ref_array_post_barrier() implementation contributed by @RealFYang
>>  - * fix card table verification crashes: in the first refinement phase, when switching the global card tables, we need to re-check whether we are still in the same sweep epoch or not. It might have changed due to a GC interrupting acquiring the Heap_lock. Otherwise new threads will scribble on the refinement table.
>>    Cause are last-minute changes before making the PR ready to review.
>>    
>>      Testing: without the patch, occurs fairly frequently when continuously
>>    (1 in 20) starting refinement. Does not afterward.
>>  - * ayang review 3
>>      * comments
>>      * minor refactorings
>>  - * iwalulya review
>>      * renaming
>>      * fix some includes, forward declaration
>>  - * fix whitespace
>>    * additional whitespace between log tags
>>    * rename G1ConcurrentRefineWorkTask -> ...SweepTask to conform to the other similar rename
>>  - ayang review
>>      * renamings
>>      * refactorings
>>  - iwalulya review
>>      * comments for variables tracking to-collection-set and just dirtied cards after GC/refinement
>>      * predicate for determining whether the refinement has been disabled
>>      * some other typos/comment improvements
>>      * renamed _has_xxx_ref to _has_ref_to_xxx to be more consistent with naming
>>  - * ayang review - fix comment
>>  - * iwalulya review 2
>>      * G1ConcurrentRefineWorkState -> G1ConcurrentRefineSweepState
>>      * some additional documentation
>>  - ... and 14 more: https://git.openjdk.org/jdk/compare/5727f166...aec95051
>
> src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 263:
> 
>> 261: 
>> 262:     SuspendibleThreadSetLeaver sts_leave;
>> 263:     VMThread::execute(&op);
> 
> Can you elaborate what synchronization this VM op is trying to achieve?

Memory visibility for refinement threads for the references written to the heap. Without them, they may not have received the most recent values.
This is the same as the `StoreLoad` barriers synchronization between mutator and refinement threads imo.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1991561707

From lmesnik at openjdk.org  Wed Mar 12 15:37:12 2025
From: lmesnik at openjdk.org (Leonid Mesnik)
Date: Wed, 12 Mar 2025 15:37:12 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v6]
In-Reply-To: <CY_w5CM4ZSozDcR8YQp8Bz_kCvv4g57pJlYTnuLma3w=.47f64d0e-2dd3-4785-aed0-254d93b61689@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <CY_w5CM4ZSozDcR8YQp8Bz_kCvv4g57pJlYTnuLma3w=.47f64d0e-2dd3-4785-aed0-254d93b61689@github.com>
Message-ID: <di5vo1Muk6wtMW-lE-vuZ-Ii9njvH4EYEEO5kQwaKBA=.56e974af-8b57-46fc-aecb-e27834d6bf3f@github.com>

On Wed, 12 Mar 2025 13:42:33 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Added validity test for the intrinsics.

test/jdk/sun/security/provider/acvp/Launcher.java line 43:

> 41:  * @modules java.base/sun.security.provider
> 42:  * @run main Launcher
> 43:  * @run main/othervm -Xcomp Launcher

Thank you for adding this case. Please add it as a separate testcase:
/*
 * @test 
 * @summary Test verifies intrinsic implementation. 
 * @library /test/lib
 * @modules java.base/sun.security.provider
 * @run main/othervm -Xcomp Launcher
 */

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1991769739

From vlivanov at openjdk.org  Wed Mar 12 17:25:06 2025
From: vlivanov at openjdk.org (Vladimir Ivanov)
Date: Wed, 12 Mar 2025 17:25:06 GMT
Subject: RFR: 8351640: Print reason for making method not entrant [v2]
In-Reply-To: <370pnPWKXnqHXz9pVOoU9vFfqdH8zIIV2K7BpqWRcEI=.0c63f38f-84ab-49c3-a0da-1ad9f1b22fb1@github.com>
References: <_XHdskC5Q0n4cwspFV97uiUyS2HsWDSZAK-YkGGCUIA=.8e86e6e2-9c46-440f-aca5-efbc54475f29@github.com>
 <370pnPWKXnqHXz9pVOoU9vFfqdH8zIIV2K7BpqWRcEI=.0c63f38f-84ab-49c3-a0da-1ad9f1b22fb1@github.com>
Message-ID: <lQTShefjPon_ejwx5CGb6cNAJaxEVPwKaNVroyUVE6M=.c3015a08-6184-4416-bdd5-a9599207fc79@github.com>

On Wed, 12 Mar 2025 07:35:33 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

>> A simple quality of life improvement. We are studying compiler dynamics in Leyden, and it would be convenient to know why the particular methods are marked as not entrant. We just need to pass the extra string argument to `nmethod::make_not_entrant` and print it out. 
>> 
>> Sample log excerpt for mainline:
>> 
>> 
>> $ grep com.sun.tools.javac.util.IntHashTable::lookup print-compilation.log 
>> 987   780       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
>> 1019  877       4       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
>> 1024  780       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)   made not entrant: not used
>> 4995  877       4       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)   made not entrant: uncommon trap
>> 5287 3734       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
>> 6615 5472       4       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
>> 6626 3734       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)   made not entrant: not used
>> 
>> 
>> You can now clearly see the method lifecycle. 1 second in app lifetime, the method was initially compiled at level 3. Shortly after, it got compiled at level 4, turning level 3 method unused. 4 seconds later, level 4 method encountered uncommon trap, so we are back to level 3. After 1.3 seconds more, the final compilation at level 4 completed, and second level 3 compilation was removed as unused.
>> 
>> Additional testing:
>>  - [x] Linux x86_64 server fastdebug, `hotspot:tier1`
>>  - [x] Linux x86_64 server fastdebug, `all`
>
> Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision:
> 
>  - Add to LogCompilation as well
>  - Merge branch 'master' into JDK-8351640-nmethod-not-entrant-reason
>  - Use resource allocation for temp buffer
>  - Base version

Looks good. 

Do you mind incorporating log compilation tool support? [1]


diff --git a/src/utils/LogCompilation/src/main/java/com/sun/hotspot/tools/compiler/LogParser.java b/src/utils/LogCompilation/src/main/java/com/sun/hotspot/tools/compiler/LogParser.java
index e1e305abe10..61cbc054200 100644
--- a/src/utils/LogCompilation/src/main/java/com/sun/hotspot/tools/compiler/LogParser.java
+++ b/src/utils/LogCompilation/src/main/java/com/sun/hotspot/tools/compiler/LogParser.java
@@ -1099,6 +1099,10 @@ public void startElement(String uri, String localName, String qname, Attributes
             e.setCompileKind(compileKind);
             String level = atts.getValue("level");
             e.setLevel(level);
+            String reason = atts.getValue("reason");
+            if (reason != null) {
+              e.setReason(reason);
+            }
             events.add(e);
         } else if (qname.equals("uncommon_trap")) {
             String id = atts.getValue("compile_id");
diff --git a/src/utils/LogCompilation/src/main/java/com/sun/hotspot/tools/compiler/MakeNotEntrantEvent.java b/src/utils/LogCompilation/src/main/java/com/sun/hotspot/tools/compiler/MakeNotEntrantEvent.java
index b4015537c74..d230f1b4336 100644
--- a/src/utils/LogCompilation/src/main/java/com/sun/hotspot/tools/compiler/MakeNotEntrantEvent.java
+++ b/src/utils/LogCompilation/src/main/java/com/sun/hotspot/tools/compiler/MakeNotEntrantEvent.java
@@ -47,6 +47,11 @@ class MakeNotEntrantEvent extends BasicLogEvent {
      */
     private String level;
 
+    /**
+     * The reason of invalidation.
+     */
+    private String reason;
+
     /**
      * The compile kind.
      */
@@ -64,10 +69,14 @@ public NMethod getNMethod() {
 
     public void print(PrintStream stream, boolean printID) {
         if (isZombie()) {
-            stream.printf("%s make_zombie\n", getId());
+            stream.printf("%s make_zombie", getId());
         } else {
-            stream.printf("%s make_not_entrant\n", getId());
+            stream.printf("%s make_not_entrant", getId());
+        }
+        if (getReason() != null) {
+            stream.printf(": %s", getReason());
         }
+        stream.println();
     }
 
     public boolean isZombie() {
@@ -88,7 +97,21 @@ public void setLevel(String level) {
       this.level = level;
   }
 
-    /**
+  /**
+   * @return the reason
+   */
+  public String getReason() {
+      return reason;
+  }
+
+  /**
+   * @param reason the reason to set
+   */
+  public void setReason(String reason) {
+      this.reason = reason;
+  }
+
+  /**
    * @return the compileKind
    */
   public String getCompileKind() {

-------------

PR Review: https://git.openjdk.org/jdk/pull/23980#pullrequestreview-2679301582

From shade at openjdk.org  Wed Mar 12 17:39:35 2025
From: shade at openjdk.org (Aleksey Shipilev)
Date: Wed, 12 Mar 2025 17:39:35 GMT
Subject: RFR: 8351640: Print reason for making method not entrant [v3]
In-Reply-To: <_XHdskC5Q0n4cwspFV97uiUyS2HsWDSZAK-YkGGCUIA=.8e86e6e2-9c46-440f-aca5-efbc54475f29@github.com>
References: <_XHdskC5Q0n4cwspFV97uiUyS2HsWDSZAK-YkGGCUIA=.8e86e6e2-9c46-440f-aca5-efbc54475f29@github.com>
Message-ID: <VKWnIPIOYBgU867qQocCO8vRIzHTpVn7T46y-lksI2M=.d6555476-4c9e-43ec-8d43-1c93bf79e137@github.com>

> A simple quality of life improvement. We are studying compiler dynamics in Leyden, and it would be convenient to know why the particular methods are marked as not entrant. We just need to pass the extra string argument to `nmethod::make_not_entrant` and print it out. 
> 
> Sample log excerpt for mainline:
> 
> 
> $ grep com.sun.tools.javac.util.IntHashTable::lookup print-compilation.log 
> 987   780       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
> 1019  877       4       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
> 1024  780       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)   made not entrant: not used
> 4995  877       4       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)   made not entrant: uncommon trap
> 5287 3734       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
> 6615 5472       4       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
> 6626 3734       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)   made not entrant: not used
> 
> 
> You can now clearly see the method lifecycle. 1 second in app lifetime, the method was initially compiled at level 3. Shortly after, it got compiled at level 4, turning level 3 method unused. 4 seconds later, level 4 method encountered uncommon trap, so we are back to level 3. After 1.3 seconds more, the final compilation at level 4 completed, and second level 3 compilation was removed as unused.
> 
> Additional testing:
>  - [x] Linux x86_64 server fastdebug, `hotspot:tier1`
>  - [x] Linux x86_64 server fastdebug, `all`

Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision:

  Add LogCompilation support

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23980/files
  - new: https://git.openjdk.org/jdk/pull/23980/files/38491fb2..5da9766d

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23980&range=02
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23980&range=01-02

  Stats: 30 lines in 2 files changed: 27 ins; 0 del; 3 mod
  Patch: https://git.openjdk.org/jdk/pull/23980.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23980/head:pull/23980

PR: https://git.openjdk.org/jdk/pull/23980

From shade at openjdk.org  Wed Mar 12 17:39:35 2025
From: shade at openjdk.org (Aleksey Shipilev)
Date: Wed, 12 Mar 2025 17:39:35 GMT
Subject: RFR: 8351640: Print reason for making method not entrant [v2]
In-Reply-To: <lQTShefjPon_ejwx5CGb6cNAJaxEVPwKaNVroyUVE6M=.c3015a08-6184-4416-bdd5-a9599207fc79@github.com>
References: <_XHdskC5Q0n4cwspFV97uiUyS2HsWDSZAK-YkGGCUIA=.8e86e6e2-9c46-440f-aca5-efbc54475f29@github.com>
 <370pnPWKXnqHXz9pVOoU9vFfqdH8zIIV2K7BpqWRcEI=.0c63f38f-84ab-49c3-a0da-1ad9f1b22fb1@github.com>
 <lQTShefjPon_ejwx5CGb6cNAJaxEVPwKaNVroyUVE6M=.c3015a08-6184-4416-bdd5-a9599207fc79@github.com>
Message-ID: <JPbWFvcsIIvF5Yksn_gAVtGJk8ddA0Hkly66HnQpppU=.cc99e78f-1f71-4c8a-997b-3fdc7c900b68@github.com>

On Wed, 12 Mar 2025 17:22:06 GMT, Vladimir Ivanov <vlivanov at openjdk.org> wrote:

> Do you mind incorporating log compilation tool support? [1]

I don't mind, added. Looks like this still works:


$ cd src/tools/LogCompilation
$ make

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23980#issuecomment-2718629919

From tschatzl at openjdk.org  Wed Mar 12 17:44:01 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Wed, 12 Mar 2025 17:44:01 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v17]
In-Reply-To: <0w7seS1tIFhUxnmStxQySISWVfpBBsRmUtx7EoTy9a4=.509a3d5e-56d0-4fd8-8896-51835b14302b@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <tUp5IEUnv96nOsSSGPGesujwklXO4nOJhs8UNp0EEUs=.fba619db-c76f-43ad-a18d-8ccb7a8bd87f@github.com>
 <0w7seS1tIFhUxnmStxQySISWVfpBBsRmUtx7EoTy9a4=.509a3d5e-56d0-4fd8-8896-51835b14302b@github.com>
Message-ID: <kXOiSt0lsAYeeLKDU1RPc3eaKMRjO6_cNFEKYn12aOU=.71293fd0-514b-427a-8f80-89b023b4185f@github.com>

On Wed, 12 Mar 2025 13:20:25 GMT, Albert Mingkun Yang <ayang at openjdk.org> wrote:

>> Thomas Schatzl has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 24 additional commits since the last revision:
>> 
>>  - Merge branch 'master' into 8342382-card-table-instead-of-dcq
>>  - * optimized RISCV gen_write_ref_array_post_barrier() implementation contributed by @RealFYang
>>  - * fix card table verification crashes: in the first refinement phase, when switching the global card tables, we need to re-check whether we are still in the same sweep epoch or not. It might have changed due to a GC interrupting acquiring the Heap_lock. Otherwise new threads will scribble on the refinement table.
>>    Cause are last-minute changes before making the PR ready to review.
>>    
>>      Testing: without the patch, occurs fairly frequently when continuously
>>    (1 in 20) starting refinement. Does not afterward.
>>  - * ayang review 3
>>      * comments
>>      * minor refactorings
>>  - * iwalulya review
>>      * renaming
>>      * fix some includes, forward declaration
>>  - * fix whitespace
>>    * additional whitespace between log tags
>>    * rename G1ConcurrentRefineWorkTask -> ...SweepTask to conform to the other similar rename
>>  - ayang review
>>      * renamings
>>      * refactorings
>>  - iwalulya review
>>      * comments for variables tracking to-collection-set and just dirtied cards after GC/refinement
>>      * predicate for determining whether the refinement has been disabled
>>      * some other typos/comment improvements
>>      * renamed _has_xxx_ref to _has_ref_to_xxx to be more consistent with naming
>>  - * ayang review - fix comment
>>  - * iwalulya review 2
>>      * G1ConcurrentRefineWorkState -> G1ConcurrentRefineSweepState
>>      * some additional documentation
>>  - ... and 14 more: https://git.openjdk.org/jdk/compare/0c7b5abb...aec95051
>
> src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 217:
> 
>> 215: 
>> 216:   {
>> 217:     SuspendibleThreadSetLeaver sts_leave;
> 
> Can you add some comment on why leaving the set is required? It's not obvious to me why. I'd expect handshake to work out of the box...

It isn't apparently. Removed.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1991999476

From tschatzl at openjdk.org  Wed Mar 12 17:59:51 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Wed, 12 Mar 2025 17:59:51 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v18]
In-Reply-To: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
Message-ID: <3KOwgdzYn_vXQVWisVUEY-0i1gtZEfZhcD1-id3epYE=.17aa84bc-a7ec-4dda-b596-7a1016d710fc@github.com>

> Hi all,
> 
>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
> 
> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
> 
> ### Current situation
> 
> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
> 
> The main reason for the current barrier is how g1 implements concurrent refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
> 
> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
> 
> 
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done;     // null value check
> if (card(@x.a) == young_card) goto done;  // write to young gen check
> StoreLoad;                // synchronize
> if (card(@x.a) == dirty_card) goto done;
> 
> *card(@x.a) = dirty
> 
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
> 
> call runtime to move thread-local-dcq into dcqs
> 
> done:
> 
> 
> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
> 
> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
> 
> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
> 
> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se...

Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:

  * ayang review
    * remove unnecessary STSleaver
    * some more documentation around to_collection_card card color

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23739/files
  - new: https://git.openjdk.org/jdk/pull/23739/files/aec95051..3766b76c

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=17
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=16-17

  Stats: 18 lines in 2 files changed: 5 ins; 4 del; 9 mod
  Patch: https://git.openjdk.org/jdk/pull/23739.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739

PR: https://git.openjdk.org/jdk/pull/23739

From vlivanov at openjdk.org  Wed Mar 12 18:17:03 2025
From: vlivanov at openjdk.org (Vladimir Ivanov)
Date: Wed, 12 Mar 2025 18:17:03 GMT
Subject: RFR: 8351640: Print reason for making method not entrant [v3]
In-Reply-To: <VKWnIPIOYBgU867qQocCO8vRIzHTpVn7T46y-lksI2M=.d6555476-4c9e-43ec-8d43-1c93bf79e137@github.com>
References: <_XHdskC5Q0n4cwspFV97uiUyS2HsWDSZAK-YkGGCUIA=.8e86e6e2-9c46-440f-aca5-efbc54475f29@github.com>
 <VKWnIPIOYBgU867qQocCO8vRIzHTpVn7T46y-lksI2M=.d6555476-4c9e-43ec-8d43-1c93bf79e137@github.com>
Message-ID: <zoGEScfS4sbew3NZvgfNmH2BtuHn7MhvIW9Eel1wfHU=.df6093a8-7543-43ef-bd89-ead1b40f2407@github.com>

On Wed, 12 Mar 2025 17:39:35 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

>> A simple quality of life improvement. We are studying compiler dynamics in Leyden, and it would be convenient to know why the particular methods are marked as not entrant. We just need to pass the extra string argument to `nmethod::make_not_entrant` and print it out. 
>> 
>> Sample log excerpt for mainline:
>> 
>> 
>> $ grep com.sun.tools.javac.util.IntHashTable::lookup print-compilation.log 
>> 987   780       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
>> 1019  877       4       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
>> 1024  780       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)   made not entrant: not used
>> 4995  877       4       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)   made not entrant: uncommon trap
>> 5287 3734       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
>> 6615 5472       4       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
>> 6626 3734       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)   made not entrant: not used
>> 
>> 
>> You can now clearly see the method lifecycle. 1 second in app lifetime, the method was initially compiled at level 3. Shortly after, it got compiled at level 4, turning level 3 method unused. 4 seconds later, level 4 method encountered uncommon trap, so we are back to level 3. After 1.3 seconds more, the final compilation at level 4 completed, and second level 3 compilation was removed as unused.
>> 
>> Additional testing:
>>  - [x] Linux x86_64 server fastdebug, `hotspot:tier1`
>>  - [x] Linux x86_64 server fastdebug, `all`
>
> Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Add LogCompilation support

Thanks. Looks good.

-------------

Marked as reviewed by vlivanov (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/23980#pullrequestreview-2679447944

From shade at openjdk.org  Wed Mar 12 18:17:04 2025
From: shade at openjdk.org (Aleksey Shipilev)
Date: Wed, 12 Mar 2025 18:17:04 GMT
Subject: RFR: 8351640: Print reason for making method not entrant [v3]
In-Reply-To: <VKWnIPIOYBgU867qQocCO8vRIzHTpVn7T46y-lksI2M=.d6555476-4c9e-43ec-8d43-1c93bf79e137@github.com>
References: <_XHdskC5Q0n4cwspFV97uiUyS2HsWDSZAK-YkGGCUIA=.8e86e6e2-9c46-440f-aca5-efbc54475f29@github.com>
 <VKWnIPIOYBgU867qQocCO8vRIzHTpVn7T46y-lksI2M=.d6555476-4c9e-43ec-8d43-1c93bf79e137@github.com>
Message-ID: <tlDZ75-zimaNAenCktgSAMGfIdML5ka0S8ys0IYk-1I=.147d338a-7a96-4598-9f1a-98126235009f@github.com>

On Wed, 12 Mar 2025 17:39:35 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

>> A simple quality of life improvement. We are studying compiler dynamics in Leyden, and it would be convenient to know why the particular methods are marked as not entrant. We just need to pass the extra string argument to `nmethod::make_not_entrant` and print it out. 
>> 
>> Sample log excerpt for mainline:
>> 
>> 
>> $ grep com.sun.tools.javac.util.IntHashTable::lookup print-compilation.log 
>> 987   780       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
>> 1019  877       4       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
>> 1024  780       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)   made not entrant: not used
>> 4995  877       4       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)   made not entrant: uncommon trap
>> 5287 3734       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
>> 6615 5472       4       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
>> 6626 3734       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)   made not entrant: not used
>> 
>> 
>> You can now clearly see the method lifecycle. 1 second in app lifetime, the method was initially compiled at level 3. Shortly after, it got compiled at level 4, turning level 3 method unused. 4 seconds later, level 4 method encountered uncommon trap, so we are back to level 3. After 1.3 seconds more, the final compilation at level 4 completed, and second level 3 compilation was removed as unused.
>> 
>> Additional testing:
>>  - [x] Linux x86_64 server fastdebug, `hotspot:tier1`
>>  - [x] Linux x86_64 server fastdebug, `all`
>
> Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Add LogCompilation support

Thanks! I'll integrate once GHA clears.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23980#issuecomment-2718719599

From duke at openjdk.org  Wed Mar 12 19:19:08 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Wed, 12 Mar 2025 19:19:08 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v7]
In-Reply-To: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
Message-ID: <xeGVcNvhb8JogSwnBjlAln9bAiZMAeiOEC5XrH0FVX0=.564addcd-ee9a-4628-8df1-b546e7de45be@github.com>

> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.

Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:

  Made the intrinsics test separate from the pure java test.

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23860/files
  - new: https://git.openjdk.org/jdk/pull/23860/files/f65ef7c4..aa2fdf2d

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=06
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=05-06

  Stats: 8 lines in 1 file changed: 8 ins; 0 del; 0 mod
  Patch: https://git.openjdk.org/jdk/pull/23860.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23860/head:pull/23860

PR: https://git.openjdk.org/jdk/pull/23860

From shade at openjdk.org  Wed Mar 12 19:47:58 2025
From: shade at openjdk.org (Aleksey Shipilev)
Date: Wed, 12 Mar 2025 19:47:58 GMT
Subject: Integrated: 8351640: Print reason for making method not entrant
In-Reply-To: <_XHdskC5Q0n4cwspFV97uiUyS2HsWDSZAK-YkGGCUIA=.8e86e6e2-9c46-440f-aca5-efbc54475f29@github.com>
References: <_XHdskC5Q0n4cwspFV97uiUyS2HsWDSZAK-YkGGCUIA=.8e86e6e2-9c46-440f-aca5-efbc54475f29@github.com>
Message-ID: <kiiBjPCWZm2CfkyxE0sLqF73_VMtTxThAUU5mmUaRyA=.9ae69fa6-418e-43ef-9c84-08229ceba6fc@github.com>

On Tue, 11 Mar 2025 11:36:59 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

> A simple quality of life improvement. We are studying compiler dynamics in Leyden, and it would be convenient to know why the particular methods are marked as not entrant. We just need to pass the extra string argument to `nmethod::make_not_entrant` and print it out. 
> 
> Sample log excerpt for mainline:
> 
> 
> $ grep com.sun.tools.javac.util.IntHashTable::lookup print-compilation.log 
> 987   780       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
> 1019  877       4       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
> 1024  780       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)   made not entrant: not used
> 4995  877       4       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)   made not entrant: uncommon trap
> 5287 3734       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
> 6615 5472       4       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)
> 6626 3734       3       com.sun.tools.javac.util.IntHashTable::lookup (100 bytes)   made not entrant: not used
> 
> 
> You can now clearly see the method lifecycle. 1 second in app lifetime, the method was initially compiled at level 3. Shortly after, it got compiled at level 4, turning level 3 method unused. 4 seconds later, level 4 method encountered uncommon trap, so we are back to level 3. After 1.3 seconds more, the final compilation at level 4 completed, and second level 3 compilation was removed as unused.
> 
> Additional testing:
>  - [x] Linux x86_64 server fastdebug, `hotspot:tier1`
>  - [x] Linux x86_64 server fastdebug, `all`

This pull request has now been integrated.

Changeset: 930455b5
Author:    Aleksey Shipilev <shade at openjdk.org>
URL:       https://git.openjdk.org/jdk/commit/930455b59608b547017c9649efeb6bd381340c34
Stats:     68 lines in 16 files changed: 35 ins; 0 del; 33 mod

8351640: Print reason for making method not entrant

Co-authored-by: Vladimir Ivanov <vlivanov at openjdk.org>
Reviewed-by: vlivanov, kvn

-------------

PR: https://git.openjdk.org/jdk/pull/23980

From tschatzl at openjdk.org  Thu Mar 13 13:07:29 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Thu, 13 Mar 2025 13:07:29 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v19]
In-Reply-To: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
Message-ID: <WwgVr77CeXiuC7R-eiFONG95gJE5Hzq6q29O0k-945w=.379cc73e-cd7b-4874-97bb-bead9793577f@github.com>

> Hi all,
> 
>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
> 
> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
> 
> ### Current situation
> 
> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
> 
> The main reason for the current barrier is how g1 implements concurrent refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
> 
> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
> 
> 
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done;     // null value check
> if (card(@x.a) == young_card) goto done;  // write to young gen check
> StoreLoad;                // synchronize
> if (card(@x.a) == dirty_card) goto done;
> 
> *card(@x.a) = dirty
> 
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
> 
> call runtime to move thread-local-dcq into dcqs
> 
> done:
> 
> 
> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
> 
> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
> 
> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
> 
> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se...

Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:

  * when aborting refinement during full collection, the global card table and the per-thread card table might not be in sync. Roll forward during abort of the refinement in these situations.
  * additional verification
  * added some missing ResourceMarks in asserts
  * added variant of ArrayJuggle2 that crashes fairly quickly without these changes

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23739/files
  - new: https://git.openjdk.org/jdk/pull/23739/files/3766b76c..78611173

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=18
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=17-18

  Stats: 111 lines in 11 files changed: 82 ins; 13 del; 16 mod
  Patch: https://git.openjdk.org/jdk/pull/23739.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739

PR: https://git.openjdk.org/jdk/pull/23739

From galder at openjdk.org  Thu Mar 13 13:50:14 2025
From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=)
Date: Thu, 13 Mar 2025 13:50:14 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v12]
In-Reply-To: <BShUfHEAVSKx4fZzf0QntcC0XOsfNSyMPCA1H3ZLKGQ=.49e26b40-ef74-44d1-bf60-4bb228ff195e@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <pZjDpZKJUmXi85-qf3F-NX91qVc42_QgZGbuo36XhPk=.f2e4ba72-bf19-4ced-9656-c01907bdae1b@github.com>
 <wDsCyP79rQ4dN3G6lMjZliTn6ym5-HwjGZ-Y-Xx_vQY=.0a187197-7505-4c95-a6cd-8b8eea0bea88@github.com>
 <DvPOVOuqVlNY-0K5E201YQgKmguBMpJDQw2h1myD8S0=.81a46beb-90fa-4417-a9b2-9d3ebc746538@github.com>
 <63F-0aHgMthexL0b2DFmkW8_QrJeo8OOlCaIyZApfpY=.4744070d-9d56-4031-8684-be14cf66d1e5@github.com>
 <Ce3tQ94W9WhirtnEpPtBNvr0szvAod7tuLkxVvNeRcQ=.f1d5cf13-6d1f-46c8-a7c0-5d72f6575d6c@github.com>
 <BShUfHEAVSKx4fZzf0QntcC0XOsfNSyMPCA1H3ZLKGQ=.49e26b40-ef74-44d1-bf60-4bb228ff195e@github.com>
Message-ID: <ZrkGZwDprDuOCB1fV7yff80uyOPS5-CAp1Z67ughYYA=.c16699e3-debd-4f0a-ac6a-6123839e6a0e@github.com>

On Fri, 7 Mar 2025 13:17:29 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>>> As for possible solutions. In all Regression 1-3 cases, it seems the issue is scalar cmove. So actually in all cases a possible solution is using branching code (i.e. `cmp+mov`). So to me, these are the follow-up RFE's:
>>> 
>>> * Detect "extreme" probability scalar cmove, and replace them with branching code. This should take care of all regressions here. This one has high priority, as it fixes the regression caused by this patch here. But it would also help to improve performance for the `Integer.min/max` cases, which have the same issue.
>> 
>> I've created [JDK-8351409](https://bugs.openjdk.org/browse/JDK-8351409) to address this.
>
> @galderz Excellent. Testing looks all good on our side. Yes I think what you saw was unrelated.
> @rwestrel Could give this a last quick scan and then I think you can integrate :)

Thanks @eme64 @rwestrel @chhagedorn for your patience with this!

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2721319344

From duke at openjdk.org  Thu Mar 13 13:50:22 2025
From: duke at openjdk.org (duke)
Date: Thu, 13 Mar 2025 13:50:22 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v14]
In-Reply-To: <9c34YjVjK0BMclNqFWMSitBV2YTcu_jmgWVitjRgvF0=.0f225af6-5888-4160-9a54-09baa696da1c@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <9c34YjVjK0BMclNqFWMSitBV2YTcu_jmgWVitjRgvF0=.0f225af6-5888-4160-9a54-09baa696da1c@github.com>
Message-ID: <HWtqao1UiTB74l0qMi76Hm8iAyHqFSsuqd3VIroozHE=.e77c8427-4acd-4c8c-b54b-3942186f017a@github.com>

On Fri, 7 Mar 2025 06:19:03 GMT, Galder Zamarre?o <galder at openjdk.org> wrote:

>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance.
>> 
>> Currently vectorization does not kick in for loops containing either of these calls because of the following error:
>> 
>> 
>> VLoop::check_preconditions: failed: control flow in loop not allowed
>> 
>> 
>> The control flow is due to the java implementation for these methods, e.g.
>> 
>> 
>> public static long max(long a, long b) {
>>     return (a >= b) ? a : b;
>> }
>> 
>> 
>> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively.
>> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization.
>> E.g.
>> 
>> 
>> SuperWord::transform_loop:
>>     Loop: N518/N126  counted [int,int),+4 (1025 iters)  main has_sfpt strip_mined
>>  518  CountedLoop  === 518 246 126  [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21)
>> 
>> 
>> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1155
>> long max   1173
>> 
>> 
>> After the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1042
>> long max   1042
>> 
>> 
>> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes.
>> Therefore, it still relies on the macro expansion to transform those into CMoveL.
>> 
>> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results:
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PA...
>
> Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 47 additional commits since the last revision:
> 
>  - Merge branch 'master' into topic.intrinsify-max-min-long
>  - Add assertion comments
>  - Add simple reduction benchmarks on top of multiply ones
>  - Merge branch 'master' into topic.intrinsify-max-min-long
>  - Fix typo
>  - Renaming methods and variables and add docu on algorithms
>  - Fix copyright years
>  - Make sure it runs with cpus with either avx512 or asimd
>  - Test can only run with 256 bit registers or bigger
>    
>    * Remove platform dependant check
>    and use platform independent configuration instead.
>  - Fix license header
>  - ... and 37 more: https://git.openjdk.org/jdk/compare/c836c5b7...1aa690d3

@galderz 
Your change (at version 1aa690d391ef3536d422ba93c33d0fc273a911c6) is now ready to be sponsored by a Committer.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2721323015

From galder at openjdk.org  Thu Mar 13 13:57:23 2025
From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=)
Date: Thu, 13 Mar 2025 13:57:23 GMT
Subject: Integrated: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long)
In-Reply-To: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
Message-ID: <JP0lKoWShaGX72awZ1jRsIanmLedcIxcg12jj4_yaxk=.1604bfcb-43e7-4ff6-8c4e-4bb08c0f76a7@github.com>

On Tue, 9 Jul 2024 12:07:37 GMT, Galder Zamarre?o <galder at openjdk.org> wrote:

> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance.
> 
> Currently vectorization does not kick in for loops containing either of these calls because of the following error:
> 
> 
> VLoop::check_preconditions: failed: control flow in loop not allowed
> 
> 
> The control flow is due to the java implementation for these methods, e.g.
> 
> 
> public static long max(long a, long b) {
>     return (a >= b) ? a : b;
> }
> 
> 
> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively.
> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization.
> E.g.
> 
> 
> SuperWord::transform_loop:
>     Loop: N518/N126  counted [int,int),+4 (1025 iters)  main has_sfpt strip_mined
>  518  CountedLoop  === 518 246 126  [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21)
> 
> 
> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1):
> 
> 
> ==============================
> Test summary
> ==============================
>    TEST                                              TOTAL  PASS  FAIL ERROR
>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>                                                          1     1     0     0
> ==============================
> TEST SUCCESS
> 
> long min   1155
> long max   1173
> 
> 
> After the patch, on darwin/aarch64 (M1):
> 
> 
> ==============================
> Test summary
> ==============================
>    TEST                                              TOTAL  PASS  FAIL ERROR
>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>                                                          1     1     0     0
> ==============================
> TEST SUCCESS
> 
> long min   1042
> long max   1042
> 
> 
> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes.
> Therefore, it still relies on the macro expansion to transform those into CMoveL.
> 
> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results:
> 
> 
> ==============================
> Test summary
> ==============================
>    TEST                                              TOTAL  PASS  FAIL ERROR
>    jtreg:test/hotspot/jtreg:tier1                     2500  2500     0     0
>>> jtreg:test/jdk:tier1                     ...

This pull request has now been integrated.

Changeset: 4e51a8c9
Author:    Galder Zamarre?o <galder at openjdk.org>
URL:       https://git.openjdk.org/jdk/commit/4e51a8c9ad4e5345d05cf32ce1e82b7158f80e93
Stats:     844 lines in 9 files changed: 725 ins; 107 del; 12 mod

8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long)

Reviewed-by: roland, epeter, chagedorn, darcy

-------------

PR: https://git.openjdk.org/jdk/pull/20098

From tschatzl at openjdk.org  Thu Mar 13 14:16:07 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Thu, 13 Mar 2025 14:16:07 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v19]
In-Reply-To: <WwgVr77CeXiuC7R-eiFONG95gJE5Hzq6q29O0k-945w=.379cc73e-cd7b-4874-97bb-bead9793577f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <WwgVr77CeXiuC7R-eiFONG95gJE5Hzq6q29O0k-945w=.379cc73e-cd7b-4874-97bb-bead9793577f@github.com>
Message-ID: <-ys7CbBNU4hCmEgYQyZpmBQ_rso4i2_KoFHLPNv73sI=.bd715b1d-b9fd-48b7-bb06-d6673ab2dbfc@github.com>

On Thu, 13 Mar 2025 13:07:29 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> Hi all,
>> 
>>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
>> 
>> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
>> 
>> ### Current situation
>> 
>> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
>> 
>> The main reason for the current barrier is how g1 implements concurrent refinement:
>> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
>> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
>> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
>> 
>> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
>> 
>> 
>> // Filtering
>> if (region(@x.a) == region(y)) goto done; // same region check
>> if (y == null) goto done;     // null value check
>> if (card(@x.a) == young_card) goto done;  // write to young gen check
>> StoreLoad;                // synchronize
>> if (card(@x.a) == dirty_card) goto done;
>> 
>> *card(@x.a) = dirty
>> 
>> // Card tracking
>> enqueue(card-address(@x.a)) into thread-local-dcq;
>> if (thread-local-dcq is not full) goto done;
>> 
>> call runtime to move thread-local-dcq into dcqs
>> 
>> done:
>> 
>> 
>> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
>> 
>> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
>> 
>> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
>> 
>> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c...
>
> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
> 
>   * when aborting refinement during full collection, the global card table and the per-thread card table might not be in sync. Roll forward during abort of the refinement in these situations.
>   * additional verification
>   * added some missing ResourceMarks in asserts
>   * added variant of ArrayJuggle2 that crashes fairly quickly without these changes

Commit https://github.com/openjdk/jdk/pull/23739/commits/786111735c306583af5bc75f7653f0da67d52adb fixes an issue with full gc interrupting refinement while the global card table and the JavaThread's card table changes.

Testing: tier1-7 with changes, tier1-5 with changes stressing refinement similar to the ones added to the new test.

The new variant of `ArrayJuggle2` fails >50% of all times in our CI without the patch (verified 700 or so executions of that not failing with patch).

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23739#issuecomment-2721413659

From tschatzl at openjdk.org  Fri Mar 14 14:28:57 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Fri, 14 Mar 2025 14:28:57 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v17]
In-Reply-To: <v9GaiQKCcy1jNjdaW2oAsBrP6wg-L7D2O5i0DpiQnT0=.906ab988-0be3-46e6-b028-ad08421d98c5@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <tUp5IEUnv96nOsSSGPGesujwklXO4nOJhs8UNp0EEUs=.fba619db-c76f-43ad-a18d-8ccb7a8bd87f@github.com>
 <0w7seS1tIFhUxnmStxQySISWVfpBBsRmUtx7EoTy9a4=.509a3d5e-56d0-4fd8-8896-51835b14302b@github.com>
 <v9GaiQKCcy1jNjdaW2oAsBrP6wg-L7D2O5i0DpiQnT0=.906ab988-0be3-46e6-b028-ad08421d98c5@github.com>
Message-ID: <58jXaIS3TNN9Y9xWGSKWM7B4C0dbZ6YxRWjPMmBeFnY=.506b75a0-12a4-424c-869c-8358195947d9@github.com>

On Wed, 12 Mar 2025 13:56:57 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 263:
>> 
>>> 261: 
>>> 262:     SuspendibleThreadSetLeaver sts_leave;
>>> 263:     VMThread::execute(&op);
>> 
>> Can you elaborate what synchronization this VM op is trying to achieve?
>
> Memory visibility for refinement threads for the references written to the heap. Without them, they may not have received the most recent values.
> This is the same as the `StoreLoad` barriers synchronization between mutator and refinement threads imo.

There has been a discussion about whether this is actually needed. Initially we thought that this could be removed because it's only the refinement worker threads that would need memory synchronization, and the memory synchronization is handled by just starting up the refinement threads.

However the rebuild remsets process (marking threads) also access the global card table reference to mark the to-collection-set cards and its value must be synchronized.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1995683088

From tschatzl at openjdk.org  Fri Mar 14 14:37:27 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Fri, 14 Mar 2025 14:37:27 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v20]
In-Reply-To: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
Message-ID: <Zoo2Pfc3jjuPGsXcCHJ-MmgwY5BLJUV_DsdqObv3T04=.f4726787-51c0-4f13-8078-9c16a5869a13@github.com>

> Hi all,
> 
>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
> 
> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
> 
> ### Current situation
> 
> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
> 
> The main reason for the current barrier is how g1 implements concurrent refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
> 
> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
> 
> 
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done;     // null value check
> if (card(@x.a) == young_card) goto done;  // write to young gen check
> StoreLoad;                // synchronize
> if (card(@x.a) == dirty_card) goto done;
> 
> *card(@x.a) = dirty
> 
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
> 
> call runtime to move thread-local-dcq into dcqs
> 
> done:
> 
> 
> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
> 
> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
> 
> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
> 
> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se...

Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:

  * ayang review
    * re-add STS leaver for java thread handshake

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23739/files
  - new: https://git.openjdk.org/jdk/pull/23739/files/78611173..51a9eed8

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=19
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=18-19

  Stats: 15 lines in 1 file changed: 5 ins; 0 del; 10 mod
  Patch: https://git.openjdk.org/jdk/pull/23739.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739

PR: https://git.openjdk.org/jdk/pull/23739

From tschatzl at openjdk.org  Fri Mar 14 16:35:38 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Fri, 14 Mar 2025 16:35:38 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v21]
In-Reply-To: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
Message-ID: <1bH6bLmIYx6eVtZ4IPlFtdYpdCAwSaNB6w0uNljTSJE=.8a4a88c7-2f66-493c-91dd-6fc6c744c08f@github.com>

> Hi all,
> 
>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
> 
> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
> 
> ### Current situation
> 
> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
> 
> The main reason for the current barrier is how g1 implements concurrent refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
> 
> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
> 
> 
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done;     // null value check
> if (card(@x.a) == young_card) goto done;  // write to young gen check
> StoreLoad;                // synchronize
> if (card(@x.a) == dirty_card) goto done;
> 
> *card(@x.a) = dirty
> 
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
> 
> call runtime to move thread-local-dcq into dcqs
> 
> done:
> 
> 
> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
> 
> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
> 
> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
> 
> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se...

Thomas Schatzl has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 28 commits:

 - Merge branch 'master' into 8342381-card-table-instead-of-dcq
 - * ayang review
     * re-add STS leaver for java thread handshake
 - * when aborting refinement during full collection, the global card table and the per-thread card table might not be in sync. Roll forward during abort of the refinement in these situations.
   * additional verification
   * added some missing ResourceMarks in asserts
   * added variant of ArrayJuggle2 that crashes fairly quickly without these changes
 - * ayang review
     * remove unnecessary STSleaver
     * some more documentation around to_collection_card card color
 - Merge branch 'master' into 8342382-card-table-instead-of-dcq
 - * optimized RISCV gen_write_ref_array_post_barrier() implementation contributed by @RealFYang
 - * fix card table verification crashes: in the first refinement phase, when switching the global card tables, we need to re-check whether we are still in the same sweep epoch or not. It might have changed due to a GC interrupting acquiring the Heap_lock. Otherwise new threads will scribble on the refinement table.
   Cause are last-minute changes before making the PR ready to review.
   
     Testing: without the patch, occurs fairly frequently when continuously
   (1 in 20) starting refinement. Does not afterward.
 - * ayang review 3
     * comments
     * minor refactorings
 - * iwalulya review
     * renaming
     * fix some includes, forward declaration
 - * fix whitespace
   * additional whitespace between log tags
   * rename G1ConcurrentRefineWorkTask -> ...SweepTask to conform to the other similar rename
 - ... and 18 more: https://git.openjdk.org/jdk/compare/7f428041...b0730176

-------------

Changes: https://git.openjdk.org/jdk/pull/23739/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=20
  Stats: 6761 lines in 99 files changed: 2368 ins; 3464 del; 929 mod
  Patch: https://git.openjdk.org/jdk/pull/23739.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739

PR: https://git.openjdk.org/jdk/pull/23739

From tschatzl at openjdk.org  Sat Mar 15 13:12:39 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Sat, 15 Mar 2025 13:12:39 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v22]
In-Reply-To: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
Message-ID: <ifamzjuPJaQ6YJdaUMwKQJQSNB4TDsgQboEtEIAhV2A=.fcee00a3-bcb1-48cf-bc22-137793f10d9f@github.com>

> Hi all,
> 
>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
> 
> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
> 
> ### Current situation
> 
> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
> 
> The main reason for the current barrier is how g1 implements concurrent refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
> 
> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
> 
> 
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done;     // null value check
> if (card(@x.a) == young_card) goto done;  // write to young gen check
> StoreLoad;                // synchronize
> if (card(@x.a) == dirty_card) goto done;
> 
> *card(@x.a) = dirty
> 
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
> 
> call runtime to move thread-local-dcq into dcqs
> 
> done:
> 
> 
> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
> 
> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
> 
> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
> 
> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se...

Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:

  * more documentation on why we need to rendezvous the gc threads

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23739/files
  - new: https://git.openjdk.org/jdk/pull/23739/files/b0730176..447fe39b

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=21
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=20-21

  Stats: 7 lines in 1 file changed: 6 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/23739.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739

PR: https://git.openjdk.org/jdk/pull/23739

From tschatzl at openjdk.org  Mon Mar 17 10:32:33 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Mon, 17 Mar 2025 10:32:33 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v23]
In-Reply-To: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
Message-ID: <uNsqoWpFOaa-LubAEB5k7yxvmr7D4APIlFS5_o5eKAI=.e6dba953-354f-4e1d-aa33-20f48099b314@github.com>

> Hi all,
> 
>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
> 
> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
> 
> ### Current situation
> 
> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
> 
> The main reason for the current barrier is how g1 implements concurrent refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
> 
> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
> 
> 
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done;     // null value check
> if (card(@x.a) == young_card) goto done;  // write to young gen check
> StoreLoad;                // synchronize
> if (card(@x.a) == dirty_card) goto done;
> 
> *card(@x.a) = dirty
> 
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
> 
> call runtime to move thread-local-dcq into dcqs
> 
> done:
> 
> 
> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
> 
> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
> 
> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
> 
> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se...

Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:

  * obsolete G1UpdateBufferSize
  
  G1UpdateBufferSize has previously been used to size the refinement
  buffers and impose a minimum limit on the number of cards per thread
  that need to be pending before refinement starts.
  
  The former function is now obsolete with the removal of the dirty
  card queues, the latter functionality has been taken over by the new
  diagnostic option `G1PerThreadPendingCardThreshold`.
  
  I prefer to make this a diagnostic option is better than a product option
  because it is something that is only necessary for some test cases to
  produce some otherwise unwanted behavior (continuous refinement).
  
  CSR is pending.

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23739/files
  - new: https://git.openjdk.org/jdk/pull/23739/files/447fe39b..4d0afd57

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=22
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=21-22

  Stats: 16 lines in 7 files changed: 2 ins; 9 del; 5 mod
  Patch: https://git.openjdk.org/jdk/pull/23739.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739

PR: https://git.openjdk.org/jdk/pull/23739

From duke at openjdk.org  Mon Mar 17 11:38:04 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Mon, 17 Mar 2025 11:38:04 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v6]
In-Reply-To: <di5vo1Muk6wtMW-lE-vuZ-Ii9njvH4EYEEO5kQwaKBA=.56e974af-8b57-46fc-aecb-e27834d6bf3f@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <CY_w5CM4ZSozDcR8YQp8Bz_kCvv4g57pJlYTnuLma3w=.47f64d0e-2dd3-4785-aed0-254d93b61689@github.com>
 <di5vo1Muk6wtMW-lE-vuZ-Ii9njvH4EYEEO5kQwaKBA=.56e974af-8b57-46fc-aecb-e27834d6bf3f@github.com>
Message-ID: <soiPMM3YGMBbup5mexy2xJOF4oXRT0JCDhNUfPrWZJc=.b4d2d937-b3de-4a39-bd28-81a615730a44@github.com>

On Wed, 12 Mar 2025 15:34:18 GMT, Leonid Mesnik <lmesnik at openjdk.org> wrote:

>> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Added validity test for the intrinsics.
>
> test/jdk/sun/security/provider/acvp/Launcher.java line 43:
> 
>> 41:  * @modules java.base/sun.security.provider
>> 42:  * @run main Launcher
>> 43:  * @run main/othervm -Xcomp Launcher
> 
> Thank you for adding this case. Please add it as a separate testcase:
> /*
>  * @test 
>  * @summary Test verifies intrinsic implementation. 
>  * @library /test/lib
>  * @modules java.base/sun.security.provider
>  * @run main/othervm -Xcomp Launcher
>  */

Done.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1998545085

From lmesnik at openjdk.org  Mon Mar 17 16:10:25 2025
From: lmesnik at openjdk.org (Leonid Mesnik)
Date: Mon, 17 Mar 2025 16:10:25 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v7]
In-Reply-To: <xeGVcNvhb8JogSwnBjlAln9bAiZMAeiOEC5XrH0FVX0=.564addcd-ee9a-4628-8df1-b546e7de45be@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <xeGVcNvhb8JogSwnBjlAln9bAiZMAeiOEC5XrH0FVX0=.564addcd-ee9a-4628-8df1-b546e7de45be@github.com>
Message-ID: <yaCzXUM8N4ScCDBFU9nmQbM4tIP4xAHwiieOq6ZqjoQ=.54da08bb-7f9a-4faa-9df5-bb66e01e4c5c@github.com>

On Wed, 12 Mar 2025 19:19:08 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Made the intrinsics test separate from the pure java test.

Test changes looks good.

-------------

Marked as reviewed by lmesnik (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/23860#pullrequestreview-2691165965

From vpaprotski at openjdk.org  Mon Mar 17 21:49:12 2025
From: vpaprotski at openjdk.org (Volodymyr Paprotski)
Date: Mon, 17 Mar 2025 21:49:12 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v7]
In-Reply-To: <xeGVcNvhb8JogSwnBjlAln9bAiZMAeiOEC5XrH0FVX0=.564addcd-ee9a-4628-8df1-b546e7de45be@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <xeGVcNvhb8JogSwnBjlAln9bAiZMAeiOEC5XrH0FVX0=.564addcd-ee9a-4628-8df1-b546e7de45be@github.com>
Message-ID: <kBg6GrgsVt_pKZ2MfKEHbuEb9Uk5YFWWQ7tKUOKySH4=.106187c8-0bb4-4463-afd7-a406fce15fc4@github.com>

On Wed, 12 Mar 2025 19:19:08 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Made the intrinsics test separate from the pure java test.

Partial review, just didnt want to sit on comments for this long.

(Spent quite a bit of time catching up on papers and math required)

The biggest roadblock I have following the code are raw register numbers. (And more comments? perhaps I need more math knowledge, but comments would help too). 

Also, 'hidden variables' (xmm30). Can't complain, because this is exactly what Vladimir Ivanov told me to do on my first PR https://github.com/openjdk/jdk/pull/10582#discussion_r1022185591 Perhaps that discussion applies here too.

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 45:

> 43: // Constants
> 44: //
> 45: ATTRIBUTE_ALIGNED(64) static const uint32_t dilithiumAvx512Consts[] = {

This is really nitpicking.. but could had loaded constants inline with `movl` without requiring an ExternalAddress()? 

Nice to have constants together, only complaint is we have 'magic offsets' in ASM to reach in for particular one..

This one isnt too bad, offset of 32bits is easy to inspect visually (`dilithiumAvx512ConstsAddr()` could take a parameter perhaps)

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 58:

> 56: 
> 57: ATTRIBUTE_ALIGNED(64) static const uint32_t dilithiumAvx512Perms[] = {
> 58:      // collect montmul results into the destination register

same as `dilithiumAvx512Consts()`, 'magic offsets'; except here they are harder to count (eg. not clear visually what is the offset of `ntt inverse`).

Could be split into three constant arrays to make the compiler count for us

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 127:

> 125:   for (int i = 0; i < parCnt; i++) {
> 126:     __ evpsubd(xmm(i + outputReg), k0, xmm(i + scratchReg1), xmm(i + scratchReg2), false, Assembler::AVX_512bit);
> 127:   }

This is such a deceptively brilliant function!!! Took me a while to understand (and map to Java `montMul` function). Perhaps needs more comments.

The comment on line 99 does provide good hints, but I still had some trouble. I ended up annotating a copy quite a bit. I do think all 'clever code' needs comments. Here is my annotated version, if you want to copy out anything:

static void montmulEven2(XMMRegister outputReg, XMMRegister inputReg1,  XMMRegister inputReg2, XMMRegister scratchReg1, 
    XMMRegister scratchReg2, XMMRegister montQInvModR, XMMRegister dilithium_q, int parCnt, MacroAssembler* _masm) {
  
  int output = outputReg->encoding();
  int input1 = inputReg1->encoding();
  int input2 = inputReg2->encoding();
  int scratch1 = scratchReg1->encoding();
  int scratch2 = scratchReg2->encoding();
  for (int i = 0; i < parCnt; i++) {
    // scratch1 = (int64)input1_even*input2_even
    //    Java: long a = (long) b * (long) c;
    __ vpmuldq(xmm(i + scratch1), xmm(i + input1), xmm((input2 == 29) ? 29 : input2 + i), Assembler::AVX_512bit);
  }
  for (int i = 0; i < parCnt; i++) {
    // scratch2 = int32(montQInvModR*(int32)scratch1)
    //    Java: int aLow = (int) a;
    //    Java: int m = MONT_Q_INV_MOD_R * aLow; // signed low product
    __ vpmulld(xmm(i + scratch2), xmm(i + scratch1), montQInvModR, Assembler::AVX_512bit);
  }
  for (int i = 0; i < parCnt; i++) {
    // scratch2 = (int64)scratch2_even*dilithium_q_even
    //    Java: ((long)m * MONT_Q)
    __ vpmuldq(xmm(i + scratch2), xmm(i + scratch2), dilithium_q, Assembler::AVX_512bit);
  }
  for (int i = 0; i < parCnt; i++) {
    // output_odd = scratch1_odd - scratch2_odd
    //    Java: (aHigh - (int) (("scratch2") >> MONT_R_BITS))
    __ evpsubd(xmm(i + output), k0, xmm(i + scratch1), xmm(i + scratch2), false, Assembler::AVX_512bit);
  }
}


- add comment that input2 can be xmm29, treated as constants, not consecutive (i.e. zetas)
- Candidate for ascii art, even/odd columns, implicit int/long casts (or more 'math' comments on what happens)
- use XMMRegisters instead of numbers (improve callsite readability)
  - can use either `inputReg1 = inputReg1->successor()` 
  - or get `encoding()` and keep current style
- could be static (local) function (hide from header), then pass _masm
- pass all registers used (helps seeing register allocation, confirm no overlaps)

False trails (i.e. nothing to do, but I thought about it already, so other reviewer doesnt have to?)
- (ignore: worse performance) squash into a single for loop, let cpu do out-of-order (and improve readability)
- xmm30/xmm31 (montQInvModR/dilithium_q) are constant. At a glance, it looks like they should be combined into one precomputed one. And paper 039.pdf suggests merging constants precompute the product; but.. different constants and looking at Java, there are several implicit casts

For reductions of products inside the NTT this is not a problem because one has to multiply by the roots of unity
which are compile-time constants. So one can just precompute them with an additional
factor of ? mod q so that the results after Montgomery reduction are in fact congruent to the desired value a

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 140:

> 138:   __ vpmuldq(xmm(scratchReg1 + 1), xmm(inputReg12), xmm(inputReg2 + 1), Assembler::AVX_512bit);
> 139:   __ vpmuldq(xmm(scratchReg1 + 2), xmm(inputReg13), xmm(inputReg2 + 2), Assembler::AVX_512bit);
> 140:   __ vpmuldq(xmm(scratchReg1 + 3), xmm(inputReg14), xmm(inputReg2 + 3), Assembler::AVX_512bit);

Another option for these four lines, to keep the style of rest of function

int inputReg1[] = {inputReg11, inputReg12, inputReg13, inputReg14};
  for (int i = 0; i < parCnt; i++) {
    __ vpmuldq(xmm(scratchReg1 + i), inputReg1[i], xmm(inputReg2 + i), Assembler::AVX_512bit);
  }

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 197:

> 195: 
> 196:   // level 0
> 197:   montmulEven(20, 8, 29, 20, 16, 4);

It would improve readability to know which parameter is a register, and which is a count.. i.e. 

`montmulEven(xmm20, xmm8, xmm29, xmm20, xmm16, 4);`

(its not _that_ bad, once I remember that its always the last parameter.. but it does add to the 'mental load' one has to carry, and this code is already interesting enough)

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 980:

> 978: // Dilithium multiply polynomials in the NTT domain.
> 979: // Implements
> 980: // static int implDilithiumNttMult(

I suppose no java changes in this PR, but I notice that the inputs are all assumed to have fixed size.

Most/all intrinsics I worked with had some sort of guard (eg `Objects.checkFromIndexSize`) right before the intrinsic java call. (It usually looks like it can be optimized away). But I notice no such guard here on the java side.

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 1010:

> 1008:   __ vpbroadcastd(xmm31, Address(dilithiumConsts, 4), Assembler::AVX_512bit); // q
> 1009:   __ vpbroadcastd(xmm29, Address(dilithiumConsts, 12), Assembler::AVX_512bit); // 2^64 mod q
> 1010:   __ evmovdqul(xmm28, Address(perms, 0), Assembler::AVX_512bit);

- use of `c_rarg3` is 'clever' so probably should have a comment (ie. 'no 3rd parameter, free register')
- Alternatively, load directly into the vector with `ExternalAddress()`; you need a scratch register (use r10) but address is close enough, it actually wont be used. Here is the disassembly I got:

StubRoutines::dilithiumNttMult [0x00007f414fb68280, 0x00007f414fb68548] (712 bytes)
--------------------------------------------------------------------------------
add    %al,(%rax)
  0x00007f414fb68280:   push   %rbp
  0x00007f414fb68281:   mov    %rsp,%rbp
  0x00007f414fb68284:   vpbroadcastd 0x18f9fe32(%rip),%zmm30        # 0x00007f4168b080c0
  0x00007f414fb6828e:   vpbroadcastd 0x18f9fe2c(%rip),%zmm31        # 0x00007f4168b080c4
  0x00007f414fb68298:   vpbroadcastd 0x18f9fe2a(%rip),%zmm29        # 0x00007f4168b080cc
  0x00007f414fb682a2:   vmovdqu32 0x18f9f8d4(%rip),%zmm28        # 0x00007f4168b07b80
  ```
  
The `ExternalAddress()` calls for above assembler
  ```
  const Register scratch = r10;
  const XMMRegister montRSquareModQ = xmm29;
  const XMMRegister montQInvModR = xmm30;
  const XMMRegister dilithium_q = xmm31;
  const XMMRegister perms = xmm28;

  __ vpbroadcastd(montQInvModR, ExternalAddress(dilithiumAvx512ConstsAddr()), Assembler::AVX_512bit, scratch); // q^-1 mod 2^32
  __ vpbroadcastd(dilithium_q, ExternalAddress(dilithiumAvx512ConstsAddr() + 4), Assembler::AVX_512bit, scratch); // q
  __ vpbroadcastd(montRSquareModQ, ExternalAddress(dilithiumAvx512ConstsAddr() + 12), Assembler::AVX_512bit, scratch); // 2^64 mod q
  __ evmovdqul(perms, k0, ExternalAddress(dilithiumAvx512PermsAddr()), false, Assembler::AVX_512bit, scratch);

(and `dilithiumAvx512ConstsAddr(offset)` cound take an int parameter too)

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 1012:

> 1010:   __ evmovdqul(xmm28, Address(perms, 0), Assembler::AVX_512bit);
> 1011: 
> 1012:   __ movl(len, 4);

Compile-time constant, why not 'unroll at compile time'? i.e. wrap this loop with `for (int len=0; len<4; len++)` instead?

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 1041:

> 1039:   for (int i = 0; i < 4; i++) {
> 1040:     __ evmovdqul(Address(result, i * 64), xmm(i), Assembler::AVX_512bit);
> 1041:   }

This is nice, compact and clean. The biggest issue I have with following this code is really with all the 'raw' registers. I would much rather prefer symbolic names, but up to you to decide style.

I ended up 'annotating' this snippet, so I could understand it and confirm everything..  as with montmulEven, hope some of it can be useful to you to copy out.


  XMMRegister POLY1[] = {xmm0, xmm1, xmm2, xmm3};
  XMMRegister POLY2[] = {xmm4, xmm5, xmm6, xmm7};
  XMMRegister SCRATCH1[] = {xmm12, xmm13, xmm14, xmm15};
  XMMRegister SCRATCH2[] = {xmm16, xmm17, xmm18, xmm19};
  XMMRegister SCRATCH3[] = {xmm8, xmm9, xmm10, xmm11};
  for (int i = 0; i < 4; i++) {
    __ evmovdqul(POLY1[i], Address(poly1, i * 64), Assembler::AVX_512bit);
    __ evmovdqul(POLY2[i], Address(poly2, i * 64), Assembler::AVX_512bit);
  }

  // montmulEven: inputs are in even columns and output is in odd columns
  // scratch3_even = poly2_even*montRSquareModQ // poly2 to montgomery domain
  montmulEven2(SCRATCH3[0], POLY2[0], montRSquareModQ, SCRATCH1[0], SCRATCH2[0], montQInvModR, dilithium_q, 4, _masm);
  for (int i = 0; i < 4; i++) {
    // swap even/odd; 0xB1 == 2-3-0-1
    __ vpshufd(SCRATCH3[i], SCRATCH3[i], 0xB1, Assembler::AVX_512bit);
  }

  // scratch3_odd = poly1_even*scratch3_even = poly1_even*poly2_even*montRSquareModQ
  montmulEven2(SCRATCH3[0], POLY1[0], SCRATCH3[0], SCRATCH1[0], SCRATCH2[0], 4, montQInvModR, dilithium_q, 4, _masm);
  for (int i = 0; i < 4; i++) {
    __ vpshufd(POLY1[i], POLY1[i], 0xB1, Assembler::AVX_512bit);
    __ vpshufd(POLY2[i], POLY2[i], 0xB1, Assembler::AVX_512bit);
  }

  // poly2_even = poly2_odd*montRSquareModQ // poly2 to montgomery domain
  montmulEven2(POLY2[0], POLY2[0], montRSquareModQ, SCRATCH1[0], SCRATCH2[0], 4, montQInvModR, dilithium_q, 4, _masm);
  for (int i = 0; i < 4; i++) {
    __ vpshufd(POLY2[i], POLY2[i], 0xB1, Assembler::AVX_512bit);
  }

  // poly1_odd = poly1_even*poly2_even
  montmulEven2(POLY1[0], POLY1[0], POLY2[0], SCRATCH1[0], SCRATCH2[0], 4, montQInvModR, dilithium_q, 4, _masm);
  for (int i = 0; i < 4; i++) {
    // result is scrambled between scratch3_odd and poly1_odd; unscramble
    __ evpermt2d(POLY1[i], perms, SCRATCH3[i], Assembler::AVX_512bit);
  }
  for (int i = 0; i < 4; i++) {
    __ evmovdqul(Address(result, i * 64), POLY1[i], Assembler::AVX_512bit);
  }


With symbolic variable names, code was much easier to follow conceptually. Also has the side benefit of making it obvious which XMM registers are used and that there is no conflicts

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 1090:

> 1088:   __ evpbroadcastd(xmm29, constant, Assembler::AVX_512bit); // constant multiplier
> 1089: 
> 1090:   __ movl(len, 2);

Same comment here as the `generate_dilithiumNttMult_avx512`
- constants can be loaded directly into XMM
- len can be removed by unrolling at compile time
- symbolic names could be used for registers
- comments could be added

-------------

PR Review: https://git.openjdk.org/jdk/pull/23860#pullrequestreview-2665370975
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1999468929
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1999471763
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1999625933
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1992230295
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1992235625
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1999712200
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1999413007
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1999367607
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1999683384
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1999686631

From vpaprotski at openjdk.org  Mon Mar 17 21:49:14 2025
From: vpaprotski at openjdk.org (Volodymyr Paprotski)
Date: Mon, 17 Mar 2025 21:49:14 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v5]
In-Reply-To: <3bphXKLpIpxAZP-FEOeob6AaHbv0BAoEceJka64vMW8=.3e4f74e0-9479-4926-b365-b08d8d702692@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <3bphXKLpIpxAZP-FEOeob6AaHbv0BAoEceJka64vMW8=.3e4f74e0-9479-4926-b365-b08d8d702692@github.com>
Message-ID: <uUNg9Inwc5sySYy-v52dgYVTSxNU0n7EQZdsjQESTsY=.b9bfb2ab-42ef-40fc-9173-6a141adfac38@github.com>

On Thu, 6 Mar 2025 17:37:33 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Accepted review comments.

src/hotspot/cpu/x86/stubGenerator_x86_64.hpp line 494:

> 492:   address generate_sha3_implCompress(StubGenStubId stub_id);
> 493: 
> 494:   address generate_double_keccak();

you can hide internal helper functions (i.e. `montmulEven(*)`) if you wish. 

The trick is to add `MacroAssembler* _masm` as a parameter to the static (local) function. Its a trick I use to keep header clean, but still have plenty of helpers

src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 409:

> 407:   __ evmovdquq(xmm29, Address(permsAndRots, 768), Assembler::AVX_512bit);
> 408:   __ evmovdquq(xmm30, Address(permsAndRots, 832), Assembler::AVX_512bit);
> 409:   __ evmovdquq(xmm31, Address(permsAndRots, 896), Assembler::AVX_512bit);

Matter of taste, but I liked the compactness of montmulEven; i.e. 

for (i=0; i<15; i++)
    __ evmovdquq(xmm(17+i), Address(permsAndRots, 64*i), Assembler::AVX_512bit);

src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 426:

> 424:   __ subl( roundsLeft, 1);
> 425: 
> 426:   __ evmovdquw(xmm5, xmm0, Assembler::AVX_512bit);

Is there a pattern here; that can be 'compacted' into a loop?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1983903347
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1983935964
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r1983937154

From tschatzl at openjdk.org  Tue Mar 18 16:24:56 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Tue, 18 Mar 2025 16:24:56 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v24]
In-Reply-To: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
Message-ID: <AyqQZLv-kCc1Fhb0QAmRhAEgd2YKCJahEwNffj9JDyc=.92505a11-1583-4cf6-800c-26e608edf9eb@github.com>

> Hi all,
> 
>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
> 
> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
> 
> ### Current situation
> 
> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
> 
> The main reason for the current barrier is how g1 implements concurrent refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
> 
> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
> 
> 
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done;     // null value check
> if (card(@x.a) == young_card) goto done;  // write to young gen check
> StoreLoad;                // synchronize
> if (card(@x.a) == dirty_card) goto done;
> 
> *card(@x.a) = dirty
> 
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
> 
> call runtime to move thread-local-dcq into dcqs
> 
> done:
> 
> 
> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
> 
> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
> 
> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
> 
> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se...

Thomas Schatzl has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 32 commits:

 - * factor out card table and refinement table merging into a single
     method
 - Merge branch 'master' into 8342382-card-table-instead-of-dcq3
 - * obsolete G1UpdateBufferSize
   
   G1UpdateBufferSize has previously been used to size the refinement
   buffers and impose a minimum limit on the number of cards per thread
   that need to be pending before refinement starts.
   
   The former function is now obsolete with the removal of the dirty
   card queues, the latter functionality has been taken over by the new
   diagnostic option `G1PerThreadPendingCardThreshold`.
   
   I prefer to make this a diagnostic option is better than a product option
   because it is something that is only necessary for some test cases to
   produce some otherwise unwanted behavior (continuous refinement).
   
   CSR is pending.
 - * more documentation on why we need to rendezvous the gc threads
 - Merge branch 'master' into 8342381-card-table-instead-of-dcq
 - * ayang review
     * re-add STS leaver for java thread handshake
 - * when aborting refinement during full collection, the global card table and the per-thread card table might not be in sync. Roll forward during abort of the refinement in these situations.
   * additional verification
   * added some missing ResourceMarks in asserts
   * added variant of ArrayJuggle2 that crashes fairly quickly without these changes
 - * ayang review
     * remove unnecessary STSleaver
     * some more documentation around to_collection_card card color
 - Merge branch 'master' into 8342382-card-table-instead-of-dcq
 - * optimized RISCV gen_write_ref_array_post_barrier() implementation contributed by @RealFYang
 - ... and 22 more: https://git.openjdk.org/jdk/compare/b025d8c2...c833bc83

-------------

Changes: https://git.openjdk.org/jdk/pull/23739/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=23
  Stats: 6788 lines in 104 files changed: 2382 ins; 3476 del; 930 mod
  Patch: https://git.openjdk.org/jdk/pull/23739.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739

PR: https://git.openjdk.org/jdk/pull/23739

From tschatzl at openjdk.org  Wed Mar 19 13:17:19 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Wed, 19 Mar 2025 13:17:19 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v25]
In-Reply-To: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
Message-ID: <5Q9-MERAD4KIP-fzgw7JVAtC9u4L1fEFGcNkdHBvkg4=.1917bd58-a5f8-4c5c-b1f9-27b7457c6262@github.com>

> Hi all,
> 
>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
> 
> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
> 
> ### Current situation
> 
> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
> 
> The main reason for the current barrier is how g1 implements concurrent refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
> 
> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
> 
> 
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done;     // null value check
> if (card(@x.a) == young_card) goto done;  // write to young gen check
> StoreLoad;                // synchronize
> if (card(@x.a) == dirty_card) goto done;
> 
> *card(@x.a) = dirty
> 
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
> 
> call runtime to move thread-local-dcq into dcqs
> 
> done:
> 
> 
> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
> 
> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
> 
> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
> 
> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se...

Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:

  * fix IR code generation tests that change due to barrier cost changes

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23739/files
  - new: https://git.openjdk.org/jdk/pull/23739/files/c833bc83..f419556e

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=24
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=23-24

  Stats: 5 lines in 2 files changed: 2 ins; 0 del; 3 mod
  Patch: https://git.openjdk.org/jdk/pull/23739.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739

PR: https://git.openjdk.org/jdk/pull/23739

From tschatzl at openjdk.org  Wed Mar 19 13:27:17 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Wed, 19 Mar 2025 13:27:17 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v25]
In-Reply-To: <5Q9-MERAD4KIP-fzgw7JVAtC9u4L1fEFGcNkdHBvkg4=.1917bd58-a5f8-4c5c-b1f9-27b7457c6262@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <5Q9-MERAD4KIP-fzgw7JVAtC9u4L1fEFGcNkdHBvkg4=.1917bd58-a5f8-4c5c-b1f9-27b7457c6262@github.com>
Message-ID: <xs8qTCwrrH81i6FG2aIT9HMIUDwzYe_UwGeHFeSyKx4=.e54302a8-506c-465a-8cac-acb8c0797e5c@github.com>

On Wed, 19 Mar 2025 13:17:19 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> Hi all,
>> 
>>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
>> 
>> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
>> 
>> ### Current situation
>> 
>> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
>> 
>> The main reason for the current barrier is how g1 implements concurrent refinement:
>> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
>> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
>> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
>> 
>> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
>> 
>> 
>> // Filtering
>> if (region(@x.a) == region(y)) goto done; // same region check
>> if (y == null) goto done;     // null value check
>> if (card(@x.a) == young_card) goto done;  // write to young gen check
>> StoreLoad;                // synchronize
>> if (card(@x.a) == dirty_card) goto done;
>> 
>> *card(@x.a) = dirty
>> 
>> // Card tracking
>> enqueue(card-address(@x.a)) into thread-local-dcq;
>> if (thread-local-dcq is not full) goto done;
>> 
>> call runtime to move thread-local-dcq into dcqs
>> 
>> done:
>> 
>> 
>> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
>> 
>> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
>> 
>> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
>> 
>> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c...
>
> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
> 
>   * fix IR code generation tests that change due to barrier cost changes

Commit https://github.com/openjdk/jdk/pull/23739/commits/f419556e9177ecf9fbf22e606dd6c1b850f4330f fixes the failing compiler tests that check whether the compiler emits the correct object graph. Occurs after merging with mainline that significantly reduces total barrier cost calculation.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23739#issuecomment-2736639357

From tschatzl at openjdk.org  Thu Mar 20 09:44:07 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Thu, 20 Mar 2025 09:44:07 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v26]
In-Reply-To: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
Message-ID: <fo8kyxCf7QsnorUjcqvtwjsV8OpgA8wYq5Qkb4U9zRk=.94c862da-4af2-437a-a89f-cac3ca173bc4@github.com>

> Hi all,
> 
>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
> 
> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
> 
> ### Current situation
> 
> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
> 
> The main reason for the current barrier is how g1 implements concurrent refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
> 
> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
> 
> 
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done;     // null value check
> if (card(@x.a) == young_card) goto done;  // write to young gen check
> StoreLoad;                // synchronize
> if (card(@x.a) == dirty_card) goto done;
> 
> *card(@x.a) = dirty
> 
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
> 
> call runtime to move thread-local-dcq into dcqs
> 
> done:
> 
> 
> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
> 
> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
> 
> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
> 
> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se...

Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:

  * make young gen length revising independent of refinement thread
    * use a service task
    * both refinement control thread and young gen length revising use the same infrastructure to get the number of available bytes and determine the time to the next update

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23739/files
  - new: https://git.openjdk.org/jdk/pull/23739/files/f419556e..5e76a516

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=25
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=24-25

  Stats: 337 lines in 12 files changed: 237 ins; 90 del; 10 mod
  Patch: https://git.openjdk.org/jdk/pull/23739.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739

PR: https://git.openjdk.org/jdk/pull/23739

From tschatzl at openjdk.org  Thu Mar 20 09:49:13 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Thu, 20 Mar 2025 09:49:13 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v26]
In-Reply-To: <fo8kyxCf7QsnorUjcqvtwjsV8OpgA8wYq5Qkb4U9zRk=.94c862da-4af2-437a-a89f-cac3ca173bc4@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <fo8kyxCf7QsnorUjcqvtwjsV8OpgA8wYq5Qkb4U9zRk=.94c862da-4af2-437a-a89f-cac3ca173bc4@github.com>
Message-ID: <zhM9TVTZY00EPA7yPnPTdk0v9w5PvBgoVH_UmSjlSAo=.2a1d8cf2-e54c-479c-8a9f-09bd68fd2c24@github.com>

On Thu, 20 Mar 2025 09:44:07 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> Hi all,
>> 
>>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
>> 
>> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
>> 
>> ### Current situation
>> 
>> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
>> 
>> The main reason for the current barrier is how g1 implements concurrent refinement:
>> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
>> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
>> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
>> 
>> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
>> 
>> 
>> // Filtering
>> if (region(@x.a) == region(y)) goto done; // same region check
>> if (y == null) goto done;     // null value check
>> if (card(@x.a) == young_card) goto done;  // write to young gen check
>> StoreLoad;                // synchronize
>> if (card(@x.a) == dirty_card) goto done;
>> 
>> *card(@x.a) = dirty
>> 
>> // Card tracking
>> enqueue(card-address(@x.a)) into thread-local-dcq;
>> if (thread-local-dcq is not full) goto done;
>> 
>> call runtime to move thread-local-dcq into dcqs
>> 
>> done:
>> 
>> 
>> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
>> 
>> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
>> 
>> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
>> 
>> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c...
>
> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
> 
>   * make young gen length revising independent of refinement thread
>     * use a service task
>     * both refinement control thread and young gen length revising use the same infrastructure to get the number of available bytes and determine the time to the next update

Commit https://github.com/openjdk/jdk/pull/23739/commits/5e76a516c848e75f56e966a1ffe4115b1dce786c implements the change to make young gen length revising independent of the refinement control thread.

Infrastructure to determine currently available number of bytes for allocation and determining the next time the particular task should be redone is shared. It may be distributed across a bit more methods than I would prefer, but particularly the refinement control thread wants to reuse and keep some intermediate results (to not be required to get the `Heap_lock` again basically).

I did not have a good reason to make the heuristic to determine the time to the next action different for both, so they are basically the same.

There is some pre-existing problem that the minimum time for re-doing the work is ~50ms. That might be too short in some cases, but then again, if you have that short of a GC interval it may not be very useful to e.g. revise young gen length anyway.

I think with this change all current concerns are addressed.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23739#issuecomment-2739766880

From duke at openjdk.org  Thu Mar 20 11:29:57 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Thu, 20 Mar 2025 11:29:57 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v8]
In-Reply-To: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
Message-ID: <ArSUYG4NM91jYhSzQs1xT1nCBvffifH8BBIgGNg8Eqc=.4e96b52f-bdec-4843-9460-be386db9b025@github.com>

> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.

Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:

  responding to review comments

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23860/files
  - new: https://git.openjdk.org/jdk/pull/23860/files/aa2fdf2d..2438fb5c

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=07
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=06-07

  Stats: 750 lines in 3 files changed: 174 ins; 447 del; 129 mod
  Patch: https://git.openjdk.org/jdk/pull/23860.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23860/head:pull/23860

PR: https://git.openjdk.org/jdk/pull/23860

From thartmann at openjdk.org  Thu Mar 20 12:29:22 2025
From: thartmann at openjdk.org (Tobias Hartmann)
Date: Thu, 20 Mar 2025 12:29:22 GMT
Subject: RFR: 8346989: Deoptimization and re-compilation cycle with C2
 compiled code
In-Reply-To: <_3l8ylsbgvsqQE1Ihp0BUAx2o_VzcS6R2jWBSKW9u1E=.0dcb6086-ff6f-4c9a-b990-6665a476a3dc@github.com>
References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com>
 <_3l8ylsbgvsqQE1Ihp0BUAx2o_VzcS6R2jWBSKW9u1E=.0dcb6086-ff6f-4c9a-b990-6665a476a3dc@github.com>
Message-ID: <Xyr4rC5ziNR1VxSalCmU-RkvOZUWyZrGctWWVd5dmjo=.68cb4408-59de-4f00-a595-06a008807f33@github.com>

On Fri, 7 Mar 2025 18:03:14 GMT, Vladimir Ivanov <vlivanov at openjdk.org> wrote:

>> `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments.
>> This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached.
>> 
>> Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all.
>> 
>> tl;dr:
>> - C1: no problem, no change
>> - C2:
>>   - with intrinsics:
>>     - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms)
>>     - without overflow: no problem, no change
>>   - without intrinsics: no problem, no change
>> 
>> Before the fix:
>> 
>> Benchmark                                           (SIZE)  Mode  Cnt     Score      Error  Units
>> MathExact.C1_1.loopAddIInBounds                    1000000  avgt    3     1.272 ?    0.048  ms/op
>> MathExact.C1_1.loopAddIOverflow                    1000000  avgt    3   641.917 ?   58.238  ms/op
>> MathExact.C1_1.loopAddLInBounds                    1000000  avgt    3     1.402 ?    0.842  ms/op
>> MathExact.C1_1.loopAddLOverflow                    1000000  avgt    3   671.013 ?  229.425  ms/op
>> MathExact.C1_1.loopDecrementIInBounds              1000000  avgt    3     3.722 ?   22.244  ms/op
>> MathExact.C1_1.loopDecrementIOverflow              1000000  avgt    3   653.341 ?  279.003  ms/op
>> MathExact.C1_1.loopDecrementLInBounds              1000000  avgt    3     2.525 ?    0.810  ms/op
>> MathExact.C1_1.loopDecrementLOverflow              1000000  avgt    3   656.750 ?  141.792  ms/op
>> MathExact.C1_1.loopIncrementIInBounds              1000000  avgt    3     4.621 ?   12.822  ms/op
>> MathExact.C1_1.loopIncrementIOverflow              1000000  avgt    3   651.608 ?  274.396  ms/op
>> MathExact.C1_1.loopIncrementLInBounds              1000000  avgt    3     2.576 ?    3.316  ms/op
>> MathExact.C1_1.loopIncrementLOverflow              1000000  avgt    3   662.216 ?   71.879  ms/op
>> MathExact.C1_1.loopMultiplyIInBounds               1000000  avgt    3     1.402 ?    0.587  ms/op
>> MathExact.C1_1.loopMultiplyIOverflow               1000000  avgt    3   615.836 ?  252.137  ms/op
>> MathExact.C1_1.loopMultiplyLInBounds               1000000  avgt    3     2.906 ?    5.718  ms/op
>> MathExact.C1_1.loopMultiplyLOverflow               1000000  avgt    3   655.576 ?  147.432  ms/op
>> MathExact.C1_1.loopNegateIInBounds                 1000000  avgt    3     2.023 ?    0.027  ms/op
>> MathExact.C1_1.loopNegateIOverflow                 1000000  avgt    3   639.136 ?   30.841  ms/op
>> MathExact.C1_1.loop...
>
> src/hotspot/share/opto/library_call.cpp line 1963:
> 
>> 1961:     set_i_o(i_o());
>> 1962: 
>> 1963:     uncommon_trap(Deoptimization::Reason_intrinsic,
> 
> What about using `builtin_throw` here? (Requires some tuning on `builtin_throw` side.) How much does it affect performance? Also, passing `must_throw = true` into `uncommon_trap` may help a bit here as well.

I think adapting and re-using `builtin_throw` like you described is reasonable but I let @iwanowww confirm :slightly_smiling_face:

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23916#discussion_r2005526386

From epeter at openjdk.org  Thu Mar 20 13:54:07 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Thu, 20 Mar 2025 13:54:07 GMT
Subject: RFR: 8346989: Deoptimization and re-compilation cycle with C2
 compiled code
In-Reply-To: <_3l8ylsbgvsqQE1Ihp0BUAx2o_VzcS6R2jWBSKW9u1E=.0dcb6086-ff6f-4c9a-b990-6665a476a3dc@github.com>
References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com>
 <_3l8ylsbgvsqQE1Ihp0BUAx2o_VzcS6R2jWBSKW9u1E=.0dcb6086-ff6f-4c9a-b990-6665a476a3dc@github.com>
Message-ID: <5rSvBeQxKuX-hhaLGygKRBi_VpALqwywgnKfK61a8j4=.258cf9ca-56fe-42a9-85b1-b6aa30f2eb5c@github.com>

On Fri, 7 Mar 2025 18:03:53 GMT, Vladimir Ivanov <vlivanov at openjdk.org> wrote:

>> `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments.
>> This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached.
>> 
>> Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all.
>> 
>> tl;dr:
>> - C1: no problem, no change
>> - C2:
>>   - with intrinsics:
>>     - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms)
>>     - without overflow: no problem, no change
>>   - without intrinsics: no problem, no change
>> 
>> Before the fix:
>> 
>> Benchmark                                           (SIZE)  Mode  Cnt     Score      Error  Units
>> MathExact.C1_1.loopAddIInBounds                    1000000  avgt    3     1.272 ?    0.048  ms/op
>> MathExact.C1_1.loopAddIOverflow                    1000000  avgt    3   641.917 ?   58.238  ms/op
>> MathExact.C1_1.loopAddLInBounds                    1000000  avgt    3     1.402 ?    0.842  ms/op
>> MathExact.C1_1.loopAddLOverflow                    1000000  avgt    3   671.013 ?  229.425  ms/op
>> MathExact.C1_1.loopDecrementIInBounds              1000000  avgt    3     3.722 ?   22.244  ms/op
>> MathExact.C1_1.loopDecrementIOverflow              1000000  avgt    3   653.341 ?  279.003  ms/op
>> MathExact.C1_1.loopDecrementLInBounds              1000000  avgt    3     2.525 ?    0.810  ms/op
>> MathExact.C1_1.loopDecrementLOverflow              1000000  avgt    3   656.750 ?  141.792  ms/op
>> MathExact.C1_1.loopIncrementIInBounds              1000000  avgt    3     4.621 ?   12.822  ms/op
>> MathExact.C1_1.loopIncrementIOverflow              1000000  avgt    3   651.608 ?  274.396  ms/op
>> MathExact.C1_1.loopIncrementLInBounds              1000000  avgt    3     2.576 ?    3.316  ms/op
>> MathExact.C1_1.loopIncrementLOverflow              1000000  avgt    3   662.216 ?   71.879  ms/op
>> MathExact.C1_1.loopMultiplyIInBounds               1000000  avgt    3     1.402 ?    0.587  ms/op
>> MathExact.C1_1.loopMultiplyIOverflow               1000000  avgt    3   615.836 ?  252.137  ms/op
>> MathExact.C1_1.loopMultiplyLInBounds               1000000  avgt    3     2.906 ?    5.718  ms/op
>> MathExact.C1_1.loopMultiplyLOverflow               1000000  avgt    3   655.576 ?  147.432  ms/op
>> MathExact.C1_1.loopNegateIInBounds                 1000000  avgt    3     2.023 ?    0.027  ms/op
>> MathExact.C1_1.loopNegateIOverflow                 1000000  avgt    3   639.136 ?   30.841  ms/op
>> MathExact.C1_1.loop...
>
> Nice benchmark, Marc!

@iwanowww Are you still reviewing or should I have a look?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23916#issuecomment-2740528216

From duke at openjdk.org  Thu Mar 20 18:42:48 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Thu, 20 Mar 2025 18:42:48 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v9]
In-Reply-To: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
Message-ID: <iLBggfqh5ULXOPU28WVp8x2ZE9WLBxduJwvnCHIaxFU=.8721af86-0cb8-492f-bc5f-00d63da14f3a@github.com>

> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.

Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:

  More beautification

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23860/files
  - new: https://git.openjdk.org/jdk/pull/23860/files/2438fb5c..1cfab778

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=08
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=07-08

  Stats: 307 lines in 1 file changed: 49 ins; 131 del; 127 mod
  Patch: https://git.openjdk.org/jdk/pull/23860.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23860/head:pull/23860

PR: https://git.openjdk.org/jdk/pull/23860

From duke at openjdk.org  Thu Mar 20 20:37:25 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Thu, 20 Mar 2025 20:37:25 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v10]
In-Reply-To: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
Message-ID: <2N5Evij0f6qZi_pG3tqoz11aQbSnLG0YszqHR9ROfKI=.d44b16c6-d334-42c4-8de8-92eb41229248@github.com>

> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.

Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:

  Fix windows build

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23860/files
  - new: https://git.openjdk.org/jdk/pull/23860/files/1cfab778..e9db09e2

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=09
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=08-09

  Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod
  Patch: https://git.openjdk.org/jdk/pull/23860.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23860/head:pull/23860

PR: https://git.openjdk.org/jdk/pull/23860

From duke at openjdk.org  Thu Mar 20 21:09:14 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Thu, 20 Mar 2025 21:09:14 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v5]
In-Reply-To: <uUNg9Inwc5sySYy-v52dgYVTSxNU0n7EQZdsjQESTsY=.b9bfb2ab-42ef-40fc-9173-6a141adfac38@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <3bphXKLpIpxAZP-FEOeob6AaHbv0BAoEceJka64vMW8=.3e4f74e0-9479-4926-b365-b08d8d702692@github.com>
 <uUNg9Inwc5sySYy-v52dgYVTSxNU0n7EQZdsjQESTsY=.b9bfb2ab-42ef-40fc-9173-6a141adfac38@github.com>
Message-ID: <tUuz7mcyKkHz0RnrD6AFFd3Dq_mN3rILQQWRzRN7j80=.563937a3-2106-47ef-8d01-f75a939cd5af@github.com>

On Thu, 6 Mar 2025 19:27:12 GMT, Volodymyr Paprotski <vpaprotski at openjdk.org> wrote:

>> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Accepted review comments.
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 426:
> 
>> 424:   __ subl( roundsLeft, 1);
>> 425: 
>> 426:   __ evmovdquw(xmm5, xmm0, Assembler::AVX_512bit);
> 
> Is there a pattern here; that can be 'compacted' into a loop?

Unfortunately, no. This loop body is imported  from generate_sha3_implCompress() and doubled, as explained in the comment about 15 lines above.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2006455877

From duke at openjdk.org  Thu Mar 20 21:09:12 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Thu, 20 Mar 2025 21:09:12 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v7]
In-Reply-To: <kBg6GrgsVt_pKZ2MfKEHbuEb9Uk5YFWWQ7tKUOKySH4=.106187c8-0bb4-4463-afd7-a406fce15fc4@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <xeGVcNvhb8JogSwnBjlAln9bAiZMAeiOEC5XrH0FVX0=.564addcd-ee9a-4628-8df1-b546e7de45be@github.com>
 <kBg6GrgsVt_pKZ2MfKEHbuEb9Uk5YFWWQ7tKUOKySH4=.106187c8-0bb4-4463-afd7-a406fce15fc4@github.com>
Message-ID: <MiV8AD5mEqWLf_DDaoTF7jG06Zd7bBloGuxRZqGtN3U=.e55cfe7f-eb52-4da1-bddd-01ce74cfc633@github.com>

On Mon, 17 Mar 2025 19:24:52 GMT, Volodymyr Paprotski <vpaprotski at openjdk.org> wrote:

>> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Made the intrinsics test separate from the pure java test.
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 58:
> 
>> 56: 
>> 57: ATTRIBUTE_ALIGNED(64) static const uint32_t dilithiumAvx512Perms[] = {
>> 58:      // collect montmul results into the destination register
> 
> same as `dilithiumAvx512Consts()`, 'magic offsets'; except here they are harder to count (eg. not clear visually what is the offset of `ntt inverse`).
> 
> Could be split into three constant arrays to make the compiler count for us

Well, it is 64 bytes per  line (16 4-byte uint32_ts), not that hard :-) ...

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 140:
> 
>> 138:   __ vpmuldq(xmm(scratchReg1 + 1), xmm(inputReg12), xmm(inputReg2 + 1), Assembler::AVX_512bit);
>> 139:   __ vpmuldq(xmm(scratchReg1 + 2), xmm(inputReg13), xmm(inputReg2 + 2), Assembler::AVX_512bit);
>> 140:   __ vpmuldq(xmm(scratchReg1 + 3), xmm(inputReg14), xmm(inputReg2 + 3), Assembler::AVX_512bit);
> 
> Another option for these four lines, to keep the style of rest of function
> 
> int inputReg1[] = {inputReg11, inputReg12, inputReg13, inputReg14};
>   for (int i = 0; i < parCnt; i++) {
>     __ vpmuldq(xmm(scratchReg1 + i), inputReg1[i], xmm(inputReg2 + i), Assembler::AVX_512bit);
>   }

I have changed the whole structure instead.

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 197:
> 
>> 195: 
>> 196:   // level 0
>> 197:   montmulEven(20, 8, 29, 20, 16, 4);
> 
> It would improve readability to know which parameter is a register, and which is a count.. i.e. 
> 
> `montmulEven(xmm20, xmm8, xmm29, xmm20, xmm16, 4);`
> 
> (its not _that_ bad, once I remember that its always the last parameter.. but it does add to the 'mental load' one has to carry, and this code is already interesting enough)

I have changed the structure, now it is clear(er) which parameter is what.

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 980:
> 
>> 978: // Dilithium multiply polynomials in the NTT domain.
>> 979: // Implements
>> 980: // static int implDilithiumNttMult(
> 
> I suppose no java changes in this PR, but I notice that the inputs are all assumed to have fixed size.
> 
> Most/all intrinsics I worked with had some sort of guard (eg `Objects.checkFromIndexSize`) right before the intrinsic java call. (It usually looks like it can be optimized away). But I notice no such guard here on the java side.

These functions will not be used anywhere else and in ML_DSA.java all of the arrays passed to inrinsics are of the correct size.

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 1010:
> 
>> 1008:   __ vpbroadcastd(xmm31, Address(dilithiumConsts, 4), Assembler::AVX_512bit); // q
>> 1009:   __ vpbroadcastd(xmm29, Address(dilithiumConsts, 12), Assembler::AVX_512bit); // 2^64 mod q
>> 1010:   __ evmovdqul(xmm28, Address(perms, 0), Assembler::AVX_512bit);
> 
> - use of `c_rarg3` is 'clever' so probably should have a comment (ie. 'no 3rd parameter, free register')
> - Alternatively, load directly into the vector with `ExternalAddress()`; you need a scratch register (use r10) but address is close enough, it actually wont be used. Here is the disassembly I got:
> 
> StubRoutines::dilithiumNttMult [0x00007f414fb68280, 0x00007f414fb68548] (712 bytes)
> --------------------------------------------------------------------------------
> add    %al,(%rax)
>   0x00007f414fb68280:   push   %rbp
>   0x00007f414fb68281:   mov    %rsp,%rbp
>   0x00007f414fb68284:   vpbroadcastd 0x18f9fe32(%rip),%zmm30        # 0x00007f4168b080c0
>   0x00007f414fb6828e:   vpbroadcastd 0x18f9fe2c(%rip),%zmm31        # 0x00007f4168b080c4
>   0x00007f414fb68298:   vpbroadcastd 0x18f9fe2a(%rip),%zmm29        # 0x00007f4168b080cc
>   0x00007f414fb682a2:   vmovdqu32 0x18f9f8d4(%rip),%zmm28        # 0x00007f4168b07b80
>   ```
>   
> The `ExternalAddress()` calls for above assembler
>   ```
>   const Register scratch = r10;
>   const XMMRegister montRSquareModQ = xmm29;
>   const XMMRegister montQInvModR = xmm30;
>   const XMMRegister dilithium_q = xmm31;
>   const XMMRegister perms = xmm28;
> 
>   __ vpbroadcastd(montQInvModR, ExternalAddress(dilithiumAvx512ConstsAddr()), Assembler::AVX_512bit, scratch); // q^-1 mod 2^32
>   __ vpbroadcastd(dilithium_q, ExternalAddress(dilithiumAvx512ConstsAddr() + 4), Assembler::AVX_512bit, scratch); // q
>   __ vpbroadcastd(montRSquareModQ, ExternalAddress(dilithiumAvx512ConstsAddr() + 12), Assembler::AVX_512bit, scratch); // 2^64 mod q
>   __ evmovdqul(perms, k0, ExternalAddress(dilithiumAvx512PermsAddr()), false, Assembler::AVX_512bit, scratch);
> 
> (and `dilithiumAvx512ConstsAddr(offset)` cound take an int parameter too)

I added comments and changed the vpbroadcast loads to load directly from memory.l

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 1012:
> 
>> 1010:   __ evmovdqul(xmm28, Address(perms, 0), Assembler::AVX_512bit);
>> 1011: 
>> 1012:   __ movl(len, 4);
> 
> Compile-time constant, why not 'unroll at compile time'? i.e. wrap this loop with `for (int len=0; len<4; len++)` instead?

I have found that unrolling these loops actually hurts performance (probably an I-cache effect.

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 1041:
> 
>> 1039:   for (int i = 0; i < 4; i++) {
>> 1040:     __ evmovdqul(Address(result, i * 64), xmm(i), Assembler::AVX_512bit);
>> 1041:   }
> 
> This is nice, compact and clean. The biggest issue I have with following this code is really with all the 'raw' registers. I would much rather prefer symbolic names, but up to you to decide style.
> 
> I ended up 'annotating' this snippet, so I could understand it and confirm everything..  as with montmulEven, hope some of it can be useful to you to copy out.
> 
> 
>   XMMRegister POLY1[] = {xmm0, xmm1, xmm2, xmm3};
>   XMMRegister POLY2[] = {xmm4, xmm5, xmm6, xmm7};
>   XMMRegister SCRATCH1[] = {xmm12, xmm13, xmm14, xmm15};
>   XMMRegister SCRATCH2[] = {xmm16, xmm17, xmm18, xmm19};
>   XMMRegister SCRATCH3[] = {xmm8, xmm9, xmm10, xmm11};
>   for (int i = 0; i < 4; i++) {
>     __ evmovdqul(POLY1[i], Address(poly1, i * 64), Assembler::AVX_512bit);
>     __ evmovdqul(POLY2[i], Address(poly2, i * 64), Assembler::AVX_512bit);
>   }
> 
>   // montmulEven: inputs are in even columns and output is in odd columns
>   // scratch3_even = poly2_even*montRSquareModQ // poly2 to montgomery domain
>   montmulEven2(SCRATCH3[0], POLY2[0], montRSquareModQ, SCRATCH1[0], SCRATCH2[0], montQInvModR, dilithium_q, 4, _masm);
>   for (int i = 0; i < 4; i++) {
>     // swap even/odd; 0xB1 == 2-3-0-1
>     __ vpshufd(SCRATCH3[i], SCRATCH3[i], 0xB1, Assembler::AVX_512bit);
>   }
> 
>   // scratch3_odd = poly1_even*scratch3_even = poly1_even*poly2_even*montRSquareModQ
>   montmulEven2(SCRATCH3[0], POLY1[0], SCRATCH3[0], SCRATCH1[0], SCRATCH2[0], 4, montQInvModR, dilithium_q, 4, _masm);
>   for (int i = 0; i < 4; i++) {
>     __ vpshufd(POLY1[i], POLY1[i], 0xB1, Assembler::AVX_512bit);
>     __ vpshufd(POLY2[i], POLY2[i], 0xB1, Assembler::AVX_512bit);
>   }
> 
>   // poly2_even = poly2_odd*montRSquareModQ // poly2 to montgomery domain
>   montmulEven2(POLY2[0], POLY2[0], montRSquareModQ, SCRATCH1[0], SCRATCH2[0], 4, montQInvModR, dilithium_q, 4, _masm);
>   for (int i = 0; i < 4; i++) {
>     __ vpshufd(POLY2[i], POLY2[i], 0xB1, Assembler::AVX_512bit);
>   }
> 
>   // poly1_odd = poly1_even*poly2_even
>   montmulEven2(POLY1[0], POLY1[0], POLY2[0], SCRATCH1[0], SCRATCH2[0], 4, montQInvModR, dilithium_q, 4, _masm);
>   for (int i = 0; i < 4; i++) {
>     // result is scrambled between scratch3_odd and poly1_odd; unscramble
>     __ evpermt2d(POLY1[i], perms, SCRATCH3[i], Assembler::AVX_512bit);
>   }
>   for (int i = 0; i < 4; i++) {
>     __ evmovdqul(Address(result, i *...

I have rewritten it to use full montmuls (a new function) her and everywhere else. It is much easier to follow the code that way.

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 1090:
> 
>> 1088:   __ evpbroadcastd(xmm29, constant, Assembler::AVX_512bit); // constant multiplier
>> 1089: 
>> 1090:   __ movl(len, 2);
> 
> Same comment here as the `generate_dilithiumNttMult_avx512`
> - constants can be loaded directly into XMM
> - len can be removed by unrolling at compile time
> - symbolic names could be used for registers
> - comments could be added

Done.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2006455445
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2006455814
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2006455732
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2006454991
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2006455529
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2006455662
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2006455178
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2006455086

From never at openjdk.org  Fri Mar 21 05:59:09 2025
From: never at openjdk.org (Tom Rodriguez)
Date: Fri, 21 Mar 2025 05:59:09 GMT
Subject: RFR: 8350892: [JVMCI] Align ResolvedJavaType.getInstanceFields
 with Class.getDeclaredFields
In-Reply-To: <poRDowqib3iR011jUOlIyKkBX8XJgqB8dmdtILmNuyU=.86ddf950-95fb-40a0-85a6-8c08a069a345@github.com>
References: <poRDowqib3iR011jUOlIyKkBX8XJgqB8dmdtILmNuyU=.86ddf950-95fb-40a0-85a6-8c08a069a345@github.com>
Message-ID: <dA6CA8C6qJBAqnyJyJ7_lj_j7XyS-3imi4ZDAU-GiEU=.4eaa62f4-7da9-4c52-bf6d-641cdd5588ab@github.com>

On Fri, 28 Feb 2025 23:46:54 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

> The current order of fields returned by `ResolvedJavaType.getInstanceFields` is a) not well specified and b) different than the order of fields used almost everywhere else in HotSpot. This PR aligns the order of `getInstanceFields` with `Class.getDeclaredFields()`.
> 
> It also makes `ciInstanceKlass::_nonstatic_fields` use the same order which unifies how escape analysis and deoptimization treats fields across C2 and JVMCI.

Seems like a good change.

-------------

Marked as reviewed by never (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/23849#pullrequestreview-2704807131

From thartmann at openjdk.org  Fri Mar 21 12:59:09 2025
From: thartmann at openjdk.org (Tobias Hartmann)
Date: Fri, 21 Mar 2025 12:59:09 GMT
Subject: RFR: 8350892: [JVMCI] Align ResolvedJavaType.getInstanceFields
 with Class.getDeclaredFields
In-Reply-To: <poRDowqib3iR011jUOlIyKkBX8XJgqB8dmdtILmNuyU=.86ddf950-95fb-40a0-85a6-8c08a069a345@github.com>
References: <poRDowqib3iR011jUOlIyKkBX8XJgqB8dmdtILmNuyU=.86ddf950-95fb-40a0-85a6-8c08a069a345@github.com>
Message-ID: <j1OmMb1Nlk8wneme9jNv_Y37nBE-b-SoJGMtpSzg-AY=.f5f3e909-3750-407b-9c23-d7cb0f92a5cc@github.com>

On Fri, 28 Feb 2025 23:46:54 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

> The current order of fields returned by `ResolvedJavaType.getInstanceFields` is a) not well specified and b) different than the order of fields used almost everywhere else in HotSpot. This PR aligns the order of `getInstanceFields` with `Class.getDeclaredFields()`.
> 
> It also makes `ciInstanceKlass::_nonstatic_fields` use the same order which unifies how escape analysis and deoptimization treats fields across C2 and JVMCI.

Nice cleanup, CI changes look good to me.

-------------

Marked as reviewed by thartmann (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/23849#pullrequestreview-2705854715

From dnsimon at openjdk.org  Fri Mar 21 13:03:17 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Fri, 21 Mar 2025 13:03:17 GMT
Subject: RFR: 8350892: [JVMCI] Align ResolvedJavaType.getInstanceFields
 with Class.getDeclaredFields
In-Reply-To: <poRDowqib3iR011jUOlIyKkBX8XJgqB8dmdtILmNuyU=.86ddf950-95fb-40a0-85a6-8c08a069a345@github.com>
References: <poRDowqib3iR011jUOlIyKkBX8XJgqB8dmdtILmNuyU=.86ddf950-95fb-40a0-85a6-8c08a069a345@github.com>
Message-ID: <Ip9Ajo-GKq9Vjl94-oXAL61AxcgrSPYfjgE0-JeK4LI=.ae430bc5-40e7-4e51-8930-4d4f21a17eb9@github.com>

On Fri, 28 Feb 2025 23:46:54 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

> The current order of fields returned by `ResolvedJavaType.getInstanceFields` is a) not well specified and b) different than the order of fields used almost everywhere else in HotSpot. This PR aligns the order of `getInstanceFields` with `Class.getDeclaredFields()`.
> 
> It also makes `ciInstanceKlass::_nonstatic_fields` use the same order which unifies how escape analysis and deoptimization treats fields across C2 and JVMCI.

Thanks for the reviews!

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23849#issuecomment-2743295318

From dnsimon at openjdk.org  Fri Mar 21 13:03:17 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Fri, 21 Mar 2025 13:03:17 GMT
Subject: Integrated: 8350892: [JVMCI] Align ResolvedJavaType.getInstanceFields
 with Class.getDeclaredFields
In-Reply-To: <poRDowqib3iR011jUOlIyKkBX8XJgqB8dmdtILmNuyU=.86ddf950-95fb-40a0-85a6-8c08a069a345@github.com>
References: <poRDowqib3iR011jUOlIyKkBX8XJgqB8dmdtILmNuyU=.86ddf950-95fb-40a0-85a6-8c08a069a345@github.com>
Message-ID: <yK8a0koKLakDPKCXwm1J5obyLB2HVUzA4M3oaSFyOYA=.fb4e143c-1028-4acc-b30b-b9a109c643d5@github.com>

On Fri, 28 Feb 2025 23:46:54 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

> The current order of fields returned by `ResolvedJavaType.getInstanceFields` is a) not well specified and b) different than the order of fields used almost everywhere else in HotSpot. This PR aligns the order of `getInstanceFields` with `Class.getDeclaredFields()`.
> 
> It also makes `ciInstanceKlass::_nonstatic_fields` use the same order which unifies how escape analysis and deoptimization treats fields across C2 and JVMCI.

This pull request has now been integrated.

Changeset: 0cb110eb
Author:    Doug Simon <dnsimon at openjdk.org>
URL:       https://git.openjdk.org/jdk/commit/0cb110ebb7f8d184dd855f64c5dd7924c8202b3d
Stats:     89 lines in 6 files changed: 18 ins; 32 del; 39 mod

8350892: [JVMCI] Align ResolvedJavaType.getInstanceFields with Class.getDeclaredFields

Reviewed-by: yzheng, never, thartmann

-------------

PR: https://git.openjdk.org/jdk/pull/23849

From adinn at openjdk.org  Fri Mar 21 14:02:17 2025
From: adinn at openjdk.org (Andrew Dinn)
Date: Fri, 21 Mar 2025 14:02:17 GMT
Subject: RFR: 8349721: Add aarch64 intrinsics for ML-KEM [v4]
In-Reply-To: <s1i_jDlsrQSGx-cx7O4uE_qwU8L1uxhANZEqG4oXclM=.bf657af0-9c8e-4b4f-a275-43a7ce118c6a@github.com>
References: <eUTcEbCy4gKPEfe0fS4GXXR8i49JYSKCGygRz8CsCnE=.c17a1739-85d2-4bb9-8a74-5ad1694d8d3d@github.com>
 <s1i_jDlsrQSGx-cx7O4uE_qwU8L1uxhANZEqG4oXclM=.bf657af0-9c8e-4b4f-a275-43a7ce118c6a@github.com>
Message-ID: <SjVh7nBBIqRfsQTIgkq_KuYmu9BXjjQOOgfTk8CWPCs=.7df8664d-9152-4ee7-9967-2e80ad77aef0@github.com>

On Tue, 4 Mar 2025 22:04:26 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the aarch64 vector registers the speed of the computation of the ML-KEM algorithms (key generation, encapsulation, decapsulation) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits:
> 
>  - Fixed mismerge.
>  - Merged master.
>  - A little cleanup
>  - Merged master
>  - removing trailing spaces
>  - kyber aarch64 intrinsics

src/hotspot/share/opto/library_call.cpp line 7800:

> 7798:   const char *stubName;
> 7799:   assert(UseKyberIntrinsics, "need Kyber intrinsics support");
> 7800:   assert(callee()->signature()->size() == 3, "kyber12To16 has 3 parameters");

Just as an aside this causes testing of a debug build to fail. The intrinsic has 4 parameters.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2007638886

From adinn at openjdk.org  Fri Mar 21 14:02:18 2025
From: adinn at openjdk.org (Andrew Dinn)
Date: Fri, 21 Mar 2025 14:02:18 GMT
Subject: RFR: 8349721: Add aarch64 intrinsics for ML-KEM [v4]
In-Reply-To: <SjVh7nBBIqRfsQTIgkq_KuYmu9BXjjQOOgfTk8CWPCs=.7df8664d-9152-4ee7-9967-2e80ad77aef0@github.com>
References: <eUTcEbCy4gKPEfe0fS4GXXR8i49JYSKCGygRz8CsCnE=.c17a1739-85d2-4bb9-8a74-5ad1694d8d3d@github.com>
 <s1i_jDlsrQSGx-cx7O4uE_qwU8L1uxhANZEqG4oXclM=.bf657af0-9c8e-4b4f-a275-43a7ce118c6a@github.com>
 <SjVh7nBBIqRfsQTIgkq_KuYmu9BXjjQOOgfTk8CWPCs=.7df8664d-9152-4ee7-9967-2e80ad77aef0@github.com>
Message-ID: <54ED2n9rhYXWQuwge7bPuvPXtAmL2WpfRJFfXH__r2I=.dead1c37-4283-48a6-ad01-26fc92be30fa@github.com>

On Fri, 21 Mar 2025 13:59:10 GMT, Andrew Dinn <adinn at openjdk.org> wrote:

>> Ferenc Rakoczi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits:
>> 
>>  - Fixed mismerge.
>>  - Merged master.
>>  - A little cleanup
>>  - Merged master
>>  - removing trailing spaces
>>  - kyber aarch64 intrinsics
>
> src/hotspot/share/opto/library_call.cpp line 7800:
> 
>> 7798:   const char *stubName;
>> 7799:   assert(UseKyberIntrinsics, "need Kyber intrinsics support");
>> 7800:   assert(callee()->signature()->size() == 3, "kyber12To16 has 3 parameters");
> 
> Just as an aside this causes testing of a debug build to fail. The intrinsic has 4 parameters.

With this value reset to 4 the ML_DSA test passes for ML_KEM on a debug build.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2007642721

From tschatzl at openjdk.org  Fri Mar 21 14:20:34 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Fri, 21 Mar 2025 14:20:34 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v27]
In-Reply-To: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
Message-ID: <mMW9wwvJZ6ko8f6tx1fyc4ZZYEvzC__Y4gs2mAr6ZRQ=.8ea636d8-8976-44ba-a87c-e4918ba8ca41@github.com>

> Hi all,
> 
>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
> 
> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
> 
> ### Current situation
> 
> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
> 
> The main reason for the current barrier is how g1 implements concurrent refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
> 
> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
> 
> 
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done;     // null value check
> if (card(@x.a) == young_card) goto done;  // write to young gen check
> StoreLoad;                // synchronize
> if (card(@x.a) == dirty_card) goto done;
> 
> *card(@x.a) = dirty
> 
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
> 
> call runtime to move thread-local-dcq into dcqs
> 
> done:
> 
> 
> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
> 
> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
> 
> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
> 
> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se...

Thomas Schatzl has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 35 commits:

 - Merge branch 'master' into submit/8342382-card-table-instead-of-dcq
 - * make young gen length revising independent of refinement thread
     * use a service task
     * both refinement control thread and young gen length revising use the same infrastructure to get the number of available bytes and determine the time to the next update
 - * fix IR code generation tests that change due to barrier cost changes
 - * factor out card table and refinement table merging into a single
     method
 - Merge branch 'master' into 8342382-card-table-instead-of-dcq3
 - * obsolete G1UpdateBufferSize
   
   G1UpdateBufferSize has previously been used to size the refinement
   buffers and impose a minimum limit on the number of cards per thread
   that need to be pending before refinement starts.
   
   The former function is now obsolete with the removal of the dirty
   card queues, the latter functionality has been taken over by the new
   diagnostic option `G1PerThreadPendingCardThreshold`.
   
   I prefer to make this a diagnostic option is better than a product option
   because it is something that is only necessary for some test cases to
   produce some otherwise unwanted behavior (continuous refinement).
   
   CSR is pending.
 - * more documentation on why we need to rendezvous the gc threads
 - Merge branch 'master' into 8342381-card-table-instead-of-dcq
 - * ayang review
     * re-add STS leaver for java thread handshake
 - * when aborting refinement during full collection, the global card table and the per-thread card table might not be in sync. Roll forward during abort of the refinement in these situations.
   * additional verification
   * added some missing ResourceMarks in asserts
   * added variant of ArrayJuggle2 that crashes fairly quickly without these changes
 - ... and 25 more: https://git.openjdk.org/jdk/compare/0cb110eb...d9311047

-------------

Changes: https://git.openjdk.org/jdk/pull/23739/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=26
  Stats: 7089 lines in 110 files changed: 2610 ins; 3555 del; 924 mod
  Patch: https://git.openjdk.org/jdk/pull/23739.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739

PR: https://git.openjdk.org/jdk/pull/23739

From vlivanov at openjdk.org  Fri Mar 21 22:37:14 2025
From: vlivanov at openjdk.org (Vladimir Ivanov)
Date: Fri, 21 Mar 2025 22:37:14 GMT
Subject: RFR: 8346989: Deoptimization and re-compilation cycle with C2
 compiled code
In-Reply-To: <Xyr4rC5ziNR1VxSalCmU-RkvOZUWyZrGctWWVd5dmjo=.68cb4408-59de-4f00-a595-06a008807f33@github.com>
References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com>
 <_3l8ylsbgvsqQE1Ihp0BUAx2o_VzcS6R2jWBSKW9u1E=.0dcb6086-ff6f-4c9a-b990-6665a476a3dc@github.com>
 <Xyr4rC5ziNR1VxSalCmU-RkvOZUWyZrGctWWVd5dmjo=.68cb4408-59de-4f00-a595-06a008807f33@github.com>
Message-ID: <VALgGrZGlfK6wjQkVlOricTgEkpC3uH0szWifF5aN8k=.cd0aa30c-9908-4df7-b6f0-970ec9717c0f@github.com>

On Thu, 20 Mar 2025 12:26:52 GMT, Tobias Hartmann <thartmann at openjdk.org> wrote:

>> src/hotspot/share/opto/library_call.cpp line 1963:
>> 
>>> 1961:     set_i_o(i_o());
>>> 1962: 
>>> 1963:     uncommon_trap(Deoptimization::Reason_intrinsic,
>> 
>> What about using `builtin_throw` here? (Requires some tuning on `builtin_throw` side.) How much does it affect performance? Also, passing `must_throw = true` into `uncommon_trap` may help a bit here as well.
>
> I think adapting and re-using `builtin_throw` like you described is reasonable but I let @iwanowww confirm :slightly_smiling_face:

Yes, that's basically what I had in mind.

Currently, the focus of the intrinsic is on well-behaved case (overflows are **very** rare). `builtin_throw()` covers more ground and optimize for scenarios when exceptions are thrown. But it depends on `ciMethod::can_omit_stack_trace()` where `-XX:-OmitStackTraceInFastThrow` mode will suffer from the original problem (continuous deoptimizations), plus a round of recompilations before giving up.

I suggest to improve and reuse `builtin_throw` here and add additional checks in the intrinsic to guard against problematic scenario with continuous deoptimizations. IMO it improves performance model for a wide range of use cases while addressing pathological scenarios.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23916#discussion_r2008427776

From duke at openjdk.org  Sat Mar 22 12:23:47 2025
From: duke at openjdk.org (Zihao Lin)
Date: Sat, 22 Mar 2025 12:23:47 GMT
Subject: RFR: 8347706: jvmciEnv.cpp has jvmci includes out of order
Message-ID: <rRCbmNoS7bQAuQAhpquNGP1Cf5_iNv0joyYS0iJ2Nio=.fa11bae0-c93c-43c1-ad7a-482b3f1d5eff@github.com>

8347706: jvmciEnv.cpp has jvmci includes out of order

-------------

Commit messages:
 - 8347706: Reorder jvmci includes in jvmciEvn.cpp

Changes: https://git.openjdk.org/jdk/pull/24174/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24174&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8347706
  Stats: 6 lines in 1 file changed: 3 ins; 3 del; 0 mod
  Patch: https://git.openjdk.org/jdk/pull/24174.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/24174/head:pull/24174

PR: https://git.openjdk.org/jdk/pull/24174

From dnsimon at openjdk.org  Sat Mar 22 14:44:09 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Sat, 22 Mar 2025 14:44:09 GMT
Subject: RFR: 8347706: jvmciEnv.cpp has jvmci includes out of order
In-Reply-To: <rRCbmNoS7bQAuQAhpquNGP1Cf5_iNv0joyYS0iJ2Nio=.fa11bae0-c93c-43c1-ad7a-482b3f1d5eff@github.com>
References: <rRCbmNoS7bQAuQAhpquNGP1Cf5_iNv0joyYS0iJ2Nio=.fa11bae0-c93c-43c1-ad7a-482b3f1d5eff@github.com>
Message-ID: <4eXcUGVycNCCf3Ago-Mtf7zobSoLrZVEateUS0NpQuQ=.3512d121-7ad0-422f-9dcd-3edf4e28ec4e@github.com>

On Sat, 22 Mar 2025 12:16:31 GMT, Zihao Lin <duke at openjdk.org> wrote:

> Reorder jvmci includes in jvmciEvn.cpp

The change is fine but I personally think manually fixing these ordering problems is not the best use of time until there's a way to automatically enforce the expected ordering (and catch regressions).

-------------

Marked as reviewed by dnsimon (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/24174#pullrequestreview-2708058400

From duke at openjdk.org  Sat Mar 22 14:54:06 2025
From: duke at openjdk.org (Zihao Lin)
Date: Sat, 22 Mar 2025 14:54:06 GMT
Subject: RFR: 8347706: jvmciEnv.cpp has jvmci includes out of order
In-Reply-To: <rRCbmNoS7bQAuQAhpquNGP1Cf5_iNv0joyYS0iJ2Nio=.fa11bae0-c93c-43c1-ad7a-482b3f1d5eff@github.com>
References: <rRCbmNoS7bQAuQAhpquNGP1Cf5_iNv0joyYS0iJ2Nio=.fa11bae0-c93c-43c1-ad7a-482b3f1d5eff@github.com>
Message-ID: <gfC0srVmXkl1hl-noXk7ZoO462e08bcnI8P31nO9UEA=.4154f15b-d66a-46c3-92f1-eaebf80cb09f@github.com>

On Sat, 22 Mar 2025 12:16:31 GMT, Zihao Lin <duke at openjdk.org> wrote:

> Reorder jvmci includes in jvmciEvn.cpp

You are right, Do we have some code check tool which help to point out the ordering issue?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24174#issuecomment-2745306642

From duke at openjdk.org  Sat Mar 22 14:54:06 2025
From: duke at openjdk.org (duke)
Date: Sat, 22 Mar 2025 14:54:06 GMT
Subject: RFR: 8347706: jvmciEnv.cpp has jvmci includes out of order
In-Reply-To: <rRCbmNoS7bQAuQAhpquNGP1Cf5_iNv0joyYS0iJ2Nio=.fa11bae0-c93c-43c1-ad7a-482b3f1d5eff@github.com>
References: <rRCbmNoS7bQAuQAhpquNGP1Cf5_iNv0joyYS0iJ2Nio=.fa11bae0-c93c-43c1-ad7a-482b3f1d5eff@github.com>
Message-ID: <M9IQ_HcRdEx3NKWthwAiU59vLYYUsai8SBvTKh-78us=.618ff5d9-6c7b-4da0-ad34-ec048acca863@github.com>

On Sat, 22 Mar 2025 12:16:31 GMT, Zihao Lin <duke at openjdk.org> wrote:

> Reorder jvmci includes in jvmciEvn.cpp

@linzihao1999 
Your change (at version f0a6b84815d7a866a1561426d6b31a5e3f3b3c73) is now ready to be sponsored by a Committer.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24174#issuecomment-2745306821

From duke at openjdk.org  Sat Mar 22 20:02:31 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Sat, 22 Mar 2025 20:02:31 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v11]
In-Reply-To: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
Message-ID: <JKJusH1gWBybGf4kMN2t_B4u-8qm8BXnEdUA-6sHaQ0=.cf44a3bf-319d-4a42-b7c9-827641753062@github.com>

> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.

Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision:

 - Further readability improvements.
 - Added asserts for array sizes

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23860/files
  - new: https://git.openjdk.org/jdk/pull/23860/files/e9db09e2..56656894

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=10
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=09-10

  Stats: 228 lines in 2 files changed: 72 ins; 56 del; 100 mod
  Patch: https://git.openjdk.org/jdk/pull/23860.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23860/head:pull/23860

PR: https://git.openjdk.org/jdk/pull/23860

From vpaprotski at openjdk.org  Sat Mar 22 20:05:11 2025
From: vpaprotski at openjdk.org (Volodymyr Paprotski)
Date: Sat, 22 Mar 2025 20:05:11 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v10]
In-Reply-To: <2N5Evij0f6qZi_pG3tqoz11aQbSnLG0YszqHR9ROfKI=.d44b16c6-d334-42c4-8de8-92eb41229248@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <2N5Evij0f6qZi_pG3tqoz11aQbSnLG0YszqHR9ROfKI=.d44b16c6-d334-42c4-8de8-92eb41229248@github.com>
Message-ID: <2yP2P1VNWgQu6cWvn0_a_7LdidS71C6PWKcqGKTOHnc=.49f8ac0f-df23-4f1e-adb9-e03a3f2295b2@github.com>

On Thu, 20 Mar 2025 20:37:25 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix windows build

was going to finish the rest of the functions.. but I see you pushed an update so I better rebase! here are the pending comments I had that perhaps are no longer applicable..

(working through the ntt math..)

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 121:

> 119: static void montmulEven(int outputReg, int inputReg1,  int inputReg2,
> 120:                         int scratchReg1, int scratchReg2,
> 121:                         int parCnt, MacroAssembler *_masm) {

nitpick.. this could be made to look more like `montMul64()` by also taking in an array of registers.

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 160:

> 158:   for (int i = 0; i < 4; i++) {
> 159:     __ vpmuldq(xmm(scratchRegs[i]), xmm(inputRegs1[i]), xmm(inputRegs2[i]),
> 160:                Assembler::AVX_512bit);

using an array of registers, instead of array of ints would read somewhat more compact and fewer 'indirections' . i.e.

static void montMul64(XMMRegister outputRegs*, XMMRegister inputRegs1*, XMMRegister inputRegs2*,
...
    __ vpmuldq(scratchRegs[i], inputRegs1[i], inputRegs2[i], Assembler::AVX_512bit);

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 216:

> 214: // Zmm8-Zmm23 used as scratch registers
> 215: // result goes to Zmm0-Zmm7
> 216: static void montMulByConst128(MacroAssembler *_masm) {

wish the inputs and output register arrays were explicit.. easier to follow that way

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 230:

> 228: }
> 229: 
> 230: static void sub_add(int subResult[], int addResult[],

Big fan of all these helper functions! Makes reading the top level functions way easier, thanks for refactoring!

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 279:

> 277: static int xmm4_20_24[] = {4, 5, 6, 7, 20, 21, 22, 23, 24, 25, 26, 27};
> 278: static int xmm16_27[] = {16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27};
> 279: static int xmm29_29[] = {29, 29, 29, 29};

I very much like the new refactor, waaaay clearer now. Some 'Could Do' comments..

- I probably would have preferred 'even more symbolic' variable names (i.e. its ideal when you can match the java variable names!). Conversely, if 'forced to defend this style', these names are MUCH much easier to debug from GDB, its clear what the matching instruction is.
- Not sure about it being global. It works currently, but less 'future proof'.

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 645:

> 643: // poly1 (int[256]) = c_rarg1
> 644: // poly2 (int[256]) = c_rarg2
> 645: static address generate_dilithiumNttMult_avx512(StubGenerator *stubgen,

This would be 'nice to have', something 'lost' with the refactor..

As I was reviewing this (original) function, I was thinking, "there is nothing here _that_ specific to AVX512, mostly columnar&independent operations... This function could be made 'vector-length-independent'..."
- double the loop length:

int iter = vector_len==Assembler::AVX_512bit?4:8;
__ movl(len, 4); -> __ movl(len, iter);

- halve the register arrays.. (or keep them the same but shuffle them to make SURE the first half are in xmm0-xmm15 range)

  XMMRegister POLY1[] = {xmm0, xmm1, xmm12, xmm13};
  XMMRegister POLY2[] = {xmm4, xmm5, xmm16, xmm17};
  XMMRegister SCRATCH1[] = {xmm2, xmm3, xmm14, xmm15}; <<< here
  XMMRegister SCRATCH2[] = {xmm6, xmm7, xmm18, xmm19}; <<< and here
  XMMRegister SCRATCH3[] = {xmm8, xmm9, xmm10, xmm11};

- couple of other int constants (like the memory 'step' and such)
- for assembler calls, like `evmovdqul` and `evpsubd`, need a few small new MacroAssembler helpers to instead generate VEX encoded versions (plenty of instructions already do that).
- I think only the perm instruction was unique to evex (didnt really think of an alternative for AVX2.. but can be abstracted away with another helper)

Anyway; not suggesting its something you do here.. but it would be convenient to leave breadcrumbs/hooks for a future update so one of us can revisit this code and add AVX2 support. e.g. `parCnt` variable was very convenient before for exactly this, now its gone... it probably could be derived in each function from vector_len but..; Its now cleaner, but also harder to 'upgrade'?

Why AVX2? many of the newer (Atom/Ecore-based/EnableX86ECoreOpts) processors do not have AVX512 support, so its something I've been prioritizing recently

The alternative would be to write a completely separate AVX2 implementation, but that would be a shame, not to 'just' reuse this code.
?
"For fun", I had even gone and parametrized the mult function with the `vector_len` to see how it would look (almost identical... to the original version):

static void montmulEven2(XMMRegister* outputReg, XMMRegister* inputReg1,  XMMRegister* inputReg2, XMMRegister* scratchReg1, 
  XMMRegister* scratchReg2, XMMRegister montQInvModR, XMMRegister dilithium_q, int parCnt, int vector_len, MacroAssembler* _masm) {

  for (int i = 0; i < parCnt; i++) {
    // scratch1 = (int64)input1_even*input2_even
    //    Java: long a = (long) b * (long) c;
    __ vpmuldq(scratchReg1[i], inputReg1[i], inputReg2[i], vector_len);
  }
  for (int i = 0; i < parCnt; i++) {
    // scratch2 = int32(montQInvModR*(int32)scratch1)
    //    Java: int aLow = (int) a;
    //    Java: int m = MONT_Q_INV_MOD_R * aLow; // signed low product
    __ vpmulld(scratchReg2[i], scratchReg1[i], montQInvModR, vector_len);
  }
  for (int i = 0; i < parCnt; i++) {
    // scratch2 = (int64)scratch2_even*dilithium_q_even
    //    Java: ((long)m * MONT_Q)
    __ vpmuldq(scratchReg2[i], scratchReg2[i], dilithium_q, vector_len);
  }
  for (int i = 0; i < parCnt; i++) {
    // output_odd = scratch1_odd - scratch2_odd
    //    Java: (aHigh - (int) (("scratch2") >> MONT_R_BITS))
    __ vpsubd(outputReg[i], scratchReg1[i], scratchReg2[i], vector_len);
  }
}

-------------

PR Review: https://git.openjdk.org/jdk/pull/23860#pullrequestreview-2708079853
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2008809855
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2008811046
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2008811541
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2008811704
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2008808110
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2008824304

From vpaprotski at openjdk.org  Sat Mar 22 20:05:12 2025
From: vpaprotski at openjdk.org (Volodymyr Paprotski)
Date: Sat, 22 Mar 2025 20:05:12 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v7]
In-Reply-To: <MiV8AD5mEqWLf_DDaoTF7jG06Zd7bBloGuxRZqGtN3U=.e55cfe7f-eb52-4da1-bddd-01ce74cfc633@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <xeGVcNvhb8JogSwnBjlAln9bAiZMAeiOEC5XrH0FVX0=.564addcd-ee9a-4628-8df1-b546e7de45be@github.com>
 <kBg6GrgsVt_pKZ2MfKEHbuEb9Uk5YFWWQ7tKUOKySH4=.106187c8-0bb4-4463-afd7-a406fce15fc4@github.com>
 <MiV8AD5mEqWLf_DDaoTF7jG06Zd7bBloGuxRZqGtN3U=.e55cfe7f-eb52-4da1-bddd-01ce74cfc633@github.com>
Message-ID: <pqoTfEkEfzOcAOv5JLfWQa7ckVyZfmILw57FVSM7bRc=.a16fb1f6-2bf8-4503-89cc-3eab2df0f686@github.com>

On Thu, 20 Mar 2025 21:06:30 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 58:
>> 
>>> 56: 
>>> 57: ATTRIBUTE_ALIGNED(64) static const uint32_t dilithiumAvx512Perms[] = {
>>> 58:      // collect montmul results into the destination register
>> 
>> same as `dilithiumAvx512Consts()`, 'magic offsets'; except here they are harder to count (eg. not clear visually what is the offset of `ntt inverse`).
>> 
>> Could be split into three constant arrays to make the compiler count for us
>
> Well, it is 64 bytes per  line (16 4-byte uint32_ts), not that hard :-) ...

Ha! I didn't realize it was 16 per line.. ran out of fingers while counting!!! :)

'works for me, as long as its a "premeditated" decision'

>> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 980:
>> 
>>> 978: // Dilithium multiply polynomials in the NTT domain.
>>> 979: // Implements
>>> 980: // static int implDilithiumNttMult(
>> 
>> I suppose no java changes in this PR, but I notice that the inputs are all assumed to have fixed size.
>> 
>> Most/all intrinsics I worked with had some sort of guard (eg `Objects.checkFromIndexSize`) right before the intrinsic java call. (It usually looks like it can be optimized away). But I notice no such guard here on the java side.
>
> These functions will not be used anywhere else and in ML_DSA.java all of the arrays passed to inrinsics are of the correct size.

Works for me; just thought I would point it out, so its a 'premeditated' decision.

>> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 1012:
>> 
>>> 1010:   __ evmovdqul(xmm28, Address(perms, 0), Assembler::AVX_512bit);
>>> 1011: 
>>> 1012:   __ movl(len, 4);
>> 
>> Compile-time constant, why not 'unroll at compile time'? i.e. wrap this loop with `for (int len=0; len<4; len++)` instead?
>
> I have found that unrolling these loops actually hurts performance (probably an I-cache effect.

Interesting; I keep on having to re-train my intuition, thanks for the data

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2008806159
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2008805574
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2008805113

From duke at openjdk.org  Sat Mar 22 20:23:25 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Sat, 22 Mar 2025 20:23:25 GMT
Subject: RFR: 8349721: Add aarch64 intrinsics for ML-KEM [v5]
In-Reply-To: <eUTcEbCy4gKPEfe0fS4GXXR8i49JYSKCGygRz8CsCnE=.c17a1739-85d2-4bb9-8a74-5ad1694d8d3d@github.com>
References: <eUTcEbCy4gKPEfe0fS4GXXR8i49JYSKCGygRz8CsCnE=.c17a1739-85d2-4bb9-8a74-5ad1694d8d3d@github.com>
Message-ID: <Eyz7nC9xbZmhQFRfGU653-s7RDgZ7SE12mFQSoSE8U4=.3a029cf9-cc60-440a-81b5-2bacdc4a840d@github.com>

> By using the aarch64 vector registers the speed of the computation of the ML-KEM algorithms (key generation, encapsulation, decapsulation) can be approximately doubled.

Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:

  Fixed bad assertion.

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23663/files
  - new: https://git.openjdk.org/jdk/pull/23663/files/7e9b3d84..9ec9a6cd

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23663&range=04
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23663&range=03-04

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/23663.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23663/head:pull/23663

PR: https://git.openjdk.org/jdk/pull/23663

From vpaprotski at openjdk.org  Sat Mar 22 20:42:09 2025
From: vpaprotski at openjdk.org (Volodymyr Paprotski)
Date: Sat, 22 Mar 2025 20:42:09 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v11]
In-Reply-To: <JKJusH1gWBybGf4kMN2t_B4u-8qm8BXnEdUA-6sHaQ0=.cf44a3bf-319d-4a42-b7c9-827641753062@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <JKJusH1gWBybGf4kMN2t_B4u-8qm8BXnEdUA-6sHaQ0=.cf44a3bf-319d-4a42-b7c9-827641753062@github.com>
Message-ID: <Wp_7I93VCaIAwdDoDeCDPlca3JKEDK4yJeEzfQQdj5M=.118ea8cc-7a0e-4dbc-9837-53cfaccc68de@github.com>

On Sat, 22 Mar 2025 20:02:31 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - Further readability improvements.
>  - Added asserts for array sizes

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 119:

> 117: static address dilithiumAvx512PermsAddr() {
> 118:   return (address) dilithiumAvx512Perms;
> 119: }

Hear me out..  ...
enums!!

enum nttPermOffset {
  montMulPermsIdx = 0,
  nttL4PermsIdx = 64,
  nttL5PermsIdx = 192,
  nttL6PermsIdx = 320,
  nttL7PermsIdx = 448,
  nttInvL0PermsIdx = 704,
  nttInvL1PermsIdx = 832,
  nttInvL2PermsIdx = 960,
  nttInvL3PermsIdx = 1088,
  nttInvL4PermsIdx = 1216,
};
static address dilithiumAvx512PermsAddr(nttPermOffset offset) {
  return (address) dilithiumAvx512Perms + offset;
}

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2008900858

From duke at openjdk.org  Sun Mar 23 00:39:11 2025
From: duke at openjdk.org (Zihao Lin)
Date: Sun, 23 Mar 2025 00:39:11 GMT
Subject: Integrated: 8347706: jvmciEnv.cpp has jvmci includes out of order
In-Reply-To: <rRCbmNoS7bQAuQAhpquNGP1Cf5_iNv0joyYS0iJ2Nio=.fa11bae0-c93c-43c1-ad7a-482b3f1d5eff@github.com>
References: <rRCbmNoS7bQAuQAhpquNGP1Cf5_iNv0joyYS0iJ2Nio=.fa11bae0-c93c-43c1-ad7a-482b3f1d5eff@github.com>
Message-ID: <LhNyfqj4mtZfi6l9n2CxqVRb6vXbJSjh8F4wvrka7pY=.ab53a0b4-cfd2-4fbc-8cb8-3b9eed2fabf9@github.com>

On Sat, 22 Mar 2025 12:16:31 GMT, Zihao Lin <duke at openjdk.org> wrote:

> Reorder jvmci includes in jvmciEvn.cpp

This pull request has now been integrated.

Changeset: df9210e6
Author:    Zihao Lin <linzihao1999 at outlook.com>
Committer: SendaoYan <syan at openjdk.org>
URL:       https://git.openjdk.org/jdk/commit/df9210e6578acd53384ee1ac06601510c9a52696
Stats:     6 lines in 1 file changed: 3 ins; 3 del; 0 mod

8347706: jvmciEnv.cpp has jvmci includes out of order

Reviewed-by: dnsimon

-------------

PR: https://git.openjdk.org/jdk/pull/24174

From syan at openjdk.org  Sun Mar 23 01:16:20 2025
From: syan at openjdk.org (SendaoYan)
Date: Sun, 23 Mar 2025 01:16:20 GMT
Subject: RFR: 8347706: jvmciEnv.cpp has jvmci includes out of order
In-Reply-To: <rRCbmNoS7bQAuQAhpquNGP1Cf5_iNv0joyYS0iJ2Nio=.fa11bae0-c93c-43c1-ad7a-482b3f1d5eff@github.com>
References: <rRCbmNoS7bQAuQAhpquNGP1Cf5_iNv0joyYS0iJ2Nio=.fa11bae0-c93c-43c1-ad7a-482b3f1d5eff@github.com>
Message-ID: <ClOTf45KABnKlhNz5f91k_KYvQL6goRDjegXXVJ1WE8=.beb82920-eb84-42ad-bcf0-898a62137256@github.com>

On Sat, 22 Mar 2025 12:16:31 GMT, Zihao Lin <duke at openjdk.org> wrote:

> Reorder jvmci includes in jvmciEvn.cpp

> /sponsor

Sorry, did not noticed that this PR no satisfied more than 24 hours...

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24174#issuecomment-2745952316

From dnsimon at openjdk.org  Sun Mar 23 11:56:11 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Sun, 23 Mar 2025 11:56:11 GMT
Subject: RFR: 8347706: jvmciEnv.cpp has jvmci includes out of order
In-Reply-To: <gfC0srVmXkl1hl-noXk7ZoO462e08bcnI8P31nO9UEA=.4154f15b-d66a-46c3-92f1-eaebf80cb09f@github.com>
References: <rRCbmNoS7bQAuQAhpquNGP1Cf5_iNv0joyYS0iJ2Nio=.fa11bae0-c93c-43c1-ad7a-482b3f1d5eff@github.com>
 <gfC0srVmXkl1hl-noXk7ZoO462e08bcnI8P31nO9UEA=.4154f15b-d66a-46c3-92f1-eaebf80cb09f@github.com>
Message-ID: <wtfSzWwtnzSM6ZPJHAzNmUyyvlc2VfoBaU_cR5WYHe0=.1c3ce01f-8844-42fa-bd21-9bde1b92b6e7@github.com>

On Sat, 22 Mar 2025 14:51:16 GMT, Zihao Lin <duke at openjdk.org> wrote:

> You are right, Do we have some code check tool which help to point out the ordering issue?

Not as far as I know but it should not be too hard to come up with. I've opened https://bugs.openjdk.org/browse/JDK-8352645 to have this considered.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24174#issuecomment-2746168751

From jbhateja at openjdk.org  Mon Mar 24 02:41:14 2025
From: jbhateja at openjdk.org (Jatin Bhateja)
Date: Mon, 24 Mar 2025 02:41:14 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v11]
In-Reply-To: <JKJusH1gWBybGf4kMN2t_B4u-8qm8BXnEdUA-6sHaQ0=.cf44a3bf-319d-4a42-b7c9-827641753062@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <JKJusH1gWBybGf4kMN2t_B4u-8qm8BXnEdUA-6sHaQ0=.cf44a3bf-319d-4a42-b7c9-827641753062@github.com>
Message-ID: <Kanyx8L0d27_yUHP15BCfzO8MDEE3_JfDlxaP7Ran_g=.571eca02-d927-4598-ac9e-ff19cef9f484@github.com>

On Sat, 22 Mar 2025 20:02:31 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - Further readability improvements.
>  - Added asserts for array sizes

src/hotspot/cpu/x86/vm_version_x86.cpp line 1252:

> 1250:   // Currently we only have them for AVX512
> 1251: #ifdef _LP64
> 1252:   if (supports_evex() && supports_avx512bw()) {

supports_evex check looks redundant.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2009379308

From vpaprotski at openjdk.org  Mon Mar 24 15:19:22 2025
From: vpaprotski at openjdk.org (Volodymyr Paprotski)
Date: Mon, 24 Mar 2025 15:19:22 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v11]
In-Reply-To: <JKJusH1gWBybGf4kMN2t_B4u-8qm8BXnEdUA-6sHaQ0=.cf44a3bf-319d-4a42-b7c9-827641753062@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <JKJusH1gWBybGf4kMN2t_B4u-8qm8BXnEdUA-6sHaQ0=.cf44a3bf-319d-4a42-b7c9-827641753062@github.com>
Message-ID: <_TOBoO4cMQpw4sgzIpNpQZ2w5wDgezKQZLe314DQ7zo=.813b81bf-ecc0-4f75-a0d6-fbb13dde594e@github.com>

On Sat, 22 Mar 2025 20:02:31 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - Further readability improvements.
>  - Added asserts for array sizes

I still need to have a look at the sha3 changes, but I think I am done with the most complex part of the review. This was a really interesting bit of code to review!

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 270:

> 268: }
> 269: 
> 270: static void loadPerm(int destinationRegs[], Register perms,

`replXmm`? i.e. this function is replicating (any) Xmm register, not just perm?..

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 327:

> 325: //
> 326: //
> 327: static address generate_dilithiumAlmostNtt_avx512(StubGenerator *stubgen,

Similar comments as to `generate_dilithiumAlmostInverseNtt_avx512`

- similar comment about the 'pair-wise' operation, updating `[j]` and `[j+l]` at a time.. 
- somehow had less trouble following the flow through registers here, perhaps I am getting used to it. FYI, ended renaming some as:

// xmm16_27 = Temp1
// xmm0_3 = Coeffs1
// xmm4_7 = Coeffs2
// xmm8_11 = Coeffs3
// xmm12_15 = Coeffs4 = Temp2
// xmm16_27 = Scratch

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 421:

> 419:   for (int i = 0; i < 8; i += 2) {
> 420:     __ evpermi2d(xmm(i / 2 + 12), xmm(i), xmm(i + 1), Assembler::AVX_512bit);
> 421:   }

Wish there was a more 'abstract' way to arrange this, so its obvious from the shape of the code what registers are input/outputs (i.e. and use the register arrays). Even though its just 'elementary index operations' `i/2 + 16` is still 'clever'. Couldnt think of anything myself though (same elsewhere in this function for the table permutes).

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 509:

> 507: // coeffs (int[256]) = c_rarg0
> 508: // zetas (int[256]) = c_rarg1
> 509: static address generate_dilithiumAlmostInverseNtt_avx512(StubGenerator *stubgen,

Done with this function; Perhaps the 'permute table' is a common vector-algorithm pattern, but this is really clever!

Some general comments first, rest inline.

- The array names for registers helped a lot. And so did the new helper functions!
- The java version of this code is quite intimidating to vectorize.. 3D loop, with geometric iteration variables.. and the literature is even more intimidating (discrete convolutions which I havent touched in two decades, ffts, ntts, etc.) Here is my attempt at a comment to 'un-scare' the next reader, though feel free to reword however you like.

The core of the (Java) loop is this 'pair-wise' operation:
        int a = coeffs[j];
        int b = coeffs[j + offset];
        coeffs[j] = (a + b);
        coeffs[j + offset] = montMul(a - b, -MONT_ZETAS_FOR_NTT[m]);

There are 8 'levels' (0-7); ('levels' are equivalent to (unrolling) the outer (Java) loop)
At each level, the 'pair-wise-offset' doubles (2^l: 1, 2, 4, 8, 16, 32, 64, 128).

To vectorize this Java code, observe that at each level, REGARDLESS the offset, half the operations are the SUM, and the other half is the
montgomery MULTIPLICATION (of the pair-difference with a constant). At each level, one 'just' has to shuffle
the coefficients, so that SUMs and MULTIPLICATIONs line up accordingly.

Otherwise, this pattern is 'lightly similar' to a discrete convolution (compute integral/summation of two functions at every offset)

- I still would prefer (more) symbolic register names.. I wouldn't hold my approval over it so won't object if nobody else does, but register numbers are harder to 'see' through the flow. I ended up search/replacing/'annotating' to make it easier on myself to follow the flow of data:

// xmm8_11  = Perms1
// xmm12_15 = Perms2
// xmm16_27 = Scratch
// xmm0_3 = CoeffsPlus
// xmm4_7 = CoeffsMul
// xmm24_27 = CoeffsMinus (overlaps with Scratch)

(I made a similar comment, but I think it is now hidden after the last refactor)
- would prefer to see the helper functions to get ALL the registers passed explicitly (i.e. currently `montMulPerm`, `montQInvModR`, `dilithium_q`, `xmm29`, are implicit.). As a general rule, I've tried to set up all the registers up at the 'entry' function (`generate_dilithium*` in this case) and from there on, use symbolic names. Not always reasonable, but what I've grown used to see?
Done with this function; Perhaps the 'permute table' is a common vector-algorithm pattern, but this is really clever!

Some general comments first, rest inline.

- The array names for registers helped a lot. And so did the new helper functions!
- The java version of this code is quite intimidating to vectorize.. 3D loop, with geometric iteration variables.. and the literature is even more intimidating (discrete convolutions which I havent touched in two decades, ffts, ntts, etc.) Here is my attempt at a comment to 'un-scare' the next reader, though feel free to reword however you like.

The core of the (Java) loop is this 'pair-wise' operation:
        int a = coeffs[j];
        int b = coeffs[j + offset];
        coeffs[j] = (a + b);
        coeffs[j + offset] = montMul(a - b, -MONT_ZETAS_FOR_NTT[m]);

There are 8 'levels' (0-7); ('levels' are equivalent to (unrolling) the outer (Java) loop)
At each level, the 'pair-wise-offset' doubles (2^l: 1, 2, 4, 8, 16, 32, 64, 128).

To vectorize this Java code, observe that at each level, REGARDLESS the offset, half the operations are the SUM, and the other half is the
montgomery MULTIPLICATION (of the pair-difference with a constant). At each level, one 'just' has to shuffle
the coefficients, so that SUMs and MULTIPLICATIONs line up accordingly.

Otherwise, this pattern is 'lightly similar' to a discrete convolution (compute integral/summation of two functions at every offset)

- I still would prefer (more) symbolic register names.. I wouldn't hold my approval over it so won't object if nobody else does, but register numbers are harder to 'see' through the flow. I ended up search/replacing/'annotating' to make it easier on myself to follow the flow of data:

// xmm8_11  = Perms1
// xmm12_15 = Perms2
// xmm16_27 = Scratch
// xmm0_3 = CoeffsPlus
// xmm4_7 = CoeffsMul
// xmm24_27 = CoeffsMinus (overlaps with Scratch)

(I made a similar comment, but I think it is now hidden after the last refactor)
- would prefer to see the helper functions to get ALL the registers passed explicitly (i.e. currently `montMulPerm`, `montQInvModR`, `dilithium_q`, `xmm29`, are implicit.). As a general rule, I've tried to set up all the registers up at the 'entry' function (`generate_dilithium*` in this case) and from there on, use symbolic names. Not always reasonable, but what I've grown used to see?

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 554:

> 552:   for (int i = 0; i < 8; i += 2) {
> 553:     __ evpermi2d(xmm(i / 2 + 8), xmm(i), xmm(i + 1), Assembler::AVX_512bit);
> 554:     __ evpermi2d(xmm(i / 2 + 12), xmm(i), xmm(i + 1), Assembler::AVX_512bit);

Took a bit to unscramble the flow, so a comment needed? Purpose 'fairly obvious' once I got the general shape of the level/algorithm (as per my top-level comment) but something like "shuffle xmm0-7 into xmm8-15"?

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 572:

> 570:   load4Xmms(xmm4_7, zetas, 512, _masm);
> 571:   sub_add(xmm24_27, xmm0_3, xmm8_11, xmm12_15, _masm);
> 572:   montMul64(xmm4_7, xmm24_27, xmm4_7, xmm16_27, _masm);

>From my annotated version, levels 1-4, fairly 'straightforward':

  // level 1
  replXmm(Perms1, perms, nttInvL1PermsIdx, _masm);
  replXmm(Perms2, perms, nttInvL1PermsIdx + 64, _masm);

  for (int i = 0; i < 4; i++) {
    __ evpermi2d(xmm(Perms1[i]), xmm(CoeffsPlus[i]), xmm(CoeffsMul[i]), Assembler::AVX_512bit);
    __ evpermi2d(xmm(Perms2[i]), xmm(CoeffsPlus[i]), xmm(CoeffsMul[i]), Assembler::AVX_512bit);
  }

  load4Xmms(CoeffsMul, zetas, 512, _masm);
  sub_add(CoeffsMinus, CoeffsPlus, Perms1, Perms2, _masm);
  montMul64(CoeffsMul, CoeffsMinus, CoeffsMul, Scratch, _masm);

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 613:

> 611:   montMul64(xmm4_7, xmm24_27, xmm4_7, xmm16_27, _masm);
> 612: 
> 613:   // level 5

"// No shuffling for level 5 and 6; can just rearrange full registers"

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 656:

> 654:   for (int i = 0; i < 8; i++) {
> 655:     __ evpsubd(xmm(i), k0, xmm(i + 8), xmm(i), false, Assembler::AVX_512bit);
> 656:   }

Fairly clean as is, but could also be two sub_add calls, I think (you have to swap order of add/sub in the helper, to be able to clobber `xmm(i)`.. or swap register usage downstream, so perhaps not.. but would be cleaner) 

  sub_add(CoeffsPlus, Scratch, Perms1, CoeffsPlus, _masm);
  sub_add(CoeffsMul,  &Scratch[4], Perms2, CoeffsMul, _masm);


If nothing else, would had prefered to see the use of the register array variables

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 660:

> 658:   store4Xmms(coeffs, 0, xmm16_19, _masm);
> 659:   store4Xmms(coeffs, 4 * XMMBYTES, xmm20_23, _masm);
> 660:   montMulByConst128(_masm);

Would prefer explicit parameters here. But I think this could also be two `montMul64` calls?

  montMul64(xmm0_3, xmm0_3, xmm29_29, Scratch, _masm);
  montMul64(xmm4_7, xmm4_7, xmm29_29, Scratch, _masm);

(I think there is one other use of `montMulByConst128` where same applies; then you could delete both `montMulByConst128` and `montmulEven`

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 871:

> 869:   __ evpaddd(xmm5, k0, xmm1, barrettAddend, false, Assembler::AVX_512bit);
> 870:   __ evpaddd(xmm6, k0, xmm2, barrettAddend, false, Assembler::AVX_512bit);
> 871:   __ evpaddd(xmm7, k0, xmm3, barrettAddend, false, Assembler::AVX_512bit);

Fairly 'straightforward' transcription of the java code.. no comments from me.

At first glance using `xmm0_3`, `xmm4_7`, etc. might had been a good idea, but you only save one line per 4x group. (Unless you have one big loop, but I suspect that give you worse performance? Is that something you tried already? Might be worth it otherwise..)

src/java.base/share/classes/sun/security/provider/ML_DSA.java line 1418:

> 1416:                                          int twoGamma2, int multiplier) {
> 1417:         assert (input.length == ML_DSA_N) && (lowPart.length == ML_DSA_N)
> 1418:                 && (highPart.length == ML_DSA_N);

I wrote this test to test java-to-intrinsic correspondence. Might be good to include it (and add the other 4 intrinsics). This is very similar to all my other *Fuzz* tests I've been adding for my own intrinsics (and you made this test FAR easier to write by breaking out the java implementation; need to 'copy' that pattern myself)

import java.util.Arrays;
import java.util.Random;

import java.lang.invoke.MethodHandle;
import java.lang.invoke.MethodHandles;
import java.lang.reflect.Field;
import java.lang.reflect.Method;
import java.lang.reflect.Constructor;

public class ML_DSA_Intrinsic_Test {

    public static void main(String[] args) throws Exception {
        MethodHandles.Lookup lookup = MethodHandles.lookup();
        Class<?> kClazz = Class.forName("sun.security.provider.ML_DSA");
        Constructor<?> constructor = kClazz.getDeclaredConstructor(
                int.class);
        constructor.setAccessible(true);
        
        Method m = kClazz.getDeclaredMethod("mlDsaNttMultiply",
                int[].class, int[].class, int[].class);
        m.setAccessible(true);
        MethodHandle mult = lookup.unreflect(m);

        m = kClazz.getDeclaredMethod("implDilithiumNttMultJava",
                int[].class, int[].class, int[].class);
        m.setAccessible(true);
        MethodHandle multJava = lookup.unreflect(m);

        Random rnd = new Random();
        long seed = rnd.nextLong();
        rnd.setSeed(seed);
        //Note: it might be useful to increase this number during development of new intrinsics
        final int repeat = 1000000;
        int[] coeffs1 = new int[ML_DSA_N];
        int[] coeffs2 = new int[ML_DSA_N];
        int[] prod1 = new int[ML_DSA_N];
        int[] prod2 = new int[ML_DSA_N];
        try {
            for (int i = 0; i < repeat; i++) {
                run(prod1, prod2, coeffs1, coeffs2, mult, multJava, rnd, seed, i);
            }
            System.out.println("Fuzz Success");
        } catch (Throwable e) {
            System.out.println("Fuzz Failed: " + e);
        }
    }

    private static final int ML_DSA_N = 256;
    public static void run(int[] prod1, int[] prod2, int[] coeffs1, int[] coeffs2, 
        MethodHandle mult, MethodHandle multJava, Random rnd, 
        long seed, int i) throws Exception, Throwable {
        for (int j = 0; j<ML_DSA_N; j++) {
            coeffs1[j] = rnd.nextInt();
            coeffs2[j] = rnd.nextInt();
        }

        mult.invoke(prod1, coeffs1, coeffs2);
        multJava.invoke(prod2, coeffs1, coeffs2);

        if (!Arrays.equals(prod1, prod2)) {
                throw new RuntimeException("[Seed "+seed+"@"+i+"] Result mismatch: " + Arrays.toString(prod1) + " != " + Arrays.toString(prod2));
        }
    }
}
// java --add-opens java.base/sun.security.provider=ALL-UNNAMED  -XX:+UseDilithiumIntrinsics test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java

-------------

PR Review: https://git.openjdk.org/jdk/pull/23860#pullrequestreview-2708301954
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2008921783
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2009415317
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2009477186
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2009428310
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2009433467
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2009435329
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2009435791
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2009437669
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2009438921
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2009486160
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2010355575

From vpaprotski at openjdk.org  Mon Mar 24 15:19:23 2025
From: vpaprotski at openjdk.org (Volodymyr Paprotski)
Date: Mon, 24 Mar 2025 15:19:23 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v11]
In-Reply-To: <Wp_7I93VCaIAwdDoDeCDPlca3JKEDK4yJeEzfQQdj5M=.118ea8cc-7a0e-4dbc-9837-53cfaccc68de@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <JKJusH1gWBybGf4kMN2t_B4u-8qm8BXnEdUA-6sHaQ0=.cf44a3bf-319d-4a42-b7c9-827641753062@github.com>
 <Wp_7I93VCaIAwdDoDeCDPlca3JKEDK4yJeEzfQQdj5M=.118ea8cc-7a0e-4dbc-9837-53cfaccc68de@github.com>
Message-ID: <PsyEjOjjVPhvslW1itlklaEZ7zraED6BQavpPxOxN8w=.0cab0c3b-837c-463c-b7f4-5a38b1ce8aeb@github.com>

On Sat, 22 Mar 2025 20:38:19 GMT, Volodymyr Paprotski <vpaprotski at openjdk.org> wrote:

>> Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision:
>> 
>>  - Further readability improvements.
>>  - Added asserts for array sizes
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 119:
> 
>> 117: static address dilithiumAvx512PermsAddr() {
>> 118:   return (address) dilithiumAvx512Perms;
>> 119: }
> 
> Hear me out..  ...
> enums!!
> 
> enum nttPermOffset {
>   montMulPermsIdx = 0,
>   nttL4PermsIdx = 64,
>   nttL5PermsIdx = 192,
>   nttL6PermsIdx = 320,
>   nttL7PermsIdx = 448,
>   nttInvL0PermsIdx = 704,
>   nttInvL1PermsIdx = 832,
>   nttInvL2PermsIdx = 960,
>   nttInvL3PermsIdx = 1088,
>   nttInvL4PermsIdx = 1216,
> };
> static address dilithiumAvx512PermsAddr(nttPermOffset offset) {
>   return (address) dilithiumAvx512Perms + offset;
> }

belay that comment.. now that I looked at `generate_dilithiumAlmostInverseNtt_avx512`, I see why thats not the 'entire picture'..

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2008978604

From vpaprotski at openjdk.org  Mon Mar 24 15:19:25 2025
From: vpaprotski at openjdk.org (Volodymyr Paprotski)
Date: Mon, 24 Mar 2025 15:19:25 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v10]
In-Reply-To: <2yP2P1VNWgQu6cWvn0_a_7LdidS71C6PWKcqGKTOHnc=.49f8ac0f-df23-4f1e-adb9-e03a3f2295b2@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <2N5Evij0f6qZi_pG3tqoz11aQbSnLG0YszqHR9ROfKI=.d44b16c6-d334-42c4-8de8-92eb41229248@github.com>
 <2yP2P1VNWgQu6cWvn0_a_7LdidS71C6PWKcqGKTOHnc=.49f8ac0f-df23-4f1e-adb9-e03a3f2295b2@github.com>
Message-ID: <36fyT0z29o9GYLeQhpYkIT4d2By-8z7TEU8TGtT2uHI=.50647fa4-32ca-41ef-8287-075a70254143@github.com>

On Sat, 22 Mar 2025 16:45:31 GMT, Volodymyr Paprotski <vpaprotski at openjdk.org> wrote:

>> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Fix windows build
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 216:
> 
>> 214: // Zmm8-Zmm23 used as scratch registers
>> 215: // result goes to Zmm0-Zmm7
>> 216: static void montMulByConst128(MacroAssembler *_masm) {
> 
> wish the inputs and output register arrays were explicit.. easier to follow that way

Looking at this function some more.. I think you could remove this function and replace it with two calls to `montMul64`?

  montMul64(xmm0_3, xmm0_3, xmm29_29, Scratch*, _masm);
  montMul64(xmm4_7, xmm4_7, xmm29_29, Scratch*, _masm);
  ```
Scratch would have to be defined..

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2008979257

From mchevalier at openjdk.org  Wed Mar 26 08:33:58 2025
From: mchevalier at openjdk.org (Marc Chevalier)
Date: Wed, 26 Mar 2025 08:33:58 GMT
Subject: RFR: 8346989: C2: deoptimization and re-compilation cycle with
 Math.*Exact in case of frequent overflow [v2]
In-Reply-To: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com>
References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com>
Message-ID: <P3yDoyndIDkAbOdsYfPhEjlI8_SPczPdiJVC8YieQrs=.e2f66078-b0bb-4045-9e4f-a9de2202a419@github.com>

> `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments.
> This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached.
> 
> Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all.
> 
> tl;dr:
> - C1: no problem, no change
> - C2:
>   - with intrinsics:
>     - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms)
>     - without overflow: no problem, no change
>   - without intrinsics: no problem, no change
> 
> Before the fix:
> 
> Benchmark                                           (SIZE)  Mode  Cnt     Score      Error  Units
> MathExact.C1_1.loopAddIInBounds                    1000000  avgt    3     1.272 ?    0.048  ms/op
> MathExact.C1_1.loopAddIOverflow                    1000000  avgt    3   641.917 ?   58.238  ms/op
> MathExact.C1_1.loopAddLInBounds                    1000000  avgt    3     1.402 ?    0.842  ms/op
> MathExact.C1_1.loopAddLOverflow                    1000000  avgt    3   671.013 ?  229.425  ms/op
> MathExact.C1_1.loopDecrementIInBounds              1000000  avgt    3     3.722 ?   22.244  ms/op
> MathExact.C1_1.loopDecrementIOverflow              1000000  avgt    3   653.341 ?  279.003  ms/op
> MathExact.C1_1.loopDecrementLInBounds              1000000  avgt    3     2.525 ?    0.810  ms/op
> MathExact.C1_1.loopDecrementLOverflow              1000000  avgt    3   656.750 ?  141.792  ms/op
> MathExact.C1_1.loopIncrementIInBounds              1000000  avgt    3     4.621 ?   12.822  ms/op
> MathExact.C1_1.loopIncrementIOverflow              1000000  avgt    3   651.608 ?  274.396  ms/op
> MathExact.C1_1.loopIncrementLInBounds              1000000  avgt    3     2.576 ?    3.316  ms/op
> MathExact.C1_1.loopIncrementLOverflow              1000000  avgt    3   662.216 ?   71.879  ms/op
> MathExact.C1_1.loopMultiplyIInBounds               1000000  avgt    3     1.402 ?    0.587  ms/op
> MathExact.C1_1.loopMultiplyIOverflow               1000000  avgt    3   615.836 ?  252.137  ms/op
> MathExact.C1_1.loopMultiplyLInBounds               1000000  avgt    3     2.906 ?    5.718  ms/op
> MathExact.C1_1.loopMultiplyLOverflow               1000000  avgt    3   655.576 ?  147.432  ms/op
> MathExact.C1_1.loopNegateIInBounds                 1000000  avgt    3     2.023 ?    0.027  ms/op
> MathExact.C1_1.loopNegateIOverflow                 1000000  avgt    3   639.136 ?   30.841  ms/op
> MathExact.C1_1.loopNegateLInBounds                 1000000  avgt    3     2.422 ?    3.59...

Marc Chevalier has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision:

 - Use builtin_throw
 - Merge branch 'master' into fix/Deoptimization-and-re-compilation-cycle-with-C2-compiled-code
 - More exhaustive bench
 - Limit inlining of math Exact operations in case of too many deopts

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23916/files
  - new: https://git.openjdk.org/jdk/pull/23916/files/2317919f..9372228d

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23916&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23916&range=00-01

  Stats: 66384 lines in 1241 files changed: 32808 ins; 21395 del; 12181 mod
  Patch: https://git.openjdk.org/jdk/pull/23916.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23916/head:pull/23916

PR: https://git.openjdk.org/jdk/pull/23916

From mchevalier at openjdk.org  Wed Mar 26 08:39:09 2025
From: mchevalier at openjdk.org (Marc Chevalier)
Date: Wed, 26 Mar 2025 08:39:09 GMT
Subject: RFR: 8346989: C2: deoptimization and re-compilation cycle with
 Math.*Exact in case of frequent overflow [v2]
In-Reply-To: <VALgGrZGlfK6wjQkVlOricTgEkpC3uH0szWifF5aN8k=.cd0aa30c-9908-4df7-b6f0-970ec9717c0f@github.com>
References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com>
 <_3l8ylsbgvsqQE1Ihp0BUAx2o_VzcS6R2jWBSKW9u1E=.0dcb6086-ff6f-4c9a-b990-6665a476a3dc@github.com>
 <Xyr4rC5ziNR1VxSalCmU-RkvOZUWyZrGctWWVd5dmjo=.68cb4408-59de-4f00-a595-06a008807f33@github.com>
 <VALgGrZGlfK6wjQkVlOricTgEkpC3uH0szWifF5aN8k=.cd0aa30c-9908-4df7-b6f0-970ec9717c0f@github.com>
Message-ID: <K7LOLOU4L64MwpfT_2rz_wYutBb3B13RRGGhXh01Sy8=.43942b11-5ed8-45f9-bbbb-518480393971@github.com>

On Fri, 21 Mar 2025 22:34:43 GMT, Vladimir Ivanov <vlivanov at openjdk.org> wrote:

>> I think adapting and re-using `builtin_throw` like you described is reasonable but I let @iwanowww confirm :slightly_smiling_face:
>
> Yes, that's basically what I had in mind.
> 
> Currently, the focus of the intrinsic is on well-behaved case (overflows are **very** rare). `builtin_throw()` covers more ground and optimize for scenarios when exceptions are thrown. But it depends on `ciMethod::can_omit_stack_trace()` where `-XX:-OmitStackTraceInFastThrow` mode will suffer from the original problem (continuous deoptimizations), plus a round of recompilations before giving up.
> 
> I suggest to improve and reuse `builtin_throw` here and add additional checks in the intrinsic to guard against problematic scenario with continuous deoptimizations. IMO it improves performance model for a wide range of use cases while addressing pathological scenarios.

So, I have done something like that (getting the exception object to throw from parameter, and factor out the logic whether builtin_throw is possible, so we can bailout of intrinsics instead of cycling again). Test seem to pass in the various cases I wrote. As for benchmark, it's quite a change. I post only the new part, the rest is pretty much the same. C2_no_builtin_throw does what the original C2 was (no builtin throw, just bailing out of intrinsics to cut our losses), and new C2 is with builtin_throw. tldr: builtin_throw makes the overflow case of the same order as the in-bound cases (1-4ms) instead of being about 100 times bigger (600-700ms with C1, C2 without intrinsics, C2 with bailing out).


MathExact.C2.loopAddIInBounds                         1000000  avgt    3    1.657 ?   11.994  ms/op
MathExact.C2.loopAddIOverflow                         1000000  avgt    3    1.313 ?    4.188  ms/op
MathExact.C2.loopAddLInBounds                         1000000  avgt    3    0.980 ?    0.396  ms/op
MathExact.C2.loopAddLOverflow                         1000000  avgt    3    2.474 ?    3.473  ms/op
MathExact.C2.loopDecrementIInBounds                   1000000  avgt    3    3.733 ?   13.709  ms/op
MathExact.C2.loopDecrementIOverflow                   1000000  avgt    3    2.792 ?   23.724  ms/op
MathExact.C2.loopDecrementLInBounds                   1000000  avgt    3    2.761 ?   24.744  ms/op
MathExact.C2.loopDecrementLOverflow                   1000000  avgt    3    2.730 ?   23.065  ms/op
MathExact.C2.loopIncrementIInBounds                   1000000  avgt    3    3.134 ?   20.980  ms/op
MathExact.C2.loopIncrementIOverflow                   1000000  avgt    3    3.271 ?    8.876  ms/op
MathExact.C2.loopIncrementLInBounds                   1000000  avgt    3    2.756 ?   22.912  ms/op
MathExact.C2.loopIncrementLOverflow                   1000000  avgt    3    4.549 ?    9.543  ms/op
MathExact.C2.loopMultiplyIInBounds                    1000000  avgt    3    1.268 ?    0.574  ms/op
MathExact.C2.loopMultiplyIOverflow                    1000000  avgt    3    1.572 ?   11.171  ms/op
MathExact.C2.loopMultiplyLInBounds                    1000000  avgt    3    1.021 ?    1.054  ms/op
MathExact.C2.loopMultiplyLOverflow                    1000000  avgt    3    3.167 ?   20.666  ms/op
MathExact.C2.loopNegateIInBounds                      1000000  avgt    3    3.575 ?   29.997  ms/op
MathExact.C2.loopNegateIOverflow                      1000000  avgt    3    4.222 ?    9.041  ms/op
MathExact.C2.loopNegateLInBounds                      1000000  avgt    3    4.452 ?    6.680  ms/op
MathExact.C2.loopNegateLOverflow                      1000000  avgt    3    4.739 ?   34.662  ms/op
MathExact.C2.loopSubtractIInBounds                    1000000  avgt    3    1.087 ?    0.539  ms/op
MathExact.C2.loopSubtractIOverflow                    1000000  avgt    3    3.027 ?    9.709  ms/op
MathExact.C2.loopSubtractLInBounds                    1000000  avgt    3    1.197 ?    5.763  ms/op
MathExact.C2.loopSubtractLOverflow                    1000000  avgt    3    1.765 ?   10.037  ms/op
MathExact.C2_no_builtin_throw.loopAddIInBounds        1000000  avgt    3    2.310 ?    2.990  ms/op
MathExact.C2_no_builtin_throw.loopAddIOverflow        1000000  avgt    3  594.036 ?  500.000  ms/op
MathExact.C2_no_builtin_throw.loopAddLInBounds        1000000  avgt    3    1.577 ?   14.053  ms/op
MathExact.C2_no_builtin_throw.loopAddLOverflow        1000000  avgt    3  631.345 ?   75.836  ms/op
MathExact.C2_no_builtin_throw.loopDecrementIInBounds  1000000  avgt    3    2.090 ?    0.937  ms/op
MathExact.C2_no_builtin_throw.loopDecrementIOverflow  1000000  avgt    3  618.080 ?   38.047  ms/op
MathExact.C2_no_builtin_throw.loopDecrementLInBounds  1000000  avgt    3    4.164 ?    6.184  ms/op
MathExact.C2_no_builtin_throw.loopDecrementLOverflow  1000000  avgt    3  596.031 ?  584.159  ms/op
MathExact.C2_no_builtin_throw.loopIncrementIInBounds  1000000  avgt    3    2.383 ?   11.729  ms/op
MathExact.C2_no_builtin_throw.loopIncrementIOverflow  1000000  avgt    3  626.425 ?  134.612  ms/op
MathExact.C2_no_builtin_throw.loopIncrementLInBounds  1000000  avgt    3    2.345 ?   13.927  ms/op
MathExact.C2_no_builtin_throw.loopIncrementLOverflow  1000000  avgt    3  630.535 ?   99.348  ms/op
MathExact.C2_no_builtin_throw.loopMultiplyIInBounds   1000000  avgt    3    1.419 ?    4.289  ms/op
MathExact.C2_no_builtin_throw.loopMultiplyIOverflow   1000000  avgt    3  587.796 ?   52.215  ms/op
MathExact.C2_no_builtin_throw.loopMultiplyLInBounds   1000000  avgt    3    0.934 ?    0.272  ms/op
MathExact.C2_no_builtin_throw.loopMultiplyLOverflow   1000000  avgt    3  589.736 ?  347.848  ms/op
MathExact.C2_no_builtin_throw.loopNegateIInBounds     1000000  avgt    3    2.236 ?    5.749  ms/op
MathExact.C2_no_builtin_throw.loopNegateIOverflow     1000000  avgt    3  618.711 ?  725.158  ms/op
MathExact.C2_no_builtin_throw.loopNegateLInBounds     1000000  avgt    3    2.605 ?   17.373  ms/op
MathExact.C2_no_builtin_throw.loopNegateLOverflow     1000000  avgt    3  627.055 ?  184.767  ms/op
MathExact.C2_no_builtin_throw.loopSubtractIInBounds   1000000  avgt    3    1.006 ?    0.584  ms/op
MathExact.C2_no_builtin_throw.loopSubtractIOverflow   1000000  avgt    3  588.062 ?  403.116  ms/op
MathExact.C2_no_builtin_throw.loopSubtractLInBounds   1000000  avgt    3    0.978 ?    0.193  ms/op
MathExact.C2_no_builtin_throw.loopSubtractLOverflow   1000000  avgt    3  611.004 ?  456.779  ms/op

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23916#discussion_r2013625437

From mchevalier at openjdk.org  Wed Mar 26 10:14:19 2025
From: mchevalier at openjdk.org (Marc Chevalier)
Date: Wed, 26 Mar 2025 10:14:19 GMT
Subject: RFR: 8348853: Fold layout helper check for objects implementing
 non-array interfaces
Message-ID: <CptyYNIV6gQls75ydTAEKDlNDgQyFA6vXepsdzxvTyA=.c0bf07ae-4da7-4649-b36a-d568755cf4ac@github.com>

If `TypeInstKlassPtr` represents an array type, it has to be `java.lang.Object`. From contraposition, if it is not `java.lang.Object`, we can conclude it is not an array, and we can skip some array checks, for instance.

In this PR, we improve this deduction with an interface base reasoning: arrays implements only Cloneable and Serializable, so if a type implements anything else, it cannot be an array.

This change partially reverts the changes from [JDK-8348631](https://bugs.openjdk.org/browse/JDK-8348631) (#23331) (in `LibraryCallKit::generate_array_guard_common`) and the test still passes.

The way interfaces are check might be done differently. The current situation is a balance between visibility (not to leak too much things explicitly private), having not overly general methods for one use-case and avoiding too concrete (and brittle) interfaces.

Tested with tier1..3, hs-precheckin-comp and hs-comp-stress

Thanks,
Marc

-------------

Commit messages:
 - Revert now useless fix
 - Generalize the not-array proof

Changes: https://git.openjdk.org/jdk/pull/24245/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24245&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8348853
  Stats: 46 lines in 5 files changed: 35 ins; 7 del; 4 mod
  Patch: https://git.openjdk.org/jdk/pull/24245.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/24245/head:pull/24245

PR: https://git.openjdk.org/jdk/pull/24245

From tschatzl at openjdk.org  Wed Mar 26 10:37:50 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Wed, 26 Mar 2025 10:37:50 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v28]
In-Reply-To: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
Message-ID: <h4zfxwSz_Km-vv55vS5AIYeG0vvDipwM-WCgH8-B0xs=.fdd792de-0019-4a59-82e0-f42e3eb5b222@github.com>

> Hi all,
> 
>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
> 
> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
> 
> ### Current situation
> 
> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
> 
> The main reason for the current barrier is how g1 implements concurrent refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
> 
> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
> 
> 
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done;     // null value check
> if (card(@x.a) == young_card) goto done;  // write to young gen check
> StoreLoad;                // synchronize
> if (card(@x.a) == dirty_card) goto done;
> 
> *card(@x.a) = dirty
> 
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
> 
> call runtime to move thread-local-dcq into dcqs
> 
> done:
> 
> 
> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
> 
> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
> 
> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
> 
> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se...

Thomas Schatzl has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 36 commits:

 - Merge branch 'master' into 8342382-card-table-instead-of-dcq
 - Merge branch 'master' into submit/8342382-card-table-instead-of-dcq
 - * make young gen length revising independent of refinement thread
     * use a service task
     * both refinement control thread and young gen length revising use the same infrastructure to get the number of available bytes and determine the time to the next update
 - * fix IR code generation tests that change due to barrier cost changes
 - * factor out card table and refinement table merging into a single
     method
 - Merge branch 'master' into 8342382-card-table-instead-of-dcq3
 - * obsolete G1UpdateBufferSize
   
   G1UpdateBufferSize has previously been used to size the refinement
   buffers and impose a minimum limit on the number of cards per thread
   that need to be pending before refinement starts.
   
   The former function is now obsolete with the removal of the dirty
   card queues, the latter functionality has been taken over by the new
   diagnostic option `G1PerThreadPendingCardThreshold`.
   
   I prefer to make this a diagnostic option is better than a product option
   because it is something that is only necessary for some test cases to
   produce some otherwise unwanted behavior (continuous refinement).
   
   CSR is pending.
 - * more documentation on why we need to rendezvous the gc threads
 - Merge branch 'master' into 8342381-card-table-instead-of-dcq
 - * ayang review
     * re-add STS leaver for java thread handshake
 - ... and 26 more: https://git.openjdk.org/jdk/compare/059f190f...6d574da0

-------------

Changes: https://git.openjdk.org/jdk/pull/23739/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=27
  Stats: 7089 lines in 110 files changed: 2610 ins; 3555 del; 924 mod
  Patch: https://git.openjdk.org/jdk/pull/23739.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739

PR: https://git.openjdk.org/jdk/pull/23739

From dnsimon at openjdk.org  Wed Mar 26 10:43:43 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Wed, 26 Mar 2025 10:43:43 GMT
Subject: RFR: 8352645: Add tool support to check order of includes
Message-ID: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>

This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)).

By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order.

The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now.

I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories.

When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example:

java.lang.RuntimeException: The unsorted includes listed below should be fixable by running:

    java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci

	at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:565)
	at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335)
	at java.base/java.lang.Thread.run(Thread.java:1447)
Caused by: java.lang.RuntimeException: 36 files with unsorted headers found:

/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Optimizer.cpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.hpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_FrameMap.hpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_RangeCheckElimination.cpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_InstructionPrinter.cpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci/bcEscapeAnalyzer.cpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci/ciInstance.cpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci/ciEnv.hpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci/ciUtilities.inline.hpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci/ciMethod.cpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci/ciUtilities.cpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci/ciEnv.cpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci/ciCallSite.cpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci/bcEscapeAnalyzer.hpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci/ciReplay.cpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci/ciInstanceKlass.cpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler/compilationMemoryStatistic.cpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler/compilationFailureInfo.cpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler/compilationPolicy.cpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler/directivesParser.hpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler/compileBroker.cpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler/directivesParser.cpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler/compilerDirectives.hpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler/methodMatcher.hpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler/compilationMemoryStatistic.hpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler/compileTask.cpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler/disassembler.hpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler/oopMap.inline.hpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci/jvmciRuntime.cpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci/jvmci.cpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci/jvmciCompiler.cpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci/jvmci.hpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci/jvmciEnv.cpp
/Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci/jvmciJavaClasses.cpp

Note that non-space characters after the closing " or > of an include statement
can be used to prevent re-ordering of the include. For example:

#include "e.hpp"
#include "d.hpp"
#include "c.hpp" // do not reorder
#include "b.hpp"
#include "a.hpp"

will be reformatted as:

#include "d.hpp"
#include "e.hpp"
#include "c.hpp" // do not reorder
#include "a.hpp"
#include "b.hpp"


	at SortIncludes.main(SortIncludes.java:190)
	at TestIncludesAreSorted.main(TestIncludesAreSorted.java:75)
	... 4 more

JavaTest Message: Test threw exception: java.lang.RuntimeException


This PR includes a [commit](https://github.com/openjdk/jdk/pull/24247/commits/a76d4f98c7e6074b4745c1c1791fe605e352d79f) with ordering suppression comments for some files I discovered needed it while playing around in #24180 .

This PR replaces #24180.

-------------

Commit messages:
 - sort includes in subset of HotSpot sources and added a test to keep them sorted
 - added tool to sort includes
 - do not reorder certain includes

Changes: https://git.openjdk.org/jdk/pull/24247/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24247&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8352645
  Stats: 396 lines in 53 files changed: 335 ins; 54 del; 7 mod
  Patch: https://git.openjdk.org/jdk/pull/24247.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/24247/head:pull/24247

PR: https://git.openjdk.org/jdk/pull/24247

From shade at openjdk.org  Wed Mar 26 11:35:25 2025
From: shade at openjdk.org (Aleksey Shipilev)
Date: Wed, 26 Mar 2025 11:35:25 GMT
Subject: RFR: 8351155: C1/C2: Remove 32-bit x86 specific FP rounding support
Message-ID: <8zUrV-sMSOwRSQk_jERtFqjrzOFUP7rlUwTTN7cPP_8=.b1d30fb5-d9c0-4417-bacd-bf09f2af433b@github.com>

C1 and C2 have support for rounding double/floats, to support awkward rounding modes of x87 FPU. With 32-bit x86 port removed, we can remove those parts. This basically deletes all the code that uses `strict_fp_requires_explicit_rounding`, which is now universally `false` for all supported platforms.

For C1, we remove `RoundFP` op, its associated `lir_roundfp` and related utility methods that insert these nodes in the graph.

For C2, we remove `RoundDouble` and `RoundFloat` nodes (note there is a confusingly named `RoundDoubleMode` nodes that are not related to this), associated utility methods, AD match rules that reference these nodes (as nops!), and some `Ideal`-s that are no longer needed.

Additional testing:
 - [x] Linux x86_64 server fastdebug, `tier1`
 - [ ] Linux x86_64 server fastdebug, `all`

-------------

Commit messages:
 - Leftover
 - Fix

Changes: https://git.openjdk.org/jdk/pull/24250/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24250&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8351155
  Stats: 545 lines in 48 files changed: 0 ins; 511 del; 34 mod
  Patch: https://git.openjdk.org/jdk/pull/24250.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/24250/head:pull/24250

PR: https://git.openjdk.org/jdk/pull/24250

From stefank at openjdk.org  Wed Mar 26 12:27:09 2025
From: stefank at openjdk.org (Stefan Karlsson)
Date: Wed, 26 Mar 2025 12:27:09 GMT
Subject: RFR: 8352645: Add tool support to check order of includes
In-Reply-To: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
Message-ID: <o6slFiYzu23K4bvKbKsRzhecTP2hdaTUwIAOXeFyOVA=.c99352ce-7ab5-4b7b-956e-1654d4869b0f@github.com>

On Wed, 26 Mar 2025 09:21:59 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

> This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)).
> 
> By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order.
> 
> The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now.
> 
> I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories.
> 
> When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example:
> 
> java.lang.RuntimeException: The unsorted includes listed below should be fixable by running:
> 
>     java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci
> 
> 	at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80)
> 	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
> 	at java.base/java.lang.reflect.Method.invoke(Method.java:565)
> 	at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335)
> 	at java.base/java.lang.Thread.run(Thread.java:1447)
> Caused by: java.lang.RuntimeException: 36 files with unsorted headers found:
> 
> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp
> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp
> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Optim...

Thanks for updating to use the lower-case comparison. I wonder if a small tweak can fix the extra blank lines I complained about in the other PR.

The tool removes the extra blank line we have in our .inline.hpp. From the Style Guide:

All .inline.hpp files should include their corresponding .hpp file as the first include line with a blank line separating it from the rest of the include lines. Declarations needed by other files should be put in the .hpp file, and not in the .inline.hpp file. This rule exists to resolve problems with circular dependencies between .inline.hpp files.


I think this needs to be fixed, otherwise people will start to remove these.

src/hotspot/share/compiler/oopMap.inline.hpp line 29:

> 27: 
> 28: #include "compiler/oopMap.hpp"
> 29: 

This blank line should not be removed.

test/hotspot/jtreg/sources/SortIncludes.java line 77:

> 75:             blankLines = List.of("");
> 76:         }
> 77:         result.addAll(blankLines);

If this line is removed you don't get the extra blank lines I mentioned in the previous PR. It also removes the extra blank line that you get inserted into oopMap.inline.hpp before the INCLUDE_JVMCI block.

-------------

Changes requested by stefank (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/24247#pullrequestreview-2716954567
PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2014026694
PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2014025793

From stefank at openjdk.org  Wed Mar 26 13:43:16 2025
From: stefank at openjdk.org (Stefan Karlsson)
Date: Wed, 26 Mar 2025 13:43:16 GMT
Subject: RFR: 8352645: Add tool support to check order of includes
In-Reply-To: <o6slFiYzu23K4bvKbKsRzhecTP2hdaTUwIAOXeFyOVA=.c99352ce-7ab5-4b7b-956e-1654d4869b0f@github.com>
References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
 <o6slFiYzu23K4bvKbKsRzhecTP2hdaTUwIAOXeFyOVA=.c99352ce-7ab5-4b7b-956e-1654d4869b0f@github.com>
Message-ID: <TpobObju0oQAJqRUKosksBMJn4yqCwCA76HxosZMrCY=.3ab8a697-4d9e-4a23-9f1b-26516f171123@github.com>

On Wed, 26 Mar 2025 12:19:14 GMT, Stefan Karlsson <stefank at openjdk.org> wrote:

>> This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)).
>> 
>> By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order.
>> 
>> The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now.
>> 
>> I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories.
>> 
>> When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example:
>> 
>> java.lang.RuntimeException: The unsorted includes listed below should be fixable by running:
>> 
>>     java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci
>> 
>> 	at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80)
>> 	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
>> 	at java.base/java.lang.reflect.Method.invoke(Method.java:565)
>> 	at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335)
>> 	at java.base/java.lang.Thread.run(Thread.java:1447)
>> Caused by: java.lang.RuntimeException: 36 files with unsorted headers found:
>> 
>> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp
>> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp
>> /Users/dnsimo...
>
> test/hotspot/jtreg/sources/SortIncludes.java line 77:
> 
>> 75:             blankLines = List.of("");
>> 76:         }
>> 77:         result.addAll(blankLines);
> 
> If this line is removed you don't get the extra blank lines I mentioned in the previous PR. It also removes the extra blank line that you get inserted into oopMap.inline.hpp before the INCLUDE_JVMCI block.

Or, rather if the code is changed to:

        if (!userIncludes.isEmpty() && !sysIncludes.isEmpty()) {
          result.add("");
        }
        result.addAll(sysIncludes);

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2014172537

From dnsimon at openjdk.org  Wed Mar 26 14:23:09 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Wed, 26 Mar 2025 14:23:09 GMT
Subject: RFR: 8352645: Add tool support to check order of includes [v2]
In-Reply-To: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
Message-ID: <xr6ltKi4UK6l63mgcdaxMvxZwBoy7UBjeOO1zy9-1b4=.080cf94e-e008-4944-aa6f-119fec8571cc@github.com>

> This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)).
> 
> By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order.
> 
> The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now.
> 
> I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories.
> 
> When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example:
> 
> java.lang.RuntimeException: The unsorted includes listed below should be fixable by running:
> 
>     java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci
> 
> 	at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80)
> 	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
> 	at java.base/java.lang.reflect.Method.invoke(Method.java:565)
> 	at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335)
> 	at java.base/java.lang.Thread.run(Thread.java:1447)
> Caused by: java.lang.RuntimeException: 36 files with unsorted headers found:
> 
> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp
> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp
> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Optim...

Doug Simon has updated the pull request incrementally with one additional commit since the last revision:

  drop extra blank lines and preserve rule for first include in .inline.hpp files

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/24247/files
  - new: https://git.openjdk.org/jdk/pull/24247/files/62779478..18e2a1d6

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=24247&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24247&range=00-01

  Stats: 52 lines in 4 files changed: 40 ins; 5 del; 7 mod
  Patch: https://git.openjdk.org/jdk/pull/24247.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/24247/head:pull/24247

PR: https://git.openjdk.org/jdk/pull/24247

From duke at openjdk.org  Wed Mar 26 15:23:47 2025
From: duke at openjdk.org (Zihao Lin)
Date: Wed, 26 Mar 2025 15:23:47 GMT
Subject: RFR: 8344116: C2: remove slice parameter from LoadNode::make
Message-ID: <Po0DIjZv6wmSdwNcL1BeN5s9xvih8YKDqaw7Io5wIl8=.82dc3319-de61-4afc-898c-a7550bf9c9ac@github.com>

This patch remove slice parameter from LoadNode::make

Mention in https://github.com/openjdk/jdk/pull/21834#pullrequestreview-2429164805

Hi team, I am new, I'd appreciate any guidance. Thank a lot!

-------------

Commit messages:
 - 8344116: C2: remove slice parameter from LoadNode::make

Changes: https://git.openjdk.org/jdk/pull/24258/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24258&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8344116
  Stats: 54 lines in 13 files changed: 3 ins; 14 del; 37 mod
  Patch: https://git.openjdk.org/jdk/pull/24258.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/24258/head:pull/24258

PR: https://git.openjdk.org/jdk/pull/24258

From duke at openjdk.org  Wed Mar 26 15:43:30 2025
From: duke at openjdk.org (Zihao Lin)
Date: Wed, 26 Mar 2025 15:43:30 GMT
Subject: RFR: 8344116: C2: remove slice parameter from LoadNode::make [v2]
In-Reply-To: <Po0DIjZv6wmSdwNcL1BeN5s9xvih8YKDqaw7Io5wIl8=.82dc3319-de61-4afc-898c-a7550bf9c9ac@github.com>
References: <Po0DIjZv6wmSdwNcL1BeN5s9xvih8YKDqaw7Io5wIl8=.82dc3319-de61-4afc-898c-a7550bf9c9ac@github.com>
Message-ID: <6NXNfV1dqzZxpogva4dsv0kxkAQtJlgmLnSHvgZm5YA=.461d9a09-1e23-4acd-8230-0840348183ef@github.com>

> This patch remove slice parameter from LoadNode::make
> 
> Mention in https://github.com/openjdk/jdk/pull/21834#pullrequestreview-2429164805
> 
> Hi team, I am new, I'd appreciate any guidance. Thank a lot!

Zihao Lin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision:

 - Merge branch 'openjdk:master' into 8344116
 - 8344116: C2: remove slice parameter from LoadNode::make

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/24258/files
  - new: https://git.openjdk.org/jdk/pull/24258/files/27df4a01..f4ef46dc

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=24258&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24258&range=00-01

  Stats: 34071 lines in 1200 files changed: 1990 ins; 30272 del; 1809 mod
  Patch: https://git.openjdk.org/jdk/pull/24258.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/24258/head:pull/24258

PR: https://git.openjdk.org/jdk/pull/24258

From stefank at openjdk.org  Wed Mar 26 15:46:18 2025
From: stefank at openjdk.org (Stefan Karlsson)
Date: Wed, 26 Mar 2025 15:46:18 GMT
Subject: RFR: 8352645: Add tool support to check order of includes [v2]
In-Reply-To: <xr6ltKi4UK6l63mgcdaxMvxZwBoy7UBjeOO1zy9-1b4=.080cf94e-e008-4944-aa6f-119fec8571cc@github.com>
References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
 <xr6ltKi4UK6l63mgcdaxMvxZwBoy7UBjeOO1zy9-1b4=.080cf94e-e008-4944-aa6f-119fec8571cc@github.com>
Message-ID: <t3nS5gis87BYhD5LFoN5TAOrRT52cU4yFKbZxWzslMg=.b496371f-3e2e-4027-a652-1403a6cf9fae@github.com>

On Wed, 26 Mar 2025 14:23:09 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

>> This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)).
>> 
>> By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order.
>> 
>> The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now.
>> 
>> I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories.
>> 
>> When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example:
>> 
>> java.lang.RuntimeException: The unsorted includes listed below should be fixable by running:
>> 
>>     java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci
>> 
>> 	at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80)
>> 	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
>> 	at java.base/java.lang.reflect.Method.invoke(Method.java:565)
>> 	at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335)
>> 	at java.base/java.lang.Thread.run(Thread.java:1447)
>> Caused by: java.lang.RuntimeException: 36 files with unsorted headers found:
>> 
>> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp
>> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp
>> /Users/dnsimo...
>
> Doug Simon has updated the pull request incrementally with one additional commit since the last revision:
> 
>   drop extra blank lines and preserve rule for first include in .inline.hpp files

Thanks for doing the last two fixes. I think this looks good now, but I need a bit more time to do some deeper verification. Thanks!

-------------

PR Review: https://git.openjdk.org/jdk/pull/24247#pullrequestreview-2717748962

From kvn at openjdk.org  Wed Mar 26 19:02:13 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Wed, 26 Mar 2025 19:02:13 GMT
Subject: RFR: 8351155: C1/C2: Remove 32-bit x86 specific FP rounding
 support
In-Reply-To: <8zUrV-sMSOwRSQk_jERtFqjrzOFUP7rlUwTTN7cPP_8=.b1d30fb5-d9c0-4417-bacd-bf09f2af433b@github.com>
References: <8zUrV-sMSOwRSQk_jERtFqjrzOFUP7rlUwTTN7cPP_8=.b1d30fb5-d9c0-4417-bacd-bf09f2af433b@github.com>
Message-ID: <UwjGljdiqXkHz3DZyu9V0VylK2kf99maM6wpcc640PE=.e929a39e-46ee-43fe-b5d2-2d84079c830f@github.com>

On Wed, 26 Mar 2025 10:11:25 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

> C1 and C2 have support for rounding double/floats, to support awkward rounding modes of x87 FPU. With 32-bit x86 port removed, we can remove those parts. This basically deletes all the code that uses `strict_fp_requires_explicit_rounding`, which is now universally `false` for all supported platforms.
> 
> For C1, we remove `RoundFP` op, its associated `lir_roundfp` and related utility methods that insert these nodes in the graph.
> 
> For C2, we remove `RoundDouble` and `RoundFloat` nodes (note there is a confusingly named `RoundDoubleMode` nodes that are not related to this), associated utility methods, AD match rules that reference these nodes (as nops!), and some `Ideal`-s that are no longer needed.
> 
> Additional testing:
>  - [x] Linux x86_64 server fastdebug, `tier1`
>  - [x] Linux x86_64 server fastdebug, `all`

Good.

-------------

Marked as reviewed by kvn (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/24250#pullrequestreview-2718342841

From vlivanov at openjdk.org  Wed Mar 26 19:13:12 2025
From: vlivanov at openjdk.org (Vladimir Ivanov)
Date: Wed, 26 Mar 2025 19:13:12 GMT
Subject: RFR: 8351155: C1/C2: Remove 32-bit x86 specific FP rounding
 support
In-Reply-To: <8zUrV-sMSOwRSQk_jERtFqjrzOFUP7rlUwTTN7cPP_8=.b1d30fb5-d9c0-4417-bacd-bf09f2af433b@github.com>
References: <8zUrV-sMSOwRSQk_jERtFqjrzOFUP7rlUwTTN7cPP_8=.b1d30fb5-d9c0-4417-bacd-bf09f2af433b@github.com>
Message-ID: <haLoEPug0QT4BksSeXrfYJuIjWpbWO9b_BqnDgoZx5c=.f2212250-cbed-4a2b-80c0-b472e9f4f180@github.com>

On Wed, 26 Mar 2025 10:11:25 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

> C1 and C2 have support for rounding double/floats, to support awkward rounding modes of x87 FPU. With 32-bit x86 port removed, we can remove those parts. This basically deletes all the code that uses `strict_fp_requires_explicit_rounding`, which is now universally `false` for all supported platforms.
> 
> For C1, we remove `RoundFP` op, its associated `lir_roundfp` and related utility methods that insert these nodes in the graph.
> 
> For C2, we remove `RoundDouble` and `RoundFloat` nodes (note there is a confusingly named `RoundDoubleMode` nodes that are not related to this), associated utility methods, AD match rules that reference these nodes (as nops!), and some `Ideal`-s that are no longer needed.
> 
> Additional testing:
>  - [x] Linux x86_64 server fastdebug, `tier1`
>  - [x] Linux x86_64 server fastdebug, `all`

Looks good.

-------------

Marked as reviewed by vlivanov (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/24250#pullrequestreview-2718363625

From kbarrett at openjdk.org  Thu Mar 27 06:14:21 2025
From: kbarrett at openjdk.org (Kim Barrett)
Date: Thu, 27 Mar 2025 06:14:21 GMT
Subject: RFR: 8352645: Add tool support to check order of includes [v2]
In-Reply-To: <xr6ltKi4UK6l63mgcdaxMvxZwBoy7UBjeOO1zy9-1b4=.080cf94e-e008-4944-aa6f-119fec8571cc@github.com>
References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
 <xr6ltKi4UK6l63mgcdaxMvxZwBoy7UBjeOO1zy9-1b4=.080cf94e-e008-4944-aa6f-119fec8571cc@github.com>
Message-ID: <d6DiXMT0hOs8Kx4mNyMXeKLNlbyaW9PnyKBscLLeySI=.ee3e04ca-ec43-4247-8dd6-59efbb4d6aed@github.com>

On Wed, 26 Mar 2025 14:23:09 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

>> This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)).
>> 
>> By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order.
>> 
>> The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now.
>> 
>> I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories.
>> 
>> When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example:
>> 
>> java.lang.RuntimeException: The unsorted includes listed below should be fixable by running:
>> 
>>     java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci
>> 
>> 	at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80)
>> 	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
>> 	at java.base/java.lang.reflect.Method.invoke(Method.java:565)
>> 	at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335)
>> 	at java.base/java.lang.Thread.run(Thread.java:1447)
>> Caused by: java.lang.RuntimeException: 36 files with unsorted headers found:
>> 
>> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp
>> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp
>> /Users/dnsimo...
>
> Doug Simon has updated the pull request incrementally with one additional commit since the last revision:
> 
>   drop extra blank lines and preserve rule for first include in .inline.hpp files

Changes requested by kbarrett (Reviewer).

src/hotspot/share/ci/ciUtilities.inline.hpp line 29:

> 27: 
> 28: #include "ci/ciUtilities.hpp"
> 29: 

Extra blank line not removed?

src/hotspot/share/ci/ciUtilities.inline.hpp line 32:

> 30: #include "runtime/interfaceSupport.inline.hpp"
> 31: 
> 32: 

Extra blank line inserted?

src/hotspot/share/compiler/compilationFailureInfo.cpp line 35:

> 33: #include "compiler/compilationFailureInfo.hpp"
> 34: #include "compiler/compileTask.hpp"
> 35: #ifdef COMPILER2

Conditional includes are supposed to follow unconditional in a section.
Out of scope for this PR?

src/hotspot/share/compiler/disassembler.hpp line 36:

> 34: #include "utilities/macros.hpp"
> 35: 
> 36: 

Extra blank line inserted?

test/hotspot/jtreg/sources/SortIncludes.java line 39:

> 37: 
> 38: public class SortIncludes {
> 39:     private static final String INCLUDE_LINE = "^ *#include *(<[^>]+>|\"[^\"]+\") *$\\n";

There are files that have spaces between the `#` and `include`.  I'm kind of inclined to suggest we fix those
at some point (not in this PR).  But the regex here needs to allow for that possibility, and perhaps (eventually)
complain about such.

test/hotspot/jtreg/sources/SortIncludes.java line 115:

> 113:     }
> 114: 
> 115:     /// Processes the C++ source file in `path` to sort its include statements.

If we want to apply this to hotspot jtreg test code, then C source files also come into the picture.

test/hotspot/jtreg/sources/SortIncludes.java line 153:

> 151: 
> 152:     /// Processes the C++ source files in `paths` to check if their include statements are sorted.
> 153:     /// Include statements with any non-space characters after the closing `"` or `>` will not

Perhaps this should be mentioned in the style guide?

-------------

PR Review: https://git.openjdk.org/jdk/pull/24247#pullrequestreview-2719852021
PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2015721384
PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2015718606
PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2015723999
PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2015725371
PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2015706803
PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2015712545
PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2015714360

From kbarrett at openjdk.org  Thu Mar 27 06:30:09 2025
From: kbarrett at openjdk.org (Kim Barrett)
Date: Thu, 27 Mar 2025 06:30:09 GMT
Subject: RFR: 8352645: Add tool support to check order of includes [v2]
In-Reply-To: <xr6ltKi4UK6l63mgcdaxMvxZwBoy7UBjeOO1zy9-1b4=.080cf94e-e008-4944-aa6f-119fec8571cc@github.com>
References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
 <xr6ltKi4UK6l63mgcdaxMvxZwBoy7UBjeOO1zy9-1b4=.080cf94e-e008-4944-aa6f-119fec8571cc@github.com>
Message-ID: <cKP_9Zpyl0zHLv4FY9wdfrfJT0TykD38Vo4EQGMQccQ=.3fbf41ef-08d1-4706-b323-d213b48563e6@github.com>

On Wed, 26 Mar 2025 14:23:09 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

>> This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)).
>> 
>> By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order.
>> 
>> The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now.
>> 
>> I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories.
>> 
>> When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example:
>> 
>> java.lang.RuntimeException: The unsorted includes listed below should be fixable by running:
>> 
>>     java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci
>> 
>> 	at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80)
>> 	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
>> 	at java.base/java.lang.reflect.Method.invoke(Method.java:565)
>> 	at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335)
>> 	at java.base/java.lang.Thread.run(Thread.java:1447)
>> Caused by: java.lang.RuntimeException: 36 files with unsorted headers found:
>> 
>> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp
>> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp
>> /Users/dnsimo...
>
> Doug Simon has updated the pull request incrementally with one additional commit since the last revision:
> 
>   drop extra blank lines and preserve rule for first include in .inline.hpp files

Probably we want to eventually apply this to gtests, but there might be additional rules there.  The include of
unittest.hpp is (usually) last, and there may be (or may have been) a technical reason for that.

Applying it to jtreg test support files could also introduce some challenges.  Or at least discover a lot of
non-conforming files.  We might eventually want a mechanism for excluding directories, in addition to an inclusion
list (that might eventually be "all").

These kinds of things can be followups once we have the basic mechanism in place.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24247#issuecomment-2756881833

From bkilambi at openjdk.org  Thu Mar 27 08:13:11 2025
From: bkilambi at openjdk.org (Bhavana Kilambi)
Date: Thu, 27 Mar 2025 08:13:11 GMT
Subject: RFR: 8345125: Aarch64: Add aarch64 backend for Float16 scalar
 operations [v2]
In-Reply-To: <8QDbenZGakijqUrwAcaVogoJBEiNpzYhN3sDrrteSDk=.d8539631-ab03-45ff-a762-0b6e14c63f89@github.com>
References: <Rbal8Cp4ncat_17FPV367OvtBjO-GVr0AdM-X-yuNt8=.e09edada-bc9d-4dbd-9904-ab523d25fc47@github.com>
 <8QDbenZGakijqUrwAcaVogoJBEiNpzYhN3sDrrteSDk=.d8539631-ab03-45ff-a762-0b6e14c63f89@github.com>
Message-ID: <MpaCoO7Di28K4TkmGWOwzb0Afv__00ZhRDGsEh4hFrA=.0fce4e33-771e-4069-ba5c-95ae942c1738@github.com>

On Tue, 25 Feb 2025 19:45:31 GMT, Bhavana Kilambi <bkilambi at openjdk.org> wrote:

>> This patch adds aarch64 backend for scalar FP16 operations namely - add, subtract, multiply, divide, fma, sqrt, min and max.
>
> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Address review comments

Hello @shqking @theRealAph , sincere apologies for the delay in addressing the review comments. I am planning on uploading a patch soon addressing all review comments. Thank you !

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23748#issuecomment-2757083553

From dnsimon at openjdk.org  Thu Mar 27 08:21:09 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Thu, 27 Mar 2025 08:21:09 GMT
Subject: RFR: 8352645: Add tool support to check order of includes [v2]
In-Reply-To: <d6DiXMT0hOs8Kx4mNyMXeKLNlbyaW9PnyKBscLLeySI=.ee3e04ca-ec43-4247-8dd6-59efbb4d6aed@github.com>
References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
 <xr6ltKi4UK6l63mgcdaxMvxZwBoy7UBjeOO1zy9-1b4=.080cf94e-e008-4944-aa6f-119fec8571cc@github.com>
 <d6DiXMT0hOs8Kx4mNyMXeKLNlbyaW9PnyKBscLLeySI=.ee3e04ca-ec43-4247-8dd6-59efbb4d6aed@github.com>
Message-ID: <dO6aDRVyRJlcdF4Rd8Q55enD--x1-9nJJ0JREpGIRIc=.b34ede90-248f-41bd-b1a0-0422a41c220c@github.com>

On Thu, 27 Mar 2025 05:56:55 GMT, Kim Barrett <kbarrett at openjdk.org> wrote:

>> Doug Simon has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   drop extra blank lines and preserve rule for first include in .inline.hpp files
>
> test/hotspot/jtreg/sources/SortIncludes.java line 39:
> 
>> 37: 
>> 38: public class SortIncludes {
>> 39:     private static final String INCLUDE_LINE = "^ *#include *(<[^>]+>|\"[^\"]+\") *$\\n";
> 
> There are files that have spaces between the `#` and `include`.  I'm kind of inclined to suggest we fix those
> at some point (not in this PR).  But the regex here needs to allow for that possibility, and perhaps (eventually)
> complain about such.

Since there are no such cases in the files processed in this PR, I'd suggest not adding support for them. They can be fixed in follow up PRs as the relevant directories are added to `TestIncludesAreSorted.HOTSPOT_SOURCES_TO_CHECK`.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2015912061

From shade at openjdk.org  Thu Mar 27 08:45:49 2025
From: shade at openjdk.org (Aleksey Shipilev)
Date: Thu, 27 Mar 2025 08:45:49 GMT
Subject: RFR: 8351155: C1/C2: Remove 32-bit x86 specific FP rounding
 support [v2]
In-Reply-To: <8zUrV-sMSOwRSQk_jERtFqjrzOFUP7rlUwTTN7cPP_8=.b1d30fb5-d9c0-4417-bacd-bf09f2af433b@github.com>
References: <8zUrV-sMSOwRSQk_jERtFqjrzOFUP7rlUwTTN7cPP_8=.b1d30fb5-d9c0-4417-bacd-bf09f2af433b@github.com>
Message-ID: <sU0m-oKfPWJrKabuKyBZ1WkH8k-sjefw9dmRH8ZAse8=.0977dc46-d2df-492e-940b-9b87aed2cd96@github.com>

> C1 and C2 have support for rounding double/floats, to support awkward rounding modes of x87 FPU. With 32-bit x86 port removed, we can remove those parts. This basically deletes all the code that uses `strict_fp_requires_explicit_rounding`, which is now universally `false` for all supported platforms.
> 
> For C1, we remove `RoundFP` op, its associated `lir_roundfp` and related utility methods that insert these nodes in the graph.
> 
> For C2, we remove `RoundDouble` and `RoundFloat` nodes (note there is a confusingly named `RoundDoubleMode` nodes that are not related to this), associated utility methods, AD match rules that reference these nodes (as nops!), and some `Ideal`-s that are no longer needed.
> 
> Additional testing:
>  - [x] Linux x86_64 server fastdebug, `tier1`
>  - [x] Linux x86_64 server fastdebug, `all`

Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision:

  Minor leftover

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/24250/files
  - new: https://git.openjdk.org/jdk/pull/24250/files/376c5ad8..88e4589c

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=24250&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24250&range=00-01

  Stats: 2 lines in 1 file changed: 0 ins; 2 del; 0 mod
  Patch: https://git.openjdk.org/jdk/pull/24250.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/24250/head:pull/24250

PR: https://git.openjdk.org/jdk/pull/24250

From stefank at openjdk.org  Thu Mar 27 08:46:09 2025
From: stefank at openjdk.org (Stefan Karlsson)
Date: Thu, 27 Mar 2025 08:46:09 GMT
Subject: RFR: 8352645: Add tool support to check order of includes [v2]
In-Reply-To: <xr6ltKi4UK6l63mgcdaxMvxZwBoy7UBjeOO1zy9-1b4=.080cf94e-e008-4944-aa6f-119fec8571cc@github.com>
References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
 <xr6ltKi4UK6l63mgcdaxMvxZwBoy7UBjeOO1zy9-1b4=.080cf94e-e008-4944-aa6f-119fec8571cc@github.com>
Message-ID: <qR9P7FbyHe2EH3dOz8xOUZo4feCOdHlZjkpPTsyCn6A=.78aef174-2eed-40b7-bbd9-14ac5ca568ee@github.com>

On Wed, 26 Mar 2025 14:23:09 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

>> This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)).
>> 
>> By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order.
>> 
>> The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now.
>> 
>> I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories.
>> 
>> When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example:
>> 
>> java.lang.RuntimeException: The unsorted includes listed below should be fixable by running:
>> 
>>     java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci
>> 
>> 	at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80)
>> 	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
>> 	at java.base/java.lang.reflect.Method.invoke(Method.java:565)
>> 	at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335)
>> 	at java.base/java.lang.Thread.run(Thread.java:1447)
>> Caused by: java.lang.RuntimeException: 36 files with unsorted headers found:
>> 
>> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp
>> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp
>> /Users/dnsimo...
>
> Doug Simon has updated the pull request incrementally with one additional commit since the last revision:
> 
>   drop extra blank lines and preserve rule for first include in .inline.hpp files

I ran the latest script over the HotSpot source and see that it messes up corner-cases with our platform includes.


diff --git a/src/hotspot/cpu/aarch64/continuationEntry_aarch64.inline.hpp b/src/hotspot/cpu/aarch64/continuationEntry_aarch64.inline.hpp
index df4d3957239..e8816767a96 100644
--- a/src/hotspot/cpu/aarch64/continuationEntry_aarch64.inline.hpp
+++ b/src/hotspot/cpu/aarch64/continuationEntry_aarch64.inline.hpp
@@ -25,10 +25,9 @@
 #ifndef CPU_AARCH64_CONTINUATIONENTRY_AARCH64_INLINE_HPP
 #define CPU_AARCH64_CONTINUATIONENTRY_AARCH64_INLINE_HPP

-#include "runtime/continuationEntry.hpp"
-
 #include "code/codeCache.hpp"
 #include "oops/method.inline.hpp"
+#include "runtime/continuationEntry.hpp"
 #include "runtime/frame.inline.hpp"
 #include "runtime/registerMap.hpp"


The includes are:

.hpp --------------> _aarch64.hpp
 ^ ^
 | |
 | +------------------+
 |                    |
.inline.hpp -------> _aarch64.inline.hpp


So, continuationEntry.hpp acts like the .hpp file for continuationEntry_aarc64.inline.hpp.

Unfortunately, we don't have a fully consistent way to write our platform includes, so I don't know how to codify this in a tool without breaking things.

-------------

PR Review: https://git.openjdk.org/jdk/pull/24247#pullrequestreview-2720267338

From stefank at openjdk.org  Thu Mar 27 08:46:10 2025
From: stefank at openjdk.org (Stefan Karlsson)
Date: Thu, 27 Mar 2025 08:46:10 GMT
Subject: RFR: 8352645: Add tool support to check order of includes [v2]
In-Reply-To: <d6DiXMT0hOs8Kx4mNyMXeKLNlbyaW9PnyKBscLLeySI=.ee3e04ca-ec43-4247-8dd6-59efbb4d6aed@github.com>
References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
 <xr6ltKi4UK6l63mgcdaxMvxZwBoy7UBjeOO1zy9-1b4=.080cf94e-e008-4944-aa6f-119fec8571cc@github.com>
 <d6DiXMT0hOs8Kx4mNyMXeKLNlbyaW9PnyKBscLLeySI=.ee3e04ca-ec43-4247-8dd6-59efbb4d6aed@github.com>
Message-ID: <0I5RGRwY9sT2TJDoc1RjzTOck5evkm4-iO2Int7Imqg=.d3d3abce-e771-455f-9de6-cae4781434a1@github.com>

On Thu, 27 Mar 2025 06:10:37 GMT, Kim Barrett <kbarrett at openjdk.org> wrote:

>> Doug Simon has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   drop extra blank lines and preserve rule for first include in .inline.hpp files
>
> src/hotspot/share/compiler/disassembler.hpp line 36:
> 
>> 34: #include "utilities/macros.hpp"
>> 35: 
>> 36: 
> 
> Extra blank line inserted?

This seems to be left-overs from an earlier run. If I run the tool on this file it doesn't add this blank line.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2015915647

From kbarrett at openjdk.org  Thu Mar 27 09:07:22 2025
From: kbarrett at openjdk.org (Kim Barrett)
Date: Thu, 27 Mar 2025 09:07:22 GMT
Subject: RFR: 8352645: Add tool support to check order of includes [v2]
In-Reply-To: <dO6aDRVyRJlcdF4Rd8Q55enD--x1-9nJJ0JREpGIRIc=.b34ede90-248f-41bd-b1a0-0422a41c220c@github.com>
References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
 <xr6ltKi4UK6l63mgcdaxMvxZwBoy7UBjeOO1zy9-1b4=.080cf94e-e008-4944-aa6f-119fec8571cc@github.com>
 <d6DiXMT0hOs8Kx4mNyMXeKLNlbyaW9PnyKBscLLeySI=.ee3e04ca-ec43-4247-8dd6-59efbb4d6aed@github.com>
 <dO6aDRVyRJlcdF4Rd8Q55enD--x1-9nJJ0JREpGIRIc=.b34ede90-248f-41bd-b1a0-0422a41c220c@github.com>
Message-ID: <nqY0Qy2mxf8QGNKfDYUGOVKvoHAywQ-ausYyyML2KUM=.7d7d2569-fc29-460b-b809-a0c85df82a9e@github.com>

On Thu, 27 Mar 2025 08:18:58 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

>> test/hotspot/jtreg/sources/SortIncludes.java line 39:
>> 
>>> 37: 
>>> 38: public class SortIncludes {
>>> 39:     private static final String INCLUDE_LINE = "^ *#include *(<[^>]+>|\"[^\"]+\") *$\\n";
>> 
>> There are files that have spaces between the `#` and `include`.  I'm kind of inclined to suggest we fix those
>> at some point (not in this PR).  But the regex here needs to allow for that possibility, and perhaps (eventually)
>> complain about such.
>
> Since there are no such cases in the files processed in this PR, I'd suggest not adding support for them. They can be fixed in follow up PRs as the relevant directories are added to `TestIncludesAreSorted.HOTSPOT_SOURCES_TO_CHECK`.

The regex needs to detect that case eventually anyway, so I think it should be done now.  Either we allow that
case, in which case the regex must match to work properly where they are present.  Or we forbid that case,
in which case the regex must match to detect future mistakes even after we've cleaned up existing usage.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2016008497

From stefank at openjdk.org  Thu Mar 27 09:17:09 2025
From: stefank at openjdk.org (Stefan Karlsson)
Date: Thu, 27 Mar 2025 09:17:09 GMT
Subject: RFR: 8352645: Add tool support to check order of includes [v2]
In-Reply-To: <xr6ltKi4UK6l63mgcdaxMvxZwBoy7UBjeOO1zy9-1b4=.080cf94e-e008-4944-aa6f-119fec8571cc@github.com>
References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
 <xr6ltKi4UK6l63mgcdaxMvxZwBoy7UBjeOO1zy9-1b4=.080cf94e-e008-4944-aa6f-119fec8571cc@github.com>
Message-ID: <1W8bUhsbNfCXWzdT6QxlegrTNqYo-wxbQHhpzifIFK4=.71382d20-0999-4385-b285-e34936be436c@github.com>

On Wed, 26 Mar 2025 14:23:09 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

>> This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)).
>> 
>> By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order.
>> 
>> The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now.
>> 
>> I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories.
>> 
>> When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example:
>> 
>> java.lang.RuntimeException: The unsorted includes listed below should be fixable by running:
>> 
>>     java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci
>> 
>> 	at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80)
>> 	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
>> 	at java.base/java.lang.reflect.Method.invoke(Method.java:565)
>> 	at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335)
>> 	at java.base/java.lang.Thread.run(Thread.java:1447)
>> Caused by: java.lang.RuntimeException: 36 files with unsorted headers found:
>> 
>> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp
>> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp
>> /Users/dnsimo...
>
> Doug Simon has updated the pull request incrementally with one additional commit since the last revision:
> 
>   drop extra blank lines and preserve rule for first include in .inline.hpp files

I verified that adding a comment to the end of the `#include "runtime/continuationEntry.hpp"` line leaves that file intact, so I think that is a good enough workaround for the problematic platform includes.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24247#issuecomment-2757283373

From stefank at openjdk.org  Thu Mar 27 09:24:17 2025
From: stefank at openjdk.org (Stefan Karlsson)
Date: Thu, 27 Mar 2025 09:24:17 GMT
Subject: RFR: 8352645: Add tool support to check order of includes [v2]
In-Reply-To: <nqY0Qy2mxf8QGNKfDYUGOVKvoHAywQ-ausYyyML2KUM=.7d7d2569-fc29-460b-b809-a0c85df82a9e@github.com>
References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
 <xr6ltKi4UK6l63mgcdaxMvxZwBoy7UBjeOO1zy9-1b4=.080cf94e-e008-4944-aa6f-119fec8571cc@github.com>
 <d6DiXMT0hOs8Kx4mNyMXeKLNlbyaW9PnyKBscLLeySI=.ee3e04ca-ec43-4247-8dd6-59efbb4d6aed@github.com>
 <dO6aDRVyRJlcdF4Rd8Q55enD--x1-9nJJ0JREpGIRIc=.b34ede90-248f-41bd-b1a0-0422a41c220c@github.com>
 <nqY0Qy2mxf8QGNKfDYUGOVKvoHAywQ-ausYyyML2KUM=.7d7d2569-fc29-460b-b809-a0c85df82a9e@github.com>
Message-ID: <eaLtz4rMivUZya22Axm60V82EkmhruvIVIl0XKqSZ9Y=.4afeafcb-4dd0-4731-acf9-ac4d0ace4568@github.com>

On Thu, 27 Mar 2025 09:04:45 GMT, Kim Barrett <kbarrett at openjdk.org> wrote:

>> Since there are no such cases in the files processed in this PR, I'd suggest not adding support for them. They can be fixed in follow up PRs as the relevant directories are added to `TestIncludesAreSorted.HOTSPOT_SOURCES_TO_CHECK`.
>
> The regex needs to detect that case eventually anyway, so I think it should be done now.  Either we allow that
> case, in which case the regex must match to work properly where they are present.  Or we forbid that case,
> in which case the regex must match to detect future mistakes even after we've cleaned up existing usage.

To me it seems like a small adjustment fixes this
Suggestion:

    private static final String INCLUDE_LINE = "^ *# *include *(<[^>]+>|"[^"]+") *$\\n";

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2016040674

From dnsimon at openjdk.org  Thu Mar 27 09:43:08 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Thu, 27 Mar 2025 09:43:08 GMT
Subject: RFR: 8352645: Add tool support to check order of includes [v2]
In-Reply-To: <cKP_9Zpyl0zHLv4FY9wdfrfJT0TykD38Vo4EQGMQccQ=.3fbf41ef-08d1-4706-b323-d213b48563e6@github.com>
References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
 <xr6ltKi4UK6l63mgcdaxMvxZwBoy7UBjeOO1zy9-1b4=.080cf94e-e008-4944-aa6f-119fec8571cc@github.com>
 <cKP_9Zpyl0zHLv4FY9wdfrfJT0TykD38Vo4EQGMQccQ=.3fbf41ef-08d1-4706-b323-d213b48563e6@github.com>
Message-ID: <iOE8Zx8dRcTyjUrxjt3fnr83XiuM2AT-5RzuF5TsIpI=.a6d6ee7a-5d6f-41e2-a1bc-f79b0deffb40@github.com>

On Thu, 27 Mar 2025 06:26:43 GMT, Kim Barrett <kbarrett at openjdk.org> wrote:

> Probably we want to eventually apply this to gtests, but there might be additional rules there. The include of unittest.hpp is (usually) last, and there may be (or may have been) a technical reason for that.
> 
> Applying it to jtreg test support files could also introduce some challenges. Or at least discover a lot of non-conforming files. We might eventually want a mechanism for excluding directories, in addition to an inclusion list (that might eventually be "all").
> 
> These kinds of things can be followups once we have the basic mechanism in place.

I would suggest someone open issue(s) for follow up enhancements to the tool. I think having something in place now and incrementally improving it and adjusting it for all the special cases makes most sense.

> src/hotspot/share/compiler/compilationFailureInfo.cpp line 35:
> 
>> 33: #include "compiler/compilationFailureInfo.hpp"
>> 34: #include "compiler/compileTask.hpp"
>> 35: #ifdef COMPILER2
> 
> Conditional includes are supposed to follow unconditional in a section.
> Out of scope for this PR?

Yep. From the PR description:

The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now.

> test/hotspot/jtreg/sources/SortIncludes.java line 115:
> 
>> 113:     }
>> 114: 
>> 115:     /// Processes the C++ source file in `path` to sort its include statements.
> 
> If we want to apply this to hotspot jtreg test code, then C source files also come into the picture.

I think the tool will need to be updated to handle C source files. At that point, the comment should be generalized.

> test/hotspot/jtreg/sources/SortIncludes.java line 153:
> 
>> 151: 
>> 152:     /// Processes the C++ source files in `paths` to check if their include statements are sorted.
>> 153:     /// Include statements with any non-space characters after the closing `"` or `>` will not
> 
> Perhaps this should be mentioned in the style guide?

Done.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24247#issuecomment-2757350491
PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2016078724
PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2016077938
PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2016078194

From dnsimon at openjdk.org  Thu Mar 27 09:49:38 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Thu, 27 Mar 2025 09:49:38 GMT
Subject: RFR: 8352645: Add tool support to check order of includes [v3]
In-Reply-To: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
Message-ID: <4eq6qUl0x1TJxdlM6oWmpAazGCmFsVbzSjY58KFosv0=.c2175ffa-cdc1-4754-a84b-30f0a389397c@github.com>

> This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)).
> 
> By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order.
> 
> The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now.
> 
> I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories.
> 
> When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example:
> 
> java.lang.RuntimeException: The unsorted includes listed below should be fixable by running:
> 
>     java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci
> 
> 	at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80)
> 	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
> 	at java.base/java.lang.reflect.Method.invoke(Method.java:565)
> 	at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335)
> 	at java.base/java.lang.Thread.run(Thread.java:1447)
> Caused by: java.lang.RuntimeException: 36 files with unsorted headers found:
> 
> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp
> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp
> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Optim...

Doug Simon has updated the pull request incrementally with four additional commits since the last revision:

 - allow spaces between `#` and `include`
 - moved some logic out of SortIncludes into TestIncludesAreSorted
 - removed extra blank lines
 - update style guide with advice on how to label includes that should not be re-ordered

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/24247/files
  - new: https://git.openjdk.org/jdk/pull/24247/files/18e2a1d6..cada0df4

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=24247&range=02
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24247&range=01-02

  Stats: 117 lines in 6 files changed: 60 ins; 29 del; 28 mod
  Patch: https://git.openjdk.org/jdk/pull/24247.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/24247/head:pull/24247

PR: https://git.openjdk.org/jdk/pull/24247

From dnsimon at openjdk.org  Thu Mar 27 09:49:38 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Thu, 27 Mar 2025 09:49:38 GMT
Subject: RFR: 8352645: Add tool support to check order of includes [v2]
In-Reply-To: <eaLtz4rMivUZya22Axm60V82EkmhruvIVIl0XKqSZ9Y=.4afeafcb-4dd0-4731-acf9-ac4d0ace4568@github.com>
References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
 <xr6ltKi4UK6l63mgcdaxMvxZwBoy7UBjeOO1zy9-1b4=.080cf94e-e008-4944-aa6f-119fec8571cc@github.com>
 <d6DiXMT0hOs8Kx4mNyMXeKLNlbyaW9PnyKBscLLeySI=.ee3e04ca-ec43-4247-8dd6-59efbb4d6aed@github.com>
 <dO6aDRVyRJlcdF4Rd8Q55enD--x1-9nJJ0JREpGIRIc=.b34ede90-248f-41bd-b1a0-0422a41c220c@github.com>
 <nqY0Qy2mxf8QGNKfDYUGOVKvoHAywQ-ausYyyML2KUM=.7d7d2569-fc29-460b-b809-a0c85df82a9e@github.com>
 <eaLtz4rMivUZya22Axm60V82EkmhruvIVIl0XKqSZ9Y=.4afeafcb-4dd0-4731-acf9-ac4d0ace4568@github.com>
Message-ID: <o9ZDiSx49jWfTL8WsEU2t6tJTk5nzaXOuy2LBGMIyY4=.feeb2cfc-b25d-4e2e-b0e3-0b25869524e4@github.com>

On Thu, 27 Mar 2025 09:20:58 GMT, Stefan Karlsson <stefank at openjdk.org> wrote:

>> The regex needs to detect that case eventually anyway, so I think it should be done now.  Either we allow that
>> case, in which case the regex must match to work properly where they are present.  Or we forbid that case,
>> in which case the regex must match to detect future mistakes even after we've cleaned up existing usage.
>
> To me it seems like a small adjustment fixes this
> Suggestion:
> 
>     private static final String INCLUDE_LINE = "^ *# *include *(<[^>]+>|"[^"]+") *$\\n";

Done.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/24247#discussion_r2016091219

From stefank at openjdk.org  Thu Mar 27 10:13:15 2025
From: stefank at openjdk.org (Stefan Karlsson)
Date: Thu, 27 Mar 2025 10:13:15 GMT
Subject: RFR: 8352645: Add tool support to check order of includes [v3]
In-Reply-To: <4eq6qUl0x1TJxdlM6oWmpAazGCmFsVbzSjY58KFosv0=.c2175ffa-cdc1-4754-a84b-30f0a389397c@github.com>
References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
 <4eq6qUl0x1TJxdlM6oWmpAazGCmFsVbzSjY58KFosv0=.c2175ffa-cdc1-4754-a84b-30f0a389397c@github.com>
Message-ID: <CSc2vRdId4j_4TOFE69aWEMMU-tyuOn8nDDjzmCqmW8=.2eb27b78-0816-4d64-9282-1768853d7f0b@github.com>

On Thu, 27 Mar 2025 09:49:38 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

>> This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)).
>> 
>> By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order.
>> 
>> The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now.
>> 
>> I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories.
>> 
>> When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example:
>> 
>> java.lang.RuntimeException: The unsorted includes listed below should be fixable by running:
>> 
>>     java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci
>> 
>> 	at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80)
>> 	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
>> 	at java.base/java.lang.reflect.Method.invoke(Method.java:565)
>> 	at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335)
>> 	at java.base/java.lang.Thread.run(Thread.java:1447)
>> Caused by: java.lang.RuntimeException: 36 files with unsorted headers found:
>> 
>> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp
>> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp
>> /Users/dnsimo...
>
> Doug Simon has updated the pull request incrementally with four additional commits since the last revision:
> 
>  - allow spaces between `#` and `include`
>  - moved some logic out of SortIncludes into TestIncludesAreSorted
>  - removed extra blank lines
>  - update style guide with advice on how to label includes that should not be re-ordered

I'm happy with the capabilities of the tool now and think that it is good enough to include and promote to HotSpot devs.

One questions is where to put the tool? I don't think the test directory is the best place. Maybe somewhere in `src/utils/`. There is a tools dir here `src/utils/src/build/tools/` but I don't know if it is appropriate to put it there. Maybe @magicus knows a good place for this?

A couple of nits:
1) jcheck fails because of whitespaces
2) The /// style comments is a style I haven't encountered before.

-------------

PR Review: https://git.openjdk.org/jdk/pull/24247#pullrequestreview-2720671629

From dnsimon at openjdk.org  Thu Mar 27 10:39:13 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Thu, 27 Mar 2025 10:39:13 GMT
Subject: RFR: 8352645: Add tool support to check order of includes [v3]
In-Reply-To: <CSc2vRdId4j_4TOFE69aWEMMU-tyuOn8nDDjzmCqmW8=.2eb27b78-0816-4d64-9282-1768853d7f0b@github.com>
References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
 <4eq6qUl0x1TJxdlM6oWmpAazGCmFsVbzSjY58KFosv0=.c2175ffa-cdc1-4754-a84b-30f0a389397c@github.com>
 <CSc2vRdId4j_4TOFE69aWEMMU-tyuOn8nDDjzmCqmW8=.2eb27b78-0816-4d64-9282-1768853d7f0b@github.com>
Message-ID: <vbDsazSULAheV4BAQn3mcWDUVn3r4o0SaLoVjKNQsBE=.427598b3-d264-45b5-8425-5ac0111bd5a5@github.com>

On Thu, 27 Mar 2025 10:10:07 GMT, Stefan Karlsson <stefank at openjdk.org> wrote:

> A couple of nits:
> 
> 1. jcheck fails because of whitespaces
> 2. The /// style comments is a style I haven't encountered before.

I fixed the whitespaces.
I can convert the `///` comments if you want - no strong opinion.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24247#issuecomment-2757548404

From dnsimon at openjdk.org  Thu Mar 27 10:47:14 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Thu, 27 Mar 2025 10:47:14 GMT
Subject: RFR: 8352645: Add tool support to check order of includes [v3]
In-Reply-To: <4eq6qUl0x1TJxdlM6oWmpAazGCmFsVbzSjY58KFosv0=.c2175ffa-cdc1-4754-a84b-30f0a389397c@github.com>
References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
 <4eq6qUl0x1TJxdlM6oWmpAazGCmFsVbzSjY58KFosv0=.c2175ffa-cdc1-4754-a84b-30f0a389397c@github.com>
Message-ID: <h8jGuNONE2PU6rCu898fkzjyS0upM3HvyQhkcJarfzI=.64c44fec-2496-4d6b-ac88-60af13fab974@github.com>

On Thu, 27 Mar 2025 09:49:38 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

>> This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)).
>> 
>> By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order.
>> 
>> The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now.
>> 
>> I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories.
>> 
>> When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example:
>> 
>> java.lang.RuntimeException: The unsorted includes listed below should be fixable by running:
>> 
>>     java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci
>> 
>> 	at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80)
>> 	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
>> 	at java.base/java.lang.reflect.Method.invoke(Method.java:565)
>> 	at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335)
>> 	at java.base/java.lang.Thread.run(Thread.java:1447)
>> Caused by: java.lang.RuntimeException: 36 files with unsorted headers found:
>> 
>> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp
>> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp
>> /Users/dnsimo...
>
> Doug Simon has updated the pull request incrementally with four additional commits since the last revision:
> 
>  - allow spaces between `#` and `include`
>  - moved some logic out of SortIncludes into TestIncludesAreSorted
>  - removed extra blank lines
>  - update style guide with advice on how to label includes that should not be re-ordered

I just noticed that TestIncludesAreSorted is not run by GHA. How about we move `test/hotspot/jtreg/sources` into `tier1_common`:

diff --git a/test/hotspot/jtreg/TEST.groups b/test/hotspot/jtreg/TEST.groups
index 71b9e497e25..62b11e73aa0 100644
--- a/test/hotspot/jtreg/TEST.groups
+++ b/test/hotspot/jtreg/TEST.groups
@@ -139,6 +139,7 @@ serviceability_ttf_virtual = \
   -serviceability/jvmti/negative
 
 tier1_common = \
+  sources \
   sanity/BasicVMTest.java \
   gtest/GTestWrapper.java \
   gtest/LockStackGtests.java \
@@ -619,16 +620,12 @@ tier1_serviceability = \
   -serviceability/sa/TestJmapCore.java \
   -serviceability/sa/TestJmapCoreMetaspace.java
 
-tier1_sources = \
-   sources
-
 tier1 = \
   :tier1_common \
   :tier1_compiler \
   :tier1_gc \
   :tier1_runtime \
   :tier1_serviceability \
-  :tier1_sources
 
 tier2 = \
   :hotspot_tier2_runtime \

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24247#issuecomment-2757570734

From stefank at openjdk.org  Thu Mar 27 11:16:07 2025
From: stefank at openjdk.org (Stefan Karlsson)
Date: Thu, 27 Mar 2025 11:16:07 GMT
Subject: RFR: 8352645: Add tool support to check order of includes [v3]
In-Reply-To: <vbDsazSULAheV4BAQn3mcWDUVn3r4o0SaLoVjKNQsBE=.427598b3-d264-45b5-8425-5ac0111bd5a5@github.com>
References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
 <4eq6qUl0x1TJxdlM6oWmpAazGCmFsVbzSjY58KFosv0=.c2175ffa-cdc1-4754-a84b-30f0a389397c@github.com>
 <CSc2vRdId4j_4TOFE69aWEMMU-tyuOn8nDDjzmCqmW8=.2eb27b78-0816-4d64-9282-1768853d7f0b@github.com>
 <vbDsazSULAheV4BAQn3mcWDUVn3r4o0SaLoVjKNQsBE=.427598b3-d264-45b5-8425-5ac0111bd5a5@github.com>
Message-ID: <E6_eQUOdh6KozlvyOqhvrRuIHY6F5JVQcbnaW0OCyNE=.fd2f5875-76de-4cbb-9543-a1681ebcca7f@github.com>

On Thu, 27 Mar 2025 10:36:38 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

> > A couple of nits:
> > 
> > 1. jcheck fails because of whitespaces
> > 2. The /// style comments is a style I haven't encountered before.
> 
> I fixed the whitespaces. I can convert the `///` comments if you want - no strong opinion.

Maybe someone else knows the preferred style for this? I don't think we need to block the integration because of this. If someone comes late with the proper comment style, we'll update it in a separate PR.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24247#issuecomment-2757648203

From duke at openjdk.org  Thu Mar 27 12:40:29 2025
From: duke at openjdk.org (Zihao Lin)
Date: Thu, 27 Mar 2025 12:40:29 GMT
Subject: RFR: 8344116: C2: remove slice parameter from LoadNode::make [v3]
In-Reply-To: <Po0DIjZv6wmSdwNcL1BeN5s9xvih8YKDqaw7Io5wIl8=.82dc3319-de61-4afc-898c-a7550bf9c9ac@github.com>
References: <Po0DIjZv6wmSdwNcL1BeN5s9xvih8YKDqaw7Io5wIl8=.82dc3319-de61-4afc-898c-a7550bf9c9ac@github.com>
Message-ID: <AeLTNWDKctABbrfIJ_-v63f4pe7AhivoMkK40R19Z5c=.876794e5-a6c4-4797-8e38-0c67049935d3@github.com>

> This patch remove slice parameter from LoadNode::make
> 
> Mention in https://github.com/openjdk/jdk/pull/21834#pullrequestreview-2429164805
> 
> Hi team, I am new, I'd appreciate any guidance. Thank a lot!

Zihao Lin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:

 - Merge branch 'openjdk:master' into 8344116
 - Merge branch 'openjdk:master' into 8344116
 - 8344116: C2: remove slice parameter from LoadNode::make

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/24258/files
  - new: https://git.openjdk.org/jdk/pull/24258/files/f4ef46dc..08c1a382

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=24258&range=02
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24258&range=01-02

  Stats: 3892 lines in 94 files changed: 1545 ins; 2033 del; 314 mod
  Patch: https://git.openjdk.org/jdk/pull/24258.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/24258/head:pull/24258

PR: https://git.openjdk.org/jdk/pull/24258

From dnsimon at openjdk.org  Thu Mar 27 13:21:55 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Thu, 27 Mar 2025 13:21:55 GMT
Subject: RFR: 8352645: Add tool support to check order of includes [v4]
In-Reply-To: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
Message-ID: <Jxf21s0Xbibs7daroz3WXU9J7CwHtGt6mKlp1h85QWc=.9e03d768-f232-41d0-8b65-d99fb13aeffe@github.com>

> This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)).
> 
> By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order.
> 
> The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now.
> 
> I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories.
> 
> When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example:
> 
> java.lang.RuntimeException: The unsorted includes listed below should be fixable by running:
> 
>     java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci
> 
> 	at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80)
> 	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
> 	at java.base/java.lang.reflect.Method.invoke(Method.java:565)
> 	at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335)
> 	at java.base/java.lang.Thread.run(Thread.java:1447)
> Caused by: java.lang.RuntimeException: 36 files with unsorted headers found:
> 
> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp
> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp
> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Optim...

Doug Simon has updated the pull request incrementally with two additional commits since the last revision:

 - moved test/hotspot/jtreg/sources into tier1_common
 - remove trailing spaces

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/24247/files
  - new: https://git.openjdk.org/jdk/pull/24247/files/cada0df4..93770e71

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=24247&range=03
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24247&range=02-03

  Stats: 7 lines in 2 files changed: 1 ins; 4 del; 2 mod
  Patch: https://git.openjdk.org/jdk/pull/24247.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/24247/head:pull/24247

PR: https://git.openjdk.org/jdk/pull/24247

From ihse at openjdk.org  Thu Mar 27 13:38:19 2025
From: ihse at openjdk.org (Magnus Ihse Bursie)
Date: Thu, 27 Mar 2025 13:38:19 GMT
Subject: RFR: 8352645: Add tool support to check order of includes [v3]
In-Reply-To: <CSc2vRdId4j_4TOFE69aWEMMU-tyuOn8nDDjzmCqmW8=.2eb27b78-0816-4d64-9282-1768853d7f0b@github.com>
References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
 <4eq6qUl0x1TJxdlM6oWmpAazGCmFsVbzSjY58KFosv0=.c2175ffa-cdc1-4754-a84b-30f0a389397c@github.com>
 <CSc2vRdId4j_4TOFE69aWEMMU-tyuOn8nDDjzmCqmW8=.2eb27b78-0816-4d64-9282-1768853d7f0b@github.com>
Message-ID: <8Fj5Ui2g5uviJDr4x5rqLaoODjxjfxY_SWqbxXojSlI=.47630310-20d6-40e3-aab9-bce915bc04ad@github.com>

On Thu, 27 Mar 2025 10:10:07 GMT, Stefan Karlsson <stefank at openjdk.org> wrote:

> One questions is where to put the tool? I don't think the test directory is the best place. Maybe somewhere in src/utils/. There is a tools dir here src/utils/src/build/tools/ but I don't know if it is appropriate to put it there. Maybe @magicus knows a good place for this?

I would actually recommend just the `bin` directory. This is , after all, intended to be run as a simple script (remember, it was originally a python script), in a similar vein to the already existing `blessed-modifier-order.sh` script.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24247#issuecomment-2758060262

From ihse at openjdk.org  Thu Mar 27 13:38:20 2025
From: ihse at openjdk.org (Magnus Ihse Bursie)
Date: Thu, 27 Mar 2025 13:38:20 GMT
Subject: RFR: 8352645: Add tool support to check order of includes [v3]
In-Reply-To: <E6_eQUOdh6KozlvyOqhvrRuIHY6F5JVQcbnaW0OCyNE=.fd2f5875-76de-4cbb-9543-a1681ebcca7f@github.com>
References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
 <4eq6qUl0x1TJxdlM6oWmpAazGCmFsVbzSjY58KFosv0=.c2175ffa-cdc1-4754-a84b-30f0a389397c@github.com>
 <CSc2vRdId4j_4TOFE69aWEMMU-tyuOn8nDDjzmCqmW8=.2eb27b78-0816-4d64-9282-1768853d7f0b@github.com>
 <vbDsazSULAheV4BAQn3mcWDUVn3r4o0SaLoVjKNQsBE=.427598b3-d264-45b5-8425-5ac0111bd5a5@github.com>
 <E6_eQUOdh6KozlvyOqhvrRuIHY6F5JVQcbnaW0OCyNE=.fd2f5875-76de-4cbb-9543-a1681ebcca7f@github.com>
Message-ID: <HTOlnyRJytrBcbEH8wR388gH7njSq92dkBG7lJ9ROwU=.5fdc8f07-1045-4a8a-ba4d-a6ec5a33ba48@github.com>

On Thu, 27 Mar 2025 11:13:27 GMT, Stefan Karlsson <stefank at openjdk.org> wrote:

> The /// style comments is a style I haven't encountered before.

This is for the new markdown comments. Personally, I very much prefer them and have been looking forward to these for a long time. But I don't know if we have any policy for or against those in the JDK. Using them in a script like this seems fine to me, at any rate.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24247#issuecomment-2758064273

From dnsimon at openjdk.org  Thu Mar 27 14:11:07 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Thu, 27 Mar 2025 14:11:07 GMT
Subject: RFR: 8352645: Add tool support to check order of includes [v5]
In-Reply-To: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
Message-ID: <xUvj9bhRO66n5K5TFbeUo-mOLSqzYfQ30o4smLsGjNY=.8039c692-f5ee-4fd5-a5ce-13c6500e20e4@github.com>

> This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)).
> 
> By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order.
> 
> The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now.
> 
> I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories.
> 
> When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example:
> 
> java.lang.RuntimeException: The unsorted includes listed below should be fixable by running:
> 
>     java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci
> 
> 	at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80)
> 	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
> 	at java.base/java.lang.reflect.Method.invoke(Method.java:565)
> 	at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335)
> 	at java.base/java.lang.Thread.run(Thread.java:1447)
> Caused by: java.lang.RuntimeException: 36 files with unsorted headers found:
> 
> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp
> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp
> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Optim...

Doug Simon has updated the pull request incrementally with one additional commit since the last revision:

  moved error message into UnsortedIncludesException

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/24247/files
  - new: https://git.openjdk.org/jdk/pull/24247/files/93770e71..c93e6646

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=24247&range=04
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24247&range=03-04

  Stats: 13 lines in 2 files changed: 8 ins; 4 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/24247.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/24247/head:pull/24247

PR: https://git.openjdk.org/jdk/pull/24247

From dnsimon at openjdk.org  Thu Mar 27 14:11:07 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Thu, 27 Mar 2025 14:11:07 GMT
Subject: RFR: 8352645: Add tool support to check order of includes [v3]
In-Reply-To: <h8jGuNONE2PU6rCu898fkzjyS0upM3HvyQhkcJarfzI=.64c44fec-2496-4d6b-ac88-60af13fab974@github.com>
References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
 <4eq6qUl0x1TJxdlM6oWmpAazGCmFsVbzSjY58KFosv0=.c2175ffa-cdc1-4754-a84b-30f0a389397c@github.com>
 <h8jGuNONE2PU6rCu898fkzjyS0upM3HvyQhkcJarfzI=.64c44fec-2496-4d6b-ac88-60af13fab974@github.com>
Message-ID: <vFgpkmanOl0oITk9BMMpw5dzLkjCUvmiNmlCrzL7Iqo=.65fe33ae-6463-4a27-8279-9d3381821288@github.com>

On Thu, 27 Mar 2025 10:44:48 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

> I just noticed that TestIncludesAreSorted is not run by GHA. How about we move `test/hotspot/jtreg/sources` into `tier1_common`:

I went ahead and pushed this change.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24247#issuecomment-2758187724

From dnsimon at openjdk.org  Thu Mar 27 14:17:34 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Thu, 27 Mar 2025 14:17:34 GMT
Subject: RFR: 8352645: Add tool support to check order of includes [v3]
In-Reply-To: <8Fj5Ui2g5uviJDr4x5rqLaoODjxjfxY_SWqbxXojSlI=.47630310-20d6-40e3-aab9-bce915bc04ad@github.com>
References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
 <4eq6qUl0x1TJxdlM6oWmpAazGCmFsVbzSjY58KFosv0=.c2175ffa-cdc1-4754-a84b-30f0a389397c@github.com>
 <CSc2vRdId4j_4TOFE69aWEMMU-tyuOn8nDDjzmCqmW8=.2eb27b78-0816-4d64-9282-1768853d7f0b@github.com>
 <8Fj5Ui2g5uviJDr4x5rqLaoODjxjfxY_SWqbxXojSlI=.47630310-20d6-40e3-aab9-bce915bc04ad@github.com>
Message-ID: <sgMRUVL0twJ8azR0YQ-2l0nlAv4SkphIfCB7Fcm-450=.b2562ed7-3a18-439e-bf79-688b276bf080@github.com>

On Thu, 27 Mar 2025 13:34:02 GMT, Magnus Ihse Bursie <ihse at openjdk.org> wrote:

> I would actually recommend just the bin directory.

Fine by me but I'm not sure how to then use `bin/SortIncludes.java` in `test/hotspot/jtreg/sources/TestIncludesAreSorted.java`.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24247#issuecomment-2758217497

From thartmann at openjdk.org  Thu Mar 27 15:36:18 2025
From: thartmann at openjdk.org (Tobias Hartmann)
Date: Thu, 27 Mar 2025 15:36:18 GMT
Subject: RFR: 8348853: Fold layout helper check for objects implementing
 non-array interfaces
In-Reply-To: <CptyYNIV6gQls75ydTAEKDlNDgQyFA6vXepsdzxvTyA=.c0bf07ae-4da7-4649-b36a-d568755cf4ac@github.com>
References: <CptyYNIV6gQls75ydTAEKDlNDgQyFA6vXepsdzxvTyA=.c0bf07ae-4da7-4649-b36a-d568755cf4ac@github.com>
Message-ID: <-Ri4lJUzCkI9yLG-kGwTGeAhd453SDgt_qvoB1iw4_A=.f3e126ab-a4ff-4f7f-80a7-c6e739cc6727@github.com>

On Wed, 26 Mar 2025 09:16:17 GMT, Marc Chevalier <mchevalier at openjdk.org> wrote:

> If `TypeInstKlassPtr` represents an array type, it has to be `java.lang.Object`. From contraposition, if it is not `java.lang.Object`, we can conclude it is not an array, and we can skip some array checks, for instance.
> 
> In this PR, we improve this deduction with an interface base reasoning: arrays implements only Cloneable and Serializable, so if a type implements anything else, it cannot be an array.
> 
> This change partially reverts the changes from [JDK-8348631](https://bugs.openjdk.org/browse/JDK-8348631) (#23331) (in `LibraryCallKit::generate_array_guard_common`) and the test still passes.
> 
> The way interfaces are check might be done differently. The current situation is a balance between visibility (not to leak too much things explicitly private), having not overly general methods for one use-case and avoiding too concrete (and brittle) interfaces.
> 
> Tested with tier1..3, hs-precheckin-comp and hs-comp-stress
> 
> Thanks,
> Marc

@rwestrel Should have a look at this :)

Please add an IR framework test that verifies that layout helper checks are optimized.

src/hotspot/share/opto/type.cpp line 3684:

> 3682: }
> 3683: 
> 3684: bool TypeInterfaces::has_non_array_interface() const {

What about using `TypeAryPtr::_array_interfaces->contains(_interfaces);` instead?

-------------

Changes requested by thartmann (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/24245#pullrequestreview-2722219539
PR Review Comment: https://git.openjdk.org/jdk/pull/24245#discussion_r2016955402

From shade at openjdk.org  Thu Mar 27 17:08:44 2025
From: shade at openjdk.org (Aleksey Shipilev)
Date: Thu, 27 Mar 2025 17:08:44 GMT
Subject: RFR: 8351155: C1/C2: Remove 32-bit x86 specific FP rounding
 support [v2]
In-Reply-To: <sU0m-oKfPWJrKabuKyBZ1WkH8k-sjefw9dmRH8ZAse8=.0977dc46-d2df-492e-940b-9b87aed2cd96@github.com>
References: <8zUrV-sMSOwRSQk_jERtFqjrzOFUP7rlUwTTN7cPP_8=.b1d30fb5-d9c0-4417-bacd-bf09f2af433b@github.com>
 <sU0m-oKfPWJrKabuKyBZ1WkH8k-sjefw9dmRH8ZAse8=.0977dc46-d2df-492e-940b-9b87aed2cd96@github.com>
Message-ID: <02tkGNzza6MfOkCxeymt8tcXm3bSCPiv6GBCkwjcLs4=.4d351dd7-82b9-46c2-ada6-facf807f70a2@github.com>

On Thu, 27 Mar 2025 08:45:49 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

>> C1 and C2 have support for rounding double/floats, to support awkward rounding modes of x87 FPU. With 32-bit x86 port removed, we can remove those parts. This basically deletes all the code that uses `strict_fp_requires_explicit_rounding`, which is now universally `false` for all supported platforms.
>> 
>> For C1, we remove `RoundFP` op, its associated `lir_roundfp` and related utility methods that insert these nodes in the graph.
>> 
>> For C2, we remove `RoundDouble` and `RoundFloat` nodes (note there is a confusingly named `RoundDoubleMode` nodes that are not related to this), associated utility methods, AD match rules that reference these nodes (as nops!), and some `Ideal`-s that are no longer needed.
>> 
>> Additional testing:
>>  - [x] Linux x86_64 server fastdebug, `tier1`
>>  - [x] Linux x86_64 server fastdebug, `all`
>
> Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Minor leftover

Need a quick re-review after a minor leftover removal.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24250#issuecomment-2758809458

From vlivanov at openjdk.org  Thu Mar 27 17:59:25 2025
From: vlivanov at openjdk.org (Vladimir Ivanov)
Date: Thu, 27 Mar 2025 17:59:25 GMT
Subject: RFR: 8351155: C1/C2: Remove 32-bit x86 specific FP rounding
 support [v2]
In-Reply-To: <sU0m-oKfPWJrKabuKyBZ1WkH8k-sjefw9dmRH8ZAse8=.0977dc46-d2df-492e-940b-9b87aed2cd96@github.com>
References: <8zUrV-sMSOwRSQk_jERtFqjrzOFUP7rlUwTTN7cPP_8=.b1d30fb5-d9c0-4417-bacd-bf09f2af433b@github.com>
 <sU0m-oKfPWJrKabuKyBZ1WkH8k-sjefw9dmRH8ZAse8=.0977dc46-d2df-492e-940b-9b87aed2cd96@github.com>
Message-ID: <soq8gNTqQCqKiFHHcsZn2tNg-OSU50Dln6nNbbZaIZ4=.961ee7ab-0575-490b-80ef-4aed9a3353d3@github.com>

On Thu, 27 Mar 2025 08:45:49 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

>> C1 and C2 have support for rounding double/floats, to support awkward rounding modes of x87 FPU. With 32-bit x86 port removed, we can remove those parts. This basically deletes all the code that uses `strict_fp_requires_explicit_rounding`, which is now universally `false` for all supported platforms.
>> 
>> For C1, we remove `RoundFP` op, its associated `lir_roundfp` and related utility methods that insert these nodes in the graph.
>> 
>> For C2, we remove `RoundDouble` and `RoundFloat` nodes (note there is a confusingly named `RoundDoubleMode` nodes that are not related to this), associated utility methods, AD match rules that reference these nodes (as nops!), and some `Ideal`-s that are no longer needed.
>> 
>> Additional testing:
>>  - [x] Linux x86_64 server fastdebug, `tier1`
>>  - [x] Linux x86_64 server fastdebug, `all`
>
> Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Minor leftover

Marked as reviewed by vlivanov (Reviewer).

-------------

PR Review: https://git.openjdk.org/jdk/pull/24250#pullrequestreview-2722925670

From shade at openjdk.org  Thu Mar 27 18:14:34 2025
From: shade at openjdk.org (Aleksey Shipilev)
Date: Thu, 27 Mar 2025 18:14:34 GMT
Subject: RFR: 8351155: C1/C2: Remove 32-bit x86 specific FP rounding
 support [v2]
In-Reply-To: <sU0m-oKfPWJrKabuKyBZ1WkH8k-sjefw9dmRH8ZAse8=.0977dc46-d2df-492e-940b-9b87aed2cd96@github.com>
References: <8zUrV-sMSOwRSQk_jERtFqjrzOFUP7rlUwTTN7cPP_8=.b1d30fb5-d9c0-4417-bacd-bf09f2af433b@github.com>
 <sU0m-oKfPWJrKabuKyBZ1WkH8k-sjefw9dmRH8ZAse8=.0977dc46-d2df-492e-940b-9b87aed2cd96@github.com>
Message-ID: <naMnvRY4QjRVUibnRmQDsCQ3f4I_n3i6q7XeAVzswaM=.0ef23475-d1cf-4026-9504-6f46e10e6890@github.com>

On Thu, 27 Mar 2025 08:45:49 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

>> C1 and C2 have support for rounding double/floats, to support awkward rounding modes of x87 FPU. With 32-bit x86 port removed, we can remove those parts. This basically deletes all the code that uses `strict_fp_requires_explicit_rounding`, which is now universally `false` for all supported platforms.
>> 
>> For C1, we remove `RoundFP` op, its associated `lir_roundfp` and related utility methods that insert these nodes in the graph.
>> 
>> For C2, we remove `RoundDouble` and `RoundFloat` nodes (note there is a confusingly named `RoundDoubleMode` nodes that are not related to this), associated utility methods, AD match rules that reference these nodes (as nops!), and some `Ideal`-s that are no longer needed.
>> 
>> Additional testing:
>>  - [x] Linux x86_64 server fastdebug, `tier1`
>>  - [x] Linux x86_64 server fastdebug, `all`
>
> Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Minor leftover

Thanks!

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24250#issuecomment-2759003292

From shade at openjdk.org  Thu Mar 27 18:14:34 2025
From: shade at openjdk.org (Aleksey Shipilev)
Date: Thu, 27 Mar 2025 18:14:34 GMT
Subject: Integrated: 8351155: C1/C2: Remove 32-bit x86 specific FP rounding
 support
In-Reply-To: <8zUrV-sMSOwRSQk_jERtFqjrzOFUP7rlUwTTN7cPP_8=.b1d30fb5-d9c0-4417-bacd-bf09f2af433b@github.com>
References: <8zUrV-sMSOwRSQk_jERtFqjrzOFUP7rlUwTTN7cPP_8=.b1d30fb5-d9c0-4417-bacd-bf09f2af433b@github.com>
Message-ID: <Tg_z4RJFpLO6y8YX-6FieQju2BAS0qqKoXvdU02G6_U=.ee1932f7-df1c-4016-80bc-888d9b5fdece@github.com>

On Wed, 26 Mar 2025 10:11:25 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

> C1 and C2 have support for rounding double/floats, to support awkward rounding modes of x87 FPU. With 32-bit x86 port removed, we can remove those parts. This basically deletes all the code that uses `strict_fp_requires_explicit_rounding`, which is now universally `false` for all supported platforms.
> 
> For C1, we remove `RoundFP` op, its associated `lir_roundfp` and related utility methods that insert these nodes in the graph.
> 
> For C2, we remove `RoundDouble` and `RoundFloat` nodes (note there is a confusingly named `RoundDoubleMode` nodes that are not related to this), associated utility methods, AD match rules that reference these nodes (as nops!), and some `Ideal`-s that are no longer needed.
> 
> Additional testing:
>  - [x] Linux x86_64 server fastdebug, `tier1`
>  - [x] Linux x86_64 server fastdebug, `all`

This pull request has now been integrated.

Changeset: b73663a2
Author:    Aleksey Shipilev <shade at openjdk.org>
URL:       https://git.openjdk.org/jdk/commit/b73663a2b4fe7049fc0990c1a1e51221640b4e29
Stats:     547 lines in 48 files changed: 0 ins; 513 del; 34 mod

8351155: C1/C2: Remove 32-bit x86 specific FP rounding support

Reviewed-by: vlivanov, kvn

-------------

PR: https://git.openjdk.org/jdk/pull/24250

From mchevalier at openjdk.org  Fri Mar 28 09:41:11 2025
From: mchevalier at openjdk.org (Marc Chevalier)
Date: Fri, 28 Mar 2025 09:41:11 GMT
Subject: RFR: 8348853: Fold layout helper check for objects implementing
 non-array interfaces
In-Reply-To: <-Ri4lJUzCkI9yLG-kGwTGeAhd453SDgt_qvoB1iw4_A=.f3e126ab-a4ff-4f7f-80a7-c6e739cc6727@github.com>
References: <CptyYNIV6gQls75ydTAEKDlNDgQyFA6vXepsdzxvTyA=.c0bf07ae-4da7-4649-b36a-d568755cf4ac@github.com>
 <-Ri4lJUzCkI9yLG-kGwTGeAhd453SDgt_qvoB1iw4_A=.f3e126ab-a4ff-4f7f-80a7-c6e739cc6727@github.com>
Message-ID: <fT6NxeYVgESrlUdmOUuJ5Gy1lE6hV_F_pc4XZnDdf6s=.c2cac7c1-82a7-4583-83ea-a9066df2b87a@github.com>

On Thu, 27 Mar 2025 15:33:31 GMT, Tobias Hartmann <thartmann at openjdk.org> wrote:

>> If `TypeInstKlassPtr` represents an array type, it has to be `java.lang.Object`. From contraposition, if it is not `java.lang.Object`, we can conclude it is not an array, and we can skip some array checks, for instance.
>> 
>> In this PR, we improve this deduction with an interface base reasoning: arrays implements only Cloneable and Serializable, so if a type implements anything else, it cannot be an array.
>> 
>> This change partially reverts the changes from [JDK-8348631](https://bugs.openjdk.org/browse/JDK-8348631) (#23331) (in `LibraryCallKit::generate_array_guard_common`) and the test still passes.
>> 
>> The way interfaces are check might be done differently. The current situation is a balance between visibility (not to leak too much things explicitly private), having not overly general methods for one use-case and avoiding too concrete (and brittle) interfaces.
>> 
>> Tested with tier1..3, hs-precheckin-comp and hs-comp-stress
>> 
>> Thanks,
>> Marc
>
> src/hotspot/share/opto/type.cpp line 3684:
> 
>> 3682: }
>> 3683: 
>> 3684: bool TypeInterfaces::has_non_array_interface() const {
> 
> What about using `TypeAryPtr::_array_interfaces->contains(_interfaces);` instead?

Almost!

return !TypeAryPtr::_array_interfaces->contains(this);

Contains is about TypeInterfaces, that is set of interfaces. So I just need to check that `this` is not a sub-set of array interfaces. That should do it.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/24245#discussion_r2018248760

From mchevalier at openjdk.org  Fri Mar 28 09:53:13 2025
From: mchevalier at openjdk.org (Marc Chevalier)
Date: Fri, 28 Mar 2025 09:53:13 GMT
Subject: RFR: 8348853: Fold layout helper check for objects implementing
 non-array interfaces
In-Reply-To: <CptyYNIV6gQls75ydTAEKDlNDgQyFA6vXepsdzxvTyA=.c0bf07ae-4da7-4649-b36a-d568755cf4ac@github.com>
References: <CptyYNIV6gQls75ydTAEKDlNDgQyFA6vXepsdzxvTyA=.c0bf07ae-4da7-4649-b36a-d568755cf4ac@github.com>
Message-ID: <GWMFDQWpCemk4uT5d6Jy-g8HJYwIElfKR3wfCZXACTk=.ad65671c-a194-4450-8757-005e14dbc2b9@github.com>

On Wed, 26 Mar 2025 09:16:17 GMT, Marc Chevalier <mchevalier at openjdk.org> wrote:

> If `TypeInstKlassPtr` represents an array type, it has to be `java.lang.Object`. From contraposition, if it is not `java.lang.Object`, we can conclude it is not an array, and we can skip some array checks, for instance.
> 
> In this PR, we improve this deduction with an interface base reasoning: arrays implements only Cloneable and Serializable, so if a type implements anything else, it cannot be an array.
> 
> This change partially reverts the changes from [JDK-8348631](https://bugs.openjdk.org/browse/JDK-8348631) (#23331) (in `LibraryCallKit::generate_array_guard_common`) and the test still passes.
> 
> The way interfaces are check might be done differently. The current situation is a balance between visibility (not to leak too much things explicitly private), having not overly general methods for one use-case and avoiding too concrete (and brittle) interfaces.
> 
> Tested with tier1..3, hs-precheckin-comp and hs-comp-stress
> 
> Thanks,
> Marc

I'm not sure how to write such an IR test.

I'm looking at [TestArrayGuardWithInterfaces.java](https://github.com/openjdk/jdk/blob/3e9a7a4aed168422473c941ff5626d0d65aaadfa/test/hotspot/jtreg/compiler/intrinsics/TestArrayGuardWithInterfaces.java).

I see the graphs of `test1` before and after, and the new one is smaller. But the nodes used are pretty much the same, or they don't feel clearly linked to interface checking: there is `DecodeNKlass` or `AddP`, but it doesn't seem obvious without having the graph under the eyes that it actually checks something meaningful. There are also less `If` (2 instead of 3), but once again, the test seems brittle. I also see that There is no more `Return` only `Halt` since we can now prove the function cannot return normally.

But on the graph of `test2` ends with two `Halt`: traps everywhere, even if there are paths on which `test2` doesn't throw. So the lack of `Return` doesn't sound very robust.

Overall, not sure what a good test would be. I can write a test that would not pass before and pass now, but I'm not convinced they would reliably catch regression, and that they won't break for unrelated reasons.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24245#issuecomment-2760760131

From duke at openjdk.org  Fri Mar 28 13:03:48 2025
From: duke at openjdk.org (Zihao Lin)
Date: Fri, 28 Mar 2025 13:03:48 GMT
Subject: RFR: 8344116: C2: remove slice parameter from LoadNode::make [v4]
In-Reply-To: <Po0DIjZv6wmSdwNcL1BeN5s9xvih8YKDqaw7Io5wIl8=.82dc3319-de61-4afc-898c-a7550bf9c9ac@github.com>
References: <Po0DIjZv6wmSdwNcL1BeN5s9xvih8YKDqaw7Io5wIl8=.82dc3319-de61-4afc-898c-a7550bf9c9ac@github.com>
Message-ID: <HqSCHX5qxIngmuWYUHRTcygbs7adyfVVWZZ_W7FGsY4=.93b0375a-b835-492d-a4e4-d207836bf237@github.com>

> This patch remove slice parameter from LoadNode::make
> 
> Mention in https://github.com/openjdk/jdk/pull/21834#pullrequestreview-2429164805
> 
> Hi team, I am new, I'd appreciate any guidance. Thank a lot!

Zihao Lin has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision:

  8344116: C2: remove slice parameter from LoadNode::make

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/24258/files
  - new: https://git.openjdk.org/jdk/pull/24258/files/08c1a382..f6b2fbec

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=24258&range=03
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24258&range=02-03

  Stats: 0 lines in 0 files changed: 0 ins; 0 del; 0 mod
  Patch: https://git.openjdk.org/jdk/pull/24258.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/24258/head:pull/24258

PR: https://git.openjdk.org/jdk/pull/24258

From thartmann at openjdk.org  Fri Mar 28 19:27:15 2025
From: thartmann at openjdk.org (Tobias Hartmann)
Date: Fri, 28 Mar 2025 19:27:15 GMT
Subject: RFR: 8348853: Fold layout helper check for objects implementing
 non-array interfaces
In-Reply-To: <CptyYNIV6gQls75ydTAEKDlNDgQyFA6vXepsdzxvTyA=.c0bf07ae-4da7-4649-b36a-d568755cf4ac@github.com>
References: <CptyYNIV6gQls75ydTAEKDlNDgQyFA6vXepsdzxvTyA=.c0bf07ae-4da7-4649-b36a-d568755cf4ac@github.com>
Message-ID: <qbQv6bakHAyxrcX0FsZ6r8iHmoZM7ucVXW0vptmnZpw=.0f65c173-ac8a-4b20-824b-c5a86321a35a@github.com>

On Wed, 26 Mar 2025 09:16:17 GMT, Marc Chevalier <mchevalier at openjdk.org> wrote:

> If `TypeInstKlassPtr` represents an array type, it has to be `java.lang.Object`. From contraposition, if it is not `java.lang.Object`, we can conclude it is not an array, and we can skip some array checks, for instance.
> 
> In this PR, we improve this deduction with an interface base reasoning: arrays implements only Cloneable and Serializable, so if a type implements anything else, it cannot be an array.
> 
> This change partially reverts the changes from [JDK-8348631](https://bugs.openjdk.org/browse/JDK-8348631) (#23331) (in `LibraryCallKit::generate_array_guard_common`) and the test still passes.
> 
> The way interfaces are check might be done differently. The current situation is a balance between visibility (not to leak too much things explicitly private), having not overly general methods for one use-case and avoiding too concrete (and brittle) interfaces.
> 
> Tested with tier1..3, hs-precheckin-comp and hs-comp-stress
> 
> Thanks,
> Marc

Right, I was hoping that there would be some other suitable users of `GraphKit::get_layout_helper` that would now be folded but all current uses either trap or don't handle both arrays and non-arrays (and therefore wouldn't fold). So I agree, adding an IR framework test does not make sense. The existing test is sufficient.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24245#issuecomment-2762234661

From thartmann at openjdk.org  Fri Mar 28 19:27:15 2025
From: thartmann at openjdk.org (Tobias Hartmann)
Date: Fri, 28 Mar 2025 19:27:15 GMT
Subject: RFR: 8348853: Fold layout helper check for objects implementing
 non-array interfaces
In-Reply-To: <fT6NxeYVgESrlUdmOUuJ5Gy1lE6hV_F_pc4XZnDdf6s=.c2cac7c1-82a7-4583-83ea-a9066df2b87a@github.com>
References: <CptyYNIV6gQls75ydTAEKDlNDgQyFA6vXepsdzxvTyA=.c0bf07ae-4da7-4649-b36a-d568755cf4ac@github.com>
 <-Ri4lJUzCkI9yLG-kGwTGeAhd453SDgt_qvoB1iw4_A=.f3e126ab-a4ff-4f7f-80a7-c6e739cc6727@github.com>
 <fT6NxeYVgESrlUdmOUuJ5Gy1lE6hV_F_pc4XZnDdf6s=.c2cac7c1-82a7-4583-83ea-a9066df2b87a@github.com>
Message-ID: <TbIQzZtBslAH3jFkxV9Gt3YjIqYuxoNtexSyfnWAtI8=.d6648890-a880-471b-bbd8-a7753e2aa217@github.com>

On Fri, 28 Mar 2025 09:38:19 GMT, Marc Chevalier <mchevalier at openjdk.org> wrote:

>> src/hotspot/share/opto/type.cpp line 3684:
>> 
>>> 3682: }
>>> 3683: 
>>> 3684: bool TypeInterfaces::has_non_array_interface() const {
>> 
>> What about using `TypeAryPtr::_array_interfaces->contains(_interfaces);` instead?
>
> Almost!
> 
> return !TypeAryPtr::_array_interfaces->contains(this);
> 
> Contains is about TypeInterfaces, that is set of interfaces. So I just need to check that `this` is not a sub-set of array interfaces. That should do it.

Now I'm confused, isn't this what I proposed? I didn't check the exact syntax, I just wondered if the `TypeInterfaces::contains` method couldn't be used instead of adding a new method.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/24245#discussion_r2019219027

From vlivanov at openjdk.org  Fri Mar 28 21:49:25 2025
From: vlivanov at openjdk.org (Vladimir Ivanov)
Date: Fri, 28 Mar 2025 21:49:25 GMT
Subject: RFR: 8346989: C2: deoptimization and re-compilation cycle with
 Math.*Exact in case of frequent overflow [v2]
In-Reply-To: <P3yDoyndIDkAbOdsYfPhEjlI8_SPczPdiJVC8YieQrs=.e2f66078-b0bb-4045-9e4f-a9de2202a419@github.com>
References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com>
 <P3yDoyndIDkAbOdsYfPhEjlI8_SPczPdiJVC8YieQrs=.e2f66078-b0bb-4045-9e4f-a9de2202a419@github.com>
Message-ID: <kTsWOckaKxXhLMHmbUABB8XiIDsetp2GyUSge6Xyz7M=.a06355c8-6309-4ab4-8b51-4116b59949a9@github.com>

On Wed, 26 Mar 2025 08:33:58 GMT, Marc Chevalier <mchevalier at openjdk.org> wrote:

>> `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments.
>> This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached.
>> 
>> Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all.
>> 
>> tl;dr:
>> - C1: no problem, no change
>> - C2:
>>   - with intrinsics:
>>     - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms)
>>     - without overflow: no problem, no change
>>   - without intrinsics: no problem, no change
>> 
>> Before the fix:
>> 
>> Benchmark                                           (SIZE)  Mode  Cnt     Score      Error  Units
>> MathExact.C1_1.loopAddIInBounds                    1000000  avgt    3     1.272 ?    0.048  ms/op
>> MathExact.C1_1.loopAddIOverflow                    1000000  avgt    3   641.917 ?   58.238  ms/op
>> MathExact.C1_1.loopAddLInBounds                    1000000  avgt    3     1.402 ?    0.842  ms/op
>> MathExact.C1_1.loopAddLOverflow                    1000000  avgt    3   671.013 ?  229.425  ms/op
>> MathExact.C1_1.loopDecrementIInBounds              1000000  avgt    3     3.722 ?   22.244  ms/op
>> MathExact.C1_1.loopDecrementIOverflow              1000000  avgt    3   653.341 ?  279.003  ms/op
>> MathExact.C1_1.loopDecrementLInBounds              1000000  avgt    3     2.525 ?    0.810  ms/op
>> MathExact.C1_1.loopDecrementLOverflow              1000000  avgt    3   656.750 ?  141.792  ms/op
>> MathExact.C1_1.loopIncrementIInBounds              1000000  avgt    3     4.621 ?   12.822  ms/op
>> MathExact.C1_1.loopIncrementIOverflow              1000000  avgt    3   651.608 ?  274.396  ms/op
>> MathExact.C1_1.loopIncrementLInBounds              1000000  avgt    3     2.576 ?    3.316  ms/op
>> MathExact.C1_1.loopIncrementLOverflow              1000000  avgt    3   662.216 ?   71.879  ms/op
>> MathExact.C1_1.loopMultiplyIInBounds               1000000  avgt    3     1.402 ?    0.587  ms/op
>> MathExact.C1_1.loopMultiplyIOverflow               1000000  avgt    3   615.836 ?  252.137  ms/op
>> MathExact.C1_1.loopMultiplyLInBounds               1000000  avgt    3     2.906 ?    5.718  ms/op
>> MathExact.C1_1.loopMultiplyLOverflow               1000000  avgt    3   655.576 ?  147.432  ms/op
>> MathExact.C1_1.loopNegateIInBounds                 1000000  avgt    3     2.023 ?    0.027  ms/op
>> MathExact.C1_1.loopNegateIOverflow                 1000000  avgt    3   639.136 ?   30.841  ms/op
>> MathExact.C1_1.loop...
>
> Marc Chevalier has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision:
> 
>  - Use builtin_throw
>  - Merge branch 'master' into fix/Deoptimization-and-re-compilation-cycle-with-C2-compiled-code
>  - More exhaustive bench
>  - Limit inlining of math Exact operations in case of too many deopts

Thanks, Marc.

It looks a bit too convoluted to me. IMO an unconditional call to `builtin_throw`, plus `too_many_traps` check should do the job. Do I miss something important here?

src/hotspot/share/opto/graphKit.hpp line 279:

> 277:   // The JVMS must allow the bytecode to be re-executed via an uncommon trap.
> 278:   // If `exception_object` is nullptr, the exception to throw will be guessed based on `reason`
> 279:   void builtin_throw(Deoptimization::DeoptReason reason, ciInstance* exception_object = nullptr);

Please, introduce a new overload instead. 

I suggest to extract Deoptimization::DeoptReason -> ciInstance mapping into a helper method and turn `void builtin_throw(Deoptimization::DeoptReason reason)` into a wrapper:

void GraphKit::builtin_throw(Deoptimization::DeoptReason reason) {
   builtin_throw(reason, exception_on_deopt(reason));
}

src/hotspot/share/opto/library_call.cpp line 2035:

> 2033: 
> 2034:     if (use_builtin_throw) {
> 2035:       builtin_throw(Deoptimization::Reason_intrinsic, env()->ArithmeticException_instance());

I suggest to unconditionally call `builtin_throw()`. It should handle `uncommon_trap` case as well.

What makes sense is to ensure that `builtin_throw()` doesn't change deoptimization reason. It can be implemented with an extra argument to new `GraphKit::builtin_throw` overload (e.g., `bool allow_deopt_reason_none`).

src/hotspot/share/opto/library_call.cpp line 2054:

> 2052:     // instead of bailing out on intrinsic or potentially deopting, let's do that!
> 2053:     use_builtin_throw = true;
> 2054:   } else if (too_many_traps(Deoptimization::Reason_intrinsic)) {

Why `too_many_traps(Deoptimization::Reason_intrinsic)` check is not enough here?

-------------

PR Review: https://git.openjdk.org/jdk/pull/23916#pullrequestreview-2726864135
PR Review Comment: https://git.openjdk.org/jdk/pull/23916#discussion_r2019432922
PR Review Comment: https://git.openjdk.org/jdk/pull/23916#discussion_r2019444895
PR Review Comment: https://git.openjdk.org/jdk/pull/23916#discussion_r2019449996

From dnsimon at openjdk.org  Fri Mar 28 22:24:40 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Fri, 28 Mar 2025 22:24:40 GMT
Subject: RFR: 8352645: Add tool support to check order of includes [v6]
In-Reply-To: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
Message-ID: <dGKFXPuQ5d-HbQYJ9Z7cn8MTLoV_7IWW5-6dftqO49c=.21f8dec7-35e5-46ab-ad24-3050ff5e200e@github.com>

> This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)).
> 
> By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order.
> 
> The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now.
> 
> I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories.
> 
> When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example:
> 
> java.lang.RuntimeException: The unsorted includes listed below should be fixable by running:
> 
>     java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci
> 
> 	at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80)
> 	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
> 	at java.base/java.lang.reflect.Method.invoke(Method.java:565)
> 	at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335)
> 	at java.base/java.lang.Thread.run(Thread.java:1447)
> Caused by: java.lang.RuntimeException: 36 files with unsorted headers found:
> 
> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp
> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp
> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Optim...

Doug Simon has updated the pull request incrementally with one additional commit since the last revision:

  convert Windows path to Unix path

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/24247/files
  - new: https://git.openjdk.org/jdk/pull/24247/files/c93e6646..921e3251

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=24247&range=05
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24247&range=04-05

  Stats: 4 lines in 1 file changed: 4 ins; 0 del; 0 mod
  Patch: https://git.openjdk.org/jdk/pull/24247.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/24247/head:pull/24247

PR: https://git.openjdk.org/jdk/pull/24247

From duke at openjdk.org  Sat Mar 29 07:19:21 2025
From: duke at openjdk.org (Zihao Lin)
Date: Sat, 29 Mar 2025 07:19:21 GMT
Subject: RFR: 8344116: C2: remove slice parameter from LoadNode::make [v5]
In-Reply-To: <Po0DIjZv6wmSdwNcL1BeN5s9xvih8YKDqaw7Io5wIl8=.82dc3319-de61-4afc-898c-a7550bf9c9ac@github.com>
References: <Po0DIjZv6wmSdwNcL1BeN5s9xvih8YKDqaw7Io5wIl8=.82dc3319-de61-4afc-898c-a7550bf9c9ac@github.com>
Message-ID: <GJH7reywdZNJ8ahTfj29K9V1iuU1i9XRa6u01VjNYMY=.09528ae4-7ac1-48bf-96dc-b671cae5565a@github.com>

> This patch remove slice parameter from LoadNode::make
> 
> Mention in https://github.com/openjdk/jdk/pull/21834#pullrequestreview-2429164805
> 
> Hi team, I am new, I'd appreciate any guidance. Thank a lot!

Zihao Lin has updated the pull request incrementally with two additional commits since the last revision:

 - Fix build
 - Fix test failed

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/24258/files
  - new: https://git.openjdk.org/jdk/pull/24258/files/f6b2fbec..a1924c35

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=24258&range=04
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24258&range=03-04

  Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod
  Patch: https://git.openjdk.org/jdk/pull/24258.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/24258/head:pull/24258

PR: https://git.openjdk.org/jdk/pull/24258

From mchevalier at openjdk.org  Mon Mar 31 06:49:50 2025
From: mchevalier at openjdk.org (Marc Chevalier)
Date: Mon, 31 Mar 2025 06:49:50 GMT
Subject: RFR: 8348853: Fold layout helper check for objects implementing
 non-array interfaces [v2]
In-Reply-To: <CptyYNIV6gQls75ydTAEKDlNDgQyFA6vXepsdzxvTyA=.c0bf07ae-4da7-4649-b36a-d568755cf4ac@github.com>
References: <CptyYNIV6gQls75ydTAEKDlNDgQyFA6vXepsdzxvTyA=.c0bf07ae-4da7-4649-b36a-d568755cf4ac@github.com>
Message-ID: <hTYBeyYtwi4_FZm-EcINqfS86Y0N7NQS9dwE1VCsC80=.284c8bef-907e-4a6d-beb3-6ce2f611587e@github.com>

> If `TypeInstKlassPtr` represents an array type, it has to be `java.lang.Object`. From contraposition, if it is not `java.lang.Object`, we can conclude it is not an array, and we can skip some array checks, for instance.
> 
> In this PR, we improve this deduction with an interface base reasoning: arrays implements only Cloneable and Serializable, so if a type implements anything else, it cannot be an array.
> 
> This change partially reverts the changes from [JDK-8348631](https://bugs.openjdk.org/browse/JDK-8348631) (#23331) (in `LibraryCallKit::generate_array_guard_common`) and the test still passes.
> 
> The way interfaces are check might be done differently. The current situation is a balance between visibility (not to leak too much things explicitly private), having not overly general methods for one use-case and avoiding too concrete (and brittle) interfaces.
> 
> Tested with tier1..3, hs-precheckin-comp and hs-comp-stress
> 
> Thanks,
> Marc

Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision:

  not reinventing the wheel

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/24245/files
  - new: https://git.openjdk.org/jdk/pull/24245/files/a77c397c..daaaf9ae

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=24245&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24245&range=00-01

  Stats: 13 lines in 1 file changed: 0 ins; 12 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/24245.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/24245/head:pull/24245

PR: https://git.openjdk.org/jdk/pull/24245

From mchevalier at openjdk.org  Mon Mar 31 06:49:50 2025
From: mchevalier at openjdk.org (Marc Chevalier)
Date: Mon, 31 Mar 2025 06:49:50 GMT
Subject: RFR: 8348853: Fold layout helper check for objects implementing
 non-array interfaces [v2]
In-Reply-To: <TbIQzZtBslAH3jFkxV9Gt3YjIqYuxoNtexSyfnWAtI8=.d6648890-a880-471b-bbd8-a7753e2aa217@github.com>
References: <CptyYNIV6gQls75ydTAEKDlNDgQyFA6vXepsdzxvTyA=.c0bf07ae-4da7-4649-b36a-d568755cf4ac@github.com>
 <-Ri4lJUzCkI9yLG-kGwTGeAhd453SDgt_qvoB1iw4_A=.f3e126ab-a4ff-4f7f-80a7-c6e739cc6727@github.com>
 <fT6NxeYVgESrlUdmOUuJ5Gy1lE6hV_F_pc4XZnDdf6s=.c2cac7c1-82a7-4583-83ea-a9066df2b87a@github.com>
 <TbIQzZtBslAH3jFkxV9Gt3YjIqYuxoNtexSyfnWAtI8=.d6648890-a880-471b-bbd8-a7753e2aa217@github.com>
Message-ID: <48D8vzTXZDKtZxAMTDdo9ggjWnWn7XNjs6rZqwuDZxc=.d833c90c-09da-4167-aec9-aba8b9e523b5@github.com>

On Fri, 28 Mar 2025 19:24:22 GMT, Tobias Hartmann <thartmann at openjdk.org> wrote:

>> Almost!
>> 
>> return !TypeAryPtr::_array_interfaces->contains(this);
>> 
>> Contains is about TypeInterfaces, that is set of interfaces. So I just need to check that `this` is not a sub-set of array interfaces. That should do it.
>
> Now I'm confused, isn't this what I proposed? I didn't check the exact syntax, I just wondered if the `TypeInterfaces::contains` method couldn't be used instead of adding a new method.

Yes, totally! It's just a detail difference. But there is another question: whether we still want `has_non_array_interface` has a wrapper for this call with a more explicit name, or if we simply inline your suggestion on the callsite of `has_non_array_interface`. I tend toward the first, I like explicit names, and I suspect it might be useful in more than one place, but not a strong opinion.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/24245#discussion_r2020483393

From mchevalier at openjdk.org  Mon Mar 31 07:54:14 2025
From: mchevalier at openjdk.org (Marc Chevalier)
Date: Mon, 31 Mar 2025 07:54:14 GMT
Subject: RFR: 8346989: C2: deoptimization and re-compilation cycle with
 Math.*Exact in case of frequent overflow [v2]
In-Reply-To: <P3yDoyndIDkAbOdsYfPhEjlI8_SPczPdiJVC8YieQrs=.e2f66078-b0bb-4045-9e4f-a9de2202a419@github.com>
References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com>
 <P3yDoyndIDkAbOdsYfPhEjlI8_SPczPdiJVC8YieQrs=.e2f66078-b0bb-4045-9e4f-a9de2202a419@github.com>
Message-ID: <KXhM91lPTTJX_lyr3teddkIzlaVuImXh_7lvahhVFxE=.bd78dd01-34a3-40ad-b648-a1c8f4481865@github.com>

On Wed, 26 Mar 2025 08:33:58 GMT, Marc Chevalier <mchevalier at openjdk.org> wrote:

>> `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments.
>> This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached.
>> 
>> Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all.
>> 
>> tl;dr:
>> - C1: no problem, no change
>> - C2:
>>   - with intrinsics:
>>     - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms)
>>     - without overflow: no problem, no change
>>   - without intrinsics: no problem, no change
>> 
>> Before the fix:
>> 
>> Benchmark                                           (SIZE)  Mode  Cnt     Score      Error  Units
>> MathExact.C1_1.loopAddIInBounds                    1000000  avgt    3     1.272 ?    0.048  ms/op
>> MathExact.C1_1.loopAddIOverflow                    1000000  avgt    3   641.917 ?   58.238  ms/op
>> MathExact.C1_1.loopAddLInBounds                    1000000  avgt    3     1.402 ?    0.842  ms/op
>> MathExact.C1_1.loopAddLOverflow                    1000000  avgt    3   671.013 ?  229.425  ms/op
>> MathExact.C1_1.loopDecrementIInBounds              1000000  avgt    3     3.722 ?   22.244  ms/op
>> MathExact.C1_1.loopDecrementIOverflow              1000000  avgt    3   653.341 ?  279.003  ms/op
>> MathExact.C1_1.loopDecrementLInBounds              1000000  avgt    3     2.525 ?    0.810  ms/op
>> MathExact.C1_1.loopDecrementLOverflow              1000000  avgt    3   656.750 ?  141.792  ms/op
>> MathExact.C1_1.loopIncrementIInBounds              1000000  avgt    3     4.621 ?   12.822  ms/op
>> MathExact.C1_1.loopIncrementIOverflow              1000000  avgt    3   651.608 ?  274.396  ms/op
>> MathExact.C1_1.loopIncrementLInBounds              1000000  avgt    3     2.576 ?    3.316  ms/op
>> MathExact.C1_1.loopIncrementLOverflow              1000000  avgt    3   662.216 ?   71.879  ms/op
>> MathExact.C1_1.loopMultiplyIInBounds               1000000  avgt    3     1.402 ?    0.587  ms/op
>> MathExact.C1_1.loopMultiplyIOverflow               1000000  avgt    3   615.836 ?  252.137  ms/op
>> MathExact.C1_1.loopMultiplyLInBounds               1000000  avgt    3     2.906 ?    5.718  ms/op
>> MathExact.C1_1.loopMultiplyLOverflow               1000000  avgt    3   655.576 ?  147.432  ms/op
>> MathExact.C1_1.loopNegateIInBounds                 1000000  avgt    3     2.023 ?    0.027  ms/op
>> MathExact.C1_1.loopNegateIOverflow                 1000000  avgt    3   639.136 ?   30.841  ms/op
>> MathExact.C1_1.loop...
>
> Marc Chevalier has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision:
> 
>  - Use builtin_throw
>  - Merge branch 'master' into fix/Deoptimization-and-re-compilation-cycle-with-C2-compiled-code
>  - More exhaustive bench
>  - Limit inlining of math Exact operations in case of too many deopts

Actually, yes, there is a reason I've made it so weird (and I agree it's pretty convoluted).
`builtin_throw` kicks in if `too_many_traps(reason)` is true (and another case, but it might not apply):
https://github.com/openjdk/jdk/blob/59629f88e6fad9c1ff91be4cfea83f78f0ea503c/src/hotspot/share/opto/graphKit.cpp#L540-L555
If `treat_throw_as_hot` is false (so before too many traps) it just ends up as a `uncommon_trap` with `Action_maybe_recompile` action. That is fine at first. But later, we would like `builtin_throw` to do its job, but it can only do if if
https://github.com/openjdk/jdk/blob/59629f88e6fad9c1ff91be4cfea83f78f0ea503c/src/hotspot/share/opto/graphKit.cpp#L563
which is not `too_many_traps(reason)`. Which means that:
- if we don't bailout intrinsics on `too_many_traps(reason)` we may be in the same situation as in the bug, with deopt cycles, in the situation where `builtin_throw` doesn't do it's job (for instance `method()->can_omit_stack_trace()` is false)
- if we bailout intrincs on `too_many_traps(reason)`, then `builtin_throw` never get a hot enough throw that it can speed up, and we have the same situation as my first version, before you suggested `builtin_throw` (with performances similar for C2 and C1).

In other words, we need `too_many_traps(reason)` to be reached to have `builtin_throw` start to have a change to do something, but it might not, and in this case, we need to bailout from intrinsics otherwise, we will repeatedly deopt. So, when `too_many_traps(reason)` is true, we have two options: either we give it to `builtin_throw` or we bailout. And to avoid the deopt cycles, we must know in advance if `builtin_throw` will do its job or just default to an `uncommon_trap` again (in which case, bailing out is better). This is why I extracted the condition for `builtin_throw` into `builtin_throw_applies`: so that intrinsic can decide what is best to do.

Some of your suggestions are still relevant tho! I'll apply them.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23916#issuecomment-2765414288

From mchevalier at openjdk.org  Mon Mar 31 08:05:50 2025
From: mchevalier at openjdk.org (Marc Chevalier)
Date: Mon, 31 Mar 2025 08:05:50 GMT
Subject: RFR: 8346989: C2: deoptimization and re-compilation cycle with
 Math.*Exact in case of frequent overflow [v3]
In-Reply-To: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com>
References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com>
Message-ID: <kZ1u8UGXRKeGwk8KYwXCBdOmdlGu8Ma9RudF7hCdOeQ=.609209db-a882-48ea-b301-652854fcb6c1@github.com>

> `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments.
> This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached.
> 
> Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all.
> 
> tl;dr:
> - C1: no problem, no change
> - C2:
>   - with intrinsics:
>     - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms)
>     - without overflow: no problem, no change
>   - without intrinsics: no problem, no change
> 
> Before the fix:
> 
> Benchmark                                           (SIZE)  Mode  Cnt     Score      Error  Units
> MathExact.C1_1.loopAddIInBounds                    1000000  avgt    3     1.272 ?    0.048  ms/op
> MathExact.C1_1.loopAddIOverflow                    1000000  avgt    3   641.917 ?   58.238  ms/op
> MathExact.C1_1.loopAddLInBounds                    1000000  avgt    3     1.402 ?    0.842  ms/op
> MathExact.C1_1.loopAddLOverflow                    1000000  avgt    3   671.013 ?  229.425  ms/op
> MathExact.C1_1.loopDecrementIInBounds              1000000  avgt    3     3.722 ?   22.244  ms/op
> MathExact.C1_1.loopDecrementIOverflow              1000000  avgt    3   653.341 ?  279.003  ms/op
> MathExact.C1_1.loopDecrementLInBounds              1000000  avgt    3     2.525 ?    0.810  ms/op
> MathExact.C1_1.loopDecrementLOverflow              1000000  avgt    3   656.750 ?  141.792  ms/op
> MathExact.C1_1.loopIncrementIInBounds              1000000  avgt    3     4.621 ?   12.822  ms/op
> MathExact.C1_1.loopIncrementIOverflow              1000000  avgt    3   651.608 ?  274.396  ms/op
> MathExact.C1_1.loopIncrementLInBounds              1000000  avgt    3     2.576 ?    3.316  ms/op
> MathExact.C1_1.loopIncrementLOverflow              1000000  avgt    3   662.216 ?   71.879  ms/op
> MathExact.C1_1.loopMultiplyIInBounds               1000000  avgt    3     1.402 ?    0.587  ms/op
> MathExact.C1_1.loopMultiplyIOverflow               1000000  avgt    3   615.836 ?  252.137  ms/op
> MathExact.C1_1.loopMultiplyLInBounds               1000000  avgt    3     2.906 ?    5.718  ms/op
> MathExact.C1_1.loopMultiplyLOverflow               1000000  avgt    3   655.576 ?  147.432  ms/op
> MathExact.C1_1.loopNegateIInBounds                 1000000  avgt    3     2.023 ?    0.027  ms/op
> MathExact.C1_1.loopNegateIOverflow                 1000000  avgt    3   639.136 ?   30.841  ms/op
> MathExact.C1_1.loopNegateLInBounds                 1000000  avgt    3     2.422 ?    3.59...

Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision:

  guess_exception_from_deopt_reason out of builtin_throw

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23916/files
  - new: https://git.openjdk.org/jdk/pull/23916/files/9372228d..41d7a1d4

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23916&range=02
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23916&range=01-02

  Stats: 49 lines in 2 files changed: 21 ins; 25 del; 3 mod
  Patch: https://git.openjdk.org/jdk/pull/23916.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23916/head:pull/23916

PR: https://git.openjdk.org/jdk/pull/23916

From mchevalier at openjdk.org  Mon Mar 31 08:33:42 2025
From: mchevalier at openjdk.org (Marc Chevalier)
Date: Mon, 31 Mar 2025 08:33:42 GMT
Subject: RFR: 8346989: C2: deoptimization and re-compilation cycle with
 Math.*Exact in case of frequent overflow [v4]
In-Reply-To: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com>
References: <8ACplaVM_gN9cbIcQYGJmR4GNINm70PAJQ8uAgucK4Y=.14fdc7e2-e0af-4f0d-acb6-bcfe99ee8f36@github.com>
Message-ID: <ZLUElOfUHk5akhNTT4axkoTzDe5y1NtL_IIDoxs3K-A=.d1259344-7b55-4634-b5d1-daafe9e42921@github.com>

> `Math.*Exact` intrinsics can cause many deopt when used repeatedly with problematic arguments.
> This fix proposes not to rely on intrinsics after `too_many_traps()` has been reached.
> 
> Benchmark show that this issue affects every Math.*Exact functions. And this fix improve them all.
> 
> tl;dr:
> - C1: no problem, no change
> - C2:
>   - with intrinsics:
>     - with overflow: clear improvement. Was way worse than C1, now is similar (~4s => ~600ms)
>     - without overflow: no problem, no change
>   - without intrinsics: no problem, no change
> 
> Before the fix:
> 
> Benchmark                                           (SIZE)  Mode  Cnt     Score      Error  Units
> MathExact.C1_1.loopAddIInBounds                    1000000  avgt    3     1.272 ?    0.048  ms/op
> MathExact.C1_1.loopAddIOverflow                    1000000  avgt    3   641.917 ?   58.238  ms/op
> MathExact.C1_1.loopAddLInBounds                    1000000  avgt    3     1.402 ?    0.842  ms/op
> MathExact.C1_1.loopAddLOverflow                    1000000  avgt    3   671.013 ?  229.425  ms/op
> MathExact.C1_1.loopDecrementIInBounds              1000000  avgt    3     3.722 ?   22.244  ms/op
> MathExact.C1_1.loopDecrementIOverflow              1000000  avgt    3   653.341 ?  279.003  ms/op
> MathExact.C1_1.loopDecrementLInBounds              1000000  avgt    3     2.525 ?    0.810  ms/op
> MathExact.C1_1.loopDecrementLOverflow              1000000  avgt    3   656.750 ?  141.792  ms/op
> MathExact.C1_1.loopIncrementIInBounds              1000000  avgt    3     4.621 ?   12.822  ms/op
> MathExact.C1_1.loopIncrementIOverflow              1000000  avgt    3   651.608 ?  274.396  ms/op
> MathExact.C1_1.loopIncrementLInBounds              1000000  avgt    3     2.576 ?    3.316  ms/op
> MathExact.C1_1.loopIncrementLOverflow              1000000  avgt    3   662.216 ?   71.879  ms/op
> MathExact.C1_1.loopMultiplyIInBounds               1000000  avgt    3     1.402 ?    0.587  ms/op
> MathExact.C1_1.loopMultiplyIOverflow               1000000  avgt    3   615.836 ?  252.137  ms/op
> MathExact.C1_1.loopMultiplyLInBounds               1000000  avgt    3     2.906 ?    5.718  ms/op
> MathExact.C1_1.loopMultiplyLOverflow               1000000  avgt    3   655.576 ?  147.432  ms/op
> MathExact.C1_1.loopNegateIInBounds                 1000000  avgt    3     2.023 ?    0.027  ms/op
> MathExact.C1_1.loopNegateIOverflow                 1000000  avgt    3   639.136 ?   30.841  ms/op
> MathExact.C1_1.loopNegateLInBounds                 1000000  avgt    3     2.422 ?    3.59...

Marc Chevalier has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits:

 - Merge branch 'master' into fix/Deoptimization-and-re-compilation-cycle-with-C2-compiled-code
 - guess_exception_from_deopt_reason out of builtin_throw
 - Use builtin_throw
 - Merge branch 'master' into fix/Deoptimization-and-re-compilation-cycle-with-C2-compiled-code
 - More exhaustive bench
 - Limit inlining of math Exact operations in case of too many deopts

-------------

Changes: https://git.openjdk.org/jdk/pull/23916/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23916&range=03
  Stats: 759 lines in 6 files changed: 723 ins; 27 del; 9 mod
  Patch: https://git.openjdk.org/jdk/pull/23916.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23916/head:pull/23916

PR: https://git.openjdk.org/jdk/pull/23916

From mchevalier at openjdk.org  Mon Mar 31 09:37:08 2025
From: mchevalier at openjdk.org (Marc Chevalier)
Date: Mon, 31 Mar 2025 09:37:08 GMT
Subject: RFR: 8348853: Fold layout helper check for objects implementing
 non-array interfaces [v2]
In-Reply-To: <48D8vzTXZDKtZxAMTDdo9ggjWnWn7XNjs6rZqwuDZxc=.d833c90c-09da-4167-aec9-aba8b9e523b5@github.com>
References: <CptyYNIV6gQls75ydTAEKDlNDgQyFA6vXepsdzxvTyA=.c0bf07ae-4da7-4649-b36a-d568755cf4ac@github.com>
 <-Ri4lJUzCkI9yLG-kGwTGeAhd453SDgt_qvoB1iw4_A=.f3e126ab-a4ff-4f7f-80a7-c6e739cc6727@github.com>
 <fT6NxeYVgESrlUdmOUuJ5Gy1lE6hV_F_pc4XZnDdf6s=.c2cac7c1-82a7-4583-83ea-a9066df2b87a@github.com>
 <TbIQzZtBslAH3jFkxV9Gt3YjIqYuxoNtexSyfnWAtI8=.d6648890-a880-471b-bbd8-a7753e2aa217@github.com>
 <48D8vzTXZDKtZxAMTDdo9ggjWnWn7XNjs6rZqwuDZxc=.d833c90c-09da-4167-aec9-aba8b9e523b5@github.com>
Message-ID: <I6V8LGxLUMvyppbbh2DfeNB2zbzxeEicsaC9Ec0mgHI=.c4704697-ae40-48c8-a67d-5a85a9c1e29a@github.com>

On Mon, 31 Mar 2025 06:46:51 GMT, Marc Chevalier <mchevalier at openjdk.org> wrote:

>> Now I'm confused, isn't this what I proposed? I didn't check the exact syntax, I just wondered if the `TypeInterfaces::contains` method couldn't be used instead of adding a new method.
>
> Yes, totally! It's just a detail difference. But there is another question: whether we still want `has_non_array_interface` has a wrapper for this call with a more explicit name, or if we simply inline your suggestion on the callsite of `has_non_array_interface`. I tend toward the first, I like explicit names, and I suspect it might be useful in more than one place, but not a strong opinion.

For now, I just replaced the implementation of `has_non_array_interface`. If one feels even keeping the method is premature factorization, I can easily inline it.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/24245#discussion_r2020704570

From bkilambi at openjdk.org  Mon Mar 31 09:54:14 2025
From: bkilambi at openjdk.org (Bhavana Kilambi)
Date: Mon, 31 Mar 2025 09:54:14 GMT
Subject: RFR: 8345125: Aarch64: Add aarch64 backend for Float16 scalar
 operations [v2]
In-Reply-To: <8QDbenZGakijqUrwAcaVogoJBEiNpzYhN3sDrrteSDk=.d8539631-ab03-45ff-a762-0b6e14c63f89@github.com>
References: <Rbal8Cp4ncat_17FPV367OvtBjO-GVr0AdM-X-yuNt8=.e09edada-bc9d-4dbd-9904-ab523d25fc47@github.com>
 <8QDbenZGakijqUrwAcaVogoJBEiNpzYhN3sDrrteSDk=.d8539631-ab03-45ff-a762-0b6e14c63f89@github.com>
Message-ID: <5_o8l6NUDH-laA-OZT9wvJ5-AR9vs2tUwXf0jVzB9T4=.0ec06331-95ca-45a2-bd1f-14cea2150b81@github.com>

On Tue, 25 Feb 2025 19:45:31 GMT, Bhavana Kilambi <bkilambi at openjdk.org> wrote:

>> This patch adds aarch64 backend for scalar FP16 operations namely - add, subtract, multiply, divide, fma, sqrt, min and max.
>
> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Address review comments

Hello, I would not be able to respond to comments until the next couple months or so due to some urgent tasks at work. Until then, I'd move this PR to draft status so that it would not be closed due to lack of activity. Thank you for the review!

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23748#issuecomment-2765729618

From ihse at openjdk.org  Mon Mar 31 10:05:19 2025
From: ihse at openjdk.org (Magnus Ihse Bursie)
Date: Mon, 31 Mar 2025 10:05:19 GMT
Subject: RFR: 8352645: Add tool support to check order of includes [v6]
In-Reply-To: <dGKFXPuQ5d-HbQYJ9Z7cn8MTLoV_7IWW5-6dftqO49c=.21f8dec7-35e5-46ab-ad24-3050ff5e200e@github.com>
References: <2R1Lazv-rFiErR_ZtJjyT77Wm2XeaKQ8hA5HDg8o1v4=.084054ad-a46b-4206-bc1e-5e9d2bdbaaa2@github.com>
 <dGKFXPuQ5d-HbQYJ9Z7cn8MTLoV_7IWW5-6dftqO49c=.21f8dec7-35e5-46ab-ad24-3050ff5e200e@github.com>
Message-ID: <ytsRZsgirljfgN1UGhjQQCWAQJeSRq419ILX2Urzl7w=.ec5e07e0-4d53-450d-9aa9-467c280613cc@github.com>

On Fri, 28 Mar 2025 22:24:40 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

>> This PR adds `test/hotspot/jtreg/sources/SortIncludes.java`, a tool to check that blocks of include statements in C++ files are sorted and that there's at least one blank line between user and sys includes (as per the [style guide](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md#source-files)).
>> 
>> By virtue of using `SortedSet`, the tool also removes duplicate includes (e.g. `"compiler/compilerDirectives.hpp"` on line [37](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L37) and line [41](https://github.com/openjdk/jdk/blob/059f190f4b0c7836b89ca2070400529e8d33790b/src/hotspot/share/c1/c1_Compilation.cpp#L41)). Sorting uses lowercased strings so that `_` sorts before letters, preserving the prevailing convention in the code base. I've also updated the style guide to clarify this sort-order.
>> 
>> The tool does nothing about re-ordering blocks of conditional includes vs unconditional includes. I briefly looked into that but it gets very complicated, very quickly. That kind of re-ordering will have to continue to be done manually for now.
>> 
>> I have used the tool to fix the ordering of a subset of HotSpot sources and added a test to keep them sorted. That test can be expanded over time to keep includes sorted in other HotSpot directories.
>> 
>> When `TestIncludesAreSorted.java` fails, it tries to provide actionable advice. For example:
>> 
>> java.lang.RuntimeException: The unsorted includes listed below should be fixable by running:
>> 
>>     java /Users/dnsimon/dev/jdk-jdk/open/test/hotspot/jtreg/sources/SortIncludes.java --update /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1 /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/ci /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/compiler /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/jvmci
>> 
>> 	at TestIncludesAreSorted.main(TestIncludesAreSorted.java:80)
>> 	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
>> 	at java.base/java.lang.reflect.Method.invoke(Method.java:565)
>> 	at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335)
>> 	at java.base/java.lang.Thread.run(Thread.java:1447)
>> Caused by: java.lang.RuntimeException: 36 files with unsorted headers found:
>> 
>> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Compilation.cpp
>> /Users/dnsimon/dev/jdk-jdk/open/src/hotspot/share/c1/c1_Runtime1.cpp
>> /Users/dnsimo...
>
> Doug Simon has updated the pull request incrementally with one additional commit since the last revision:
> 
>   convert Windows path to Unix path

Hm... 

I know the source code is bundled with the test image, but I'm not 100% sure if it just includes `src`, or if the entire top-level source is included. I'll need to check that, including what is the best way to get a proper reference to the top-level directory from a test.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24247#issuecomment-2765754142

From duke at openjdk.org  Mon Mar 31 11:14:14 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Mon, 31 Mar 2025 11:14:14 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v11]
In-Reply-To: <Kanyx8L0d27_yUHP15BCfzO8MDEE3_JfDlxaP7Ran_g=.571eca02-d927-4598-ac9e-ff19cef9f484@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <JKJusH1gWBybGf4kMN2t_B4u-8qm8BXnEdUA-6sHaQ0=.cf44a3bf-319d-4a42-b7c9-827641753062@github.com>
 <Kanyx8L0d27_yUHP15BCfzO8MDEE3_JfDlxaP7Ran_g=.571eca02-d927-4598-ac9e-ff19cef9f484@github.com>
Message-ID: <IrDjFB-31qorhKt_vwc2dL6LbQUq_7JMO2je_hGCa3I=.63a4319a-e334-4440-ac78-13165b4a23c3@github.com>

On Mon, 24 Mar 2025 02:38:37 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision:
>> 
>>  - Further readability improvements.
>>  - Added asserts for array sizes
>
> src/hotspot/cpu/x86/vm_version_x86.cpp line 1252:
> 
>> 1250:   // Currently we only have them for AVX512
>> 1251: #ifdef _LP64
>> 1252:   if (supports_evex() && supports_avx512bw()) {
> 
> supports_evex check looks redundant.

These are checks for two different feature bits: CPU_AVX512F and CPU_AVX512BW. Are you saying that the latter implies the former in every implementation of the spec?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2020853815

From roland at openjdk.org  Mon Mar 31 11:49:09 2025
From: roland at openjdk.org (Roland Westrelin)
Date: Mon, 31 Mar 2025 11:49:09 GMT
Subject: RFR: 8348853: Fold layout helper check for objects implementing
 non-array interfaces [v2]
In-Reply-To: <hTYBeyYtwi4_FZm-EcINqfS86Y0N7NQS9dwE1VCsC80=.284c8bef-907e-4a6d-beb3-6ce2f611587e@github.com>
References: <CptyYNIV6gQls75ydTAEKDlNDgQyFA6vXepsdzxvTyA=.c0bf07ae-4da7-4649-b36a-d568755cf4ac@github.com>
 <hTYBeyYtwi4_FZm-EcINqfS86Y0N7NQS9dwE1VCsC80=.284c8bef-907e-4a6d-beb3-6ce2f611587e@github.com>
Message-ID: <sAdJdvQ5yetHi0iHkW2LNhOODBZeDjnGj399bKRL0JQ=.0de18bc8-4ec8-45dd-92c8-fe496296c33a@github.com>

On Mon, 31 Mar 2025 06:49:50 GMT, Marc Chevalier <mchevalier at openjdk.org> wrote:

>> If `TypeInstKlassPtr` represents an array type, it has to be `java.lang.Object`. From contraposition, if it is not `java.lang.Object`, we can conclude it is not an array, and we can skip some array checks, for instance.
>> 
>> In this PR, we improve this deduction with an interface base reasoning: arrays implements only Cloneable and Serializable, so if a type implements anything else, it cannot be an array.
>> 
>> This change partially reverts the changes from [JDK-8348631](https://bugs.openjdk.org/browse/JDK-8348631) (#23331) (in `LibraryCallKit::generate_array_guard_common`) and the test still passes.
>> 
>> The way interfaces are check might be done differently. The current situation is a balance between visibility (not to leak too much things explicitly private), having not overly general methods for one use-case and avoiding too concrete (and brittle) interfaces.
>> 
>> Tested with tier1..3, hs-precheckin-comp and hs-comp-stress
>> 
>> Thanks,
>> Marc
>
> Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision:
> 
>   not reinventing the wheel

src/hotspot/share/opto/memnode.cpp line 2214:

> 2212:     if (tkls->offset() == in_bytes(Klass::layout_helper_offset()) &&
> 2213:         tkls->isa_instklassptr() && // not directly typed as an array
> 2214:         !tkls->is_instklassptr()->might_be_an_array() // not the supertype of all T[] (java.lang.Object) or has an interface that is not Serializable or Cloneable

Could we do the same by using `TypeKlassPtr::maybe_java_subtype_of(TypeAryKlassPtr::BOTTOM)` and define a `TypeAryKlassPtr::BOTTOM` to be a static field for the `array_interfaces`?

AFAICT,   `TypeKlassPtr::maybe_java_subtype_of()` already covers that case so it would avoid some logic duplication. Also in the test above, maybe you could simplify the test a little but by removing `tkls->isa_instklassptr()`?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/24245#discussion_r2020893305

From duke at openjdk.org  Mon Mar 31 14:28:20 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Mon, 31 Mar 2025 14:28:20 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v11]
In-Reply-To: <_TOBoO4cMQpw4sgzIpNpQZ2w5wDgezKQZLe314DQ7zo=.813b81bf-ecc0-4f75-a0d6-fbb13dde594e@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <JKJusH1gWBybGf4kMN2t_B4u-8qm8BXnEdUA-6sHaQ0=.cf44a3bf-319d-4a42-b7c9-827641753062@github.com>
 <_TOBoO4cMQpw4sgzIpNpQZ2w5wDgezKQZLe314DQ7zo=.813b81bf-ecc0-4f75-a0d6-fbb13dde594e@github.com>
Message-ID: <JHUIfL3RpIiFSOyHesKuxqIfJKTzvDsQ2e3CzXvX2Gs=.3eac6881-4359-4117-a5d4-aeabc98c84d2@github.com>

On Mon, 24 Mar 2025 15:16:20 GMT, Volodymyr Paprotski <vpaprotski at openjdk.org> wrote:

>> Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision:
>> 
>>  - Further readability improvements.
>>  - Added asserts for array sizes
>
> I still need to have a look at the sha3 changes, but I think I am done with the most complex part of the review. This was a really interesting bit of code to review!

@vpaprotsk , thanks a lot for the very thorough review!

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 270:
> 
>> 268: }
>> 269: 
>> 270: static void loadPerm(int destinationRegs[], Register perms,
> 
> `replXmm`? i.e. this function is replicating (any) Xmm register, not just perm?..

Since I am only using it for permutation describers, I thought this way it is easier to follow what is happening.

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 327:
> 
>> 325: //
>> 326: //
>> 327: static address generate_dilithiumAlmostNtt_avx512(StubGenerator *stubgen,
> 
> Similar comments as to `generate_dilithiumAlmostInverseNtt_avx512`
> 
> - similar comment about the 'pair-wise' operation, updating `[j]` and `[j+l]` at a time.. 
> - somehow had less trouble following the flow through registers here, perhaps I am getting used to it. FYI, ended renaming some as:
> 
> // xmm16_27 = Temp1
> // xmm0_3 = Coeffs1
> // xmm4_7 = Coeffs2
> // xmm8_11 = Coeffs3
> // xmm12_15 = Coeffs4 = Temp2
> // xmm16_27 = Scratch

For me, it was easier to follow what goes where using the xmm... names (with the symbolic names you always have to remember which one overlaps with another and how much).

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 421:
> 
>> 419:   for (int i = 0; i < 8; i += 2) {
>> 420:     __ evpermi2d(xmm(i / 2 + 12), xmm(i), xmm(i + 1), Assembler::AVX_512bit);
>> 421:   }
> 
> Wish there was a more 'abstract' way to arrange this, so its obvious from the shape of the code what registers are input/outputs (i.e. and use the register arrays). Even though its just 'elementary index operations' `i/2 + 16` is still 'clever'. Couldnt think of anything myself though (same elsewhere in this function for the table permutes).

Well, this is how it is when we have three inputs, one of which also plays as output... At least the output is always the first one (so that one gets clobbered). This is why you have to replicate the permutation describer when you need both permutands later.

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 509:
> 
>> 507: // coeffs (int[256]) = c_rarg0
>> 508: // zetas (int[256]) = c_rarg1
>> 509: static address generate_dilithiumAlmostInverseNtt_avx512(StubGenerator *stubgen,
> 
> Done with this function; Perhaps the 'permute table' is a common vector-algorithm pattern, but this is really clever!
> 
> Some general comments first, rest inline.
> 
> - The array names for registers helped a lot. And so did the new helper functions!
> - The java version of this code is quite intimidating to vectorize.. 3D loop, with geometric iteration variables.. and the literature is even more intimidating (discrete convolutions which I havent touched in two decades, ffts, ntts, etc.) Here is my attempt at a comment to 'un-scare' the next reader, though feel free to reword however you like.
> 
> The core of the (Java) loop is this 'pair-wise' operation:
>         int a = coeffs[j];
>         int b = coeffs[j + offset];
>         coeffs[j] = (a + b);
>         coeffs[j + offset] = montMul(a - b, -MONT_ZETAS_FOR_NTT[m]);
> 
> There are 8 'levels' (0-7); ('levels' are equivalent to (unrolling) the outer (Java) loop)
> At each level, the 'pair-wise-offset' doubles (2^l: 1, 2, 4, 8, 16, 32, 64, 128).
> 
> To vectorize this Java code, observe that at each level, REGARDLESS the offset, half the operations are the SUM, and the other half is the
> montgomery MULTIPLICATION (of the pair-difference with a constant). At each level, one 'just' has to shuffle
> the coefficients, so that SUMs and MULTIPLICATIONs line up accordingly.
> 
> Otherwise, this pattern is 'lightly similar' to a discrete convolution (compute integral/summation of two functions at every offset)
> 
> - I still would prefer (more) symbolic register names.. I wouldn't hold my approval over it so won't object if nobody else does, but register numbers are harder to 'see' through the flow. I ended up search/replacing/'annotating' to make it easier on myself to follow the flow of data:
> 
> // xmm8_11  = Perms1
> // xmm12_15 = Perms2
> // xmm16_27 = Scratch
> // xmm0_3 = CoeffsPlus
> // xmm4_7 = CoeffsMul
> // xmm24_27 = CoeffsMinus (overlaps with Scratch)
> 
> (I made a similar comment, but I think it is now hidden after the last refactor)
> - would prefer to see the helper functions to get ALL the registers passed explicitly (i.e. currently `montMulPerm`, `montQInvModR`, `dilithium_q`, `xmm29`, are implicit.). As a general rule, I've tried to set up all the registers up at the 'entry' function (`generate_dilithium*` in this case) and ...

I added some more comments, but I kept the xmm... names for the registers, just like with the ntt function.

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 554:
> 
>> 552:   for (int i = 0; i < 8; i += 2) {
>> 553:     __ evpermi2d(xmm(i / 2 + 8), xmm(i), xmm(i + 1), Assembler::AVX_512bit);
>> 554:     __ evpermi2d(xmm(i / 2 + 12), xmm(i), xmm(i + 1), Assembler::AVX_512bit);
> 
> Took a bit to unscramble the flow, so a comment needed? Purpose 'fairly obvious' once I got the general shape of the level/algorithm (as per my top-level comment) but something like "shuffle xmm0-7 into xmm8-15"?

I hope the comment that I added at the beginning of the function sheds some light on the purpose of these permutations.

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 656:
> 
>> 654:   for (int i = 0; i < 8; i++) {
>> 655:     __ evpsubd(xmm(i), k0, xmm(i + 8), xmm(i), false, Assembler::AVX_512bit);
>> 656:   }
> 
> Fairly clean as is, but could also be two sub_add calls, I think (you have to swap order of add/sub in the helper, to be able to clobber `xmm(i)`.. or swap register usage downstream, so perhaps not.. but would be cleaner) 
> 
>   sub_add(CoeffsPlus, Scratch, Perms1, CoeffsPlus, _masm);
>   sub_add(CoeffsMul,  &Scratch[4], Perms2, CoeffsMul, _masm);
> 
> 
> If nothing else, would had prefered to see the use of the register array variables

I would rather leave this alone, too. I was considering the same, but decided that this is fairly easy to follow, it would be more complicated to either add a new helper function or follow where there are overlaps in the symbolically named register sets.

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 871:
> 
>> 869:   __ evpaddd(xmm5, k0, xmm1, barrettAddend, false, Assembler::AVX_512bit);
>> 870:   __ evpaddd(xmm6, k0, xmm2, barrettAddend, false, Assembler::AVX_512bit);
>> 871:   __ evpaddd(xmm7, k0, xmm3, barrettAddend, false, Assembler::AVX_512bit);
> 
> Fairly 'straightforward' transcription of the java code.. no comments from me.
> 
> At first glance using `xmm0_3`, `xmm4_7`, etc. might had been a good idea, but you only save one line per 4x group. (Unless you have one big loop, but I suspect that give you worse performance? Is that something you tried already? Might be worth it otherwise..)

I have considered this but decided to leave it alone (for the reason that you mentioned).

> src/java.base/share/classes/sun/security/provider/ML_DSA.java line 1418:
> 
>> 1416:                                          int twoGamma2, int multiplier) {
>> 1417:         assert (input.length == ML_DSA_N) && (lowPart.length == ML_DSA_N)
>> 1418:                 && (highPart.length == ML_DSA_N);
> 
> I wrote this test to test java-to-intrinsic correspondence. Might be good to include it (and add the other 4 intrinsics). This is very similar to all my other *Fuzz* tests I've been adding for my own intrinsics (and you made this test FAR easier to write by breaking out the java implementation; need to 'copy' that pattern myself)
> 
> import java.util.Arrays;
> import java.util.Random;
> 
> import java.lang.invoke.MethodHandle;
> import java.lang.invoke.MethodHandles;
> import java.lang.reflect.Field;
> import java.lang.reflect.Method;
> import java.lang.reflect.Constructor;
> 
> public class ML_DSA_Intrinsic_Test {
> 
>     public static void main(String[] args) throws Exception {
>         MethodHandles.Lookup lookup = MethodHandles.lookup();
>         Class<?> kClazz = Class.forName("sun.security.provider.ML_DSA");
>         Constructor<?> constructor = kClazz.getDeclaredConstructor(
>                 int.class);
>         constructor.setAccessible(true);
>         
>         Method m = kClazz.getDeclaredMethod("mlDsaNttMultiply",
>                 int[].class, int[].class, int[].class);
>         m.setAccessible(true);
>         MethodHandle mult = lookup.unreflect(m);
> 
>         m = kClazz.getDeclaredMethod("implDilithiumNttMultJava",
>                 int[].class, int[].class, int[].class);
>         m.setAccessible(true);
>         MethodHandle multJava = lookup.unreflect(m);
> 
>         Random rnd = new Random();
>         long seed = rnd.nextLong();
>         rnd.setSeed(seed);
>         //Note: it might be useful to increase this number during development of new intrinsics
>         final int repeat = 1000000;
>         int[] coeffs1 = new int[ML_DSA_N];
>         int[] coeffs2 = new int[ML_DSA_N];
>         int[] prod1 = new int[ML_DSA_N];
>         int[] prod2 = new int[ML_DSA_N];
>         try {
>             for (int i = 0; i < repeat; i++) {
>                 run(prod1, prod2, coeffs1, coeffs2, mult, multJava, rnd, seed, i);
>             }
>             System.out.println("Fuzz Success");
>         } catch (Throwable e) {
>             System.out.println("Fuzz Failed: " + e);
>         }
>     }
> 
>     private static final int ML_DSA_N = 256;
>     public static void run(int[] prod1, int[] prod2, int[] coeffs1, int[] coeffs2, 
>         MethodH...

We will consider it for a follow-up PR.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23860#issuecomment-2766414076
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021150966
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021151152
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021151361
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021151680
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021152095
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021152962
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021154571
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021156249

From duke at openjdk.org  Mon Mar 31 14:28:21 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Mon, 31 Mar 2025 14:28:21 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v7]
In-Reply-To: <kBg6GrgsVt_pKZ2MfKEHbuEb9Uk5YFWWQ7tKUOKySH4=.106187c8-0bb4-4463-afd7-a406fce15fc4@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <xeGVcNvhb8JogSwnBjlAln9bAiZMAeiOEC5XrH0FVX0=.564addcd-ee9a-4628-8df1-b546e7de45be@github.com>
 <kBg6GrgsVt_pKZ2MfKEHbuEb9Uk5YFWWQ7tKUOKySH4=.106187c8-0bb4-4463-afd7-a406fce15fc4@github.com>
Message-ID: <V-Um4_MZamQaQUXEwjGPIFyYG7kYWo3hS_XW0oE3j5s=.d454f8d7-789a-4009-aadc-024fe8a90971@github.com>

On Mon, 17 Mar 2025 19:22:41 GMT, Volodymyr Paprotski <vpaprotski at openjdk.org> wrote:

>> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Made the intrinsics test separate from the pure java test.
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 45:
> 
>> 43: // Constants
>> 44: //
>> 45: ATTRIBUTE_ALIGNED(64) static const uint32_t dilithiumAvx512Consts[] = {
> 
> This is really nitpicking.. but could had loaded constants inline with `movl` without requiring an ExternalAddress()? 
> 
> Nice to have constants together, only complaint is we have 'magic offsets' in ASM to reach in for particular one..
> 
> This one isnt too bad, offset of 32bits is easy to inspect visually (`dilithiumAvx512ConstsAddr()` could take a parameter perhaps)

I added symbolic names for the indexes.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021149647

From duke at openjdk.org  Mon Mar 31 14:28:22 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Mon, 31 Mar 2025 14:28:22 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v11]
In-Reply-To: <PsyEjOjjVPhvslW1itlklaEZ7zraED6BQavpPxOxN8w=.0cab0c3b-837c-463c-b7f4-5a38b1ce8aeb@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <JKJusH1gWBybGf4kMN2t_B4u-8qm8BXnEdUA-6sHaQ0=.cf44a3bf-319d-4a42-b7c9-827641753062@github.com>
 <Wp_7I93VCaIAwdDoDeCDPlca3JKEDK4yJeEzfQQdj5M=.118ea8cc-7a0e-4dbc-9837-53cfaccc68de@github.com>
 <PsyEjOjjVPhvslW1itlklaEZ7zraED6BQavpPxOxN8w=.0cab0c3b-837c-463c-b7f4-5a38b1ce8aeb@github.com>
Message-ID: <F6oaqCaQHfnSVnaI6XwVwoYcyBWq9KPJ0PDRYIvdFaE=.451d4ec3-83c3-49b6-a3b1-b674bfb14b2b@github.com>

On Sun, 23 Mar 2025 00:21:18 GMT, Volodymyr Paprotski <vpaprotski at openjdk.org> wrote:

>> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 119:
>> 
>>> 117: static address dilithiumAvx512PermsAddr() {
>>> 118:   return (address) dilithiumAvx512Perms;
>>> 119: }
>> 
>> Hear me out..  ...
>> enums!!
>> 
>> enum nttPermOffset {
>>   montMulPermsIdx = 0,
>>   nttL4PermsIdx = 64,
>>   nttL5PermsIdx = 192,
>>   nttL6PermsIdx = 320,
>>   nttL7PermsIdx = 448,
>>   nttInvL0PermsIdx = 704,
>>   nttInvL1PermsIdx = 832,
>>   nttInvL2PermsIdx = 960,
>>   nttInvL3PermsIdx = 1088,
>>   nttInvL4PermsIdx = 1216,
>> };
>> static address dilithiumAvx512PermsAddr(nttPermOffset offset) {
>>   return (address) dilithiumAvx512Perms + offset;
>> }
>
> belay that comment.. now that I looked at `generate_dilithiumAlmostInverseNtt_avx512`, I see why thats not the 'entire picture'..

I leave it as it is now.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021149925

From duke at openjdk.org  Mon Mar 31 14:28:24 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Mon, 31 Mar 2025 14:28:24 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v10]
In-Reply-To: <2yP2P1VNWgQu6cWvn0_a_7LdidS71C6PWKcqGKTOHnc=.49f8ac0f-df23-4f1e-adb9-e03a3f2295b2@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <2N5Evij0f6qZi_pG3tqoz11aQbSnLG0YszqHR9ROfKI=.d44b16c6-d334-42c4-8de8-92eb41229248@github.com>
 <2yP2P1VNWgQu6cWvn0_a_7LdidS71C6PWKcqGKTOHnc=.49f8ac0f-df23-4f1e-adb9-e03a3f2295b2@github.com>
Message-ID: <m_ebURrfxd_YhF6rrVT3SrctPGeMy5U_5sOdVZNjP2c=.c48e1992-3be6-430c-92b6-79ec244539db@github.com>

On Sat, 22 Mar 2025 16:36:08 GMT, Volodymyr Paprotski <vpaprotski at openjdk.org> wrote:

>> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Fix windows build
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 121:
> 
>> 119: static void montmulEven(int outputReg, int inputReg1,  int inputReg2,
>> 120:                         int scratchReg1, int scratchReg2,
>> 121:                         int parCnt, MacroAssembler *_masm) {
> 
> nitpick.. this could be made to look more like `montMul64()` by also taking in an array of registers.

I eliminated this function.

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 160:
> 
>> 158:   for (int i = 0; i < 4; i++) {
>> 159:     __ vpmuldq(xmm(scratchRegs[i]), xmm(inputRegs1[i]), xmm(inputRegs2[i]),
>> 160:                Assembler::AVX_512bit);
> 
> using an array of registers, instead of array of ints would read somewhat more compact and fewer 'indirections' . i.e.
> 
> static void montMul64(XMMRegister outputRegs*, XMMRegister inputRegs1*, XMMRegister inputRegs2*,
> ...
>     __ vpmuldq(scratchRegs[i], inputRegs1[i], inputRegs2[i], Assembler::AVX_512bit);

I think from the names it is easy enough to see that we are really passing register names here and it is also easy to check that the indexes of the registers in the named arrays are really what the names of those arrays suggest, so I would like to leave this alone.

> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 645:
> 
>> 643: // poly1 (int[256]) = c_rarg1
>> 644: // poly2 (int[256]) = c_rarg2
>> 645: static address generate_dilithiumNttMult_avx512(StubGenerator *stubgen,
> 
> This would be 'nice to have', something 'lost' with the refactor..
> 
> As I was reviewing this (original) function, I was thinking, "there is nothing here _that_ specific to AVX512, mostly columnar&independent operations... This function could be made 'vector-length-independent'..."
> - double the loop length:
> 
> int iter = vector_len==Assembler::AVX_512bit?4:8;
> __ movl(len, 4); -> __ movl(len, iter);
> 
> - halve the register arrays.. (or keep them the same but shuffle them to make SURE the first half are in xmm0-xmm15 range)
> 
>   XMMRegister POLY1[] = {xmm0, xmm1, xmm12, xmm13};
>   XMMRegister POLY2[] = {xmm4, xmm5, xmm16, xmm17};
>   XMMRegister SCRATCH1[] = {xmm2, xmm3, xmm14, xmm15}; <<< here
>   XMMRegister SCRATCH2[] = {xmm6, xmm7, xmm18, xmm19}; <<< and here
>   XMMRegister SCRATCH3[] = {xmm8, xmm9, xmm10, xmm11};
> 
> - couple of other int constants (like the memory 'step' and such)
> - for assembler calls, like `evmovdqul` and `evpsubd`, need a few small new MacroAssembler helpers to instead generate VEX encoded versions (plenty of instructions already do that).
> - I think only the perm instruction was unique to evex (didnt really think of an alternative for AVX2.. but can be abstracted away with another helper)
> 
> Anyway; not suggesting its something you do here.. but it would be convenient to leave breadcrumbs/hooks for a future update so one of us can revisit this code and add AVX2 support. e.g. `parCnt` variable was very convenient before for exactly this, now its gone... it probably could be derived in each function from vector_len but..; Its now cleaner, but also harder to 'upgrade'?
> 
> Why AVX2? many of the newer (Atom/Ecore-based/EnableX86ECoreOpts) processors do not have AVX512 support, so its something I've been prioritizing recently
> 
> The alternative would be to write a completely separate AVX2 implementation, but that would be a shame, not to 'just' reuse this code.
> ?
> "For fun", I had even gone and parametrized the mult function with the `vector_len` to see how it would look (almost identical... to the original version):
> 
> static void montmulEven2(XMMRegister* outputReg, XMMRegister* inputReg1,  XMMRegister* inputReg2, XMMRegister* scratchReg1, 
>   XMMRegister* scratchReg2, XMMRegister montQInvModR, XMMRegister dilithium_q, int parCnt, int vector_len, ...

I'd like to leave this for another PR.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021150150
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021150516
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021153931

From duke at openjdk.org  Mon Mar 31 14:28:25 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Mon, 31 Mar 2025 14:28:25 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v10]
In-Reply-To: <36fyT0z29o9GYLeQhpYkIT4d2By-8z7TEU8TGtT2uHI=.50647fa4-32ca-41ef-8287-075a70254143@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <2N5Evij0f6qZi_pG3tqoz11aQbSnLG0YszqHR9ROfKI=.d44b16c6-d334-42c4-8de8-92eb41229248@github.com>
 <2yP2P1VNWgQu6cWvn0_a_7LdidS71C6PWKcqGKTOHnc=.49f8ac0f-df23-4f1e-adb9-e03a3f2295b2@github.com>
 <36fyT0z29o9GYLeQhpYkIT4d2By-8z7TEU8TGtT2uHI=.50647fa4-32ca-41ef-8287-075a70254143@github.com>
Message-ID: <RlqToYqMgX0HTc-X_WNN2jlw9s0CjgrsIvxFqObeAwY=.31eb71b7-6a54-4f0c-aa88-c526b8e73402@github.com>

On Sun, 23 Mar 2025 00:26:20 GMT, Volodymyr Paprotski <vpaprotski at openjdk.org> wrote:

>> src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 216:
>> 
>>> 214: // Zmm8-Zmm23 used as scratch registers
>>> 215: // result goes to Zmm0-Zmm7
>>> 216: static void montMulByConst128(MacroAssembler *_masm) {
>> 
>> wish the inputs and output register arrays were explicit.. easier to follow that way
>
> Looking at this function some more.. I think you could remove this function and replace it with two calls to `montMul64`?
> 
>   montMul64(xmm0_3, xmm0_3, xmm29_29, Scratch*, _masm);
>   montMul64(xmm4_7, xmm4_7, xmm29_29, Scratch*, _masm);
>   ```
> Scratch would have to be defined..

I accepted this suggestion, it really saved quite a few lines of code, thanks!

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021150687

From duke at openjdk.org  Mon Mar 31 14:28:26 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Mon, 31 Mar 2025 14:28:26 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v7]
In-Reply-To: <pqoTfEkEfzOcAOv5JLfWQa7ckVyZfmILw57FVSM7bRc=.a16fb1f6-2bf8-4503-89cc-3eab2df0f686@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <xeGVcNvhb8JogSwnBjlAln9bAiZMAeiOEC5XrH0FVX0=.564addcd-ee9a-4628-8df1-b546e7de45be@github.com>
 <kBg6GrgsVt_pKZ2MfKEHbuEb9Uk5YFWWQ7tKUOKySH4=.106187c8-0bb4-4463-afd7-a406fce15fc4@github.com>
 <MiV8AD5mEqWLf_DDaoTF7jG06Zd7bBloGuxRZqGtN3U=.e55cfe7f-eb52-4da1-bddd-01ce74cfc633@github.com>
 <pqoTfEkEfzOcAOv5JLfWQa7ckVyZfmILw57FVSM7bRc=.a16fb1f6-2bf8-4503-89cc-3eab2df0f686@github.com>
Message-ID: <tg8N7pyqL0slcHiljNo1zxkCLUQ76MXm_12Lc5poyII=.e0d32962-05db-43b4-a1c4-623bba718c7e@github.com>

On Sat, 22 Mar 2025 16:11:02 GMT, Volodymyr Paprotski <vpaprotski at openjdk.org> wrote:

>> These functions will not be used anywhere else and in ML_DSA.java all of the arrays passed to inrinsics are of the correct size.
>
> Works for me; just thought I would point it out, so its a 'premeditated' decision.

Well, I ended up putting some asserts in the java code, just in case...

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021153417

From duke at openjdk.org  Mon Mar 31 14:28:27 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Mon, 31 Mar 2025 14:28:27 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v5]
In-Reply-To: <uUNg9Inwc5sySYy-v52dgYVTSxNU0n7EQZdsjQESTsY=.b9bfb2ab-42ef-40fc-9173-6a141adfac38@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <3bphXKLpIpxAZP-FEOeob6AaHbv0BAoEceJka64vMW8=.3e4f74e0-9479-4926-b365-b08d8d702692@github.com>
 <uUNg9Inwc5sySYy-v52dgYVTSxNU0n7EQZdsjQESTsY=.b9bfb2ab-42ef-40fc-9173-6a141adfac38@github.com>
Message-ID: <bUWSTdt_26lSS9_P1W0PTVPvFnYGnmHcFhvWpK0-hzQ=.1a40b277-9642-45a7-a1d8-25ff039c0c0f@github.com>

On Thu, 6 Mar 2025 19:26:14 GMT, Volodymyr Paprotski <vpaprotski at openjdk.org> wrote:

>> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Accepted review comments.
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 409:
> 
>> 407:   __ evmovdquq(xmm29, Address(permsAndRots, 768), Assembler::AVX_512bit);
>> 408:   __ evmovdquq(xmm30, Address(permsAndRots, 832), Assembler::AVX_512bit);
>> 409:   __ evmovdquq(xmm31, Address(permsAndRots, 896), Assembler::AVX_512bit);
> 
> Matter of taste, but I liked the compactness of montmulEven; i.e. 
> 
> for (i=0; i<15; i++)
>     __ evmovdquq(xmm(17+i), Address(permsAndRots, 64*i), Assembler::AVX_512bit);

Changed.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021155416

From duke at openjdk.org  Mon Mar 31 14:40:56 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Mon, 31 Mar 2025 14:40:56 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v12]
In-Reply-To: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
Message-ID: <jC0P2QJcg4d8sNdrxMl5u2I7wB2J3k8Kght-nyv8i_Y=.608416bb-0d15-48ec-91f9-ef2b4df8fb4b@github.com>

> By using the AVX-512 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.

Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:

  Reacting to comments by Volodymyr.

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23860/files
  - new: https://git.openjdk.org/jdk/pull/23860/files/56656894..7a9f6645

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=11
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23860&range=10-11

  Stats: 145 lines in 2 files changed: 24 ins; 91 del; 30 mod
  Patch: https://git.openjdk.org/jdk/pull/23860.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23860/head:pull/23860

PR: https://git.openjdk.org/jdk/pull/23860

From jbhateja at openjdk.org  Mon Mar 31 16:43:39 2025
From: jbhateja at openjdk.org (Jatin Bhateja)
Date: Mon, 31 Mar 2025 16:43:39 GMT
Subject: RFR: 8351034: Add AVX-512 intrinsics for ML-DSA [v11]
In-Reply-To: <IrDjFB-31qorhKt_vwc2dL6LbQUq_7JMO2je_hGCa3I=.63a4319a-e334-4440-ac78-13165b4a23c3@github.com>
References: <sSyNauBw37i2PTz7sjlG2MRJOeyVliI2my0I6moVVlk=.a55a94fc-adb4-4d55-99ac-f70ac456a4b9@github.com>
 <JKJusH1gWBybGf4kMN2t_B4u-8qm8BXnEdUA-6sHaQ0=.cf44a3bf-319d-4a42-b7c9-827641753062@github.com>
 <Kanyx8L0d27_yUHP15BCfzO8MDEE3_JfDlxaP7Ran_g=.571eca02-d927-4598-ac9e-ff19cef9f484@github.com>
 <IrDjFB-31qorhKt_vwc2dL6LbQUq_7JMO2je_hGCa3I=.63a4319a-e334-4440-ac78-13165b4a23c3@github.com>
Message-ID: <-sFpKarpt9CP7DYd7v9vSBAgHYthQ4OZFNGHFOgb2AI=.fc908719-8e45-43d2-97df-95ff01129275@github.com>

On Mon, 31 Mar 2025 11:11:54 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> src/hotspot/cpu/x86/vm_version_x86.cpp line 1252:
>> 
>>> 1250:   // Currently we only have them for AVX512
>>> 1251: #ifdef _LP64
>>> 1252:   if (supports_evex() && supports_avx512bw()) {
>> 
>> supports_evex check looks redundant.
>
> These are checks for two different feature bits: CPU_AVX512F and CPU_AVX512BW. Are you saying that the latter implies the former in every implementation of the spec?

AVX512BW is built on top of AVX512F spec. In assembler and other places we only check BW in assertions which implies EVEX.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2021381288