From doug.simon at oracle.com  Sat Feb  1 08:03:35 2025
From: doug.simon at oracle.com (Douglas Simon)
Date: Sat, 1 Feb 2025 08:03:35 +0000
Subject: Proposal: Remove EnableJVMCI flag
Message-ID: <D7F69451-94CE-49B5-A9D4-CD1C968EE948@oracle.com>

Hi,

https://bugs.openjdk.org/browse/JDK-8345826 was filed to make libgraal and new CDS optimizations more compatible:

Since JDK 483, many more CDS optimizations are enabled when -XX:+AOTClassLinking is specified (see numbers in https://bugs.openjdk.org/browse/JDK-8342279). However, these optimizations require the archived module graph to be used. Today, if you enable UseGraalJIT, the archived module graph will be disabled. As a result, the *entire* CDS archive will be disabled. This will result in slower start-up time when UseGraalJIT is enabled.


Further internal discussion<https://bugs.openjdk.org/browse/JDK-8345826?focusedId=14736369&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14736369> resulted in the proposal to remove all use of EnableJVMCI in the VM code. This will mean -XX:+EnableJVMCI only applies to the Java code (i.e. adds jdk.internal.vm.ci to the root module set).

However, further reflection suggests something more aggressive is worth considering: remove the EnableJVMCI flag altogether.

This option was implemented to make use of JVMCI opt-in. However, JVMCI is effectively opt-in anyway without this option. There are two ways in which JVMCI can be used: as a JIT compiler by the CompileBroker and as a compiler for ?guest? code (e.g., Truffle use case).

1. JVMCI as JIT.

To enable JVMCI as JIT, flags such as UseJVMCICompiler, UseGraalJIT or EnableJVMCIProduct must be specified to the java launcher. Each of these flags set EnableJVMCI to true as a side-effect. That is, use of JVMCI as JIT is already opt-in due to needing these other flags - specifying EnableJVMCI is redundant.

2. JVMCI as guest code compiler

In this mode, the jdk.internal.vm.ci module must be loaded (i.e. EnableJVMCI currently has the side-effect of `--add-modules=jdk.internal.vm.ci`). This module has no unqualified exports (as seen in its module descriptor<https://github.com/openjdk/jdk/blob/master/src/jdk.internal.vm.ci/share/classes/module-info.java>) so using it requires specifying at least one instance of --add-exports to the Java launcher. That is, once again EnableJVMCI alone is not sufficient for opting-in to JVMCI.

In light of the above, I propose removing EnableJVMCI altogether. This will require using --add-modules=jdk.internal.vm.ci when you actually want to use the JVMCI module. It will also require modifying JDK code guarded by this flag. It guards both VM code and use of the `jdk.internal.vm.ci` module and I consider them separately below.

#### VM code

All uses of EnableJVMCI to guard VM code would adapted with one of the following strategies:
1. Remove the guard and make the code unconditional.
2. Replace EnableJVMCI with something else such as UseJVMCICompiler or test of a global variable set to true as soon as JVMCI compiled code is about to be installed in the code cache (example<https://github.com/openjdk/jdk/pull/23408/files#diff-ee8337800ed1d1b84e3e49a2481809a6affac5d70ca23934a44497c9c758092fR456>).
3. Replace EnableJVMCI with a test of whether the jdk.internal.vm.ci module has been resolved (example<https://github.com/openjdk/jdk/pull/23408/files#diff-4e6668d768f7d67417cbac39bcb723552cc0b80ad218709cfa0e6e31f32b69f0R518>).

Of course, this change almost certainly needs a CSR as well but I?d like to get feedback on the primary change before worrying about that.

-Doug

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/graal-dev/attachments/20250201/3541044b/attachment.htm>

From galder at openjdk.org  Mon Feb  3 14:22:52 2025
From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=)
Date: Mon, 3 Feb 2025 14:22:52 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v11]
In-Reply-To: <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com>
Message-ID: <ZB3z3AeYNtcPXQt2oBLNokMziVx2WOnKoVCxMoyEO7M=.84e952ac-7415-4c3e-8f6a-06ae1b7526d4@github.com>

On Fri, 17 Jan 2025 17:53:24 GMT, Galder Zamarre?o <galder at openjdk.org> wrote:

>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance.
>> 
>> Currently vectorization does not kick in for loops containing either of these calls because of the following error:
>> 
>> 
>> VLoop::check_preconditions: failed: control flow in loop not allowed
>> 
>> 
>> The control flow is due to the java implementation for these methods, e.g.
>> 
>> 
>> public static long max(long a, long b) {
>>     return (a >= b) ? a : b;
>> }
>> 
>> 
>> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively.
>> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization.
>> E.g.
>> 
>> 
>> SuperWord::transform_loop:
>>     Loop: N518/N126  counted [int,int),+4 (1025 iters)  main has_sfpt strip_mined
>>  518  CountedLoop  === 518 246 126  [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21)
>> 
>> 
>> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1155
>> long max   1173
>> 
>> 
>> After the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1042
>> long max   1042
>> 
>> 
>> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes.
>> Therefore, it still relies on the macro expansion to transform those into CMoveL.
>> 
>> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results:
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PA...
>
> Galder Zamarre?o has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix typo

@eastig fyi

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2631136070

From duke at openjdk.org  Mon Feb  3 15:49:01 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Mon, 3 Feb 2025 15:49:01 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v3]
In-Reply-To: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
Message-ID: <ZlTddmcstNpGFZY_UScIYuunFnSHV2qhEL41ZGFNG2w=.ab2aed4a-f186-435f-8c62-4a1435cfc9bd@github.com>

> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.

Ferenc Rakoczi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits:

 - merging master
 - Use SHA3Parallel for matrix generation
 - fixing whitespace errors
 - 8348561: Add aarch64 intrinsics for ML-DSA

-------------

Changes: https://git.openjdk.org/jdk/pull/23300/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23300&range=02
  Stats: 2133 lines in 19 files changed: 2045 ins; 11 del; 77 mod
  Patch: https://git.openjdk.org/jdk/pull/23300.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23300/head:pull/23300

PR: https://git.openjdk.org/jdk/pull/23300

From duke at openjdk.org  Mon Feb  3 16:15:32 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Mon, 3 Feb 2025 16:15:32 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v4]
In-Reply-To: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
Message-ID: <zxTkNcNIAhOtebaonz8-AGEwjx0ysZmtj3iMURXmhew=.b92fbd52-8003-44ee-bade-72a715d41311@github.com>

> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.

Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:

  removed debugging code

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23300/files
  - new: https://git.openjdk.org/jdk/pull/23300/files/5630fd14..9f7c4a23

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23300&range=03
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23300&range=02-03

  Stats: 25 lines in 3 files changed: 0 ins; 25 del; 0 mod
  Patch: https://git.openjdk.org/jdk/pull/23300.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23300/head:pull/23300

PR: https://git.openjdk.org/jdk/pull/23300

From vladimir.kozlov at oracle.com  Mon Feb  3 17:45:39 2025
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Mon, 3 Feb 2025 09:45:39 -0800
Subject: Proposal: Remove EnableJVMCI flag
In-Reply-To: <D7F69451-94CE-49B5-A9D4-CD1C968EE948@oracle.com>
References: <D7F69451-94CE-49B5-A9D4-CD1C968EE948@oracle.com>
Message-ID: <611affa4-09c6-41af-a853-1106e12dfbb9@oracle.com>

Hi Doug,

My concern is that some code (stubs, blobs, Interpreter) are generated before we are loading any modules.
How you handle JVMCI specific code there if you have it? If you don't have such code than we can discuss.

I definitely against adding runtime checks for JVMCI presence into executed (assembler) code.

Would be nice if/when command line is parsed we can detect presence of `--add-modules=jdk.internal.vm.ci` (or others 
related) flag and enable JVMCI flag. I am fine to keep `EnableJVMCI` but make it ergonomic. You may still want to 
disable JVMCI from command line even if somewhere in start script you have `--add-modules=jdk.internal.vm.ci`.

Thanks,
Vladimir K

On 2/1/25 12:03 AM, Douglas Simon wrote:
> Hi,
> 
> https://bugs.openjdk.org/browse/JDK-8345826 <https://bugs.openjdk.org/browse/JDK-8345826>?was filed to make libgraal and 
> new CDS optimizations more compatible:
> 
>> Since JDK 483, many more CDS optimizations are enabled when -XX:+AOTClassLinking is specified (see numbers in?https:// 
>> bugs.openjdk.org/browse/JDK-8342279). However, these optimizations require the archived module graph to be used. 
>> Today, if you enable UseGraalJIT, the archived module graph will be disabled. As a result, the *entire* CDS archive 
>> will be disabled. This will result in slower start-up time when UseGraalJIT is enabled.
>>
> 
> Further internal discussion <https://bugs.openjdk.org/browse/JDK-8345826? 
> focusedId=14736369&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14736369>?resulted in 
> the proposal to remove all use of EnableJVMCI in the VM code. This will mean -XX:+EnableJVMCI only applies to the Java 
> code (i.e. adds jdk.internal.vm.ci to the root module set).
> 
> However, further reflection suggests something more aggressive is worth considering: remove the EnableJVMCI flag altogether.
> 
> This option was implemented to make use of JVMCI opt-in. However, JVMCI is effectively opt-in anyway without this 
> option. There are two ways in which JVMCI can be used: as a JIT compiler by the CompileBroker and as a compiler for 
> ?guest? code (e.g., Truffle use case).
> 
> 1. JVMCI as JIT.
> 
> To enable JVMCI as JIT, flags such as UseJVMCICompiler, UseGraalJIT or EnableJVMCIProduct must be specified to the java 
> launcher. Each of these flags set EnableJVMCI to true as a side-effect. That is, use of JVMCI as JIT is already opt-in 
> due to needing these other flags - specifying EnableJVMCI is redundant.
> 
> 2. JVMCI as guest code compiler
> 
> In this mode, the jdk.internal.vm.ci module must be loaded (i.e. EnableJVMCI currently has the side-effect of `--add- 
> modules=jdk.internal.vm.ci`). This module has no unqualified exports (as seen in its module descriptor <https:// 
> github.com/openjdk/jdk/blob/master/src/jdk.internal.vm.ci/share/classes/module-info.java>)?so using it requires 
> specifying at least one instance of --add-exports to the Java launcher. That is, once again EnableJVMCI alone is not 
> sufficient for opting-in to JVMCI.
> 
> In light of the above, I propose removing EnableJVMCI altogether. This will require using --add- 
> modules=jdk.internal.vm.ci when you actually want to use the JVMCI module. It will also require modifying JDK code 
> guarded by this flag. It guards both VM code and use of the `jdk.internal.vm.ci` module and I consider them separately 
> below.
> 
> #### VM code
> 
> All uses of EnableJVMCI to guard VM code would adapted with one of the following strategies:
> 1. Remove the guard and make the code unconditional.
> 2. Replace EnableJVMCI with something else such as UseJVMCICompiler or test of a global variable set to true as soon as 
> JVMCI compiled code is about to be installed in the code cache (example <https://github.com/openjdk/jdk/pull/23408/ 
> files#diff-ee8337800ed1d1b84e3e49a2481809a6affac5d70ca23934a44497c9c758092fR456>).
> 3. Replace EnableJVMCI with a test of whether the jdk.internal.vm.ci module has been resolved (example <https:// 
> github.com/openjdk/jdk/pull/23408/files#diff-4e6668d768f7d67417cbac39bcb723552cc0b80ad218709cfa0e6e31f32b69f0R518>).
> 
> Of course, this change almost certainly needs a CSR as well but I?d like to get feedback on the primary change before 
> worrying about that.
> 
> -Doug
> 


From coleenp at openjdk.org  Mon Feb  3 17:49:19 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Mon, 3 Feb 2025 17:49:19 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native
Message-ID: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>

The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.

There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.

Tested with tier1-8.

-------------

Commit messages:
 - Removed @Stable.
 - Fix JFR bug.
 - 8345678: Make Class.getModifiers() non-native.

Changes: https://git.openjdk.org/jdk/pull/22652/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=22652&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8346567
  Stats: 218 lines in 34 files changed: 57 ins; 139 del; 22 mod
  Patch: https://git.openjdk.org/jdk/pull/22652.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/22652/head:pull/22652

PR: https://git.openjdk.org/jdk/pull/22652

From liach at openjdk.org  Mon Feb  3 17:49:20 2025
From: liach at openjdk.org (Chen Liang)
Date: Mon, 3 Feb 2025 17:49:20 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native
In-Reply-To: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
Message-ID: <1kHVpYCOExfkn8UHTZNZT6zwjRj3MCXJD2LVcY0NTrg=.0644323b-5f40-4441-8c19-763105aaf08d@github.com>

On Mon, 9 Dec 2024 19:26:53 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.
> 
> There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.
> 
> Tested with tier1-8.

The change to java.lang.Class looks good.

Looking at #23396, we might need to filter this field too.

src/hotspot/share/classfile/javaClasses.cpp line 1504:

> 1502:   macro(_reflectionData_offset,      k, "reflectionData",      java_lang_ref_SoftReference_signature, false); \
> 1503:   macro(_signers_offset,             k, "signers",             object_array_signature, false); \
> 1504:   macro(_modifiers_offset,           k, vmSymbols::modifiers_name(), int_signature,    false)

Do we need a trailing semicolon here?

src/java.base/share/classes/java/lang/Class.java line 1315:

> 1313: 
> 1314:     // Set by the JVM when creating the instance of this java.lang.Class
> 1315:     private transient int modifiers;

If this is set by the JVM, can this be marked `final` so JIT compiler can trust this field? Also preferable if we can move this together with components/signers/classData fields.

-------------

PR Review: https://git.openjdk.org/jdk/pull/22652#pullrequestreview-2490110846
PR Comment: https://git.openjdk.org/jdk/pull/22652#issuecomment-2631658029
PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1876630297
PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1876627105

From coleenp at openjdk.org  Mon Feb  3 17:49:20 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Mon, 3 Feb 2025 17:49:20 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native
In-Reply-To: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
Message-ID: <o2U4UlgxQKxF1RdTU6HARYpBGMYDRk1B3ZHeLCP7ZiQ=.67875d38-2a85-4fd9-a181-e6d765ab629c@github.com>

On Mon, 9 Dec 2024 19:26:53 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.
> 
> There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.
> 
> Tested with tier1-8.

> Looking at https://github.com/openjdk/jdk/pull/23396, we might need to filter this field too.

Yes, I agree.  This patch is a follow on to that one, so I'll add it to the same places when that one is merged in here.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22652#issuecomment-2631661716

From coleenp at openjdk.org  Mon Feb  3 17:49:20 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Mon, 3 Feb 2025 17:49:20 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native
In-Reply-To: <1kHVpYCOExfkn8UHTZNZT6zwjRj3MCXJD2LVcY0NTrg=.0644323b-5f40-4441-8c19-763105aaf08d@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <1kHVpYCOExfkn8UHTZNZT6zwjRj3MCXJD2LVcY0NTrg=.0644323b-5f40-4441-8c19-763105aaf08d@github.com>
Message-ID: <e8V5_QLgWMgWjewNHqzQ9lk-HNa3RylbwnSE01ZHWMA=.781eb5c8-9a25-4670-8ba4-bbfe8e4b262c@github.com>

On Mon, 9 Dec 2024 19:46:43 GMT, Chen Liang <liach at openjdk.org> wrote:

>> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.
>> 
>> There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.
>> 
>> Tested with tier1-8.
>
> src/hotspot/share/classfile/javaClasses.cpp line 1504:
> 
>> 1502:   macro(_reflectionData_offset,      k, "reflectionData",      java_lang_ref_SoftReference_signature, false); \
>> 1503:   macro(_signers_offset,             k, "signers",             object_array_signature, false); \
>> 1504:   macro(_modifiers_offset,           k, vmSymbols::modifiers_name(), int_signature,    false)
> 
> Do we need a trailing semicolon here?

yes. it is needed.

> src/java.base/share/classes/java/lang/Class.java line 1315:
> 
>> 1313: 
>> 1314:     // Set by the JVM when creating the instance of this java.lang.Class
>> 1315:     private transient int modifiers;
> 
> If this is set by the JVM, can this be marked `final` so JIT compiler can trust this field? Also preferable if we can move this together with components/signers/classData fields.

The JVM rearranges these fields so that's why I put it near the caller.  Let me check if final compiles.

Edit: it looks better with the other fields though.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1876712191
PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1876713323

From duke at openjdk.org  Mon Feb  3 17:49:20 2025
From: duke at openjdk.org (ExE Boss)
Date: Mon, 3 Feb 2025 17:49:20 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native
In-Reply-To: <e8V5_QLgWMgWjewNHqzQ9lk-HNa3RylbwnSE01ZHWMA=.781eb5c8-9a25-4670-8ba4-bbfe8e4b262c@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <1kHVpYCOExfkn8UHTZNZT6zwjRj3MCXJD2LVcY0NTrg=.0644323b-5f40-4441-8c19-763105aaf08d@github.com>
 <e8V5_QLgWMgWjewNHqzQ9lk-HNa3RylbwnSE01ZHWMA=.781eb5c8-9a25-4670-8ba4-bbfe8e4b262c@github.com>
Message-ID: <fah136x1eL92esnncQ2l8JVUMhkwg25xpjmZhE-Rkbk=.ce4ed4dd-3f56-4ac7-83ec-ebed54ee74a7@github.com>

On Mon, 9 Dec 2024 20:27:52 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> src/hotspot/share/classfile/javaClasses.cpp line 1504:
>> 
>>> 1502:   macro(_reflectionData_offset,      k, "reflectionData",      java_lang_ref_SoftReference_signature, false); \
>>> 1503:   macro(_signers_offset,             k, "signers",             object_array_signature, false); \
>>> 1504:   macro(_modifiers_offset,           k, vmSymbols::modifiers_name(), int_signature,    false)
>> 
>> Do we need a trailing semicolon here?
>
> yes. it is needed.

This is?**C++**, so?yes.
Suggestion:

  macro(_modifiers_offset,           k, vmSymbols::modifiers_name(), int_signature,    false);

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1876794006

From coleenp at openjdk.org  Mon Feb  3 17:49:20 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Mon, 3 Feb 2025 17:49:20 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native
In-Reply-To: <fah136x1eL92esnncQ2l8JVUMhkwg25xpjmZhE-Rkbk=.ce4ed4dd-3f56-4ac7-83ec-ebed54ee74a7@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <1kHVpYCOExfkn8UHTZNZT6zwjRj3MCXJD2LVcY0NTrg=.0644323b-5f40-4441-8c19-763105aaf08d@github.com>
 <e8V5_QLgWMgWjewNHqzQ9lk-HNa3RylbwnSE01ZHWMA=.781eb5c8-9a25-4670-8ba4-bbfe8e4b262c@github.com>
 <fah136x1eL92esnncQ2l8JVUMhkwg25xpjmZhE-Rkbk=.ce4ed4dd-3f56-4ac7-83ec-ebed54ee74a7@github.com>
Message-ID: <44DPWzTGxPDoyWwZFbAxE74-KrXChIvfusVws1N-uN0=.f346731b-c61e-468f-9f58-4dc6e2df35d2@github.com>

On Mon, 9 Dec 2024 21:35:42 GMT, ExE Boss <duke at openjdk.org> wrote:

>> yes. it is needed.
>
> This is?**C++**, so?yes.
> Suggestion:
> 
>   macro(_modifiers_offset,           k, vmSymbols::modifiers_name(), int_signature,    false);

I see, there's a trailing semi somewhere in the expansion of this macro so it compiles, but I added one in.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1878263513

From heidinga at openjdk.org  Mon Feb  3 17:49:20 2025
From: heidinga at openjdk.org (Dan Heidinga)
Date: Mon, 3 Feb 2025 17:49:20 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native
In-Reply-To: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
Message-ID: <5bMxhTRPqj-dMhr3FoSrym2ttWuzjWwtXAEcQHbF9Vg=.859ae29f-2530-4130-b108-d47c100ac19f@github.com>

On Mon, 9 Dec 2024 19:26:53 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.
> 
> There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.
> 
> Tested with tier1-8.

src/java.base/share/classes/java/lang/Class.java line 244:

> 242:         classLoader = loader;
> 243:         componentType = arrayComponentType;
> 244:         modifiers = 0;

The comment above about assigning a parameter to the field to prevent the JIT from assuming an incorrect default also should apply to the new `modifiers` field.  I think the constructor, which is never called, should also pass in a `dummyModifiers` value rather than using 0 directly

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1880689835

From coleenp at openjdk.org  Mon Feb  3 17:49:20 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Mon, 3 Feb 2025 17:49:20 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native
In-Reply-To: <5bMxhTRPqj-dMhr3FoSrym2ttWuzjWwtXAEcQHbF9Vg=.859ae29f-2530-4130-b108-d47c100ac19f@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <5bMxhTRPqj-dMhr3FoSrym2ttWuzjWwtXAEcQHbF9Vg=.859ae29f-2530-4130-b108-d47c100ac19f@github.com>
Message-ID: <FqVaeVj-Bz008BPwuDXr4GUgrQg2T3-glsXYm-yZy8E=.3c3f5d89-90fc-4c4b-9d03-0d66cfdfbde2@github.com>

On Wed, 11 Dec 2024 18:15:57 GMT, Dan Heidinga <heidinga at openjdk.org> wrote:

>> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.
>> 
>> There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.
>> 
>> Tested with tier1-8.
>
> src/java.base/share/classes/java/lang/Class.java line 244:
> 
>> 242:         classLoader = loader;
>> 243:         componentType = arrayComponentType;
>> 244:         modifiers = 0;
> 
> The comment above about assigning a parameter to the field to prevent the JIT from assuming an incorrect default also should apply to the new `modifiers` field.  I think the constructor, which is never called, should also pass in a `dummyModifiers` value rather than using 0 directly

Yes, definitely, didn't see that this is the right way to do this.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1887157349

From duke at openjdk.org  Mon Feb  3 17:49:21 2025
From: duke at openjdk.org (ExE Boss)
Date: Mon, 3 Feb 2025 17:49:21 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native
In-Reply-To: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
Message-ID: <-GEiPPAhFzy-uaUwIACYA7fZVCT3wkuVd-gtf9rrlnw=.de130f97-59bd-4581-a568-05d6238cf90a@github.com>

On Mon, 9 Dec 2024 19:26:53 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.
> 
> There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.
> 
> Tested with tier1-8.

src/java.base/share/classes/java/lang/Class.java line 1005:

> 1003:     private transient Object[] signers; // Read by VM, mutable
> 1004: 
> 1005:     @Stable

The?`modifiers`?field doesn?t?need to?be?`@Stable`:
Suggestion:

test/micro/org/openjdk/bench/java/lang/reflect/Clazz.java line 65:

> 63:      */
> 64:     @Benchmark
> 65:     public int getModifiers() throws NoSuchMethodException {

The?only `Throwable`s that?can be?thrown by?calling `Class::getModifiers()` are?`Error`s (e.g.:?`StackOverflowError`) and?`RuntimeException`s (e.g.:?`NullPointerException`):
Suggestion:

    public int getModifiers() {

test/micro/org/openjdk/bench/java/lang/reflect/Clazz.java line 71:

> 69:     Clazz[] clazzArray = new Clazz[1];
> 70:     @Benchmark
> 71:     public int getAppArrayModifiers() throws NoSuchMethodException {

Suggestion:

    public int getAppArrayModifiers() {

test/micro/org/openjdk/bench/java/lang/reflect/Clazz.java line 81:

> 79:      */
> 80:     @Benchmark
> 81:     public int getArrayModifiers() throws NoSuchMethodException {

Suggestion:

    public int getArrayModifiers() {

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1888757754
PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1888760732
PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1888760967
PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1888761412

From coleenp at openjdk.org  Mon Feb  3 17:49:21 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Mon, 3 Feb 2025 17:49:21 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native
In-Reply-To: <-GEiPPAhFzy-uaUwIACYA7fZVCT3wkuVd-gtf9rrlnw=.de130f97-59bd-4581-a568-05d6238cf90a@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <-GEiPPAhFzy-uaUwIACYA7fZVCT3wkuVd-gtf9rrlnw=.de130f97-59bd-4581-a568-05d6238cf90a@github.com>
Message-ID: <f7t75uKmpCaoI-b5hUxSL2-6XRUyO7oXu4uXuqYQOgA=.f7fecd55-39f7-4b74-8aa1-723c58385121@github.com>

On Tue, 17 Dec 2024 15:54:48 GMT, ExE Boss <duke at openjdk.org> wrote:

>> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.
>> 
>> There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.
>> 
>> Tested with tier1-8.
>
> src/java.base/share/classes/java/lang/Class.java line 1005:
> 
>> 1003:     private transient Object[] signers; // Read by VM, mutable
>> 1004: 
>> 1005:     @Stable
> 
> The?`modifiers`?field doesn?t?need to?be?`@Stable`:
> Suggestion:

I now don't know whether we want @Stable here or not.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1890866329

From vklang at openjdk.org  Mon Feb  3 17:49:21 2025
From: vklang at openjdk.org (Viktor Klang)
Date: Mon, 3 Feb 2025 17:49:21 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native
In-Reply-To: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
Message-ID: <yhyfqAgCbPEVVVdWt3CVX5gjPVSy0LQfQEUlWciih34=.f23c9cbc-ebc3-4f50-9223-a034d1fdc6a0@github.com>

On Mon, 9 Dec 2024 19:26:53 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.
> 
> There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.
> 
> Tested with tier1-8.

src/java.base/share/classes/java/lang/Class.java line 1006:

> 1004:     private final transient int modifiers;  // Set by the VM
> 1005: 
> 1006:     // package-private

@coleenp Could this field be @Stable, or does that only apply to `putfield`s?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1879797327

From liach at openjdk.org  Mon Feb  3 17:49:21 2025
From: liach at openjdk.org (Chen Liang)
Date: Mon, 3 Feb 2025 17:49:21 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native
In-Reply-To: <yhyfqAgCbPEVVVdWt3CVX5gjPVSy0LQfQEUlWciih34=.f23c9cbc-ebc3-4f50-9223-a034d1fdc6a0@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <yhyfqAgCbPEVVVdWt3CVX5gjPVSy0LQfQEUlWciih34=.f23c9cbc-ebc3-4f50-9223-a034d1fdc6a0@github.com>
Message-ID: <ry9lIOtqKOCl6v48h9bIbACNHJleAKYwRXCkd_0VkfU=.9d87f360-b867-46cc-a44a-88ba991980ec@github.com>

On Wed, 11 Dec 2024 10:24:03 GMT, Viktor Klang <vklang at openjdk.org> wrote:

>> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.
>> 
>> There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.
>> 
>> Tested with tier1-8.
>
> src/java.base/share/classes/java/lang/Class.java line 1006:
> 
>> 1004:     private final transient int modifiers;  // Set by the VM
>> 1005: 
>> 1006:     // package-private
> 
> @coleenp Could this field be @Stable, or does that only apply to `putfield`s?

I don't think this needs to be stable - finals in java.lang is trusted by the JIT compiler.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1880350790

From vklang at openjdk.org  Mon Feb  3 17:49:24 2025
From: vklang at openjdk.org (Viktor Klang)
Date: Mon, 3 Feb 2025 17:49:24 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native
In-Reply-To: <ry9lIOtqKOCl6v48h9bIbACNHJleAKYwRXCkd_0VkfU=.9d87f360-b867-46cc-a44a-88ba991980ec@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <yhyfqAgCbPEVVVdWt3CVX5gjPVSy0LQfQEUlWciih34=.f23c9cbc-ebc3-4f50-9223-a034d1fdc6a0@github.com>
 <ry9lIOtqKOCl6v48h9bIbACNHJleAKYwRXCkd_0VkfU=.9d87f360-b867-46cc-a44a-88ba991980ec@github.com>
Message-ID: <rRoMJbmO4ZFCagXgmezUlgTxe48usUja4KqhdmXDsjk=.568fe8d4-1fbd-4dfc-99e8-5941b67248de@github.com>

On Wed, 11 Dec 2024 14:52:48 GMT, Chen Liang <liach at openjdk.org> wrote:

>> src/java.base/share/classes/java/lang/Class.java line 1006:
>> 
>>> 1004:     private final transient int modifiers;  // Set by the VM
>>> 1005: 
>>> 1006:     // package-private
>> 
>> @coleenp Could this field be @Stable, or does that only apply to `putfield`s?
>
> I don't think this needs to be stable - finals in java.lang is trusted by the JIT compiler.

Yeah, I was just thinking whether something set from inside the VM which is marked @Stable is constant-folded :)

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1880374750

From coleenp at openjdk.org  Mon Feb  3 17:49:24 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Mon, 3 Feb 2025 17:49:24 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native
In-Reply-To: <rRoMJbmO4ZFCagXgmezUlgTxe48usUja4KqhdmXDsjk=.568fe8d4-1fbd-4dfc-99e8-5941b67248de@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <yhyfqAgCbPEVVVdWt3CVX5gjPVSy0LQfQEUlWciih34=.f23c9cbc-ebc3-4f50-9223-a034d1fdc6a0@github.com>
 <ry9lIOtqKOCl6v48h9bIbACNHJleAKYwRXCkd_0VkfU=.9d87f360-b867-46cc-a44a-88ba991980ec@github.com>
 <rRoMJbmO4ZFCagXgmezUlgTxe48usUja4KqhdmXDsjk=.568fe8d4-1fbd-4dfc-99e8-5941b67248de@github.com>
Message-ID: <f05_My76F3A49SVKFI7chP2uMOn3nochpfYWbTxCwAw=.3fb9cf96-b0c1-47c7-9769-90d5b55be072@github.com>

On Wed, 11 Dec 2024 15:06:54 GMT, Viktor Klang <vklang at openjdk.org> wrote:

>> I don't think this needs to be stable - finals in java.lang is trusted by the JIT compiler.
>
> Yeah, I was just thinking whether something set from inside the VM which is marked @Stable is constant-folded :)

I don't think @Stable would hurt but final should provide the same guarantee.  It's set internally by the VM so there's no late setting. I don't know if this field implementation can constant fold in the case of Arrays which are (JVM_ACC_ABSTRACT | JVM_ACC_FINAL | JVM_ACC_PUBLIC).

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1880663099

From heidinga at openjdk.org  Mon Feb  3 17:49:25 2025
From: heidinga at openjdk.org (Dan Heidinga)
Date: Mon, 3 Feb 2025 17:49:25 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native
In-Reply-To: <rRoMJbmO4ZFCagXgmezUlgTxe48usUja4KqhdmXDsjk=.568fe8d4-1fbd-4dfc-99e8-5941b67248de@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <yhyfqAgCbPEVVVdWt3CVX5gjPVSy0LQfQEUlWciih34=.f23c9cbc-ebc3-4f50-9223-a034d1fdc6a0@github.com>
 <ry9lIOtqKOCl6v48h9bIbACNHJleAKYwRXCkd_0VkfU=.9d87f360-b867-46cc-a44a-88ba991980ec@github.com>
 <rRoMJbmO4ZFCagXgmezUlgTxe48usUja4KqhdmXDsjk=.568fe8d4-1fbd-4dfc-99e8-5941b67248de@github.com>
Message-ID: <yjBy9m5yM18_XOZjEN8dn5osnTUPviOliyM_qNHoKgQ=.f9c43bc8-4bee-46bb-b466-7d471bfcd582@github.com>

On Wed, 11 Dec 2024 15:06:54 GMT, Viktor Klang <vklang at openjdk.org> wrote:

>> I don't think this needs to be stable - finals in java.lang is trusted by the JIT compiler.
>
> Yeah, I was just thinking whether something set from inside the VM which is marked @Stable is constant-folded :)

@viktorklang-ora `@Stable` is not about how the field was set, but about the JIT observing a  non-default value at compile time.  If it observes a non-default value, it can treat it as a compile time constant.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1880692608

From vklang at openjdk.org  Mon Feb  3 17:49:25 2025
From: vklang at openjdk.org (Viktor Klang)
Date: Mon, 3 Feb 2025 17:49:25 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native
In-Reply-To: <yjBy9m5yM18_XOZjEN8dn5osnTUPviOliyM_qNHoKgQ=.f9c43bc8-4bee-46bb-b466-7d471bfcd582@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <yhyfqAgCbPEVVVdWt3CVX5gjPVSy0LQfQEUlWciih34=.f23c9cbc-ebc3-4f50-9223-a034d1fdc6a0@github.com>
 <ry9lIOtqKOCl6v48h9bIbACNHJleAKYwRXCkd_0VkfU=.9d87f360-b867-46cc-a44a-88ba991980ec@github.com>
 <rRoMJbmO4ZFCagXgmezUlgTxe48usUja4KqhdmXDsjk=.568fe8d4-1fbd-4dfc-99e8-5941b67248de@github.com>
 <yjBy9m5yM18_XOZjEN8dn5osnTUPviOliyM_qNHoKgQ=.f9c43bc8-4bee-46bb-b466-7d471bfcd582@github.com>
Message-ID: <8Wx3xbbOnPXS5n1RuNaesqHbhKV3iLwrCVF0s6uWOrA=.cb20728e-e13c-4667-822b-3ba424cbc12f@github.com>

On Wed, 11 Dec 2024 18:17:43 GMT, Dan Heidinga <heidinga at openjdk.org> wrote:

>> Yeah, I was just thinking whether something set from inside the VM which is marked @Stable is constant-folded :)
>
> @viktorklang-ora `@Stable` is not about how the field was set, but about the JIT observing a  non-default value at compile time.  If it observes a non-default value, it can treat it as a compile time constant.

@DanHeidinga Great explanation, thank you!

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1881782322

From duke at openjdk.org  Mon Feb  3 18:14:54 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Mon, 3 Feb 2025 18:14:54 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v2]
In-Reply-To: <7UgNYEuTu6rj7queOgM9xIy-6kQMdACrZiDLtlniMYw=.dff6f18b-1236-43b1-8280-2bce9160f32a@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <hEbWLMmT-NemgtAzFnQaJcpsD72ILyj6MMABAH6kBQY=.3e4995f2-d6d2-49ea-ac19-9a241333aeac@github.com>
 <7UgNYEuTu6rj7queOgM9xIy-6kQMdACrZiDLtlniMYw=.dff6f18b-1236-43b1-8280-2bce9160f32a@github.com>
Message-ID: <jrSKBd120-hd7KlWve_pary_pdkTrJFQ18dpCk86O34=.e923d98c-0105-41fc-8b68-48490409adc1@github.com>

On Thu, 30 Jan 2025 16:23:56 GMT, Andrew Dinn <adinn at openjdk.org> wrote:

> @ferakocz I'm afraid you lucked out on getting your change committed before my reorganization of the stub generation code. If you are unsure of how to do the merge so your new stub is declared and generated following the new model (see the doc comments in stubDeclarations.hpp for details) let me know and I'll be happy to help you sort it out.

@adinn I think I managed to figure it out. Please take a look at the PR and let me know if I should have done anything differently.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2631720583

From jbhateja at openjdk.org  Mon Feb  3 18:14:56 2025
From: jbhateja at openjdk.org (Jatin Bhateja)
Date: Mon, 3 Feb 2025 18:14:56 GMT
Subject: RFR: 8342103: C2 compiler support for Float16 type and associated
 scalar operations [v16]
In-Reply-To: <EC2coajO2bsqpsjUFgjuf-YXj23ENZeLj2ABtq4Sd74=.1e290b04-fcfc-4aaa-b80d-60cdbb0a529a@github.com>
References: <a00XTjaE0iFc3MKq9ER_tgXoz81Hg07N8sPSPpTIQt4=.c05fd92f-8105-49d5-80be-ee56aeb77ede@github.com>
 <EC2coajO2bsqpsjUFgjuf-YXj23ENZeLj2ABtq4Sd74=.1e290b04-fcfc-4aaa-b80d-60cdbb0a529a@github.com>
Message-ID: <uwc5HqlrWh6v8Nmgng_Ar1monFTvmFnqOl7pA57b_gY=.95301efd-6e3d-486e-831f-694761ffcfdb@github.com>

On Thu, 30 Jan 2025 11:03:43 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Hi All,
>> 
>> This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128)
>> 
>> Following is the summary of changes included with this patch:-
>> 
>> 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations.
>> 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization.
>> 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class.
>>       -    These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values.
>> 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines.
>> 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details.
>> 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa.
>> 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF
>> 9. X86  backend implementation for all supported intrinsics.
>> 10. Functional and Performance validation tests.
>> 
>> Kindly review the patch and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Update test/micro/org/openjdk/bench/jdk/incubator/vector/Float16OperationsBenchmark.java
>   
>   Co-authored-by: Emanuel Peter <emanuel.peter at oracle.com>

Hi @PaulSandoz , @eme64 , All outstanding comments haven been addressed, please let me know if there are other comments.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22754#issuecomment-2631719276

From doug.simon at oracle.com  Mon Feb  3 19:09:26 2025
From: doug.simon at oracle.com (Douglas Simon)
Date: Mon, 3 Feb 2025 19:09:26 +0000
Subject: Proposal: Remove EnableJVMCI flag
In-Reply-To: <611affa4-09c6-41af-a853-1106e12dfbb9@oracle.com>
References: <D7F69451-94CE-49B5-A9D4-CD1C968EE948@oracle.com>
 <611affa4-09c6-41af-a853-1106e12dfbb9@oracle.com>
Message-ID: <A0981593-5A4B-48FF-8119-D8BFA86EDCC1@oracle.com>

Hi Vladimir,

On 3 Feb 2025, at 18:45, Vladimir Kozlov <vladimir.kozlov at oracle.com> wrote:

Hi Doug,

My concern is that some code (stubs, blobs, Interpreter) are generated before we are loading any modules.
How you handle JVMCI specific code there if you have it? If you don't have such code than we can discuss.

You mean what would we do with generated code that currently tests EnableJVMCI? We have these 2 options as far as I can see:
1. Always generate the JVMCI part of the code (example<https://github.com/openjdk/jdk/pull/23408/files#diff-524c9e019cb83916aa3db772fb33acbbe3e7465867a8d2f7e6376be3c8260eddL606>).
2. Instead of testing EnableJVMCI, we instead test a JVMCI::_is_enabled bool which would be initialized during argument parsing (i.e. before any code is generated). JVMCI::_is_enabled would be set to true if jdk.internal.vm.ci is in the root module set or if any other JVMCI flags such as UseGraalJIT or UseJVMCICompiler are true. I suspect this option is the one to go with as it?s pretty much equivalent to the current semantics (i.e. JVMCI conditional VM is only executed/generated) if JVMCI is enabled.

I definitely against adding runtime checks for JVMCI presence into executed (assembler) code.

I agree that we do not want that.

Would be nice if/when command line is parsed we can detect presence of `--add-modules=jdk.internal.vm.ci` (or others related) flag and enable JVMCI flag. I am fine to keep `EnableJVMCI` but make it ergonomic.

I?d like EnableJVMCI to become purely an alias for --add-modules=jdk.internal.vm.ci.

You may still want to disable JVMCI from command line even if somewhere in start script you have `--add-modules=jdk.internal.vm.ci`.

I don't think we need to support such a contradiction - if the launcher has been asked to load jdk.internal.vm.ci as part of the root module set, then it wants JVMCI enabled. Either that or we make -EnableJVMCI undo any preceding --add-modules=jdk.internal.vm.ci (if that?s even possible).

-Doug

On 2/1/25 12:03 AM, Douglas Simon wrote:
Hi,
https://bugs.openjdk.org/browse/JDK-8345826 <https://bugs.openjdk.org/browse/JDK-8345826> was filed to make libgraal and new CDS optimizations more compatible:
Since JDK 483, many more CDS optimizations are enabled when -XX:+AOTClassLinking is specified (see numbers in https:// bugs.openjdk.org/browse/JDK-8342279). However, these optimizations require the archived module graph to be used. Today, if you enable UseGraalJIT, the archived module graph will be disabled. As a result, the *entire* CDS archive will be disabled. This will result in slower start-up time when UseGraalJIT is enabled.

Further internal discussion <https://bugs.openjdk.org/browse/JDK-8345826? focusedId=14736369&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14736369> resulted in the proposal to remove all use of EnableJVMCI in the VM code. This will mean -XX:+EnableJVMCI only applies to the Java code (i.e. adds jdk.internal.vm.ci to the root module set).
However, further reflection suggests something more aggressive is worth considering: remove the EnableJVMCI flag altogether.
This option was implemented to make use of JVMCI opt-in. However, JVMCI is effectively opt-in anyway without this option. There are two ways in which JVMCI can be used: as a JIT compiler by the CompileBroker and as a compiler for ?guest? code (e.g., Truffle use case).
1. JVMCI as JIT.
To enable JVMCI as JIT, flags such as UseJVMCICompiler, UseGraalJIT or EnableJVMCIProduct must be specified to the java launcher. Each of these flags set EnableJVMCI to true as a side-effect. That is, use of JVMCI as JIT is already opt-in due to needing these other flags - specifying EnableJVMCI is redundant.
2. JVMCI as guest code compiler
In this mode, the jdk.internal.vm.ci module must be loaded (i.e. EnableJVMCI currently has the side-effect of `--add- modules=jdk.internal.vm.ci`). This module has no unqualified exports (as seen in its module descriptor <https:// github.com/openjdk/jdk/blob/master/src/jdk.internal.vm.ci/share/classes/module-info.java>) so using it requires specifying at least one instance of --add-exports to the Java launcher. That is, once again EnableJVMCI alone is not sufficient for opting-in to JVMCI.
In light of the above, I propose removing EnableJVMCI altogether. This will require using --add- modules=jdk.internal.vm.ci when you actually want to use the JVMCI module. It will also require modifying JDK code guarded by this flag. It guards both VM code and use of the `jdk.internal.vm.ci` module and I consider them separately below.
#### VM code
All uses of EnableJVMCI to guard VM code would adapted with one of the following strategies:
1. Remove the guard and make the code unconditional.
2. Replace EnableJVMCI with something else such as UseJVMCICompiler or test of a global variable set to true as soon as JVMCI compiled code is about to be installed in the code cache (example <https://github.com/openjdk/jdk/pull/23408/ files#diff-ee8337800ed1d1b84e3e49a2481809a6affac5d70ca23934a44497c9c758092fR456>).
3. Replace EnableJVMCI with a test of whether the jdk.internal.vm.ci module has been resolved (example <https:// github.com/openjdk/jdk/pull/23408/files#diff-4e6668d768f7d67417cbac39bcb723552cc0b80ad218709cfa0e6e31f32b69f0R518>).
Of course, this change almost certainly needs a CSR as well but I?d like to get feedback on the primary change before worrying about that.
-Doug


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/graal-dev/attachments/20250203/40ea2a36/attachment-0001.htm>

From vladimir.kozlov at oracle.com  Mon Feb  3 19:14:08 2025
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Mon, 3 Feb 2025 11:14:08 -0800
Subject: Proposal: Remove EnableJVMCI flag
In-Reply-To: <A0981593-5A4B-48FF-8119-D8BFA86EDCC1@oracle.com>
References: <D7F69451-94CE-49B5-A9D4-CD1C968EE948@oracle.com>
 <611affa4-09c6-41af-a853-1106e12dfbb9@oracle.com>
 <A0981593-5A4B-48FF-8119-D8BFA86EDCC1@oracle.com>
Message-ID: <888c013b-482e-4269-972e-078b8517485e@oracle.com>

On 2/3/25 11:09 AM, Douglas Simon wrote:
> Hi Vladimir,
> 
>> On 3 Feb 2025, at 18:45, Vladimir Kozlov <vladimir.kozlov at oracle.com> wrote:
>>
>> Hi Doug,
>>
>> My concern is that some code (stubs, blobs, Interpreter) are generated before we are loading any modules.
>> How you handle JVMCI specific code there if you have it? If you don't have such code than we can discuss.
> 
> You mean what would we do with generated code that currently tests EnableJVMCI? We have these 2 options as far as I can see:
> 1. Always generate the JVMCI part of the code (example <https://github.com/openjdk/jdk/pull/23408/ 
> files#diff-524c9e019cb83916aa3db772fb33acbbe3e7465867a8d2f7e6376be3c8260eddL606>).
> 2. Instead of testing EnableJVMCI, we instead test a JVMCI::_is_enabled bool which would be initialized during argument 
> parsing (i.e. before any code is generated). JVMCI::_is_enabled would be set to true if jdk.internal.vm.ci is in the 
> root module set or if any other JVMCI flags such as UseGraalJIT or UseJVMCICompiler are true. I suspect this option is 
> the one to go with as it?s pretty much equivalent to the current semantics (i.e. JVMCI conditional VM is only executed/ 
> generated) if JVMCI is enabled.

I agree with option 2. This looks like most reasonable approach.

Thanks,
Vladimir K

> 
>> I definitely against adding runtime checks for JVMCI presence into executed (assembler) code.
> 
> I agree that we do not want that.
> 
>> Would be nice if/when command line is parsed we can detect presence of `--add-modules=jdk.internal.vm.ci` (or others 
>> related) flag and enable JVMCI flag. I am fine to keep `EnableJVMCI` but make it ergonomic.
> 
> I?d like EnableJVMCI to become purely an alias for --add-modules=jdk.internal.vm.ci.
> 
>> You may still want to disable JVMCI from command line even if somewhere in start script you have `--add- 
>> modules=jdk.internal.vm.ci`.
> 
> I don't think we need to support such a contradiction - if the launcher has been asked to load jdk.internal.vm.ci as 
> part of the root module set, then it wants JVMCI enabled. Either that or we make -EnableJVMCI undo any preceding --add- 
> modules=jdk.internal.vm.ci (if that?s even possible).
> 
> -Doug
> 
>> On 2/1/25 12:03 AM, Douglas Simon wrote:
>>> Hi,
>>> https://bugs.openjdk.org/browse/JDK-8345826 <https://bugs.openjdk.org/browse/JDK-8345826>?was filed to make libgraal 
>>> and new CDS optimizations more compatible:
>>>> Since JDK 483, many more CDS optimizations are enabled when -XX:+AOTClassLinking is specified (see numbers 
>>>> in?https:// bugs.openjdk.org/browse/JDK-8342279). However, these optimizations require the archived module graph to 
>>>> be used. Today, if you enable UseGraalJIT, the archived module graph will be disabled. As a result, the *entire* CDS 
>>>> archive will be disabled. This will result in slower start-up time when UseGraalJIT is enabled.
>>>>
>>> Further internal discussion <https://bugs.openjdk.org/browse/JDK-8345826? 
>>> focusedId=14736369&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14736369>?resulted 
>>> in the proposal to remove all use of EnableJVMCI in the VM code. This will mean -XX:+EnableJVMCI only applies to the 
>>> Java code (i.e. adds jdk.internal.vm.ci to the root module set).
>>> However, further reflection suggests something more aggressive is worth considering: remove the EnableJVMCI flag 
>>> altogether.
>>> This option was implemented to make use of JVMCI opt-in. However, JVMCI is effectively opt-in anyway without this 
>>> option. There are two ways in which JVMCI can be used: as a JIT compiler by the CompileBroker and as a compiler for 
>>> ?guest? code (e.g., Truffle use case).
>>> 1. JVMCI as JIT.
>>> To enable JVMCI as JIT, flags such as UseJVMCICompiler, UseGraalJIT or EnableJVMCIProduct must be specified to the 
>>> java launcher. Each of these flags set EnableJVMCI to true as a side-effect. That is, use of JVMCI as JIT is already 
>>> opt-in due to needing these other flags - specifying EnableJVMCI is redundant.
>>> 2. JVMCI as guest code compiler
>>> In this mode, the jdk.internal.vm.ci module must be loaded (i.e. EnableJVMCI currently has the side-effect of `--add- 
>>> modules=jdk.internal.vm.ci`). This module has no unqualified exports (as seen in its module descriptor <https:// 
>>> github.com/openjdk/jdk/blob/master/src/jdk.internal.vm.ci/share/classes/module-info.java>)?so using it requires 
>>> specifying at least one instance of --add-exports to the Java launcher. That is, once again EnableJVMCI alone is not 
>>> sufficient for opting-in to JVMCI.
>>> In light of the above, I propose removing EnableJVMCI altogether. This will require using --add- 
>>> modules=jdk.internal.vm.ci when you actually want to use the JVMCI module. It will also require modifying JDK code 
>>> guarded by this flag. It guards both VM code and use of the `jdk.internal.vm.ci` module and I consider them 
>>> separately below.
>>> #### VM code
>>> All uses of EnableJVMCI to guard VM code would adapted with one of the following strategies:
>>> 1. Remove the guard and make the code unconditional.
>>> 2. Replace EnableJVMCI with something else such as UseJVMCICompiler or test of a global variable set to true as soon 
>>> as JVMCI compiled code is about to be installed in the code cache (example <https://github.com/openjdk/jdk/ 
>>> pull/23408/ files#diff-ee8337800ed1d1b84e3e49a2481809a6affac5d70ca23934a44497c9c758092fR456>).
>>> 3. Replace EnableJVMCI with a test of whether the jdk.internal.vm.ci module has been resolved (example <https:// 
>>> github.com/openjdk/jdk/pull/23408/files#diff-4e6668d768f7d67417cbac39bcb723552cc0b80ad218709cfa0e6e31f32b69f0R518>).
>>> Of course, this change almost certainly needs a CSR as well but I?d like to get feedback on the primary change before 
>>> worrying about that.
>>> -Doug
>>
> 


From epeter at openjdk.org  Tue Feb  4 08:52:15 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Tue, 4 Feb 2025 08:52:15 GMT
Subject: RFR: 8342103: C2 compiler support for Float16 type and associated
 scalar operations [v16]
In-Reply-To: <uwc5HqlrWh6v8Nmgng_Ar1monFTvmFnqOl7pA57b_gY=.95301efd-6e3d-486e-831f-694761ffcfdb@github.com>
References: <a00XTjaE0iFc3MKq9ER_tgXoz81Hg07N8sPSPpTIQt4=.c05fd92f-8105-49d5-80be-ee56aeb77ede@github.com>
 <EC2coajO2bsqpsjUFgjuf-YXj23ENZeLj2ABtq4Sd74=.1e290b04-fcfc-4aaa-b80d-60cdbb0a529a@github.com>
 <uwc5HqlrWh6v8Nmgng_Ar1monFTvmFnqOl7pA57b_gY=.95301efd-6e3d-486e-831f-694761ffcfdb@github.com>
Message-ID: <_5bwBRKG8Zu7iywOJZ6WgUb6N4so1sAO6Ua8S0zQU94=.3200ef74-4e50-424b-a3da-637be63e3f0c@github.com>

On Mon, 3 Feb 2025 18:11:11 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Update test/micro/org/openjdk/bench/jdk/incubator/vector/Float16OperationsBenchmark.java
>>   
>>   Co-authored-by: Emanuel Peter <emanuel.peter at oracle.com>
>
> Hi @PaulSandoz , @eme64 , All outstanding comments haven been addressed, please let me know if there are other comments.

@jatin-bhateja Testing is all green :green_circle: 
Doing a last pass over the code.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22754#issuecomment-2633248273

From epeter at openjdk.org  Tue Feb  4 09:03:17 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Tue, 4 Feb 2025 09:03:17 GMT
Subject: RFR: 8342103: C2 compiler support for Float16 type and associated
 scalar operations [v16]
In-Reply-To: <EC2coajO2bsqpsjUFgjuf-YXj23ENZeLj2ABtq4Sd74=.1e290b04-fcfc-4aaa-b80d-60cdbb0a529a@github.com>
References: <a00XTjaE0iFc3MKq9ER_tgXoz81Hg07N8sPSPpTIQt4=.c05fd92f-8105-49d5-80be-ee56aeb77ede@github.com>
 <EC2coajO2bsqpsjUFgjuf-YXj23ENZeLj2ABtq4Sd74=.1e290b04-fcfc-4aaa-b80d-60cdbb0a529a@github.com>
Message-ID: <s2YfUR-OSsspzSE48zzcERjOKMj6nHJAYcqdupJJv4g=.01feb188-2d41-4b58-bee2-01f95bc73325@github.com>

On Thu, 30 Jan 2025 11:03:43 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Hi All,
>> 
>> This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128)
>> 
>> Following is the summary of changes included with this patch:-
>> 
>> 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations.
>> 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization.
>> 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class.
>>       -    These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values.
>> 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines.
>> 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details.
>> 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa.
>> 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF
>> 9. X86  backend implementation for all supported intrinsics.
>> 10. Functional and Performance validation tests.
>> 
>> Kindly review the patch and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Update test/micro/org/openjdk/bench/jdk/incubator/vector/Float16OperationsBenchmark.java
>   
>   Co-authored-by: Emanuel Peter <emanuel.peter at oracle.com>

src/hotspot/share/opto/convertnode.hpp line 222:

> 220: class ReinterpretS2HFNode : public Node {
> 221:   public:
> 222:   ReinterpretS2HFNode(Node* in1) : Node(0, in1) {}

Suggestion:

  ReinterpretS2HFNode(Node* in1) : Node(nullptr, in1) {}

Oh, just caught this. I think you should not use `0` here any more, check all other uses.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1940762320

From epeter at openjdk.org  Tue Feb  4 09:16:17 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Tue, 4 Feb 2025 09:16:17 GMT
Subject: RFR: 8342103: C2 compiler support for Float16 type and associated
 scalar operations [v16]
In-Reply-To: <EC2coajO2bsqpsjUFgjuf-YXj23ENZeLj2ABtq4Sd74=.1e290b04-fcfc-4aaa-b80d-60cdbb0a529a@github.com>
References: <a00XTjaE0iFc3MKq9ER_tgXoz81Hg07N8sPSPpTIQt4=.c05fd92f-8105-49d5-80be-ee56aeb77ede@github.com>
 <EC2coajO2bsqpsjUFgjuf-YXj23ENZeLj2ABtq4Sd74=.1e290b04-fcfc-4aaa-b80d-60cdbb0a529a@github.com>
Message-ID: <mUgb99AYb-pdBKAx5Jy6TfGAWok7-DJ89A9m80L-yYc=.d51af39b-d84d-4daf-857d-513183b5262e@github.com>

On Thu, 30 Jan 2025 11:03:43 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Hi All,
>> 
>> This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128)
>> 
>> Following is the summary of changes included with this patch:-
>> 
>> 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations.
>> 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization.
>> 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class.
>>       -    These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values.
>> 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines.
>> 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details.
>> 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa.
>> 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF
>> 9. X86  backend implementation for all supported intrinsics.
>> 10. Functional and Performance validation tests.
>> 
>> Kindly review the patch and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Update test/micro/org/openjdk/bench/jdk/incubator/vector/Float16OperationsBenchmark.java
>   
>   Co-authored-by: Emanuel Peter <emanuel.peter at oracle.com>

Ooops, I found a few more details. But the C++ VM changes look really good now.

The Java changes I leave to @PaulSandoz

src/hotspot/share/opto/convertnode.cpp line 971:

> 969:       return true;
> 970:     default:
> 971:       return false;

Does this cover all cases? What about `FmaHF`?

src/hotspot/share/opto/convertnode.hpp line 234:

> 232: class ReinterpretHF2SNode : public Node {
> 233:   public:
> 234:   ReinterpretHF2SNode(Node* in1) : Node(0, in1) {}

Suggestion:

  ReinterpretHF2SNode(Node* in1) : Node(nullptr, in1) {}

src/hotspot/share/opto/divnode.cpp line 866:

> 864: // Dividing by self is 1.
> 865: // IF the divisor is 1, we are an identity on the dividend.
> 866: Node* DivHFNode::Identity(PhaseGVN* phase) {

Remove line with `isA_Copy`.

src/hotspot/share/opto/type.cpp line 1106:

> 1104:     if (_base == FloatBot || _base == FloatTop) return FLOAT;
> 1105:     if (_base == HalfFloatTop || _base == HalfFloatBot) return Type::BOTTOM;
> 1106:     if (_base == DoubleTop || _base == DoubleBot) return Type::BOTTOM;

If you already fixing the style, you should use curly braces as I said above ;)

src/hotspot/share/opto/type.cpp line 1472:

> 1470: //------------------------------meet-------------------------------------------
> 1471: // Compute the MEET of two types.  It returns a new Type object.
> 1472: const Type* TypeH::xmeet(const Type* t) const {

Suggestion:

//------------------------------xmeet-------------------------------------------
// Compute the MEET of two types.  It returns a new Type object.
const Type* TypeH::xmeet(const Type* t) const {

-------------

PR Review: https://git.openjdk.org/jdk/pull/22754#pullrequestreview-2592155651
PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1940766035
PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1940763403
PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1940766624
PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1940771256
PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1940771662

From jbhateja at openjdk.org  Tue Feb  4 10:05:09 2025
From: jbhateja at openjdk.org (Jatin Bhateja)
Date: Tue, 4 Feb 2025 10:05:09 GMT
Subject: RFR: 8342103: C2 compiler support for Float16 type and associated
 scalar operations [v17]
In-Reply-To: <a00XTjaE0iFc3MKq9ER_tgXoz81Hg07N8sPSPpTIQt4=.c05fd92f-8105-49d5-80be-ee56aeb77ede@github.com>
References: <a00XTjaE0iFc3MKq9ER_tgXoz81Hg07N8sPSPpTIQt4=.c05fd92f-8105-49d5-80be-ee56aeb77ede@github.com>
Message-ID: <sgcWAEvXaWi40uFqPPJq-e6n3YsHb7agKBUrMBdqJSc=.48c41eab-6641-45b7-9765-400622894f4b@github.com>

> Hi All,
> 
> This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128)
> 
> Following is the summary of changes included with this patch:-
> 
> 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations.
> 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization.
> 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class.
>       -    These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values.
> 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines.
> 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details.
> 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa.
> 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF
> 9. X86  backend implementation for all supported intrinsics.
> 10. Functional and Performance validation tests.
> 
> Kindly review the patch and share your feedback.
> 
> Best Regards,
> Jatin

Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:

  Fixing typos

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/22754/files
  - new: https://git.openjdk.org/jdk/pull/22754/files/8207c9ff..82a42213

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=22754&range=16
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22754&range=15-16

  Stats: 13 lines in 3 files changed: 0 ins; 0 del; 13 mod
  Patch: https://git.openjdk.org/jdk/pull/22754.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/22754/head:pull/22754

PR: https://git.openjdk.org/jdk/pull/22754

From jbhateja at openjdk.org  Tue Feb  4 10:05:11 2025
From: jbhateja at openjdk.org (Jatin Bhateja)
Date: Tue, 4 Feb 2025 10:05:11 GMT
Subject: RFR: 8342103: C2 compiler support for Float16 type and associated
 scalar operations [v16]
In-Reply-To: <uwc5HqlrWh6v8Nmgng_Ar1monFTvmFnqOl7pA57b_gY=.95301efd-6e3d-486e-831f-694761ffcfdb@github.com>
References: <a00XTjaE0iFc3MKq9ER_tgXoz81Hg07N8sPSPpTIQt4=.c05fd92f-8105-49d5-80be-ee56aeb77ede@github.com>
 <EC2coajO2bsqpsjUFgjuf-YXj23ENZeLj2ABtq4Sd74=.1e290b04-fcfc-4aaa-b80d-60cdbb0a529a@github.com>
 <uwc5HqlrWh6v8Nmgng_Ar1monFTvmFnqOl7pA57b_gY=.95301efd-6e3d-486e-831f-694761ffcfdb@github.com>
Message-ID: <eMGB03R4LnluDMT7cnMNKqOu2cf7KyTcnO-XaVtaxGc=.776c7039-5f69-481a-9a0a-f167486560ad@github.com>

On Mon, 3 Feb 2025 18:11:11 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Update test/micro/org/openjdk/bench/jdk/incubator/vector/Float16OperationsBenchmark.java
>>   
>>   Co-authored-by: Emanuel Peter <emanuel.peter at oracle.com>
>
> Hi @PaulSandoz , @eme64 , All outstanding comments haven been addressed, please let me know if there are other comments.

> @jatin-bhateja Testing is all green ? Doing a last pass over the code.

Thanks @eme64, looking forward to your approval :-)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22754#issuecomment-2633414710

From jbhateja at openjdk.org  Tue Feb  4 10:05:11 2025
From: jbhateja at openjdk.org (Jatin Bhateja)
Date: Tue, 4 Feb 2025 10:05:11 GMT
Subject: RFR: 8342103: C2 compiler support for Float16 type and associated
 scalar operations [v16]
In-Reply-To: <mUgb99AYb-pdBKAx5Jy6TfGAWok7-DJ89A9m80L-yYc=.d51af39b-d84d-4daf-857d-513183b5262e@github.com>
References: <a00XTjaE0iFc3MKq9ER_tgXoz81Hg07N8sPSPpTIQt4=.c05fd92f-8105-49d5-80be-ee56aeb77ede@github.com>
 <EC2coajO2bsqpsjUFgjuf-YXj23ENZeLj2ABtq4Sd74=.1e290b04-fcfc-4aaa-b80d-60cdbb0a529a@github.com>
 <mUgb99AYb-pdBKAx5Jy6TfGAWok7-DJ89A9m80L-yYc=.d51af39b-d84d-4daf-857d-513183b5262e@github.com>
Message-ID: <cZKkfnS-ra5Pmvto6pG-lfQ7WcpkKsncirRdbDnXRRk=.fc426cac-4fc7-49ca-b0fa-9b983b9246ca@github.com>

On Tue, 4 Feb 2025 09:03:09 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Update test/micro/org/openjdk/bench/jdk/incubator/vector/Float16OperationsBenchmark.java
>>   
>>   Co-authored-by: Emanuel Peter <emanuel.peter at oracle.com>
>
> src/hotspot/share/opto/convertnode.cpp line 971:
> 
>> 969:       return true;
>> 970:     default:
>> 971:       return false;
> 
> Does this cover all cases? What about `FmaHF`?

FmaHF is a ternary operation and is intrinsified.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1940855109

From adinn at openjdk.org  Tue Feb  4 11:48:09 2025
From: adinn at openjdk.org (Andrew Dinn)
Date: Tue, 4 Feb 2025 11:48:09 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v2]
In-Reply-To: <jrSKBd120-hd7KlWve_pary_pdkTrJFQ18dpCk86O34=.e923d98c-0105-41fc-8b68-48490409adc1@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <hEbWLMmT-NemgtAzFnQaJcpsD72ILyj6MMABAH6kBQY=.3e4995f2-d6d2-49ea-ac19-9a241333aeac@github.com>
 <7UgNYEuTu6rj7queOgM9xIy-6kQMdACrZiDLtlniMYw=.dff6f18b-1236-43b1-8280-2bce9160f32a@github.com>
 <jrSKBd120-hd7KlWve_pary_pdkTrJFQ18dpCk86O34=.e923d98c-0105-41fc-8b68-48490409adc1@github.com>
Message-ID: <G0UnVGUZbpqmsXjgMfyb6w1QWuN47KGVaEL8PTi9gug=.76871ac9-2e34-4ab3-93d7-5787676a269e@github.com>

On Mon, 3 Feb 2025 18:11:51 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> @ferakocz I'm afraid you lucked out on getting your change committed before my reorganization of the stub generation code. If you are unsure of how to do the merge so your new stub is declared and generated following the new model (see the doc comments in stubDeclarations.hpp for details) let me know and I'll be happy to help you sort it out.
>
>> @ferakocz I'm afraid you lucked out on getting your change committed before my reorganization of the stub generation code. If you are unsure of how to do the merge so your new stub is declared and generated following the new model (see the doc comments in stubDeclarations.hpp for details) let me know and I'll be happy to help you sort it out.
> 
> @adinn I think I managed to figure it out. Please take a look at the PR and let me know if I should have done anything differently.

@ferakocz Yes, the stub declaration part of it looks to be correct.

The rest of the patch will need at least two reviewers (@theRealAph? @martinuy? @franferrax) and may take some time to review, given that they will probably need to read up on the maths and algorithms. As an aid for reviewers and maintainers it would be good to insert a comment into the generator file linking the implementations to the relevant maths and algorithm. I found the FIPS-204 spec and the CRYSTALS-Dilithium Algorithm Speci?cations and Supporting Documentation paper, Shi Bai, L?o Ducas et al, 2021 - are they the best ones to look at?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2633666753

From alanb at openjdk.org  Tue Feb  4 13:39:14 2025
From: alanb at openjdk.org (Alan Bateman)
Date: Tue, 4 Feb 2025 13:39:14 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native
In-Reply-To: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
Message-ID: <LmtPYlKPNV3CPRskziXNOotqOzQSuugz1KgkSYZkBx0=.9c87d177-b3e5-4bc4-8ce2-133ecd47dc9b@github.com>

On Mon, 9 Dec 2024 19:26:53 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.
> 
> There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.
> 
> Tested with tier1-8.

src/hotspot/share/oops/arrayKlass.hpp line 2:

> 1: /*
> 2:  * Copyright (c) 1997, 2025, Oracle and/or its affiliates. All rights reserved.

arrayKlass.hpp isn't changed, is this left over from a previous iteration?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1941185550

From alanb at openjdk.org  Tue Feb  4 14:00:13 2025
From: alanb at openjdk.org (Alan Bateman)
Date: Tue, 4 Feb 2025 14:00:13 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native
In-Reply-To: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
Message-ID: <Dfl0TA0afI1TwGuimErvPiKq5aj1ZoKAtRR2D3h1RKU=.9026a36e-34cf-4081-852c-da921d0de1e2@github.com>

On Mon, 9 Dec 2024 19:26:53 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.
> 
> There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.
> 
> Tested with tier1-8.

Good cleanup.

src/java.base/share/classes/java/lang/Class.java line 244:

> 242:         classLoader = loader;
> 243:         componentType = arrayComponentType;
> 244:         modifiers = dummyModifiers;

I realize this ctor isn't used but "dummyModifiers" looks very strange as parameter name when compared to the others, renaming it to something like "mods" would make it less confusing for anyone reading through this code.

-------------

Marked as reviewed by alanb (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/22652#pullrequestreview-2592938860
PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1941220263

From coleenp at openjdk.org  Tue Feb  4 14:43:51 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Tue, 4 Feb 2025 14:43:51 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2]
In-Reply-To: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
Message-ID: <J3izMa8E182ejKhvLNQc0U__1Eswe4pAcSZd-g5OYEs=.ce24eab9-9e04-4220-9300-b81bee0453db@github.com>

> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.
> 
> There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.
> 
> Tested with tier1-8.

Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:

  Fix copyright and param name

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/22652/files
  - new: https://git.openjdk.org/jdk/pull/22652/files/8854fcc6..ff693418

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=22652&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22652&range=00-01

  Stats: 3 lines in 2 files changed: 0 ins; 0 del; 3 mod
  Patch: https://git.openjdk.org/jdk/pull/22652.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/22652/head:pull/22652

PR: https://git.openjdk.org/jdk/pull/22652

From coleenp at openjdk.org  Tue Feb  4 14:43:51 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Tue, 4 Feb 2025 14:43:51 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2]
In-Reply-To: <J3izMa8E182ejKhvLNQc0U__1Eswe4pAcSZd-g5OYEs=.ce24eab9-9e04-4220-9300-b81bee0453db@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <J3izMa8E182ejKhvLNQc0U__1Eswe4pAcSZd-g5OYEs=.ce24eab9-9e04-4220-9300-b81bee0453db@github.com>
Message-ID: <btZ1sRHlMDoKSoiu3JP2F6yQQ5RoyrjtqZ6U3x4SlPU=.24565830-b88c-4dea-9fd3-54ec09b5b1eb@github.com>

On Tue, 4 Feb 2025 14:40:47 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.
>> 
>> There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.
>> 
>> Tested with tier1-8.
>
> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix copyright and param name

Thank you for your comments, Alan.

-------------

PR Review: https://git.openjdk.org/jdk/pull/22652#pullrequestreview-2593075666

From coleenp at openjdk.org  Tue Feb  4 14:43:51 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Tue, 4 Feb 2025 14:43:51 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2]
In-Reply-To: <LmtPYlKPNV3CPRskziXNOotqOzQSuugz1KgkSYZkBx0=.9c87d177-b3e5-4bc4-8ce2-133ecd47dc9b@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <LmtPYlKPNV3CPRskziXNOotqOzQSuugz1KgkSYZkBx0=.9c87d177-b3e5-4bc4-8ce2-133ecd47dc9b@github.com>
Message-ID: <JckEJXVxVdnyG0Y0fxLjskXIpeCZlGgxyxklrGA9tqE=.81e6ea77-c462-4b65-9b2c-6fd18d29f471@github.com>

On Tue, 4 Feb 2025 13:36:44 GMT, Alan Bateman <alanb at openjdk.org> wrote:

>> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Fix copyright and param name
>
> src/hotspot/share/oops/arrayKlass.hpp line 2:
> 
>> 1: /*
>> 2:  * Copyright (c) 1997, 2025, Oracle and/or its affiliates. All rights reserved.
> 
> arrayKlass.hpp isn't changed, is this left over from a previous iteration?

yes, it was something that my copyright script thought I changed from merging some previous changes.

> src/java.base/share/classes/java/lang/Class.java line 244:
> 
>> 242:         classLoader = loader;
>> 243:         componentType = arrayComponentType;
>> 244:         modifiers = dummyModifiers;
> 
> I realize this ctor isn't used but "dummyModifiers" looks very strange as parameter name when compared to the others, renaming it to something like "mods" would make it less confusing for anyone reading through this code.

I changed it to mods.  Thanks for the suggestion.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1941301152
PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1941302820

From never at openjdk.org  Tue Feb  4 16:36:26 2025
From: never at openjdk.org (Tom Rodriguez)
Date: Tue, 4 Feb 2025 16:36:26 GMT
Subject: RFR: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can
 crash
Message-ID: <h-d5AL3L3J-BvIgzmhy0VsZWG8q7JAPK-kO_xtfSv9s=.244df430-4498-48a9-b0b6-b2082cc485cf@github.com>

This ensures that collectFailedSpeculations sees the initialization of the recently allocated failedSpeculationsAddress memory.

-------------

Commit messages:
 - 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can crash

Changes: https://git.openjdk.org/jdk/pull/23444/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23444&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8349374
  Stats: 8 lines in 1 file changed: 5 ins; 0 del; 3 mod
  Patch: https://git.openjdk.org/jdk/pull/23444.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23444/head:pull/23444

PR: https://git.openjdk.org/jdk/pull/23444

From dnsimon at openjdk.org  Tue Feb  4 16:58:09 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Tue, 4 Feb 2025 16:58:09 GMT
Subject: RFR: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can
 crash
In-Reply-To: <h-d5AL3L3J-BvIgzmhy0VsZWG8q7JAPK-kO_xtfSv9s=.244df430-4498-48a9-b0b6-b2082cc485cf@github.com>
References: <h-d5AL3L3J-BvIgzmhy0VsZWG8q7JAPK-kO_xtfSv9s=.244df430-4498-48a9-b0b6-b2082cc485cf@github.com>
Message-ID: <ULNknzEaDa7qTU6m1TUPWmyJHu_66jEz6oJKPVAGCfo=.4ac92804-d37a-4d17-8804-0a3a88a56537@github.com>

On Tue, 4 Feb 2025 16:31:50 GMT, Tom Rodriguez <never at openjdk.org> wrote:

> This ensures that collectFailedSpeculations sees the initialization of the recently allocated failedSpeculationsAddress memory.

src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java line 179:

> 177:         }
> 178: 
> 179:         if (UnsafeAccess.UNSAFE.getLong(getFailedSpeculationsAddress()) != 0) {

It's still possible for `getFailedSpeculationsAddress()` to return 0 (i.e. when `managesFailedSpeculations` is `false`). So I think this should be:

diff --git a/src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java b/src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java
index fd46e281c3b..a861c00d77d 100644
--- a/src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java
+++ b/src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java
@@ -171,8 +171,9 @@ public String toString() {

     @Override
     public void collectFailedSpeculations() {
-        if (failedSpeculationsAddress != 0 && UnsafeAccess.UNSAFE.getLong(failedSpeculationsAddress) != 0) {
-            failedSpeculations = compilerToVM().getFailedSpeculations(failedSpeculationsAddress, failedSpeculations);
+        long address = getFailedSpeculationsAddress();
+        if (address != 0 && UnsafeAccess.UNSAFE.getLong(address) != 0) {
+            failedSpeculations = compilerToVM().getFailedSpeculations(address, failedSpeculations);
             assert failedSpeculations.getClass() == byte[][].class;
         }
     }

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23444#discussion_r1941551882

From never at openjdk.org  Tue Feb  4 17:41:14 2025
From: never at openjdk.org (Tom Rodriguez)
Date: Tue, 4 Feb 2025 17:41:14 GMT
Subject: RFR: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can
 crash
In-Reply-To: <ULNknzEaDa7qTU6m1TUPWmyJHu_66jEz6oJKPVAGCfo=.4ac92804-d37a-4d17-8804-0a3a88a56537@github.com>
References: <h-d5AL3L3J-BvIgzmhy0VsZWG8q7JAPK-kO_xtfSv9s=.244df430-4498-48a9-b0b6-b2082cc485cf@github.com>
 <ULNknzEaDa7qTU6m1TUPWmyJHu_66jEz6oJKPVAGCfo=.4ac92804-d37a-4d17-8804-0a3a88a56537@github.com>
Message-ID: <O9Sy9k-hnRM-JpVEOWQ17bSKHBBiTN9lJfHRfF2nHwc=.d0a50e7f-b197-4d49-b9d9-c48d87f0bfd4@github.com>

On Tue, 4 Feb 2025 16:54:58 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

>> This ensures that collectFailedSpeculations sees the initialization of the recently allocated failedSpeculationsAddress memory.
>
> src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java line 179:
> 
>> 177:         }
>> 178: 
>> 179:         if (UnsafeAccess.UNSAFE.getLong(getFailedSpeculationsAddress()) != 0) {
> 
> It's still possible for `getFailedSpeculationsAddress()` to return 0 (i.e. when `managesFailedSpeculations` is `false`). So I think this should be:
> 
> diff --git a/src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java b/src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java
> index fd46e281c3b..a861c00d77d 100644
> --- a/src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java
> +++ b/src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java
> @@ -171,8 +171,9 @@ public String toString() {
> 
>      @Override
>      public void collectFailedSpeculations() {
> -        if (failedSpeculationsAddress != 0 && UnsafeAccess.UNSAFE.getLong(failedSpeculationsAddress) != 0) {
> -            failedSpeculations = compilerToVM().getFailedSpeculations(failedSpeculationsAddress, failedSpeculations);
> +        long address = getFailedSpeculationsAddress();
> +        if (address != 0 && UnsafeAccess.UNSAFE.getLong(address) != 0) {
> +            failedSpeculations = compilerToVM().getFailedSpeculations(address, failedSpeculations);
>              assert failedSpeculations.getClass() == byte[][].class;
>          }
>      }

I'm filtering out 0 above this line and `getFailedSpeculationsAddress()` can't return 0 if `failedSpeculationsAddress` is already non-zero.

`failedSpeculationsAddress` also can't be 0 if `managesFailedSpeculations` is false since we throw `IllegalArgumentException` in that case.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23444#discussion_r1941612206

From epeter at openjdk.org  Tue Feb  4 18:47:25 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Tue, 4 Feb 2025 18:47:25 GMT
Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and
 LoadNKlassNode [v4]
In-Reply-To: <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com>
References: <m-zQmSsU2kIPw-wb-_1TbaVOHR_M5F-I1rFjvYAJu1k=.c8341d39-b1fc-4eb4-bcef-dfa46fa6c65c@github.com>
 <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com>
Message-ID: <GUvIdy-0FQ95PHTmqupwKv3C7d3pCF1Kc1n1ZN3URV4=.f3f9fa7f-96e0-4196-b426-154756e03b61@github.com>

On Thu, 30 Jan 2025 17:11:08 GMT, Quan Anh Mai <qamai at openjdk.org> wrote:

>> Hi,
>> 
>> This patch removes the control input of `LoadKlassNode` and `LoadNKlassNode`. They can only have a control input if created inside `Parse::array_store_check()`, the reason given is:
>> 
>>     // We are allowed to use the constant type only if cast succeeded
>> 
>> But this seems incorrect, the load from the constant type can be done regardless, and it will be constant-folded. This patch only makes that more formal and cleanup `LoadKlassNode::can_remove_control`.
>> 
>> Please take a look and leave your reviews, thanks a lot.
>
> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision:
> 
>   format

Nice cleanup.
Though it looks like you are doing more than remove the ctrl input. I don't know the code very well, so I have some questions ;)

src/hotspot/share/opto/parseHelper.cpp line 170:

> 168:       !too_many_traps(Deoptimization::Reason_array_check) &&
> 169:       !tak->klass_is_exact() &&
> 170:       tak->isa_aryklassptr()) {

Looks like an implicit `nullptr` check. Not allowed by code style ;)

src/hotspot/share/opto/parseHelper.cpp line 193:

> 191:       // See issue JDK-8057622 for details.
> 192: 
> 193:     always_see_exact_class = true;

Why is it ok to remove this?
If this branch is not taken, it used to be `false`, and would lead to something different below...

-------------

Changes requested by epeter (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/23274#pullrequestreview-2593742600
PR Review Comment: https://git.openjdk.org/jdk/pull/23274#discussion_r1941714615
PR Review Comment: https://git.openjdk.org/jdk/pull/23274#discussion_r1941719070

From epeter at openjdk.org  Tue Feb  4 18:47:25 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Tue, 4 Feb 2025 18:47:25 GMT
Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and
 LoadNKlassNode [v4]
In-Reply-To: <GUvIdy-0FQ95PHTmqupwKv3C7d3pCF1Kc1n1ZN3URV4=.f3f9fa7f-96e0-4196-b426-154756e03b61@github.com>
References: <m-zQmSsU2kIPw-wb-_1TbaVOHR_M5F-I1rFjvYAJu1k=.c8341d39-b1fc-4eb4-bcef-dfa46fa6c65c@github.com>
 <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com>
 <GUvIdy-0FQ95PHTmqupwKv3C7d3pCF1Kc1n1ZN3URV4=.f3f9fa7f-96e0-4196-b426-154756e03b61@github.com>
Message-ID: <Z1O6YkRqcRT5ECD2d_HqANJ11xYimojQkOHz9kVS0eM=.601cefc5-9e97-499c-bc78-06986b45051f@github.com>

On Tue, 4 Feb 2025 18:39:32 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   format
>
> src/hotspot/share/opto/parseHelper.cpp line 170:
> 
>> 168:       !too_many_traps(Deoptimization::Reason_array_check) &&
>> 169:       !tak->klass_is_exact() &&
>> 170:       tak->isa_aryklassptr()) {
> 
> Looks like an implicit `nullptr` check. Not allowed by code style ;)

Can you quickly explain this change from `tak != TypeInstKlassPtr::OBJECT` so I don't need to investigate myself, please?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23274#discussion_r1941715309

From qamai at openjdk.org  Tue Feb  4 18:57:10 2025
From: qamai at openjdk.org (Quan Anh Mai)
Date: Tue, 4 Feb 2025 18:57:10 GMT
Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and
 LoadNKlassNode [v4]
In-Reply-To: <Z1O6YkRqcRT5ECD2d_HqANJ11xYimojQkOHz9kVS0eM=.601cefc5-9e97-499c-bc78-06986b45051f@github.com>
References: <m-zQmSsU2kIPw-wb-_1TbaVOHR_M5F-I1rFjvYAJu1k=.c8341d39-b1fc-4eb4-bcef-dfa46fa6c65c@github.com>
 <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com>
 <GUvIdy-0FQ95PHTmqupwKv3C7d3pCF1Kc1n1ZN3URV4=.f3f9fa7f-96e0-4196-b426-154756e03b61@github.com>
 <Z1O6YkRqcRT5ECD2d_HqANJ11xYimojQkOHz9kVS0eM=.601cefc5-9e97-499c-bc78-06986b45051f@github.com>
Message-ID: <067rRrzD6d7ZDU-HYPHQ-qVhPygP_3WqrrgZvikgjIc=.98110421-5c91-492a-8f35-a9544cde6189@github.com>

On Tue, 4 Feb 2025 18:40:05 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> src/hotspot/share/opto/parseHelper.cpp line 170:
>> 
>>> 168:       !too_many_traps(Deoptimization::Reason_array_check) &&
>>> 169:       !tak->klass_is_exact() &&
>>> 170:       tak->isa_aryklassptr()) {
>> 
>> Looks like an implicit `nullptr` check. Not allowed by code style ;)
>
> Can you quickly explain this change from `tak != TypeInstKlassPtr::OBJECT` so I don't need to investigate myself, please?

> Looks like an implicit nullptr check. Not allowed by code style ;)

But the verb here is `isa` and we use these as a `bool` a lot, though :/

> Can you quickly explain this change from tak != TypeInstKlassPtr::OBJECT so I don't need to investigate myself, please?

The bottom type of an array can be either `Object` or an array of some kind, so `tak != TypeInstKlassPtr::OBJECT` is the same as `tak->isa_aryklassptr()`.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23274#discussion_r1941732694

From qamai at openjdk.org  Tue Feb  4 18:57:11 2025
From: qamai at openjdk.org (Quan Anh Mai)
Date: Tue, 4 Feb 2025 18:57:11 GMT
Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and
 LoadNKlassNode [v4]
In-Reply-To: <GUvIdy-0FQ95PHTmqupwKv3C7d3pCF1Kc1n1ZN3URV4=.f3f9fa7f-96e0-4196-b426-154756e03b61@github.com>
References: <m-zQmSsU2kIPw-wb-_1TbaVOHR_M5F-I1rFjvYAJu1k=.c8341d39-b1fc-4eb4-bcef-dfa46fa6c65c@github.com>
 <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com>
 <GUvIdy-0FQ95PHTmqupwKv3C7d3pCF1Kc1n1ZN3URV4=.f3f9fa7f-96e0-4196-b426-154756e03b61@github.com>
Message-ID: <y34m4lyF4vZ2jLalsoUF_NjNATaQId1KI3NVCcB6q9A=.02c00aca-9d4a-43e9-8efc-c7fc44e03bee@github.com>

On Tue, 4 Feb 2025 18:43:04 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   format
>
> src/hotspot/share/opto/parseHelper.cpp line 193:
> 
>> 191:       // See issue JDK-8057622 for details.
>> 192: 
>> 193:     always_see_exact_class = true;
> 
> Why is it ok to remove this?
> If this branch is not taken, it used to be `false`, and would lead to something different below...

The only use of this is to decide if we need to attach a control input to the `LoadKlass`. As the control input is not needed, this can be removed.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23274#discussion_r1941735400

From duke at openjdk.org  Tue Feb  4 19:00:33 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Tue, 4 Feb 2025 19:00:33 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v2]
In-Reply-To: <jrSKBd120-hd7KlWve_pary_pdkTrJFQ18dpCk86O34=.e923d98c-0105-41fc-8b68-48490409adc1@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <hEbWLMmT-NemgtAzFnQaJcpsD72ILyj6MMABAH6kBQY=.3e4995f2-d6d2-49ea-ac19-9a241333aeac@github.com>
 <7UgNYEuTu6rj7queOgM9xIy-6kQMdACrZiDLtlniMYw=.dff6f18b-1236-43b1-8280-2bce9160f32a@github.com>
 <jrSKBd120-hd7KlWve_pary_pdkTrJFQ18dpCk86O34=.e923d98c-0105-41fc-8b68-48490409adc1@github.com>
Message-ID: <bgQcz0qkxlllqA2IQzycvOzQo6SXHvzJHqXnsIV9tJw=.c0841967-0da0-492c-8427-a790ca7cb93f@github.com>

On Mon, 3 Feb 2025 18:11:51 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> @ferakocz I'm afraid you lucked out on getting your change committed before my reorganization of the stub generation code. If you are unsure of how to do the merge so your new stub is declared and generated following the new model (see the doc comments in stubDeclarations.hpp for details) let me know and I'll be happy to help you sort it out.
>
>> @ferakocz I'm afraid you lucked out on getting your change committed before my reorganization of the stub generation code. If you are unsure of how to do the merge so your new stub is declared and generated following the new model (see the doc comments in stubDeclarations.hpp for details) let me know and I'll be happy to help you sort it out.
> 
> @adinn I think I managed to figure it out. Please take a look at the PR and let me know if I should have done anything differently.

> @ferakocz Yes, the stub declaration part of it looks to be correct.
> 
> The rest of the patch will need at least two reviewers (@theRealAph? @martinuy? @franferrax) and may take some time to review, given that they will probably need to read up on the maths and algorithms. As an aid for reviewers and maintainers it would be good to insert a comment into the generator file linking the implementations to the relevant maths and algorithm. I found the FIPS-204 spec and the CRYSTALS-Dilithium Algorithm Speci?cations and Supporting Documentation paper, Shi Bai, L?o Ducas et al, 2021 - are they the best ones to look at?

The Java implementation of ML-DSA is based on the FIPS-204 standard and the intrinsicss' implementations are based on the corresponding Java methods, except that the montMul() calls in them are inlined. The rest of the transformation from Java code to intrinsic code is pretty straightforward, so a reviewer need not necessarily understand the whole mathematics of the ML-DSA algorithms, just that the Java and the corresponding intrinsic code do the same thing.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2634810518

From epeter at openjdk.org  Tue Feb  4 19:09:35 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Tue, 4 Feb 2025 19:09:35 GMT
Subject: RFR: 8342103: C2 compiler support for Float16 type and associated
 scalar operations [v17]
In-Reply-To: <sgcWAEvXaWi40uFqPPJq-e6n3YsHb7agKBUrMBdqJSc=.48c41eab-6641-45b7-9765-400622894f4b@github.com>
References: <a00XTjaE0iFc3MKq9ER_tgXoz81Hg07N8sPSPpTIQt4=.c05fd92f-8105-49d5-80be-ee56aeb77ede@github.com>
 <sgcWAEvXaWi40uFqPPJq-e6n3YsHb7agKBUrMBdqJSc=.48c41eab-6641-45b7-9765-400622894f4b@github.com>
Message-ID: <yXDCPe78DaMicuW5NvRvjnYXbq7VGOpReHXS3DwmXm8=.e570b318-11a6-43fe-89c7-e42f7315dd5a@github.com>

On Tue, 4 Feb 2025 10:05:09 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Hi All,
>> 
>> This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128)
>> 
>> Following is the summary of changes included with this patch:-
>> 
>> 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations.
>> 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization.
>> 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class.
>>       -    These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values.
>> 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines.
>> 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details.
>> 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa.
>> 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF
>> 9. X86  backend implementation for all supported intrinsics.
>> 10. Functional and Performance validation tests.
>> 
>> Kindly review the patch and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fixing typos

Thanks @jatin-bhateja for all your patience, this really took a while ? 

It looks good to me - again I'm only reviewing the C++ VM changes, so someone else has to review the Java changes.

-------------

Marked as reviewed by epeter (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/22754#pullrequestreview-2593800414

From epeter at openjdk.org  Tue Feb  4 19:09:36 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Tue, 4 Feb 2025 19:09:36 GMT
Subject: RFR: 8342103: C2 compiler support for Float16 type and associated
 scalar operations [v16]
In-Reply-To: <cZKkfnS-ra5Pmvto6pG-lfQ7WcpkKsncirRdbDnXRRk=.fc426cac-4fc7-49ca-b0fa-9b983b9246ca@github.com>
References: <a00XTjaE0iFc3MKq9ER_tgXoz81Hg07N8sPSPpTIQt4=.c05fd92f-8105-49d5-80be-ee56aeb77ede@github.com>
 <EC2coajO2bsqpsjUFgjuf-YXj23ENZeLj2ABtq4Sd74=.1e290b04-fcfc-4aaa-b80d-60cdbb0a529a@github.com>
 <mUgb99AYb-pdBKAx5Jy6TfGAWok7-DJ89A9m80L-yYc=.d51af39b-d84d-4daf-857d-513183b5262e@github.com>
 <cZKkfnS-ra5Pmvto6pG-lfQ7WcpkKsncirRdbDnXRRk=.fc426cac-4fc7-49ca-b0fa-9b983b9246ca@github.com>
Message-ID: <7WobCDj_e4Sw1CEYr3EVfgHTxJoxBfiFR63WwrzDDzs=.27e926d0-23e6-4231-a677-fdfd683083be@github.com>

On Tue, 4 Feb 2025 09:56:15 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> src/hotspot/share/opto/convertnode.cpp line 971:
>> 
>>> 969:       return true;
>>> 970:     default:
>>> 971:       return false;
>> 
>> Does this cover all cases? What about `FmaHF`?
>
> FmaHF is a ternary operation and is intrinsified.

Ah, right. My bad ?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1941748224

From liach at openjdk.org  Tue Feb  4 19:21:44 2025
From: liach at openjdk.org (Chen Liang)
Date: Tue, 4 Feb 2025 19:21:44 GMT
Subject: RFR: 8342103: C2 compiler support for Float16 type and associated
 scalar operations [v17]
In-Reply-To: <sgcWAEvXaWi40uFqPPJq-e6n3YsHb7agKBUrMBdqJSc=.48c41eab-6641-45b7-9765-400622894f4b@github.com>
References: <a00XTjaE0iFc3MKq9ER_tgXoz81Hg07N8sPSPpTIQt4=.c05fd92f-8105-49d5-80be-ee56aeb77ede@github.com>
 <sgcWAEvXaWi40uFqPPJq-e6n3YsHb7agKBUrMBdqJSc=.48c41eab-6641-45b7-9765-400622894f4b@github.com>
Message-ID: <7oq7j2pYG9ToDNcGyVWrphH_wFyvPRX2kl3qxgQYBss=.449139d7-e3a8-4587-b5ce-a5f7f9f5b613@github.com>

On Tue, 4 Feb 2025 10:05:09 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Hi All,
>> 
>> This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128)
>> 
>> Following is the summary of changes included with this patch:-
>> 
>> 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations.
>> 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization.
>> 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class.
>>       -    These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values.
>> 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines.
>> 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details.
>> 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa.
>> 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF
>> 9. X86  backend implementation for all supported intrinsics.
>> 10. Functional and Performance validation tests.
>> 
>> Kindly review the patch and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fixing typos

src/java.base/share/classes/jdk/internal/vm/vector/Float16Math.java line 42:

> 40:     }
> 41: 
> 42:     public interface Float16TernaryMathOp {

Is there a reason we don't write the default impl explicitly in this class, but ask for a lambda for an implementation? Each intrinsified method only has one default impl, so I think we can just inline that into the method body here.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1941764924

From dnsimon at openjdk.org  Tue Feb  4 19:39:15 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Tue, 4 Feb 2025 19:39:15 GMT
Subject: RFR: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can
 crash
In-Reply-To: <h-d5AL3L3J-BvIgzmhy0VsZWG8q7JAPK-kO_xtfSv9s=.244df430-4498-48a9-b0b6-b2082cc485cf@github.com>
References: <h-d5AL3L3J-BvIgzmhy0VsZWG8q7JAPK-kO_xtfSv9s=.244df430-4498-48a9-b0b6-b2082cc485cf@github.com>
Message-ID: <XrJtehmjOdGH84IzW1cwb9qMntsinr8EJ29Y4Wjb0rI=.aafc5bae-3bfc-4f06-92c0-8bb45bfb3bdd@github.com>

On Tue, 4 Feb 2025 16:31:50 GMT, Tom Rodriguez <never at openjdk.org> wrote:

> This ensures that collectFailedSpeculations sees the initialization of the recently allocated failedSpeculationsAddress memory.

Marked as reviewed by dnsimon (Reviewer).

-------------

PR Review: https://git.openjdk.org/jdk/pull/23444#pullrequestreview-2593861407

From dnsimon at openjdk.org  Tue Feb  4 19:39:16 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Tue, 4 Feb 2025 19:39:16 GMT
Subject: RFR: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can
 crash
In-Reply-To: <O9Sy9k-hnRM-JpVEOWQ17bSKHBBiTN9lJfHRfF2nHwc=.d0a50e7f-b197-4d49-b9d9-c48d87f0bfd4@github.com>
References: <h-d5AL3L3J-BvIgzmhy0VsZWG8q7JAPK-kO_xtfSv9s=.244df430-4498-48a9-b0b6-b2082cc485cf@github.com>
 <ULNknzEaDa7qTU6m1TUPWmyJHu_66jEz6oJKPVAGCfo=.4ac92804-d37a-4d17-8804-0a3a88a56537@github.com>
 <O9Sy9k-hnRM-JpVEOWQ17bSKHBBiTN9lJfHRfF2nHwc=.d0a50e7f-b197-4d49-b9d9-c48d87f0bfd4@github.com>
Message-ID: <V3UkQiI2JInkZpuQKJb1wRhC34fTKZZ8FahQZ1MsxyI=.bb1faf83-a21d-4933-9434-96e1c3265a7e@github.com>

On Tue, 4 Feb 2025 17:38:40 GMT, Tom Rodriguez <never at openjdk.org> wrote:

> `failedSpeculationsAddress` also can't be 0 if `managesFailedSpeculations` is false since we throw `IllegalArgumentException` in that case.

Ok, I'd forgotten about that invariant. Might be worth reminding the reader of it with a comment in collectFailedSpeculations.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23444#discussion_r1941785813

From never at openjdk.org  Tue Feb  4 20:52:37 2025
From: never at openjdk.org (Tom Rodriguez)
Date: Tue, 4 Feb 2025 20:52:37 GMT
Subject: RFR: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can
 crash [v2]
In-Reply-To: <h-d5AL3L3J-BvIgzmhy0VsZWG8q7JAPK-kO_xtfSv9s=.244df430-4498-48a9-b0b6-b2082cc485cf@github.com>
References: <h-d5AL3L3J-BvIgzmhy0VsZWG8q7JAPK-kO_xtfSv9s=.244df430-4498-48a9-b0b6-b2082cc485cf@github.com>
Message-ID: <B6Cj8486brCHarwlFABLKQxOCFmO0WtNgyfLpciwPj4=.4941c52b-9d65-4811-9dca-8f9c8b4ac272@github.com>

> This ensures that collectFailedSpeculations sees the initialization of the recently allocated failedSpeculationsAddress memory.

Tom Rodriguez has updated the pull request incrementally with one additional commit since the last revision:

  improve javadoc

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23444/files
  - new: https://git.openjdk.org/jdk/pull/23444/files/aefc1dfd..459f5c36

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23444&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23444&range=00-01

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/23444.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23444/head:pull/23444

PR: https://git.openjdk.org/jdk/pull/23444

From never at openjdk.org  Tue Feb  4 20:52:37 2025
From: never at openjdk.org (Tom Rodriguez)
Date: Tue, 4 Feb 2025 20:52:37 GMT
Subject: RFR: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can
 crash [v2]
In-Reply-To: <V3UkQiI2JInkZpuQKJb1wRhC34fTKZZ8FahQZ1MsxyI=.bb1faf83-a21d-4933-9434-96e1c3265a7e@github.com>
References: <h-d5AL3L3J-BvIgzmhy0VsZWG8q7JAPK-kO_xtfSv9s=.244df430-4498-48a9-b0b6-b2082cc485cf@github.com>
 <ULNknzEaDa7qTU6m1TUPWmyJHu_66jEz6oJKPVAGCfo=.4ac92804-d37a-4d17-8804-0a3a88a56537@github.com>
 <O9Sy9k-hnRM-JpVEOWQ17bSKHBBiTN9lJfHRfF2nHwc=.d0a50e7f-b197-4d49-b9d9-c48d87f0bfd4@github.com>
 <V3UkQiI2JInkZpuQKJb1wRhC34fTKZZ8FahQZ1MsxyI=.bb1faf83-a21d-4933-9434-96e1c3265a7e@github.com>
Message-ID: <er71BTThPWlvkJMtaB5L8FYZEP1XcKdkMZBRXj-mfqc=.994a68b6-0195-450a-9249-7da8c4b8471e@github.com>

On Tue, 4 Feb 2025 19:36:36 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

>> I'm filtering out 0 above this line and `getFailedSpeculationsAddress()` can't return 0 if `failedSpeculationsAddress` is already non-zero.
>> 
>> `failedSpeculationsAddress` also can't be 0 if `managesFailedSpeculations` is false since we throw `IllegalArgumentException` in that case.
>
>> `failedSpeculationsAddress` also can't be 0 if `managesFailedSpeculations` is false since we throw `IllegalArgumentException` in that case.
> 
> Ok, I'd forgotten about that invariant. Might be worth reminding the reader of it with a comment in collectFailedSpeculations.

It's already the style in other places like the call to addFailedSpeculation so I'm not sure it's worth calling out here.  I've updated the javadoc for getFailedSpeculationsAddress to specify that it always returns non-zero.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23444#discussion_r1941877712

From never at openjdk.org  Tue Feb  4 20:56:53 2025
From: never at openjdk.org (Tom Rodriguez)
Date: Tue, 4 Feb 2025 20:56:53 GMT
Subject: RFR: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can
 crash [v3]
In-Reply-To: <h-d5AL3L3J-BvIgzmhy0VsZWG8q7JAPK-kO_xtfSv9s=.244df430-4498-48a9-b0b6-b2082cc485cf@github.com>
References: <h-d5AL3L3J-BvIgzmhy0VsZWG8q7JAPK-kO_xtfSv9s=.244df430-4498-48a9-b0b6-b2082cc485cf@github.com>
Message-ID: <SPZqWN3T8fTZzOQsFUkJ_JqbGIMAYZbUEW3c6vN3shI=.2fad04b0-be2c-41dd-9337-1192c11da7fa@github.com>

> This ensures that collectFailedSpeculations sees the initialization of the recently allocated failedSpeculationsAddress memory.

Tom Rodriguez has updated the pull request incrementally with one additional commit since the last revision:

  improve comments

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23444/files
  - new: https://git.openjdk.org/jdk/pull/23444/files/459f5c36..5a5fd6fc

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23444&range=02
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23444&range=01-02

  Stats: 6 lines in 1 file changed: 4 ins; 0 del; 2 mod
  Patch: https://git.openjdk.org/jdk/pull/23444.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23444/head:pull/23444

PR: https://git.openjdk.org/jdk/pull/23444

From dlong at openjdk.org  Wed Feb  5 01:13:20 2025
From: dlong at openjdk.org (Dean Long)
Date: Wed, 5 Feb 2025 01:13:20 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2]
In-Reply-To: <J3izMa8E182ejKhvLNQc0U__1Eswe4pAcSZd-g5OYEs=.ce24eab9-9e04-4220-9300-b81bee0453db@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <J3izMa8E182ejKhvLNQc0U__1Eswe4pAcSZd-g5OYEs=.ce24eab9-9e04-4220-9300-b81bee0453db@github.com>
Message-ID: <EU357-FKuMBGUepB5H3CyvB-YhnJbRrvE0Zwab63Gw0=.3f51e9e8-0c3a-4d60-84fa-2fc8db065990@github.com>

On Tue, 4 Feb 2025 14:43:51 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.
>> 
>> There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.
>> 
>> Tested with tier1-8.
>
> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix copyright and param name

test/micro/org/openjdk/bench/java/lang/reflect/Clazz.java line 73:

> 71:     public int getAppArrayModifiers() {
> 72:         return clazzArray.getClass().getModifiers();
> 73:     }

I'm guessing this is the benchmark that shows an extra load.  How about adding a benchmark that makes the Clazz[] final or @Stable, and see if that makes the extra load go away?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1942114565

From jbhateja at openjdk.org  Wed Feb  5 07:09:15 2025
From: jbhateja at openjdk.org (Jatin Bhateja)
Date: Wed, 5 Feb 2025 07:09:15 GMT
Subject: RFR: 8342103: C2 compiler support for Float16 type and associated
 scalar operations [v17]
In-Reply-To: <7oq7j2pYG9ToDNcGyVWrphH_wFyvPRX2kl3qxgQYBss=.449139d7-e3a8-4587-b5ce-a5f7f9f5b613@github.com>
References: <a00XTjaE0iFc3MKq9ER_tgXoz81Hg07N8sPSPpTIQt4=.c05fd92f-8105-49d5-80be-ee56aeb77ede@github.com>
 <sgcWAEvXaWi40uFqPPJq-e6n3YsHb7agKBUrMBdqJSc=.48c41eab-6641-45b7-9765-400622894f4b@github.com>
 <7oq7j2pYG9ToDNcGyVWrphH_wFyvPRX2kl3qxgQYBss=.449139d7-e3a8-4587-b5ce-a5f7f9f5b613@github.com>
Message-ID: <ze1zyGCZkepWitVWvZyFvdqYUSPo0cJA-bX1pyo58mc=.113254e4-fa71-4ca6-91d2-2929dfe94e45@github.com>

On Tue, 4 Feb 2025 19:18:39 GMT, Chen Liang <liach at openjdk.org> wrote:

>> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Fixing typos
>
> src/java.base/share/classes/jdk/internal/vm/vector/Float16Math.java line 42:
> 
>> 40:     }
>> 41: 
>> 42:     public interface Float16TernaryMathOp {
> 
> Is there a reason we don't write the default impl explicitly in this class, but ask for a lambda for an implementation? Each intrinsified method only has one default impl, so I think we can just inline that into the method body here.

This wrapper class is part of java.base module and only contains intrinsic entry points for APIs defined in Float16 class which is part of an incubation module. Thus, exposing intrinsic fallback code through lambda keeps the interface clean while actual API logic and comments around it remains intact in Float16 class.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1942344948

From epeter at openjdk.org  Wed Feb  5 09:30:21 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Wed, 5 Feb 2025 09:30:21 GMT
Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and
 LoadNKlassNode [v4]
In-Reply-To: <067rRrzD6d7ZDU-HYPHQ-qVhPygP_3WqrrgZvikgjIc=.98110421-5c91-492a-8f35-a9544cde6189@github.com>
References: <m-zQmSsU2kIPw-wb-_1TbaVOHR_M5F-I1rFjvYAJu1k=.c8341d39-b1fc-4eb4-bcef-dfa46fa6c65c@github.com>
 <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com>
 <GUvIdy-0FQ95PHTmqupwKv3C7d3pCF1Kc1n1ZN3URV4=.f3f9fa7f-96e0-4196-b426-154756e03b61@github.com>
 <Z1O6YkRqcRT5ECD2d_HqANJ11xYimojQkOHz9kVS0eM=.601cefc5-9e97-499c-bc78-06986b45051f@github.com>
 <067rRrzD6d7ZDU-HYPHQ-qVhPygP_3WqrrgZvikgjIc=.98110421-5c91-492a-8f35-a9544cde6189@github.com>
Message-ID: <PrvaN9Il5lroS5i2cjvqYodCWyL25xslwI_zTY1zjuc=.742479e3-58da-4f4c-bb49-5c01d31b07e9@github.com>

On Tue, 4 Feb 2025 18:52:13 GMT, Quan Anh Mai <qamai at openjdk.org> wrote:

>> Can you quickly explain this change from `tak != TypeInstKlassPtr::OBJECT` so I don't need to investigate myself, please?
>
>> Looks like an implicit nullptr check. Not allowed by code style ;)
> 
> But the verb here is `isa` and we use these as a `bool` a lot, though :/
> 
>> Can you quickly explain this change from tak != TypeInstKlassPtr::OBJECT so I don't need to investigate myself, please?
> 
> The bottom type of an array can be either `Object` or an array of some kind, so `tak != TypeInstKlassPtr::OBJECT` is the same as `tak->isa_aryklassptr()`.

Ah great, thanks for the explanation!

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23274#discussion_r1942520434

From epeter at openjdk.org  Wed Feb  5 09:35:17 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Wed, 5 Feb 2025 09:35:17 GMT
Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and
 LoadNKlassNode [v4]
In-Reply-To: <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com>
References: <m-zQmSsU2kIPw-wb-_1TbaVOHR_M5F-I1rFjvYAJu1k=.c8341d39-b1fc-4eb4-bcef-dfa46fa6c65c@github.com>
 <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com>
Message-ID: <XGYHqpmkpkKvDCbsZHbZ23-b0dHp_wN0RHKRYYYlq_A=.d5248bc1-1d43-4094-8cb2-ee5008015927@github.com>

On Thu, 30 Jan 2025 17:11:08 GMT, Quan Anh Mai <qamai at openjdk.org> wrote:

>> Hi,
>> 
>> This patch removes the control input of `LoadKlassNode` and `LoadNKlassNode`. They can only have a control input if created inside `Parse::array_store_check()`, the reason given is:
>> 
>>     // We are allowed to use the constant type only if cast succeeded
>> 
>> But this seems incorrect, the load from the constant type can be done regardless, and it will be constant-folded. This patch only makes that more formal and cleanup `LoadKlassNode::can_remove_control`.
>> 
>> Please take a look and leave your reviews, thanks a lot.
>
> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision:
> 
>   format

Looks good, thanks for the explanations!

I see we did not yet run internal tests for the last commit, though it is only formatting, so most most likely ok.

But the state of the code is also 2 weeks old, so it would be good if you merged and launched testing again before integration, just in case ;)

-------------

Marked as reviewed by epeter (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/23274#pullrequestreview-2595110848

From epeter at openjdk.org  Wed Feb  5 09:35:18 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Wed, 5 Feb 2025 09:35:18 GMT
Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and
 LoadNKlassNode [v4]
In-Reply-To: <y34m4lyF4vZ2jLalsoUF_NjNATaQId1KI3NVCcB6q9A=.02c00aca-9d4a-43e9-8efc-c7fc44e03bee@github.com>
References: <m-zQmSsU2kIPw-wb-_1TbaVOHR_M5F-I1rFjvYAJu1k=.c8341d39-b1fc-4eb4-bcef-dfa46fa6c65c@github.com>
 <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com>
 <GUvIdy-0FQ95PHTmqupwKv3C7d3pCF1Kc1n1ZN3URV4=.f3f9fa7f-96e0-4196-b426-154756e03b61@github.com>
 <y34m4lyF4vZ2jLalsoUF_NjNATaQId1KI3NVCcB6q9A=.02c00aca-9d4a-43e9-8efc-c7fc44e03bee@github.com>
Message-ID: <UjvfXyifguSXsPCLB-NZKWBXgyWpxqDxKq3SbabRhUk=.00983d83-e370-462d-a76b-1d733ee992a8@github.com>

On Tue, 4 Feb 2025 18:54:21 GMT, Quan Anh Mai <qamai at openjdk.org> wrote:

>> src/hotspot/share/opto/parseHelper.cpp line 193:
>> 
>>> 191:       // See issue JDK-8057622 for details.
>>> 192: 
>>> 193:     always_see_exact_class = true;
>> 
>> Why is it ok to remove this?
>> If this branch is not taken, it used to be `false`, and would lead to something different below...
>
> The only use of this is to decide if we need to attach a control input to the `LoadKlass`. As the control input is not needed, this can be removed.

Got it, thanks!

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23274#discussion_r1942528816

From adinn at openjdk.org  Wed Feb  5 10:35:10 2025
From: adinn at openjdk.org (Andrew Dinn)
Date: Wed, 5 Feb 2025 10:35:10 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v2]
In-Reply-To: <bgQcz0qkxlllqA2IQzycvOzQo6SXHvzJHqXnsIV9tJw=.c0841967-0da0-492c-8427-a790ca7cb93f@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <hEbWLMmT-NemgtAzFnQaJcpsD72ILyj6MMABAH6kBQY=.3e4995f2-d6d2-49ea-ac19-9a241333aeac@github.com>
 <7UgNYEuTu6rj7queOgM9xIy-6kQMdACrZiDLtlniMYw=.dff6f18b-1236-43b1-8280-2bce9160f32a@github.com>
 <jrSKBd120-hd7KlWve_pary_pdkTrJFQ18dpCk86O34=.e923d98c-0105-41fc-8b68-48490409adc1@github.com>
 <bgQcz0qkxlllqA2IQzycvOzQo6SXHvzJHqXnsIV9tJw=.c0841967-0da0-492c-8427-a790ca7cb93f@github.com>
Message-ID: <MUXhj0c9oUeQZnLmsrlIT4_YGcdXJjKrhnJy9aj5JyE=.0ffc2b99-5d4e-4348-8b7c-d6b898903d87@github.com>

On Tue, 4 Feb 2025 18:57:28 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>>> @ferakocz I'm afraid you lucked out on getting your change committed before my reorganization of the stub generation code. If you are unsure of how to do the merge so your new stub is declared and generated following the new model (see the doc comments in stubDeclarations.hpp for details) let me know and I'll be happy to help you sort it out.
>> 
>> @adinn I think I managed to figure it out. Please take a look at the PR and let me know if I should have done anything differently.
>
>> @ferakocz Yes, the stub declaration part of it looks to be correct.
>> 
>> The rest of the patch will need at least two reviewers (@theRealAph? @martinuy? @franferrax) and may take some time to review, given that they will probably need to read up on the maths and algorithms. As an aid for reviewers and maintainers it would be good to insert a comment into the generator file linking the implementations to the relevant maths and algorithm. I found the FIPS-204 spec and the CRYSTALS-Dilithium Algorithm Speci?cations and Supporting Documentation paper, Shi Bai, L?o Ducas et al, 2021 - are they the best ones to look at?
> 
> The Java implementation of ML-DSA is based on the FIPS-204 standard and the intrinsicss' implementations are based on the corresponding Java methods, except that the montMul() calls in them are inlined. The rest of the transformation from Java code to intrinsic code is pretty straightforward, so a reviewer need not necessarily understand the whole mathematics of the ML-DSA algorithms, just that the Java and the corresponding intrinsic code do the same thing.

@ferakocz
> The Java implementation of ML-DSA is based on the FIPS-204 standard and the intrinsics' implementations are based on the corresponding Java methods, except that the montMul() calls in them are inlined. The rest of the transformation from Java code to intrinsic code is pretty straightforward, so a reviewer need not necessarily understand the whole mathematics of the ML-DSA algorithms, just that the Java and the corresponding intrinsic code do the same thing.

Yes, I located the relevant Java implementations in SHA3.java (keccak) and ML_DSA.java (dilithiumXXX) plus also SHA3Parallel.java (doubleKeccak). The first file does at least mention FIPS-202. The second does not include any reference, in particular does not mention FIPS-204.

I still think it would be helpful for reviewers and maintainers if you were to add a comment in front of the generator routines that 1) notes that these routines are based on the relevant Java sources and 2) mentions that the Java code is in turn based on the FIPS-202 and FIPS-204 standards.

While I agree that a reviewer or maintainer could simply check the generated code against the Java code I believe access to the underlying theory will be of aid when it comes to understanding what each variant is doing and verifying the equivalence of the two.

That's why I'd also prefer to have two reviews to be sure that more than one of us who may be tasked with maintaining this code can be happy that we understand, at least, the equivalence in question.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2636346476

From coleenp at openjdk.org  Wed Feb  5 19:51:06 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Wed, 5 Feb 2025 19:51:06 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2]
In-Reply-To: <EU357-FKuMBGUepB5H3CyvB-YhnJbRrvE0Zwab63Gw0=.3f51e9e8-0c3a-4d60-84fa-2fc8db065990@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <J3izMa8E182ejKhvLNQc0U__1Eswe4pAcSZd-g5OYEs=.ce24eab9-9e04-4220-9300-b81bee0453db@github.com>
 <EU357-FKuMBGUepB5H3CyvB-YhnJbRrvE0Zwab63Gw0=.3f51e9e8-0c3a-4d60-84fa-2fc8db065990@github.com>
Message-ID: <0ZM_vg_dAmbdbeoIeZ8ylBUDj_4_jxM-aE6IKoH6ykM=.69c7554f-5e2b-40b9-8d1a-abe147548dbb@github.com>

On Wed, 5 Feb 2025 01:10:39 GMT, Dean Long <dlong at openjdk.org> wrote:

>> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Fix copyright and param name
>
> test/micro/org/openjdk/bench/java/lang/reflect/Clazz.java line 73:
> 
>> 71:     public int getAppArrayModifiers() {
>> 72:         return clazzArray.getClass().getModifiers();
>> 73:     }
> 
> I'm guessing this is the benchmark that shows an extra load.  How about adding a benchmark that makes the Clazz[] final or @Stable, and see if that makes the extra load go away?

Name                      Cnt  Base   Error   Test   Error  Unit  Change
getAppArrayModifiers       30 0.923 ? 0.004  1.260 ? 0.001 ns/op   0.73x (p = 0.000*)
getAppArrayModifiersFinal  30 0.922 ? 0.000  1.260 ? 0.001 ns/op   0.73x (p = 0.000*)

No it doesn't really help.  There's still an extra load.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1943569183

From kvn at openjdk.org  Wed Feb  5 20:15:10 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Wed, 5 Feb 2025 20:15:10 GMT
Subject: RFR: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can
 crash [v3]
In-Reply-To: <SPZqWN3T8fTZzOQsFUkJ_JqbGIMAYZbUEW3c6vN3shI=.2fad04b0-be2c-41dd-9337-1192c11da7fa@github.com>
References: <h-d5AL3L3J-BvIgzmhy0VsZWG8q7JAPK-kO_xtfSv9s=.244df430-4498-48a9-b0b6-b2082cc485cf@github.com>
 <SPZqWN3T8fTZzOQsFUkJ_JqbGIMAYZbUEW3c6vN3shI=.2fad04b0-be2c-41dd-9337-1192c11da7fa@github.com>
Message-ID: <1GBWBQfWNIwLEF26VW0tecseBegwuuRUDG-rNg1zdoU=.63aa4380-5c75-45ac-86dd-9c9fe308b9dc@github.com>

On Tue, 4 Feb 2025 20:56:53 GMT, Tom Rodriguez <never at openjdk.org> wrote:

>> This ensures that collectFailedSpeculations sees the initialization of the recently allocated failedSpeculationsAddress memory.
>
> Tom Rodriguez has updated the pull request incrementally with one additional commit since the last revision:
> 
>   improve comments

Seems fine.

-------------

Marked as reviewed by kvn (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/23444#pullrequestreview-2596877352

From dlong at openjdk.org  Wed Feb  5 20:26:16 2025
From: dlong at openjdk.org (Dean Long)
Date: Wed, 5 Feb 2025 20:26:16 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2]
In-Reply-To: <0ZM_vg_dAmbdbeoIeZ8ylBUDj_4_jxM-aE6IKoH6ykM=.69c7554f-5e2b-40b9-8d1a-abe147548dbb@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <J3izMa8E182ejKhvLNQc0U__1Eswe4pAcSZd-g5OYEs=.ce24eab9-9e04-4220-9300-b81bee0453db@github.com>
 <EU357-FKuMBGUepB5H3CyvB-YhnJbRrvE0Zwab63Gw0=.3f51e9e8-0c3a-4d60-84fa-2fc8db065990@github.com>
 <0ZM_vg_dAmbdbeoIeZ8ylBUDj_4_jxM-aE6IKoH6ykM=.69c7554f-5e2b-40b9-8d1a-abe147548dbb@github.com>
Message-ID: <0efX7bcHNl5p1RoF3VnqZIabdavsGosuMI14cZPDzbQ=.2bde6bbf-a59b-4f5b-9c68-7a8a258b2ee5@github.com>

On Wed, 5 Feb 2025 19:42:02 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> test/micro/org/openjdk/bench/java/lang/reflect/Clazz.java line 73:
>> 
>>> 71:     public int getAppArrayModifiers() {
>>> 72:         return clazzArray.getClass().getModifiers();
>>> 73:     }
>> 
>> I'm guessing this is the benchmark that shows an extra load.  How about adding a benchmark that makes the Clazz[] final or @Stable, and see if that makes the extra load go away?
>
> Name                      Cnt  Base   Error   Test   Error  Unit  Change
> getAppArrayModifiers       30 0.923 ? 0.004  1.260 ? 0.001 ns/op   0.73x (p = 0.000*)
> getAppArrayModifiersFinal  30 0.922 ? 0.000  1.260 ? 0.001 ns/op   0.73x (p = 0.000*)
> 
> No it doesn't really help.  There's still an extra load.

OK, if the extra load turns out to be a problem in the future, we could look into why the compilers are generating the load when the Class is known/constant.  If the old intrinsic was able to pull the constant out of the Klass, then surely we can do the same and pull the value from the Class field.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1943616021

From dlong at openjdk.org  Wed Feb  5 21:29:12 2025
From: dlong at openjdk.org (Dean Long)
Date: Wed, 5 Feb 2025 21:29:12 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2]
In-Reply-To: <J3izMa8E182ejKhvLNQc0U__1Eswe4pAcSZd-g5OYEs=.ce24eab9-9e04-4220-9300-b81bee0453db@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <J3izMa8E182ejKhvLNQc0U__1Eswe4pAcSZd-g5OYEs=.ce24eab9-9e04-4220-9300-b81bee0453db@github.com>
Message-ID: <1yPHOj_hANp7ZvMfmgi6lRkpokgNNaUSc09FJfZvWk8=.bfcf2780-4afe-4253-ae0b-e3bc6ab7ee86@github.com>

On Tue, 4 Feb 2025 14:43:51 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.
>> 
>> There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.
>> 
>> Tested with tier1-8.
>
> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix copyright and param name

src/hotspot/share/compiler/compileLog.cpp line 116:

> 114:         print(" unloaded='1'");
> 115:       } else {
> 116:         print(" flags='%d'", klass->access_flags());

There may be tools that parse the log file and get confused by this change.  Maybe we should also change the label from "flags" to "access flags".

src/hotspot/share/jfr/recorder/checkpoint/types/jfrTypeSet.cpp line 350:

> 348:   writer->write(mark_symbol(klass, leakp));
> 349:   writer->write(package_id(klass, leakp));
> 350:   writer->write(klass->compute_modifier_flags());

Isn't this much more expensive than grabbing the value from the mirror, especially if we have to iterate over inner classes?

src/hotspot/share/oops/instanceKlass.hpp line 1128:

> 1126: #endif
> 1127: 
> 1128:   int compute_modifier_flags() const;

I don't see why this can't stay u2.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1943680670
PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1943679056
PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1943682936

From dlong at openjdk.org  Wed Feb  5 21:43:14 2025
From: dlong at openjdk.org (Dean Long)
Date: Wed, 5 Feb 2025 21:43:14 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2]
In-Reply-To: <J3izMa8E182ejKhvLNQc0U__1Eswe4pAcSZd-g5OYEs=.ce24eab9-9e04-4220-9300-b81bee0453db@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <J3izMa8E182ejKhvLNQc0U__1Eswe4pAcSZd-g5OYEs=.ce24eab9-9e04-4220-9300-b81bee0453db@github.com>
Message-ID: <NvrhD47QIH5G8TfOihEL7SyCjPdfd2jP6mOI5gyDBXo=.9d21e6b9-e527-437a-a90b-37fc7a6b7472@github.com>

On Tue, 4 Feb 2025 14:43:51 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.
>> 
>> There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.
>> 
>> Tested with tier1-8.
>
> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix copyright and param name

src/hotspot/share/opto/memnode.cpp line 2458:

> 2456:           return TypePtr::NULL_PTR;
> 2457:         }
> 2458:         // ???

I suspect that we still need this code to support intrinsics like LibraryCallKit::inline_native_classID() and maybe other users of this field, but the comment below no longer makes sense.

src/hotspot/share/opto/memnode.cpp line 2459:

> 2457:         }
> 2458:         // ???
> 2459:         // (Folds up the 1st indirection in aClassConstant.getModifiers().)

Suggestion:

        // Fold up the load of the hidden field

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1943695585
PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1943696867

From dlong at openjdk.org  Wed Feb  5 21:47:12 2025
From: dlong at openjdk.org (Dean Long)
Date: Wed, 5 Feb 2025 21:47:12 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2]
In-Reply-To: <8Wx3xbbOnPXS5n1RuNaesqHbhKV3iLwrCVF0s6uWOrA=.cb20728e-e13c-4667-822b-3ba424cbc12f@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <yhyfqAgCbPEVVVdWt3CVX5gjPVSy0LQfQEUlWciih34=.f23c9cbc-ebc3-4f50-9223-a034d1fdc6a0@github.com>
 <ry9lIOtqKOCl6v48h9bIbACNHJleAKYwRXCkd_0VkfU=.9d87f360-b867-46cc-a44a-88ba991980ec@github.com>
 <rRoMJbmO4ZFCagXgmezUlgTxe48usUja4KqhdmXDsjk=.568fe8d4-1fbd-4dfc-99e8-5941b67248de@github.com>
 <yjBy9m5yM18_XOZjEN8dn5osnTUPviOliyM_qNHoKgQ=.f9c43bc8-4bee-46bb-b466-7d471bfcd582@github.com>
 <8Wx3xbbOnPXS5n1RuNaesqHbhKV3iLwrCVF0s6uWOrA=.cb20728e-e13c-4667-822b-3ba424cbc12f@github.com>
Message-ID: <yuiSOJZCDOCs2H5MHBg5FZnJYjy4XxsEnRy87qL8DFg=.1d5fac0f-8695-4e26-ac3f-0d787421607f@github.com>

On Thu, 12 Dec 2024 10:16:01 GMT, Viktor Klang <vklang at openjdk.org> wrote:

>> @viktorklang-ora `@Stable` is not about how the field was set, but about the JIT observing a  non-default value at compile time.  If it observes a non-default value, it can treat it as a compile time constant.
>
> @DanHeidinga Great explanation, thank you!

If Class had other fields smaller than `int`, would be consider making this something like `char` to save space (allowing all the sub-word fields to be compacted)?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1943701237

From dlong at openjdk.org  Wed Feb  5 21:53:11 2025
From: dlong at openjdk.org (Dean Long)
Date: Wed, 5 Feb 2025 21:53:11 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2]
In-Reply-To: <J3izMa8E182ejKhvLNQc0U__1Eswe4pAcSZd-g5OYEs=.ce24eab9-9e04-4220-9300-b81bee0453db@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <J3izMa8E182ejKhvLNQc0U__1Eswe4pAcSZd-g5OYEs=.ce24eab9-9e04-4220-9300-b81bee0453db@github.com>
Message-ID: <yTbRwO_J0qBiFJYLzUj0UUqfZFFJJJk0FvwVY1V11mk=.360808a6-849f-4487-9176-60957be5d811@github.com>

On Tue, 4 Feb 2025 14:43:51 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.
>> 
>> There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.
>> 
>> Tested with tier1-8.
>
> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix copyright and param name

Overall looks good to me.  Please ask @iwanowww to review compiler changes.

-------------

Marked as reviewed by dlong (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/22652#pullrequestreview-2597046622

From liach at openjdk.org  Thu Feb  6 04:40:11 2025
From: liach at openjdk.org (Chen Liang)
Date: Thu, 6 Feb 2025 04:40:11 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2]
In-Reply-To: <0efX7bcHNl5p1RoF3VnqZIabdavsGosuMI14cZPDzbQ=.2bde6bbf-a59b-4f5b-9c68-7a8a258b2ee5@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <J3izMa8E182ejKhvLNQc0U__1Eswe4pAcSZd-g5OYEs=.ce24eab9-9e04-4220-9300-b81bee0453db@github.com>
 <EU357-FKuMBGUepB5H3CyvB-YhnJbRrvE0Zwab63Gw0=.3f51e9e8-0c3a-4d60-84fa-2fc8db065990@github.com>
 <0ZM_vg_dAmbdbeoIeZ8ylBUDj_4_jxM-aE6IKoH6ykM=.69c7554f-5e2b-40b9-8d1a-abe147548dbb@github.com>
 <0efX7bcHNl5p1RoF3VnqZIabdavsGosuMI14cZPDzbQ=.2bde6bbf-a59b-4f5b-9c68-7a8a258b2ee5@github.com>
Message-ID: <7KdNVSXLx0N027uyQgtUuN82VpXTlyPpPOnBv3sqYRs=.6b549b56-36f9-4ab3-8469-4779d93dd1e7@github.com>

On Wed, 5 Feb 2025 20:23:05 GMT, Dean Long <dlong at openjdk.org> wrote:

>> Name                      Cnt  Base   Error   Test   Error  Unit  Change
>> getAppArrayModifiers       30 0.923 ? 0.004  1.260 ? 0.001 ns/op   0.73x (p = 0.000*)
>> getAppArrayModifiersFinal  30 0.922 ? 0.000  1.260 ? 0.001 ns/op   0.73x (p = 0.000*)
>> 
>> No it doesn't really help.  There's still an extra load.
>
> OK, if the extra load turns out to be a problem in the future, we could look into why the compilers are generating the load when the Class is known/constant.  If the old intrinsic was able to pull the constant out of the Klass, then surely we can do the same and pull the value from the Class field.

Does `static final` help here?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1944083490

From coleenp at openjdk.org  Thu Feb  6 12:11:14 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Thu, 6 Feb 2025 12:11:14 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2]
In-Reply-To: <yuiSOJZCDOCs2H5MHBg5FZnJYjy4XxsEnRy87qL8DFg=.1d5fac0f-8695-4e26-ac3f-0d787421607f@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <yhyfqAgCbPEVVVdWt3CVX5gjPVSy0LQfQEUlWciih34=.f23c9cbc-ebc3-4f50-9223-a034d1fdc6a0@github.com>
 <ry9lIOtqKOCl6v48h9bIbACNHJleAKYwRXCkd_0VkfU=.9d87f360-b867-46cc-a44a-88ba991980ec@github.com>
 <rRoMJbmO4ZFCagXgmezUlgTxe48usUja4KqhdmXDsjk=.568fe8d4-1fbd-4dfc-99e8-5941b67248de@github.com>
 <yjBy9m5yM18_XOZjEN8dn5osnTUPviOliyM_qNHoKgQ=.f9c43bc8-4bee-46bb-b466-7d471bfcd582@github.com>
 <8Wx3xbbOnPXS5n1RuNaesqHbhKV3iLwrCVF0s6uWOrA=.cb20728e-e13c-4667-822b-3ba424cbc12f@github.com>
 <yuiSOJZCDOCs2H5MHBg5FZnJYjy4XxsEnRy87qL8DFg=.1d5fac0f-8695-4e26-ac3f-0d787421607f@github.com>
Message-ID: <4aAX8rSEcvkeYteaJUXHfVEzBbNGwGlhDLIz548dFcs=.616fa7dd-d5bf-42d5-aca0-0bea0b5591d0@github.com>

On Wed, 5 Feb 2025 21:44:37 GMT, Dean Long <dlong at openjdk.org> wrote:

>> @DanHeidinga Great explanation, thank you!
>
> If Class had other fields smaller than `int`, would be consider making this something like `char` to save space (allowing all the sub-word fields to be compacted)?

I thought of doing this since I made modifiers u2 in the Hotspot code just previously, but all the Java code refers to this as an int.  And I didn't see other fields to compact it with.  Maybe if access_flags are moved we could make them both char (not short since they're unsigned).  It feels weird to not have unsigned short to my C++ eyes.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1944613105

From coleenp at openjdk.org  Thu Feb  6 13:13:29 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Thu, 6 Feb 2025 13:13:29 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v3]
In-Reply-To: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
Message-ID: <eZJdUNeSPci8TGUYqbi99392SJYBoDfBV2YSQWJleaI=.4f05072f-5d2d-4a6d-81b1-333305cd07c1@github.com>

> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.
> 
> There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.
> 
> Tested with tier1-8.

Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:

  Update src/hotspot/share/opto/memnode.cpp
  
  Co-authored-by: Dean Long <17332032+dean-long at users.noreply.github.com>

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/22652/files
  - new: https://git.openjdk.org/jdk/pull/22652/files/ff693418..f92620eb

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=22652&range=02
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22652&range=01-02

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/22652.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/22652/head:pull/22652

PR: https://git.openjdk.org/jdk/pull/22652

From coleenp at openjdk.org  Thu Feb  6 13:13:29 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Thu, 6 Feb 2025 13:13:29 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2]
In-Reply-To: <1yPHOj_hANp7ZvMfmgi6lRkpokgNNaUSc09FJfZvWk8=.bfcf2780-4afe-4253-ae0b-e3bc6ab7ee86@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <J3izMa8E182ejKhvLNQc0U__1Eswe4pAcSZd-g5OYEs=.ce24eab9-9e04-4220-9300-b81bee0453db@github.com>
 <1yPHOj_hANp7ZvMfmgi6lRkpokgNNaUSc09FJfZvWk8=.bfcf2780-4afe-4253-ae0b-e3bc6ab7ee86@github.com>
Message-ID: <H3qkoCqZwtAyXL1ZWdPnuzvkV3ZRUuM565siUlSO4lM=.2339a020-bf74-41a4-adbf-2273284d06b6@github.com>

On Wed, 5 Feb 2025 21:24:25 GMT, Dean Long <dlong at openjdk.org> wrote:

>> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Fix copyright and param name
>
> src/hotspot/share/compiler/compileLog.cpp line 116:
> 
>> 114:         print(" unloaded='1'");
>> 115:       } else {
>> 116:         print(" flags='%d'", klass->access_flags());
> 
> There may be tools that parse the log file and get confused by this change.  Maybe we should also change the label from "flags" to "access flags".

Okay, I wanted to remove the one use of ciKlass::modifier_flags() and the field with this change, but I'll add it back since I added a Klass::modifier_flags() function.

> src/hotspot/share/jfr/recorder/checkpoint/types/jfrTypeSet.cpp line 350:
> 
>> 348:   writer->write(mark_symbol(klass, leakp));
>> 349:   writer->write(package_id(klass, leakp));
>> 350:   writer->write(klass->compute_modifier_flags());
> 
> Isn't this much more expensive than grabbing the value from the mirror, especially if we have to iterate over inner classes?

I was trying not to add a Klass::modifier_flags function, but now I have.

> src/hotspot/share/opto/memnode.cpp line 2458:
> 
>> 2456:           return TypePtr::NULL_PTR;
>> 2457:         }
>> 2458:         // ???
> 
> I suspect that we still need this code to support intrinsics like LibraryCallKit::inline_native_classID() and maybe other users of this field, but the comment below no longer makes sense.

Thank you for noticing the ??? that I left in and the comment.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1944651499
PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1944640356
PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1944697467

From coleenp at openjdk.org  Thu Feb  6 13:13:29 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Thu, 6 Feb 2025 13:13:29 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2]
In-Reply-To: <J3izMa8E182ejKhvLNQc0U__1Eswe4pAcSZd-g5OYEs=.ce24eab9-9e04-4220-9300-b81bee0453db@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <J3izMa8E182ejKhvLNQc0U__1Eswe4pAcSZd-g5OYEs=.ce24eab9-9e04-4220-9300-b81bee0453db@github.com>
Message-ID: <gqMJXQpP6idR6RS6bHOqE8D6JI8mlg5JpqZDcCc0L7o=.590b8dd1-a70f-4af4-88be-3fb205b8870c@github.com>

On Tue, 4 Feb 2025 14:43:51 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.
>> 
>> There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.
>> 
>> Tested with tier1-8.
>
> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix copyright and param name

Thank you for the detailed comments.

-------------

PR Review: https://git.openjdk.org/jdk/pull/22652#pullrequestreview-2598534835

From coleenp at openjdk.org  Thu Feb  6 13:13:30 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Thu, 6 Feb 2025 13:13:30 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2]
In-Reply-To: <7KdNVSXLx0N027uyQgtUuN82VpXTlyPpPOnBv3sqYRs=.6b549b56-36f9-4ab3-8469-4779d93dd1e7@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <J3izMa8E182ejKhvLNQc0U__1Eswe4pAcSZd-g5OYEs=.ce24eab9-9e04-4220-9300-b81bee0453db@github.com>
 <EU357-FKuMBGUepB5H3CyvB-YhnJbRrvE0Zwab63Gw0=.3f51e9e8-0c3a-4d60-84fa-2fc8db065990@github.com>
 <0ZM_vg_dAmbdbeoIeZ8ylBUDj_4_jxM-aE6IKoH6ykM=.69c7554f-5e2b-40b9-8d1a-abe147548dbb@github.com>
 <0efX7bcHNl5p1RoF3VnqZIabdavsGosuMI14cZPDzbQ=.2bde6bbf-a59b-4f5b-9c68-7a8a258b2ee5@github.com>
 <7KdNVSXLx0N027uyQgtUuN82VpXTlyPpPOnBv3sqYRs=.6b549b56-36f9-4ab3-8469-4779d93dd1e7@github.com>
Message-ID: <fIDzS3_DxYMZA78UX_HiBqnAPvUcE2qhBfw30T2hrv4=.d65d867b-2a38-49b7-ba1b-8284ae19e7d2@github.com>

On Thu, 6 Feb 2025 04:37:17 GMT, Chen Liang <liach at openjdk.org> wrote:

>> OK, if the extra load turns out to be a problem in the future, we could look into why the compilers are generating the load when the Class is known/constant.  If the old intrinsic was able to pull the constant out of the Klass, then surely we can do the same and pull the value from the Class field.
>
> Does `static final` help here?

Yes.  Yes it does.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1944694824

From coleenp at openjdk.org  Thu Feb  6 13:23:54 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Thu, 6 Feb 2025 13:23:54 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v4]
In-Reply-To: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
Message-ID: <4ruwzJXM3Jgy0rbobE3PPNAH4k8c10_4zAi6mCmc4Lw=.ccf7c825-4ffc-49fb-bc42-3c0168c1dcf8@github.com>

> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.
> 
> There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.
> 
> Tested with tier1-8.

Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:

  Add Klass::modifier_flags to look in the mirror, restore ciKlass::modifier_flags, add benchmark.

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/22652/files
  - new: https://git.openjdk.org/jdk/pull/22652/files/f92620eb..85026362

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=22652&range=03
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22652&range=02-03

  Stats: 28 lines in 7 files changed: 26 ins; 0 del; 2 mod
  Patch: https://git.openjdk.org/jdk/pull/22652.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/22652/head:pull/22652

PR: https://git.openjdk.org/jdk/pull/22652

From coleenp at openjdk.org  Thu Feb  6 14:31:28 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Thu, 6 Feb 2025 14:31:28 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v5]
In-Reply-To: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
Message-ID: <Rnq14bXvkpbsJQdX-PF9uNr86akoq0ujgHPrFBLfIqY=.0278cab5-71ed-4a80-a751-e38248928e2e@github.com>

> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.
> 
> There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.
> 
> Tested with tier1-8.

Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:

  Make compute_modifiers return u2.

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/22652/files
  - new: https://git.openjdk.org/jdk/pull/22652/files/85026362..146e2551

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=22652&range=04
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22652&range=03-04

  Stats: 7 lines in 7 files changed: 0 ins; 0 del; 7 mod
  Patch: https://git.openjdk.org/jdk/pull/22652.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/22652/head:pull/22652

PR: https://git.openjdk.org/jdk/pull/22652

From coleenp at openjdk.org  Thu Feb  6 14:31:29 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Thu, 6 Feb 2025 14:31:29 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2]
In-Reply-To: <1yPHOj_hANp7ZvMfmgi6lRkpokgNNaUSc09FJfZvWk8=.bfcf2780-4afe-4253-ae0b-e3bc6ab7ee86@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <J3izMa8E182ejKhvLNQc0U__1Eswe4pAcSZd-g5OYEs=.ce24eab9-9e04-4220-9300-b81bee0453db@github.com>
 <1yPHOj_hANp7ZvMfmgi6lRkpokgNNaUSc09FJfZvWk8=.bfcf2780-4afe-4253-ae0b-e3bc6ab7ee86@github.com>
Message-ID: <9iIj0xWClD_H4U0MiEUrQGqeIgjyFdC4tuN0sAP9kUo=.1c11d464-4380-4954-9e9f-c40872acff24@github.com>

On Wed, 5 Feb 2025 21:26:29 GMT, Dean Long <dlong at openjdk.org> wrote:

>> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Fix copyright and param name
>
> src/hotspot/share/oops/instanceKlass.hpp line 1128:
> 
>> 1126: #endif
>> 1127: 
>> 1128:   int compute_modifier_flags() const;
> 
> I don't see why this can't stay u2.

I had some compilation error for conversion that has disappeared into the either with u2, so I've restored them to u2.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1944825437

From liach at openjdk.org  Thu Feb  6 16:20:21 2025
From: liach at openjdk.org (Chen Liang)
Date: Thu, 6 Feb 2025 16:20:21 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v5]
In-Reply-To: <4aAX8rSEcvkeYteaJUXHfVEzBbNGwGlhDLIz548dFcs=.616fa7dd-d5bf-42d5-aca0-0bea0b5591d0@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <yhyfqAgCbPEVVVdWt3CVX5gjPVSy0LQfQEUlWciih34=.f23c9cbc-ebc3-4f50-9223-a034d1fdc6a0@github.com>
 <ry9lIOtqKOCl6v48h9bIbACNHJleAKYwRXCkd_0VkfU=.9d87f360-b867-46cc-a44a-88ba991980ec@github.com>
 <rRoMJbmO4ZFCagXgmezUlgTxe48usUja4KqhdmXDsjk=.568fe8d4-1fbd-4dfc-99e8-5941b67248de@github.com>
 <yjBy9m5yM18_XOZjEN8dn5osnTUPviOliyM_qNHoKgQ=.f9c43bc8-4bee-46bb-b466-7d471bfcd582@github.com>
 <8Wx3xbbOnPXS5n1RuNaesqHbhKV3iLwrCVF0s6uWOrA=.cb20728e-e13c-4667-822b-3ba424cbc12f@github.com>
 <yuiSOJZCDOCs2H5MHBg5FZnJYjy4XxsEnRy87qL8DFg=.1d5fac0f-8695-4e26-ac3f-0d787421607f@github.com>
 <4aAX8rSEcvkeYteaJUXHfVEzBbNGwGlhDLIz548dFcs=.616fa7dd-d5bf-42d5-aca0-0bea0b5591d0@github.com>
Message-ID: <UIvrfGyV8aaiXHhTfrdrdhWX12Vx4kjonBNRFEwtTbw=.d008ebb0-d865-444e-b9c3-d198ff57bf7b@github.com>

On Thu, 6 Feb 2025 12:08:59 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> If Class had other fields smaller than `int`, would be consider making this something like `char` to save space (allowing all the sub-word fields to be compacted)?
>
> I thought of doing this since I made modifiers u2 in the Hotspot code just previously, but all the Java code refers to this as an int.  And I didn't see other fields to compact it with.  Maybe if access_flags are moved we could make them both char (not short since they're unsigned).  It feels weird to not have unsigned short to my C++ eyes.

>From a Java perspective, using `char` for the field is completely fine; this field is only accessed via `getModifiers` and not set by Java code, so the automatic widening conversion can handle it all.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1945021458

From duke at openjdk.org  Thu Feb  6 18:47:54 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Thu, 6 Feb 2025 18:47:54 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]
In-Reply-To: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
Message-ID: <unMldYiDLGyImOJQ1oXuzR2OViIBxTKFjE3Ks6_VSn4=.e86bd4ee-5fce-415a-888a-06aff24bd664@github.com>

> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.

Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:

  Adding comments + some code reorganization

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23300/files
  - new: https://git.openjdk.org/jdk/pull/23300/files/9f7c4a23..9a3a9444

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23300&range=04
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23300&range=03-04

  Stats: 447 lines in 3 files changed: 140 ins; 247 del; 60 mod
  Patch: https://git.openjdk.org/jdk/pull/23300.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23300/head:pull/23300

PR: https://git.openjdk.org/jdk/pull/23300

From qamai at openjdk.org  Thu Feb  6 19:11:58 2025
From: qamai at openjdk.org (Quan Anh Mai)
Date: Thu, 6 Feb 2025 19:11:58 GMT
Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and
 LoadNKlassNode [v5]
In-Reply-To: <m-zQmSsU2kIPw-wb-_1TbaVOHR_M5F-I1rFjvYAJu1k=.c8341d39-b1fc-4eb4-bcef-dfa46fa6c65c@github.com>
References: <m-zQmSsU2kIPw-wb-_1TbaVOHR_M5F-I1rFjvYAJu1k=.c8341d39-b1fc-4eb4-bcef-dfa46fa6c65c@github.com>
Message-ID: <XOXG3194WyxMa6oX28SrOJFZcHZDhFYyZUU8Z0oySeI=.84e7f2cb-af4b-457a-8fc4-82e26fb4bbe9@github.com>

> Hi,
> 
> This patch removes the control input of `LoadKlassNode` and `LoadNKlassNode`. They can only have a control input if created inside `Parse::array_store_check()`, the reason given is:
> 
>     // We are allowed to use the constant type only if cast succeeded
> 
> But this seems incorrect, the load from the constant type can be done regardless, and it will be constant-folded. This patch only makes that more formal and cleanup `LoadKlassNode::can_remove_control`.
> 
> Please take a look and leave your reviews, thanks a lot.

Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision:

 - Merge branch 'master' into loadklassctrl
 - format
 - clearer intention, revert formatting, add assert
 - remove always_see_exact_class
 - remove control input of LoadKlassNode

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23274/files
  - new: https://git.openjdk.org/jdk/pull/23274/files/175232a6..7c2b595b

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23274&range=04
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23274&range=03-04

  Stats: 34650 lines in 1350 files changed: 16246 ins; 10055 del; 8349 mod
  Patch: https://git.openjdk.org/jdk/pull/23274.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23274/head:pull/23274

PR: https://git.openjdk.org/jdk/pull/23274

From qamai at openjdk.org  Thu Feb  6 19:15:13 2025
From: qamai at openjdk.org (Quan Anh Mai)
Date: Thu, 6 Feb 2025 19:15:13 GMT
Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and
 LoadNKlassNode [v5]
In-Reply-To: <XGYHqpmkpkKvDCbsZHbZ23-b0dHp_wN0RHKRYYYlq_A=.d5248bc1-1d43-4094-8cb2-ee5008015927@github.com>
References: <m-zQmSsU2kIPw-wb-_1TbaVOHR_M5F-I1rFjvYAJu1k=.c8341d39-b1fc-4eb4-bcef-dfa46fa6c65c@github.com>
 <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com>
 <XGYHqpmkpkKvDCbsZHbZ23-b0dHp_wN0RHKRYYYlq_A=.d5248bc1-1d43-4094-8cb2-ee5008015927@github.com>
Message-ID: <38AZvEN6jWtzUKAm6eRqJwarn31L2bZYw4-MTClOMaQ=.828db7e7-115f-4856-afbf-6c00bbc34224@github.com>

On Wed, 5 Feb 2025 09:32:27 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision:
>> 
>>  - Merge branch 'master' into loadklassctrl
>>  - format
>>  - clearer intention, revert formatting, add assert
>>  - remove always_see_exact_class
>>  - remove control input of LoadKlassNode
>
> Looks good, thanks for the explanations!
> 
> I see we did not yet run internal tests for the last commit, though it is only formatting, so most most likely ok.
> 
> But the state of the code is also 2 weeks old, so it would be good if you merged and launched testing again before integration, just in case ;)

@eme64 I have merged the change with master, could you help me initiate the testing process, please? Thanks very much.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23274#issuecomment-2640769617

From vlivanov at openjdk.org  Thu Feb  6 21:15:12 2025
From: vlivanov at openjdk.org (Vladimir Ivanov)
Date: Thu, 6 Feb 2025 21:15:12 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2]
In-Reply-To: <fIDzS3_DxYMZA78UX_HiBqnAPvUcE2qhBfw30T2hrv4=.d65d867b-2a38-49b7-ba1b-8284ae19e7d2@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <J3izMa8E182ejKhvLNQc0U__1Eswe4pAcSZd-g5OYEs=.ce24eab9-9e04-4220-9300-b81bee0453db@github.com>
 <EU357-FKuMBGUepB5H3CyvB-YhnJbRrvE0Zwab63Gw0=.3f51e9e8-0c3a-4d60-84fa-2fc8db065990@github.com>
 <0ZM_vg_dAmbdbeoIeZ8ylBUDj_4_jxM-aE6IKoH6ykM=.69c7554f-5e2b-40b9-8d1a-abe147548dbb@github.com>
 <0efX7bcHNl5p1RoF3VnqZIabdavsGosuMI14cZPDzbQ=.2bde6bbf-a59b-4f5b-9c68-7a8a258b2ee5@github.com>
 <7KdNVSXLx0N027uyQgtUuN82VpXTlyPpPOnBv3sqYRs=.6b549b56-36f9-4ab3-8469-4779d93dd1e7@github.com>
 <fIDzS3_DxYMZA78UX_HiBqnAPvUcE2qhBfw30T2hrv4=.d65d867b-2a38-49b7-ba1b-8284ae19e7d2@github.com>
Message-ID: <UYowvIDtIq6dD29gTSkg0hhi4JsrmETGcFNPuGrE_6Y=.bd96c4b1-3882-414a-ab6b-30abac2e665e@github.com>

On Thu, 6 Feb 2025 13:08:31 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> Does `static final` help here?
>
> Yes.  Yes it does.

Cases when a class mirror is a compile-time constant are already well-optimized. Non constant cases are the ones where missing optimization opportunities arise. 

In this particular case, C2 doesn't benefit from the observation that `Clazz[]` is a leaf type at runtime (no subclasses). Hence, a value loaded from a field typed as `Clazz[]` has exactly the same type and `clazzArray.getClass()` can be constant folded to `Clazz[].class`.  Rather than a common case, it feels more like a corner case. So, worth addressing as a follow-up enhancement.

Another scenario is a meet of 2 primitive array types (ends up as `bottom[]` in C2 type system), but I believe it hasn't been optimized before.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1945451909

From vlivanov at openjdk.org  Thu Feb  6 21:21:18 2025
From: vlivanov at openjdk.org (Vladimir Ivanov)
Date: Thu, 6 Feb 2025 21:21:18 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v5]
In-Reply-To: <Rnq14bXvkpbsJQdX-PF9uNr86akoq0ujgHPrFBLfIqY=.0278cab5-71ed-4a80-a751-e38248928e2e@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <Rnq14bXvkpbsJQdX-PF9uNr86akoq0ujgHPrFBLfIqY=.0278cab5-71ed-4a80-a751-e38248928e2e@github.com>
Message-ID: <YvJXs2OX-TBzUmJg0UJjwiVexzE85_cm_CQyX5FZZeo=.9bbcaeaf-06e0-4ea8-8a04-40b06a9cfc3f@github.com>

On Thu, 6 Feb 2025 14:31:28 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.
>> 
>> There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.
>> 
>> Tested with tier1-8.
>
> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Make compute_modifiers return u2.

Looks good. (Except a left-over `???` in a comment.)

I very much like this cleanup. Migrating from Klass to Class simplifies compiler logic since there's no need to care about primitives at runtime anymore.

Speaking of missing optimization opportunities (demonstrated by one microbenchmark), it looks like a corner case and can be addressed later.

-------------

Marked as reviewed by vlivanov (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/22652#pullrequestreview-2599983789

From coleenp at openjdk.org  Thu Feb  6 23:26:31 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Thu, 6 Feb 2025 23:26:31 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v6]
In-Reply-To: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
Message-ID: <FQVTF7A94az6n1bHbNae0CuaicfYRTcEh8qUzM7077s=.796fa77d-b455-4d96-b8ae-f6e1453c589a@github.com>

> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.
> 
> There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.
> 
> Tested with tier1-8.

Coleen Phillimore has updated the pull request incrementally with two additional commits since the last revision:

 - Remove ??? in the code.
 - Hide Class.modifiers field.

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/22652/files
  - new: https://git.openjdk.org/jdk/pull/22652/files/146e2551..304a17ee

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=22652&range=05
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22652&range=04-05

  Stats: 6 lines in 3 files changed: 1 ins; 1 del; 4 mod
  Patch: https://git.openjdk.org/jdk/pull/22652.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/22652/head:pull/22652

PR: https://git.openjdk.org/jdk/pull/22652

From coleenp at openjdk.org  Thu Feb  6 23:26:31 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Thu, 6 Feb 2025 23:26:31 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v5]
In-Reply-To: <Rnq14bXvkpbsJQdX-PF9uNr86akoq0ujgHPrFBLfIqY=.0278cab5-71ed-4a80-a751-e38248928e2e@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <Rnq14bXvkpbsJQdX-PF9uNr86akoq0ujgHPrFBLfIqY=.0278cab5-71ed-4a80-a751-e38248928e2e@github.com>
Message-ID: <IpxDRodyniGksNWI6F0ool2cmMfdnxV-lTXSLWFjYxg=.b23c920e-0d65-46e5-a4ce-697dbc638c3e@github.com>

On Thu, 6 Feb 2025 14:31:28 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.
>> 
>> There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.
>> 
>> Tested with tier1-8.
>
> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Make compute_modifiers return u2.

Thank you Vladimir for encouraging me to continue this change.  I removed the ??? and hid the modifiers field for reflection as suggested in this PR.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22652#issuecomment-2641339406

From epeter at openjdk.org  Fri Feb  7 07:07:14 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Fri, 7 Feb 2025 07:07:14 GMT
Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and
 LoadNKlassNode [v5]
In-Reply-To: <XOXG3194WyxMa6oX28SrOJFZcHZDhFYyZUU8Z0oySeI=.84e7f2cb-af4b-457a-8fc4-82e26fb4bbe9@github.com>
References: <m-zQmSsU2kIPw-wb-_1TbaVOHR_M5F-I1rFjvYAJu1k=.c8341d39-b1fc-4eb4-bcef-dfa46fa6c65c@github.com>
 <XOXG3194WyxMa6oX28SrOJFZcHZDhFYyZUU8Z0oySeI=.84e7f2cb-af4b-457a-8fc4-82e26fb4bbe9@github.com>
Message-ID: <vd5tKvu9T10TnRqS8vzRLjMAY0a5vGn5hTpp4isbjS0=.cbe367ea-7084-4333-b1c6-947bd962c6ce@github.com>

On Thu, 6 Feb 2025 19:11:58 GMT, Quan Anh Mai <qamai at openjdk.org> wrote:

>> Hi,
>> 
>> This patch removes the control input of `LoadKlassNode` and `LoadNKlassNode`. They can only have a control input if created inside `Parse::array_store_check()`, the reason given is:
>> 
>>     // We are allowed to use the constant type only if cast succeeded
>> 
>> But this seems incorrect, the load from the constant type can be done regardless, and it will be constant-folded. This patch only makes that more formal and cleanup `LoadKlassNode::can_remove_control`.
>> 
>> Please take a look and leave your reviews, thanks a lot.
>
> Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision:
> 
>  - Merge branch 'master' into loadklassctrl
>  - format
>  - clearer intention, revert formatting, add assert
>  - remove always_see_exact_class
>  - remove control input of LoadKlassNode

Marked as reviewed by epeter (Reviewer).

-------------

PR Review: https://git.openjdk.org/jdk/pull/23274#pullrequestreview-2600950916

From epeter at openjdk.org  Fri Feb  7 07:07:15 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Fri, 7 Feb 2025 07:07:15 GMT
Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and
 LoadNKlassNode [v5]
In-Reply-To: <38AZvEN6jWtzUKAm6eRqJwarn31L2bZYw4-MTClOMaQ=.828db7e7-115f-4856-afbf-6c00bbc34224@github.com>
References: <m-zQmSsU2kIPw-wb-_1TbaVOHR_M5F-I1rFjvYAJu1k=.c8341d39-b1fc-4eb4-bcef-dfa46fa6c65c@github.com>
 <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com>
 <XGYHqpmkpkKvDCbsZHbZ23-b0dHp_wN0RHKRYYYlq_A=.d5248bc1-1d43-4094-8cb2-ee5008015927@github.com>
 <38AZvEN6jWtzUKAm6eRqJwarn31L2bZYw4-MTClOMaQ=.828db7e7-115f-4856-afbf-6c00bbc34224@github.com>
Message-ID: <kYM86GFze1DSXlfiCOpgmwDjhRmS6l3GbE1rhc33JaA=.dcbcfa39-fc32-41dc-9949-d68bf5f03ef2@github.com>

On Thu, 6 Feb 2025 19:12:26 GMT, Quan Anh Mai <qamai at openjdk.org> wrote:

>> Looks good, thanks for the explanations!
>> 
>> I see we did not yet run internal tests for the last commit, though it is only formatting, so most most likely ok.
>> 
>> But the state of the code is also 2 weeks old, so it would be good if you merged and launched testing again before integration, just in case ;)
>
> @eme64 I have merged the change with master, could you help me initiate the testing process, please? Thanks very much.

@merykitty Testing launched!

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23274#issuecomment-2642097456

From galder at openjdk.org  Fri Feb  7 12:31:11 2025
From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=)
Date: Fri, 7 Feb 2025 12:31:11 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v11]
In-Reply-To: <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com>
Message-ID: <Mci8jQuT-MquLYeikUrrdzKo9dJJuQa3ejdc7tlYQyI=.e0007de8-08b2-4a42-950c-f8e1225777fc@github.com>

On Fri, 17 Jan 2025 17:53:24 GMT, Galder Zamarre?o <galder at openjdk.org> wrote:

>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance.
>> 
>> Currently vectorization does not kick in for loops containing either of these calls because of the following error:
>> 
>> 
>> VLoop::check_preconditions: failed: control flow in loop not allowed
>> 
>> 
>> The control flow is due to the java implementation for these methods, e.g.
>> 
>> 
>> public static long max(long a, long b) {
>>     return (a >= b) ? a : b;
>> }
>> 
>> 
>> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively.
>> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization.
>> E.g.
>> 
>> 
>> SuperWord::transform_loop:
>>     Loop: N518/N126  counted [int,int),+4 (1025 iters)  main has_sfpt strip_mined
>>  518  CountedLoop  === 518 246 126  [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21)
>> 
>> 
>> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1155
>> long max   1173
>> 
>> 
>> After the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1042
>> long max   1042
>> 
>> 
>> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes.
>> Therefore, it still relies on the macro expansion to transform those into CMoveL.
>> 
>> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results:
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PA...
>
> Galder Zamarre?o has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix typo

@eastig is helping with the results on aarch64, so I will verify the numbers in same way done below for x86_64 once he provides me with the results.

Here is a summary of the benchmarking results I'm seeing on x86_64 (I will push an update that just merges the latest master shortly).

First I will go through the results of `MinMaxVector`. This benchmark computes throughput by default so the higher the number the better.

# MinMaxVector AVX-512

Following are results with AVX-512 instructions:

Benchmark                       (probability)  (range)  (seed)  (size)   Mode  Cnt   Baseline     Patch   Units
MinMaxVector.longClippingRange            N/A       90       0    1000  thrpt    4    834.127  3688.961  ops/ms
MinMaxVector.longClippingRange            N/A      100       0    1000  thrpt    4   1147.010  3687.721  ops/ms
MinMaxVector.longLoopMax                   50      N/A     N/A    2048  thrpt    4   1126.718  1072.812  ops/ms
MinMaxVector.longLoopMax                   80      N/A     N/A    2048  thrpt    4   1070.921  1070.538  ops/ms
MinMaxVector.longLoopMax                  100      N/A     N/A    2048  thrpt    4    510.483  1073.081  ops/ms
MinMaxVector.longLoopMin                   50      N/A     N/A    2048  thrpt    4    935.658  1016.910  ops/ms
MinMaxVector.longLoopMin                   80      N/A     N/A    2048  thrpt    4   1007.410   933.774  ops/ms
MinMaxVector.longLoopMin                  100      N/A     N/A    2048  thrpt    4    536.582  1017.337  ops/ms
MinMaxVector.longReductionMax              50      N/A     N/A    2048  thrpt    4    967.288   966.945  ops/ms
MinMaxVector.longReductionMax              80      N/A     N/A    2048  thrpt    4    967.327   967.382  ops/ms
MinMaxVector.longReductionMax             100      N/A     N/A    2048  thrpt    4    849.689   967.327  ops/ms
MinMaxVector.longReductionMin              50      N/A     N/A    2048  thrpt    4    966.323   967.275  ops/ms
MinMaxVector.longReductionMin              80      N/A     N/A    2048  thrpt    4    967.340   967.228  ops/ms
MinMaxVector.longReductionMin             100      N/A     N/A    2048  thrpt    4    880.921   967.233  ops/ms


### `longReduction[Min|Max]` performance improves slightly when probability is 100

Without the patch the code uses compare instructions:


   7.83%  ???? ????  ?           0x00007f4f700fb305:   imulq		$0xb, 0x20(%r14, %r8, 8), %rdi
          ???? ????  ?                                                                     ;*lmul {reexecute=0 rethrow=0 return_oop=0}
          ???? ????  ?                                                                     ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 24 (line 255)
          ???? ????  ?                                                                     ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)
   5.64%  ???? ????  ?           0x00007f4f700fb30b:   cmpq		%rdi, %rdx
          ?????????  ?           0x00007f4f700fb30e:   jge		0x7f4f700fb32c      ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
          ?????????  ?                                                                     ; - java.lang.Math::max at 11 (line 2037)
          ?????????  ?                                                                     ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 30 (line 256)
          ?????????  ?                                                                     ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)
  12.82%  ?????????? ?           0x00007f4f700fb310:   imulq		$0xb, 0x28(%r14, %r8, 8), %rbp
          ?????????? ?                                                                     ;*lmul {reexecute=0 rethrow=0 return_oop=0}
          ?????????? ?                                                                     ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 24 (line 255)
          ?????????? ?                                                                     ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)
   7.46%  ?????????? ?           0x00007f4f700fb316:   cmpq		%rbp, %rdi
          ?????????? ?           0x00007f4f700fb319:   jl		0x7f4f700fb2e0      ;*iflt {reexecute=0 rethrow=0 return_oop=0}
          ????? ???? ?                                                                     ; - java.lang.Math::max at 3 (line 2037)
          ????? ???? ?                                                                     ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 30 (line 256)
          ????? ???? ?                                                                     ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)


And with the patch these become vectorized:


          ?    ?? ?????  0x00007f56280fad10:   vpmullq		0xf0(%rdx, %rsi, 8), %ymm10, %ymm4
   8.35%  ?    ?? ?????  0x00007f56280fad1b:   vpmullq		0xd0(%rdx, %rsi, 8), %ymm10, %ymm5
   4.27%  ?    ?? ?????  0x00007f56280fad26:   vpmullq		0x10(%rdx, %rsi, 8), %ymm10, %ymm6
          ?    ?? ?????                                                            ;   {no_reloc}
   4.22%  ?    ?? ?????  0x00007f56280fad31:   vpmullq		0x30(%rdx, %rsi, 8), %ymm10, %ymm7
   4.00%  ?    ?? ?????  0x00007f56280fad3c:   vpmullq		0xb0(%rdx, %rsi, 8), %ymm10, %ymm8
   4.13%  ?    ?? ?????  0x00007f56280fad47:   vpmullq		0x50(%rdx, %rsi, 8), %ymm10, %ymm11
   4.10%  ?    ?? ?????  0x00007f56280fad52:   vpmullq		0x70(%rdx, %rsi, 8), %ymm10, %ymm12
   4.13%  ?    ?? ?????  0x00007f56280fad5d:   vpmullq		0x90(%rdx, %rsi, 8), %ymm10, %ymm13
   4.03%  ?    ?? ?????  0x00007f56280fad68:   vpmaxsq		%ymm6, %ymm3, %ymm3
          ?    ?? ?????  0x00007f56280fad6e:   vpmaxsq		%ymm7, %ymm3, %ymm3
   4.72%  ?    ?? ?????  0x00007f56280fad74:   vpmaxsq		%ymm11, %ymm3, %ymm3
          ?    ?? ?????  0x00007f56280fad7a:   vpmaxsq		%ymm12, %ymm3, %ymm3
   8.40%  ?    ?? ?????  0x00007f56280fad80:   vpmaxsq		%ymm13, %ymm3, %ymm3
  23.11%  ?    ?? ?????  0x00007f56280fad86:   vpmaxsq		%ymm8, %ymm3, %ymm3
   2.15%  ?    ?? ?????  0x00007f56280fad8c:   vpmaxsq		%ymm5, %ymm3, %ymm3
   8.79%  ?    ?? ?????  0x00007f56280fad92:   vpmaxsq		%ymm4, %ymm3, %ymm3 ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
          ?    ?? ?????                                                            ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 30 (line 256)
          ?    ?? ?????                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)


### `longLoop[Min|Max]` performance improves considerably when probability is 100

Without the patch the code uses compare + move instructions:


   4.53%  ????  ??  ? ?           0x00007f96b40faf33:   movq		0x18(%rax, %rsi, 8), %r13;*laload {reexecute=0 rethrow=0 return_oop=0}
          ????  ??  ? ?                                                                     ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax at 20 (line 236)
          ????  ??  ? ?                                                                     ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub at 19 (line 124)
   2.69%  ????  ??  ? ?           0x00007f96b40faf38:   cmpq		%r11, %r13
          ????? ??  ? ?           0x00007f96b40faf3b:   jl		0x7f96b40faf67      ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
          ????? ??  ? ?                                                                     ; - java.lang.Math::max at 11 (line 2037)
          ????? ??  ? ?                                                                     ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax at 27 (line 236)
          ????? ??  ? ?                                                                     ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub at 19 (line 124)
   8.75%  ????? ??? ? ?           0x00007f96b40faf3d:   movq		%r13, 0x18(%rbp, %rsi, 8);*lastore {reexecute=0 rethrow=0 return_oop=0}
          ????? ??? ? ?                                                                     ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax at 30 (line 236)
          ????? ??? ? ?                                                                     ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub at 19 (line 124)


And with the patch those become vectorized:


   3.55%  ?  ??  0x00007f13c80fa18a:   vmovdqu		0xf0(%rbx, %r10, 8), %ymm5
          ?  ??  0x00007f13c80fa194:   vmovdqu		0xf0(%rdi, %r10, 8), %ymm6
   2.35%  ?  ??  0x00007f13c80fa19e:   vpmaxsq		%ymm6, %ymm5, %ymm5
   5.03%  ?  ??  0x00007f13c80fa1a4:   vmovdqu		%ymm5, 0xf0(%rax, %r10, 8)
          ?  ??                                                            ;*lastore {reexecute=0 rethrow=0 return_oop=0}
          ?  ??                                                            ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax at 30 (line 236)
          ?  ??                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub at 19 (line 124)


It's interesting to observe that at probabilites of 50/80% the baseline performs better than at 100%. The reason for that is because at 50/80% the baseline already vectorizes. So, why isn't the baseline vectorizing at 100% probability?


VLoop::check_preconditions
      Loop: N1256/N463  limit_check counted [int,int),+4 (3161 iters)  main rc  has_sfpt strip_mined
 1256  CountedLoop  === 1256 598 463  [[ 1256 1257 1271 1272 ]] inner stride: 4 main of N1256 strip mined !orig=[1126],[599],[590],[307] !jvms: MinMaxVector::longLoopMax @ bci:10 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
VLoop::check_preconditions: fails because of control flow.
  cl_exit 594  594  CountedLoopEnd  === 415 593  [[ 1275 463 ]] [lt] P=0.999684, C=707717.000000 !orig=[462] !jvms: MinMaxVector::longLoopMax @ bci:7 (line 235) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
  cl_exit->in(0) 415  415  Region  === 415 411 412  [[ 415 594 416 451 ]]  !orig=[423] !jvms: Math::max @ bci:11 (line 2037) MinMaxVector::longLoopMax @ bci:27 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
  lpt->_head 1256 1256  CountedLoop  === 1256 598 463  [[ 1256 1257 1271 1272 ]] inner stride: 4 main of N1256 strip mined !orig=[1126],[599],[590],[307] !jvms: MinMaxVector::longLoopMax @ bci:10 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
      Loop: N1256/N463  limit_check counted [int,int),+4 (3161 iters)  main rc  has_sfpt strip_mined
VLoop::check_preconditions: failed: control flow in loop not allowed


At 100% probability baseline fails to vectorize because it observes a control flow. This control flow is not the one you see in min/max implementations, but this is one added by HotSpot as a result of the JIT profiling. It observes that one branch is always taken so it optimizes for that, and adds a branch for the uncommon case where the branch is not taken.

### `longClippingRange` performance improves considerably

Without the patch the code uses compare + move instructions:


   3.39%  ?? ?      ?? ?            0x00007febb40fb175:   cmpq		%rbp, %rcx
          ?? ??     ?? ?            0x00007febb40fb178:   jge		0x7febb40fb17d      ;*iflt {reexecute=0 rethrow=0 return_oop=0}
          ?? ??     ?? ?                                                                      ; - java.lang.Math::max at 3 (line 2037)
          ?? ??     ?? ?                                                                      ; - org.openjdk.bench.java.lang.MinMaxVector::longClippingRange at 25 (line 220)
          ?? ??     ?? ?                                                                      ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124)
   2.69%  ?? ??     ?? ?            0x00007febb40fb17a:   movq		%rbp, %rcx          ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
          ?? ??     ?? ?                                                                      ; - java.lang.Math::max at 11 (line 2037)
          ?? ??     ?? ?                                                                      ; - org.openjdk.bench.java.lang.MinMaxVector::longClippingRange at 25 (line 220)
          ?? ??     ?? ?                                                                      ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124)
   4.35%  ?? ??     ?? ?            0x00007febb40fb17d:   nop
   2.93%  ?? ?      ?? ?            0x00007febb40fb180:   cmpq		%r8, %rcx
          ?? ? ?    ?? ?            0x00007febb40fb183:   jle		0x7febb40fb188      ;*ifgt {reexecute=0 rethrow=0 return_oop=0}
          ?? ? ?    ?? ?                                                                      ; - java.lang.Math::min at 3 (line 2132)
          ?? ? ?    ?? ?                                                                      ; - org.openjdk.bench.java.lang.MinMaxVector::longClippingRange at 32 (line 220)
          ?? ? ?    ?? ?                                                                      ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124)
   3.51%  ?? ? ?    ?? ?            0x00007febb40fb185:   movq		%r8, %rcx           ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
          ?? ? ?    ?? ?                                                                      ; - java.lang.Math::min at 11 (line 2132)
          ?? ? ?    ?? ?                                                                      ; - org.openjdk.bench.java.lang.MinMaxVector::longClippingRange at 32 (line 220)
          ?? ? ?    ?? ?                                                                      ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124)
   4.26%  ?? ? ?    ?? ?            0x00007febb40fb188:   movq		%rcx, 0x10(%rsi, %r9, 8);*lastore {reexecute=0 rethrow=0 return_oop=0}
          ?? ?      ?? ?                                                                      ; - org.openjdk.bench.java.lang.MinMaxVector::longClippingRange at 35 (line 220)
          ?? ?      ?? ?                                                                      ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124)


With the patch these become vectorized:


   0.20%  ???        ?   0x00007f10180fd15c:   vmovdqu		0x10(%r11, %rcx, 8), %ymm6
          ???        ?   0x00007f10180fd163:   vpmaxsq		%ymm6, %ymm7, %ymm6
          ???        ?   0x00007f10180fd169:   vpminsq		%ymm8, %ymm6, %ymm6
          ???        ?   0x00007f10180fd16f:   vmovdqu		%ymm6, 0x10(%r8, %rcx, 8);*lastore {reexecute=0 rethrow=0 return_oop=0}
          ???        ?                                                             ; - org.openjdk.bench.java.lang.MinMaxVector::longClippingRange at 35 (line 220)
          ???        ?                                                             ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124)


# `MinMaxVector` AVX2

Following are results on the same machine as above but forcing AVX2 to be used instead of AVX-512:

Benchmark                       (probability)  (range)  (seed)  (size)   Mode  Cnt  Baseline     Patch   Units
MinMaxVector.longClippingRange            N/A       90       0    1000  thrpt    4   832.132  1813.609  ops/ms
MinMaxVector.longClippingRange            N/A      100       0    1000  thrpt    4   832.546  1814.477  ops/ms
MinMaxVector.longLoopMax                   50      N/A     N/A    2048  thrpt    4   938.372   939.313  ops/ms
MinMaxVector.longLoopMax                   80      N/A     N/A    2048  thrpt    4   934.964   945.124  ops/ms
MinMaxVector.longLoopMax                  100      N/A     N/A    2048  thrpt    4   512.076   937.287  ops/ms
MinMaxVector.longLoopMin                   50      N/A     N/A    2048  thrpt    4   999.455   689.750  ops/ms
MinMaxVector.longLoopMin                   80      N/A     N/A    2048  thrpt    4  1000.352   876.326  ops/ms
MinMaxVector.longLoopMin                  100      N/A     N/A    2048  thrpt    4   536.359   999.475  ops/ms
MinMaxVector.longReductionMax              50      N/A     N/A    2048  thrpt    4   409.413   409.363  ops/ms
MinMaxVector.longReductionMax              80      N/A     N/A    2048  thrpt    4   409.374   409.141  ops/ms
MinMaxVector.longReductionMax             100      N/A     N/A    2048  thrpt    4   883.614   409.318  ops/ms
MinMaxVector.longReductionMin              50      N/A     N/A    2048  thrpt    4   404.723   404.705  ops/ms
MinMaxVector.longReductionMin              80      N/A     N/A    2048  thrpt    4   404.755   404.748  ops/ms
MinMaxVector.longReductionMin             100      N/A     N/A    2048  thrpt    4   848.784   404.669  ops/ms


### `longClippingRange` performance improves considerably

Baseline uses compare + move instructions as shown above. But the patched version improves in spite of not being able to use AVX-512 instructions such as `vpmaxsq`. The performance improvements come from using other vectorized compare + vectorized move instructions:


          ?    ?   ????  0x00007f9aa40f94ac:   vpcmpgtq		%ymm6, %ymm7, %ymm12
   3.79%  ?    ?   ????  0x00007f9aa40f94b1:   vblendvpd		%ymm12, %ymm7, %ymm6, %ymm12
   3.72%  ?    ?   ????  0x00007f9aa40f94b7:   vpcmpgtq		%ymm8, %ymm12, %ymm10
          ?    ?   ????  0x00007f9aa40f94bc:   vblendvpd		%ymm10, %ymm8, %ymm12, %ymm10
   3.78%  ?    ?   ????  0x00007f9aa40f94c2:   vmovdqu		%ymm10, 0xf0(%r8, %rcx, 8)
          ?    ?   ????                                                            ;*lastore {reexecute=0 rethrow=0 return_oop=0}
          ?    ?   ????                                                            ; - org.openjdk.bench.java.lang.MinMaxVector::longClippingRange at 35 (line 220)
          ?    ?   ????                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124)


### `longReduction[Min|Max]` performance drops considerably when probability is 100

Baseline uses compare + move instruction to implement this:


          ???? ????  ?                                                                     ;*lmul {reexecute=0 rethrow=0 return_oop=0}
          ???? ????  ?                                                                     ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 24 (line 255)
          ???? ????  ?                                                                     ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)
   6.30%  ???? ????  ?           0x00007fd5580f678b:   cmpq		%rdi, %rdx
          ?????????  ?           0x00007fd5580f678e:   jge		0x7fd5580f67ac      ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
          ?????????  ?                                                                     ; - java.lang.Math::max at 11 (line 2037)
          ?????????  ?                                                                     ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 30 (line 256)
          ?????????  ?                                                                     ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)
  12.88%  ?????????? ?           0x00007fd5580f6790:   imulq		$0xb, 0x28(%r14, %r8, 8), %rbp
          ?????????? ?                                                                     ;*lmul {reexecute=0 rethrow=0 return_oop=0}
          ?????????? ?                                                                     ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 24 (line 255)
          ?????????? ?                                                                     ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)
   7.55%  ?????????? ?           0x00007fd5580f6796:   cmpq		%rbp, %rdi
          ?????????? ?           0x00007fd5580f6799:   jl		0x7fd5580f6760      ;*iflt {reexecute=0 rethrow=0 return_oop=0}
          ????? ???? ?                                                                     ; - java.lang.Math::max at 3 (line 2037)
          ????? ???? ?                                                                     ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 30 (line 256)
          ????? ???? ?                                                                     ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)


With the patch the code uses conditional moves instead:


   0.05%  ??  0x00007fc4700f5253:   imulq		$0xb, 0x28(%r14, %r11, 8), %rdx
  10.62%  ??  0x00007fc4700f5259:   imulq		$0xb, 0x20(%r14, %r11, 8), %rax
   0.63%  ??  0x00007fc4700f525f:   imulq		$0xb, 0x10(%r14, %r11, 8), %r8
          ??                                                            ;*lmul {reexecute=0 rethrow=0 return_oop=0}
          ??                                                            ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 24 (line 255)
          ??                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)
  10.34%  ??  0x00007fc4700f5265:   cmpq		%r8, %r13
   2.37%  ??  0x00007fc4700f5268:   cmovlq		%r8, %r13           ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
          ??                                                            ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 30 (line 256)
          ??                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)
   1.15%  ??  0x00007fc4700f526c:   imulq		$0xb, 0x18(%r14, %r11, 8), %r8
          ??                                                            ;*lmul {reexecute=0 rethrow=0 return_oop=0}
          ??                                                            ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 24 (line 255)
          ??                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)
   9.28%  ??  0x00007fc4700f5272:   cmpq		%r8, %r13
   3.82%  ??  0x00007fc4700f5275:   cmovlq		%r8, %r13
  21.61%  ??  0x00007fc4700f5279:   cmpq		%rax, %r13
  11.55%  ??  0x00007fc4700f527c:   cmovlq		%rax, %r13
   4.48%  ??  0x00007fc4700f5280:   cmpq		%rdx, %r13
  11.76%  ??  0x00007fc4700f5283:   cmovlq		%rdx, %r13          ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
          ??                                                            ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 30 (line 256)
          ??                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124)


When one of the branches is taken always or almost always, the branched code of baseline can be optimized with branch prediction. However, the conditional move instructions force the CPU to compute both sides of the branch, so it performs worse in this scenario.

Why vectorized instructions are not used in this scenario? Vector instructions for min/max are not available with AVX2 and the trace vectorization signals it:


PackSet::print: 3 packs
 Pack: 0
    0:  1119  LoadL  === 1105 343 1120  [[ 1117 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=997,663,[457] !jvms: MinMaxVector::longReductionMax @ bci:23 (line 255) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
    1:  1112  LoadL  === 1105 343 1113  [[ 1111 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=663,[457] !jvms: MinMaxVector::longReductionMax @ bci:23 (line 255) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
    2:   997  LoadL  === 1105 343 998  [[ 996 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=663,[457] !jvms: MinMaxVector::longReductionMax @ bci:23 (line 255) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
    3:   663  LoadL  === 1105 343 455  [[ 458 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=[457] !jvms: MinMaxVector::longReductionMax @ bci:23 (line 255) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
 Pack: 1
    0:  1117  MulL  === _ 1119 162  [[ 1116 ]]  !orig=996,458 !jvms: MinMaxVector::longReductionMax @ bci:24 (line 255) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
    1:  1111  MulL  === _ 1112 162  [[ 1110 ]]  !orig=458 !jvms: MinMaxVector::longReductionMax @ bci:24 (line 255) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
    2:   996  MulL  === _ 997 162  [[ 995 ]]  !orig=458 !jvms: MinMaxVector::longReductionMax @ bci:24 (line 255) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
    3:   458  MulL  === _ 663 162  [[ 459 ]]  !jvms: MinMaxVector::longReductionMax @ bci:24 (line 255) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
 Pack: 2
    0:  1116  MaxL  === _ 1128 1117  [[ 1110 ]]  !orig=995,459,1012 !jvms: MinMaxVector::longReductionMax @ bci:30 (line 256) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
    1:  1110  MaxL  === _ 1116 1111  [[ 995 ]]  !orig=459,1012 !jvms: MinMaxVector::longReductionMax @ bci:30 (line 256) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
    2:   995  MaxL  === _ 1110 996  [[ 459 ]]  !orig=459,1012 !jvms: MinMaxVector::longReductionMax @ bci:30 (line 256) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
    3:   459  MaxL  === _ 995 458  [[ 1128 923 570 ]]  !orig=1012 !jvms: MinMaxVector::longReductionMax @ bci:30 (line 256) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)

WARNING: Removed pack: not implemented at any smaller size:
    0:  1116  MaxL  === _ 1128 1117  [[ 1110 ]]  !orig=995,459,1012 !jvms: MinMaxVector::longReductionMax @ bci:30 (line 256) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
    1:  1110  MaxL  === _ 1116 1111  [[ 995 ]]  !orig=459,1012 !jvms: MinMaxVector::longReductionMax @ bci:30 (line 256) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
    2:   995  MaxL  === _ 1110 996  [[ 459 ]]  !orig=459,1012 !jvms: MinMaxVector::longReductionMax @ bci:30 (line 256) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)
    3:   459  MaxL  === _ 995 458  [[ 1128 923 570 ]]  !orig=1012 !jvms: MinMaxVector::longReductionMax @ bci:30 (line 256) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124)

After SuperWord::split_packs_only_implemented_with_smaller_size


One interesting question option to explore here would be if MaxL/MinL could be implemented in terms of vectorized compare instructions, as shown above in the `longClippingRange` scenario. Thoughts @rwestrel @eme64?

# `VectorReduction2.WithSuperword` on AVX-512 machine

As requested by Emanuel I've also run this benchmark. Note that the results here are time per op, so the lower the number the better:


Benchmark                                         (SIZE)  (seed)  Mode  Cnt  Baseline     Patch  Units
VectorReduction2.WithSuperword.longMaxBig           2048       0  avgt    3  3970.527  1918.821  ns/op
VectorReduction2.WithSuperword.longMaxDotProduct    2048       0  avgt    3  1369.634  1055.762  ns/op
VectorReduction2.WithSuperword.longMaxSimple        2048       0  avgt    3   722.314  2172.064  ns/op
VectorReduction2.WithSuperword.longMinBig           2048       0  avgt    3  3996.694  1918.398  ns/op
VectorReduction2.WithSuperword.longMinDotProduct    2048       0  avgt    3  1363.687  1056.375  ns/op
VectorReduction2.WithSuperword.longMinSimple        2048       0  avgt    3   718.150  2179.478  ns/op


`long[Min|Max]Big` and `long[Min|Max]DotProduct` benchmarks show considerable improvements,
but something odd is happening in `long[Min|Max]Simple`.

### `long[Min|Max]Simple` performance drops considerably

Baseline uses compare + moves instructions:


   8.05%  ??      ???       ?    0x00007f9d580f569b:   movq		0x18(%r13, %r11, 8), %r8;*laload {reexecute=0 rethrow=0 return_oop=0}
          ??      ???       ?                                                              ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxSimple at 22 (line 1054)
          ??      ???       ?                                                              ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub at 17 (line 190)
   0.23%  ??      ???       ?    0x00007f9d580f56a0:   cmpq		%r8, %rsi
          ???     ???       ?    0x00007f9d580f56a3:   jl		0x7f9d580f5713      ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
          ???     ???       ?                                                              ; - java.lang.Math::max at 11 (line 2037)
          ???     ???       ?                                                              ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxSimple at 28 (line 1055)
          ???     ???       ?                                                              ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub at 17 (line 190)


Patched version uses conditional moves instead of vectorized instructions:


   2.76%  ??    0x00007fcd180f695c:   movq		0x18(%r14, %r11, 8), %rdi;*laload {reexecute=0 rethrow=0 return_oop=0}
          ??                                                              ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxSimple at 22 (line 1054)
          ??                                                              ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub at 17 (line 190)
          ??    0x00007fcd180f6961:   cmpq		%rdi, %r13
   3.11%  ??    0x00007fcd180f6964:   cmovlq		%rdi, %r13          ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
          ??                                                              ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxSimple at 28 (line 1055)
          ??                                                              ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub at 17 (line 190)


Why are vectorized instructions not kicking in with patch? Because superword doesn't think it's profitable to vectorize this:


PackSet::print: 2 packs
 Pack: 0
    0:  733  LoadL  === 721 184 734  [[ 732 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=669,500,[319] !jvms: VectorReduction2::longMaxSimple @ bci:22 (line 1054) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
    1:  728  LoadL  === 721 184 729  [[ 727 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=500,[319] !jvms: VectorReduction2::longMaxSimple @ bci:22 (line 1054) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
    2:  669  LoadL  === 721 184 670  [[ 668 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=500,[319] !jvms: VectorReduction2::longMaxSimple @ bci:22 (line 1054) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
    3:  500  LoadL  === 721 184 317  [[ 320 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=[319] !jvms: VectorReduction2::longMaxSimple @ bci:22 (line 1054) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
 Pack: 1
    0:  732  MaxL  === _ 743 733  [[ 727 ]]  !orig=668,320,685 !jvms: VectorReduction2::longMaxSimple @ bci:28 (line 1055) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
    1:  727  MaxL  === _ 732 728  [[ 668 ]]  !orig=320,685 !jvms: VectorReduction2::longMaxSimple @ bci:28 (line 1055) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
    2:  668  MaxL  === _ 727 669  [[ 320 ]]  !orig=320,685 !jvms: VectorReduction2::longMaxSimple @ bci:28 (line 1055) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
    3:  320  MaxL  === _ 668 500  [[ 743 593 456 ]]  !orig=685 !jvms: VectorReduction2::longMaxSimple @ bci:28 (line 1055) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)

WARNING: Removed pack: not profitable:
    0:  732  MaxL  === _ 743 733  [[ 727 ]]  !orig=668,320,685 !jvms: VectorReduction2::longMaxSimple @ bci:28 (line 1055) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
    1:  727  MaxL  === _ 732 728  [[ 668 ]]  !orig=320,685 !jvms: VectorReduction2::longMaxSimple @ bci:28 (line 1055) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
    2:  668  MaxL  === _ 727 669  [[ 320 ]]  !orig=320,685 !jvms: VectorReduction2::longMaxSimple @ bci:28 (line 1055) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
    3:  320  MaxL  === _ 668 500  [[ 743 593 456 ]]  !orig=685 !jvms: VectorReduction2::longMaxSimple @ bci:28 (line 1055) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)

WARNING: Removed pack: not profitable:
    0:  733  LoadL  === 721 184 734  [[ 732 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=669,500,[319] !jvms: VectorReduction2::longMaxSimple @ bci:22 (line 1054) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
    1:  728  LoadL  === 721 184 729  [[ 727 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=500,[319] !jvms: VectorReduction2::longMaxSimple @ bci:22 (line 1054) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
    2:  669  LoadL  === 721 184 670  [[ 668 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=500,[319] !jvms: VectorReduction2::longMaxSimple @ bci:22 (line 1054) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)
    3:  500  LoadL  === 721 184 317  [[ 320 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=[319] !jvms: VectorReduction2::longMaxSimple @ bci:22 (line 1054) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190)

After Superword::filter_packs_for_profitable

PackSet::print: 0 packs

SuperWord::transform_loop failed: SuperWord::SLP_extract did not vectorize


How can you make it vectorize? By doing something with the value in the array before passing it to min/max. That is what `MinMaxVector.longReduction[Min|Max]` and `VectorReduction2.long[Min|Max]DotProduct` methods do.

# `VectorReduction2.NoSuperword` on AVX-512 machine


Benchmark                                       (SIZE)  (seed)  Mode  Cnt  Baseline     Patch  Units
VectorReduction2.NoSuperword.longMaxBig           2048       0  avgt    3  3964.403  2966.258  ns/op
VectorReduction2.NoSuperword.longMaxDotProduct    2048       0  avgt    3  1686.373  2462.876  ns/op
VectorReduction2.NoSuperword.longMaxSimple        2048       0  avgt    3   722.219  2171.859  ns/op
VectorReduction2.NoSuperword.longMinBig           2048       0  avgt    3  3994.685  2971.143  ns/op
VectorReduction2.NoSuperword.longMinDotProduct    2048       0  avgt    3  1366.291  2428.173  ns/op
VectorReduction2.NoSuperword.longMinSimple        2048       0  avgt    3   719.218  2179.546  ns/op


Performance improves or `long[Min|Max]Big`. `long[Min|Max]Simple` suffers similar issues as shown in previous section because when not vectorized, these benchmarks fallback on conditional moves. The drop in performance in `long[Min|Max]DotProduct` needs some explanation.

### `long[Min|Max]DotProduct` performance drops considerably

Baseline uses compare + move instructions here:


   5.67%  ??? ????  ?    0x00007f3fcc0fa71d:   movq		0x20(%r14, %r8, 8), %r9
   5.19%  ??? ????  ?    0x00007f3fcc0fa722:   imulq		0x20(%rax, %r8, 8), %r9;*lmul {reexecute=0 rethrow=0 return_oop=0}
          ??? ????  ?                                                              ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxDotProduct at 30 (line 1125)
          ??? ????  ?                                                              ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_NoSuperword_longMaxDotProduct_jmhTest::longMaxDotProduct_avgt_jmhStub at 17 (line 190)
   8.46%  ??? ????  ?    0x00007f3fcc0fa728:   cmpq		%r9, %rsi
          ????????  ?    0x00007f3fcc0fa72b:   jl		0x7f3fcc0fa751      ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
          ????????  ?                                                              ; - java.lang.Math::max at 11 (line 2037)
          ????????  ?                                                              ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxDotProduct at 36 (line 1126)
          ????????  ?                                                              ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_NoSuperword_longMaxDotProduct_jmhTest::longMaxDotProduct_avgt_jmhStub at 17 (line 190)


Patch transforms this into conditional moves:


  11.00%  ?  0x00007f66f40f70b2:   movq		0x18(%r13, %rcx, 8), %rax
          ?  0x00007f66f40f70b7:   imulq		0x18(%r9, %rcx, 8), %rax;*lmul {reexecute=0 rethrow=0 return_oop=0}
          ?                                                            ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxDotProduct at 30 (line 1125)
          ?                                                            ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_NoSuperword_longMaxDotProduct_jmhTest::longMaxDotProduct_avgt_jmhStub at 17 (line 190)
          ?  0x00007f66f40f70bd:   cmpq		%rdx, %rax
  13.07%  ?  0x00007f66f40f70c0:   cmovlq		%rdx, %rax          ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
          ?                                                            ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxDotProduct at 36 (line 1126)
          ?                                                            ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_NoSuperword_longMaxDotProduct_jmhTest::longMaxDotProduct_avgt_jmhStub at 17 (line 190)


This is similar to what we have seen above. Lacking superword functionality, the fallback for MaxL/MinL implies using conditional moves. Although branch probabilities are not controlled here, we can observe that one of the branches is likely being taken ~100% of the time.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2642788364

From coleenp at openjdk.org  Fri Feb  7 12:34:40 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Fri, 7 Feb 2025 12:34:40 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v7]
In-Reply-To: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
Message-ID: <tKFgUu23VemLpYbPlEKP78xbcZN3ltxni_Wr4lj2i-E=.581d6ed2-c421-427e-9eb2-ad0764c787ba@github.com>

> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.
> 
> There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.
> 
> Tested with tier1-8.

Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:

  Fix jvmci test.

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/22652/files
  - new: https://git.openjdk.org/jdk/pull/22652/files/304a17ee..37a8cf81

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=22652&range=06
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22652&range=05-06

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/22652.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/22652/head:pull/22652

PR: https://git.openjdk.org/jdk/pull/22652

From galder at openjdk.org  Fri Feb  7 12:39:24 2025
From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=)
Date: Fri, 7 Feb 2025 12:39:24 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v12]
In-Reply-To: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
Message-ID: <pZjDpZKJUmXi85-qf3F-NX91qVc42_QgZGbuo36XhPk=.f2e4ba72-bf19-4ced-9656-c01907bdae1b@github.com>

> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance.
> 
> Currently vectorization does not kick in for loops containing either of these calls because of the following error:
> 
> 
> VLoop::check_preconditions: failed: control flow in loop not allowed
> 
> 
> The control flow is due to the java implementation for these methods, e.g.
> 
> 
> public static long max(long a, long b) {
>     return (a >= b) ? a : b;
> }
> 
> 
> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively.
> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization.
> E.g.
> 
> 
> SuperWord::transform_loop:
>     Loop: N518/N126  counted [int,int),+4 (1025 iters)  main has_sfpt strip_mined
>  518  CountedLoop  === 518 246 126  [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21)
> 
> 
> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1):
> 
> 
> ==============================
> Test summary
> ==============================
>    TEST                                              TOTAL  PASS  FAIL ERROR
>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>                                                          1     1     0     0
> ==============================
> TEST SUCCESS
> 
> long min   1155
> long max   1173
> 
> 
> After the patch, on darwin/aarch64 (M1):
> 
> 
> ==============================
> Test summary
> ==============================
>    TEST                                              TOTAL  PASS  FAIL ERROR
>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>                                                          1     1     0     0
> ==============================
> TEST SUCCESS
> 
> long min   1042
> long max   1042
> 
> 
> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes.
> Therefore, it still relies on the macro expansion to transform those into CMoveL.
> 
> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results:
> 
> 
> ==============================
> Test summary
> ==============================
>    TEST                                              TOTAL  PASS  FAIL ERROR
>    jtreg:test/hotspot/jtreg:tier1                     2500  2500     0     0
>>> jtreg:test/jdk:tier1                     ...

Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision:

 - Merge branch 'master' into topic.intrinsify-max-min-long
 - Fix typo
 - Renaming methods and variables and add docu on algorithms
 - Fix copyright years
 - Make sure it runs with cpus with either avx512 or asimd
 - Test can only run with 256 bit registers or bigger
   
   * Remove platform dependant check
   and use platform independent configuration instead.
 - Fix license header
 - Tests should also run on aarch64 asimd=true envs
 - Added comment around the assertions
 - Adjust min/max identity IR test expectations after changes
 - ... and 34 more: https://git.openjdk.org/jdk/compare/f56622ff...a190ae68

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/20098/files
  - new: https://git.openjdk.org/jdk/pull/20098/files/724a346a..a190ae68

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=20098&range=11
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20098&range=10-11

  Stats: 206462 lines in 5108 files changed: 101636 ins; 84099 del; 20727 mod
  Patch: https://git.openjdk.org/jdk/pull/20098.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/20098/head:pull/20098

PR: https://git.openjdk.org/jdk/pull/20098

From epeter at openjdk.org  Fri Feb  7 16:40:15 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Fri, 7 Feb 2025 16:40:15 GMT
Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and
 LoadNKlassNode [v5]
In-Reply-To: <38AZvEN6jWtzUKAm6eRqJwarn31L2bZYw4-MTClOMaQ=.828db7e7-115f-4856-afbf-6c00bbc34224@github.com>
References: <m-zQmSsU2kIPw-wb-_1TbaVOHR_M5F-I1rFjvYAJu1k=.c8341d39-b1fc-4eb4-bcef-dfa46fa6c65c@github.com>
 <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com>
 <XGYHqpmkpkKvDCbsZHbZ23-b0dHp_wN0RHKRYYYlq_A=.d5248bc1-1d43-4094-8cb2-ee5008015927@github.com>
 <38AZvEN6jWtzUKAm6eRqJwarn31L2bZYw4-MTClOMaQ=.828db7e7-115f-4856-afbf-6c00bbc34224@github.com>
Message-ID: <yrgbaXR5R4GCLd3Itd4LfwhcwcIwa06tia5tSs_a8xM=.35c650fd-25c4-46f5-9a47-6d13d830238c@github.com>

On Thu, 6 Feb 2025 19:12:26 GMT, Quan Anh Mai <qamai at openjdk.org> wrote:

>> Looks good, thanks for the explanations!
>> 
>> I see we did not yet run internal tests for the last commit, though it is only formatting, so most most likely ok.
>> 
>> But the state of the code is also 2 weeks old, so it would be good if you merged and launched testing again before integration, just in case ;)
>
> @eme64 I have merged the change with master, could you help me initiate the testing process, please? Thanks very much.

@merykitty Testing is all passing!

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23274#issuecomment-2643431552

From coleenp at openjdk.org  Fri Feb  7 19:16:13 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Fri, 7 Feb 2025 19:16:13 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v7]
In-Reply-To: <tKFgUu23VemLpYbPlEKP78xbcZN3ltxni_Wr4lj2i-E=.581d6ed2-c421-427e-9eb2-ad0764c787ba@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <tKFgUu23VemLpYbPlEKP78xbcZN3ltxni_Wr4lj2i-E=.581d6ed2-c421-427e-9eb2-ad0764c787ba@github.com>
Message-ID: <lE8Fzktp5eGkcXgqGo9BX2XZ-7nP2_63aoED1jxBbtk=.8dcfa9e0-c8a8-4922-937e-fdb22ee84c95@github.com>

On Fri, 7 Feb 2025 12:34:40 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.
>> 
>> There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.
>> 
>> Tested with tier1-8.
>
> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix jvmci test.

I added some code to hide the Class.modifiers field and fixed the JVMCI test. Please re-review.

Also @iwanowww I think the intrinsic for isInterface can be removed and just be Java code like:


public boolean isInterface()  {
  return getModifiers().isInterface();
}

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22652#issuecomment-2643799984

From vlivanov at openjdk.org  Fri Feb  7 19:47:12 2025
From: vlivanov at openjdk.org (Vladimir Ivanov)
Date: Fri, 7 Feb 2025 19:47:12 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v7]
In-Reply-To: <tKFgUu23VemLpYbPlEKP78xbcZN3ltxni_Wr4lj2i-E=.581d6ed2-c421-427e-9eb2-ad0764c787ba@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <tKFgUu23VemLpYbPlEKP78xbcZN3ltxni_Wr4lj2i-E=.581d6ed2-c421-427e-9eb2-ad0764c787ba@github.com>
Message-ID: <RwOpQ-TzMSazfLgxqUZ5fJcdzBmOHvhfgnelely73SQ=.0dbfb18a-98d1-4342-b897-1d99900fb3ec@github.com>

On Fri, 7 Feb 2025 12:34:40 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.
>> 
>> There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.
>> 
>> Tested with tier1-8.
>
> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix jvmci test.

Marked as reviewed by vlivanov (Reviewer).

-------------

PR Review: https://git.openjdk.org/jdk/pull/22652#pullrequestreview-2602686659

From never at openjdk.org  Fri Feb  7 20:01:24 2025
From: never at openjdk.org (Tom Rodriguez)
Date: Fri, 7 Feb 2025 20:01:24 GMT
Subject: RFR: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can
 crash [v3]
In-Reply-To: <SPZqWN3T8fTZzOQsFUkJ_JqbGIMAYZbUEW3c6vN3shI=.2fad04b0-be2c-41dd-9337-1192c11da7fa@github.com>
References: <h-d5AL3L3J-BvIgzmhy0VsZWG8q7JAPK-kO_xtfSv9s=.244df430-4498-48a9-b0b6-b2082cc485cf@github.com>
 <SPZqWN3T8fTZzOQsFUkJ_JqbGIMAYZbUEW3c6vN3shI=.2fad04b0-be2c-41dd-9337-1192c11da7fa@github.com>
Message-ID: <sV7jjnqBuvbfwppFGsviVN8G5z6C6HtysqA_45VeXOg=.2988b9cf-a008-4312-b16a-ca94f72297f1@github.com>

On Tue, 4 Feb 2025 20:56:53 GMT, Tom Rodriguez <never at openjdk.org> wrote:

>> This ensures that collectFailedSpeculations sees the initialization of the recently allocated failedSpeculationsAddress memory.
>
> Tom Rodriguez has updated the pull request incrementally with one additional commit since the last revision:
> 
>   improve comments

Thanks!

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23444#issuecomment-2643985603

From vlivanov at openjdk.org  Fri Feb  7 20:01:27 2025
From: vlivanov at openjdk.org (Vladimir Ivanov)
Date: Fri, 7 Feb 2025 20:01:27 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v7]
In-Reply-To: <lE8Fzktp5eGkcXgqGo9BX2XZ-7nP2_63aoED1jxBbtk=.8dcfa9e0-c8a8-4922-937e-fdb22ee84c95@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <tKFgUu23VemLpYbPlEKP78xbcZN3ltxni_Wr4lj2i-E=.581d6ed2-c421-427e-9eb2-ad0764c787ba@github.com>
 <lE8Fzktp5eGkcXgqGo9BX2XZ-7nP2_63aoED1jxBbtk=.8dcfa9e0-c8a8-4922-937e-fdb22ee84c95@github.com>
Message-ID: <Hai1knECYXvemIZZ5utF6G7XsGtjf7VH4wySvU6WT4s=.40d63bbb-4a91-4983-9921-82d963f8ead8@github.com>

On Fri, 7 Feb 2025 19:13:07 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

> I think the intrinsic for isInterface can be removed

Good point. Moreover, it seems most of intrinsics on Class queries can be replaced with a flag bit check on the mirror. (Do we have 16 unused bits in Class::modifiers after this change?)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22652#issuecomment-2643997479

From never at openjdk.org  Fri Feb  7 20:01:25 2025
From: never at openjdk.org (Tom Rodriguez)
Date: Fri, 7 Feb 2025 20:01:25 GMT
Subject: Integrated: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog
 can crash
In-Reply-To: <h-d5AL3L3J-BvIgzmhy0VsZWG8q7JAPK-kO_xtfSv9s=.244df430-4498-48a9-b0b6-b2082cc485cf@github.com>
References: <h-d5AL3L3J-BvIgzmhy0VsZWG8q7JAPK-kO_xtfSv9s=.244df430-4498-48a9-b0b6-b2082cc485cf@github.com>
Message-ID: <Bo8dr8ahelZdBvhZklGD4UfvpZi-NW0ZMnYiCe3dZNc=.e521bc09-8753-4f15-b3c5-df9b19609305@github.com>

On Tue, 4 Feb 2025 16:31:50 GMT, Tom Rodriguez <never at openjdk.org> wrote:

> This ensures that collectFailedSpeculations sees the initialization of the recently allocated failedSpeculationsAddress memory.

This pull request has now been integrated.

Changeset: 7f6c6878
Author:    Tom Rodriguez <never at openjdk.org>
URL:       https://git.openjdk.org/jdk/commit/7f6c687815031d99931265007ff8867bf964cb25
Stats:     14 lines in 1 file changed: 9 ins; 0 del; 5 mod

8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can crash

Reviewed-by: kvn, dnsimon

-------------

PR: https://git.openjdk.org/jdk/pull/23444

From coleenp at openjdk.org  Fri Feb  7 21:14:13 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Fri, 7 Feb 2025 21:14:13 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v7]
In-Reply-To: <Hai1knECYXvemIZZ5utF6G7XsGtjf7VH4wySvU6WT4s=.40d63bbb-4a91-4983-9921-82d963f8ead8@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <tKFgUu23VemLpYbPlEKP78xbcZN3ltxni_Wr4lj2i-E=.581d6ed2-c421-427e-9eb2-ad0764c787ba@github.com>
 <lE8Fzktp5eGkcXgqGo9BX2XZ-7nP2_63aoED1jxBbtk=.8dcfa9e0-c8a8-4922-937e-fdb22ee84c95@github.com>
 <Hai1knECYXvemIZZ5utF6G7XsGtjf7VH4wySvU6WT4s=.40d63bbb-4a91-4983-9921-82d963f8ead8@github.com>
Message-ID: <RXGNb9N_QtHpY_ACo3Tqlezykzg36nHJSHO0Ep7jVEQ=.0ee2e5c1-006e-436d-815a-9086bd002dc5@github.com>

On Fri, 7 Feb 2025 19:58:12 GMT, Vladimir Ivanov <vlivanov at openjdk.org> wrote:

> Good point. Moreover, it seems most of intrinsics on Class queries can be replaced with a flag bit check on the mirror. (Do we have 16 unused bits in Class::modifiers after this change?)

Yes, I think so.  isArray and isPrimitive definitely.  We could first change the modifiers field to "char" because that's its size and then have two booleans for each of these.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22652#issuecomment-2644136904

From liach at openjdk.org  Fri Feb  7 21:37:12 2025
From: liach at openjdk.org (Chen Liang)
Date: Fri, 7 Feb 2025 21:37:12 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v7]
In-Reply-To: <tKFgUu23VemLpYbPlEKP78xbcZN3ltxni_Wr4lj2i-E=.581d6ed2-c421-427e-9eb2-ad0764c787ba@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <tKFgUu23VemLpYbPlEKP78xbcZN3ltxni_Wr4lj2i-E=.581d6ed2-c421-427e-9eb2-ad0764c787ba@github.com>
Message-ID: <rOsaO-sBYnrdqrPazYWGUx95QxEJ6vkbqO-C47E-P5Q=.fdda2cd0-df90-4a6a-8c2c-4fb577a72de3@github.com>

On Fri, 7 Feb 2025 12:34:40 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.
>> 
>> There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.
>> 
>> Tested with tier1-8.
>
> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix jvmci test.

Making `isArray` and `isPrimitive` Java-based is going to be helpful for the interpreter performance of these methods in early bootstrap. ?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22652#issuecomment-2644171713

From qamai at openjdk.org  Sat Feb  8 04:23:18 2025
From: qamai at openjdk.org (Quan Anh Mai)
Date: Sat, 8 Feb 2025 04:23:18 GMT
Subject: Integrated: 8348411: C2: Remove the control input of LoadKlassNode and
 LoadNKlassNode
In-Reply-To: <m-zQmSsU2kIPw-wb-_1TbaVOHR_M5F-I1rFjvYAJu1k=.c8341d39-b1fc-4eb4-bcef-dfa46fa6c65c@github.com>
References: <m-zQmSsU2kIPw-wb-_1TbaVOHR_M5F-I1rFjvYAJu1k=.c8341d39-b1fc-4eb4-bcef-dfa46fa6c65c@github.com>
Message-ID: <nOtCBu-pBg70gG5jmP2d5_lWqcx55szLEs8M9YzH91w=.6ab64992-7672-4d3c-8405-1e577dcaac10@github.com>

On Thu, 23 Jan 2025 17:22:02 GMT, Quan Anh Mai <qamai at openjdk.org> wrote:

> Hi,
> 
> This patch removes the control input of `LoadKlassNode` and `LoadNKlassNode`. They can only have a control input if created inside `Parse::array_store_check()`, the reason given is:
> 
>     // We are allowed to use the constant type only if cast succeeded
> 
> But this seems incorrect, the load from the constant type can be done regardless, and it will be constant-folded. This patch only makes that more formal and cleanup `LoadKlassNode::can_remove_control`.
> 
> Please take a look and leave your reviews, thanks a lot.

This pull request has now been integrated.

Changeset: e9278de3
Author:    Quan Anh Mai <qamai at openjdk.org>
URL:       https://git.openjdk.org/jdk/commit/e9278de3f8676c288bfdce96f8348470e7c42900
Stats:     60 lines in 10 files changed: 5 ins; 18 del; 37 mod

8348411: C2: Remove the control input of LoadKlassNode and LoadNKlassNode

Reviewed-by: vlivanov, epeter

-------------

PR: https://git.openjdk.org/jdk/pull/23274

From qamai at openjdk.org  Sat Feb  8 04:23:17 2025
From: qamai at openjdk.org (Quan Anh Mai)
Date: Sat, 8 Feb 2025 04:23:17 GMT
Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and
 LoadNKlassNode [v5]
In-Reply-To: <XOXG3194WyxMa6oX28SrOJFZcHZDhFYyZUU8Z0oySeI=.84e7f2cb-af4b-457a-8fc4-82e26fb4bbe9@github.com>
References: <m-zQmSsU2kIPw-wb-_1TbaVOHR_M5F-I1rFjvYAJu1k=.c8341d39-b1fc-4eb4-bcef-dfa46fa6c65c@github.com>
 <XOXG3194WyxMa6oX28SrOJFZcHZDhFYyZUU8Z0oySeI=.84e7f2cb-af4b-457a-8fc4-82e26fb4bbe9@github.com>
Message-ID: <ForzkhKMpuorqIFWwIqXy3PaYLLby1CA3s19pLqgPAM=.cf42d597-0836-4eb6-a6d2-4dc47616ca74@github.com>

On Thu, 6 Feb 2025 19:11:58 GMT, Quan Anh Mai <qamai at openjdk.org> wrote:

>> Hi,
>> 
>> This patch removes the control input of `LoadKlassNode` and `LoadNKlassNode`. They can only have a control input if created inside `Parse::array_store_check()`, the reason given is:
>> 
>>     // We are allowed to use the constant type only if cast succeeded
>> 
>> But this seems incorrect, the load from the constant type can be done regardless, and it will be constant-folded. This patch only makes that more formal and cleanup `LoadKlassNode::can_remove_control`.
>> 
>> Please take a look and leave your reviews, thanks a lot.
>
> Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision:
> 
>  - Merge branch 'master' into loadklassctrl
>  - format
>  - clearer intention, revert formatting, add assert
>  - remove always_see_exact_class
>  - remove control input of LoadKlassNode

Thanks a lot for your reviews and testing!

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23274#issuecomment-2644491331

From alanb at openjdk.org  Sat Feb  8 19:44:12 2025
From: alanb at openjdk.org (Alan Bateman)
Date: Sat, 8 Feb 2025 19:44:12 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v7]
In-Reply-To: <tKFgUu23VemLpYbPlEKP78xbcZN3ltxni_Wr4lj2i-E=.581d6ed2-c421-427e-9eb2-ad0764c787ba@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <tKFgUu23VemLpYbPlEKP78xbcZN3ltxni_Wr4lj2i-E=.581d6ed2-c421-427e-9eb2-ad0764c787ba@github.com>
Message-ID: <YJ99YDxAF7V-8aQ9vkVzWyGuNciUbP6D0yTaPs7jqgY=.edf8667d-605e-4ae9-87c0-bd6376daf30f@github.com>

On Fri, 7 Feb 2025 12:34:40 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.
>> 
>> There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.
>> 
>> Tested with tier1-8.
>
> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix jvmci test.

No more comments from me.

-------------

Marked as reviewed by alanb (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/22652#pullrequestreview-2604014387

From kvn at openjdk.org  Sun Feb  9 19:43:29 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Sun, 9 Feb 2025 19:43:29 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2]
In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
Message-ID: <UpT0IDb0Y33QFlFCzUQad0y1qYnadGt6VwFxeR9GTZY=.8d61f572-209d-4509-b03d-068fb21ee3ce@github.com>

> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table.
> 
> Added C++ static asserts to make sure no virtual methods are added in a future.
> 
> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob.
> 
> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp

Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:

  Fix Zero and Minimal VM builds

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23533/files
  - new: https://git.openjdk.org/jdk/pull/23533/files/11abd5e7..dda20f0b

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=00-01

  Stats: 6 lines in 1 file changed: 4 ins; 2 del; 0 mod
  Patch: https://git.openjdk.org/jdk/pull/23533.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23533/head:pull/23533

PR: https://git.openjdk.org/jdk/pull/23533

From kvn at openjdk.org  Sun Feb  9 19:43:29 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Sun, 9 Feb 2025 19:43:29 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod
In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
Message-ID: <bRugayK5HkT9rbk9LPYZ8SdsqyySDVVJbJIfhGuezGA=.2422266a-e58c-4038-a726-db3e54cf3588@github.com>

On Sun, 9 Feb 2025 17:45:30 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table.
> 
> Added C++ static asserts to make sure no virtual methods are added in a future.
> 
> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob.
> 
> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp

@dougxc and @tkrodriguez,  please look if it affects Graal.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2646553512

From cjplummer at openjdk.org  Mon Feb 10 03:14:22 2025
From: cjplummer at openjdk.org (Chris Plummer)
Date: Mon, 10 Feb 2025 03:14:22 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2]
In-Reply-To: <UpT0IDb0Y33QFlFCzUQad0y1qYnadGt6VwFxeR9GTZY=.8d61f572-209d-4509-b03d-068fb21ee3ce@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <UpT0IDb0Y33QFlFCzUQad0y1qYnadGt6VwFxeR9GTZY=.8d61f572-209d-4509-b03d-068fb21ee3ce@github.com>
Message-ID: <rs6rOizOwjIbK5bOSPR-YHledkZyK1ZrfXTYLAjaI2o=.8a8eab26-eae5-45f5-a10a-fc6cf4bbe94c@github.com>

On Sun, 9 Feb 2025 19:43:29 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table.
>> 
>> Added C++ static asserts to make sure no virtual methods are added in a future.
>> 
>> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob.
>> 
>> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp
>
> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix Zero and Minimal VM builds

I almost wished I hadn't looked because there is a lot of SA CodeBlob support that could use some cleanup. Most notably I think most of the wrapper subclasses are not needed by SA, and could be served by one common class. See what I'm doing in #23456 for JavaThread subclasses. Wrapper classes don't need to be 1-to-1 with the class type they are wrapping. A single wrapper class type can handle any number of hotspot types. It usually just make more sense for them to be 1-to-1, but when they are trivial and the implementation is replicated across subtypes, just having one wrapper class implement them all can simplify things.

The other thing I noticed is a lot of the subtypes seem to be doing some unnecessary things like registering Observers, and they all do something like the following:

44     Type type = db.lookupType("BufferBlob");

Even when it never references "type".

I'm not suggesting you clean up any of this now, but just pointed it out. I might file an issue and try to clean it up myself at some point.

I still need to take a closer look at the SA changes.

src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeCache.java line 38:

> 36: public class CodeCache {
> 37:   private static GrowableArray<CodeHeap> heapArray;
> 38:   private static VirtualConstructor virtualConstructor;

What is the reason for switching from the virtualConstructor/hashMap approach to using getClassFor()? The hashmap is the model for JavaThread, MetaData, and CollectedHeap subtypes.

-------------

PR Review: https://git.openjdk.org/jdk/pull/23533#pullrequestreview-2604594200
PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1948335278

From cjplummer at openjdk.org  Mon Feb 10 03:29:13 2025
From: cjplummer at openjdk.org (Chris Plummer)
Date: Mon, 10 Feb 2025 03:29:13 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2]
In-Reply-To: <rs6rOizOwjIbK5bOSPR-YHledkZyK1ZrfXTYLAjaI2o=.8a8eab26-eae5-45f5-a10a-fc6cf4bbe94c@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <UpT0IDb0Y33QFlFCzUQad0y1qYnadGt6VwFxeR9GTZY=.8d61f572-209d-4509-b03d-068fb21ee3ce@github.com>
 <rs6rOizOwjIbK5bOSPR-YHledkZyK1ZrfXTYLAjaI2o=.8a8eab26-eae5-45f5-a10a-fc6cf4bbe94c@github.com>
Message-ID: <MuWo0UbHjeatHRYvdW2o5IuYS99xCt9Cx4tcuyd7YLI=.347b62cf-16ad-4f24-aaa0-db044acca1f8@github.com>

On Mon, 10 Feb 2025 02:47:58 GMT, Chris Plummer <cjplummer at openjdk.org> wrote:

>> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Fix Zero and Minimal VM builds
>
> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeCache.java line 38:
> 
>> 36: public class CodeCache {
>> 37:   private static GrowableArray<CodeHeap> heapArray;
>> 38:   private static VirtualConstructor virtualConstructor;
> 
> What is the reason for switching from the virtualConstructor/hashMap approach to using getClassFor()? The hashmap is the model for JavaThread, MetaData, and CollectedHeap subtypes.

I think I found the answer. Since there is no longer a vtable, TypeDataBase.addressTypeIsEqualToType() will no longer work for Codeblobs. I was wondering if the lack of a vtable might have some negative impact. Glad to see you found a solution. I hope the lack of a vtable does not creep in elsewhere. I know it will have some negative impact on things like the "findpc" functionality, which will no longer be able to tell the user that an address points to a Codeblob instance. There's no test for this, but users might run across it.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1948352958

From jbhateja at openjdk.org  Mon Feb 10 05:33:25 2025
From: jbhateja at openjdk.org (Jatin Bhateja)
Date: Mon, 10 Feb 2025 05:33:25 GMT
Subject: RFR: 8342103: C2 compiler support for Float16 type and associated
 scalar operations [v17]
In-Reply-To: <sgcWAEvXaWi40uFqPPJq-e6n3YsHb7agKBUrMBdqJSc=.48c41eab-6641-45b7-9765-400622894f4b@github.com>
References: <a00XTjaE0iFc3MKq9ER_tgXoz81Hg07N8sPSPpTIQt4=.c05fd92f-8105-49d5-80be-ee56aeb77ede@github.com>
 <sgcWAEvXaWi40uFqPPJq-e6n3YsHb7agKBUrMBdqJSc=.48c41eab-6641-45b7-9765-400622894f4b@github.com>
Message-ID: <LgeejggXLAJq4_9FPPMg5w2HFVYid76BGKErA1D2u_0=.9a64cd0b-0cff-4aa1-a2f3-6c380d00de4e@github.com>

On Tue, 4 Feb 2025 10:05:09 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Hi All,
>> 
>> This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128)
>> 
>> Following is the summary of changes included with this patch:-
>> 
>> 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations.
>> 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization.
>> 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class.
>>       -    These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values.
>> 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines.
>> 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details.
>> 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa.
>> 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF
>> 9. X86  backend implementation for all supported intrinsics.
>> 10. Functional and Performance validation tests.
>> 
>> Kindly review the patch and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fixing typos

Hi @PaulSandoz , Kindly let us know if this is good for integration.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22754#issuecomment-2646957788

From galder at openjdk.org  Mon Feb 10 09:29:20 2025
From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=)
Date: Mon, 10 Feb 2025 09:29:20 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v11]
In-Reply-To: <Mci8jQuT-MquLYeikUrrdzKo9dJJuQa3ejdc7tlYQyI=.e0007de8-08b2-4a42-950c-f8e1225777fc@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com>
 <Mci8jQuT-MquLYeikUrrdzKo9dJJuQa3ejdc7tlYQyI=.e0007de8-08b2-4a42-950c-f8e1225777fc@github.com>
Message-ID: <RHL_g49_BCZQzsQJU-T88fkAOoSKpNvEC2Xx-QxdpRk=.4fbc0037-ba55-40e1-a091-4c16d7e8ee99@github.com>

On Fri, 7 Feb 2025 12:27:42 GMT, Galder Zamarre?o <galder at openjdk.org> wrote:

> At 100% probability baseline fails to vectorize because it observes a control flow. This control flow is not the one you see in min/max implementations, but this is one added by HotSpot as a result of the JIT profiling. It observes that one branch is always taken so it optimizes for that, and adds a branch for the uncommon case where the branch is not taken.

I've dug further into this to try to understand how the baseline hotspot code works, and the explanation above is not entirely correct. Let's look at the IR differences between say 100% vs 80% branch situations.

At branch 80% you see:

 1115  CountedLoop  === 1115 598 463  [[ 1101 1115 1116 1118 451 594 ]] inner stride: 2 main of N1115 strip mined !orig=[599],[590],[307] !jvms: MinMaxVector::longLoopMax @ bci:10 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)

  692  LoadL  === 1083 1101 393  [[ 747 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=9; #long (does not depend only on test, unknown control) !orig=[395] !jvms: MinMaxVector::longLoopMax @ bci:26 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
  651  LoadL  === 1095 1101 355  [[ 747 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=9; #long (does not depend only on test, unknown control) !orig=[357] !jvms: MinMaxVector::longLoopMax @ bci:20 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
  747  MaxL  === _ 651 692  [[ 451 ]]  !orig=[608],[416] !jvms: Math::max @ bci:11 (line 2037) MinMaxVector::longLoopMax @ bci:27 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)

  451  StoreL  === 1115 1101 449 747  [[ 1116 454 911 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=9;  Memory: @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=9; !orig=1124 !jvms: MinMaxVector::longLoopMax @ bci:30 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)

  594  CountedLoopEnd  === 1115 593  [[ 1123 463 ]] [lt] P=0.999731, C=780799.000000 !orig=[462] !jvms: MinMaxVector::longLoopMax @ bci:7 (line 235) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)


You see the counted loop with the LoadL for array loads and MaxL consuming those. The StoreL is for array assignment (I think).

At branch 100% you see:


  650  LoadL  === 1105 1119 355  [[ 416 408 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=9; #long (does not depend only on test, unknown control) !orig=[357] !jvms: MinMaxVector::longLoopMax @ bci:20 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
  691  LoadL  === 1093 1119 393  [[ 416 408 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=9; #long (does not depend only on test, unknown control) !orig=[395] !jvms: MinMaxVector::longLoopMax @ bci:26 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
  408  CmpL  === _ 650 691  [[ 409 ]]  !jvms: Math::max @ bci:3 (line 2037) MinMaxVector::longLoopMax @ bci:27 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
  409  Bool  === _ 408  [[ 410 ]] [lt] !jvms: Math::max @ bci:3 (line 2037) MinMaxVector::longLoopMax @ bci:27 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
  410  If  === 1132 409  [[ 411 412 ]] P=0.019892, C=79127.000000 !jvms: Math::max @ bci:3 (line 2037) MinMaxVector::longLoopMax @ bci:27 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
  411  IfTrue  === 410  [[ 415 ]] #1 !jvms: Math::max @ bci:3 (line 2037) MinMaxVector::longLoopMax @ bci:27 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
  412  IfFalse  === 410  [[ 415 ]] #0 !jvms: Math::max @ bci:3 (line 2037) MinMaxVector::longLoopMax @ bci:27 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
  415  Region  === 415 411 412  [[ 415 594 416 451 ]]  !orig=[423] !jvms: Math::max @ bci:11 (line 2037) MinMaxVector::longLoopMax @ bci:27 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)

  594  CountedLoopEnd  === 415 593  [[ 1139 463 ]] [lt] P=0.999683, C=706030.000000 !orig=[462] !jvms: MinMaxVector::longLoopMax @ bci:7 (line 235) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)


You see a region within the counted loop with the if/else which belongs to the actual `Math.max` implementation, with the corresponding CmpL and the LoadL nodes for retrieving the longs from the arrays.

What causes the difference? It's this section in `PhaseIdealLoop::conditional_move`:

```c++
  // Check for highly predictable branch.  No point in CMOV'ing if
  // we are going to predict accurately all the time.
  if (C->use_cmove() && (cmp_op == Op_CmpF || cmp_op == Op_CmpD)) {
    //keep going
  } else if (iff->_prob < infrequent_prob ||
      iff->_prob > (1.0f - infrequent_prob))
    return nullptr;


At branch 100 `iff->_prob > (1.0f - infrequent_prob)` becomes true and no CMoveL is created so hotspot seems to stick to the original bytecode implementation of `Math.max`. At branch 80 that comparison is below and CMoveL is created, which eventually gets converted into a MaxL node and vectorization kicks in. The numbers are interesting. `infrequent_prob` appears to be a fixed number `0.181818187` and `1.0f` minus that is `0.818181812`. So, at branch 100 `iff->_prob` is `0.906792104` therefore higher than `0.818181812`, and at branch 80 `0.718619287`. I would have expected those `iff->_prob` to be closer to the branch % targets I set, but ignoring that, seems like ~90% would be the cut off.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2647410266

From yzheng at openjdk.org  Mon Feb 10 10:17:15 2025
From: yzheng at openjdk.org (Yudi Zheng)
Date: Mon, 10 Feb 2025 10:17:15 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v7]
In-Reply-To: <tKFgUu23VemLpYbPlEKP78xbcZN3ltxni_Wr4lj2i-E=.581d6ed2-c421-427e-9eb2-ad0764c787ba@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <tKFgUu23VemLpYbPlEKP78xbcZN3ltxni_Wr4lj2i-E=.581d6ed2-c421-427e-9eb2-ad0764c787ba@github.com>
Message-ID: <s10zFU38xW679o1IZQLM-ABYTxKHhGnVzX1sg0rVoiY=.7fe943fc-f4c0-4cd5-ab69-fb6dbfa6fde6@github.com>

On Fri, 7 Feb 2025 12:34:40 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.
>> 
>> There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.
>> 
>> Tested with tier1-8.
>
> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix jvmci test.

JVMCI change looks good to me

-------------

Marked as reviewed by yzheng (Committer).

PR Review: https://git.openjdk.org/jdk/pull/22652#pullrequestreview-2605295926

From dnsimon at openjdk.org  Mon Feb 10 11:03:13 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Mon, 10 Feb 2025 11:03:13 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod
In-Reply-To: <bRugayK5HkT9rbk9LPYZ8SdsqyySDVVJbJIfhGuezGA=.2422266a-e58c-4038-a726-db3e54cf3588@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <bRugayK5HkT9rbk9LPYZ8SdsqyySDVVJbJIfhGuezGA=.2422266a-e58c-4038-a726-db3e54cf3588@github.com>
Message-ID: <PqwHiIGu284wCbjgvDkhnG1pNNDZNOs0vn4cQV-cVoE=.398c16a5-79ec-4c12-a3ed-1cabc1fed6cd@github.com>

On Sun, 9 Feb 2025 19:36:28 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

> @dougxc and @tkrodriguez, please look if it affects Graal.

I'm pretty sure JVMCI does not care about the virtual-ness of these C++ classes. Running tier9 in mach5 is a good way to be sure.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2647642674

From adinn at openjdk.org  Mon Feb 10 11:07:14 2025
From: adinn at openjdk.org (Andrew Dinn)
Date: Mon, 10 Feb 2025 11:07:14 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2]
In-Reply-To: <UpT0IDb0Y33QFlFCzUQad0y1qYnadGt6VwFxeR9GTZY=.8d61f572-209d-4509-b03d-068fb21ee3ce@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <UpT0IDb0Y33QFlFCzUQad0y1qYnadGt6VwFxeR9GTZY=.8d61f572-209d-4509-b03d-068fb21ee3ce@github.com>
Message-ID: <0LQ3b0zaCg8HEDx4C5xM8W4-qmQ9PkoAClhyVxKxxtE=.8cd94c7a-8496-436c-8387-6aa443942bb6@github.com>

On Sun, 9 Feb 2025 19:43:29 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table.
>> 
>> Added C++ static asserts to make sure no virtual methods are added in a future.
>> 
>> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob.
>> 
>> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp
>
> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix Zero and Minimal VM builds

src/hotspot/share/code/codeBlob.cpp line 58:

> 56: #include <type_traits>
> 57: 
> 58: // Virtual methods are not allowed in code blobs to simplify caching compiled code.

Is it worth considering generating this code plus also some of the existing code in the header using an iterator template macro? e.g.

    #define CODEBLOBS_DO(do_codeblob_abstract, do_codeblob_nonleaf, \
                         do_codeblob_leaf)                          \
      do_codeblob_abstract(CodeBlob)                                \
      do_codeblob_leaf(nmethod, Nmethod, nmethod)                   \
      do_codeblob_abstract(RuntimeBlob)                             \
      do_codeblob_nonleaf(BufferBlob, Buffer, buffer)               \
      do_codeblob_leaf(AdapterBlob, Adapter, adapter)               \
      . . .                                                         \
      do_codeblob_leaf(RuntimeStub, Runtime_Stub, runtime_stub)     \
      . . .

The macro arguments to the templates would themselves be macros:

    do_codeblob_abstract(classname) // abstract, non-instantiable class
    do_codeblob_nonleaf(classname, kindname, accessorname) // instantiable, subclassable
    do_codeblob_leaf(classname, kindname, accessorname) // instantiable, non-subclassable

Using a template macro like this to generate the code below -- *plus also* some of the code currently declared piecemeal in the header -- would guarantee all cases are covered now and will remain so later so when the macro is updated. I think it would probably also allow case handling code in AOT cache code to be generated.

So, we would generate the code here as follows

    #define EMPTY1(classname) 
    #define EMPTY3(classname, kindname, accessorname)

    #define assert_nonvirtual_leaf(classname, kindname, accessorname) \
      static_assert(!std::is_polymorphic<classname>::value,            \
                    "no virtual methods are allowed in " # classname );

    CODEBLOBS_DO(empty1, empty3, assert_nonvirtual_leaf)

    #undef assert_nonvirtual_leaf

Likewise in codeBlob.hpp we could generate `enum CodeBlobKind` to cover all the non-abstract classes and likewise generate the accessor methods `is_nmethod()`, `is_buffer_blob()` in class `CodeBlob` which allow the kind to be tested.

    #define codekind_enum_tag(classname, kindname, accessorname) \
      kindname,

    enum CodeBlobKind : u1 {
      None,
      CODEBLOBS_DO(empty1, codekind_enum_tag, codekind_enum_tag)
      Number_Of_Kinds
    };

    #define is_codeblob_define(classname, kindname, accessorname) \
        void is_ # accessor_name () { return _kind == kindname; }

    class CodeBlob {
      . . .
      CODEBLOBS_DO(empty1, is_codeblob_define, is_codeblob_define);
      . . .
There may be other opportunities to use the iterator (e.g. in vmStructs.cpp?) but this looks like a good start.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1948849392

From coleenp at openjdk.org  Mon Feb 10 12:47:31 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Mon, 10 Feb 2025 12:47:31 GMT
Subject: RFR: 8346567: Make Class.getModifiers() non-native [v7]
In-Reply-To: <tKFgUu23VemLpYbPlEKP78xbcZN3ltxni_Wr4lj2i-E=.581d6ed2-c421-427e-9eb2-ad0764c787ba@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
 <tKFgUu23VemLpYbPlEKP78xbcZN3ltxni_Wr4lj2i-E=.581d6ed2-c421-427e-9eb2-ad0764c787ba@github.com>
Message-ID: <DmXMLnORTe6gOVepuVwqdgQDFU-2XSaO70XqWd1mhfU=.3cb505ca-fe8f-4e4b-940a-93713cb3f637@github.com>

On Fri, 7 Feb 2025 12:34:40 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.
>> 
>> There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.
>> 
>> Tested with tier1-8.
>
> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix jvmci test.

Thank you for the reviews Yudi, Alan, Chen, Vladimir and Dean, and the help and comments with the various pieces of this.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22652#issuecomment-2647880184

From coleenp at openjdk.org  Mon Feb 10 12:47:32 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Mon, 10 Feb 2025 12:47:32 GMT
Subject: Integrated: 8346567: Make Class.getModifiers() non-native
In-Reply-To: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com>
Message-ID: <-VYQTxGucpCCQZccdw6wMnDavFDAt75MDHY8mGxEMiw=.042099b8-41dc-4b0d-8bdd-a874f004a0f6@github.com>

On Mon, 9 Dec 2024 19:26:53 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror.  The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it.  This moves the field to Java and removes the intrinsic code.  I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value.  It should really be an unsigned short though.
> 
> There's a couple of JMH benchmarks added with this change.  One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable.  I don't think this is real life code. The other benchmarks added show no regression.
> 
> Tested with tier1-8.

This pull request has now been integrated.

Changeset: c9cadbd2
Author:    Coleen Phillimore <coleenp at openjdk.org>
URL:       https://git.openjdk.org/jdk/commit/c9cadbd23fb13933b8968f283d27842cd35f8d6f
Stats:     217 lines in 31 files changed: 71 ins; 127 del; 19 mod

8346567: Make Class.getModifiers() non-native

Reviewed-by: alanb, vlivanov, yzheng, dlong

-------------

PR: https://git.openjdk.org/jdk/pull/22652

From stefank at openjdk.org  Mon Feb 10 16:26:12 2025
From: stefank at openjdk.org (Stefan Karlsson)
Date: Mon, 10 Feb 2025 16:26:12 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2]
In-Reply-To: <UpT0IDb0Y33QFlFCzUQad0y1qYnadGt6VwFxeR9GTZY=.8d61f572-209d-4509-b03d-068fb21ee3ce@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <UpT0IDb0Y33QFlFCzUQad0y1qYnadGt6VwFxeR9GTZY=.8d61f572-209d-4509-b03d-068fb21ee3ce@github.com>
Message-ID: <TInY73vEr-Zf1V8aJoj9Gi05j1HEVp6yERf7WyRAWTI=.68a66e73-b21f-4957-a231-49f87b592915@github.com>

On Sun, 9 Feb 2025 19:43:29 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table.
>> 
>> Added C++ static asserts to make sure no virtual methods are added in a future.
>> 
>> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob.
>> 
>> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp
>
> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix Zero and Minimal VM builds

We have a similar situation with oopDesc that are not allowed to have a vtable. The solution there is to use the Klass as the proxy vtable and then have a bunch of Klass::oop_ functions that act like virtual dispatch functions for associated oopDesc functions.

I wonder if a similar approach can be use here? Such an approach would (to me at lest) have the benefit that we don't have to spread switch statements in various functions in the top-most class.

If you are interested in seeing a prototype of this, take a look at this branch:
https://github.com/openjdk/jdk/compare/master...stefank:jdk:code_blob_vptr

Just a suggestion if you want to consider alternatives to these switch statements.

-------------

PR Review: https://git.openjdk.org/jdk/pull/23533#pullrequestreview-2606457754

From kvn at openjdk.org  Mon Feb 10 16:39:19 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Mon, 10 Feb 2025 16:39:19 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2]
In-Reply-To: <0LQ3b0zaCg8HEDx4C5xM8W4-qmQ9PkoAClhyVxKxxtE=.8cd94c7a-8496-436c-8387-6aa443942bb6@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <UpT0IDb0Y33QFlFCzUQad0y1qYnadGt6VwFxeR9GTZY=.8d61f572-209d-4509-b03d-068fb21ee3ce@github.com>
 <0LQ3b0zaCg8HEDx4C5xM8W4-qmQ9PkoAClhyVxKxxtE=.8cd94c7a-8496-436c-8387-6aa443942bb6@github.com>
Message-ID: <1P7Q-yHC0Ho8DPfgzZfxR27NmNQPJ4LcgEbilqdaVNw=.0c023c74-b3d9-4139-8363-5ebdf1a1805d@github.com>

On Mon, 10 Feb 2025 11:04:38 GMT, Andrew Dinn <adinn at openjdk.org> wrote:

>> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Fix Zero and Minimal VM builds
>
> src/hotspot/share/code/codeBlob.cpp line 58:
> 
>> 56: #include <type_traits>
>> 57: 
>> 58: // Virtual methods are not allowed in code blobs to simplify caching compiled code.
> 
> Is it worth considering generating this code plus also some of the existing code in the header using an iterator template macro? e.g.
> 
>     #define CODEBLOBS_DO(do_codeblob_abstract, do_codeblob_nonleaf, \
>                          do_codeblob_leaf)                          \
>       do_codeblob_abstract(CodeBlob)                                \
>       do_codeblob_leaf(nmethod, Nmethod, nmethod)                   \
>       do_codeblob_abstract(RuntimeBlob)                             \
>       do_codeblob_nonleaf(BufferBlob, Buffer, buffer)               \
>       do_codeblob_leaf(AdapterBlob, Adapter, adapter)               \
>       . . .                                                         \
>       do_codeblob_leaf(RuntimeStub, Runtime_Stub, runtime_stub)     \
>       . . .
> 
> The macro arguments to the templates would themselves be macros:
> 
>     do_codeblob_abstract(classname) // abstract, non-instantiable class
>     do_codeblob_nonleaf(classname, kindname, accessorname) // instantiable, subclassable
>     do_codeblob_leaf(classname, kindname, accessorname) // instantiable, non-subclassable
> 
> Using a template macro like this to generate the code below -- *plus also* some of the code currently declared piecemeal in the header -- would guarantee all cases are covered now and will remain so later so when the macro is updated. I think it would probably also allow case handling code in AOT cache code to be generated.
> 
> So, we would generate the code here as follows
> 
>     #define EMPTY1(classname) 
>     #define EMPTY3(classname, kindname, accessorname)
> 
>     #define assert_nonvirtual_leaf(classname, kindname, accessorname) \
>       static_assert(!std::is_polymorphic<classname>::value,            \
>                     "no virtual methods are allowed in " # classname );
> 
>     CODEBLOBS_DO(empty1, empty3, assert_nonvirtual_leaf)
> 
>     #undef assert_nonvirtual_leaf
> 
> Likewise in codeBlob.hpp we could generate `enum CodeBlobKind` to cover all the non-abstract classes and likewise generate the accessor methods `is_nmethod()`, `is_buffer_blob()` in class `CodeBlob` which allow the kind to be tested.
> 
>     #define codekind_enum_tag(classname, kindname, accessorname) \
>       kindname,
> 
>     enum CodeBlobKind : u1 {
>       None,
>       CODEBLOBS_DO(empty1, codekind_enum_tag, codekind_enum_tag)
>       Number_Of_Kinds
>     };
> 
>     ...

Thank you @adinn for suggestion but no, I don't like macros - hard to debug and they add more complexity in this case.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1949483501

From kvn at openjdk.org  Mon Feb 10 16:50:12 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Mon, 10 Feb 2025 16:50:12 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2]
In-Reply-To: <MuWo0UbHjeatHRYvdW2o5IuYS99xCt9Cx4tcuyd7YLI=.347b62cf-16ad-4f24-aaa0-db044acca1f8@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <UpT0IDb0Y33QFlFCzUQad0y1qYnadGt6VwFxeR9GTZY=.8d61f572-209d-4509-b03d-068fb21ee3ce@github.com>
 <rs6rOizOwjIbK5bOSPR-YHledkZyK1ZrfXTYLAjaI2o=.8a8eab26-eae5-45f5-a10a-fc6cf4bbe94c@github.com>
 <MuWo0UbHjeatHRYvdW2o5IuYS99xCt9Cx4tcuyd7YLI=.347b62cf-16ad-4f24-aaa0-db044acca1f8@github.com>
Message-ID: <ikNjrWuT83NiylLufeiZausKN4JHJaGXowZ5CMOhMTg=.ca4265f3-3608-40ee-8596-f12148a03ecd@github.com>

On Mon, 10 Feb 2025 03:25:30 GMT, Chris Plummer <cjplummer at openjdk.org> wrote:

>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeCache.java line 38:
>> 
>>> 36: public class CodeCache {
>>> 37:   private static GrowableArray<CodeHeap> heapArray;
>>> 38:   private static VirtualConstructor virtualConstructor;
>> 
>> What is the reason for switching from the virtualConstructor/hashMap approach to using getClassFor()? The hashmap is the model for JavaThread, MetaData, and CollectedHeap subtypes.
>
> I think I found the answer. Since there is no longer a vtable, TypeDataBase.addressTypeIsEqualToType() will no longer work for Codeblobs. I was wondering if the lack of a vtable might have some negative impact. Glad to see you found a solution. I hope the lack of a vtable does not creep in elsewhere. I know it will have some negative impact on things like the "findpc" functionality, which will no longer be able to tell the user that an address points to a Codeblob instance. There's no test for this, but users might run across it.

> What is the reason for switching from the virtualConstructor/hashMap approach to using getClassFor()? The hashmap is the model for JavaThread, MetaData, and CollectedHeap subtypes.

I don't need any more mapping from CodeBlob class to corresponding virtual table name which does not exist anymore. `CodeBlob::_kind` field's value is used to determine which class should be used.

I think `hashMap` is overkill here. I can construct array `Class<?> cbClasses[]` and use `cbClasses[CodeBlob::_kind]` instead of `if/else` in `getClassFor`. But I would still need to check for unknown value of `CodeBlob::_kind` somehow.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1949505126

From kvn at openjdk.org  Mon Feb 10 17:06:13 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Mon, 10 Feb 2025 17:06:13 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2]
In-Reply-To: <TInY73vEr-Zf1V8aJoj9Gi05j1HEVp6yERf7WyRAWTI=.68a66e73-b21f-4957-a231-49f87b592915@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <UpT0IDb0Y33QFlFCzUQad0y1qYnadGt6VwFxeR9GTZY=.8d61f572-209d-4509-b03d-068fb21ee3ce@github.com>
 <TInY73vEr-Zf1V8aJoj9Gi05j1HEVp6yERf7WyRAWTI=.68a66e73-b21f-4957-a231-49f87b592915@github.com>
Message-ID: <iLRSrhsBjEX-53Px4rstfpVqimSrAdbUT4rEWNXwKoY=.88bd3608-3917-45b2-b9a6-fa45937efd1e@github.com>

On Mon, 10 Feb 2025 16:23:53 GMT, Stefan Karlsson <stefank at openjdk.org> wrote:

>> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Fix Zero and Minimal VM builds
>
> We have a similar situation with oopDesc that are not allowed to have a vtable. The solution there is to use the Klass as the proxy vtable and then have a bunch of Klass::oop_ functions that act like virtual dispatch functions for associated oopDesc functions.
> 
> I wonder if a similar approach can be use here? Such an approach would (to me at lest) have the benefit that we don't have to spread switch statements in various functions in the top-most class.
> 
> If you are interested in seeing a prototype of this, take a look at this branch:
> https://github.com/openjdk/jdk/compare/master...stefank:jdk:code_blob_vptr
> 
> Just a suggestion if you want to consider alternatives to these switch statements.

Thank you, @stefank. This is very interesting suggestion which I may take. I will check it.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2648688942

From mpowers at openjdk.org  Mon Feb 10 21:01:18 2025
From: mpowers at openjdk.org (Mark Powers)
Date: Mon, 10 Feb 2025 21:01:18 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]
In-Reply-To: <unMldYiDLGyImOJQ1oXuzR2OViIBxTKFjE3Ks6_VSn4=.e86bd4ee-5fce-415a-888a-06aff24bd664@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <unMldYiDLGyImOJQ1oXuzR2OViIBxTKFjE3Ks6_VSn4=.e86bd4ee-5fce-415a-888a-06aff24bd664@github.com>
Message-ID: <dNh6r-1Nk_lDWHJPXOtEH08ufVRw01u_WsG4ppSVCq0=.5c888525-23f5-47e2-a9b8-3b4f56301582@github.com>

On Thu, 6 Feb 2025 18:47:54 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Adding comments + some code reorganization

Some measurements:


With Intrinsics
---------------
keygen    ML-DSA-44   38.8  us/op
keygen    ML-DSA-65   82.5  us/op
keygen    ML-DSA-87  112.6  us/op
siggen    ML-DSA-44  119.1  us/op
siggen    ML-DSA-65  186.5  us/op
siggen    ML-DSA-87  306.1  us/op
sigver    ML-DSA-44   46.4  us/op
sigver    ML-DSA-65   72.8  us/op
sigver    ML-DSA-87  123.4  us/op


No Intrinsics
-------------
keygen    ML-DSA-44   63.1  us/op
keygen    ML-DSA-65  118.7  us/op
keygen    ML-DSA-87  167.2  us/op
siggen    ML-DSA-44  466.8  us/op
siggen    ML-DSA-65  546.3  us/op
siggen    ML-DSA-87  560.3  us/op
sigver    ML-DSA-44   71.6  us/op
sigver    ML-DSA-65  117.9  us/op
sigver    ML-DSA-87  180.4  us/op

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2649220775

From psandoz at openjdk.org  Mon Feb 10 21:26:25 2025
From: psandoz at openjdk.org (Paul Sandoz)
Date: Mon, 10 Feb 2025 21:26:25 GMT
Subject: RFR: 8342103: C2 compiler support for Float16 type and associated
 scalar operations [v17]
In-Reply-To: <sgcWAEvXaWi40uFqPPJq-e6n3YsHb7agKBUrMBdqJSc=.48c41eab-6641-45b7-9765-400622894f4b@github.com>
References: <a00XTjaE0iFc3MKq9ER_tgXoz81Hg07N8sPSPpTIQt4=.c05fd92f-8105-49d5-80be-ee56aeb77ede@github.com>
 <sgcWAEvXaWi40uFqPPJq-e6n3YsHb7agKBUrMBdqJSc=.48c41eab-6641-45b7-9765-400622894f4b@github.com>
Message-ID: <fYcbuTZ6Uicn6x1uTnhBZxYwgzim_TwldI_9MSLBbmw=.74d51608-d631-4a8a-8e72-a5d623ffef25@github.com>

On Tue, 4 Feb 2025 10:05:09 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Hi All,
>> 
>> This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128)
>> 
>> Following is the summary of changes included with this patch:-
>> 
>> 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations.
>> 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization.
>> 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class.
>>       -    These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values.
>> 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines.
>> 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details.
>> 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa.
>> 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF
>> 9. X86  backend implementation for all supported intrinsics.
>> 10. Functional and Performance validation tests.
>> 
>> Kindly review the patch and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fixing typos

An impressive and substantial change. I focused on the Java code, there are some small tweaks, presented in comments, we can make to the intrinsics to improve the expression of code, and it has no impact on the intrinsic implementation.

src/java.base/share/classes/jdk/internal/vm/vector/Float16Math.java line 32:

> 30:  * The class {@code Float16Math} constains intrinsic entry points corresponding
> 31:  * to scalar numeric operations defined in Float16 class.
> 32:  * @since   25

You can remove this line, since this is an internal class.

src/java.base/share/classes/jdk/internal/vm/vector/Float16Math.java line 38:

> 36:     }
> 37: 
> 38:     public interface Float16UnaryMathOp {

You can just use `UnaryOperator<T>`, no need for a new type, here are the updated methods you can apply to this class.

    @FunctionalInterface
    public interface TernaryOperator<T> {
        T apply(T a, T b, T c);
    }

    @IntrinsicCandidate
    public static <T> T sqrt(Class<T> box_class, T oa, UnaryOperator<T> defaultImpl) {
        assert isNonCapturingLambda(defaultImpl) : defaultImpl;
        return defaultImpl.apply(oa);
    }

    @IntrinsicCandidate
    public static <T> T fma(Class<T> box_class, T oa, T ob, T oc, TernaryOperator<T> defaultImpl) {
        assert isNonCapturingLambda(defaultImpl) : defaultImpl;
        return defaultImpl.apply(oa, ob, oc);
    }

    static boolean isNonCapturingLambda(Object o) {
        return o.getClass().getDeclaredFields().length == 0;
    }


And in `src/hotspot/share/classfile/vmIntrinsics.hpp`:

  /* Float16Math API intrinsification support */                                                                         \
  /* Float16 signatures */                                                                                               \
  do_signature(float16_unary_math_op_sig, "(Ljava/lang/Class;"                                                           \
                                           "Ljava/lang/Object;"                                                          \
                                           "Ljava/util/function/UnaryOperator;)"                                         \
                                           "Ljava/lang/Object;")                                                         \
  do_signature(float16_ternary_math_op_sig, "(Ljava/lang/Class;"                                                         \
                                             "Ljava/lang/Object;"                                                        \
                                             "Ljava/lang/Object;"                                                        \
                                             "Ljava/lang/Object;"                                                        \
                                             "Ljdk/internal/vm/vector/Float16Math$TernaryOperator;)"                     \
                                             "Ljava/lang/Object;")                                                       \
  do_intrinsic(_sqrt_float16, jdk_internal_vm_vector_Float16Math, sqrt_name, float16_unary_math_op_sig, F_S)             \
  do_intrinsic(_fma_float16, jdk_internal_vm_vector_Float16Math, fma_name, float16_ternary_math_op_sig, F_S)             \

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Float16.java line 1202:

> 1200:      */
> 1201:     public static Float16 sqrt(Float16 radicand) {
> 1202:         return (Float16) Float16Math.sqrt(Float16.class, radicand,

With changes to the intrinsics (as presented in another comment) you no longer need explicit casts and the code is precisely the same as before except embedded in a lambda body:

    public static Float16 sqrt(Float16 radicand) {
        return Float16Math.sqrt(Float16.class, radicand,
            (_radicand) -> {
                // Rounding path of sqrt(Float16 -> double) -> Float16 is fine
                // for preserving the correct final value. The conversion
                // Float16 -> double preserves the exact numerical value. The
                // conversion of double -> Float16 also benefits from the
                // 2p+2 property of IEEE 754 arithmetic.
               return valueOf(Math.sqrt(_radicand.doubleValue()));
            }
        );
    }


Similarly for `fma`:

         return Float16Math.fma(Float16.class, a, b, c,
                (_a, _b, _c) -> {
                    // product is numerically exact in float before the cast to
                    // double; not necessary to widen to double before the
                    // multiply.
                    double product = (double)(_a.floatValue() * _b.floatValue());
                    return valueOf(product + _c.doubleValue());
                });

test/jdk/jdk/incubator/vector/ScalarFloat16OperationsTest.java line 44:

> 42: import static jdk.incubator.vector.Float16.*;
> 43: 
> 44: public class ScalarFloat16OperationsTest {

Now that we have IR tests do you still think this test is necessary or should we have more IR test instead? @eme64 thoughts? We could follow up in another PR if need be.

-------------

PR Review: https://git.openjdk.org/jdk/pull/22754#pullrequestreview-2607094727
PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1949842011
PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1949871647
PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1949847574
PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1949858554

From jbhateja at openjdk.org  Tue Feb 11 06:32:56 2025
From: jbhateja at openjdk.org (Jatin Bhateja)
Date: Tue, 11 Feb 2025 06:32:56 GMT
Subject: RFR: 8342103: C2 compiler support for Float16 type and associated
 scalar operations [v18]
In-Reply-To: <a00XTjaE0iFc3MKq9ER_tgXoz81Hg07N8sPSPpTIQt4=.c05fd92f-8105-49d5-80be-ee56aeb77ede@github.com>
References: <a00XTjaE0iFc3MKq9ER_tgXoz81Hg07N8sPSPpTIQt4=.c05fd92f-8105-49d5-80be-ee56aeb77ede@github.com>
Message-ID: <GTm_Er6CT-A4aFdVeWEMCXyJKWWrW56VLe9On4W02fk=.6bb331e3-3a26-4f5e-befb-42e955e4d994@github.com>

> Hi All,
> 
> This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128)
> 
> Following is the summary of changes included with this patch:-
> 
> 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations.
> 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization.
> 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class.
>       -    These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values.
> 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines.
> 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details.
> 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa.
> 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF
> 9. X86  backend implementation for all supported intrinsics.
> 10. Functional and Performance validation tests.
> 
> Kindly review the patch and share your feedback.
> 
> Best Regards,
> Jatin

Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:

  Review comments resolutions

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/22754/files
  - new: https://git.openjdk.org/jdk/pull/22754/files/82a42213..111c8084

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=22754&range=17
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22754&range=16-17

  Stats: 38 lines in 3 files changed: 2 ins; 11 del; 25 mod
  Patch: https://git.openjdk.org/jdk/pull/22754.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/22754/head:pull/22754

PR: https://git.openjdk.org/jdk/pull/22754

From jbhateja at openjdk.org  Tue Feb 11 06:32:56 2025
From: jbhateja at openjdk.org (Jatin Bhateja)
Date: Tue, 11 Feb 2025 06:32:56 GMT
Subject: RFR: 8342103: C2 compiler support for Float16 type and associated
 scalar operations [v17]
In-Reply-To: <fYcbuTZ6Uicn6x1uTnhBZxYwgzim_TwldI_9MSLBbmw=.74d51608-d631-4a8a-8e72-a5d623ffef25@github.com>
References: <a00XTjaE0iFc3MKq9ER_tgXoz81Hg07N8sPSPpTIQt4=.c05fd92f-8105-49d5-80be-ee56aeb77ede@github.com>
 <sgcWAEvXaWi40uFqPPJq-e6n3YsHb7agKBUrMBdqJSc=.48c41eab-6641-45b7-9765-400622894f4b@github.com>
 <fYcbuTZ6Uicn6x1uTnhBZxYwgzim_TwldI_9MSLBbmw=.74d51608-d631-4a8a-8e72-a5d623ffef25@github.com>
Message-ID: <lbTe_GWhxZ_joIIxX3Bpr6BTzk8xm1rbn-s3fKEShl0=.09c2feaa-b3a1-412c-ac19-c5cbffa9c297@github.com>

On Mon, 10 Feb 2025 20:43:19 GMT, Paul Sandoz <psandoz at openjdk.org> wrote:

>> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Fixing typos
>
> test/jdk/jdk/incubator/vector/ScalarFloat16OperationsTest.java line 44:
> 
>> 42: import static jdk.incubator.vector.Float16.*;
>> 43: 
>> 44: public class ScalarFloat16OperationsTest {
> 
> Now that we have IR tests do you still think this test is necessary or should we have more IR test instead? @eme64 thoughts? We could follow up in another PR if need be.

Hi Paul, DataProviders used in this Functional validation test exercises each newly added Float16 operation over entire value range, while our IR tests are more directed towards valdating the newly added IR transforms and constant folding scenarios. We have a follow-up PR for auto-vectorizing Float16 operation which can be used to beefup any validation gap.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1950290083

From bkilambi at openjdk.org  Tue Feb 11 10:43:22 2025
From: bkilambi at openjdk.org (Bhavana Kilambi)
Date: Tue, 11 Feb 2025 10:43:22 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]
In-Reply-To: <unMldYiDLGyImOJQ1oXuzR2OViIBxTKFjE3Ks6_VSn4=.e86bd4ee-5fce-415a-888a-06aff24bd664@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <unMldYiDLGyImOJQ1oXuzR2OViIBxTKFjE3Ks6_VSn4=.e86bd4ee-5fce-415a-888a-06aff24bd664@github.com>
Message-ID: <1yB95sOajuS5ptFI0GQWLepii5JsZ9DOsje-TEFyFYs=.a325ad18-17ed-4e77-b1e3-0bad2cf55c67@github.com>

On Thu, 6 Feb 2025 18:47:54 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Adding comments + some code reorganization

src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 2618:

> 2616:   INSN(smaxp,  0, 0b101001, false); // accepted arrangements: T8B, T16B, T4H, T8H, T2S, T4S
> 2617:   INSN(sminp,  0, 0b101011, false); // accepted arrangements: T8B, T16B, T4H, T8H, T2S, T4S
> 2618:   INSN(sqdmulh,0, 0b101101, false); // accepted arrangements: T4H, T8H, T2S, T4S

Hi, not a comment on the algorithm itself but you might have to add these new instructions in the gtest for aarch64 here - test/hotspot/gtest/aarch64/aarch64-asmtest.py and use this file to generate test/hotspot/gtest/aarch64/asmtest.out.h which would contain these newly added instructions.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1950610623

From kvn at openjdk.org  Tue Feb 11 23:58:14 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Tue, 11 Feb 2025 23:58:14 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2]
In-Reply-To: <rs6rOizOwjIbK5bOSPR-YHledkZyK1ZrfXTYLAjaI2o=.8a8eab26-eae5-45f5-a10a-fc6cf4bbe94c@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <UpT0IDb0Y33QFlFCzUQad0y1qYnadGt6VwFxeR9GTZY=.8d61f572-209d-4509-b03d-068fb21ee3ce@github.com>
 <rs6rOizOwjIbK5bOSPR-YHledkZyK1ZrfXTYLAjaI2o=.8a8eab26-eae5-45f5-a10a-fc6cf4bbe94c@github.com>
Message-ID: <GPZdeIQ8_nb5jtnet5PBB3IIrkqaQMS-SxtcIPhcHeY=.0c90ff82-6b91-4ed9-a400-2967c712ca1c@github.com>

On Mon, 10 Feb 2025 03:11:22 GMT, Chris Plummer <cjplummer at openjdk.org> wrote:

>> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Fix Zero and Minimal VM builds
>
> I almost wished I hadn't looked because there is a lot of SA CodeBlob support that could use some cleanup. Most notably I think most of the wrapper subclasses are not needed by SA, and could be served by one common class. See what I'm doing in #23456 for JavaThread subclasses. Wrapper classes don't need to be 1-to-1 with the class type they are wrapping. A single wrapper class type can handle any number of hotspot types. It usually just make more sense for them to be 1-to-1, but when they are trivial and the implementation is replicated across subtypes, just having one wrapper class implement them all can simplify things.
> 
> The other thing I noticed is a lot of the subtypes seem to be doing some unnecessary things like registering Observers, and they all do something like the following:
> 
> 44     Type type = db.lookupType("BufferBlob");
> 
> Even when it never references "type".
> 
> I'm not suggesting you clean up any of this now, but just pointed it out. I might file an issue and try to clean it up myself at some point.
> 
> I still need to take a closer look at the SA changes.

Before I forgot to answer you, @plummercj 
I completely agree with your comment about cleaning up wrapper subclasses which do nothing.

I think some wrapper subclasses for CodeBlob were kept because of `is*()` which were used only in `PStack` to print name. Why not use `getName()` for this purpose without big `if/else` there?

An other purpose could be a place holder for additional information in a future which never come.

Other wrapper provides information available in `CodeBlob`. Like `RuntimeStub. callerMustGCArguments()`. `_caller_must_gc_arguments` field is part of  VM's `CodeBlob` class for some time now. Looks like I missed change in SA when did change in VM.

So yes, feel free to clean this up.  I will help with review.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2652321179

From kvn at openjdk.org  Wed Feb 12 00:11:28 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Wed, 12 Feb 2025 00:11:28 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v3]
In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
Message-ID: <vsqt46cBx-yGUN-M9EHus0kxHmB6FJDFMXYXB-cva0A=.e219913c-a58e-4c3d-8f86-11fa10c2ed58@github.com>

> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table.
> 
> Added C++ static asserts to make sure no virtual methods are added in a future.
> 
> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob.
> 
> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp

Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:

  Add CodeBlob proxy vtable

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23533/files
  - new: https://git.openjdk.org/jdk/pull/23533/files/dda20f0b..43ae0ed2

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=02
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=01-02

  Stats: 322 lines in 13 files changed: 175 ins; 90 del; 57 mod
  Patch: https://git.openjdk.org/jdk/pull/23533.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23533/head:pull/23533

PR: https://git.openjdk.org/jdk/pull/23533

From kvn at openjdk.org  Wed Feb 12 00:11:28 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Wed, 12 Feb 2025 00:11:28 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2]
In-Reply-To: <UpT0IDb0Y33QFlFCzUQad0y1qYnadGt6VwFxeR9GTZY=.8d61f572-209d-4509-b03d-068fb21ee3ce@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <UpT0IDb0Y33QFlFCzUQad0y1qYnadGt6VwFxeR9GTZY=.8d61f572-209d-4509-b03d-068fb21ee3ce@github.com>
Message-ID: <DNzQaCv2QBEUWhr8ml1mPS2DeR_-KDTk8jmAE8e8TEA=.6af74b88-db15-44bb-a9f2-c8302e1e379d@github.com>

On Sun, 9 Feb 2025 19:43:29 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table.
>> 
>> Added C++ static asserts to make sure no virtual methods are added in a future.
>> 
>> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob.
>> 
>> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp
>
> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix Zero and Minimal VM builds

I adopted Stefan's suggestion. I agree that it is more "future-proof".

I also remove underscore `_` from `CodeBlobKind` names.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2652333587
PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2652335723

From kvn at openjdk.org  Wed Feb 12 00:14:31 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Wed, 12 Feb 2025 00:14:31 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v4]
In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
Message-ID: <T3fkdMOVqn8th3Lxu8Zly72Dr7qViHLLuPxu5_VPW6w=.f92bbf2b-bbda-4c56-a345-cda5a9cf1cdb@github.com>

> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table.
> 
> Added C++ static asserts to make sure no virtual methods are added in a future.
> 
> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob.
> 
> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp

Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:

  Fix Minimal and Zero VM builds again

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23533/files
  - new: https://git.openjdk.org/jdk/pull/23533/files/43ae0ed2..7d3dce0e

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=03
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=02-03

  Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod
  Patch: https://git.openjdk.org/jdk/pull/23533.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23533/head:pull/23533

PR: https://git.openjdk.org/jdk/pull/23533

From kvn at openjdk.org  Wed Feb 12 00:22:31 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Wed, 12 Feb 2025 00:22:31 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v5]
In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
Message-ID: <koutEg_KOqdmHOvwLzzgSt7BlBjXGodx6Ry7wt3NWY0=.55e70830-6468-4f13-a861-fbdf34eb02c6@github.com>

> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table.
> 
> Added C++ static asserts to make sure no virtual methods are added in a future.
> 
> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob.
> 
> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp

Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:

  Fix Minimal and Zero VM builds once more

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23533/files
  - new: https://git.openjdk.org/jdk/pull/23533/files/7d3dce0e..1d108349

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=04
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=03-04

  Stats: 3 lines in 1 file changed: 3 ins; 0 del; 0 mod
  Patch: https://git.openjdk.org/jdk/pull/23533.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23533/head:pull/23533

PR: https://git.openjdk.org/jdk/pull/23533

From cjplummer at openjdk.org  Wed Feb 12 03:06:15 2025
From: cjplummer at openjdk.org (Chris Plummer)
Date: Wed, 12 Feb 2025 03:06:15 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2]
In-Reply-To: <GPZdeIQ8_nb5jtnet5PBB3IIrkqaQMS-SxtcIPhcHeY=.0c90ff82-6b91-4ed9-a400-2967c712ca1c@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <UpT0IDb0Y33QFlFCzUQad0y1qYnadGt6VwFxeR9GTZY=.8d61f572-209d-4509-b03d-068fb21ee3ce@github.com>
 <rs6rOizOwjIbK5bOSPR-YHledkZyK1ZrfXTYLAjaI2o=.8a8eab26-eae5-45f5-a10a-fc6cf4bbe94c@github.com>
 <GPZdeIQ8_nb5jtnet5PBB3IIrkqaQMS-SxtcIPhcHeY=.0c90ff82-6b91-4ed9-a400-2967c712ca1c@github.com>
Message-ID: <g1xZtizMWGUVmJj54HU6Xto4EFize_eVZy7gWRNrRus=.835606f5-7800-48af-84bd-2310d82d867e@github.com>

On Tue, 11 Feb 2025 23:55:46 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

> I think some wrapper subclasses for CodeBlob were kept because of `is*()` which were used only in `PStack` to print name. Why not use `getName()` for this purpose without big `if/else` there?

Possibly getName() didn't exist when PStack was first written. It would be good if PStack not only included the type name as it does now, but also the actual name of the blob, which getName() would return.

> An other purpose could be a place holder for additional information in a future which never come.

Yes, and you also see that with the Observer registration and the `Type type = db.lookupType(<typename>)` code, which are only needed if you are going to lookup fields of the subtypes, which most don't ever do, yet they all have this code.
 
> Other wrapper provides information available in `CodeBlob`. Like `RuntimeStub. callerMustGCArguments()`. `_caller_must_gc_arguments` field is part of VM's `CodeBlob` class for some time now. Looks like I missed change in SA when did change in VM.

Yeah, that's not working right for CodeBlob subtypes that are not RuntimeStubs. Easy to fix though.

> So yes, feel free to clean this up. I will help with review.

Ok. Let me see where things are at after you are done with the PR.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2652549878

From jbhateja at openjdk.org  Wed Feb 12 09:13:17 2025
From: jbhateja at openjdk.org (Jatin Bhateja)
Date: Wed, 12 Feb 2025 09:13:17 GMT
Subject: RFR: 8342103: C2 compiler support for Float16 type and associated
 scalar operations [v17]
In-Reply-To: <fYcbuTZ6Uicn6x1uTnhBZxYwgzim_TwldI_9MSLBbmw=.74d51608-d631-4a8a-8e72-a5d623ffef25@github.com>
References: <a00XTjaE0iFc3MKq9ER_tgXoz81Hg07N8sPSPpTIQt4=.c05fd92f-8105-49d5-80be-ee56aeb77ede@github.com>
 <sgcWAEvXaWi40uFqPPJq-e6n3YsHb7agKBUrMBdqJSc=.48c41eab-6641-45b7-9765-400622894f4b@github.com>
 <fYcbuTZ6Uicn6x1uTnhBZxYwgzim_TwldI_9MSLBbmw=.74d51608-d631-4a8a-8e72-a5d623ffef25@github.com>
Message-ID: <1xQeG8IO8aJNUluyWTaz9cm2xmTKSNsZJMNhnicnm5s=.304de8b6-9bba-44db-9982-eddaf950a415@github.com>

On Mon, 10 Feb 2025 21:23:28 GMT, Paul Sandoz <psandoz at openjdk.org> wrote:

>> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Fixing typos
>
> An impressive and substantial change. I focused on the Java code, there are some small tweaks, presented in comments, we can make to the intrinsics to improve the expression of code, and it has no impact on the intrinsic implementation.

Hi @PaulSandoz , Your comments have been addressed.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22754#issuecomment-2653071755

From psandoz at openjdk.org  Wed Feb 12 14:49:27 2025
From: psandoz at openjdk.org (Paul Sandoz)
Date: Wed, 12 Feb 2025 14:49:27 GMT
Subject: RFR: 8342103: C2 compiler support for Float16 type and associated
 scalar operations [v18]
In-Reply-To: <GTm_Er6CT-A4aFdVeWEMCXyJKWWrW56VLe9On4W02fk=.6bb331e3-3a26-4f5e-befb-42e955e4d994@github.com>
References: <a00XTjaE0iFc3MKq9ER_tgXoz81Hg07N8sPSPpTIQt4=.c05fd92f-8105-49d5-80be-ee56aeb77ede@github.com>
 <GTm_Er6CT-A4aFdVeWEMCXyJKWWrW56VLe9On4W02fk=.6bb331e3-3a26-4f5e-befb-42e955e4d994@github.com>
Message-ID: <lmzYNgi6uBiqH8YxnqZb-lcmFB6ZRHscR8s4M4u-ZEk=.4eb0803e-b74a-4109-91da-c49ea84fb4a2@github.com>

On Tue, 11 Feb 2025 06:32:56 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Hi All,
>> 
>> This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128)
>> 
>> Following is the summary of changes included with this patch:-
>> 
>> 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations.
>> 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization.
>> 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class.
>>       -    These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values.
>> 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines.
>> 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details.
>> 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa.
>> 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF
>> 9. X86  backend implementation for all supported intrinsics.
>> 10. Functional and Performance validation tests.
>> 
>> Kindly review the patch and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Review comments resolutions

Looks good. I merged this PR with master, successfully (at the time) with no conflicts, and ran it through tier 1 to 3 testing and there were no failures.

-------------

Marked as reviewed by psandoz (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/22754#pullrequestreview-2612181239

From kvn at openjdk.org  Wed Feb 12 16:28:32 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Wed, 12 Feb 2025 16:28:32 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v6]
In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
Message-ID: <uUg8vIR1j_4I2P3UYRzZA0EU8z5qXmethIRmUl7VBmE=.2be2895c-126a-4f27-9061-816b7c63d5e1@github.com>

> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table.
> 
> Added C++ static asserts to make sure no virtual methods are added in a future.
> 
> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob.
> 
> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp

Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:

  Fix Zero VM build

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23533/files
  - new: https://git.openjdk.org/jdk/pull/23533/files/1d108349..b09ddce6

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=05
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=04-05

  Stats: 11 lines in 2 files changed: 7 ins; 1 del; 3 mod
  Patch: https://git.openjdk.org/jdk/pull/23533.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23533/head:pull/23533

PR: https://git.openjdk.org/jdk/pull/23533

From jbhateja at openjdk.org  Wed Feb 12 17:08:25 2025
From: jbhateja at openjdk.org (Jatin Bhateja)
Date: Wed, 12 Feb 2025 17:08:25 GMT
Subject: RFR: 8342103: C2 compiler support for Float16 type and associated
 scalar operations [v18]
In-Reply-To: <lmzYNgi6uBiqH8YxnqZb-lcmFB6ZRHscR8s4M4u-ZEk=.4eb0803e-b74a-4109-91da-c49ea84fb4a2@github.com>
References: <a00XTjaE0iFc3MKq9ER_tgXoz81Hg07N8sPSPpTIQt4=.c05fd92f-8105-49d5-80be-ee56aeb77ede@github.com>
 <GTm_Er6CT-A4aFdVeWEMCXyJKWWrW56VLe9On4W02fk=.6bb331e3-3a26-4f5e-befb-42e955e4d994@github.com>
 <lmzYNgi6uBiqH8YxnqZb-lcmFB6ZRHscR8s4M4u-ZEk=.4eb0803e-b74a-4109-91da-c49ea84fb4a2@github.com>
Message-ID: <mi6t6CGWnnvZyEDJ1me11TvMj_50EnUWoINpmJKNiOQ=.6281007b-6f01-484c-b9e6-5d23f0f1a158@github.com>

On Wed, 12 Feb 2025 14:46:49 GMT, Paul Sandoz <psandoz at openjdk.org> wrote:

>> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Review comments resolutions
>
> Looks good. I merged this PR with master, successfully (at the time) with no conflicts, and ran it through tier 1 to 3 testing and there were no failures.

Thanks @PaulSandoz , @eme64 and @sviswa7 for your  valuable feedback.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22754#issuecomment-2654337191

From jbhateja at openjdk.org  Wed Feb 12 17:08:28 2025
From: jbhateja at openjdk.org (Jatin Bhateja)
Date: Wed, 12 Feb 2025 17:08:28 GMT
Subject: Integrated: 8342103: C2 compiler support for Float16 type and
 associated scalar operations
In-Reply-To: <a00XTjaE0iFc3MKq9ER_tgXoz81Hg07N8sPSPpTIQt4=.c05fd92f-8105-49d5-80be-ee56aeb77ede@github.com>
References: <a00XTjaE0iFc3MKq9ER_tgXoz81Hg07N8sPSPpTIQt4=.c05fd92f-8105-49d5-80be-ee56aeb77ede@github.com>
Message-ID: <0jFE4E2Aewb7aCN5nZrmV3Lz3SSsNSmhhUEiL9JQjMA=.c202afcf-340c-4fca-8a2a-778c7677fe1f@github.com>

On Sun, 15 Dec 2024 18:05:02 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

> Hi All,
> 
> This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128)
> 
> Following is the summary of changes included with this patch:-
> 
> 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations.
> 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization.
> 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class.
>       -    These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values.
> 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines.
> 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details.
> 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa.
> 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF
> 9. X86  backend implementation for all supported intrinsics.
> 10. Functional and Performance validation tests.
> 
> Kindly review the patch and share your feedback.
> 
> Best Regards,
> Jatin

This pull request has now been integrated.

Changeset: 4b463ee7
Author:    Jatin Bhateja <jbhateja at openjdk.org>
URL:       https://git.openjdk.org/jdk/commit/4b463ee70eceb94fdfbffa5c49dd58dcc6a6c890
Stats:     2855 lines in 56 files changed: 2788 ins; 0 del; 67 mod

8342103: C2 compiler support for Float16 type and associated scalar operations

Co-authored-by: Paul Sandoz <psandoz at openjdk.org>
Co-authored-by: Bhavana Kilambi <bkilambi at openjdk.org>
Co-authored-by: Joe Darcy <darcy at openjdk.org>
Co-authored-by: Raffaello Giulietti <rgiulietti at openjdk.org>
Reviewed-by: psandoz, epeter, sviswanathan

-------------

PR: https://git.openjdk.org/jdk/pull/22754

From kvn at openjdk.org  Wed Feb 12 20:21:13 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Wed, 12 Feb 2025 20:21:13 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v6]
In-Reply-To: <uUg8vIR1j_4I2P3UYRzZA0EU8z5qXmethIRmUl7VBmE=.2be2895c-126a-4f27-9061-816b7c63d5e1@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <uUg8vIR1j_4I2P3UYRzZA0EU8z5qXmethIRmUl7VBmE=.2be2895c-126a-4f27-9061-816b7c63d5e1@github.com>
Message-ID: <lxju-ofQn-wIudGzN2NFQopKiFzzcn0ynYsE3Cnj-aQ=.3476c1da-ec8e-45d7-a3e7-b041458d1683@github.com>

On Wed, 12 Feb 2025 16:28:32 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table.
>> 
>> Added C++ static asserts to make sure no virtual methods are added in a future.
>> 
>> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob.
>> 
>> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp
>
> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix Zero VM build

It is ready for re-review.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2654754643

From cjplummer at openjdk.org  Thu Feb 13 02:36:16 2025
From: cjplummer at openjdk.org (Chris Plummer)
Date: Thu, 13 Feb 2025 02:36:16 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v6]
In-Reply-To: <uUg8vIR1j_4I2P3UYRzZA0EU8z5qXmethIRmUl7VBmE=.2be2895c-126a-4f27-9061-816b7c63d5e1@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <uUg8vIR1j_4I2P3UYRzZA0EU8z5qXmethIRmUl7VBmE=.2be2895c-126a-4f27-9061-816b7c63d5e1@github.com>
Message-ID: <070Dz3l6A_ZT20jprInpMdpeqE3gogKAmmpnCprr4j0=.3b4804dc-02d7-4aa6-af42-7ef076d4fe0d@github.com>

On Wed, 12 Feb 2025 16:28:32 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table.
>> 
>> Added C++ static asserts to make sure no virtual methods are added in a future.
>> 
>> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob.
>> 
>> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp
>
> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix Zero VM build

src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeBlob.java line 118:

> 116:   }
> 117: 
> 118:   public static Class<?> getClassFor(Address addr) {

Did you consider using a lookup table here that is indexed using the kind value?

src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeBlob.java line 146:

> 144:         }
> 145:       }
> 146:       return null;

Should this be an assert?

src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeBlob.java line 213:

> 211: 
> 212:   public boolean isUncommonTrapBlob()   {
> 213:     if (!VM.getVM().isServerCompiler()) return false;

Why is the check needed? Why not just return the value `getKind() == UncommonTrapKind` result below?

src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeCache.java line 95:

> 93:   }
> 94: 
> 95:   public CodeBlob createCodeBlobWrapper(Address cbAddr, Address start) {

I think the use of the name "start" here is a carryover from `findBlobUnsafe(Address start)`. I find it a very misleading name. cbAddr points to the "start" of the blob. "start" points somewhere in the middle of the blob. In fact callers of this API somethimes pass in findStart(addr) for cbAddr, which just adds to the confusion. Perhaps this is a good time to rename "start" to something else, although I can't come up with a good suggestion, but I think anything other than "start" would be an improvement. Maybe "pcAddr".  That aligns with the "for PC=" message below. Or maybe just "ptr" which aligns with `createCodeBlobWrapper(findStart(ptr), ptr);`

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953665953
PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953666268
PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953667349
PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953682557

From kvn at openjdk.org  Thu Feb 13 03:43:17 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Thu, 13 Feb 2025 03:43:17 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v6]
In-Reply-To: <070Dz3l6A_ZT20jprInpMdpeqE3gogKAmmpnCprr4j0=.3b4804dc-02d7-4aa6-af42-7ef076d4fe0d@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <uUg8vIR1j_4I2P3UYRzZA0EU8z5qXmethIRmUl7VBmE=.2be2895c-126a-4f27-9061-816b7c63d5e1@github.com>
 <070Dz3l6A_ZT20jprInpMdpeqE3gogKAmmpnCprr4j0=.3b4804dc-02d7-4aa6-af42-7ef076d4fe0d@github.com>
Message-ID: <8VFudK82JuBbjj_s74lDlHd1TWurW8uiBbw2DutA-PU=.ec26075e-89ad-4caf-ae3f-f50e5407a5f6@github.com>

On Thu, 13 Feb 2025 02:06:57 GMT, Chris Plummer <cjplummer at openjdk.org> wrote:

>> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Fix Zero VM build
>
> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeBlob.java line 118:
> 
>> 116:   }
>> 117: 
>> 118:   public static Class<?> getClassFor(Address addr) {
> 
> Did you consider using a lookup table here that is indexed using the kind value?

Example please.

> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeBlob.java line 146:
> 
>> 144:         }
>> 145:       }
>> 146:       return null;
> 
> Should this be an assert?

I don't think we need it - the caller `CodeCache.createCodeBlobWrapper()` will throw `RuntimeException` when `null` is returned.

> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeBlob.java line 213:
> 
>> 211: 
>> 212:   public boolean isUncommonTrapBlob()   {
>> 213:     if (!VM.getVM().isServerCompiler()) return false;
> 
> Why is the check needed? Why not just return the value `getKind() == UncommonTrapKind` result below?

`UncommonTrapKind` and `ExceptionKind` are not initialized for Client VM because corresponding `CodeBlobKind` values are not defined. See `CodeBlob.initialize()`.
Their not initialized value will be 0 which matches `CodeBlobKind::None` value. Returning true in such case will be incorrect.

> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeCache.java line 95:
> 
>> 93:   }
>> 94: 
>> 95:   public CodeBlob createCodeBlobWrapper(Address cbAddr, Address start) {
> 
> I think the use of the name "start" here is a carryover from `findBlobUnsafe(Address start)`. I find it a very misleading name. cbAddr points to the "start" of the blob. "start" points somewhere in the middle of the blob. In fact callers of this API somethimes pass in findStart(addr) for cbAddr, which just adds to the confusion. Perhaps this is a good time to rename "start" to something else, although I can't come up with a good suggestion, but I think anything other than "start" would be an improvement. Maybe "pcAddr".  That aligns with the "for PC=" message below. Or maybe just "ptr" which aligns with `createCodeBlobWrapper(findStart(ptr), ptr);`

`cbPc` with comment explaining that it could be inside code blob.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953732919
PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953733212
PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953738572
PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953745389

From cjplummer at openjdk.org  Thu Feb 13 05:22:14 2025
From: cjplummer at openjdk.org (Chris Plummer)
Date: Thu, 13 Feb 2025 05:22:14 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v6]
In-Reply-To: <8VFudK82JuBbjj_s74lDlHd1TWurW8uiBbw2DutA-PU=.ec26075e-89ad-4caf-ae3f-f50e5407a5f6@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <uUg8vIR1j_4I2P3UYRzZA0EU8z5qXmethIRmUl7VBmE=.2be2895c-126a-4f27-9061-816b7c63d5e1@github.com>
 <070Dz3l6A_ZT20jprInpMdpeqE3gogKAmmpnCprr4j0=.3b4804dc-02d7-4aa6-af42-7ef076d4fe0d@github.com>
 <8VFudK82JuBbjj_s74lDlHd1TWurW8uiBbw2DutA-PU=.ec26075e-89ad-4caf-ae3f-f50e5407a5f6@github.com>
Message-ID: <vn5PGBP9fcajrfw0rRxo6CRMOKRZsUEzsremVe7X0qQ=.ff23c214-5599-49fb-9732-75e4df74c819@github.com>

On Thu, 13 Feb 2025 03:26:19 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeBlob.java line 118:
>> 
>>> 116:   }
>>> 117: 
>>> 118:   public static Class<?> getClassFor(Address addr) {
>> 
>> Did you consider using a lookup table here that is indexed using the kind value?
>
> Example please.

static Class wrapperClasses = new Class[Number_Of_Kinds];
    wrapperClasses[NMethodKind] = NMethodBlob.class;
    wrapperClasses[BufferKind] = BufferBopb.class;
    ...;
    wrapperClasses[SafepointKind] = SafepointBlob.class;


    CodeBlob cb = new CodeBlob(addr);
    return wrapperClasses[cb.getKind()];

>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeBlob.java line 146:
>> 
>>> 144:         }
>>> 145:       }
>>> 146:       return null;
>> 
>> Should this be an assert?
>
> I don't think we need it - the caller `CodeCache.createCodeBlobWrapper()` will throw `RuntimeException` when `null` is returned.

I guess my real question is whether or not it can be considered normal behavior to return null. It seems it should never happen, which is why I was suggesting an assert.

>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeBlob.java line 213:
>> 
>>> 211: 
>>> 212:   public boolean isUncommonTrapBlob()   {
>>> 213:     if (!VM.getVM().isServerCompiler()) return false;
>> 
>> Why is the check needed? Why not just return the value `getKind() == UncommonTrapKind` result below?
>
> `UncommonTrapKind` and `ExceptionKind` are not initialized for Client VM because corresponding `CodeBlobKind` values are not defined. See `CodeBlob.initialize()`.
> Their not initialized value will be 0 which matches `CodeBlobKind::None` value. Returning true in such case will be incorrect.

Ok. Leaving UncommonTrapKind and ExceptionKind uninitialized seems a bit error prone. Perhaps they can be given some sort of INVALID value.

>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeCache.java line 95:
>> 
>>> 93:   }
>>> 94: 
>>> 95:   public CodeBlob createCodeBlobWrapper(Address cbAddr, Address start) {
>> 
>> I think the use of the name "start" here is a carryover from `findBlobUnsafe(Address start)`. I find it a very misleading name. cbAddr points to the "start" of the blob. "start" points somewhere in the middle of the blob. In fact callers of this API somethimes pass in findStart(addr) for cbAddr, which just adds to the confusion. Perhaps this is a good time to rename "start" to something else, although I can't come up with a good suggestion, but I think anything other than "start" would be an improvement. Maybe "pcAddr".  That aligns with the "for PC=" message below. Or maybe just "ptr" which aligns with `createCodeBlobWrapper(findStart(ptr), ptr);`
>
> `cbPc` with comment explaining that it could be inside code blob.

That sounds fine.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953818292
PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953819796
PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953821968
PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953822595

From jrose at openjdk.org  Thu Feb 13 07:44:19 2025
From: jrose at openjdk.org (John R Rose)
Date: Thu, 13 Feb 2025 07:44:19 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v6]
In-Reply-To: <uUg8vIR1j_4I2P3UYRzZA0EU8z5qXmethIRmUl7VBmE=.2be2895c-126a-4f27-9061-816b7c63d5e1@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <uUg8vIR1j_4I2P3UYRzZA0EU8z5qXmethIRmUl7VBmE=.2be2895c-126a-4f27-9061-816b7c63d5e1@github.com>
Message-ID: <Z3zsWZU5JO5jqKnUtnysgiEkpMd-Mqvvc8F3rP9dBJ8=.34222123-6c96-44a4-ad68-9fb3c3e928c3@github.com>

On Wed, 12 Feb 2025 16:28:32 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table.
>> 
>> Added C++ static asserts to make sure no virtual methods are added in a future.
>> 
>> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob.
>> 
>> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp
>
> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix Zero VM build

I've read the code and it looks good.  I find myself wishing for a few more comments to guide me, especially in knowing which methods to pay attention to, and which to ignore as "pure plumbing".

The array of vptr-ptrs is the key element.  It seems to work nicely.

There are lots of regularizations here, which I enjoy.  But the new code has (to me) distracting irregularities.  Why define one Vptr as a struct and others as classes?  Did we really regularize the names of all the print functions (they were irregular before)?

I was glad to see lots of magic code deleted from SA.  Although, having to look at SA at all is annoying!

I noticed a lot of churn in "innocent bystander" client code that looks like this:


                   p2i(_frame.pc()), decode_offset);
-      nm()->print_on(&ss);
+      nm()->print_on_v(&ss);
       nm()->method()->print_codes_on(&ss);


What is the client maintainer (or any casual reader) supposed to get from the "_v" suffix?  I know we have made the "v/nv" distinction before, but it is rather obscure, not documeted here.  Is it described elsewhere in our code base?  Our use of it here should be docuemented in codeBlob.hpp.

Normally, we try to keep client APIs invariant while doing refactorings like this, so as to avoid touching all the client code.

In this case, we have to use a new naming convention to distinguish all versions of (say) print_on:

M. The implementation in each CB class K, which can be private if K::Vptr is a friend.
P. The public API point, used outside of the CB classes, as well as inside.
V. The name of the virtual function defined by each K::Vptr.

I would expect I to have the "nice name" like print_on, not print_on_v, while while the private method M would be print_on_impl or print_on_nv, and never called except from Vptr or other methods of the same name.  But any convention will work, as long as it is documented and held to consistently.

I'm sympathetic to both Andrew's call for maacro-enforced regularity, and Vladimir's objection that macros make things hard to follow.  If macros won't work for us here, let's define a documented pattern and stick to it closely, documenting our decisions as we go.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2655760868

From aboldtch at openjdk.org  Thu Feb 13 08:32:21 2025
From: aboldtch at openjdk.org (Axel Boldt-Christmas)
Date: Thu, 13 Feb 2025 08:32:21 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v6]
In-Reply-To: <uUg8vIR1j_4I2P3UYRzZA0EU8z5qXmethIRmUl7VBmE=.2be2895c-126a-4f27-9061-816b7c63d5e1@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <uUg8vIR1j_4I2P3UYRzZA0EU8z5qXmethIRmUl7VBmE=.2be2895c-126a-4f27-9061-816b7c63d5e1@github.com>
Message-ID: <Z20zaowGEnlGq_wSYvZbz9jVb7CWBjb4MYOTpmXkJ0o=.6c6c608d-8694-4eb5-9ac6-45001a33b4d8@github.com>

On Wed, 12 Feb 2025 16:28:32 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table.
>> 
>> Added C++ static asserts to make sure no virtual methods are added in a future.
>> 
>> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob.
>> 
>> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp
>
> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix Zero VM build

Similar to what @rose00 noted I think the `_v` and `_nv` suffixes are unfortunate in the public API.

Maybe it we could add a protected `x_impl` containing the implementation, then dispatch to the correct one based on _kind, using the Vptr abstraction. And have the normal print_on method use this. We could let our leaf types to directly call the specific implementation, not that I think that our print functions require compile time devirtualisation. 

There are many solutions here with their pros and cons.

src/hotspot/share/code/codeBlob.hpp line 140:

> 138:       instance->print_value_on_nv(st);
> 139:     }
> 140:   };

I wonder why the base class is not abstract. AFAICT `print_value_on` is unreachable and `print_on` is only used by `DeoptimizationBlob::Vptr` which also seems like a behavioural change, as before this patch calling `print_on` a `DeoptimizationBlob` object would dispatch to `SingletonBlob::print_on` not `CodeBlob::print_on`. 

Suggestion:

 struct Vptr {
    virtual void print_on(const CodeBlob* instance, outputStream* st) const = 0;
    virtual void print_value_on(const CodeBlob* instance, outputStream* st) const = 0;
  };

src/hotspot/share/code/codeBlob.hpp line 339:

> 337:   void print_value_on(outputStream* st) const;
> 338: 
> 339:   class Vptr : public CodeBlob::Vptr {

I wonder if these should share the same type hierarchy as tier container class. This would also solve the issueI noted in my other comment about not calling the correct `print_on`.
Suggestion:

  class Vptr : public RuntimeBlob::Vptr {

src/hotspot/share/code/codeBlob.hpp line 427:

> 425:   void print_value_on(outputStream* st) const;
> 426: 
> 427:   class Vptr : public CodeBlob::Vptr {

Suggestion:

  class Vptr : public RuntimeBlob::Vptr {

src/hotspot/share/code/codeBlob.hpp line 467:

> 465:   void print_value_on(outputStream* st) const;
> 466: 
> 467:   class Vptr : public CodeBlob::Vptr {

Suggestion:

  class Vptr : public RuntimeBlob::Vptr {

src/hotspot/share/code/codeBlob.hpp line 553:

> 551:   void print_value_on(outputStream* st) const;
> 552: 
> 553:   class Vptr : public CodeBlob::Vptr {

This one specifically
Suggestion:

  class Vptr : public SingletonBlob::Vptr {

src/hotspot/share/code/codeBlob.hpp line 679:

> 677:   void print_value_on(outputStream* st) const;
> 678: 
> 679:   class Vptr : public CodeBlob::Vptr {

Suggestion:

  class Vptr : public RuntimeBlob::Vptr {

-------------

PR Review: https://git.openjdk.org/jdk/pull/23533#pullrequestreview-2614177723
PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954019308
PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954024528
PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954028620
PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954028940
PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954027733
PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954029504

From dnsimon at openjdk.org  Thu Feb 13 10:04:20 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Thu, 13 Feb 2025 10:04:20 GMT
Subject: RFR: 8349977: JVMCIRuntime::_shared_library_javavm_id should be jlong
Message-ID: <S2_GJGE35jlx-bZ-j8zY3s4aMD0mIKDne2Dtdj8Z9aE=.c312fc1e-be94-4a09-91f2-5844c7802e0c@github.com>

The `JVMCIRuntime::_shared_library_javavm_id` field is initialized from a jlong in [libgraal](https://github.com/oracle/graal/blob/d544bbe3fe416d39e9e5b8fc645a67a36a5d7c07/substratevm/src/com.oracle.svm.core/src/com/oracle/svm/core/jni/functions/JNIInvocationInterface.java#L396-L397) and so it's C++ type in HotSpot should match.

-------------

Commit messages:
 - converted JVMCIRuntime::_shared_library_javavm_id to jlong

Changes: https://git.openjdk.org/jdk/pull/23610/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23610&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8349977
  Stats: 7 lines in 3 files changed: 0 ins; 0 del; 7 mod
  Patch: https://git.openjdk.org/jdk/pull/23610.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23610/head:pull/23610

PR: https://git.openjdk.org/jdk/pull/23610

From epeter at openjdk.org  Thu Feb 13 11:39:15 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Thu, 13 Feb 2025 11:39:15 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v11]
In-Reply-To: <RHL_g49_BCZQzsQJU-T88fkAOoSKpNvEC2Xx-QxdpRk=.4fbc0037-ba55-40e1-a091-4c16d7e8ee99@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com>
 <Mci8jQuT-MquLYeikUrrdzKo9dJJuQa3ejdc7tlYQyI=.e0007de8-08b2-4a42-950c-f8e1225777fc@github.com>
 <RHL_g49_BCZQzsQJU-T88fkAOoSKpNvEC2Xx-QxdpRk=.4fbc0037-ba55-40e1-a091-4c16d7e8ee99@github.com>
Message-ID: <BXFt3f0cRfeulvfxCIDM-MUWMxMtFYSuXHUjtnjBKUs=.efc82fb4-d82f-4464-bdcd-95ca56b5e42c@github.com>

On Mon, 10 Feb 2025 09:26:32 GMT, Galder Zamarre?o <galder at openjdk.org> wrote:

>> @eastig is helping with the results on aarch64, so I will verify the numbers in same way done below for x86_64 once he provides me with the results.
>> 
>> Here is a summary of the benchmarking results I'm seeing on x86_64 (I will push an update that just merges the latest master shortly).
>> 
>> First I will go through the results of `MinMaxVector`. This benchmark computes throughput by default so the higher the number the better.
>> 
>> # MinMaxVector AVX-512
>> 
>> Following are results with AVX-512 instructions:
>> 
>> Benchmark                       (probability)  (range)  (seed)  (size)   Mode  Cnt   Baseline     Patch   Units
>> MinMaxVector.longClippingRange            N/A       90       0    1000  thrpt    4    834.127  3688.961  ops/ms
>> MinMaxVector.longClippingRange            N/A      100       0    1000  thrpt    4   1147.010  3687.721  ops/ms
>> MinMaxVector.longLoopMax                   50      N/A     N/A    2048  thrpt    4   1126.718  1072.812  ops/ms
>> MinMaxVector.longLoopMax                   80      N/A     N/A    2048  thrpt    4   1070.921  1070.538  ops/ms
>> MinMaxVector.longLoopMax                  100      N/A     N/A    2048  thrpt    4    510.483  1073.081  ops/ms
>> MinMaxVector.longLoopMin                   50      N/A     N/A    2048  thrpt    4    935.658  1016.910  ops/ms
>> MinMaxVector.longLoopMin                   80      N/A     N/A    2048  thrpt    4   1007.410   933.774  ops/ms
>> MinMaxVector.longLoopMin                  100      N/A     N/A    2048  thrpt    4    536.582  1017.337  ops/ms
>> MinMaxVector.longReductionMax              50      N/A     N/A    2048  thrpt    4    967.288   966.945  ops/ms
>> MinMaxVector.longReductionMax              80      N/A     N/A    2048  thrpt    4    967.327   967.382  ops/ms
>> MinMaxVector.longReductionMax             100      N/A     N/A    2048  thrpt    4    849.689   967.327  ops/ms
>> MinMaxVector.longReductionMin              50      N/A     N/A    2048  thrpt    4    966.323   967.275  ops/ms
>> MinMaxVector.longReductionMin              80      N/A     N/A    2048  thrpt    4    967.340   967.228  ops/ms
>> MinMaxVector.longReductionMin             100      N/A     N/A    2048  thrpt    4    880.921   967.233  ops/ms
>> 
>> 
>> ### `longReduction[Min|Max]` performance improves slightly when probability is 100
>> 
>> Without the patch the code uses compare instructions:
>> 
>> 
>>    7.83%  ???? ????  ?           0x00007f4f700fb305:   imulq		$0xb, 0x20(%r14, %r8, 8), %rdi
>>           ???? ???...
>
>> At 100% probability baseline fails to vectorize because it observes a control flow. This control flow is not the one you see in min/max implementations, but this is one added by HotSpot as a result of the JIT profiling. It observes that one branch is always taken so it optimizes for that, and adds a branch for the uncommon case where the branch is not taken.
> 
> I've dug further into this to try to understand how the baseline hotspot code works, and the explanation above is not entirely correct. Let's look at the IR differences between say 100% vs 80% branch situations.
> 
> At branch 80% you see:
> 
>  1115  CountedLoop  === 1115 598 463  [[ 1101 1115 1116 1118 451 594 ]] inner stride: 2 main of N1115 strip mined !orig=[599],[590],[307] !jvms: MinMaxVector::longLoopMax @ bci:10 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
> 
>   692  LoadL  === 1083 1101 393  [[ 747 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=9; #long (does not depend only on test, unknown control) !orig=[395] !jvms: MinMaxVector::longLoopMax @ bci:26 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
>   651  LoadL  === 1095 1101 355  [[ 747 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=9; #long (does not depend only on test, unknown control) !orig=[357] !jvms: MinMaxVector::longLoopMax @ bci:20 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
>   747  MaxL  === _ 651 692  [[ 451 ]]  !orig=[608],[416] !jvms: Math::max @ bci:11 (line 2037) MinMaxVector::longLoopMax @ bci:27 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
> 
>   451  StoreL  === 1115 1101 449 747  [[ 1116 454 911 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=9;  Memory: @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=9; !orig=1124 !jvms: MinMaxVector::longLoopMax @ bci:30 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
> 
>   594  CountedLoopEnd  === 1115 593  [[ 1123 463 ]] [lt] P=0.999731, C=780799.000000 !orig=[462] !jvms: MinMaxVector::longLoopMax @ bci:7 (line 235) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
> 
> 
> You see the counted loop with the LoadL for array loads and MaxL consuming those. The StoreL is for array assignment (I think).
> 
> At branch 100% you see:
> 
> 
>  ...

@galderz Thanks for all the explanations, that's really helpful ? 

**Discussion**

- AVX512: only imprivements.
  - Expecially with probability 100, where before we used the bytecode, which would then create an `unstable_if` with uncommon trap. That meant we could not re-discover the CMove / Max later in the IR. Now that we never inline the bytecode, and just intrinsify directly, we can use `vpmax` and that is faster.
  - Ah, maybe that was all incorrect, though it sounded reasonable. You seem to suggest that we actually did use to inline both branches, but that the issue was that `PhaseIdealLoop::conditional_move` does not like extreme probabilities, and so it did not convert 100% cases to CMove, and so it did not use to vectorize. Right. Getting the probability cutoff just right it a little tricky there, and the precise number can seem strange. But that's a discussion for another day.
  - The reduction case is only improved slightly... at least. Maybe we can further improve the throughput with [this](https://bugs.openjdk.org/browse/JDK-8345245) later on.

 - AVX2: mixed results
   - `longReductionMax/Min`: vector max / min is not implemented. We should investigate why.
   - It seems like the `MaxVL` and `MinVL` (e.g. `vpmaxsq`) instructions are only implemented directly for AVX512, see [this](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=4669,2611&text=max_epi64).
   - As you suggested @galderz we could consider implementing it via `cmove` in the backend for `AVX2` and maybe lower. Maybe we can talk with @jatin-bhateja about this. That would probably already be worth it on its own, in a separate RFE. Because I would suspect it could give speedup in the non 100% cases as well. Maybe this would even have to be an RFE that makes it in first, so we don't have regressions here?
   - But even still: just intfinsifying should not get us a regression, because there will always be cases where the auto-vectorizer fails, and so the scalar code should not be slower with your patch than on master, right? So we need to investigate this scalar issue as well.

- VectorReduction2.WithSuperword on AVX-512
  - `long[Min|Max]Simple performance drops considerably`. Yes, this case is not yet supposed to vectorize, I'm working on that - it is the issue with "simple" reductions, i.e. those that do no work other than reduce. Our current reduction heuristic thinks these are not profitable to vectorize - but that is wrong in almost all cases. You even filed an issue for that a while back ;) see https://bugs.openjdk.org/browse/JDK-8345044 and related issues. We could bite the bullet on this, knowing that I'm working on it and it will probably fix that issue, or we just wait a little here. Let's discuss.

- VectorReduction2.NoSuperword on AVX-512 machine
  - Hmm, ok. So we seem to realize that the scalar case is slower with your patch in some cases, because now we have a `cmove` on the critical path, and previously we could just predict the branches, which was faster. Interesting that the number of other instructions has an effect here as well, you seem to see a speedup with the "big" benchmarks, but the "small" and "dot" benchmarks are slower. This is surprising. It would be great if we understood why it behaves this way.

**Summary**

Wow, things are more complicated than I would have thought, I hope you are not too discouraged ? 

We seem to have these issues, maybe there are more:
- AVX2 does not have long-vector-min/max implemented. That can be done in a separate RFE.
- Simple reductions do not vectorize, known issue see https://bugs.openjdk.org/browse/JDK-8345044, I'm working on that.
- Scalar reductions are slower with your patch for extreme probabilities. Before, they were done with branches, and branch prediction was fast. Now with cmove or max instructions, the critical path is longer, and that makes things slow. Maybe this could be alleviated by reordering / reassociating the reduction path, see [JDK-8345245](https://bugs.openjdk.org/browse/JDK-8345245). Alternatively, we could convert the `cmove` back to a branch, but for that we would probably need to know the branching probability, which we now do not have any more, right? Tricky. This seems the real issue we need to address and discuss.

@galderz What do you think?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2656328729

From epeter at openjdk.org  Thu Feb 13 11:49:18 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Thu, 13 Feb 2025 11:49:18 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v11]
In-Reply-To: <RHL_g49_BCZQzsQJU-T88fkAOoSKpNvEC2Xx-QxdpRk=.4fbc0037-ba55-40e1-a091-4c16d7e8ee99@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com>
 <Mci8jQuT-MquLYeikUrrdzKo9dJJuQa3ejdc7tlYQyI=.e0007de8-08b2-4a42-950c-f8e1225777fc@github.com>
 <RHL_g49_BCZQzsQJU-T88fkAOoSKpNvEC2Xx-QxdpRk=.4fbc0037-ba55-40e1-a091-4c16d7e8ee99@github.com>
Message-ID: <MAcSY0Kc9JFrv5ueJDNkDH9I9LpIsYcRjiJ7RtQt090=.03b107fd-8879-41de-a0a4-b6202a9da369@github.com>

On Mon, 10 Feb 2025 09:26:32 GMT, Galder Zamarre?o <galder at openjdk.org> wrote:

>> @eastig is helping with the results on aarch64, so I will verify the numbers in same way done below for x86_64 once he provides me with the results.
>> 
>> Here is a summary of the benchmarking results I'm seeing on x86_64 (I will push an update that just merges the latest master shortly).
>> 
>> First I will go through the results of `MinMaxVector`. This benchmark computes throughput by default so the higher the number the better.
>> 
>> # MinMaxVector AVX-512
>> 
>> Following are results with AVX-512 instructions:
>> 
>> Benchmark                       (probability)  (range)  (seed)  (size)   Mode  Cnt   Baseline     Patch   Units
>> MinMaxVector.longClippingRange            N/A       90       0    1000  thrpt    4    834.127  3688.961  ops/ms
>> MinMaxVector.longClippingRange            N/A      100       0    1000  thrpt    4   1147.010  3687.721  ops/ms
>> MinMaxVector.longLoopMax                   50      N/A     N/A    2048  thrpt    4   1126.718  1072.812  ops/ms
>> MinMaxVector.longLoopMax                   80      N/A     N/A    2048  thrpt    4   1070.921  1070.538  ops/ms
>> MinMaxVector.longLoopMax                  100      N/A     N/A    2048  thrpt    4    510.483  1073.081  ops/ms
>> MinMaxVector.longLoopMin                   50      N/A     N/A    2048  thrpt    4    935.658  1016.910  ops/ms
>> MinMaxVector.longLoopMin                   80      N/A     N/A    2048  thrpt    4   1007.410   933.774  ops/ms
>> MinMaxVector.longLoopMin                  100      N/A     N/A    2048  thrpt    4    536.582  1017.337  ops/ms
>> MinMaxVector.longReductionMax              50      N/A     N/A    2048  thrpt    4    967.288   966.945  ops/ms
>> MinMaxVector.longReductionMax              80      N/A     N/A    2048  thrpt    4    967.327   967.382  ops/ms
>> MinMaxVector.longReductionMax             100      N/A     N/A    2048  thrpt    4    849.689   967.327  ops/ms
>> MinMaxVector.longReductionMin              50      N/A     N/A    2048  thrpt    4    966.323   967.275  ops/ms
>> MinMaxVector.longReductionMin              80      N/A     N/A    2048  thrpt    4    967.340   967.228  ops/ms
>> MinMaxVector.longReductionMin             100      N/A     N/A    2048  thrpt    4    880.921   967.233  ops/ms
>> 
>> 
>> ### `longReduction[Min|Max]` performance improves slightly when probability is 100
>> 
>> Without the patch the code uses compare instructions:
>> 
>> 
>>    7.83%  ???? ????  ?           0x00007f4f700fb305:   imulq		$0xb, 0x20(%r14, %r8, 8), %rdi
>>           ???? ???...
>
>> At 100% probability baseline fails to vectorize because it observes a control flow. This control flow is not the one you see in min/max implementations, but this is one added by HotSpot as a result of the JIT profiling. It observes that one branch is always taken so it optimizes for that, and adds a branch for the uncommon case where the branch is not taken.
> 
> I've dug further into this to try to understand how the baseline hotspot code works, and the explanation above is not entirely correct. Let's look at the IR differences between say 100% vs 80% branch situations.
> 
> At branch 80% you see:
> 
>  1115  CountedLoop  === 1115 598 463  [[ 1101 1115 1116 1118 451 594 ]] inner stride: 2 main of N1115 strip mined !orig=[599],[590],[307] !jvms: MinMaxVector::longLoopMax @ bci:10 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
> 
>   692  LoadL  === 1083 1101 393  [[ 747 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=9; #long (does not depend only on test, unknown control) !orig=[395] !jvms: MinMaxVector::longLoopMax @ bci:26 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
>   651  LoadL  === 1095 1101 355  [[ 747 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=9; #long (does not depend only on test, unknown control) !orig=[357] !jvms: MinMaxVector::longLoopMax @ bci:20 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
>   747  MaxL  === _ 651 692  [[ 451 ]]  !orig=[608],[416] !jvms: Math::max @ bci:11 (line 2037) MinMaxVector::longLoopMax @ bci:27 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
> 
>   451  StoreL  === 1115 1101 449 747  [[ 1116 454 911 ]]  @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=9;  Memory: @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=9; !orig=1124 !jvms: MinMaxVector::longLoopMax @ bci:30 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
> 
>   594  CountedLoopEnd  === 1115 593  [[ 1123 463 ]] [lt] P=0.999731, C=780799.000000 !orig=[462] !jvms: MinMaxVector::longLoopMax @ bci:7 (line 235) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
> 
> 
> You see the counted loop with the LoadL for array loads and MaxL consuming those. The StoreL is for array assignment (I think).
> 
> At branch 100% you see:
> 
> 
>  ...

@galderz How sure are that intrinsifying directly is really the right approach?

Maybe the approach via `PhaseIdealLoop::conditional_move` where we know the branching probability is a better one. Though of course knowing the branching probability is no perfect heuristic for how good branch prediction is going to be, but it is at least something.

So I'm wondering if there could be a different approach that sees all the wins you get here, without any of the regressions?

If we are just interested in better vectorization: the current issue is that the auto-vectorizer cannot handle CFG, i.e. we do not yet do if-conversion. But if we had if-conversion, then the inlined CFG of min/max would just be converted to vector CMove (or vector min/max where available) at that point. We can take the branching probabilities into account, just like `PhaseIdealLoop::conditional_move` does - if that is necessary. Of course if-conversion is far away, and we will encounter a lot of issues with branch prediction etc, so I'm scared we might never get there - but I want to try ;)

Do we see any other wins with your patch, that are not due to vectorization, but just scalar code?

@galderz Maybe we can discuss this offline at some point as well :)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2656350896
PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2656351785

From yzheng at openjdk.org  Thu Feb 13 12:35:15 2025
From: yzheng at openjdk.org (Yudi Zheng)
Date: Thu, 13 Feb 2025 12:35:15 GMT
Subject: RFR: 8349977: JVMCIRuntime::_shared_library_javavm_id should be
 jlong
In-Reply-To: <S2_GJGE35jlx-bZ-j8zY3s4aMD0mIKDne2Dtdj8Z9aE=.c312fc1e-be94-4a09-91f2-5844c7802e0c@github.com>
References: <S2_GJGE35jlx-bZ-j8zY3s4aMD0mIKDne2Dtdj8Z9aE=.c312fc1e-be94-4a09-91f2-5844c7802e0c@github.com>
Message-ID: <Sx0RENd0E5An4D3QjJJeDzWdwbzBQo7QDOv2WcMmJUQ=.7fc35b7e-09dd-4838-bb45-96d08ae9f6f7@github.com>

On Thu, 13 Feb 2025 09:59:41 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

> The `JVMCIRuntime::_shared_library_javavm_id` field is initialized from a jlong in [libgraal](https://github.com/oracle/graal/blob/d544bbe3fe416d39e9e5b8fc645a67a36a5d7c07/substratevm/src/com.oracle.svm.core/src/com/oracle/svm/core/jni/functions/JNIInvocationInterface.java#L396-L397) and so it's C++ type in HotSpot should match.

LGTM

-------------

Marked as reviewed by yzheng (Committer).

PR Review: https://git.openjdk.org/jdk/pull/23610#pullrequestreview-2614845493

From dnsimon at openjdk.org  Thu Feb 13 12:50:15 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Thu, 13 Feb 2025 12:50:15 GMT
Subject: RFR: 8349977: JVMCIRuntime::_shared_library_javavm_id should be
 jlong
In-Reply-To: <S2_GJGE35jlx-bZ-j8zY3s4aMD0mIKDne2Dtdj8Z9aE=.c312fc1e-be94-4a09-91f2-5844c7802e0c@github.com>
References: <S2_GJGE35jlx-bZ-j8zY3s4aMD0mIKDne2Dtdj8Z9aE=.c312fc1e-be94-4a09-91f2-5844c7802e0c@github.com>
Message-ID: <qHOqZo59_Zcieucx9HZMIwds2sCaKv2X93ZvouyyMUw=.e775dace-2f21-40f2-a82a-d27d73e60caf@github.com>

On Thu, 13 Feb 2025 09:59:41 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

> The `JVMCIRuntime::_shared_library_javavm_id` field is initialized from a jlong in [libgraal](https://github.com/oracle/graal/blob/d544bbe3fe416d39e9e5b8fc645a67a36a5d7c07/substratevm/src/com.oracle.svm.core/src/com/oracle/svm/core/jni/functions/JNIInvocationInterface.java#L396-L397) and so it's C++ type in HotSpot should match.

Passes the openjdk-pr-canary: https://github.com/dougxc/openjdk-pr-canary/blob/master/tested-prs/23610/b7a38951a54ff4c1186a3682f717805822575ea8.json

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23610#issuecomment-2656499655

From roland at openjdk.org  Thu Feb 13 16:46:16 2025
From: roland at openjdk.org (Roland Westrelin)
Date: Thu, 13 Feb 2025 16:46:16 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v11]
In-Reply-To: <MAcSY0Kc9JFrv5ueJDNkDH9I9LpIsYcRjiJ7RtQt090=.03b107fd-8879-41de-a0a4-b6202a9da369@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com>
 <Mci8jQuT-MquLYeikUrrdzKo9dJJuQa3ejdc7tlYQyI=.e0007de8-08b2-4a42-950c-f8e1225777fc@github.com>
 <RHL_g49_BCZQzsQJU-T88fkAOoSKpNvEC2Xx-QxdpRk=.4fbc0037-ba55-40e1-a091-4c16d7e8ee99@github.com>
 <MAcSY0Kc9JFrv5ueJDNkDH9I9LpIsYcRjiJ7RtQt090=.03b107fd-8879-41de-a0a4-b6202a9da369@github.com>
Message-ID: <OxF5Va_n5CdxRW2uSTQQzMe6JSSNqnfs4qd3pAwSAEo=.d33d96b2-0d11-4d67-91cc-5ae94e78c580@github.com>

On Thu, 13 Feb 2025 11:46:35 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> Do we see any other wins with your patch, that are not due to vectorization, but just scalar code?

I think there are some. 

The current transformation from the parsed version of min/max to a conditional move to a `Max`/`Min` node depends on the conditional move transformation which has its own set of heuristics and while it happens on simple test cases, that's not necessarily the case on all code shapes. I don't think we want to trust it too much.

With the intrinsic, the type of the min or max can be narrowed down in a way it can't be whether the code includes control flow or a conditional move. That in turn, once types have propagated, could cause some constant to appear and could be a significant win.

The `Min`/`Max` nodes are floating nodes. They can hoist out of loop and common reliably in ways that are not guaranteed  otherwise.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2657176312

From kvn at openjdk.org  Thu Feb 13 17:05:05 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Thu, 13 Feb 2025 17:05:05 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v6]
In-Reply-To: <vn5PGBP9fcajrfw0rRxo6CRMOKRZsUEzsremVe7X0qQ=.ff23c214-5599-49fb-9732-75e4df74c819@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <uUg8vIR1j_4I2P3UYRzZA0EU8z5qXmethIRmUl7VBmE=.2be2895c-126a-4f27-9061-816b7c63d5e1@github.com>
 <070Dz3l6A_ZT20jprInpMdpeqE3gogKAmmpnCprr4j0=.3b4804dc-02d7-4aa6-af42-7ef076d4fe0d@github.com>
 <8VFudK82JuBbjj_s74lDlHd1TWurW8uiBbw2DutA-PU=.ec26075e-89ad-4caf-ae3f-f50e5407a5f6@github.com>
 <vn5PGBP9fcajrfw0rRxo6CRMOKRZsUEzsremVe7X0qQ=.ff23c214-5599-49fb-9732-75e4df74c819@github.com>
Message-ID: <Iu7pFTUKNy7wzzBWDf_JQgA5UdI-Vd5n_RlbsEbOmtk=.3280d691-9c26-4393-bfdc-c4fb30b03f35@github.com>

On Thu, 13 Feb 2025 05:14:59 GMT, Chris Plummer <cjplummer at openjdk.org> wrote:

>> Example please.
>
> static Class wrapperClasses = new Class[Number_Of_Kinds];
>     wrapperClasses[NMethodKind] = NMethodBlob.class;
>     wrapperClasses[BufferKind] = BufferBopb.class;
>     ...;
>     wrapperClasses[SafepointKind] = SafepointBlob.class;
> 
> 
> 
>     CodeBlob cb = new CodeBlob(addr);
>     return wrapperClasses[cb.getKind()];

Done.

>> I don't think we need it - the caller `CodeCache.createCodeBlobWrapper()` will throw `RuntimeException` when `null` is returned.
>
> I guess my real question is whether or not it can be considered normal behavior to return null. It seems it should never happen, which is why I was suggesting an assert.

With your suggested `wrapperClasses[]` we will get OOB exception. No need separate assert.

>> `UncommonTrapKind` and `ExceptionKind` are not initialized for Client VM because corresponding `CodeBlobKind` values are not defined. See `CodeBlob.initialize()`.
>> Their not initialized value will be 0 which matches `CodeBlobKind::None` value. Returning true in such case will be incorrect.
>
> Ok. Leaving UncommonTrapKind and ExceptionKind uninitialized seems a bit error prone. Perhaps they can be given some sort of INVALID value.

Done. Initialized them to `Number_Of_Kinds + 1`.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954886028
PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954890522
PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954891616

From kvn at openjdk.org  Thu Feb 13 17:05:04 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Thu, 13 Feb 2025 17:05:04 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v7]
In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
Message-ID: <5LGcbNB2_MigrbHGKV3CY8e6z-1iioFUuiSvTU8-lNY=.af273d17-6ab5-4b12-ae41-e6900494b5ee@github.com>

> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table.
> 
> Added C++ static asserts to make sure no virtual methods are added in a future.
> 
> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob.
> 
> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp

Vladimir Kozlov has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains eight additional commits since the last revision:

 - Update SA based on comments
 - Merge branch 'master' into 8349088
 - Fix Zero VM build
 - Fix Minimal and Zero VM builds once more
 - Fix Minimal and Zero VM builds again
 - Add CodeBlob proxy vtable
 - Fix Zero and Minimal VM builds
 - 8349088: De-virtualize Codeblob and nmethod

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23533/files
  - new: https://git.openjdk.org/jdk/pull/23533/files/b09ddce6..515495b2

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=06
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=05-06

  Stats: 11482 lines in 618 files changed: 7914 ins; 1738 del; 1830 mod
  Patch: https://git.openjdk.org/jdk/pull/23533.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23533/head:pull/23533

PR: https://git.openjdk.org/jdk/pull/23533

From kvn at openjdk.org  Thu Feb 13 17:14:59 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Thu, 13 Feb 2025 17:14:59 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8]
In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
Message-ID: <lefn8L1ulD5y9lN9rRQwhTkyziJN6-ZmImxIC9SVsfw=.46a2461a-99b0-45bc-b11d-609e076de019@github.com>

> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table.
> 
> Added C++ static asserts to make sure no virtual methods are added in a future.
> 
> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob.
> 
> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp

Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:

  rename SA argument

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23533/files
  - new: https://git.openjdk.org/jdk/pull/23533/files/515495b2..61fdee68

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=07
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=06-07

  Stats: 6 lines in 1 file changed: 2 ins; 0 del; 4 mod
  Patch: https://git.openjdk.org/jdk/pull/23533.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23533/head:pull/23533

PR: https://git.openjdk.org/jdk/pull/23533

From kvn at openjdk.org  Thu Feb 13 17:14:59 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Thu, 13 Feb 2025 17:14:59 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v6]
In-Reply-To: <vn5PGBP9fcajrfw0rRxo6CRMOKRZsUEzsremVe7X0qQ=.ff23c214-5599-49fb-9732-75e4df74c819@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <uUg8vIR1j_4I2P3UYRzZA0EU8z5qXmethIRmUl7VBmE=.2be2895c-126a-4f27-9061-816b7c63d5e1@github.com>
 <070Dz3l6A_ZT20jprInpMdpeqE3gogKAmmpnCprr4j0=.3b4804dc-02d7-4aa6-af42-7ef076d4fe0d@github.com>
 <8VFudK82JuBbjj_s74lDlHd1TWurW8uiBbw2DutA-PU=.ec26075e-89ad-4caf-ae3f-f50e5407a5f6@github.com>
 <vn5PGBP9fcajrfw0rRxo6CRMOKRZsUEzsremVe7X0qQ=.ff23c214-5599-49fb-9732-75e4df74c819@github.com>
Message-ID: <II4rbWyPgglWU4xZpfSS44PwgMpGPvIKPzZS9rL1MiA=.e6136557-86f3-4053-abbf-ddf72e9d719a@github.com>

On Thu, 13 Feb 2025 05:19:48 GMT, Chris Plummer <cjplummer at openjdk.org> wrote:

>> `cbPc` with comment explaining that it could be inside code blob.
>
> That sounds fine.

done

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954906986

From cjplummer at openjdk.org  Thu Feb 13 17:14:59 2025
From: cjplummer at openjdk.org (Chris Plummer)
Date: Thu, 13 Feb 2025 17:14:59 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8]
In-Reply-To: <zH9gxmYS4fhlJiI1mVHPOZXN_5f6_reQDlPv-mBJGKw=.ea3686e7-f7ad-4811-8609-05befc856dde@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <UpT0IDb0Y33QFlFCzUQad0y1qYnadGt6VwFxeR9GTZY=.8d61f572-209d-4509-b03d-068fb21ee3ce@github.com>
 <rs6rOizOwjIbK5bOSPR-YHledkZyK1ZrfXTYLAjaI2o=.8a8eab26-eae5-45f5-a10a-fc6cf4bbe94c@github.com>
 <MuWo0UbHjeatHRYvdW2o5IuYS99xCt9Cx4tcuyd7YLI=.347b62cf-16ad-4f24-aaa0-db044acca1f8@github.com>
 <ikNjrWuT83NiylLufeiZausKN4JHJaGXowZ5CMOhMTg=.ca4265f3-3608-40ee-8596-f12148a03ecd@github.com>
 <zH9gxmYS4fhlJiI1mVHPOZXN_5f6_reQDlPv-mBJGKw=.ea3686e7-f7ad-4811-8609-05befc856dde@github.com>
Message-ID: <TXDo104cgPVsZyBlN2XxgMwEBWyW7cw1OMqcZRP-Pnw=.2082c207-2f2d-4bf6-89cb-27b388eb2b94@github.com>

On Mon, 10 Feb 2025 16:57:18 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>>> What is the reason for switching from the virtualConstructor/hashMap approach to using getClassFor()? The hashmap is the model for JavaThread, MetaData, and CollectedHeap subtypes.
>> 
>> I don't need any more mapping from CodeBlob class to corresponding virtual table name which does not exist anymore. `CodeBlob::_kind` field's value is used to determine which class should be used.
>> 
>> I think `hashMap` is overkill here. I can construct array `Class<?> cbClasses[]` and use `cbClasses[CodeBlob::_kind]` instead of `if/else` in `getClassFor`. But I would still need to check for unknown value of `CodeBlob::_kind` somehow.
>
>> impact on things like the "findpc" functionality
> 
> Do you mean `findpc()` function in VM which is used in debugger? Nothing should be changed for it.
> It calls `os::print_location()` which calls `CodeBlob::dump_for_addr(addr, st, verbose);`:
> https://github.com/openjdk/jdk/blob/master/src/hotspot/share/runtime/os.cpp#L1278

Actually I was referring to the clhsdb findpc command, which uses PointerFinder, but actually that should be ok because it special cases the codecache and knows how to find CodeBlobs in it. It's the clhsdb "inspect" command that will no longer be able to identify the type for an address that points to the start of a CodeBlob. This is true of any address that points to the start of a hotspot C++ object that does not have a vtable, or is not declared in vmstructs. So it's not a new issue, but is just adding more types to the list that "inspect" won't figure out.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954906641

From epeter at openjdk.org  Thu Feb 13 17:16:22 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Thu, 13 Feb 2025 17:16:22 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v11]
In-Reply-To: <OxF5Va_n5CdxRW2uSTQQzMe6JSSNqnfs4qd3pAwSAEo=.d33d96b2-0d11-4d67-91cc-5ae94e78c580@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com>
 <Mci8jQuT-MquLYeikUrrdzKo9dJJuQa3ejdc7tlYQyI=.e0007de8-08b2-4a42-950c-f8e1225777fc@github.com>
 <RHL_g49_BCZQzsQJU-T88fkAOoSKpNvEC2Xx-QxdpRk=.4fbc0037-ba55-40e1-a091-4c16d7e8ee99@github.com>
 <MAcSY0Kc9JFrv5ueJDNkDH9I9LpIsYcRjiJ7RtQt090=.03b107fd-8879-41de-a0a4-b6202a9da369@github.com>
 <OxF5Va_n5CdxRW2uSTQQzMe6JSSNqnfs4qd3pAwSAEo=.d33d96b2-0d11-4d67-91cc-5ae94e78c580@github.com>
Message-ID: <X4j1fJ5qlpnYbtZuhWWtSh5JfbmAE201_tpvrtCrkuU=.c41ddaa0-794a-4574-b0a0-845fb0afc093@github.com>

On Thu, 13 Feb 2025 16:43:22 GMT, Roland Westrelin <roland at openjdk.org> wrote:

> The current transformation from the parsed version of min/max to a conditional move to a Max/Min node depends on the conditional move transformation which has its own set of heuristics and while it happens on simple test cases, that's not necessarily the case on all code shapes. I don't think we want to trust it too much.

Well, actually people have tried to improve the conditonal move transformation, and it is really really difficult. It's hard not to get regressions. I'm wondering how much easier it is for min / max. Maybe we have similar limitations, especially with predicting how well branch prediction performs.

You are probably right about type propagation and `Min / Max` being floating nodes.

@rwestrel What do you think about the regressions in the scalar cases of this patch?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2657253439

From never at openjdk.org  Thu Feb 13 17:22:13 2025
From: never at openjdk.org (Tom Rodriguez)
Date: Thu, 13 Feb 2025 17:22:13 GMT
Subject: RFR: 8349977: JVMCIRuntime::_shared_library_javavm_id should be
 jlong
In-Reply-To: <S2_GJGE35jlx-bZ-j8zY3s4aMD0mIKDne2Dtdj8Z9aE=.c312fc1e-be94-4a09-91f2-5844c7802e0c@github.com>
References: <S2_GJGE35jlx-bZ-j8zY3s4aMD0mIKDne2Dtdj8Z9aE=.c312fc1e-be94-4a09-91f2-5844c7802e0c@github.com>
Message-ID: <oYVAHPYOJtbOQk677PmbgzwjmAmS8oEDHRwQYYaz1Is=.01d49895-e3ea-410d-aaa0-c5512436caf3@github.com>

On Thu, 13 Feb 2025 09:59:41 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

> The `JVMCIRuntime::_shared_library_javavm_id` field is initialized from a jlong in [libgraal](https://github.com/oracle/graal/blob/d544bbe3fe416d39e9e5b8fc645a67a36a5d7c07/substratevm/src/com.oracle.svm.core/src/com/oracle/svm/core/jni/functions/JNIInvocationInterface.java#L396-L397) and so it's C++ type in HotSpot should match.

Marked as reviewed by never (Reviewer).

-------------

PR Review: https://git.openjdk.org/jdk/pull/23610#pullrequestreview-2615740718

From jrose at openjdk.org  Thu Feb 13 17:25:13 2025
From: jrose at openjdk.org (John R Rose)
Date: Thu, 13 Feb 2025 17:25:13 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8]
In-Reply-To: <lefn8L1ulD5y9lN9rRQwhTkyziJN6-ZmImxIC9SVsfw=.46a2461a-99b0-45bc-b11d-609e076de019@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <lefn8L1ulD5y9lN9rRQwhTkyziJN6-ZmImxIC9SVsfw=.46a2461a-99b0-45bc-b11d-609e076de019@github.com>
Message-ID: <nr8I_5uRgl9CgLWFXyq754zgqih9UQkkpOEVcjYqt5k=.eb46380c-3050-47c3-974a-bf5631996c1a@github.com>

On Thu, 13 Feb 2025 17:14:59 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table.
>> 
>> Added C++ static asserts to make sure no virtual methods are added in a future.
>> 
>> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob.
>> 
>> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp
>
> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
> 
>   rename SA argument

One related idea:  The Vptr classes seem to be regular enough to be templated.  That is, one class body, instantiated with a template argument for each code blob type (and probably another for the enum).  That would remove some of the distracting per-class boilerplate.  Something like:


template<typename CB_T, CodeBlob::Kind Tkind>
class Vptr_Impl : public Vptr {
  override void print_on(const CodeBlob* instance, outputStream* st) const {
    assert(instance->kind() == Tkind, "sanity");
    ((const CB_T*)instance)->print_on_impl(st);
  }
  ?
  override bool assert_sane(cosnt CodeBlob* instance) {
     assert(instance->kind() == Tkind, "");
     return true;
  }
};

class CodeBlob {
  public:
  final Vptr* vptr() const {
    Vptr* vptr = vptr_array[_kind];
    assert(vptr->assert_sant(this), "correct array element");
    return vptr;
  }
  final void print_on(outputStream* st) const {
    vptr()->print_on(this, st);
  }
};


Then:


const Vptr* array[] = {
  &Vptr_Impl<CodeBlob, CodeBlobKind>(),
  ...
  &Vptr_Impl<UncommonTrapBlob, UncommonTrapKind>(),
  ...
};


The array could be filled by a macro that tracks the enum members; I like that as a small job for macros (no code in it).

Then:


class UncommonTrapBlob : public OtherBlob {
  protected:  // impl "M" method is not public
  void print_on_impl(outputStream* st) const {
    OtherBlob::print_on_impl(st);
    st->print("my field = %d", _my_field);
  }
  // Vptr needs to call impl method
  friend class Vptr_Impl;  // this might break down, so make it all public in the end
};


I don't see any reason the Vptr subclasses need to be related in any more detail as subs or supers.

Well, C++ is a bag of surprises, so there are probably several reasons the above sketch is wrong.  But something like it might add a little more readability and predictability to the code.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2657274388

From kvn at openjdk.org  Thu Feb 13 17:29:14 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Thu, 13 Feb 2025 17:29:14 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8]
In-Reply-To: <nr8I_5uRgl9CgLWFXyq754zgqih9UQkkpOEVcjYqt5k=.eb46380c-3050-47c3-974a-bf5631996c1a@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <lefn8L1ulD5y9lN9rRQwhTkyziJN6-ZmImxIC9SVsfw=.46a2461a-99b0-45bc-b11d-609e076de019@github.com>
 <nr8I_5uRgl9CgLWFXyq754zgqih9UQkkpOEVcjYqt5k=.eb46380c-3050-47c3-974a-bf5631996c1a@github.com>
Message-ID: <hWjwtUyTPwZ-YArpxuZUF1Fq-bBruELG57kWULIgIgY=.73f5e444-1695-4f9e-ade9-af1cb5d206f4@github.com>

On Thu, 13 Feb 2025 17:22:18 GMT, John R Rose <jrose at openjdk.org> wrote:

>> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   rename SA argument
>
> One related idea:  The Vptr classes seem to be regular enough to be templated.  That is, one class body, instantiated with a template argument for each code blob type (and probably another for the enum).  That would remove some of the distracting per-class boilerplate.  Something like:
> 
> 
> template<typename CB_T, CodeBlob::Kind Tkind>
> class Vptr_Impl : public Vptr {
>   override void print_on(const CodeBlob* instance, outputStream* st) const {
>     assert(instance->kind() == Tkind, "sanity");
>     ((const CB_T*)instance)->print_on_impl(st);
>   }
>   ?
>   override bool assert_sane(cosnt CodeBlob* instance) {
>      assert(instance->kind() == Tkind, "");
>      return true;
>   }
> };
> 
> class CodeBlob {
>   public:
>   final Vptr* vptr() const {
>     Vptr* vptr = vptr_array[_kind];
>     assert(vptr->assert_sant(this), "correct array element");
>     return vptr;
>   }
>   final void print_on(outputStream* st) const {
>     vptr()->print_on(this, st);
>   }
> };
> 
> 
> Then:
> 
> 
> const Vptr* array[] = {
>   &Vptr_Impl<CodeBlob, CodeBlobKind>(),
>   ...
>   &Vptr_Impl<UncommonTrapBlob, UncommonTrapKind>(),
>   ...
> };
> 
> 
> The array could be filled by a macro that tracks the enum members; I like that as a small job for macros (no code in it).
> 
> Then:
> 
> 
> class UncommonTrapBlob : public OtherBlob {
>   protected:  // impl "M" method is not public
>   void print_on_impl(outputStream* st) const {
>     OtherBlob::print_on_impl(st);
>     st->print("my field = %d", _my_field);
>   }
>   // Vptr needs to call impl method
>   friend class Vptr_Impl;  // this might break down, so make it all public in the end
> };
> 
> 
> I don't see any reason the Vptr subclasses need to be related in any more detail as subs or supers.
> 
> Well, C++ is a bag of surprises, so there are probably several reasons the above sketch is wrong.  But something like it might add a little more readability and predictability to the code.

Thank you, @rose00 and @xmas92, for review and suggestions.

Let me say it first - printing code for code blobs and nmethod is big mess. It requires separate big change to clean it up.
For example, I have to go through CodeBlob's virtual dispatch `print_value_on_v()` for nmethod because some sets of `nmethod::print*()` are defined only in debug VM: [nmethod.hpp#L919](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/nmethod.hpp#L919)

Then `nmethod` has other mess which requires C++ trickery because it does not follow print API in CodeBlob:

  void print(outputStream* st) const;
  // need to re-define this from CodeBlob else the overload hides it
  void print_on(outputStream* st) const override { CodeBlob::print_on(st); }
  void print_on(outputStream* st, const char* msg) const;

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2657282969

From kvn at openjdk.org  Thu Feb 13 17:37:19 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Thu, 13 Feb 2025 17:37:19 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8]
In-Reply-To: <lefn8L1ulD5y9lN9rRQwhTkyziJN6-ZmImxIC9SVsfw=.46a2461a-99b0-45bc-b11d-609e076de019@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <lefn8L1ulD5y9lN9rRQwhTkyziJN6-ZmImxIC9SVsfw=.46a2461a-99b0-45bc-b11d-609e076de019@github.com>
Message-ID: <kvipwhHTRWnwE_lU_KL3RxkwOHglyBGfkq7C-07hkfs=.dd12213b-9c87-48fe-9a50-7437f166c7f1@github.com>

On Thu, 13 Feb 2025 17:14:59 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table.
>> 
>> Added C++ static asserts to make sure no virtual methods are added in a future.
>> 
>> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob.
>> 
>> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp
>
> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
> 
>   rename SA argument

Saying that, I agree that I need to add comments explaining printing API and how Vptr class will work.
I will work on @xmas92 suggestions and look on using `_impl`.
I will try to look on templates @rose00 suggested but I don't want to complicate code for just for few print methods.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2657303967

From kvn at openjdk.org  Thu Feb 13 18:04:17 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Thu, 13 Feb 2025 18:04:17 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8]
In-Reply-To: <lefn8L1ulD5y9lN9rRQwhTkyziJN6-ZmImxIC9SVsfw=.46a2461a-99b0-45bc-b11d-609e076de019@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <lefn8L1ulD5y9lN9rRQwhTkyziJN6-ZmImxIC9SVsfw=.46a2461a-99b0-45bc-b11d-609e076de019@github.com>
Message-ID: <3RrosS3Q-iEBqaD4hVGMfjY2hDGLqwWwSUqgT0Za1k4=.1e32f3f0-6677-4082-b100-ce9b4603ec80@github.com>

On Thu, 13 Feb 2025 17:14:59 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table.
>> 
>> Added C++ static asserts to make sure no virtual methods are added in a future.
>> 
>> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob.
>> 
>> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp
>
> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
> 
>   rename SA argument

> AFAICT `print_value_on` is unreachable 

It is reachable in product VM when `print_value_on_v()` is called for `nmethod` which does not have `print_value_on()` in product VM. Which can be solved by adding simple `nmethod::print_value_on()` for product VM but it will change current behavior.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2657354310

From cjplummer at openjdk.org  Thu Feb 13 19:31:15 2025
From: cjplummer at openjdk.org (Chris Plummer)
Date: Thu, 13 Feb 2025 19:31:15 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8]
In-Reply-To: <lefn8L1ulD5y9lN9rRQwhTkyziJN6-ZmImxIC9SVsfw=.46a2461a-99b0-45bc-b11d-609e076de019@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <lefn8L1ulD5y9lN9rRQwhTkyziJN6-ZmImxIC9SVsfw=.46a2461a-99b0-45bc-b11d-609e076de019@github.com>
Message-ID: <DygSEY7jGRAZgDhL_sxRitod87ENPG5ysSC3i-kJ0bY=.2cb646ca-6f66-4ace-aec9-384b8ef7cb15@github.com>

On Thu, 13 Feb 2025 17:14:59 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table.
>> 
>> Added C++ static asserts to make sure no virtual methods are added in a future.
>> 
>> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob.
>> 
>> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp
>
> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
> 
>   rename SA argument

src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeCache.java line 97:

> 95:   // cbAddr - address of a code blob
> 96:   // cbPC   - address inside of a code blob
> 97:   public CodeBlob createCodeBlobWrapper(Address cbAddr, Address cbPC) {

Can you change findBlobUnsafe() above also? That's where the naming problem originated.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1955098013

From dnsimon at openjdk.org  Thu Feb 13 19:38:19 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Thu, 13 Feb 2025 19:38:19 GMT
Subject: Integrated: 8349977: JVMCIRuntime::_shared_library_javavm_id should be
 jlong
In-Reply-To: <S2_GJGE35jlx-bZ-j8zY3s4aMD0mIKDne2Dtdj8Z9aE=.c312fc1e-be94-4a09-91f2-5844c7802e0c@github.com>
References: <S2_GJGE35jlx-bZ-j8zY3s4aMD0mIKDne2Dtdj8Z9aE=.c312fc1e-be94-4a09-91f2-5844c7802e0c@github.com>
Message-ID: <7BPX92kK6cDWVILYcvyQXfSssFDFjv0XjIZQlGnlRhI=.6521b1d0-9c70-4054-a276-601536946443@github.com>

On Thu, 13 Feb 2025 09:59:41 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

> The `JVMCIRuntime::_shared_library_javavm_id` field is initialized from a jlong in [libgraal](https://github.com/oracle/graal/blob/d544bbe3fe416d39e9e5b8fc645a67a36a5d7c07/substratevm/src/com.oracle.svm.core/src/com/oracle/svm/core/jni/functions/JNIInvocationInterface.java#L396-L397) and so it's C++ type in HotSpot should match.

This pull request has now been integrated.

Changeset: a88e2a58
Author:    Doug Simon <dnsimon at openjdk.org>
URL:       https://git.openjdk.org/jdk/commit/a88e2a58bf834081db55c2071d072567ea763354
Stats:     7 lines in 3 files changed: 0 ins; 0 del; 7 mod

8349977: JVMCIRuntime::_shared_library_javavm_id should be jlong

Reviewed-by: yzheng, never

-------------

PR: https://git.openjdk.org/jdk/pull/23610

From dnsimon at openjdk.org  Thu Feb 13 19:38:19 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Thu, 13 Feb 2025 19:38:19 GMT
Subject: RFR: 8349977: JVMCIRuntime::_shared_library_javavm_id should be
 jlong
In-Reply-To: <S2_GJGE35jlx-bZ-j8zY3s4aMD0mIKDne2Dtdj8Z9aE=.c312fc1e-be94-4a09-91f2-5844c7802e0c@github.com>
References: <S2_GJGE35jlx-bZ-j8zY3s4aMD0mIKDne2Dtdj8Z9aE=.c312fc1e-be94-4a09-91f2-5844c7802e0c@github.com>
Message-ID: <cy_wGTEuTtSAz4py3Pqyn1P5svE7ufIe9z1SrWoODPc=.ae00e593-cd8c-4486-9517-4dd0c8e5610f@github.com>

On Thu, 13 Feb 2025 09:59:41 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

> The `JVMCIRuntime::_shared_library_javavm_id` field is initialized from a jlong in [libgraal](https://github.com/oracle/graal/blob/d544bbe3fe416d39e9e5b8fc645a67a36a5d7c07/substratevm/src/com.oracle.svm.core/src/com/oracle/svm/core/jni/functions/JNIInvocationInterface.java#L396-L397) and so it's C++ type in HotSpot should match.

Thanks for the reviews.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23610#issuecomment-2657542024

From dlong at openjdk.org  Thu Feb 13 22:50:17 2025
From: dlong at openjdk.org (Dean Long)
Date: Thu, 13 Feb 2025 22:50:17 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8]
In-Reply-To: <lefn8L1ulD5y9lN9rRQwhTkyziJN6-ZmImxIC9SVsfw=.46a2461a-99b0-45bc-b11d-609e076de019@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <lefn8L1ulD5y9lN9rRQwhTkyziJN6-ZmImxIC9SVsfw=.46a2461a-99b0-45bc-b11d-609e076de019@github.com>
Message-ID: <ClsVckm4m2CbI-340xWXM9eq8swOEGjejA7-jhIm5CA=.6f295219-6198-4e5a-af0a-474e763cc8c3@github.com>

On Thu, 13 Feb 2025 17:14:59 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table.
>> 
>> Added C++ static asserts to make sure no virtual methods are added in a future.
>> 
>> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob.
>> 
>> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp
>
> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
> 
>   rename SA argument

src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/c1/Runtime1.java line 65:

> 63:   public CodeBlob blobFor(int id) {
> 64:     Address blobAddr = blobsField.getStaticFieldAddress().getAddressAt(id * VM.getVM().getAddressSize());
> 65:     return VM.getVM().getCodeCache().createCodeBlobWrapper(blobAddr);

We don't need to change all the callers if we keep a 1-arg version of createCodeBlobWrapper():

public CodeBlob createCodeBlobWrapper(Address codeBlobAddr) {
    return createCodeBlobWrapper(codeBlobAddr, codeBlobAddr);
}

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1955316582

From dlong at openjdk.org  Thu Feb 13 23:04:17 2025
From: dlong at openjdk.org (Dean Long)
Date: Thu, 13 Feb 2025 23:04:17 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8]
In-Reply-To: <lefn8L1ulD5y9lN9rRQwhTkyziJN6-ZmImxIC9SVsfw=.46a2461a-99b0-45bc-b11d-609e076de019@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <lefn8L1ulD5y9lN9rRQwhTkyziJN6-ZmImxIC9SVsfw=.46a2461a-99b0-45bc-b11d-609e076de019@github.com>
Message-ID: <UIDt7b3LS2ZVfhQFiMj41t_A0e4cnTnIOD5TViBMv5M=.68fc6999-e9e2-4ede-9cf6-b4f485f14972@github.com>

On Thu, 13 Feb 2025 17:14:59 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table.
>> 
>> Added C++ static asserts to make sure no virtual methods are added in a future.
>> 
>> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob.
>> 
>> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp
>
> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
> 
>   rename SA argument

src/hotspot/share/compiler/oopMap.cpp line 567:

> 565:   fr->print_on(tty);
> 566:   tty->print("     ");
> 567:   cb->print_value_on(tty);  tty->cr();

We could minimize the number of files changed if we keep print_value_on() for compatibility:

void print_value_on(outputStream* st) const { print_value_on_v(st); }

src/hotspot/share/runtime/vframe.inline.hpp line 178:

> 176:                   INTPTR_FORMAT " not found or invalid at %d",
> 177:                   p2i(_frame.pc()), decode_offset);
> 178:       nm()->print_on_v(&ss);

I suggest removing _v suffix to reduce changes and match existing naming.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1955325657
PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1955327438

From dlong at openjdk.org  Fri Feb 14 00:11:16 2025
From: dlong at openjdk.org (Dean Long)
Date: Fri, 14 Feb 2025 00:11:16 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8]
In-Reply-To: <lefn8L1ulD5y9lN9rRQwhTkyziJN6-ZmImxIC9SVsfw=.46a2461a-99b0-45bc-b11d-609e076de019@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <lefn8L1ulD5y9lN9rRQwhTkyziJN6-ZmImxIC9SVsfw=.46a2461a-99b0-45bc-b11d-609e076de019@github.com>
Message-ID: <kjmAahvfyo5d63hmg3SqUtvqBNFrY3gnzYVeTvm98PY=.3c62b35a-bda8-462c-aa01-30db76aa9606@github.com>

On Thu, 13 Feb 2025 17:14:59 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table.
>> 
>> Added C++ static asserts to make sure no virtual methods are added in a future.
>> 
>> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob.
>> 
>> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp
>
> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
> 
>   rename SA argument

src/hotspot/share/code/codeBlob.hpp line 669:

> 667: 
> 668:   jobject  receiver()          { return _receiver; }
> 669:   ByteSize frame_data_offset() { return _frame_data_offset; }

`frame_data_offset()` seems to be unused.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1955373697

From dlong at openjdk.org  Fri Feb 14 00:17:14 2025
From: dlong at openjdk.org (Dean Long)
Date: Fri, 14 Feb 2025 00:17:14 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8]
In-Reply-To: <lefn8L1ulD5y9lN9rRQwhTkyziJN6-ZmImxIC9SVsfw=.46a2461a-99b0-45bc-b11d-609e076de019@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <lefn8L1ulD5y9lN9rRQwhTkyziJN6-ZmImxIC9SVsfw=.46a2461a-99b0-45bc-b11d-609e076de019@github.com>
Message-ID: <MwRthjcW9295noDiEJ0TukT8n93H07jFK6PXcd7IUyI=.d1d72e9d-0e20-4a1c-b8df-892d3177a759@github.com>

On Thu, 13 Feb 2025 17:14:59 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table.
>> 
>> Added C++ static asserts to make sure no virtual methods are added in a future.
>> 
>> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob.
>> 
>> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp
>
> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
> 
>   rename SA argument

HotSpot C++ changes look good.  I skipped SA changes.

-------------

Marked as reviewed by dlong (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/23533#pullrequestreview-2616477660

From roland at openjdk.org  Fri Feb 14 16:55:14 2025
From: roland at openjdk.org (Roland Westrelin)
Date: Fri, 14 Feb 2025 16:55:14 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v11]
In-Reply-To: <OxF5Va_n5CdxRW2uSTQQzMe6JSSNqnfs4qd3pAwSAEo=.d33d96b2-0d11-4d67-91cc-5ae94e78c580@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com>
 <Mci8jQuT-MquLYeikUrrdzKo9dJJuQa3ejdc7tlYQyI=.e0007de8-08b2-4a42-950c-f8e1225777fc@github.com>
 <RHL_g49_BCZQzsQJU-T88fkAOoSKpNvEC2Xx-QxdpRk=.4fbc0037-ba55-40e1-a091-4c16d7e8ee99@github.com>
 <MAcSY0Kc9JFrv5ueJDNkDH9I9LpIsYcRjiJ7RtQt090=.03b107fd-8879-41de-a0a4-b6202a9da369@github.com>
 <OxF5Va_n5CdxRW2uSTQQzMe6JSSNqnfs4qd3pAwSAEo=.d33d96b2-0d11-4d67-91cc-5ae94e78c580@github.com>
Message-ID: <S-e2FYgJy02RfywN3A6WTPaepdkK6ly8KYNnX6tHCig=.a38238e5-7d3b-412b-a78e-9def776179ff@github.com>

On Thu, 13 Feb 2025 16:43:22 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>> @galderz How sure are that intrinsifying directly is really the right approach?
>> 
>> Maybe the approach via `PhaseIdealLoop::conditional_move` where we know the branching probability is a better one. Though of course knowing the branching probability is no perfect heuristic for how good branch prediction is going to be, but it is at least something.
>> 
>> So I'm wondering if there could be a different approach that sees all the wins you get here, without any of the regressions?
>> 
>> If we are just interested in better vectorization: the current issue is that the auto-vectorizer cannot handle CFG, i.e. we do not yet do if-conversion. But if we had if-conversion, then the inlined CFG of min/max would just be converted to vector CMove (or vector min/max where available) at that point. We can take the branching probabilities into account, just like `PhaseIdealLoop::conditional_move` does - if that is necessary. Of course if-conversion is far away, and we will encounter a lot of issues with branch prediction etc, so I'm scared we might never get there - but I want to try ;)
>> 
>> Do we see any other wins with your patch, that are not due to vectorization, but just scalar code?
>
>> Do we see any other wins with your patch, that are not due to vectorization, but just scalar code?
> 
> I think there are some. 
> 
> The current transformation from the parsed version of min/max to a conditional move to a `Max`/`Min` node depends on the conditional move transformation which has its own set of heuristics and while it happens on simple test cases, that's not necessarily the case on all code shapes. I don't think we want to trust it too much.
> 
> With the intrinsic, the type of the min or max can be narrowed down in a way it can't be whether the code includes control flow or a conditional move. That in turn, once types have propagated, could cause some constant to appear and could be a significant win.
> 
> The `Min`/`Max` nodes are floating nodes. They can hoist out of loop and common reliably in ways that are not guaranteed  otherwise.

> @rwestrel What do you think about the regressions in the scalar cases of this patch?

Shouldn't int `min`/`max` be affected the same way?

I suppose extracting the branch probability from the `MethodData` and attaching it to the `Min`/`Max` nodes is not impossible. I did something like that in the `ScopedValue` PR that you reviewed (and was put on hold). Now, that would be quite a bit of extra complexity for what feels like a corner case.

Another possibility would be to implement `CMove` with branches (https://bugs.openjdk.org/browse/JDK-8340206) or to move the implementation of `MinL`/`MovL` in the ad files and experiment with branches there. 

It seems overall, we likely win more than we loose with this intrinsic, so I would integrate this change as it is and file a bug to keep track of remaining issues.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2659821025

From kvn at openjdk.org  Fri Feb 14 23:14:18 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Fri, 14 Feb 2025 23:14:18 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8]
In-Reply-To: <lefn8L1ulD5y9lN9rRQwhTkyziJN6-ZmImxIC9SVsfw=.46a2461a-99b0-45bc-b11d-609e076de019@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <lefn8L1ulD5y9lN9rRQwhTkyziJN6-ZmImxIC9SVsfw=.46a2461a-99b0-45bc-b11d-609e076de019@github.com>
Message-ID: <v2d22ewwY2ILZHBAE6SnHqJXOHYTxNcTyeE08Cc2CIg=.37550622-993b-49f4-abd0-5bc5c4301444@github.com>

On Thu, 13 Feb 2025 17:14:59 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table.
>> 
>> Added C++ static asserts to make sure no virtual methods are added in a future.
>> 
>> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob.
>> 
>> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp
>
> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
> 
>   rename SA argument

I addressed most @xmas92 and @dean-long comments and working on avoid `_v` suffix

Thank you, Dean, for review.

-------------

PR Review: https://git.openjdk.org/jdk/pull/23533#pullrequestreview-2618707275
PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2660443983

From kvn at openjdk.org  Fri Feb 14 23:14:20 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Fri, 14 Feb 2025 23:14:20 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v6]
In-Reply-To: <Z20zaowGEnlGq_wSYvZbz9jVb7CWBjb4MYOTpmXkJ0o=.6c6c608d-8694-4eb5-9ac6-45001a33b4d8@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <uUg8vIR1j_4I2P3UYRzZA0EU8z5qXmethIRmUl7VBmE=.2be2895c-126a-4f27-9061-816b7c63d5e1@github.com>
 <Z20zaowGEnlGq_wSYvZbz9jVb7CWBjb4MYOTpmXkJ0o=.6c6c608d-8694-4eb5-9ac6-45001a33b4d8@github.com>
Message-ID: <zjci1dX7BD_H8OUz0iVVbVij3E6vFGpka_YY3pS_2OI=.c93cfb4e-55d1-4c77-b4ce-e58503620dce@github.com>

On Thu, 13 Feb 2025 08:15:16 GMT, Axel Boldt-Christmas <aboldtch at openjdk.org> wrote:

>> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Fix Zero VM build
>
> src/hotspot/share/code/codeBlob.hpp line 140:
> 
>> 138:       instance->print_value_on_nv(st);
>> 139:     }
>> 140:   };
> 
> I wonder why the base class is not abstract. AFAICT `print_value_on` is unreachable and `print_on` is only used by `DeoptimizationBlob::Vptr` which also seems like a behavioural change, as before this patch calling `print_on` a `DeoptimizationBlob` object would dispatch to `SingletonBlob::print_on` not `CodeBlob::print_on`. 
> 
> Suggestion:
> 
>  struct Vptr {
>     virtual void print_on(const CodeBlob* instance, outputStream* st) const = 0;
>     virtual void print_value_on(const CodeBlob* instance, outputStream* st) const = 0;
>   };

done

> src/hotspot/share/code/codeBlob.hpp line 339:
> 
>> 337:   void print_value_on(outputStream* st) const;
>> 338: 
>> 339:   class Vptr : public CodeBlob::Vptr {
> 
> I wonder if these should share the same type hierarchy as tier container class. This would also solve the issueI noted in my other comment about not calling the correct `print_on`.
> Suggestion:
> 
>   class Vptr : public RuntimeBlob::Vptr {

Fixed

> src/hotspot/share/code/codeBlob.hpp line 427:
> 
>> 425:   void print_value_on(outputStream* st) const;
>> 426: 
>> 427:   class Vptr : public CodeBlob::Vptr {
> 
> Suggestion:
> 
>   class Vptr : public RuntimeBlob::Vptr {

Fixed

> src/hotspot/share/code/codeBlob.hpp line 467:
> 
>> 465:   void print_value_on(outputStream* st) const;
>> 466: 
>> 467:   class Vptr : public CodeBlob::Vptr {
> 
> Suggestion:
> 
>   class Vptr : public RuntimeBlob::Vptr {

Fixed

> src/hotspot/share/code/codeBlob.hpp line 553:
> 
>> 551:   void print_value_on(outputStream* st) const;
>> 552: 
>> 553:   class Vptr : public CodeBlob::Vptr {
> 
> This one specifically
> Suggestion:
> 
>   class Vptr : public SingletonBlob::Vptr {

fixed

> src/hotspot/share/code/codeBlob.hpp line 679:
> 
>> 677:   void print_value_on(outputStream* st) const;
>> 678: 
>> 679:   class Vptr : public CodeBlob::Vptr {
> 
> Suggestion:
> 
>   class Vptr : public RuntimeBlob::Vptr {

fixed

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1956799673
PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1956801833
PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1956801994
PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1956802109
PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1956803039
PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1956827486

From kvn at openjdk.org  Fri Feb 14 23:14:22 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Fri, 14 Feb 2025 23:14:22 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8]
In-Reply-To: <kjmAahvfyo5d63hmg3SqUtvqBNFrY3gnzYVeTvm98PY=.3c62b35a-bda8-462c-aa01-30db76aa9606@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <lefn8L1ulD5y9lN9rRQwhTkyziJN6-ZmImxIC9SVsfw=.46a2461a-99b0-45bc-b11d-609e076de019@github.com>
 <kjmAahvfyo5d63hmg3SqUtvqBNFrY3gnzYVeTvm98PY=.3c62b35a-bda8-462c-aa01-30db76aa9606@github.com>
Message-ID: <_9qiqpCFRxCMY4nADw0lqrNuOZYIKUpeY_7FYyoQWC8=.78588553-bede-45b1-bf2d-5ad306b81e29@github.com>

On Fri, 14 Feb 2025 00:08:35 GMT, Dean Long <dlong at openjdk.org> wrote:

>> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   rename SA argument
>
> src/hotspot/share/code/codeBlob.hpp line 669:
> 
>> 667: 
>> 668:   jobject  receiver()          { return _receiver; }
>> 669:   ByteSize frame_data_offset() { return _frame_data_offset; }
> 
> `frame_data_offset()` seems to be unused.

removed

> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/c1/Runtime1.java line 65:
> 
>> 63:   public CodeBlob blobFor(int id) {
>> 64:     Address blobAddr = blobsField.getStaticFieldAddress().getAddressAt(id * VM.getVM().getAddressSize());
>> 65:     return VM.getVM().getCodeCache().createCodeBlobWrapper(blobAddr);
> 
> We don't need to change all the callers if we keep a 1-arg version of createCodeBlobWrapper():
> 
> public CodeBlob createCodeBlobWrapper(Address codeBlobAddr) {
>     return createCodeBlobWrapper(codeBlobAddr, codeBlobAddr);
> }

This is the only one place where arguments are the same. In other two arguments are different.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1956672379
PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1956667806

From kvn at openjdk.org  Fri Feb 14 23:14:23 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Fri, 14 Feb 2025 23:14:23 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8]
In-Reply-To: <DygSEY7jGRAZgDhL_sxRitod87ENPG5ysSC3i-kJ0bY=.2cb646ca-6f66-4ace-aec9-384b8ef7cb15@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <lefn8L1ulD5y9lN9rRQwhTkyziJN6-ZmImxIC9SVsfw=.46a2461a-99b0-45bc-b11d-609e076de019@github.com>
 <DygSEY7jGRAZgDhL_sxRitod87ENPG5ysSC3i-kJ0bY=.2cb646ca-6f66-4ace-aec9-384b8ef7cb15@github.com>
Message-ID: <07aI9gwcVtc89Bte9DRQ6VwmCfhcBJJQlrXhxkRRgX0=.97d4a1cc-92a2-43dc-8516-2433eca67263@github.com>

On Thu, 13 Feb 2025 19:27:19 GMT, Chris Plummer <cjplummer at openjdk.org> wrote:

>> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   rename SA argument
>
> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeCache.java line 97:
> 
>> 95:   // cbAddr - address of a code blob
>> 96:   // cbPC   - address inside of a code blob
>> 97:   public CodeBlob createCodeBlobWrapper(Address cbAddr, Address cbPC) {
> 
> Can you change findBlobUnsafe() above also? That's where the naming problem originated.

After some thoughts I think `PC` is not usually used by us.
I renamed `cbAddr` to `cbStart` and `cbPC`/`start` to `addr` in this whole file.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1956664966

From kvn at openjdk.org  Sat Feb 15 02:08:20 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Sat, 15 Feb 2025 02:08:20 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8]
In-Reply-To: <UIDt7b3LS2ZVfhQFiMj41t_A0e4cnTnIOD5TViBMv5M=.68fc6999-e9e2-4ede-9cf6-b4f485f14972@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <lefn8L1ulD5y9lN9rRQwhTkyziJN6-ZmImxIC9SVsfw=.46a2461a-99b0-45bc-b11d-609e076de019@github.com>
 <UIDt7b3LS2ZVfhQFiMj41t_A0e4cnTnIOD5TViBMv5M=.68fc6999-e9e2-4ede-9cf6-b4f485f14972@github.com>
Message-ID: <pX-b0gPP4MvfCgWorgLDtLD-UkHdo-ryi0yaNPn6n9s=.96ca16be-a859-4a17-b80d-f0aecbbf6170@github.com>

On Thu, 13 Feb 2025 23:01:24 GMT, Dean Long <dlong at openjdk.org> wrote:

>> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   rename SA argument
>
> src/hotspot/share/runtime/vframe.inline.hpp line 178:
> 
>> 176:                   INTPTR_FORMAT " not found or invalid at %d",
>> 177:                   p2i(_frame.pc()), decode_offset);
>> 178:       nm()->print_on_v(&ss);
> 
> I suggest removing _v suffix to reduce changes and match existing naming.

Done. Testing now.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1956985708

From kvn at openjdk.org  Sat Feb 15 06:13:57 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Sat, 15 Feb 2025 06:13:57 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v9]
In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
Message-ID: <bOOpa8SBoYO5qmGYp0e7x01gp6uE6OjiXZ8-wjz9_DI=.27bac79d-4d10-490c-9cd8-8df3d090fe7b@github.com>

> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table.
> 
> Added C++ static asserts to make sure no virtual methods are added in a future.
> 
> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob.
> 
> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp

Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:

  Address comments

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23533/files
  - new: https://git.openjdk.org/jdk/pull/23533/files/61fdee68..89a383e5

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=08
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=07-08

  Stats: 115 lines in 12 files changed: 7 ins; 7 del; 101 mod
  Patch: https://git.openjdk.org/jdk/pull/23533.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23533/head:pull/23533

PR: https://git.openjdk.org/jdk/pull/23533

From kvn at openjdk.org  Sat Feb 15 06:29:14 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Sat, 15 Feb 2025 06:29:14 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v9]
In-Reply-To: <bOOpa8SBoYO5qmGYp0e7x01gp6uE6OjiXZ8-wjz9_DI=.27bac79d-4d10-490c-9cd8-8df3d090fe7b@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <bOOpa8SBoYO5qmGYp0e7x01gp6uE6OjiXZ8-wjz9_DI=.27bac79d-4d10-490c-9cd8-8df3d090fe7b@github.com>
Message-ID: <2aYXBHyZE83suQFtY_POyft2gbRwwF_Xf_qajA62Pgw=.1fe1143c-33c5-4e78-b691-3f85f176c598@github.com>

On Sat, 15 Feb 2025 06:13:57 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table.
>> 
>> Added C++ static asserts to make sure no virtual methods are added in a future.
>> 
>> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob.
>> 
>> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp
>
> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Address comments

I removed `_v` from `CodeBlob::print*_on(st)` methods to reduce scope of VM changes.  But I have to add `_impl` suffix to these methods in CodeBlob subclasses.

I renamed `nmethod::print_on(st, msg);` to `print_on_with_msg(at, msg) to avoid naming conflict C++ complains about. It cased change in `dependencyContext.cpp`.

I made `CodeBlob::Vptr` class abstract as suggested.

I added empty `Vptr` class to `RuntimeBlob` because it is referenced in subclasses and corrected extensions in sublcasses to avoid mistakes @xmas92 pointed.

I also did some arguments renaming in SA in `CodeCache.java` as requested.

Tier1-5 testing passed.

Ready for new round of reviews.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2660770028

From kvn at openjdk.org  Sat Feb 15 06:34:56 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Sat, 15 Feb 2025 06:34:56 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v10]
In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
Message-ID: <PLR4hd5rc20OFzfJxG7yDhph6PXNn6j1CJnvHI-OrI8=.7f019bc0-c1c4-4d68-8354-ca2567994784@github.com>

> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table.
> 
> Added C++ static asserts to make sure no virtual methods are added in a future.
> 
> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob.
> 
> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp

Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:

  Remove commented lines left by mistake

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23533/files
  - new: https://git.openjdk.org/jdk/pull/23533/files/89a383e5..3fdf1c81

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=09
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=08-09

  Stats: 2 lines in 1 file changed: 0 ins; 2 del; 0 mod
  Patch: https://git.openjdk.org/jdk/pull/23533.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23533/head:pull/23533

PR: https://git.openjdk.org/jdk/pull/23533

From aboldtch at openjdk.org  Mon Feb 17 06:41:18 2025
From: aboldtch at openjdk.org (Axel Boldt-Christmas)
Date: Mon, 17 Feb 2025 06:41:18 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v10]
In-Reply-To: <PLR4hd5rc20OFzfJxG7yDhph6PXNn6j1CJnvHI-OrI8=.7f019bc0-c1c4-4d68-8354-ca2567994784@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <PLR4hd5rc20OFzfJxG7yDhph6PXNn6j1CJnvHI-OrI8=.7f019bc0-c1c4-4d68-8354-ca2567994784@github.com>
Message-ID: <UW6o1fVdNjYeif3nCIGQpZPvXasY40lIjBIczpsfY9I=.ad72b291-258f-4df1-b3ee-9a5326657db1@github.com>

On Sat, 15 Feb 2025 06:34:56 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table.
>> 
>> Added C++ static asserts to make sure no virtual methods are added in a future.
>> 
>> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob.
>> 
>> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp
>
> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Remove commented lines left by mistake

Not looked at the SA changes.

lgtm.

src/hotspot/share/code/codeBlob.hpp line 308:

> 306: 
> 307:   class Vptr : public CodeBlob::Vptr {
> 308:   };

Was this needed for some compiler? Or is it to be more explicit about the type hierarchy?

-------------

Marked as reviewed by aboldtch (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/23533#pullrequestreview-2620128040
PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1957678232

From epeter at openjdk.org  Mon Feb 17 08:40:17 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Mon, 17 Feb 2025 08:40:17 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v11]
In-Reply-To: <S-e2FYgJy02RfywN3A6WTPaepdkK6ly8KYNnX6tHCig=.a38238e5-7d3b-412b-a78e-9def776179ff@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com>
 <Mci8jQuT-MquLYeikUrrdzKo9dJJuQa3ejdc7tlYQyI=.e0007de8-08b2-4a42-950c-f8e1225777fc@github.com>
 <RHL_g49_BCZQzsQJU-T88fkAOoSKpNvEC2Xx-QxdpRk=.4fbc0037-ba55-40e1-a091-4c16d7e8ee99@github.com>
 <MAcSY0Kc9JFrv5ueJDNkDH9I9LpIsYcRjiJ7RtQt090=.03b107fd-8879-41de-a0a4-b6202a9da369@github.com>
 <OxF5Va_n5CdxRW2uSTQQzMe6JSSNqnfs4qd3pAwSAEo=.d33d96b2-0d11-4d67-91cc-5ae94e78c580@github.com>
 <S-e2FYgJy02RfywN3A6WTPaepdkK6ly8KYNnX6tHCig=.a38238e5-7d3b-412b-a78e-9def776179ff@github.com>
Message-ID: <T8VJwIaK1x-G60XsQNIUB9wcHQeYP1bkYB98Am9c8KM=.56412af0-9471-4619-9e14-03ba5f5cfde2@github.com>

On Fri, 14 Feb 2025 16:52:17 GMT, Roland Westrelin <roland at openjdk.org> wrote:

> I suppose extracting the branch probability from the MethodData and attaching it to the Min/Max nodes is not impossible.

That is basically what `PhaseIdealLoop::conditional_move` already does, right? It detects the diamond and converts it to `CMove`. We could special case for `min / max`, and then we'd have the probability for the branch, which we could store at the node.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2662409450

From roland at openjdk.org  Mon Feb 17 08:47:22 2025
From: roland at openjdk.org (Roland Westrelin)
Date: Mon, 17 Feb 2025 08:47:22 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v11]
In-Reply-To: <T8VJwIaK1x-G60XsQNIUB9wcHQeYP1bkYB98Am9c8KM=.56412af0-9471-4619-9e14-03ba5f5cfde2@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com>
 <Mci8jQuT-MquLYeikUrrdzKo9dJJuQa3ejdc7tlYQyI=.e0007de8-08b2-4a42-950c-f8e1225777fc@github.com>
 <RHL_g49_BCZQzsQJU-T88fkAOoSKpNvEC2Xx-QxdpRk=.4fbc0037-ba55-40e1-a091-4c16d7e8ee99@github.com>
 <MAcSY0Kc9JFrv5ueJDNkDH9I9LpIsYcRjiJ7RtQt090=.03b107fd-8879-41de-a0a4-b6202a9da369@github.com>
 <OxF5Va_n5CdxRW2uSTQQzMe6JSSNqnfs4qd3pAwSAEo=.d33d96b2-0d11-4d67-91cc-5ae94e78c580@github.com>
 <S-e2FYgJy02RfywN3A6WTPaepdkK6ly8KYNnX6tHCig=.a38238e5-7d3b-412b-a78e-9def776179ff@github.com>
 <T8VJwIaK1x-G60XsQNIUB9wcHQeYP1bkYB98Am9c8KM=.56412af0-9471-4619-9e14-03ba5f5cfde2@github.com>
Message-ID: <msxoTDkFVjqA2xbPwhboyul7Oal6uQAQcig5RdkeiyY=.7d5bd096-0906-45fa-a3db-0ace9b629fcb@github.com>

On Mon, 17 Feb 2025 08:37:56 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> > I suppose extracting the branch probability from the MethodData and attaching it to the Min/Max nodes is not impossible.
> 
> That is basically what `PhaseIdealLoop::conditional_move` already does, right? It detects the diamond and converts it to `CMove`. We could special case for `min / max`, and then we'd have the probability for the branch, which we could store at the node.

Possibly. We could also create the intrinsic they way it's done in the patch and extract the frequency from the `MethoData` for the min or max methods. The shape of the bytecodes for these methods should be simple enough that it should be feasible.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2662424292

From epeter at openjdk.org  Mon Feb 17 10:39:15 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Mon, 17 Feb 2025 10:39:15 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v11]
In-Reply-To: <msxoTDkFVjqA2xbPwhboyul7Oal6uQAQcig5RdkeiyY=.7d5bd096-0906-45fa-a3db-0ace9b629fcb@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com>
 <Mci8jQuT-MquLYeikUrrdzKo9dJJuQa3ejdc7tlYQyI=.e0007de8-08b2-4a42-950c-f8e1225777fc@github.com>
 <RHL_g49_BCZQzsQJU-T88fkAOoSKpNvEC2Xx-QxdpRk=.4fbc0037-ba55-40e1-a091-4c16d7e8ee99@github.com>
 <MAcSY0Kc9JFrv5ueJDNkDH9I9LpIsYcRjiJ7RtQt090=.03b107fd-8879-41de-a0a4-b6202a9da369@github.com>
 <OxF5Va_n5CdxRW2uSTQQzMe6JSSNqnfs4qd3pAwSAEo=.d33d96b2-0d11-4d67-91cc-5ae94e78c580@github.com>
 <S-e2FYgJy02RfywN3A6WTPaepdkK6ly8KYNnX6tHCig=.a38238e5-7d3b-412b-a78e-9def776179ff@github.com>
 <T8VJwIaK1x-G60XsQNIUB9wcHQeYP1bkYB98Am9c8KM=.56412af0-9471-4619-9e14-03ba5f5cfde2@github.com>
 <msxoTDkFVjqA2xbPwhboyul7Oal6uQAQcig5RdkeiyY=.7d5bd096-0906-45fa-a3db-0ace9b629fcb@github.com>
Message-ID: <5oGMaD5b87inAMkco6l5ODRvWv7FRsHGJiu_UMrGrTc=.0be44429-d322-4a6f-b91d-b64a146fad05@github.com>

On Mon, 17 Feb 2025 08:44:46 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>>> I suppose extracting the branch probability from the MethodData and attaching it to the Min/Max nodes is not impossible.
>> 
>> That is basically what `PhaseIdealLoop::conditional_move` already does, right? It detects the diamond and converts it to `CMove`. We could special case for `min / max`, and then we'd have the probability for the branch, which we could store at the node.
>
>> > I suppose extracting the branch probability from the MethodData and attaching it to the Min/Max nodes is not impossible.
>> 
>> That is basically what `PhaseIdealLoop::conditional_move` already does, right? It detects the diamond and converts it to `CMove`. We could special case for `min / max`, and then we'd have the probability for the branch, which we could store at the node.
> 
> Possibly. We could also create the intrinsic they way it's done in the patch and extract the frequency from the `MethoData` for the min or max methods. The shape of the bytecodes for these methods should be simple enough that it should be feasible.

@rwestrel @galderz 

> It seems overall, we likely win more than we loose with this intrinsic, so I would integrate this change as it is and file a bug to keep track of remaining issues.

I'm a little scared to just accept the regressions, especially for this "most average looking case":
Imagine you have an array with random numbers. Or at least numbers in a random order. If we take the max, then we expect the first number to be max with probability 1, the second 1/2, the third 1/3, the i'th 1/i. So the average branch probability is `n / (sum_i 1/i)`. This goes closer and closer to zero, the larger the array. This means that the "average" case has an extreme probability. And so if we do not vectorize, then this gets us a regression with the current patch. And vectorization is a little fragile, it only takes very little for vectorization not to kick in.

> The Min/Max nodes are floating nodes. They can hoist out of loop and common reliably in ways that are not guaranteed otherwise.

I suppose we could write an optimization that can hoist out loop independent if-diamonds out of a loop. If the condition and all phi inputs are loop invariant, you could just cut the diamond out of the loop, and paste it before the loop entry.

> Shouldn't int min/max be affected the same way?

I think we should be able to see the same issue here, actually. Yes. Here a quick benchmark below:


java -XX:CompileCommand=compileonly,TestIntMax::test* -XX:CompileCommand=printcompilation,TestIntMax::test* -XX:+TraceNewVectors TestIntMax.java
CompileCommand: compileonly TestIntMax.test* bool compileonly = true
CompileCommand: PrintCompilation TestIntMax.test* bool PrintCompilation = true
Warmup
5225   93 %     3       TestIntMax::test1 @ 5 (27 bytes)
5226   94       3       TestIntMax::test1 (27 bytes)
5226   95 %     4       TestIntMax::test1 @ 5 (27 bytes)
5238   96       4       TestIntMax::test1 (27 bytes)
Run
Time: 542056319
Warmup
6320  101 %     3       TestIntMax::test2 @ 5 (34 bytes)
6322  102 %     4       TestIntMax::test2 @ 5 (34 bytes)
6329  103       4       TestIntMax::test2 (34 bytes)
Run
Time: 166815209

That's a 4x regression on random input data!

With:

import java.util.Random;

public class TestIntMax {
    private static Random RANDOM = new Random();

    public static void main(String[] args) {
        int[] a = new int[64 * 1024];
        for (int i = 0; i < a.length; i++) {
            a[i] = RANDOM.nextInt();
        }


        {
            System.out.println("Warmup"); 
            for (int i = 0; i < 10_000; i++) { test1(a); }
            System.out.println("Run"); 
            long t0 = System.nanoTime();
            for (int i = 0; i < 10_000; i++) { test1(a); }
            long t1 = System.nanoTime();
            System.out.println("Time: " + (t1 - t0)); 
        }

        {
            System.out.println("Warmup"); 
            for (int i = 0; i < 10_000; i++) { test2(a); }
            System.out.println("Run"); 
            long t0 = System.nanoTime();
            for (int i = 0; i < 10_000; i++) { test2(a); }
            long t1 = System.nanoTime();
            System.out.println("Time: " + (t1 - t0)); 
        }
    }

    public static int test1(int[] a) {
        int x = Integer.MIN_VALUE;
        for (int i = 0; i < a.length; i++) {
            x = Math.max(x, a[i]);
        }
        return x;
    }

    public static int test2(int[] a) {
        int x = Integer.MIN_VALUE;
        for (int i = 0; i < a.length; i++) {
            x = (x >= a[i]) ? x : a[i];
        }
        return x;
    }
}

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2662706564

From roland at openjdk.org  Mon Feb 17 10:50:22 2025
From: roland at openjdk.org (Roland Westrelin)
Date: Mon, 17 Feb 2025 10:50:22 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v11]
In-Reply-To: <5oGMaD5b87inAMkco6l5ODRvWv7FRsHGJiu_UMrGrTc=.0be44429-d322-4a6f-b91d-b64a146fad05@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com>
 <Mci8jQuT-MquLYeikUrrdzKo9dJJuQa3ejdc7tlYQyI=.e0007de8-08b2-4a42-950c-f8e1225777fc@github.com>
 <RHL_g49_BCZQzsQJU-T88fkAOoSKpNvEC2Xx-QxdpRk=.4fbc0037-ba55-40e1-a091-4c16d7e8ee99@github.com>
 <MAcSY0Kc9JFrv5ueJDNkDH9I9LpIsYcRjiJ7RtQt090=.03b107fd-8879-41de-a0a4-b6202a9da369@github.com>
 <OxF5Va_n5CdxRW2uSTQQzMe6JSSNqnfs4qd3pAwSAEo=.d33d96b2-0d11-4d67-91cc-5ae94e78c580@github.com>
 <S-e2FYgJy02RfywN3A6WTPaepdkK6ly8KYNnX6tHCig=.a38238e5-7d3b-412b-a78e-9def776179ff@github.com>
 <T8VJwIaK1x-G60XsQNIUB9wcHQeYP1bkYB98Am9c8KM=.56412af0-9471-4619-9e14-03ba5f5cfde2@github.com>
 <msxoTDkFVjqA2xbPwhboyul7Oal6uQAQcig5RdkeiyY=.7d5bd096-0906-45fa-a3db-0ace9b629fcb@github.com>
 <5oGMaD5b87inAMkco6l5ODRvWv7FRsHGJiu_UMrGrTc=.0be44429-d322-4a6f-b91d-b64a146fad05@github.com>
Message-ID: <U6wVxiGwEunYY_rfxNj6Fl3qN7OLATbZLdF3F-3-gz0=.272a5579-3012-4199-8f25-8d7c4136153d@github.com>

On Mon, 17 Feb 2025 10:36:52 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> I suppose we could write an optimization that can hoist out loop independent if-diamonds out of a loop. If the condition and all phi inputs are loop invariant, you could just cut the diamond out of the loop, and paste it before the loop entry.

Right. But, it would likely not optimize as well. The new optimization will possibly have heuristics to limit complexity so could be limited. The diamond could be transformed to something else by some other optimization before it gets a chance to get hoisted. There are likely other optimizations that apply to floating nodes that would still not apply: for instance, `MinL`/`MaxL` can be split thru phi even if the `min` call is not right after the merge point. With branches that's not true. Also, with more compexity comes more bugs.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2662733218

From duke at openjdk.org  Mon Feb 17 14:10:47 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Mon, 17 Feb 2025 14:10:47 GMT
Subject: RFR: 8349721: Add aarch64 intrinsics for ML-KEM
Message-ID: <eUTcEbCy4gKPEfe0fS4GXXR8i49JYSKCGygRz8CsCnE=.c17a1739-85d2-4bb9-8a74-5ad1694d8d3d@github.com>

By using the aarch64 vector registers the speed of the computation of the ML-KEM algorithms (key generation, encapsulation, decapsulation) can be approximately doubled.

-------------

Commit messages:
 - removing trailing spaces
 - kyber aarch64 intrinsics

Changes: https://git.openjdk.org/jdk/pull/23663/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23663&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8349721
  Stats: 2885 lines in 20 files changed: 2774 ins; 84 del; 27 mod
  Patch: https://git.openjdk.org/jdk/pull/23663.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23663/head:pull/23663

PR: https://git.openjdk.org/jdk/pull/23663

From roland at openjdk.org  Mon Feb 17 14:19:12 2025
From: roland at openjdk.org (Roland Westrelin)
Date: Mon, 17 Feb 2025 14:19:12 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory
In-Reply-To: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
Message-ID: <LNBP-PAgsrMthleNIaOPQ6zv6UL6TivmnCJ2m8MliSM=.2d5d0fc7-6d69-4264-8419-31386a65b807@github.com>

On Mon, 11 Nov 2024 14:40:09 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below.
> 
> **Background**
> 
> With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer.
> 
> **Problem**
> 
> So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code.
> 
> 
> MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1);
> MemorySegment nativeUnaligned = nativeAligned.asSlice(1);
> test3(nativeUnaligned);
> 
> 
> When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not!
> 
>     static void test3(MemorySegment ms) {
>         for (int i = 0; i < RANGE; i++) {
>             long adr = i * 4L;
>             int v = ms.get(ELEMENT_LAYOUT, adr);
>             ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1));
>         }
>     }
> 
> 
> **Solution: Runtime Checks - Predicate and Multiversioning**
> 
> Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check.
> 
> I came up with 2 options where to place the runtime checks:
> - A new "auto vectorization" Parse Predicate:
>   - This only works when predicates are available.
>   - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop.
> - Multiversion the loop:
>   - Create 2 copies of the loop (fast and slow loops).
>   - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take
>   - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even unaligned `base`s would end up with reasonably fast code.
>   - We "stall" the `...

What are the architectures affected by this? Isn't it the case that x86 and aarch64 are unaffected by this? Is the motivation to use this as a way to do prep work for alias analysis?

Do you intend to use a single deoptimization reason for all vectorization related predicates? (that is when you take care of aliasing, are you going to to use the same reason for aliasing and alignment checks)

I went over the code and it looks reasonable to me. I intend to do a more careful review later.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2663262133

From roland at openjdk.org  Mon Feb 17 15:05:17 2025
From: roland at openjdk.org (Roland Westrelin)
Date: Mon, 17 Feb 2025 15:05:17 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v11]
In-Reply-To: <5oGMaD5b87inAMkco6l5ODRvWv7FRsHGJiu_UMrGrTc=.0be44429-d322-4a6f-b91d-b64a146fad05@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com>
 <Mci8jQuT-MquLYeikUrrdzKo9dJJuQa3ejdc7tlYQyI=.e0007de8-08b2-4a42-950c-f8e1225777fc@github.com>
 <RHL_g49_BCZQzsQJU-T88fkAOoSKpNvEC2Xx-QxdpRk=.4fbc0037-ba55-40e1-a091-4c16d7e8ee99@github.com>
 <MAcSY0Kc9JFrv5ueJDNkDH9I9LpIsYcRjiJ7RtQt090=.03b107fd-8879-41de-a0a4-b6202a9da369@github.com>
 <OxF5Va_n5CdxRW2uSTQQzMe6JSSNqnfs4qd3pAwSAEo=.d33d96b2-0d11-4d67-91cc-5ae94e78c580@github.com>
 <S-e2FYgJy02RfywN3A6WTPaepdkK6ly8KYNnX6tHCig=.a38238e5-7d3b-412b-a78e-9def776179ff@github.com>
 <T8VJwIaK1x-G60XsQNIUB9wcHQeYP1bkYB98Am9c8KM=.56412af0-9471-4619-9e14-03ba5f5cfde2@github.com>
 <msxoTDkFVjqA2xbPwhboyul7Oal6uQAQcig5RdkeiyY=.7d5bd096-0906-45fa-a3db-0ace9b629fcb@github.com>
 <5oGMaD5b87inAMkco6l5ODRvWv7FRsHGJiu_UMrGrTc=.0be44429-d322-4a6f-b91d-b64a146fad05@github.com>
Message-ID: <3ArmrOQcUoj8DhHTq1a40Oz3GE8bCDDy3FFeVgbladg=.b8e0e13b-39f3-41a6-8a1b-5ca4febb4a41@github.com>

On Mon, 17 Feb 2025 10:36:52 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> I think we should be able to see the same issue here, actually. Yes. Here a quick benchmark below:

I observe the same:


Warmup
751    3    b        TestIntMax::test1 (27 bytes)
Run
Time: 360 550 158
Warmup
1862   15    b        TestIntMax::test2 (34 bytes)
Run
Time: 92 116 170


But then with this:


diff --git a/src/hotspot/cpu/x86/x86_64.ad b/src/hotspot/cpu/x86/x86_64.ad
index 8cc4a970bfd..9abda8f4178 100644
--- a/src/hotspot/cpu/x86/x86_64.ad
+++ b/src/hotspot/cpu/x86/x86_64.ad
@@ -12037,16 +12037,20 @@ instruct cmovI_reg_l(rRegI dst, rRegI src, rFlagsReg cr)
 %}
 
 
-instruct maxI_rReg(rRegI dst, rRegI src)
+instruct maxI_rReg(rRegI dst, rRegI src, rFlagsReg cr)
 %{
   match(Set dst (MaxI dst src));
+  effect(KILL cr);
 
   ins_cost(200);
-  expand %{
-    rFlagsReg cr;
-    compI_rReg(cr, dst, src);
-    cmovI_reg_l(dst, src, cr);
+  ins_encode %{
+    Label done;
+    __ cmpl($src$$Register, $dst$$Register);
+    __ jccb(Assembler::less, done);
+    __ mov($dst$$Register, $src$$Register);
+    __ bind(done);
   %}
+  ins_pipe(pipe_cmov_reg);
 %}
 
 // ============================================================================


the performance gap narrows:


Warmup
770    3    b        TestIntMax::test1 (27 bytes)
Run
Time: 94 951 677
Warmup
1312   15    b        TestIntMax::test2 (34 bytes)
Run
Time: 70 053 824


(the number of test2 fluctuates quite a bit). Does it ever make sense to implement `MaxI` with a conditional move then?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2663379660

From epeter at openjdk.org  Mon Feb 17 15:28:13 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Mon, 17 Feb 2025 15:28:13 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory
In-Reply-To: <LNBP-PAgsrMthleNIaOPQ6zv6UL6TivmnCJ2m8MliSM=.2d5d0fc7-6d69-4264-8419-31386a65b807@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <LNBP-PAgsrMthleNIaOPQ6zv6UL6TivmnCJ2m8MliSM=.2d5d0fc7-6d69-4264-8419-31386a65b807@github.com>
Message-ID: <ENUNnsRfvcahVl3Iw4yBuZeBo0-v5-CoRDxpWVGTr_s=.1770c7b6-5df4-4d6b-a204-b81f67c29dcf@github.com>

On Mon, 17 Feb 2025 14:16:59 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>> Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below.
>> 
>> **Background**
>> 
>> With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer.
>> 
>> **Problem**
>> 
>> So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code.
>> 
>> 
>> MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1);
>> MemorySegment nativeUnaligned = nativeAligned.asSlice(1);
>> test3(nativeUnaligned);
>> 
>> 
>> When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not!
>> 
>>     static void test3(MemorySegment ms) {
>>         for (int i = 0; i < RANGE; i++) {
>>             long adr = i * 4L;
>>             int v = ms.get(ELEMENT_LAYOUT, adr);
>>             ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1));
>>         }
>>     }
>> 
>> 
>> **Solution: Runtime Checks - Predicate and Multiversioning**
>> 
>> Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check.
>> 
>> I came up with 2 options where to place the runtime checks:
>> - A new "auto vectorization" Parse Predicate:
>>   - This only works when predicates are available.
>>   - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop.
>> - Multiversion the loop:
>>   - Create 2 copies of the loop (fast and slow loops).
>>   - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take
>>   - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even ...
>
> What are the architectures affected by this? Isn't it the case that x86 and aarch64 are unaffected by this? Is the motivation to use this as a way to do prep work for alias analysis?
> 
> Do you intend to use a single deoptimization reason for all vectorization related predicates? (that is when you take care of aliasing, are you going to to use the same reason for aliasing and alignment checks)
> 
> I went over the code and it looks reasonable to me. I intend to do a more careful review later.

@rwestrel Thanks for having a first look!

> What are the architectures affected by this? Isn't it the case that x86 and aarch64 are unaffected by this?

Yes, x86 and aarch64 are unaffected, as far as I know. Well, we can simulate strict alignment with `-XX:+AlignVector`, and there it should behave correctly, and it currently fails with the `-XX:+VerifyAlignVector`. It would be nice if that was not the case, so that we can write tests with arbitrary alignment, and turn on those flags freely.

>  Is the motivation to use this as a way to do prep work for alias analysis?

I see this as a bug-fix AND preparation for future work. I suppose I might not have fixed this bug here since our platforms are not really affected, but I might as well fix it now since I can re-use most of the code later.

> Do you intend to use a single deoptimization reason for all vectorization related predicates? (that is when you take care of aliasing, are you going to to use the same reason for aliasing and alignment checks)

I suppose that is currently what I'm planning. But we could in principle separate them. But I would leave that for later, if there is any desire to do that. For now, I think it's ok to just go with a single "auto-vectorization" reason.

Does that sound reasonable?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2663434802

From dnsimon at openjdk.org  Mon Feb 17 16:12:41 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Mon, 17 Feb 2025 16:12:41 GMT
Subject: RFR: 8346781: [JVMCI] Limit ServiceLoader to class initializers
 [v2]
In-Reply-To: <klDOcXCGbFV6vc8RIaD6T21NI4FapOYKkTLzVatTc38=.76502c0a-703d-4e77-8cf3-652d91e2199a@github.com>
References: <klDOcXCGbFV6vc8RIaD6T21NI4FapOYKkTLzVatTc38=.76502c0a-703d-4e77-8cf3-652d91e2199a@github.com>
Message-ID: <m_l8eImOpn51svu62a2lmUQUIudLtF0SZkT2AK4J6DE=.c20a191e-0778-45f3-8327-18ba97c5c1c3@github.com>

> In the context of libgraal, the current use of ServiceLoader in JVMCI is problematic as libgraal does all class loading at image build time. There are static fields such as `JVMCIServiceLocator.cachedLocators` that need to be initialized [via reflection](https://github.com/oracle/graal/blob/30492c3f7847a13ae7f8dc50663a5a039e49a8e7/compiler/src/jdk.graal.compiler/src/jdk/graal/compiler/hotspot/libgraal/BuildTime.java#L175-L180) when building libgraal.
> 
> This PR removes the need for such reflection by moving all use of ServiceLoader in JVMCI into `<clinit>` methods. These methods are executed when building libgraal. It also removes a few other public methods and fields that are no longer used by Graal. Given that JVMCI is still experimental and only has qualified exports to Graal, I don't think this needs a CSR.

Doug Simon has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:

 - Merge remote-tracking branch 'openjdk-jdk/master' into JDK-8346781
 - remove non-native-image build time use of ServiceLoader
 - make Cleaner.clean public

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/22869/files
  - new: https://git.openjdk.org/jdk/pull/22869/files/24bb39be..7c91d00c

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=22869&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22869&range=00-01

  Stats: 212534 lines in 5089 files changed: 102007 ins; 88290 del; 22237 mod
  Patch: https://git.openjdk.org/jdk/pull/22869.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/22869/head:pull/22869

PR: https://git.openjdk.org/jdk/pull/22869

From galder at openjdk.org  Mon Feb 17 16:49:15 2025
From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=)
Date: Mon, 17 Feb 2025 16:49:15 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v11]
In-Reply-To: <3ArmrOQcUoj8DhHTq1a40Oz3GE8bCDDy3FFeVgbladg=.b8e0e13b-39f3-41a6-8a1b-5ca4febb4a41@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com>
 <Mci8jQuT-MquLYeikUrrdzKo9dJJuQa3ejdc7tlYQyI=.e0007de8-08b2-4a42-950c-f8e1225777fc@github.com>
 <RHL_g49_BCZQzsQJU-T88fkAOoSKpNvEC2Xx-QxdpRk=.4fbc0037-ba55-40e1-a091-4c16d7e8ee99@github.com>
 <MAcSY0Kc9JFrv5ueJDNkDH9I9LpIsYcRjiJ7RtQt090=.03b107fd-8879-41de-a0a4-b6202a9da369@github.com>
 <OxF5Va_n5CdxRW2uSTQQzMe6JSSNqnfs4qd3pAwSAEo=.d33d96b2-0d11-4d67-91cc-5ae94e78c580@github.com>
 <S-e2FYgJy02RfywN3A6WTPaepdkK6ly8KYNnX6tHCig=.a38238e5-7d3b-412b-a78e-9def776179ff@github.com>
 <T8VJwIaK1x-G60XsQNIUB9wcHQeYP1bkYB98Am9c8KM=.56412af0-9471-4619-9e14-03ba5f5cfde2@github.com>
 <msxoTDkFVjqA2xbPwhboyul7Oal6uQAQcig5RdkeiyY=.7d5bd096-0906-45fa-a3db-0ace9b629fcb@github.com>
 <5oGMaD5b87inAMkco6l5ODRvWv7FRsHGJiu_UMrGrTc=.0be44429-d322-4a6f-b91d-b64a146fad05@github.com>
 <3ArmrOQcUoj8DhHTq1a40Oz3GE8bCDDy3FF
 eVgbladg=.b8e0e13b-39f3-41a6-8a1b-5ca4febb4a41@github.com>
Message-ID: <_SUoth7bTq41M5TpGjQ5ADL2TOesK2tIIxmL21BZ6RU=.65284948-b4a8-4d01-a924-e9dfeefe1c88@github.com>

On Mon, 17 Feb 2025 15:02:32 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>> @rwestrel @galderz 
>> 
>>> It seems overall, we likely win more than we loose with this intrinsic, so I would integrate this change as it is and file a bug to keep track of remaining issues.
>> 
>> I'm a little scared to just accept the regressions, especially for this "most average looking case":
>> Imagine you have an array with random numbers. Or at least numbers in a random order. If we take the max, then we expect the first number to be max with probability 1, the second 1/2, the third 1/3, the i'th 1/i. So the average branch probability is `n / (sum_i 1/i)`. This goes closer and closer to zero, the larger the array. This means that the "average" case has an extreme probability. And so if we do not vectorize, then this gets us a regression with the current patch. And vectorization is a little fragile, it only takes very little for vectorization not to kick in.
>> 
>>> The Min/Max nodes are floating nodes. They can hoist out of loop and common reliably in ways that are not guaranteed otherwise.
>> 
>> I suppose we could write an optimization that can hoist out loop independent if-diamonds out of a loop. If the condition and all phi inputs are loop invariant, you could just cut the diamond out of the loop, and paste it before the loop entry.
>> 
>>> Shouldn't int min/max be affected the same way?
>> 
>> I think we should be able to see the same issue here, actually. Yes. Here a quick benchmark below:
>> 
>> 
>> java -XX:CompileCommand=compileonly,TestIntMax::test* -XX:CompileCommand=printcompilation,TestIntMax::test* -XX:+TraceNewVectors TestIntMax.java
>> CompileCommand: compileonly TestIntMax.test* bool compileonly = true
>> CompileCommand: PrintCompilation TestIntMax.test* bool PrintCompilation = true
>> Warmup
>> 5225   93 %     3       TestIntMax::test1 @ 5 (27 bytes)
>> 5226   94       3       TestIntMax::test1 (27 bytes)
>> 5226   95 %     4       TestIntMax::test1 @ 5 (27 bytes)
>> 5238   96       4       TestIntMax::test1 (27 bytes)
>> Run
>> Time: 542056319
>> Warmup
>> 6320  101 %     3       TestIntMax::test2 @ 5 (34 bytes)
>> 6322  102 %     4       TestIntMax::test2 @ 5 (34 bytes)
>> 6329  103       4       TestIntMax::test2 (34 bytes)
>> Run
>> Time: 166815209
>> 
>> That's a 4x regression on random input data!
>> 
>> With:
>> 
>> import java.util.Random;
>> 
>> public class TestIntMax {
>>     private static Random RANDOM = new Random();
>> 
>>     public static void main(String[] args) {
>>         int[] a = new int[64 * 1024];
>>         for (int i = 0; i < a.length; i++) {
>>...
>
>> I think we should be able to see the same issue here, actually. Yes. Here a quick benchmark below:
> 
> I observe the same:
> 
> 
> Warmup
> 751    3    b        TestIntMax::test1 (27 bytes)
> Run
> Time: 360 550 158
> Warmup
> 1862   15    b        TestIntMax::test2 (34 bytes)
> Run
> Time: 92 116 170
> 
> 
> But then with this:
> 
> 
> diff --git a/src/hotspot/cpu/x86/x86_64.ad b/src/hotspot/cpu/x86/x86_64.ad
> index 8cc4a970bfd..9abda8f4178 100644
> --- a/src/hotspot/cpu/x86/x86_64.ad
> +++ b/src/hotspot/cpu/x86/x86_64.ad
> @@ -12037,16 +12037,20 @@ instruct cmovI_reg_l(rRegI dst, rRegI src, rFlagsReg cr)
>  %}
>  
>  
> -instruct maxI_rReg(rRegI dst, rRegI src)
> +instruct maxI_rReg(rRegI dst, rRegI src, rFlagsReg cr)
>  %{
>    match(Set dst (MaxI dst src));
> +  effect(KILL cr);
>  
>    ins_cost(200);
> -  expand %{
> -    rFlagsReg cr;
> -    compI_rReg(cr, dst, src);
> -    cmovI_reg_l(dst, src, cr);
> +  ins_encode %{
> +    Label done;
> +    __ cmpl($src$$Register, $dst$$Register);
> +    __ jccb(Assembler::less, done);
> +    __ mov($dst$$Register, $src$$Register);
> +    __ bind(done);
>    %}
> +  ins_pipe(pipe_cmov_reg);
>  %}
>  
>  // ============================================================================
> 
> 
> the performance gap narrows:
> 
> 
> Warmup
> 770    3    b        TestIntMax::test1 (27 bytes)
> Run
> Time: 94 951 677
> Warmup
> 1312   15    b        TestIntMax::test2 (34 bytes)
> Run
> Time: 70 053 824
> 
> 
> (the number of test2 fluctuates quite a bit). Does it ever make sense to implement `MaxI` with a conditional move then?

@rwestrel @eme64 I think that the data distribution in the `TestIntMax` above matters (see my explanations in https://github.com/openjdk/jdk/pull/20098#issuecomment-2642788364), so I've enhanced the test to control data distribution in the int[] (see at the bottom).

Here are the results I see on my AVX-512 machine:


Probability: 50%
Warmup
7834   92 %  b  3       TestIntMax::test1 @ 5 (27 bytes)
7836   93    b  3       TestIntMax::test1 (27 bytes)
7838   94 %  b  4       TestIntMax::test1 @ 5 (27 bytes)
7851   95    b  4       TestIntMax::test1 (27 bytes)
Run
Time: 699 923 014
Warmup
9272   96 %  b  3       TestIntMax::test2 @ 5 (34 bytes)
9274   97    b  3       TestIntMax::test2 (34 bytes)
9275   98 %  b  4       TestIntMax::test2 @ 5 (34 bytes)
9287   99    b  4       TestIntMax::test2 (34 bytes)
Run
Time: 699 815 792

Probability: 80%
Warmup
7872   92 %  b  3       TestIntMax::test1 @ 5 (27 bytes)
7874   93    b  3       TestIntMax::test1 (27 bytes)
7875   94 %  b  4       TestIntMax::test1 @ 5 (27 bytes)
7889   95    b  4       TestIntMax::test1 (27 bytes)
Run
Time: 699 947 633
Warmup
9310   96 %  b  3       TestIntMax::test2 @ 5 (34 bytes)
9311   97    b  3       TestIntMax::test2 (34 bytes)
9312   98 %  b  4       TestIntMax::test2 @ 5 (34 bytes)
9325   99    b  4       TestIntMax::test2 (34 bytes)
Run
Time: 699 827 882

Probability: 100%
Warmup
7884   92 %  b  3       TestIntMax::test1 @ 5 (27 bytes)
7886   93    b  3       TestIntMax::test1 (27 bytes)
7888   94 %  b  4       TestIntMax::test1 @ 5 (27 bytes)
7901   95    b  4       TestIntMax::test1 (27 bytes)
Run
Time: 699 931 243
Warmup
9322   96 %  b  3       TestIntMax::test2 @ 5 (34 bytes)
9323   97    b  3       TestIntMax::test2 (34 bytes)
9324   98 %  b  4       TestIntMax::test2 @ 5 (34 bytes)
9336   99    b  4       TestIntMax::test2 (34 bytes)
Run
Time: 1 077 937 282


import java.util.Random;
import java.util.concurrent.ThreadLocalRandom;
import java.text.DecimalFormat;
import java.text.DecimalFormatSymbols;

class TestIntMax
{
    static final int RANGE = 16 * 1024;
    static final int ITER = 100_000;

    public static void main(String[] args)
    {
        final int probability = Integer.parseInt(args[0]);

        final DecimalFormatSymbols symbols = new DecimalFormatSymbols();
        symbols.setGroupingSeparator(' ');
        final DecimalFormat format = new DecimalFormat("#,###", symbols);

        System.out.printf("Probability: %d%%%n", probability);
        int[] a = new int[64 * 1024];
        init(a, probability);

        {
            System.out.println("Warmup");
            for (int i = 0; i < 10_000; i++)
            {
                test1(a);
            }
            System.out.println("Run");
            long t0 = System.nanoTime();
            for (int i = 0; i < 10_000; i++)
            {
                test1(a);
            }
            long t1 = System.nanoTime();
            System.out.println("Time: " + format.format(t1 - t0));
        }

        {
            System.out.println("Warmup");
            for (int i = 0; i < 10_000; i++)
            {
                test2(a);
            }
            System.out.println("Run");
            long t0 = System.nanoTime();
            for (int i = 0; i < 10_000; i++)
            {
                test2(a);
            }
            long t1 = System.nanoTime();
            System.out.println("Time: " + format.format(t1 - t0));
        }
    }

    public static int test1(int[] a)
    {
        int x = Integer.MIN_VALUE;
        for (int i = 0; i < a.length; i++)
        {
            x = Math.max(x, a[i]);
        }
        return x;
    }

    public static int test2(int[] a)
    {
        int x = Integer.MIN_VALUE;
        for (int i = 0; i < a.length; i++)
        {
            x = (x >= a[i]) ? x : a[i];
        }
        return x;
    }

    public static void init(int[] ints, int probability)
    {
        int aboveCount, abovePercent;

        do
        {
            int max = ThreadLocalRandom.current().nextInt(10);
            ints[0] = max;

            aboveCount = 0;
            for (int i = 1; i < ints.length; i++)
            {
                int value;
                if (ThreadLocalRandom.current().nextInt(101) <= probability)
                {
                    int increment = ThreadLocalRandom.current().nextInt(10);
                    value = max + increment;
                    aboveCount++;
                }
                else
                {
                    // Decrement by at least 1
                    int decrement = ThreadLocalRandom.current().nextInt(10) + 1;
                    value = max - decrement;
                }
                ints[i] = value;
                max = Math.max(max, value);
            }

            abovePercent = ((aboveCount + 1) * 100) / ints.length;
        } while (abovePercent != probability);
    }
}


Focusing my comment below on 100% which is where the differences appear:

test2 (100%):

 ;; B12: #	out( B21 B13 ) <- in( B11 B20 )  Freq: 1.6744e+09
  0x00007f15bcada2e9:   movl		0x14(%rsi, %rdx, 4), %r11d
                                                            ;*iaload {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - TestIntMax::test2 at 14 (line 71)
  0x00007f15bcada2ee:   cmpl		%r11d, %r10d
  0x00007f15bcada2f1:   jge		0x7f15bcada362      ;*istore_1 {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - TestIntMax::test2 at 25 (line 71)


test1 (100%)

 ;; B10: #	out( B10 B11 ) <- in( B9 B10 ) Loop( B10-B10 inner main of N64 strip mined) Freq: 1.6744e+09
  0x00007f15bcad9a70:   movl		0x4c(%rsi, %rdx, 4), %r11d
  0x00007f15bcad9a75:   movl		%r11d, (%rsp)
  0x00007f15bcad9a79:   movl		0x48(%rsi, %rdx, 4), %r10d
  0x00007f15bcad9a7e:   movl		%r10d, 4(%rsp)
  0x00007f15bcad9a83:   movl		0x10(%rsi, %rdx, 4), %r11d
  0x00007f15bcad9a88:   movl		0x14(%rsi, %rdx, 4), %r9d
  0x00007f15bcad9a8d:   movl		0x44(%rsi, %rdx, 4), %r10d
  0x00007f15bcad9a92:   movl		%r10d, 8(%rsp)
  0x00007f15bcad9a97:   movl		0x18(%rsi, %rdx, 4), %r8d
  0x00007f15bcad9a9c:   cmpl		%r11d, %eax
  0x00007f15bcad9a9f:   cmovll		%r11d, %eax
  0x00007f15bcad9aa3:   cmpl		%r9d, %eax
  0x00007f15bcad9aa6:   cmovll		%r9d, %eax
  0x00007f15bcad9aaa:   movl		0x20(%rsi, %rdx, 4), %r10d
  0x00007f15bcad9aaf:   cmpl		%r8d, %eax
  0x00007f15bcad9ab2:   cmovll		%r8d, %eax
  0x00007f15bcad9ab6:   movl		0x24(%rsi, %rdx, 4), %r8d
  0x00007f15bcad9abb:   movl		0x28(%rsi, %rdx, 4), %r11d
                                                            ;   {no_reloc}
  0x00007f15bcad9ac0:   movl		0x2c(%rsi, %rdx, 4), %ecx
  0x00007f15bcad9ac4:   movl		0x30(%rsi, %rdx, 4), %r9d
  0x00007f15bcad9ac9:   movl		0x34(%rsi, %rdx, 4), %edi
  0x00007f15bcad9acd:   movl		0x38(%rsi, %rdx, 4), %ebx
  0x00007f15bcad9ad1:   movl		0x3c(%rsi, %rdx, 4), %ebp
  0x00007f15bcad9ad5:   movl		0x40(%rsi, %rdx, 4), %r13d
  0x00007f15bcad9ada:   movl		0x1c(%rsi, %rdx, 4), %r14d
  0x00007f15bcad9adf:   cmpl		%r14d, %eax
  0x00007f15bcad9ae2:   cmovll		%r14d, %eax
  0x00007f15bcad9ae6:   cmpl		%r10d, %eax
  0x00007f15bcad9ae9:   cmovll		%r10d, %eax
  0x00007f15bcad9aed:   cmpl		%r8d, %eax
  0x00007f15bcad9af0:   cmovll		%r8d, %eax
  0x00007f15bcad9af4:   cmpl		%r11d, %eax
  0x00007f15bcad9af7:   cmovll		%r11d, %eax
  0x00007f15bcad9afb:   cmpl		%ecx, %eax
  0x00007f15bcad9afd:   cmovll		%ecx, %eax
  0x00007f15bcad9b00:   cmpl		%r9d, %eax
  0x00007f15bcad9b03:   cmovll		%r9d, %eax
  0x00007f15bcad9b07:   cmpl		%edi, %eax
  0x00007f15bcad9b09:   cmovll		%edi, %eax
  0x00007f15bcad9b0c:   cmpl		%ebx, %eax
  0x00007f15bcad9b0e:   cmovll		%ebx, %eax
  0x00007f15bcad9b11:   cmpl		%ebp, %eax
  0x00007f15bcad9b13:   cmovll		%ebp, %eax
  0x00007f15bcad9b16:   cmpl		%r13d, %eax
  0x00007f15bcad9b19:   cmovll		%r13d, %eax
  0x00007f15bcad9b1d:   cmpl		8(%rsp), %eax
  0x00007f15bcad9b21:   movl		8(%rsp), %r11d
  0x00007f15bcad9b26:   cmovll		%r11d, %eax
  0x00007f15bcad9b2a:   cmpl		4(%rsp), %eax
  0x00007f15bcad9b2e:   movl		4(%rsp), %r10d
  0x00007f15bcad9b33:   cmovll		%r10d, %eax
  0x00007f15bcad9b37:   cmpl		(%rsp), %eax
  0x00007f15bcad9b3a:   movl		(%rsp), %r11d
  0x00007f15bcad9b3e:   cmovll		%r11d, %eax         ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - TestIntMax::test1 at 15 (line 61)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2663633050

From galder at openjdk.org  Mon Feb 17 17:05:28 2025
From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=)
Date: Mon, 17 Feb 2025 17:05:28 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v12]
In-Reply-To: <pZjDpZKJUmXi85-qf3F-NX91qVc42_QgZGbuo36XhPk=.f2e4ba72-bf19-4ced-9656-c01907bdae1b@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <pZjDpZKJUmXi85-qf3F-NX91qVc42_QgZGbuo36XhPk=.f2e4ba72-bf19-4ced-9656-c01907bdae1b@github.com>
Message-ID: <WmvAVC0dQC0czj5aQWgjKPqKAVaHGnVhoqUbR_B8dB0=.ae621dd5-172e-41ce-be78-402acaad8b31@github.com>

On Fri, 7 Feb 2025 12:39:24 GMT, Galder Zamarre?o <galder at openjdk.org> wrote:

>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance.
>> 
>> Currently vectorization does not kick in for loops containing either of these calls because of the following error:
>> 
>> 
>> VLoop::check_preconditions: failed: control flow in loop not allowed
>> 
>> 
>> The control flow is due to the java implementation for these methods, e.g.
>> 
>> 
>> public static long max(long a, long b) {
>>     return (a >= b) ? a : b;
>> }
>> 
>> 
>> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively.
>> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization.
>> E.g.
>> 
>> 
>> SuperWord::transform_loop:
>>     Loop: N518/N126  counted [int,int),+4 (1025 iters)  main has_sfpt strip_mined
>>  518  CountedLoop  === 518 246 126  [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21)
>> 
>> 
>> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1155
>> long max   1173
>> 
>> 
>> After the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1042
>> long max   1042
>> 
>> 
>> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes.
>> Therefore, it still relies on the macro expansion to transform those into CMoveL.
>> 
>> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results:
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PA...
>
> Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision:
> 
>  - Merge branch 'master' into topic.intrinsify-max-min-long
>  - Fix typo
>  - Renaming methods and variables and add docu on algorithms
>  - Fix copyright years
>  - Make sure it runs with cpus with either avx512 or asimd
>  - Test can only run with 256 bit registers or bigger
>    
>    * Remove platform dependant check
>    and use platform independent configuration instead.
>  - Fix license header
>  - Tests should also run on aarch64 asimd=true envs
>  - Added comment around the assertions
>  - Adjust min/max identity IR test expectations after changes
>  - ... and 34 more: https://git.openjdk.org/jdk/compare/ba549afe...a190ae68

Another interesting comparison arises above when comparing `test2` in 80% vs 100%:

test2 (100%):

 ;; B12: #	out( B21 B13 ) <- in( B11 B20 )  Freq: 1.6744e+09
  0x00007f15bcada2e9:   movl		0x14(%rsi, %rdx, 4), %r11d
                                                            ;*iaload {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - TestIntMax::test2 at 14 (line 71)
  0x00007f15bcada2ee:   cmpl		%r11d, %r10d
  0x00007f15bcada2f1:   jge		0x7f15bcada362      ;*istore_1 {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - TestIntMax::test2 at 25 (line 71)


test2(80%):

 ;; B10: #	out( B10 B11 ) <- in( B9 B10 ) Loop( B10-B10 inner main of N64 strip mined) Freq: 1.6744e+09
  0x00007fe850ada2f0:   movl		0x4c(%rsi, %rdx, 4), %r11d
  0x00007fe850ada2f5:   movl		%r11d, (%rsp)
  0x00007fe850ada2f9:   movl		0x48(%rsi, %rdx, 4), %r10d
  0x00007fe850ada2fe:   movl		%r10d, 4(%rsp)
  0x00007fe850ada303:   movl		0x10(%rsi, %rdx, 4), %r11d
  0x00007fe850ada308:   movl		0x14(%rsi, %rdx, 4), %r9d
  0x00007fe850ada30d:   movl		0x44(%rsi, %rdx, 4), %r10d
  0x00007fe850ada312:   movl		%r10d, 8(%rsp)
  0x00007fe850ada317:   movl		0x18(%rsi, %rdx, 4), %r8d
  0x00007fe850ada31c:   cmpl		%r11d, %eax
  0x00007fe850ada31f:   cmovll		%r11d, %eax
  0x00007fe850ada323:   cmpl		%r9d, %eax
  0x00007fe850ada326:   cmovll		%r9d, %eax
  0x00007fe850ada32a:   movl		0x20(%rsi, %rdx, 4), %r10d
  0x00007fe850ada32f:   cmpl		%r8d, %eax
  0x00007fe850ada332:   cmovll		%r8d, %eax
  0x00007fe850ada336:   movl		0x24(%rsi, %rdx, 4), %r8d
  0x00007fe850ada33b:   movl		0x28(%rsi, %rdx, 4), %r11d
                                                            ;   {no_reloc}
  0x00007fe850ada340:   movl		0x2c(%rsi, %rdx, 4), %ecx
  0x00007fe850ada344:   movl		0x30(%rsi, %rdx, 4), %r9d
  0x00007fe850ada349:   movl		0x34(%rsi, %rdx, 4), %edi
  0x00007fe850ada34d:   movl		0x38(%rsi, %rdx, 4), %ebx
  0x00007fe850ada351:   movl		0x3c(%rsi, %rdx, 4), %ebp
  0x00007fe850ada355:   movl		0x40(%rsi, %rdx, 4), %r13d
  0x00007fe850ada35a:   movl		0x1c(%rsi, %rdx, 4), %r14d
  0x00007fe850ada35f:   cmpl		%r14d, %eax
  0x00007fe850ada362:   cmovll		%r14d, %eax
  0x00007fe850ada366:   cmpl		%r10d, %eax
  0x00007fe850ada369:   cmovll		%r10d, %eax
  0x00007fe850ada36d:   cmpl		%r8d, %eax
  0x00007fe850ada370:   cmovll		%r8d, %eax
  0x00007fe850ada374:   cmpl		%r11d, %eax
  0x00007fe850ada377:   cmovll		%r11d, %eax
  0x00007fe850ada37b:   cmpl		%ecx, %eax
  0x00007fe850ada37d:   cmovll		%ecx, %eax
  0x00007fe850ada380:   cmpl		%r9d, %eax
  0x00007fe850ada383:   cmovll		%r9d, %eax
  0x00007fe850ada387:   cmpl		%edi, %eax
  0x00007fe850ada389:   cmovll		%edi, %eax
  0x00007fe850ada38c:   cmpl		%ebx, %eax
  0x00007fe850ada38e:   cmovll		%ebx, %eax
  0x00007fe850ada391:   cmpl		%ebp, %eax
  0x00007fe850ada393:   cmovll		%ebp, %eax
  0x00007fe850ada396:   cmpl		%r13d, %eax
  0x00007fe850ada399:   cmovll		%r13d, %eax
  0x00007fe850ada39d:   cmpl		8(%rsp), %eax
  0x00007fe850ada3a1:   movl		8(%rsp), %r11d
  0x00007fe850ada3a6:   cmovll		%r11d, %eax
  0x00007fe850ada3aa:   cmpl		4(%rsp), %eax
  0x00007fe850ada3ae:   movl		4(%rsp), %r10d
  0x00007fe850ada3b3:   cmovll		%r10d, %eax
  0x00007fe850ada3b7:   cmpl		(%rsp), %eax
  0x00007fe850ada3ba:   movl		(%rsp), %r11d
  0x00007fe850ada3be:   cmovll		%r11d, %eax         ;*istore_1 {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - TestIntMax::test2 at 25 (line 71)


There are a couple of things is puzzling me. This test is like a reduction test and no vectorization appears to be kicking in any of the percentages (I've not enabled vectorization SW rejections to check). The other thing that is strange is the overall time. When no vectorization kicks in and the code uses cmovs, I've been seeing worse performance numbers compared to say compare and jumps, particularly in 100% tests. With `TestIntMax` it appears to be the opposite, test2 at 100% uses jpm+cmp, which performs worse than cmov versions.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2663665858

From dnsimon at openjdk.org  Mon Feb 17 17:11:22 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Mon, 17 Feb 2025 17:11:22 GMT
Subject: RFR: 8346781: [JVMCI] Limit ServiceLoader to class initializers
 [v2]
In-Reply-To: <m_l8eImOpn51svu62a2lmUQUIudLtF0SZkT2AK4J6DE=.c20a191e-0778-45f3-8327-18ba97c5c1c3@github.com>
References: <klDOcXCGbFV6vc8RIaD6T21NI4FapOYKkTLzVatTc38=.76502c0a-703d-4e77-8cf3-652d91e2199a@github.com>
 <m_l8eImOpn51svu62a2lmUQUIudLtF0SZkT2AK4J6DE=.c20a191e-0778-45f3-8327-18ba97c5c1c3@github.com>
Message-ID: <wNNIgeTz_LC2I13yNTR4EMKFUZOA7S4QIhiJwgC9fCs=.4a495127-72cc-4f9f-8212-fd28da0112be@github.com>

On Mon, 17 Feb 2025 16:12:41 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

>> In the context of libgraal, the current use of ServiceLoader in JVMCI is problematic as libgraal does all class loading at image build time. There are static fields such as `JVMCIServiceLocator.cachedLocators` that need to be initialized [via reflection](https://github.com/oracle/graal/blob/30492c3f7847a13ae7f8dc50663a5a039e49a8e7/compiler/src/jdk.graal.compiler/src/jdk/graal/compiler/hotspot/libgraal/BuildTime.java#L175-L180) when building libgraal.
>> 
>> This PR removes the need for such reflection by moving all use of ServiceLoader in JVMCI into `<clinit>` methods. These methods are executed when building libgraal. It also removes a few other public methods and fields that are no longer used by Graal. Given that JVMCI is still experimental and only has qualified exports to Graal, I don't think this needs a CSR.
>
> Doug Simon has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
> 
>  - Merge remote-tracking branch 'openjdk-jdk/master' into JDK-8346781
>  - remove non-native-image build time use of ServiceLoader
>  - make Cleaner.clean public

Passes openjdk-pr-canary: https://github.com/dougxc/openjdk-pr-canary/actions/runs/13374826011/job/37351770830#step:4:47

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22869#issuecomment-2663687923

From galder at openjdk.org  Mon Feb 17 17:21:15 2025
From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=)
Date: Mon, 17 Feb 2025 17:21:15 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v12]
In-Reply-To: <WmvAVC0dQC0czj5aQWgjKPqKAVaHGnVhoqUbR_B8dB0=.ae621dd5-172e-41ce-be78-402acaad8b31@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <pZjDpZKJUmXi85-qf3F-NX91qVc42_QgZGbuo36XhPk=.f2e4ba72-bf19-4ced-9656-c01907bdae1b@github.com>
 <WmvAVC0dQC0czj5aQWgjKPqKAVaHGnVhoqUbR_B8dB0=.ae621dd5-172e-41ce-be78-402acaad8b31@github.com>
Message-ID: <GOYBbeJu9CPh96LNBu-zbWHeMd1U8chVdXCIkoQmz48=.1cd35133-6aae-4d1c-ad3e-60dece9d2744@github.com>

On Mon, 17 Feb 2025 17:02:47 GMT, Galder Zamarre?o <galder at openjdk.org> wrote:

> This test is like a reduction test and no vectorization appears to be kicking in any of the percentages (I've not enabled vectorization SW rejections to check).

Ah, that's probably because of profitable vectorization checks.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2663710153

From dnsimon at openjdk.org  Mon Feb 17 17:43:14 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Mon, 17 Feb 2025 17:43:14 GMT
Subject: RFR: 8346781: [JVMCI] Limit ServiceLoader to class initializers
 [v2]
In-Reply-To: <mVnLWmMfwB9Yx5rjTh0ih4QHH5e12-7pbb2aczkecJg=.28a2e6c9-de6c-495a-b323-cc599bd59bfe@github.com>
References: <klDOcXCGbFV6vc8RIaD6T21NI4FapOYKkTLzVatTc38=.76502c0a-703d-4e77-8cf3-652d91e2199a@github.com>
 <mVnLWmMfwB9Yx5rjTh0ih4QHH5e12-7pbb2aczkecJg=.28a2e6c9-de6c-495a-b323-cc599bd59bfe@github.com>
Message-ID: <amhjYXsTOHS4FzF_Iyj_3V_ncmrU1DoqTD7SQO0aAOU=.5d15cd2a-2674-4079-932c-d9d5b07bc907@github.com>

On Mon, 23 Dec 2024 18:06:21 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

>> Doug Simon has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
>> 
>>  - Merge remote-tracking branch 'openjdk-jdk/master' into JDK-8346781
>>  - remove non-native-image build time use of ServiceLoader
>>  - make Cleaner.clean public
>
> src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/services/Services.java line 52:
> 
>> 50:      * statement on this field - the guard cannot be behind a method call.
>> 51:      */
>> 52:     public static final boolean IS_BUILDING_NATIVE_IMAGE = Boolean.parseBoolean(VM.getSavedProperty("jdk.vm.ci.services.aot"));
> 
> This field is no longer used in JVMCI and I will remove its usages in Graal.

Removed in https://github.com/oracle/graal/pull/10380

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22869#discussion_r1958608248

From yzheng at openjdk.org  Mon Feb 17 17:56:15 2025
From: yzheng at openjdk.org (Yudi Zheng)
Date: Mon, 17 Feb 2025 17:56:15 GMT
Subject: RFR: 8346781: [JVMCI] Limit ServiceLoader to class initializers
 [v2]
In-Reply-To: <m_l8eImOpn51svu62a2lmUQUIudLtF0SZkT2AK4J6DE=.c20a191e-0778-45f3-8327-18ba97c5c1c3@github.com>
References: <klDOcXCGbFV6vc8RIaD6T21NI4FapOYKkTLzVatTc38=.76502c0a-703d-4e77-8cf3-652d91e2199a@github.com>
 <m_l8eImOpn51svu62a2lmUQUIudLtF0SZkT2AK4J6DE=.c20a191e-0778-45f3-8327-18ba97c5c1c3@github.com>
Message-ID: <98jmUmCaXEstTsMZUeuKA1QBro7kZvIZhrFsQWbQIj0=.f4e81caf-78b4-44b8-9d70-b1d68cfc6f7b@github.com>

On Mon, 17 Feb 2025 16:12:41 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

>> In the context of libgraal, the current use of ServiceLoader in JVMCI is problematic as libgraal does all class loading at image build time. There are static fields such as `JVMCIServiceLocator.cachedLocators` that need to be initialized [via reflection](https://github.com/oracle/graal/blob/30492c3f7847a13ae7f8dc50663a5a039e49a8e7/compiler/src/jdk.graal.compiler/src/jdk/graal/compiler/hotspot/libgraal/BuildTime.java#L175-L180) when building libgraal.
>> 
>> This PR removes the need for such reflection by moving all use of ServiceLoader in JVMCI into `<clinit>` methods. These methods are executed when building libgraal. It also removes a few other public methods and fields that are no longer used by Graal. Given that only has qualified exports to Graal, a CSR is not needed.
>
> Doug Simon has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
> 
>  - Merge remote-tracking branch 'openjdk-jdk/master' into JDK-8346781
>  - remove non-native-image build time use of ServiceLoader
>  - make Cleaner.clean public

LGTM

-------------

Marked as reviewed by yzheng (Committer).

PR Review: https://git.openjdk.org/jdk/pull/22869#pullrequestreview-2621722417

From kvn at openjdk.org  Mon Feb 17 18:43:23 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Mon, 17 Feb 2025 18:43:23 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v10]
In-Reply-To: <UW6o1fVdNjYeif3nCIGQpZPvXasY40lIjBIczpsfY9I=.ad72b291-258f-4df1-b3ee-9a5326657db1@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <PLR4hd5rc20OFzfJxG7yDhph6PXNn6j1CJnvHI-OrI8=.7f019bc0-c1c4-4d68-8354-ca2567994784@github.com>
 <UW6o1fVdNjYeif3nCIGQpZPvXasY40lIjBIczpsfY9I=.ad72b291-258f-4df1-b3ee-9a5326657db1@github.com>
Message-ID: <tvGthmdq3FTnD3fqAPLacJa3_VmAY4dwEmgi0NfQaBE=.8699ac74-47d5-4c7a-941b-1613f84fa52f@github.com>

On Mon, 17 Feb 2025 06:24:35 GMT, Axel Boldt-Christmas <aboldtch at openjdk.org> wrote:

>> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Remove commented lines left by mistake
>
> src/hotspot/share/code/codeBlob.hpp line 308:
> 
>> 306: 
>> 307:   class Vptr : public CodeBlob::Vptr {
>> 308:   };
> 
> Was this needed for some compiler? Or is it to be more explicit about the type hierarchy?

Thank you, @xmas92, for review and suggestions.

It is second (explicit type hierarchy). I think it should be explicitly declared (even empty) because it is referenced in subclasses to avoid confusion. And it could be useful in a future if we need other virtual methods.

Local build with `gcc` on Linux passed without it but  I did not try to build on other platforms.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1958673128

From dnsimon at openjdk.org  Mon Feb 17 19:37:24 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Mon, 17 Feb 2025 19:37:24 GMT
Subject: RFR: 8346781: [JVMCI] Limit ServiceLoader to class initializers
 [v2]
In-Reply-To: <m_l8eImOpn51svu62a2lmUQUIudLtF0SZkT2AK4J6DE=.c20a191e-0778-45f3-8327-18ba97c5c1c3@github.com>
References: <klDOcXCGbFV6vc8RIaD6T21NI4FapOYKkTLzVatTc38=.76502c0a-703d-4e77-8cf3-652d91e2199a@github.com>
 <m_l8eImOpn51svu62a2lmUQUIudLtF0SZkT2AK4J6DE=.c20a191e-0778-45f3-8327-18ba97c5c1c3@github.com>
Message-ID: <0elzblvKiIjGRnZiBSPjStJpDMTPJyXObkHwVuStSJg=.8ac2fd8e-d38c-42de-a1fa-c94eac144a73@github.com>

On Mon, 17 Feb 2025 16:12:41 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

>> In the context of libgraal, the current use of ServiceLoader in JVMCI is problematic as libgraal does all class loading at image build time. There are static fields such as `JVMCIServiceLocator.cachedLocators` that need to be initialized [via reflection](https://github.com/oracle/graal/blob/30492c3f7847a13ae7f8dc50663a5a039e49a8e7/compiler/src/jdk.graal.compiler/src/jdk/graal/compiler/hotspot/libgraal/BuildTime.java#L175-L180) when building libgraal.
>> 
>> This PR removes the need for such reflection by moving all use of ServiceLoader in JVMCI into `<clinit>` methods. These methods are executed when building libgraal. It also removes a few other public methods and fields that are no longer used by Graal. Given that only has qualified exports to Graal, a CSR is not needed.
>
> Doug Simon has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
> 
>  - Merge remote-tracking branch 'openjdk-jdk/master' into JDK-8346781
>  - remove non-native-image build time use of ServiceLoader
>  - make Cleaner.clean public

Thanks for the reviews.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22869#issuecomment-2663944221

From dnsimon at openjdk.org  Mon Feb 17 19:37:24 2025
From: dnsimon at openjdk.org (Doug Simon)
Date: Mon, 17 Feb 2025 19:37:24 GMT
Subject: Integrated: 8346781: [JVMCI] Limit ServiceLoader to class initializers
In-Reply-To: <klDOcXCGbFV6vc8RIaD6T21NI4FapOYKkTLzVatTc38=.76502c0a-703d-4e77-8cf3-652d91e2199a@github.com>
References: <klDOcXCGbFV6vc8RIaD6T21NI4FapOYKkTLzVatTc38=.76502c0a-703d-4e77-8cf3-652d91e2199a@github.com>
Message-ID: <XZ8Ti8twUn04AaAVL2gU4Nf1vKrFQHKVqKRLR_--ImY=.e14e322b-dadd-456b-9306-a3ecabb6603d@github.com>

On Mon, 23 Dec 2024 17:58:23 GMT, Doug Simon <dnsimon at openjdk.org> wrote:

> In the context of libgraal, the current use of ServiceLoader in JVMCI is problematic as libgraal does all class loading at image build time. There are static fields such as `JVMCIServiceLocator.cachedLocators` that need to be initialized [via reflection](https://github.com/oracle/graal/blob/30492c3f7847a13ae7f8dc50663a5a039e49a8e7/compiler/src/jdk.graal.compiler/src/jdk/graal/compiler/hotspot/libgraal/BuildTime.java#L175-L180) when building libgraal.
> 
> This PR removes the need for such reflection by moving all use of ServiceLoader in JVMCI into `<clinit>` methods. These methods are executed when building libgraal. It also removes a few other public methods and fields that are no longer used by Graal. Given that only has qualified exports to Graal, a CSR is not needed.

This pull request has now been integrated.

Changeset: 8ec58939
Author:    Doug Simon <dnsimon at openjdk.org>
URL:       https://git.openjdk.org/jdk/commit/8ec589390f7dc67dd883a1efddb8da32790f6591
Stats:     166 lines in 7 files changed: 10 ins; 126 del; 30 mod

8346781: [JVMCI] Limit ServiceLoader to class initializers

Reviewed-by: never, yzheng

-------------

PR: https://git.openjdk.org/jdk/pull/22869

From jwaters at openjdk.org  Tue Feb 18 02:39:25 2025
From: jwaters at openjdk.org (Julian Waters)
Date: Tue, 18 Feb 2025 02:39:25 GMT
Subject: RFR: 8342103: C2 compiler support for Float16 type and associated
 scalar operations [v18]
In-Reply-To: <GTm_Er6CT-A4aFdVeWEMCXyJKWWrW56VLe9On4W02fk=.6bb331e3-3a26-4f5e-befb-42e955e4d994@github.com>
References: <a00XTjaE0iFc3MKq9ER_tgXoz81Hg07N8sPSPpTIQt4=.c05fd92f-8105-49d5-80be-ee56aeb77ede@github.com>
 <GTm_Er6CT-A4aFdVeWEMCXyJKWWrW56VLe9On4W02fk=.6bb331e3-3a26-4f5e-befb-42e955e4d994@github.com>
Message-ID: <npCqs_ZZsS5MkLbKdkH2HqSrD8KI0loTBdo1DEIMfe4=.ac942c11-d10a-4bc2-a820-396656331758@github.com>

On Tue, 11 Feb 2025 06:32:56 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Hi All,
>> 
>> This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128)
>> 
>> Following is the summary of changes included with this patch:-
>> 
>> 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations.
>> 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization.
>> 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class.
>>       -    These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values.
>> 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines.
>> 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details.
>> 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa.
>> 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF
>> 9. X86  backend implementation for all supported intrinsics.
>> 10. Functional and Performance validation tests.
>> 
>> Kindly review the patch and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Review comments resolutions

Is anyone else getting compile failures after this was integrated? This weirdly seems to only happen on Linux

* For target hotspot_variant-server_libjvm_objs_mulnode.o:
/home/runner/work/jdk/jdk/src/hotspot/share/opto/mulnode.cpp: In member function ?virtual const Type* FmaHFNode::Value(PhaseGVN*) const?:
/home/runner/work/jdk/jdk/src/hotspot/share/opto/mulnode.cpp:1944:37: error: call of overloaded ?make(double)? is ambiguous
 1944 |   return TypeH::make(fma(f1, f2, f3));
      |                                     ^
In file included from /home/runner/work/jdk/jdk/src/hotspot/share/opto/node.hpp:31,
                 from /home/runner/work/jdk/jdk/src/hotspot/share/opto/addnode.hpp:28,
                 from /home/runner/work/jdk/jdk/src/hotspot/share/opto/mulnode.cpp:26:
/home/runner/work/jdk/jdk/src/hotspot/share/opto/type.hpp:544:23: note: candidate: ?static const TypeH* TypeH::make(float)?
  544 |   static const TypeH* make(float f);
      |                       ^~~~
/home/runner/work/jdk/jdk/src/hotspot/share/opto/type.hpp:545:23: note: candidate: ?static const TypeH* TypeH::make(short int)?
  545 |   static const TypeH* make(short f);
      |                       ^~~~

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22754#issuecomment-2664473623

From cjplummer at openjdk.org  Tue Feb 18 03:05:16 2025
From: cjplummer at openjdk.org (Chris Plummer)
Date: Tue, 18 Feb 2025 03:05:16 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v10]
In-Reply-To: <PLR4hd5rc20OFzfJxG7yDhph6PXNn6j1CJnvHI-OrI8=.7f019bc0-c1c4-4d68-8354-ca2567994784@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <PLR4hd5rc20OFzfJxG7yDhph6PXNn6j1CJnvHI-OrI8=.7f019bc0-c1c4-4d68-8354-ca2567994784@github.com>
Message-ID: <VuK7WXwT96D8YCQUtxc6ODJY7os3I6Uxav8le0rCGCk=.2de8b156-f103-4865-bf91-43bf4388376d@github.com>

On Sat, 15 Feb 2025 06:34:56 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table.
>> 
>> Added C++ static asserts to make sure no virtual methods are added in a future.
>> 
>> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob.
>> 
>> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp
>
> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Remove commented lines left by mistake

SA changes look good. Thanks for taking care of this.

-------------

Marked as reviewed by cjplummer (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/23533#pullrequestreview-2622331256

From galder at openjdk.org  Tue Feb 18 08:04:21 2025
From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=)
Date: Tue, 18 Feb 2025 08:04:21 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v12]
In-Reply-To: <pZjDpZKJUmXi85-qf3F-NX91qVc42_QgZGbuo36XhPk=.f2e4ba72-bf19-4ced-9656-c01907bdae1b@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <pZjDpZKJUmXi85-qf3F-NX91qVc42_QgZGbuo36XhPk=.f2e4ba72-bf19-4ced-9656-c01907bdae1b@github.com>
Message-ID: <zQC1cGYzePtgsQhjhK3FFht5BwWRGsyBxQJJm61LaW8=.b013acaf-b904-4a9a-8eeb-c49aab65ca2b@github.com>

On Fri, 7 Feb 2025 12:39:24 GMT, Galder Zamarre?o <galder at openjdk.org> wrote:

>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance.
>> 
>> Currently vectorization does not kick in for loops containing either of these calls because of the following error:
>> 
>> 
>> VLoop::check_preconditions: failed: control flow in loop not allowed
>> 
>> 
>> The control flow is due to the java implementation for these methods, e.g.
>> 
>> 
>> public static long max(long a, long b) {
>>     return (a >= b) ? a : b;
>> }
>> 
>> 
>> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively.
>> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization.
>> E.g.
>> 
>> 
>> SuperWord::transform_loop:
>>     Loop: N518/N126  counted [int,int),+4 (1025 iters)  main has_sfpt strip_mined
>>  518  CountedLoop  === 518 246 126  [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21)
>> 
>> 
>> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1155
>> long max   1173
>> 
>> 
>> After the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1042
>> long max   1042
>> 
>> 
>> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes.
>> Therefore, it still relies on the macro expansion to transform those into CMoveL.
>> 
>> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results:
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PA...
>
> Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision:
> 
>  - Merge branch 'master' into topic.intrinsify-max-min-long
>  - Fix typo
>  - Renaming methods and variables and add docu on algorithms
>  - Fix copyright years
>  - Make sure it runs with cpus with either avx512 or asimd
>  - Test can only run with 256 bit registers or bigger
>    
>    * Remove platform dependant check
>    and use platform independent configuration instead.
>  - Fix license header
>  - Tests should also run on aarch64 asimd=true envs
>  - Added comment around the assertions
>  - Adjust min/max identity IR test expectations after changes
>  - ... and 34 more: https://git.openjdk.org/jdk/compare/6ad0c61a...a190ae68

What is happening with int min/max needs a separate investigation because based on my testing, the int min/max intrinsic is both a regression and a performance improvement! Check this out:


make test TEST="micro:org.openjdk.bench.java.lang.MinMaxVector.intReductionSimpleMax" MICRO="FORK=1"
Benchmark                           (probability)  (size)   Mode  Cnt    Score   Error   Units
MinMaxVector.intReductionSimpleMax             50    2048  thrpt    4  460.585 ? 0.348  ops/ms
MinMaxVector.intReductionSimpleMax             80    2048  thrpt    4  460.633 ? 0.103  ops/ms
MinMaxVector.intReductionSimpleMax            100    2048  thrpt    4  460.580 ? 0.091  ops/ms

make test TEST="micro:org.openjdk.bench.java.lang.MinMaxVector.intReductionSimpleMax" MICRO="FORK=1;OPTIONS=-jvmArgs -XX:CompileCommand=option,org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionSimpleMax_jmhTest::intReductionSimpleMax_thrpt_jmhStub,ccstrlist,DisableIntrinsic,_max"
Benchmark                           (probability)  (size)   Mode  Cnt     Score   Error   Units
MinMaxVector.intReductionSimpleMax             50    2048  thrpt    4   460.479 ? 0.044  ops/ms
MinMaxVector.intReductionSimpleMax             80    2048  thrpt    4   460.587 ? 0.106  ops/ms
MinMaxVector.intReductionSimpleMax            100    2048  thrpt    4  1027.831 ? 9.353  ops/ms
80%:
          ??      ?   0x00007ffb200fa089:   cmpl		%r11d, %r10d
   3.04%  ??      ?   0x00007ffb200fa08c:   cmovll		%r11d, %r10d
   4.38%  ??      ?   0x00007ffb200fa090:   cmpl		%ebx, %r10d
   1.61%  ??      ?   0x00007ffb200fa093:   cmovll		%ebx, %r10d
   2.79%  ??      ?   0x00007ffb200fa097:   cmpl		%edi, %r10d
   2.92%  ??      ?   0x00007ffb200fa09a:   cmovll		%edi, %r10d         ;*ireturn {reexecute=0 rethrow=0 return_oop=0}
          ??      ?                                                             ; - java.lang.Math::max at 10 (line 2023)
          ??      ?                                                             ; - org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMax at 23 (line 232)
100%:
   3.11%  ???????   ??????    ?             0x00007f26c00f8f9c:   nopl		(%rax)
   3.31%  ???????   ??????    ?             0x00007f26c00f8fa0:   cmpl		%r10d, %ecx
          ????????  ??????    ?             0x00007f26c00f8fa3:   jge		0x7f26c00f8ff1      ;*ireturn {reexecute=0 rethrow=0 return_oop=0}
          ????????  ??????    ?                                                                       ; - java.lang.Math::max at 10 (line 2023)
          ????????  ??????    ?                                                                       ; - org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMax at 23 (line 232)
          ????????  ??????    ?                                                                       ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionSimpleMax_jmhTest::intReductionSimpleMax_thrpt_jmhStub at 19 (line 124)

make test TEST="micro:org.openjdk.bench.java.lang.MinMaxVector.intReductionMultiplyMax" MICRO="FORK=1"
Benchmark                             (probability)  (size)   Mode  Cnt     Score   Error   Units
MinMaxVector.intReductionMultiplyMax             50    2048  thrpt    4  2815.614 ? 0.406  ops/ms
MinMaxVector.intReductionMultiplyMax             80    2048  thrpt    4  2814.943 ? 2.174  ops/ms
MinMaxVector.intReductionMultiplyMax            100    2048  thrpt    4  2815.285 ? 1.725  ops/ms

make test TEST="micro:org.openjdk.bench.java.lang.MinMaxVector.intReductionMultiplyMax" MICRO="FORK=1;OPTIONS=-jvmArgs -XX:CompileCommand=option,org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionMultiplyMax_jmhTest::intReductionMultiplyMax_thrpt_jmhStub,ccstrlist,DisableIntrinsic,_max"
Benchmark                             (probability)  (size)   Mode  Cnt     Score   Error   Units
MinMaxVector.intReductionMultiplyMax             50    2048  thrpt    4  2802.062 ? 0.710  ops/ms
MinMaxVector.intReductionMultiplyMax             80    2048  thrpt    4  2814.874 ? 4.058  ops/ms
MinMaxVector.intReductionMultiplyMax            100    2048  thrpt    4   883.879 ? 0.327  ops/ms
80%:
   3.54%  ?    ?? ?????  0x00007faa700fa177:   vpmaxsd		%ymm4, %ymm5, %ymm13;*ireturn {reexecute=0 rethrow=0 return_oop=0}
          ?    ?? ?????                                                            ; - java.lang.Math::max at 10 (line 2023)

100:
   7.50%  ??????????????????? ?             0x00007f75280f8849:   imull		$0xb, 0x2c(%rbp, %r11, 4), %r10d
          ??????????????????? ?                                                                       ;*imul {reexecute=0 rethrow=0 return_oop=0}
          ??????????????????? ?                                                                       ; - org.openjdk.bench.java.lang.MinMaxVector::intReductionMultiplyMax at 20 (line 221)
          ??????????????????? ?                                                                       ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionMultiplyMax_jmhTest::intReductionMultiplyMax_thrpt_jmhStub at 19 (line 124)
   3.85%  ??????????????????? ?             0x00007f75280f884f:   cmpl		%r10d, %r8d
          ??????????????????? ?             0x00007f75280f8852:   jl		0x7f75280f87d0      ;*if_icmplt {reexecute=0 rethrow=0 return_oop=0}
          ?????????? ???????? ?                                                                       ; - java.lang.Math::max at 2 (line 2023)
          ?????????? ???????? ?                                                                       ; - org.openjdk.bench.java.lang.MinMaxVector::intReductionMultiplyMax at 26 (line 222)
          ?????????? ???????? ?                                                                       ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionMultiplyMax_jmhTest::intReductionMultiplyMax_thrpt_jmhStub at 19 (line 124)


I ran the exact same test with longs and I don't see such an issue. The performance is always the same either with the intrisinc or disabling it as shown above.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2664871838

From galder at openjdk.org  Tue Feb 18 08:16:19 2025
From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=)
Date: Tue, 18 Feb 2025 08:16:19 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v11]
In-Reply-To: <3ArmrOQcUoj8DhHTq1a40Oz3GE8bCDDy3FFeVgbladg=.b8e0e13b-39f3-41a6-8a1b-5ca4febb4a41@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com>
 <Mci8jQuT-MquLYeikUrrdzKo9dJJuQa3ejdc7tlYQyI=.e0007de8-08b2-4a42-950c-f8e1225777fc@github.com>
 <RHL_g49_BCZQzsQJU-T88fkAOoSKpNvEC2Xx-QxdpRk=.4fbc0037-ba55-40e1-a091-4c16d7e8ee99@github.com>
 <MAcSY0Kc9JFrv5ueJDNkDH9I9LpIsYcRjiJ7RtQt090=.03b107fd-8879-41de-a0a4-b6202a9da369@github.com>
 <OxF5Va_n5CdxRW2uSTQQzMe6JSSNqnfs4qd3pAwSAEo=.d33d96b2-0d11-4d67-91cc-5ae94e78c580@github.com>
 <S-e2FYgJy02RfywN3A6WTPaepdkK6ly8KYNnX6tHCig=.a38238e5-7d3b-412b-a78e-9def776179ff@github.com>
 <T8VJwIaK1x-G60XsQNIUB9wcHQeYP1bkYB98Am9c8KM=.56412af0-9471-4619-9e14-03ba5f5cfde2@github.com>
 <msxoTDkFVjqA2xbPwhboyul7Oal6uQAQcig5RdkeiyY=.7d5bd096-0906-45fa-a3db-0ace9b629fcb@github.com>
 <5oGMaD5b87inAMkco6l5ODRvWv7FRsHGJiu_UMrGrTc=.0be44429-d322-4a6f-b91d-b64a146fad05@github.com>
 <3ArmrOQcUoj8DhHTq1a40Oz3GE8bCDDy3FF
 eVgbladg=.b8e0e13b-39f3-41a6-8a1b-5ca4febb4a41@github.com>
Message-ID: <G9nejN6fsDD5TrseZaI4aJAIv2hqZOy2aNwRRjd460M=.5ade26c5-f6eb-4cf8-af80-9dd469c6bf3f@github.com>

On Mon, 17 Feb 2025 15:02:32 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>> @rwestrel @galderz 
>> 
>>> It seems overall, we likely win more than we loose with this intrinsic, so I would integrate this change as it is and file a bug to keep track of remaining issues.
>> 
>> I'm a little scared to just accept the regressions, especially for this "most average looking case":
>> Imagine you have an array with random numbers. Or at least numbers in a random order. If we take the max, then we expect the first number to be max with probability 1, the second 1/2, the third 1/3, the i'th 1/i. So the average branch probability is `n / (sum_i 1/i)`. This goes closer and closer to zero, the larger the array. This means that the "average" case has an extreme probability. And so if we do not vectorize, then this gets us a regression with the current patch. And vectorization is a little fragile, it only takes very little for vectorization not to kick in.
>> 
>>> The Min/Max nodes are floating nodes. They can hoist out of loop and common reliably in ways that are not guaranteed otherwise.
>> 
>> I suppose we could write an optimization that can hoist out loop independent if-diamonds out of a loop. If the condition and all phi inputs are loop invariant, you could just cut the diamond out of the loop, and paste it before the loop entry.
>> 
>>> Shouldn't int min/max be affected the same way?
>> 
>> I think we should be able to see the same issue here, actually. Yes. Here a quick benchmark below:
>> 
>> 
>> java -XX:CompileCommand=compileonly,TestIntMax::test* -XX:CompileCommand=printcompilation,TestIntMax::test* -XX:+TraceNewVectors TestIntMax.java
>> CompileCommand: compileonly TestIntMax.test* bool compileonly = true
>> CompileCommand: PrintCompilation TestIntMax.test* bool PrintCompilation = true
>> Warmup
>> 5225   93 %     3       TestIntMax::test1 @ 5 (27 bytes)
>> 5226   94       3       TestIntMax::test1 (27 bytes)
>> 5226   95 %     4       TestIntMax::test1 @ 5 (27 bytes)
>> 5238   96       4       TestIntMax::test1 (27 bytes)
>> Run
>> Time: 542056319
>> Warmup
>> 6320  101 %     3       TestIntMax::test2 @ 5 (34 bytes)
>> 6322  102 %     4       TestIntMax::test2 @ 5 (34 bytes)
>> 6329  103       4       TestIntMax::test2 (34 bytes)
>> Run
>> Time: 166815209
>> 
>> That's a 4x regression on random input data!
>> 
>> With:
>> 
>> import java.util.Random;
>> 
>> public class TestIntMax {
>>     private static Random RANDOM = new Random();
>> 
>>     public static void main(String[] args) {
>>         int[] a = new int[64 * 1024];
>>         for (int i = 0; i < a.length; i++) {
>>...
>
>> I think we should be able to see the same issue here, actually. Yes. Here a quick benchmark below:
> 
> I observe the same:
> 
> 
> Warmup
> 751    3    b        TestIntMax::test1 (27 bytes)
> Run
> Time: 360 550 158
> Warmup
> 1862   15    b        TestIntMax::test2 (34 bytes)
> Run
> Time: 92 116 170
> 
> 
> But then with this:
> 
> 
> diff --git a/src/hotspot/cpu/x86/x86_64.ad b/src/hotspot/cpu/x86/x86_64.ad
> index 8cc4a970bfd..9abda8f4178 100644
> --- a/src/hotspot/cpu/x86/x86_64.ad
> +++ b/src/hotspot/cpu/x86/x86_64.ad
> @@ -12037,16 +12037,20 @@ instruct cmovI_reg_l(rRegI dst, rRegI src, rFlagsReg cr)
>  %}
>  
>  
> -instruct maxI_rReg(rRegI dst, rRegI src)
> +instruct maxI_rReg(rRegI dst, rRegI src, rFlagsReg cr)
>  %{
>    match(Set dst (MaxI dst src));
> +  effect(KILL cr);
>  
>    ins_cost(200);
> -  expand %{
> -    rFlagsReg cr;
> -    compI_rReg(cr, dst, src);
> -    cmovI_reg_l(dst, src, cr);
> +  ins_encode %{
> +    Label done;
> +    __ cmpl($src$$Register, $dst$$Register);
> +    __ jccb(Assembler::less, done);
> +    __ mov($dst$$Register, $src$$Register);
> +    __ bind(done);
>    %}
> +  ins_pipe(pipe_cmov_reg);
>  %}
>  
>  // ============================================================================
> 
> 
> the performance gap narrows:
> 
> 
> Warmup
> 770    3    b        TestIntMax::test1 (27 bytes)
> Run
> Time: 94 951 677
> Warmup
> 1312   15    b        TestIntMax::test2 (34 bytes)
> Run
> Time: 70 053 824
> 
> 
> (the number of test2 fluctuates quite a bit). Does it ever make sense to implement `MaxI` with a conditional move then?

Note something I spoke with @rwestrel yesterday in the context of long min/max vs int min/max.

Int has an ad implementation for min/max whereas long as not. My very first prototype of this issue was to mimmic what int did with log, but talking to @rwestrel we decided it would be better to implement this without introducing platform specific changes.

So, following Roland's thread in https://github.com/openjdk/jdk/pull/20098#issuecomment-2663379660, I could add ad changes for say x86 and aarch64 for long such that it uses branch instead of cmov. Note that the cmov fallback of long min/max comes from macro expansion, not platform specific changes.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2664893516

From galder at openjdk.org  Tue Feb 18 08:20:15 2025
From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=)
Date: Tue, 18 Feb 2025 08:20:15 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v11]
In-Reply-To: <3ArmrOQcUoj8DhHTq1a40Oz3GE8bCDDy3FFeVgbladg=.b8e0e13b-39f3-41a6-8a1b-5ca4febb4a41@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com>
 <Mci8jQuT-MquLYeikUrrdzKo9dJJuQa3ejdc7tlYQyI=.e0007de8-08b2-4a42-950c-f8e1225777fc@github.com>
 <RHL_g49_BCZQzsQJU-T88fkAOoSKpNvEC2Xx-QxdpRk=.4fbc0037-ba55-40e1-a091-4c16d7e8ee99@github.com>
 <MAcSY0Kc9JFrv5ueJDNkDH9I9LpIsYcRjiJ7RtQt090=.03b107fd-8879-41de-a0a4-b6202a9da369@github.com>
 <OxF5Va_n5CdxRW2uSTQQzMe6JSSNqnfs4qd3pAwSAEo=.d33d96b2-0d11-4d67-91cc-5ae94e78c580@github.com>
 <S-e2FYgJy02RfywN3A6WTPaepdkK6ly8KYNnX6tHCig=.a38238e5-7d3b-412b-a78e-9def776179ff@github.com>
 <T8VJwIaK1x-G60XsQNIUB9wcHQeYP1bkYB98Am9c8KM=.56412af0-9471-4619-9e14-03ba5f5cfde2@github.com>
 <msxoTDkFVjqA2xbPwhboyul7Oal6uQAQcig5RdkeiyY=.7d5bd096-0906-45fa-a3db-0ace9b629fcb@github.com>
 <5oGMaD5b87inAMkco6l5ODRvWv7FRsHGJiu_UMrGrTc=.0be44429-d322-4a6f-b91d-b64a146fad05@github.com>
 <3ArmrOQcUoj8DhHTq1a40Oz3GE8bCDDy3FF
 eVgbladg=.b8e0e13b-39f3-41a6-8a1b-5ca4febb4a41@github.com>
Message-ID: <ajuhBKoDVKmn_D9vg3O3OVUQVVunAT02KhXcaapMokE=.5b8f2f36-6521-4389-97de-0bf8deeb47dc@github.com>

On Mon, 17 Feb 2025 15:02:32 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>> @rwestrel @galderz 
>> 
>>> It seems overall, we likely win more than we loose with this intrinsic, so I would integrate this change as it is and file a bug to keep track of remaining issues.
>> 
>> I'm a little scared to just accept the regressions, especially for this "most average looking case":
>> Imagine you have an array with random numbers. Or at least numbers in a random order. If we take the max, then we expect the first number to be max with probability 1, the second 1/2, the third 1/3, the i'th 1/i. So the average branch probability is `n / (sum_i 1/i)`. This goes closer and closer to zero, the larger the array. This means that the "average" case has an extreme probability. And so if we do not vectorize, then this gets us a regression with the current patch. And vectorization is a little fragile, it only takes very little for vectorization not to kick in.
>> 
>>> The Min/Max nodes are floating nodes. They can hoist out of loop and common reliably in ways that are not guaranteed otherwise.
>> 
>> I suppose we could write an optimization that can hoist out loop independent if-diamonds out of a loop. If the condition and all phi inputs are loop invariant, you could just cut the diamond out of the loop, and paste it before the loop entry.
>> 
>>> Shouldn't int min/max be affected the same way?
>> 
>> I think we should be able to see the same issue here, actually. Yes. Here a quick benchmark below:
>> 
>> 
>> java -XX:CompileCommand=compileonly,TestIntMax::test* -XX:CompileCommand=printcompilation,TestIntMax::test* -XX:+TraceNewVectors TestIntMax.java
>> CompileCommand: compileonly TestIntMax.test* bool compileonly = true
>> CompileCommand: PrintCompilation TestIntMax.test* bool PrintCompilation = true
>> Warmup
>> 5225   93 %     3       TestIntMax::test1 @ 5 (27 bytes)
>> 5226   94       3       TestIntMax::test1 (27 bytes)
>> 5226   95 %     4       TestIntMax::test1 @ 5 (27 bytes)
>> 5238   96       4       TestIntMax::test1 (27 bytes)
>> Run
>> Time: 542056319
>> Warmup
>> 6320  101 %     3       TestIntMax::test2 @ 5 (34 bytes)
>> 6322  102 %     4       TestIntMax::test2 @ 5 (34 bytes)
>> 6329  103       4       TestIntMax::test2 (34 bytes)
>> Run
>> Time: 166815209
>> 
>> That's a 4x regression on random input data!
>> 
>> With:
>> 
>> import java.util.Random;
>> 
>> public class TestIntMax {
>>     private static Random RANDOM = new Random();
>> 
>>     public static void main(String[] args) {
>>         int[] a = new int[64 * 1024];
>>         for (int i = 0; i < a.length; i++) {
>>...
>
>> I think we should be able to see the same issue here, actually. Yes. Here a quick benchmark below:
> 
> I observe the same:
> 
> 
> Warmup
> 751    3    b        TestIntMax::test1 (27 bytes)
> Run
> Time: 360 550 158
> Warmup
> 1862   15    b        TestIntMax::test2 (34 bytes)
> Run
> Time: 92 116 170
> 
> 
> But then with this:
> 
> 
> diff --git a/src/hotspot/cpu/x86/x86_64.ad b/src/hotspot/cpu/x86/x86_64.ad
> index 8cc4a970bfd..9abda8f4178 100644
> --- a/src/hotspot/cpu/x86/x86_64.ad
> +++ b/src/hotspot/cpu/x86/x86_64.ad
> @@ -12037,16 +12037,20 @@ instruct cmovI_reg_l(rRegI dst, rRegI src, rFlagsReg cr)
>  %}
>  
>  
> -instruct maxI_rReg(rRegI dst, rRegI src)
> +instruct maxI_rReg(rRegI dst, rRegI src, rFlagsReg cr)
>  %{
>    match(Set dst (MaxI dst src));
> +  effect(KILL cr);
>  
>    ins_cost(200);
> -  expand %{
> -    rFlagsReg cr;
> -    compI_rReg(cr, dst, src);
> -    cmovI_reg_l(dst, src, cr);
> +  ins_encode %{
> +    Label done;
> +    __ cmpl($src$$Register, $dst$$Register);
> +    __ jccb(Assembler::less, done);
> +    __ mov($dst$$Register, $src$$Register);
> +    __ bind(done);
>    %}
> +  ins_pipe(pipe_cmov_reg);
>  %}
>  
>  // ============================================================================
> 
> 
> the performance gap narrows:
> 
> 
> Warmup
> 770    3    b        TestIntMax::test1 (27 bytes)
> Run
> Time: 94 951 677
> Warmup
> 1312   15    b        TestIntMax::test2 (34 bytes)
> Run
> Time: 70 053 824
> 
> 
> (the number of test2 fluctuates quite a bit). Does it ever make sense to implement `MaxI` with a conditional move then?

To make it more explicit: implementing long min/max in ad files as cmp will likely remove all the 100% regressions that are observed here. I'm going to repeat the same MinMaxVector int min/max reduction test above with the ad changes @rwestrel suggested to see what effect they have.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2664903731

From epeter at openjdk.org  Tue Feb 18 08:46:17 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Tue, 18 Feb 2025 08:46:17 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v11]
In-Reply-To: <ajuhBKoDVKmn_D9vg3O3OVUQVVunAT02KhXcaapMokE=.5b8f2f36-6521-4389-97de-0bf8deeb47dc@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com>
 <Mci8jQuT-MquLYeikUrrdzKo9dJJuQa3ejdc7tlYQyI=.e0007de8-08b2-4a42-950c-f8e1225777fc@github.com>
 <RHL_g49_BCZQzsQJU-T88fkAOoSKpNvEC2Xx-QxdpRk=.4fbc0037-ba55-40e1-a091-4c16d7e8ee99@github.com>
 <MAcSY0Kc9JFrv5ueJDNkDH9I9LpIsYcRjiJ7RtQt090=.03b107fd-8879-41de-a0a4-b6202a9da369@github.com>
 <OxF5Va_n5CdxRW2uSTQQzMe6JSSNqnfs4qd3pAwSAEo=.d33d96b2-0d11-4d67-91cc-5ae94e78c580@github.com>
 <S-e2FYgJy02RfywN3A6WTPaepdkK6ly8KYNnX6tHCig=.a38238e5-7d3b-412b-a78e-9def776179ff@github.com>
 <T8VJwIaK1x-G60XsQNIUB9wcHQeYP1bkYB98Am9c8KM=.56412af0-9471-4619-9e14-03ba5f5cfde2@github.com>
 <msxoTDkFVjqA2xbPwhboyul7Oal6uQAQcig5RdkeiyY=.7d5bd096-0906-45fa-a3db-0ace9b629fcb@github.com>
 <5oGMaD5b87inAMkco6l5ODRvWv7FRsHGJiu_UMrGrTc=.0be44429-d322-4a6f-b91d-b64a146fad05@github.com>
 <3ArmrOQcUoj8DhHTq1a40Oz3GE8bCDDy3FF
 eVgbladg=.b8e0e13b-39f3-41a6-8a1b-5ca4febb4a41@github.com>
 <ajuhBKoDVKmn_D9vg3O3OVUQVVunAT02KhXcaapMokE=.5b8f2f36-6521-4389-97de-0bf8deeb47dc@github.com>
Message-ID: <N-faJEFC3vJg5m-H3EyMnAkBg3ecpx_YduxpPyXRz8o=.fee546a6-61d3-4ce7-9939-80a1a9e2748e@github.com>

On Tue, 18 Feb 2025 08:17:59 GMT, Galder Zamarre?o <galder at openjdk.org> wrote:

>>> I think we should be able to see the same issue here, actually. Yes. Here a quick benchmark below:
>> 
>> I observe the same:
>> 
>> 
>> Warmup
>> 751    3    b        TestIntMax::test1 (27 bytes)
>> Run
>> Time: 360 550 158
>> Warmup
>> 1862   15    b        TestIntMax::test2 (34 bytes)
>> Run
>> Time: 92 116 170
>> 
>> 
>> But then with this:
>> 
>> 
>> diff --git a/src/hotspot/cpu/x86/x86_64.ad b/src/hotspot/cpu/x86/x86_64.ad
>> index 8cc4a970bfd..9abda8f4178 100644
>> --- a/src/hotspot/cpu/x86/x86_64.ad
>> +++ b/src/hotspot/cpu/x86/x86_64.ad
>> @@ -12037,16 +12037,20 @@ instruct cmovI_reg_l(rRegI dst, rRegI src, rFlagsReg cr)
>>  %}
>>  
>>  
>> -instruct maxI_rReg(rRegI dst, rRegI src)
>> +instruct maxI_rReg(rRegI dst, rRegI src, rFlagsReg cr)
>>  %{
>>    match(Set dst (MaxI dst src));
>> +  effect(KILL cr);
>>  
>>    ins_cost(200);
>> -  expand %{
>> -    rFlagsReg cr;
>> -    compI_rReg(cr, dst, src);
>> -    cmovI_reg_l(dst, src, cr);
>> +  ins_encode %{
>> +    Label done;
>> +    __ cmpl($src$$Register, $dst$$Register);
>> +    __ jccb(Assembler::less, done);
>> +    __ mov($dst$$Register, $src$$Register);
>> +    __ bind(done);
>>    %}
>> +  ins_pipe(pipe_cmov_reg);
>>  %}
>>  
>>  // ============================================================================
>> 
>> 
>> the performance gap narrows:
>> 
>> 
>> Warmup
>> 770    3    b        TestIntMax::test1 (27 bytes)
>> Run
>> Time: 94 951 677
>> Warmup
>> 1312   15    b        TestIntMax::test2 (34 bytes)
>> Run
>> Time: 70 053 824
>> 
>> 
>> (the number of test2 fluctuates quite a bit). Does it ever make sense to implement `MaxI` with a conditional move then?
>
> To make it more explicit: implementing long min/max in ad files as cmp will likely remove all the 100% regressions that are observed here. I'm going to repeat the same MinMaxVector int min/max reduction test above with the ad changes @rwestrel suggested to see what effect they have.

@galderz I think we will have the same issue with both `int` and `long`: As far as I know, it is really a difficult problem to decide at compile-time if a `cmove` or `branch` is the better choice. I'm not sure there is any heuristic for which you will not find a micro-benchmark where the heuristic made the wrong choice.

To my understanding, these are the factors that impact the performance:
- `cmove` requires all inputs to complete before it can execute, and it has an inherent latency of a cycle or so itself. But you cannot have any branch mispredictions, and hence no branch misprediction penalties (i.e. when the CPU has to flush out the ops from the wrong branch and restart at the branch).
- `branch` can hide some latencies, because we can already continue with the branch that is speculated on. We do not need to wait for the inputs of the comparison to arrive, and we can already continue with the speculated resulting value. But if the speculation is ever wrong, we have to pay the misprediction penalty.

In my understanding, there are roughly 3 scenarios:
- The branch probability is so extreme that the branch predictor would be correct almost always, and so it is profitable to do branching code.
- The branching probability is somewhere in the middle, and the branch is not predictable. Branch mispredictions are very expensive, and so it is better to use `cmove`.
- The branching probability is somewhere in the middle, but the branch is predictable (e.g. swapps back and forth). The branch predictor will have almost no mispredictions, and it is faster to use branching code.

Modeling this precisely is actually a little complex. You would have to know the cost of the `cmove` and the `branching` version of the code. That depends on the latency of the inputs, and the outputs: does the `cmove` dramatically increase the latency on the critical path, and `branching` could hide some of that latency? And you would have to know how good the branch predictor is, which you cannot derive from the branching probability of our profiling (at least not when the probabilities are in the middle, and you don't know if it is a random or predictable pattern).

If we can find a perfect heuristic - that would be fantastic ;)

If we cannot find a perfect heuristic, then we should think about what are the most "common" or "relevant" scenarios, I think.

But let's discuss all of this in a call / offline :)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2664956307

From galder at openjdk.org  Tue Feb 18 09:24:18 2025
From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=)
Date: Tue, 18 Feb 2025 09:24:18 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v11]
In-Reply-To: <N-faJEFC3vJg5m-H3EyMnAkBg3ecpx_YduxpPyXRz8o=.fee546a6-61d3-4ce7-9939-80a1a9e2748e@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com>
 <Mci8jQuT-MquLYeikUrrdzKo9dJJuQa3ejdc7tlYQyI=.e0007de8-08b2-4a42-950c-f8e1225777fc@github.com>
 <RHL_g49_BCZQzsQJU-T88fkAOoSKpNvEC2Xx-QxdpRk=.4fbc0037-ba55-40e1-a091-4c16d7e8ee99@github.com>
 <MAcSY0Kc9JFrv5ueJDNkDH9I9LpIsYcRjiJ7RtQt090=.03b107fd-8879-41de-a0a4-b6202a9da369@github.com>
 <OxF5Va_n5CdxRW2uSTQQzMe6JSSNqnfs4qd3pAwSAEo=.d33d96b2-0d11-4d67-91cc-5ae94e78c580@github.com>
 <S-e2FYgJy02RfywN3A6WTPaepdkK6ly8KYNnX6tHCig=.a38238e5-7d3b-412b-a78e-9def776179ff@github.com>
 <T8VJwIaK1x-G60XsQNIUB9wcHQeYP1bkYB98Am9c8KM=.56412af0-9471-4619-9e14-03ba5f5cfde2@github.com>
 <msxoTDkFVjqA2xbPwhboyul7Oal6uQAQcig5RdkeiyY=.7d5bd096-0906-45fa-a3db-0ace9b629fcb@github.com>
 <5oGMaD5b87inAMkco6l5ODRvWv7FRsHGJiu_UMrGrTc=.0be44429-d322-4a6f-b91d-b64a146fad05@github.com>
 <3ArmrOQcUoj8DhHTq1a40Oz3GE8bCDDy3FF
 eVgbladg=.b8e0e13b-39f3-41a6-8a1b-5ca4febb4a41@github.com>
 <ajuhBKoDVKmn_D9vg3O3OVUQVVunAT02KhXcaapMokE=.5b8f2f36-6521-4389-97de-0bf8deeb47dc@github.com>
 <N-faJEFC3vJg5m-H3EyMnAkBg3ecpx_YduxpPyXRz8o=.fee546a6-61d3-4ce7-9939-80a1a9e2748e@github.com>
Message-ID: <ilwTxx7G5Uh2ZmiVAv334-cVnqTIyKIyHTmtsdkYUxQ=.f10760eb-0e02-462c-a61a-80c1bada28fe@github.com>

On Tue, 18 Feb 2025 08:43:38 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> But let's discuss all of this in a call / offline :)

Yup.

> I ran the exact same test with longs and I don't see such an issue. The performance is always the same either with the intrisinc or disabling it as shown above.

For the equivalent long tests I think I made a mistake in the id of the disabled intrinsic, it should be `_maxL` and not `_max`. I will repeat the tests and post if any similar differences observed.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2665045881

From roland at openjdk.org  Tue Feb 18 09:35:16 2025
From: roland at openjdk.org (Roland Westrelin)
Date: Tue, 18 Feb 2025 09:35:16 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory
In-Reply-To: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
Message-ID: <tJIepmfbtgbfD-EzVGPavvFjOQRaSK0riJzPO6YsTM0=.77b01211-44d1-47b2-8e56-ca98a68cfac4@github.com>

On Mon, 11 Nov 2024 14:40:09 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below.
> 
> **Background**
> 
> With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer.
> 
> **Problem**
> 
> So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code.
> 
> 
> MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1);
> MemorySegment nativeUnaligned = nativeAligned.asSlice(1);
> test3(nativeUnaligned);
> 
> 
> When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not!
> 
>     static void test3(MemorySegment ms) {
>         for (int i = 0; i < RANGE; i++) {
>             long adr = i * 4L;
>             int v = ms.get(ELEMENT_LAYOUT, adr);
>             ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1));
>         }
>     }
> 
> 
> **Solution: Runtime Checks - Predicate and Multiversioning**
> 
> Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check.
> 
> I came up with 2 options where to place the runtime checks:
> - A new "auto vectorization" Parse Predicate:
>   - This only works when predicates are available.
>   - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop.
> - Multiversion the loop:
>   - Create 2 copies of the loop (fast and slow loops).
>   - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take
>   - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even unaligned `base`s would end up with reasonably fast code.
>   - We "stall" the `...

Would it make sense to add verification code that makes sure that whenever a loop is flagged as multi version, c2 can find the multi version guard (and maybe whenever there's a multi version guard, loops that are guarded are indeed flagged correctly)?

src/hotspot/share/opto/loopTransform.cpp line 751:

> 749:         // Peeling also destroys the connection of the main loop
> 750:         // to the multiversion_if.
> 751:         cl->set_no_multiversion();

Would we want to change the multiversion guard at this point so it constant folds and the slow version is removed?

src/hotspot/share/opto/loopUnswitch.cpp line 513:

> 511: 
> 512:   // Create new Region.
> 513:   RegionNode* region = new RegionNode(1);

So we create a new `Region` every time a new condition is added?

src/hotspot/share/opto/loopnode.cpp line 1097:

> 1095:     // PhaseIdealLoop::add_parse_predicate only checks trap limits per method, so
> 1096:     // we do a custom check here.
> 1097:     if (!C->too_many_traps(cloned_sfpt->jvms()->method(), cloned_sfpt->jvms()->bci(), Deoptimization::Reason_auto_vectorization_check)) {

Isn't that done by `add_parse_predicate`?

src/hotspot/share/opto/traceAutoVectorizationTag.hpp line 32:

> 30: 
> 31: #define COMPILER_TRACE_AUTO_VECTORIZATION_TAG(flags) \
> 32:   flags(POINTER_PARSING,            "Trace VPointer/MemPointer parsing") \

Has anything changed here? I stared at it a few times and couldn't figure out what has.

-------------

PR Review: https://git.openjdk.org/jdk/pull/22016#pullrequestreview-2622881581
PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959338954
PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959344256
PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959347164
PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959349092

From roland at openjdk.org  Tue Feb 18 09:48:14 2025
From: roland at openjdk.org (Roland Westrelin)
Date: Tue, 18 Feb 2025 09:48:14 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory
In-Reply-To: <ENUNnsRfvcahVl3Iw4yBuZeBo0-v5-CoRDxpWVGTr_s=.1770c7b6-5df4-4d6b-a204-b81f67c29dcf@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <LNBP-PAgsrMthleNIaOPQ6zv6UL6TivmnCJ2m8MliSM=.2d5d0fc7-6d69-4264-8419-31386a65b807@github.com>
 <ENUNnsRfvcahVl3Iw4yBuZeBo0-v5-CoRDxpWVGTr_s=.1770c7b6-5df4-4d6b-a204-b81f67c29dcf@github.com>
Message-ID: <zUl7Ze9ROEeDGuN-EUOAPik-t5mnmJ_HObxy9XhozPY=.511d8b54-540b-4a78-adb9-10eed8f91d34@github.com>

On Mon, 17 Feb 2025 15:24:44 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> > Do you intend to use a single deoptimization reason for all vectorization related predicates? (that is when you take care of aliasing, are you going to to use the same reason for aliasing and alignment checks)
> 
> I suppose that is currently what I'm planning. But we could in principle separate them. But I would leave that for later, if there is any desire to do that. For now, I think it's ok to just go with a single "auto-vectorization" reason.
> 
> Does that sound reasonable?

Yes, it sounds reasonable.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2665104472

From epeter at openjdk.org  Tue Feb 18 09:48:16 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Tue, 18 Feb 2025 09:48:16 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory
In-Reply-To: <tJIepmfbtgbfD-EzVGPavvFjOQRaSK0riJzPO6YsTM0=.77b01211-44d1-47b2-8e56-ca98a68cfac4@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <tJIepmfbtgbfD-EzVGPavvFjOQRaSK0riJzPO6YsTM0=.77b01211-44d1-47b2-8e56-ca98a68cfac4@github.com>
Message-ID: <47tXBG3sQGZVEE5Ya2wr46CopmDjy8OClbpqagIsjgA=.6d07b495-4777-4c7e-a3b7-820f100ec2c0@github.com>

On Tue, 18 Feb 2025 09:09:15 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>> Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below.
>> 
>> **Background**
>> 
>> With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer.
>> 
>> **Problem**
>> 
>> So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code.
>> 
>> 
>> MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1);
>> MemorySegment nativeUnaligned = nativeAligned.asSlice(1);
>> test3(nativeUnaligned);
>> 
>> 
>> When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not!
>> 
>>     static void test3(MemorySegment ms) {
>>         for (int i = 0; i < RANGE; i++) {
>>             long adr = i * 4L;
>>             int v = ms.get(ELEMENT_LAYOUT, adr);
>>             ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1));
>>         }
>>     }
>> 
>> 
>> **Solution: Runtime Checks - Predicate and Multiversioning**
>> 
>> Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check.
>> 
>> I came up with 2 options where to place the runtime checks:
>> - A new "auto vectorization" Parse Predicate:
>>   - This only works when predicates are available.
>>   - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop.
>> - Multiversion the loop:
>>   - Create 2 copies of the loop (fast and slow loops).
>>   - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take
>>   - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even ...
>
> src/hotspot/share/opto/loopTransform.cpp line 751:
> 
>> 749:         // Peeling also destroys the connection of the main loop
>> 750:         // to the multiversion_if.
>> 751:         cl->set_no_multiversion();
> 
> Would we want to change the multiversion guard at this point so it constant folds and the slow version is removed?

I suppose we can probably do that. Otherwise, we just have to wait until the `OpaqueMultiversioningNode` constant folds after loop-opts.

> src/hotspot/share/opto/loopUnswitch.cpp line 513:
> 
>> 511: 
>> 512:   // Create new Region.
>> 513:   RegionNode* region = new RegionNode(1);
> 
> So we create a new `Region` every time a new condition is added?

Yes. Are you ok with that? Or would you prefer if we extended an existing region (is that possible?) and then we'd have 2 cases, one where there is none yet, and one where we'd extend. I think adding one each time is easier, and it would get commoned anyway, right?

> src/hotspot/share/opto/traceAutoVectorizationTag.hpp line 32:
> 
>> 30: 
>> 31: #define COMPILER_TRACE_AUTO_VECTORIZATION_TAG(flags) \
>> 32:   flags(POINTER_PARSING,            "Trace VPointer/MemPointer parsing") \
> 
> Has anything changed here? I stared at it a few times and couldn't figure out what has.

I added the tag `SPECULATIVE_RUNTIME_CHECKS`. And then had to change alignment for all others ;)

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959397988
PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959392450
PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959394676

From epeter at openjdk.org  Tue Feb 18 09:51:15 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Tue, 18 Feb 2025 09:51:15 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory
In-Reply-To: <tJIepmfbtgbfD-EzVGPavvFjOQRaSK0riJzPO6YsTM0=.77b01211-44d1-47b2-8e56-ca98a68cfac4@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <tJIepmfbtgbfD-EzVGPavvFjOQRaSK0riJzPO6YsTM0=.77b01211-44d1-47b2-8e56-ca98a68cfac4@github.com>
Message-ID: <haW3UR2f-oX34CVUcZ3PuHNnYhrosO5CCc6I4zic7tg=.dce84228-5c39-4c9d-b881-4404c2bb28fc@github.com>

On Tue, 18 Feb 2025 09:14:28 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>> Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below.
>> 
>> **Background**
>> 
>> With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer.
>> 
>> **Problem**
>> 
>> So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code.
>> 
>> 
>> MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1);
>> MemorySegment nativeUnaligned = nativeAligned.asSlice(1);
>> test3(nativeUnaligned);
>> 
>> 
>> When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not!
>> 
>>     static void test3(MemorySegment ms) {
>>         for (int i = 0; i < RANGE; i++) {
>>             long adr = i * 4L;
>>             int v = ms.get(ELEMENT_LAYOUT, adr);
>>             ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1));
>>         }
>>     }
>> 
>> 
>> **Solution: Runtime Checks - Predicate and Multiversioning**
>> 
>> Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check.
>> 
>> I came up with 2 options where to place the runtime checks:
>> - A new "auto vectorization" Parse Predicate:
>>   - This only works when predicates are available.
>>   - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop.
>> - Multiversion the loop:
>>   - Create 2 copies of the loop (fast and slow loops).
>>   - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take
>>   - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even ...
>
> src/hotspot/share/opto/loopnode.cpp line 1097:
> 
>> 1095:     // PhaseIdealLoop::add_parse_predicate only checks trap limits per method, so
>> 1096:     // we do a custom check here.
>> 1097:     if (!C->too_many_traps(cloned_sfpt->jvms()->method(), cloned_sfpt->jvms()->bci(), Deoptimization::Reason_auto_vectorization_check)) {
> 
> Isn't that done by `add_parse_predicate`?

@rwestrel I only see `if (!C->too_many_traps(reason)) {` in `PhaseIdealLoop::add_parse_predicate`. And as the comment I put here that only checks the `reason` per `method`, and not per `bci`. Do you see anything else?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959403871

From epeter at openjdk.org  Tue Feb 18 09:56:13 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Tue, 18 Feb 2025 09:56:13 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory
In-Reply-To: <tJIepmfbtgbfD-EzVGPavvFjOQRaSK0riJzPO6YsTM0=.77b01211-44d1-47b2-8e56-ca98a68cfac4@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <tJIepmfbtgbfD-EzVGPavvFjOQRaSK0riJzPO6YsTM0=.77b01211-44d1-47b2-8e56-ca98a68cfac4@github.com>
Message-ID: <HiM2dzvsG2hB2utYPwFRplD8CRLPglPjQmY3sU2ZAKY=.93d0510e-0d9d-4f07-ba07-b9027ba6f89b@github.com>

On Tue, 18 Feb 2025 09:32:19 GMT, Roland Westrelin <roland at openjdk.org> wrote:

> Would it make sense to add verification code that makes sure that whenever a loop is flagged as multi version, c2 can find the multi version guard (and maybe whenever there's a multi version guard, loops that are guarded are indeed flagged correctly)?

I'd have to see if that is possible. Well:

> verification code that makes sure that whenever a loop is flagged as multi version, c2 can find the multi version guard

That is maybe possible. At least I cannot think of a reason why it should not work right now. Well, maybe what if the predicates get messed up somehow, is that possible? Then you would lose connection.

Ah: what if the pre-loop somehow gets "messed up", i.e. that it loses its loop structure. Then we could not really go from the main-loop to the pre-loop to the selector-if any more.

> whenever there's a multi version guard, loops that are guarded are indeed flagged correctly

That one is more tricky. Because what if the loop somehow gets folded away? How would we catch that?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2665123097

From roland at openjdk.org  Tue Feb 18 09:56:14 2025
From: roland at openjdk.org (Roland Westrelin)
Date: Tue, 18 Feb 2025 09:56:14 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory
In-Reply-To: <haW3UR2f-oX34CVUcZ3PuHNnYhrosO5CCc6I4zic7tg=.dce84228-5c39-4c9d-b881-4404c2bb28fc@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <tJIepmfbtgbfD-EzVGPavvFjOQRaSK0riJzPO6YsTM0=.77b01211-44d1-47b2-8e56-ca98a68cfac4@github.com>
 <haW3UR2f-oX34CVUcZ3PuHNnYhrosO5CCc6I4zic7tg=.dce84228-5c39-4c9d-b881-4404c2bb28fc@github.com>
Message-ID: <lta_D15JcJ5ohO7jE4czhU5ZGys5Ty3SZ1uMbvmFbm0=.35840037-8357-40ea-b452-269fb2425532@github.com>

On Tue, 18 Feb 2025 09:48:58 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> src/hotspot/share/opto/loopnode.cpp line 1097:
>> 
>>> 1095:     // PhaseIdealLoop::add_parse_predicate only checks trap limits per method, so
>>> 1096:     // we do a custom check here.
>>> 1097:     if (!C->too_many_traps(cloned_sfpt->jvms()->method(), cloned_sfpt->jvms()->bci(), Deoptimization::Reason_auto_vectorization_check)) {
>> 
>> Isn't that done by `add_parse_predicate`?
>
> @rwestrel I only see `if (!C->too_many_traps(reason)) {` in `PhaseIdealLoop::add_parse_predicate`. And as the comment I put here that only checks the `reason` per `method`, and not per `bci`. Do you see anything else?

Seems like it's a bug that `PhaseIdealLoop::add_parse_predicate` doesn't check the `bci` too. Could you fix it?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959411405

From epeter at openjdk.org  Tue Feb 18 10:07:18 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Tue, 18 Feb 2025 10:07:18 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory
In-Reply-To: <lta_D15JcJ5ohO7jE4czhU5ZGys5Ty3SZ1uMbvmFbm0=.35840037-8357-40ea-b452-269fb2425532@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <tJIepmfbtgbfD-EzVGPavvFjOQRaSK0riJzPO6YsTM0=.77b01211-44d1-47b2-8e56-ca98a68cfac4@github.com>
 <haW3UR2f-oX34CVUcZ3PuHNnYhrosO5CCc6I4zic7tg=.dce84228-5c39-4c9d-b881-4404c2bb28fc@github.com>
 <lta_D15JcJ5ohO7jE4czhU5ZGys5Ty3SZ1uMbvmFbm0=.35840037-8357-40ea-b452-269fb2425532@github.com>
Message-ID: <g3D0_Zz8XB5cKoUXbfWoRzZePw7c2NVCgF4uy9y1aRI=.20d4c390-13b9-4e27-9753-e87af103ec97@github.com>

On Tue, 18 Feb 2025 09:53:14 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>> @rwestrel I only see `if (!C->too_many_traps(reason)) {` in `PhaseIdealLoop::add_parse_predicate`. And as the comment I put here that only checks the `reason` per `method`, and not per `bci`. Do you see anything else?
>
> Seems like it's a bug that `PhaseIdealLoop::add_parse_predicate` doesn't check the `bci` too. Could you fix it?

@rwestrel So we would check both, right? But is that what we want for all predicates?

`C->too_many_traps(reason)` checks against `PerMethodTrapLimit`:

if (trap_count(reason) >= Deoptimization::per_method_trap_limit(reason)) {


But the `bci` check works with `PerBytecodeTrapLimit`, and it actually has a comment like this:

if (md->has_trap_at(bci, m, reason) != 0) {
  // Assume PerBytecodeTrapLimit==0, for a more conservative heuristic.
  // Also, if there are multiple reasons, or if there is no per-BCI record,
  // assume the worst.

So the `bci` check fails if there has been even a single trapping recorded.

So it seems that such a change would affect the behavior in ways I cannot yet predict.

What do you think?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959431345

From galder at openjdk.org  Tue Feb 18 10:09:18 2025
From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=)
Date: Tue, 18 Feb 2025 10:09:18 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v11]
In-Reply-To: <ilwTxx7G5Uh2ZmiVAv334-cVnqTIyKIyHTmtsdkYUxQ=.f10760eb-0e02-462c-a61a-80c1bada28fe@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com>
 <Mci8jQuT-MquLYeikUrrdzKo9dJJuQa3ejdc7tlYQyI=.e0007de8-08b2-4a42-950c-f8e1225777fc@github.com>
 <RHL_g49_BCZQzsQJU-T88fkAOoSKpNvEC2Xx-QxdpRk=.4fbc0037-ba55-40e1-a091-4c16d7e8ee99@github.com>
 <MAcSY0Kc9JFrv5ueJDNkDH9I9LpIsYcRjiJ7RtQt090=.03b107fd-8879-41de-a0a4-b6202a9da369@github.com>
 <OxF5Va_n5CdxRW2uSTQQzMe6JSSNqnfs4qd3pAwSAEo=.d33d96b2-0d11-4d67-91cc-5ae94e78c580@github.com>
 <S-e2FYgJy02RfywN3A6WTPaepdkK6ly8KYNnX6tHCig=.a38238e5-7d3b-412b-a78e-9def776179ff@github.com>
 <T8VJwIaK1x-G60XsQNIUB9wcHQeYP1bkYB98Am9c8KM=.56412af0-9471-4619-9e14-03ba5f5cfde2@github.com>
 <msxoTDkFVjqA2xbPwhboyul7Oal6uQAQcig5RdkeiyY=.7d5bd096-0906-45fa-a3db-0ace9b629fcb@github.com>
 <5oGMaD5b87inAMkco6l5ODRvWv7FRsHGJiu_UMrGrTc=.0be44429-d322-4a6f-b91d-b64a146fad05@github.com>
 <3ArmrOQcUoj8DhHTq1a40Oz3GE8bCDDy3FF
 eVgbladg=.b8e0e13b-39f3-41a6-8a1b-5ca4febb4a41@github.com>
 <ajuhBKoDVKmn_D9vg3O3OVUQVVunAT02KhXcaapMokE=.5b8f2f36-6521-4389-97de-0bf8deeb47dc@github.com>
 <N-faJEFC3vJg5m-H3EyMnAkBg3ecpx_YduxpPyXRz8o=.fee546a6-61d3-4ce7-9939-80a1a9e2748e@github.com>
 <ilwTxx7G5Uh2ZmiVAv334-cVnqTIyKIyHTmtsdkYUxQ=.f10760eb-0e02-462c-a61a-80c1bada28fe@github.com>
Message-ID: <50uPQ3ue90Xr_LSEm8z3XLTL1yx2A-Q0SJ8rdmv-gsg=.960a6c31-9850-4ce3-bd88-41d4342a5605@github.com>

On Tue, 18 Feb 2025 09:21:46 GMT, Galder Zamarre?o <galder at openjdk.org> wrote:

> For the equivalent long tests I think I made a mistake in the id of the disabled intrinsic, it should be _maxL and not _max. I will repeat the tests and post if any similar differences observed.

FYI Indeed a similar pattern is observed for long min/max (with the patch in this PR):


make test TEST="micro:org.openjdk.bench.java.lang.MinMaxVector.longReductionSimpleMax" MICRO="FORK=1"
Benchmark                            (probability)  (size)   Mode  Cnt    Score   Error   Units
MinMaxVector.longReductionSimpleMax             50    2048  thrpt    4  460.392 ? 0.076  ops/ms
MinMaxVector.longReductionSimpleMax             80    2048  thrpt    4  460.459 ? 0.438  ops/ms
MinMaxVector.longReductionSimpleMax            100    2048  thrpt    4  460.469 ? 0.057  ops/ms

make test TEST="micro:org.openjdk.bench.java.lang.MinMaxVector.longReductionSimpleMax" MICRO="FORK=1;OPTIONS=-jvmArgs -XX:CompileCommand=option,org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionSimpleMax_jmhTest::longReductionSimpleMax_thrpt_jmhStub,ccstrlist,DisableIntrinsic,_maxL"
Benchmark                            (probability)  (size)   Mode  Cnt     Score   Error   Units
MinMaxVector.longReductionSimpleMax             50    2048  thrpt    4   460.453 ? 0.188  ops/ms
MinMaxVector.longReductionSimpleMax             80    2048  thrpt    4   460.507 ? 0.192  ops/ms
MinMaxVector.longReductionSimpleMax            100    2048  thrpt    4  1013.498 ? 1.607  ops/ms

make test TEST="micro:org.openjdk.bench.java.lang.MinMaxVector.longReductionMultiplyMax" MICRO="FORK=1"
Benchmark                              (probability)  (size)   Mode  Cnt    Score   Error   Units
MinMaxVector.longReductionMultiplyMax             50    2048  thrpt    4  966.429 ? 0.359  ops/ms
MinMaxVector.longReductionMultiplyMax             80    2048  thrpt    4  966.569 ? 0.338  ops/ms
MinMaxVector.longReductionMultiplyMax            100    2048  thrpt    4  966.548 ? 0.575  ops/ms

make test TEST="micro:org.openjdk.bench.java.lang.MinMaxVector.longReductionMultiplyMax" MICRO="FORK=1;OPTIONS=-jvmArgs -XX:CompileCommand=option,org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMultiplyMax_jmhTest::longReductionMultiplyMax_thrpt_jmhStub,ccstrlist,DisableIntrinsic,_maxL"
Benchmark                              (probability)  (size)   Mode  Cnt    Score   Error   Units
MinMaxVector.longReductionMultiplyMax             50    2048  thrpt    4  966.130 ? 5.549  ops/ms
MinMaxVector.longReductionMultiplyMax             80    2048  thrpt    4  966.380 ? 0.663  ops/ms
MinMaxVector.longReductionMultiplyMax            100    2048  thrpt    4  859.233 ? 7.817  ops/ms

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2665159015

From epeter at openjdk.org  Tue Feb 18 10:11:16 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Tue, 18 Feb 2025 10:11:16 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory
In-Reply-To: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <tJIepmfbtgbfD-EzVGPavvFjOQRaSK0riJzPO6YsTM0=.77b01211-44d1-47b2-8e56-ca98a68cfac4@github.com>
 <HiM2dzvsG2hB2utYPwFRplD8CRLPglPjQmY3sU2ZAKY=.93d0510e-0d9d-4f07-ba07-b9027ba6f89b@github.com>
 <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com>
Message-ID: <JIozM6Xe_RuACVPRvbQtZYJUYzqmJri69XklWzmzVO8=.94c86dea-0665-48ba-989a-035e1a0ff35d@github.com>

On Tue, 18 Feb 2025 09:57:29 GMT, Roland Westrelin <roland at openjdk.org> wrote:

> > That one is more tricky. Because what if the loop somehow gets folded away? How would we catch that?

>There is code that removes the OuterStripMinedLoop if the CountedLoop goes away and also, if I recall correctly, logic that verifies no ``OuterStripMinedLoopis left behind without aCountedLoop` so it's probably possible. Question is whether we want that or not. Seems like quite a bit of extra complexity.

Hmm ok, I see. I wonder how bad it is to leave the slow-loop there until after loop-opts. I mean it was already created, and it now has no loop-opts performed on it (it is stalled), so it just sits there like dead code. So I'm not sure there is really a performance benefit to kill it already a little earlier. Maybe a very small one?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2665161507

From roland at openjdk.org  Tue Feb 18 10:11:17 2025
From: roland at openjdk.org (Roland Westrelin)
Date: Tue, 18 Feb 2025 10:11:17 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory
In-Reply-To: <g3D0_Zz8XB5cKoUXbfWoRzZePw7c2NVCgF4uy9y1aRI=.20d4c390-13b9-4e27-9753-e87af103ec97@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <tJIepmfbtgbfD-EzVGPavvFjOQRaSK0riJzPO6YsTM0=.77b01211-44d1-47b2-8e56-ca98a68cfac4@github.com>
 <haW3UR2f-oX34CVUcZ3PuHNnYhrosO5CCc6I4zic7tg=.dce84228-5c39-4c9d-b881-4404c2bb28fc@github.com>
 <lta_D15JcJ5ohO7jE4czhU5ZGys5Ty3SZ1uMbvmFbm0=.35840037-8357-40ea-b452-269fb2425532@github.com>
 <g3D0_Zz8XB5cKoUXbfWoRzZePw7c2NVCgF4uy9y1aRI=.20d4c390-13b9-4e27-9753-e87af103ec97@github.com>
Message-ID: <-h_j1wlUqiWpk7lHDe2qqLlTPUdRLJ2NBaid6KJURCQ=.e1ef0bfa-4043-42b0-be58-ac130373c788@github.com>

On Tue, 18 Feb 2025 10:04:59 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> Seems like it's a bug that `PhaseIdealLoop::add_parse_predicate` doesn't check the `bci` too. Could you fix it?
>
> @rwestrel So we would check both, right? But is that what we want for all predicates?
> 
> `C->too_many_traps(reason)` checks against `PerMethodTrapLimit`:
> 
> if (trap_count(reason) >= Deoptimization::per_method_trap_limit(reason)) {
> 
> 
> But the `bci` check works with `PerBytecodeTrapLimit`, and it actually has a comment like this:
> 
> if (md->has_trap_at(bci, m, reason) != 0) {
>   // Assume PerBytecodeTrapLimit==0, for a more conservative heuristic.
>   // Also, if there are multiple reasons, or if there is no per-BCI record,
>   // assume the worst.
> 
> So the `bci` check fails if there has been even a single trapping recorded.
> 
> So it seems that such a change would affect the behavior in ways I cannot yet predict.
> 
> What do you think?

That code is supposed to mirror the `GraphKit::add_parse_predicate()`. It doesn't. Would you like me to fix this separately?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959437628

From epeter at openjdk.org  Tue Feb 18 10:20:15 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Tue, 18 Feb 2025 10:20:15 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory
In-Reply-To: <-h_j1wlUqiWpk7lHDe2qqLlTPUdRLJ2NBaid6KJURCQ=.e1ef0bfa-4043-42b0-be58-ac130373c788@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <tJIepmfbtgbfD-EzVGPavvFjOQRaSK0riJzPO6YsTM0=.77b01211-44d1-47b2-8e56-ca98a68cfac4@github.com>
 <haW3UR2f-oX34CVUcZ3PuHNnYhrosO5CCc6I4zic7tg=.dce84228-5c39-4c9d-b881-4404c2bb28fc@github.com>
 <lta_D15JcJ5ohO7jE4czhU5ZGys5Ty3SZ1uMbvmFbm0=.35840037-8357-40ea-b452-269fb2425532@github.com>
 <g3D0_Zz8XB5cKoUXbfWoRzZePw7c2NVCgF4uy9y1aRI=.20d4c390-13b9-4e27-9753-e87af103ec97@github.com>
 <-h_j1wlUqiWpk7lHDe2qqLlTPUdRLJ2NBaid6KJURCQ=.e1ef0bfa-4043-42b0-be58-ac130373c788@github.com>
Message-ID: <oz3T7VHMQY6-yxBCR4HKV2UvJUjCPLL7Lyky5weOv3I=.deff5cf9-fe1e-4193-9b8f-94eb5266beef@github.com>

On Tue, 18 Feb 2025 10:09:00 GMT, Roland Westrelin <roland at openjdk.org> wrote:

> That code

Which code are you referring to?
Ah, probably you are talking about `PhaseIdealLoop::add_parse_predicate`, which is using the method wide check. And `GraphKit::add_parse_predicate` actually queries `GraphKit::too_many_traps`, which knows the current `bci()`, and can query the per-bci count.

> Would you like me to fix this separately?

Yes, please. I definitely don't want to do it in this PR ;)
And I don't have as much experience with traps as you do. We'd have to think a little about what cases this affects, and if performance would go up or down in all those cases.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959451204

From epeter at openjdk.org  Tue Feb 18 10:29:12 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Tue, 18 Feb 2025 10:29:12 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory
In-Reply-To: <-h_j1wlUqiWpk7lHDe2qqLlTPUdRLJ2NBaid6KJURCQ=.e1ef0bfa-4043-42b0-be58-ac130373c788@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <tJIepmfbtgbfD-EzVGPavvFjOQRaSK0riJzPO6YsTM0=.77b01211-44d1-47b2-8e56-ca98a68cfac4@github.com>
 <haW3UR2f-oX34CVUcZ3PuHNnYhrosO5CCc6I4zic7tg=.dce84228-5c39-4c9d-b881-4404c2bb28fc@github.com>
 <lta_D15JcJ5ohO7jE4czhU5ZGys5Ty3SZ1uMbvmFbm0=.35840037-8357-40ea-b452-269fb2425532@github.com>
 <g3D0_Zz8XB5cKoUXbfWoRzZePw7c2NVCgF4uy9y1aRI=.20d4c390-13b9-4e27-9753-e87af103ec97@github.com>
 <-h_j1wlUqiWpk7lHDe2qqLlTPUdRLJ2NBaid6KJURCQ=.e1ef0bfa-4043-42b0-be58-ac130373c788@github.com>
Message-ID: <IbMGSpAFsTKRhUuPYQS3q36YFegklEV1b8rwuT8TN0g=.bc3adcf6-cc6d-483c-a59e-cf6b2971e7de@github.com>

On Tue, 18 Feb 2025 10:09:00 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>> @rwestrel So we would check both, right? But is that what we want for all predicates?
>> 
>> `C->too_many_traps(reason)` checks against `PerMethodTrapLimit`:
>> 
>> if (trap_count(reason) >= Deoptimization::per_method_trap_limit(reason)) {
>> 
>> 
>> But the `bci` check works with `PerBytecodeTrapLimit`, and it actually has a comment like this:
>> 
>> if (md->has_trap_at(bci, m, reason) != 0) {
>>   // Assume PerBytecodeTrapLimit==0, for a more conservative heuristic.
>>   // Also, if there are multiple reasons, or if there is no per-BCI record,
>>   // assume the worst.
>> 
>> So the `bci` check fails if there has been even a single trapping recorded.
>> 
>> So it seems that such a change would affect the behavior in ways I cannot yet predict.
>> 
>> What do you think?
>
> That code is supposed to mirror the `GraphKit::add_parse_predicate()`. It doesn't. Would you like me to fix this separately?

@rwestrel do you consider that a blocking issue for this PR here?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959463556

From roland at openjdk.org  Tue Feb 18 10:29:13 2025
From: roland at openjdk.org (Roland Westrelin)
Date: Tue, 18 Feb 2025 10:29:13 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory
In-Reply-To: <IbMGSpAFsTKRhUuPYQS3q36YFegklEV1b8rwuT8TN0g=.bc3adcf6-cc6d-483c-a59e-cf6b2971e7de@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <tJIepmfbtgbfD-EzVGPavvFjOQRaSK0riJzPO6YsTM0=.77b01211-44d1-47b2-8e56-ca98a68cfac4@github.com>
 <haW3UR2f-oX34CVUcZ3PuHNnYhrosO5CCc6I4zic7tg=.dce84228-5c39-4c9d-b881-4404c2bb28fc@github.com>
 <lta_D15JcJ5ohO7jE4czhU5ZGys5Ty3SZ1uMbvmFbm0=.35840037-8357-40ea-b452-269fb2425532@github.com>
 <g3D0_Zz8XB5cKoUXbfWoRzZePw7c2NVCgF4uy9y1aRI=.20d4c390-13b9-4e27-9753-e87af103ec97@github.com>
 <-h_j1wlUqiWpk7lHDe2qqLlTPUdRLJ2NBaid6KJURCQ=.e1ef0bfa-4043-42b0-be58-ac130373c788@github.com>
 <IbMGSpAFsTKRhUuPYQS3q36YFegklEV1b8rwuT8TN0g=.bc3adcf6-cc6d-483c-a59e-cf6b2971e7de@github.com>
Message-ID: <yA58b9sjhHbbQyMn3eKACoN7gRR5a8GZBpQ6ovkoqCI=.9a893efc-ab75-4536-aeaf-24ca717a6af6@github.com>

On Tue, 18 Feb 2025 10:25:08 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> That code is supposed to mirror the `GraphKit::add_parse_predicate()`. It doesn't. Would you like me to fix this separately?
>
> @rwestrel do you consider that a blocking issue for this PR here?

No

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959465988

From adinn at openjdk.org  Tue Feb 18 13:36:27 2025
From: adinn at openjdk.org (Andrew Dinn)
Date: Tue, 18 Feb 2025 13:36:27 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]
In-Reply-To: <unMldYiDLGyImOJQ1oXuzR2OViIBxTKFjE3Ks6_VSn4=.e86bd4ee-5fce-415a-888a-06aff24bd664@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <unMldYiDLGyImOJQ1oXuzR2OViIBxTKFjE3Ks6_VSn4=.e86bd4ee-5fce-415a-888a-06aff24bd664@github.com>
Message-ID: <3kiI1J7jcczgzTRi9HZztzhGe1blcy8Ga11xoGhzueY=.98543172-5b38-4199-bead-0988de0e0e75@github.com>

On Thu, 6 Feb 2025 18:47:54 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Adding comments + some code reorganization

src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 2594:

> 2592:     guarantee(T != T1Q && T != T1D, "incorrect arrangement");                           \
> 2593:     if (!acceptT2D) guarantee(T != T2D, "incorrect arrangement");                       \
> 2594:     if (strcmp(#NAME, "sqdmulh") == 0) guarantee(T != T8B && T != T16B, "incorrect arrangement");   \

Suggestion:

I think it might be better to change this test from a strcmp call to (opc2 == 0b101101). The strcmp test is clearer to a reader of the code but the call may not be guaranteed to be compiled out at build time while the latter will.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1959758334

From adinn at openjdk.org  Tue Feb 18 13:46:17 2025
From: adinn at openjdk.org (Andrew Dinn)
Date: Tue, 18 Feb 2025 13:46:17 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]
In-Reply-To: <unMldYiDLGyImOJQ1oXuzR2OViIBxTKFjE3Ks6_VSn4=.e86bd4ee-5fce-415a-888a-06aff24bd664@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <unMldYiDLGyImOJQ1oXuzR2OViIBxTKFjE3Ks6_VSn4=.e86bd4ee-5fce-415a-888a-06aff24bd664@github.com>
Message-ID: <Xeb_-xj7BbIrcvXu7SM1zaBqg4Mr_qnGXZdUeomEeLI=.63ded1f5-b52a-41eb-bf03-7d74fdfb17f1@github.com>

On Thu, 6 Feb 2025 18:47:54 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Adding comments + some code reorganization

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4066:

> 4064:   }
> 4065: 
> 4066:   // Execute on round of keccak of two computations in parallel.

Suggestion:

It would be helpful to add comments that relate the register and instruction selection to the original Java source code. e.g. change the header as follows

    // Performs 2 keccak round transformations using vector parallelism
    // 
    // Two sets of 25 * 64-bit input states a0[lo:hi]...a24[lo:hi] are passed in
    // the lower/upper halves of registers v0...v24 and the transformed states
    // are returned in the same registers. Intermediate 64-bit pairs
    // c0...c5 and d0...d5 are computed in registers v25...v30. v31 is
    // loaded with the required pair of 64 bit rounding constants.
    // During computation of the output states some intermediate results are
    // shuffled around registers v0...v30. Comments on each line indicate
    // how the values in registers correspond to variables ai, ci, di in
    // the Java source code, likewise how the generated machine instructions
    // correspond to Java source operations (n.b. rol means rotate left).

The annotate the generation steps as follows:

    __ eor3(v29, __ T16B, v4, v9, v14);       // c4 = a4 ^ a9 ^ a14
    __ eor3(v26, __ T16B, v1, v6, v11);       // c1 = a1 ^ a16 ^ a11
    __ eor3(v28, __ T16B, v3, v8, v13);       // c3 = a3 ^ a8 ^a13
    __ eor3(v25, __ T16B, v0, v5, v10);       // c0 = a0 ^ a5 ^ a10
    __ eor3(v27, __ T16B, v2, v7, v12);       // c2 = a2 ^ a7 ^ a12
    __ eor3(v29, __ T16B, v29, v19, v24);     // c4 ^= a19 ^ a24
    __ eor3(v26, __ T16B, v26, v16, v21);     // c1 ^= a16 ^ a21
    __ eor3(v28, __ T16B, v28, v18, v23);     // c3 ^= a18 ^ a23
    __ eor3(v25, __ T16B, v25, v15, v20);     // c0 ^= a15 ^ a20
    __ eor3(v27, __ T16B, v27, v17, v22);     // c2 ^= a17 ^ a22

    __ rax1(v30, __ T2D, v29, v26);           // d0 = c4 ^ rol(c1, 1)
    __ rax1(v26, __ T2D, v26, v28);           // d2 = c1 ^ rol(c3, 1)
    __ rax1(v28, __ T2D, v28, v25);           // d4 = c3 ^ rol(c0, 1)
    __ rax1(v25, __ T2D, v25, v27);           // d1 = c0 ^ rol(c2, 1)
    __ rax1(v27, __ T2D, v27, v29);           // d3 = c2 ^ rol(c4, 1)

    __ eor(v0, __ T16B, v0, v30);             // a0 = a0 ^ d0
    __ xar(v29, __ T2D, v1,  v25, (64 - 1));  // a10' = rol((a1^d1), 1)
    __ xar(v1,  __ T2D, v6,  v25, (64 - 44)); // a1 = rol(a6^d1), 44)
    __ xar(v6,  __ T2D, v9,  v28, (64 - 20)); // a6 = rol((a9^d4), 20)
    __ xar(v9,  __ T2D, v22, v26, (64 - 61)); // a9 = rol((a22^d2), 61)
    __ xar(v22, __ T2D, v14, v28, (64 - 39)); // a22 = rol((a14^d4), 39)
    __ xar(v14, __ T2D, v20, v30, (64 - 18)); // a14 = rol((a20^d0), 18)
    __ xar(v31, __ T2D, v2,  v26, (64 - 62)); // a20' = rol((a2^d2), 62)
    __ xar(v2,  __ T2D, v12, v26, (64 - 43)); // a2 = rol((a12^d2), 43)
    __ xar(v12, __ T2D, v13, v27, (64 - 25)); // a12 = rol((a13^d3), 25)
    __ xar(v13, __ T2D, v19, v28, (64 - 8));  // a13 = rol((a19^d4), 8)
    __ xar(v19, __ T2D, v23, v27, (64 - 56)); // a19 = rol((a23^d3), 56)
    __ xar(v23, __ T2D, v15, v30, (64 - 41)); // a23 = rol((a15^d0), 41)
    __ xar(v15, __ T2D, v4,  v28, (64 - 27)); // a15 = rol((a4^d4), 27)
    __ xar(v28, __ T2D, v24, v28, (64 - 14)); // a4' = rol((a24^d4), 14)
    __ xar(v24, __ T2D, v21, v25, (64 - 2));  // a24 = rol((a21^d1), 2)
    __ xar(v8,  __ T2D, v8,  v27, (64 - 55)); // a21' = rol((a8^d3), 55)
    __ xar(v4,  __ T2D, v16, v25, (64 - 45)); // a8' = rol((a16^d1), 45)
    __ xar(v16, __ T2D, v5,  v30, (64 - 36)); // a16 = rol((a5^d0), 36)
    __ xar(v5,  __ T2D, v3,  v27, (64 - 28)); // a5 = rol((a3^d3), 28)
    __ xar(v27, __ T2D, v18, v27, (64 - 21)); // a3' = rol((a18^d3), 21)
    __ xar(v3,  __ T2D, v17, v26, (64 - 15)); // a18' = rol((a17^d2), 15)
    __ xar(v25, __ T2D, v11, v25, (64 - 10)); // a17' = rol((a11^d1), 10)
    __ xar(v26, __ T2D, v7,  v26, (64 - 6));  // a11' = rol((a7^d2), 6)
    __ xar(v30, __ T2D, v10, v30, (64 - 3));  // a7' = rol((a10^d0), 3)

    __ bcax(v20, __ T16B, v31, v22, v8);      // a20 = a20' ^ (~a21 & a22')
    __ bcax(v21, __ T16B, v8,  v23, v22);     // a21 = a21' ^ (~a22 & a23)
    __ bcax(v22, __ T16B, v22, v24, v23);     // a22 = a22 ^ (~a23 & a24)
    __ bcax(v23, __ T16B, v23, v31, v24);     // a23 = a23 ^ (~a24 & a20')
    __ bcax(v24, __ T16B, v24, v8,  v31);     // a24 = a24 ^ (~a20' & a21')

    __ ld1r(v31, __ T2D, __ post(rscratch1, 8)); // rc = round_constants[i]

    __ bcax(v17, __ T16B, v25, v19, v3);      // a17 = a17' ^ (~a18' & a19)
    __ bcax(v18, __ T16B, v3,  v15, v19);     // a18 = a18' ^ (~a19 & a15')
    __ bcax(v19, __ T16B, v19, v16, v15);     // a19 = a19 ^ (~a15 & a16)
    __ bcax(v15, __ T16B, v15, v25, v16);     // a15 = a15 ^ (~a16 & a17')
    __ bcax(v16, __ T16B, v16, v3,  v25);     // a16 = a16 ^ (~a17' & a18')

    __ bcax(v10, __ T16B, v29, v12, v26);     // a10 = a10' ^ (~a11' & a12)
    __ bcax(v11, __ T16B, v26, v13, v12);     // a11 = a11' ^ (~a12 & a13)
    __ bcax(v12, __ T16B, v12, v14, v13);     // a12 = a12 ^ (~a13 & a14)
    __ bcax(v13, __ T16B, v13, v29, v14);     // a13 = a13 ^ (~a14 & a10')
    __ bcax(v14, __ T16B, v14, v26, v29);     // a14 = a14 ^ (~a10' & a11')

    __ bcax(v7, __ T16B, v30, v9,  v4);       // a7 = a7' ^ (~a8' & a9)
    __ bcax(v8, __ T16B, v4,  v5,  v9);       // a8 = a8' ^ (~a9 & a5)
    __ bcax(v9, __ T16B, v9,  v6,  v5);       // a9 = a9 ^ (~a5 & a6)
    __ bcax(v5, __ T16B, v5,  v30, v6);       // a5 = a5 ^ (~a6 & a7)
    __ bcax(v6, __ T16B, v6,  v4,  v30);      // a6 = a6 ^ (~a7 & a8')

    __ bcax(v3, __ T16B, v27, v0,  v28);      // a3 = a3' ^ (~a4' & a0)
    __ bcax(v4, __ T16B, v28, v1,  v0);       // a4 = a4' ^ (~a0 & a1)
    __ bcax(v0, __ T16B, v0,  v2,  v1);       // a0 = a0 ^ (~a1 & a2)
    __ bcax(v1, __ T16B, v1,  v27, v2);       // a1 = a1 ^ (~a2 & a3)
    __ bcax(v2, __ T16B, v2,  v28, v27);      // a2 = a2 ^ (~a3 & a4')

    __ eor(v0, __ T16B, v0, v31);             // a0 = a0 ^ rc

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1959776475

From kvn at openjdk.org  Tue Feb 18 19:24:22 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Tue, 18 Feb 2025 19:24:22 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2]
In-Reply-To: <g1xZtizMWGUVmJj54HU6Xto4EFize_eVZy7gWRNrRus=.835606f5-7800-48af-84bd-2310d82d867e@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <UpT0IDb0Y33QFlFCzUQad0y1qYnadGt6VwFxeR9GTZY=.8d61f572-209d-4509-b03d-068fb21ee3ce@github.com>
 <rs6rOizOwjIbK5bOSPR-YHledkZyK1ZrfXTYLAjaI2o=.8a8eab26-eae5-45f5-a10a-fc6cf4bbe94c@github.com>
 <GPZdeIQ8_nb5jtnet5PBB3IIrkqaQMS-SxtcIPhcHeY=.0c90ff82-6b91-4ed9-a400-2967c712ca1c@github.com>
 <g1xZtizMWGUVmJj54HU6Xto4EFize_eVZy7gWRNrRus=.835606f5-7800-48af-84bd-2310d82d867e@github.com>
Message-ID: <p1tQ4tR4An5lAGBXmeiWSiwzyVhQGHdiPolrRL3zvjk=.4c206362-11bc-4c7e-8514-bdc24c82bb00@github.com>

On Wed, 12 Feb 2025 03:03:34 GMT, Chris Plummer <cjplummer at openjdk.org> wrote:

>> Before I forgot to answer you, @plummercj 
>> I completely agree with your comment about cleaning up wrapper subclasses which do nothing.
>> 
>> I think some wrapper subclasses for CodeBlob were kept because of `is*()` which were used only in `PStack` to print name. Why not use `getName()` for this purpose without big `if/else` there?
>> 
>> An other purpose could be a place holder for additional information in a future which never come.
>> 
>> Other wrapper provides information available in `CodeBlob`. Like `RuntimeStub. callerMustGCArguments()`. `_caller_must_gc_arguments` field is part of  VM's `CodeBlob` class for some time now. Looks like I missed change in SA when did change in VM.
>> 
>> So yes, feel free to clean this up.  I will help with review.
>
>> I think some wrapper subclasses for CodeBlob were kept because of `is*()` which were used only in `PStack` to print name. Why not use `getName()` for this purpose without big `if/else` there?
> 
> Possibly getName() didn't exist when PStack was first written. It would be good if PStack not only included the type name as it does now, but also the actual name of the blob, which getName() would return.
> 
>> An other purpose could be a place holder for additional information in a future which never come.
> 
> Yes, and you also see that with the Observer registration and the `Type type = db.lookupType(<typename>)` code, which are only needed if you are going to lookup fields of the subtypes, which most don't ever do, yet they all have this code.
>  
>> Other wrapper provides information available in `CodeBlob`. Like `RuntimeStub. callerMustGCArguments()`. `_caller_must_gc_arguments` field is part of VM's `CodeBlob` class for some time now. Looks like I missed change in SA when did change in VM.
> 
> Yeah, that's not working right for CodeBlob subtypes that are not RuntimeStubs. Easy to fix though.
> 
>> So yes, feel free to clean this up. I will help with review.
> 
> Ok. Let me see where things are at after you are done with the PR.

Thank you, @plummercj , for review.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2666228333

From kvn at openjdk.org  Tue Feb 18 19:24:34 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Tue, 18 Feb 2025 19:24:34 GMT
Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v10]
In-Reply-To: <PLR4hd5rc20OFzfJxG7yDhph6PXNn6j1CJnvHI-OrI8=.7f019bc0-c1c4-4d68-8354-ca2567994784@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
 <PLR4hd5rc20OFzfJxG7yDhph6PXNn6j1CJnvHI-OrI8=.7f019bc0-c1c4-4d68-8354-ca2567994784@github.com>
Message-ID: <dAs_GALeshXTv-gRg4iGXrb59ZxDM1iCimXzIQ03f24=.8d891b89-e547-4923-8a9a-987af28acfb4@github.com>

On Sat, 15 Feb 2025 06:34:56 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table.
>> 
>> Added C++ static asserts to make sure no virtual methods are added in a future.
>> 
>> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob.
>> 
>> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp
>
> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Remove commented lines left by mistake

Thank you all for reviews and suggestions.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2666253220

From kvn at openjdk.org  Tue Feb 18 19:26:04 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Tue, 18 Feb 2025 19:26:04 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory
In-Reply-To: <JIozM6Xe_RuACVPRvbQtZYJUYzqmJri69XklWzmzVO8=.94c86dea-0665-48ba-989a-035e1a0ff35d@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <tJIepmfbtgbfD-EzVGPavvFjOQRaSK0riJzPO6YsTM0=.77b01211-44d1-47b2-8e56-ca98a68cfac4@github.com>
 <HiM2dzvsG2hB2utYPwFRplD8CRLPglPjQmY3sU2ZAKY=.93d0510e-0d9d-4f07-ba07-b9027ba6f89b@github.com>
 <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com>
 <JIozM6Xe_RuACVPRvbQtZYJUYzqmJri69XklWzmzVO8=.94c86dea-0665-48ba-989a-035e1a0ff35d@github.com>
Message-ID: <cMsbb3UmOWs-309Zobc6gna8ZepHvuoUpSpwvCaQmoM=.5f999813-d1e5-482b-8415-b875377221c5@github.com>

On Tue, 18 Feb 2025 10:07:07 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>>> That one is more tricky. Because what if the loop somehow gets folded away? How would we catch that?
>> 
>> There is code that removes the `OuterStripMinedLoop` if the `CountedLoop` goes away and also, if I recall correctly, logic that verifies no ``OuterStripMinedLoop` is left behind without a `CountedLoop` so it's probably possible. Question is whether we want that or not. Seems like quite a bit of extra complexity.
>
>> > That one is more tricky. Because what if the loop somehow gets folded away? How would we catch that?
> 
>>There is code that removes the OuterStripMinedLoop if the CountedLoop goes away and also, if I recall correctly, logic that verifies no ``OuterStripMinedLoopis left behind without aCountedLoop` so it's probably possible. Question is whether we want that or not. Seems like quite a bit of extra complexity.
> 
> Hmm ok, I see. I wonder how bad it is to leave the slow-loop there until after loop-opts. I mean it was already created, and it now has no loop-opts performed on it (it is stalled), so it just sits there like dead code. So I'm not sure there is really a performance benefit to kill it already a little earlier. Maybe a very small one?

@eme64, my main concern is loop multi versions code will blowup inlining decisions. Our benchmarks may not be affected because we nay never trigger multi versions code on our hardware (as Roland pointed). May be you can force its generation and then compare performance.   Do we really need it for this changes? Can we simply generate un-vectorized loop?

" x86 and aarch64 are unaffected". Which platforms are affected? Do we really should sacrifice code complexity for platforms we don't support?

An other question is what deoptimization `Action` is taken when predicate is failed? I saw comment in code "We only want to use the auto-vectorization check as a trap once per bci." Does it mean you immediately deoptimize code? Can we hit uncommon trap few times before deoptimization? Deoptimization after one trap assumes we will process the same un-aligned data again. In a test it could be true but in reality is it true too?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2666176147

From epeter at openjdk.org  Tue Feb 18 19:26:07 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Tue, 18 Feb 2025 19:26:07 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory
In-Reply-To: <cMsbb3UmOWs-309Zobc6gna8ZepHvuoUpSpwvCaQmoM=.5f999813-d1e5-482b-8415-b875377221c5@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <tJIepmfbtgbfD-EzVGPavvFjOQRaSK0riJzPO6YsTM0=.77b01211-44d1-47b2-8e56-ca98a68cfac4@github.com>
 <HiM2dzvsG2hB2utYPwFRplD8CRLPglPjQmY3sU2ZAKY=.93d0510e-0d9d-4f07-ba07-b9027ba6f89b@github.com>
 <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com>
 <JIozM6Xe_RuACVPRvbQtZYJUYzqmJri69XklWzmzVO8=.94c86dea-0665-48ba-989a-035e1a0ff35d@github.com>
 <cMsbb3UmOWs-309Zobc6gna8ZepHvuoUpSpwvCaQmoM=.5f999813-d1e5-482b-8415-b875377221c5@github.com>
Message-ID: <Kpik6hwYiD1r8-ervrV0rg2VvrFK7pGr7NEV2qISPoE=.02e5a9e0-9c5f-4e08-9831-b0781a09f2f3@github.com>

On Tue, 18 Feb 2025 16:10:20 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>>> > That one is more tricky. Because what if the loop somehow gets folded away? How would we catch that?
>> 
>>>There is code that removes the OuterStripMinedLoop if the CountedLoop goes away and also, if I recall correctly, logic that verifies no ``OuterStripMinedLoopis left behind without aCountedLoop` so it's probably possible. Question is whether we want that or not. Seems like quite a bit of extra complexity.
>> 
>> Hmm ok, I see. I wonder how bad it is to leave the slow-loop there until after loop-opts. I mean it was already created, and it now has no loop-opts performed on it (it is stalled), so it just sits there like dead code. So I'm not sure there is really a performance benefit to kill it already a little earlier. Maybe a very small one?
>
> @eme64, my main concern is loop multi versions code will blowup inlining decisions. Our benchmarks may not be affected because we nay never trigger multi versions code on our hardware (as Roland pointed). May be you can force its generation and then compare performance.   Do we really need it for this changes? Can we simply generate un-vectorized loop?
> 
> " x86 and aarch64 are unaffected". Which platforms are affected? Do we really should sacrifice code complexity for platforms we don't support?
> 
> An other question is what deoptimization `Action` is taken when predicate is failed? I saw comment in code "We only want to use the auto-vectorization check as a trap once per bci." Does it mean you immediately deoptimize code? Can we hit uncommon trap few times before deoptimization? Deoptimization after one trap assumes we will process the same un-aligned data again. In a test it could be true but in reality is it true too?

@vnkozlov 

> " x86 and aarch64 are unaffected". Which platforms are affected? Do we really should sacrifice code complexity for platforms we don't support?

I would say most of the code here, i.e. the predicate and multi-version parts are also relevant for the up-coming patch for aliasing analysis runtime-checks. These are especially important for `MemorySegment` cases where there could basically always be aliasing and only runtime-checks can help us vectorize.

There is really only a small part, which is emitting the actual alignment-check.

> Do we really need it for this changes? Can we simply generate un-vectorized loop?

The alternatives on architectures that are actually affected by this bug:
- Not fix the bug, and risk possible `SIGBUS`. And on our platforms, that just means living with the HALT caused by `VerifyAlignVector`.
- Disable ALL vectorization of cases where we cannot guarantee statically that accesses are aligned. That would certainly disable all uses of `MemorySegment`, and that is probably not preferrable.

> my main concern is loop multi versions code will blowup inlining decisions. Our benchmarks may not be affected because we nay never trigger multi versions code on our hardware (as Roland pointed). May be you can force its generation and then compare performance.

Right. I suppose code size might be slightly affected. But I only multi-version if we are already going to pre-main-post the loop. And that means that the loop is already copied 3x, and doing 4x is not that noticable I would suspect. Also, with OSR we already currently don't generate predicates, and so it is generating the multi-versioning for those. And I really could not measure any difference in the performance benchmarking. I doubt it is even noticable on compile-time.


> An other question is what deoptimization Action is taken when predicate is failed? I saw comment in code "We only want to use the auto-vectorization check as a trap once per bci." Does it mean you immediately deoptimize code? Can we hit uncommon trap few times before deoptimization? Deoptimization after one trap assumes we will process the same un-aligned data again. In a test it could be true but in reality is it true too?

Yes, when we deopt for the bci, we recompile immediately. The alternative is to make the check per method, but then the risk is that one loop deopting causes other loops to be multi-versioned instead of using predicates too. Counting deopts per bci is currently not done at all. But I suppose we could make it a bit more "forgiving"... but is that worth it? I suppose if in reallity we do see non-aligned cases (or in the future cases where we have problematic aliasing), then it will probably repeat, and is worth recompiling to handle both cases. But that is speculation, and we can discuss :)

TLDR:
@vnkozlov I would not have fixed the bug with such a heavy mechanism if I did not intend to use it for runtime-check for aliasing analysis. And 90% of the code here is reusable for that.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2666357998

From kvn at openjdk.org  Tue Feb 18 19:26:11 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Tue, 18 Feb 2025 19:26:11 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory
In-Reply-To: <Kpik6hwYiD1r8-ervrV0rg2VvrFK7pGr7NEV2qISPoE=.02e5a9e0-9c5f-4e08-9831-b0781a09f2f3@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <tJIepmfbtgbfD-EzVGPavvFjOQRaSK0riJzPO6YsTM0=.77b01211-44d1-47b2-8e56-ca98a68cfac4@github.com>
 <HiM2dzvsG2hB2utYPwFRplD8CRLPglPjQmY3sU2ZAKY=.93d0510e-0d9d-4f07-ba07-b9027ba6f89b@github.com>
 <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com>
 <JIozM6Xe_RuACVPRvbQtZYJUYzqmJri69XklWzmzVO8=.94c86dea-0665-48ba-989a-035e1a0ff35d@github.com>
 <cMsbb3UmOWs-309Zobc6gna8ZepHvuoUpSpwvCaQmoM=.5f999813-d1e5-482b-8415-b875377221c5@github.com>
 <Kpik6hwYiD1r8-ervrV0rg2VvrFK7pGr7NEV2qISPoE=.02e5a9e0-9c5f-4e08-9831-b0781a09f2f3@github.com>
Message-ID: <7CUvxR76ROhB7TB2qqbF2nQB5RNIj4GpRvKqZSw-dDM=.8917fc6a-3e84-4a9b-8df7-2eec07cfa768@github.com>

On Tue, 18 Feb 2025 17:20:23 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> Do we really need it for this changes? Can we simply generate un-vectorized loop?

To clarify. This question was about second phase after we deoptimize and recompile when hit predicate check failure. I am fine with predicate change.

> And I really could not measure any difference in the performance benchmarking. I doubt it is even noticable on compile-time.

Right. If a method has a vectorizable loop, it is most likely has big generated code and not inlined already. So adding 4th loop may not affected significantly.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2666506254
PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2666525354

From kvn at openjdk.org  Tue Feb 18 19:26:16 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Tue, 18 Feb 2025 19:26:16 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory
In-Reply-To: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
Message-ID: <IUuLTkwPe7pefd6C6NhQEI7ASmdSW8Bb0kBFJVfXkUY=.f6d110c2-0d6d-424f-8898-b06d5f9552f6@github.com>

On Mon, 11 Nov 2024 14:40:09 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below.
> 
> **Background**
> 
> With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer.
> 
> **Problem**
> 
> So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code.
> 
> 
> MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1);
> MemorySegment nativeUnaligned = nativeAligned.asSlice(1);
> test3(nativeUnaligned);
> 
> 
> When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not!
> 
>     static void test3(MemorySegment ms) {
>         for (int i = 0; i < RANGE; i++) {
>             long adr = i * 4L;
>             int v = ms.get(ELEMENT_LAYOUT, adr);
>             ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1));
>         }
>     }
> 
> 
> **Solution: Runtime Checks - Predicate and Multiversioning**
> 
> Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check.
> 
> I came up with 2 options where to place the runtime checks:
> - A new "auto vectorization" Parse Predicate:
>   - This only works when predicates are available.
>   - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop.
> - Multiversion the loop:
>   - Create 2 copies of the loop (fast and slow loops).
>   - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take
>   - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even unaligned `base`s would end up with reasonably fast code.
>   - We "stall" the `...

What probabilities for multi-version loops branches? Did non-vectorized version is move out of hot path in generated code?

About actual probability value. I was thinking PROB_LIKELY_MAG(3). PROB_LIKELY_MAG(1) will only guarantee that vectorized loop will be first but it could be enough without moving other loop from hot path. Needs testing.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2666554240
PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2666710345

From epeter at openjdk.org  Tue Feb 18 19:26:18 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Tue, 18 Feb 2025 19:26:18 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory
In-Reply-To: <IUuLTkwPe7pefd6C6NhQEI7ASmdSW8Bb0kBFJVfXkUY=.f6d110c2-0d6d-424f-8898-b06d5f9552f6@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <IUuLTkwPe7pefd6C6NhQEI7ASmdSW8Bb0kBFJVfXkUY=.f6d110c2-0d6d-424f-8898-b06d5f9552f6@github.com>
Message-ID: <Gu0WgclypEv2o7nq6vvf0T3yPeMJ6uZjSCatPdcZJ6M=.b463ff12-49b5-455b-9bcd-3396f75e3a2c@github.com>

On Tue, 18 Feb 2025 18:29:42 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

> What probabilities for multi-version loops branches? Did non-vectorized version is move out of hot path in generated code?

I'm not sure what you are asking. Are you asking what probability I'm setting for the multi-version branch?

This is the loop selector, which later gets copied for each of the checks.
`const LoopSelector loop_selector(lpt, opaque, PROB_FAIR, COUNT_UNKNOWN);`

So 50%. But maybe you are suggesting it should really be biased towards the fast-path, right? What probability would you suggest? It should probably be fairly low, since there can be multiple checks added, and each one lowers the probability of arriving at the true-loop. So for scheduling, we should keep the probability high, so the true-loop is scheduled closer, right?

Is that what you meant?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2666602599

From kvn at openjdk.org  Tue Feb 18 19:26:19 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Tue, 18 Feb 2025 19:26:19 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory
In-Reply-To: <Gu0WgclypEv2o7nq6vvf0T3yPeMJ6uZjSCatPdcZJ6M=.b463ff12-49b5-455b-9bcd-3396f75e3a2c@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <IUuLTkwPe7pefd6C6NhQEI7ASmdSW8Bb0kBFJVfXkUY=.f6d110c2-0d6d-424f-8898-b06d5f9552f6@github.com>
 <Gu0WgclypEv2o7nq6vvf0T3yPeMJ6uZjSCatPdcZJ6M=.b463ff12-49b5-455b-9bcd-3396f75e3a2c@github.com>
Message-ID: <CxFJaAdpdY83rvyKI6qIzg85ILZh8RmioHcvEwZARoQ=.601e740b-a628-4686-ae9e-a85500bf4bde@github.com>

On Tue, 18 Feb 2025 18:45:34 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> > What probabilities for multi-version loops branches? Did non-vectorized version is move out of hot path in generated code?
> 
> I'm not sure what you are asking. Are you asking what probability I'm setting for the multi-version branch?
> 
> This is the loop selector, which later gets copied for each of the checks. `const LoopSelector loop_selector(lpt, opaque, PROB_FAIR, COUNT_UNKNOWN);`
> 
> So 50%. But maybe you are suggesting it should really be biased towards the fast-path, right? What probability would you suggest? It should probably be fairly low, since there can be multiple checks added, and each one lowers the probability of arriving at the true-loop. So for scheduling, we should keep the probability high, so the true-loop is scheduled closer, right?
> 
> Is that what you meant?

Yes. I want prioritize fast path assuming it is vectorized loop and that we get aligned data more frequently.
It is actually difficult to judge without statistic from real applications. It should be reversed if an application works mostly on unaligned data.  Can we profile alignment in Interpreter (and C1)?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2666635167

From kvn at openjdk.org  Tue Feb 18 20:11:04 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Tue, 18 Feb 2025 20:11:04 GMT
Subject: Integrated: 8349088: De-virtualize Codeblob and nmethod
In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com>
Message-ID: <ErVvY1I8JdPoOXE4djJRzOhIeB1wvVZetGE8bnhau7U=.69b08d24-c6e5-408c-9045-6ab0447aca4b@github.com>

On Sun, 9 Feb 2025 17:45:30 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table.
> 
> Added C++ static asserts to make sure no virtual methods are added in a future.
> 
> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob.
> 
> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp

This pull request has now been integrated.

Changeset: 46d4a601
Author:    Vladimir Kozlov <kvn at openjdk.org>
URL:       https://git.openjdk.org/jdk/commit/46d4a601e04f90b11d4ccc97a49f4e7010b4fd83
Stats:     529 lines in 23 files changed: 262 ins; 152 del; 115 mod

8349088: De-virtualize Codeblob and nmethod

Co-authored-by: Stefan Karlsson <stefank at openjdk.org>
Co-authored-by: Chris Plummer <cjplummer at openjdk.org>
Reviewed-by: cjplummer, aboldtch, dlong

-------------

PR: https://git.openjdk.org/jdk/pull/23533

From coleenp at openjdk.org  Tue Feb 18 23:49:52 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Tue, 18 Feb 2025 23:49:52 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native
Message-ID: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>

Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
Tested with tier1-4 and performance tests.

-------------

Commit messages:
 - Add ')' removed from jvmci test.
 - Shrink modifiers flag so isPrimitive can share word.
 - Remove isPrimitive intrinsic in favor of a boolean.
 - Make isInterface non-native.
 - Make isArray non-native

Changes: https://git.openjdk.org/jdk/pull/23572/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23572&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8349860
  Stats: 178 lines in 19 files changed: 37 ins; 115 del; 26 mod
  Patch: https://git.openjdk.org/jdk/pull/23572.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23572/head:pull/23572

PR: https://git.openjdk.org/jdk/pull/23572

From liach at openjdk.org  Tue Feb 18 23:49:54 2025
From: liach at openjdk.org (Chen Liang)
Date: Tue, 18 Feb 2025 23:49:54 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native
In-Reply-To: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
Message-ID: <6EpQLprXKfUDUQ6UIl0Vo0M5OPmCJ4SjcnOeprbO40w=.7d6cd0d3-ec59-4935-adb9-484764f0235c@github.com>

On Tue, 11 Feb 2025 20:56:39 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
> Tested with tier1-4 and performance tests.

We often need to determine what primitive type a `class` is. Currently we do it through `Wrapper.forPrimitiveType`. Do you see potential value in encoding the primitive status in a byte, so primitive info also knows what primitive type this class is instead of doing identity comparisons?

@cl4es Can you offer some insight here?

src/java.base/share/classes/jdk/internal/reflect/Reflection.java line 59:

> 57:             Reflection.class, ALL_MEMBERS,
> 58:             AccessibleObject.class, ALL_MEMBERS,
> 59:             Class.class, Set.of("classLoader", "classData", "modifiers", "isPrimitive"),

I think the field is named `isPrimitive`, right?

test/hotspot/jtreg/compiler/jvmci/jdk.vm.ci.runtime.test/src/jdk/vm/ci/runtime/test/TestResolvedJavaType.java line 933:

> 931:         if (f.getDeclaringClass().equals(metaAccess.lookupJavaType(Class.class))) {
> 932:             String name = f.getName();
> 933:             return name.equals("classLoader") || name.equals("classData") || name.equals("modifiers") || name.equals("isPrimitive");

Same field name remark.

test/jdk/jdk/internal/reflect/Reflection/Filtering.java line 59:

> 57:             { Class.class, "classData" },
> 58:             { Class.class, "modifiers" },
> 59:             { Class.class, "isPrimitive" },

Same field name remark.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23572#issuecomment-2654120983
PR Comment: https://git.openjdk.org/jdk/pull/23572#issuecomment-2659605250
PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1951773863
PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1951774073
PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1951774214

From coleenp at openjdk.org  Tue Feb 18 23:49:54 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Tue, 18 Feb 2025 23:49:54 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native
In-Reply-To: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
Message-ID: <m2mXAVAOTN37xUEBUnMyZqZXk1QinoEySwzh8w_W83c=.3c7669f8-e0e8-4dab-bac2-d208af8a3680@github.com>

On Tue, 11 Feb 2025 20:56:39 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
> Tested with tier1-4 and performance tests.

I had a look at Wrapper.forPrimitiveType() and it's not an intrinsic so I don't really know how hot it is.  It's a comparison, vs getting a field out of Class<?>.  Not sure how to measure it.  So I can't address it in this change.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23572#issuecomment-2659396480

From redestad at openjdk.org  Tue Feb 18 23:49:54 2025
From: redestad at openjdk.org (Claes Redestad)
Date: Tue, 18 Feb 2025 23:49:54 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native
In-Reply-To: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
Message-ID: <bRJ_xGw_OElD7FwDegvz5FUoEvnosOjJHQcaXO6yIYs=.8e353192-da4d-4c06-901c-a504f19d6d60@github.com>

On Tue, 11 Feb 2025 20:56:39 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
> Tested with tier1-4 and performance tests.

Touching `Wrapper` seems out of scope for this PR, but if `Class.isPrimitive` gets cheaper from this then `Wrapper.forPrimitiveType` should definitely be examined in a follow-up.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23572#issuecomment-2661970849

From coleenp at openjdk.org  Tue Feb 18 23:49:55 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Tue, 18 Feb 2025 23:49:55 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native
In-Reply-To: <6EpQLprXKfUDUQ6UIl0Vo0M5OPmCJ4SjcnOeprbO40w=.7d6cd0d3-ec59-4935-adb9-484764f0235c@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <6EpQLprXKfUDUQ6UIl0Vo0M5OPmCJ4SjcnOeprbO40w=.7d6cd0d3-ec59-4935-adb9-484764f0235c@github.com>
Message-ID: <oIkFtLHt8zVGSYcgzTIiQUarrANFHe2XjVYufDlqKPM=.b8a2516f-1668-44fd-b4b7-16a8f9703167@github.com>

On Wed, 12 Feb 2025 00:05:13 GMT, Chen Liang <liach at openjdk.org> wrote:

>> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
>> Tested with tier1-4 and performance tests.
>
> src/java.base/share/classes/jdk/internal/reflect/Reflection.java line 59:
> 
>> 57:             Reflection.class, ALL_MEMBERS,
>> 58:             AccessibleObject.class, ALL_MEMBERS,
>> 59:             Class.class, Set.of("classLoader", "classData", "modifiers", "isPrimitive"),
> 
> I think the field is named `isPrimitive`, right?

The method is isPrimitive so I think I had to give the field isPrimitiveType as a name, so this is wrong.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1952521536

From dlong at openjdk.org  Wed Feb 19 02:38:03 2025
From: dlong at openjdk.org (Dean Long)
Date: Wed, 19 Feb 2025 02:38:03 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native
In-Reply-To: <oIkFtLHt8zVGSYcgzTIiQUarrANFHe2XjVYufDlqKPM=.b8a2516f-1668-44fd-b4b7-16a8f9703167@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <6EpQLprXKfUDUQ6UIl0Vo0M5OPmCJ4SjcnOeprbO40w=.7d6cd0d3-ec59-4935-adb9-484764f0235c@github.com>
 <oIkFtLHt8zVGSYcgzTIiQUarrANFHe2XjVYufDlqKPM=.b8a2516f-1668-44fd-b4b7-16a8f9703167@github.com>
Message-ID: <S35E5eUCp-oDpwRdgMUzaHEIo8wv16ACZb7iDRzzm0g=.0d482ee0-73cb-494b-a98c-c0aaed5d07f2@github.com>

On Wed, 12 Feb 2025 12:05:22 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> src/java.base/share/classes/jdk/internal/reflect/Reflection.java line 59:
>> 
>>> 57:             Reflection.class, ALL_MEMBERS,
>>> 58:             AccessibleObject.class, ALL_MEMBERS,
>>> 59:             Class.class, Set.of("classLoader", "classData", "modifiers", "isPrimitive"),
>> 
>> I think the field is named `isPrimitive`, right?
>
> The method is isPrimitive so I think I had to give the field isPrimitiveType as a name, so this is wrong.

I don't know if we have a style guide that covers this, but I believe the method and field could both be named `isPrimitive`.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1960863953

From dlong at openjdk.org  Wed Feb 19 02:56:57 2025
From: dlong at openjdk.org (Dean Long)
Date: Wed, 19 Feb 2025 02:56:57 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native
In-Reply-To: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
Message-ID: <E9GPjreqeKFJmZAIjHGQ-1y6FnyqaT94FHUPuK65kmE=.48bd4ecc-ac91-4f7b-895b-a32280d8b437@github.com>

On Tue, 11 Feb 2025 20:56:39 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
> Tested with tier1-4 and performance tests.

src/hotspot/share/classfile/javaClasses.inline.hpp line 301:

> 299: #ifdef ASSERT
> 300:   // The heapwalker walks through Classes that have had their Klass pointers removed, so can't assert this.
> 301:   // assert(is_primitive == java_class->bool_field(_is_primitive_offset), "must match what we told Java");

I don't understand this comment about the heapwalker.  It sounds like we could have `is_primitive` set to true incorrectly.  If so, what prevents the asserts below from failing?  And why not use the value from _is_primitive_offset instead?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1960876174

From liach at openjdk.org  Wed Feb 19 02:56:58 2025
From: liach at openjdk.org (Chen Liang)
Date: Wed, 19 Feb 2025 02:56:58 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native
In-Reply-To: <S35E5eUCp-oDpwRdgMUzaHEIo8wv16ACZb7iDRzzm0g=.0d482ee0-73cb-494b-a98c-c0aaed5d07f2@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <6EpQLprXKfUDUQ6UIl0Vo0M5OPmCJ4SjcnOeprbO40w=.7d6cd0d3-ec59-4935-adb9-484764f0235c@github.com>
 <oIkFtLHt8zVGSYcgzTIiQUarrANFHe2XjVYufDlqKPM=.b8a2516f-1668-44fd-b4b7-16a8f9703167@github.com>
 <S35E5eUCp-oDpwRdgMUzaHEIo8wv16ACZb7iDRzzm0g=.0d482ee0-73cb-494b-a98c-c0aaed5d07f2@github.com>
Message-ID: <o0OOTu12p_xatdRLah4eYogKT3R1je0CKtw-mQVsi2Y=.c16a1780-0e76-4d9d-b9f7-5f9b45695c7d@github.com>

On Wed, 19 Feb 2025 02:35:25 GMT, Dean Long <dlong at openjdk.org> wrote:

>> The method is isPrimitive so I think I had to give the field isPrimitiveType as a name, so this is wrong.
>
> I don't know if we have a style guide that covers this, but I believe the method and field could both be named `isPrimitive`.

I would personally name such a boolean field `primitive`, but I don't have a strong preference on the field naming as long as its references in tests and other locations are correct. In addition, I believe this field may soon be widened to carry more hotspot-specific flags (such as hidden, etc.) so the name is bound to change.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1960876569

From haosun at openjdk.org  Wed Feb 19 02:58:03 2025
From: haosun at openjdk.org (Hao Sun)
Date: Wed, 19 Feb 2025 02:58:03 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]
In-Reply-To: <unMldYiDLGyImOJQ1oXuzR2OViIBxTKFjE3Ks6_VSn4=.e86bd4ee-5fce-415a-888a-06aff24bd664@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <unMldYiDLGyImOJQ1oXuzR2OViIBxTKFjE3Ks6_VSn4=.e86bd4ee-5fce-415a-888a-06aff24bd664@github.com>
Message-ID: <c8EPfl5IC1K3uLMftbZSbf-TyJK-e5LEsXovSfjqO14=.ae0182d1-e7d2-4ab5-9ebe-d7bc8bac643e@github.com>

On Thu, 6 Feb 2025 18:47:54 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Adding comments + some code reorganization

Hi. Here is the test result of our CI.

### copyright year

the following files should update the copyright year to 2025.


src/hotspot/cpu/aarch64/assembler_aarch64.hpp
src/hotspot/cpu/aarch64/stubRoutines_aarch64.hpp
src/hotspot/share/runtime/globals.hpp
src/java.base/share/classes/sun/security/provider/ML_DSA.java
src/java.base/share/classes/sun/security/provider/SHA3Parallel.java
test/micro/org/openjdk/bench/java/security/MLDSA.java


### cross-build failure

Cross build for riscv64/s390/ppc64 failed.

Here shows the error msg for ppc64


=== Output from failing command(s) repeated here ===
* For target support_interim-jmods_support__create_java.base.jmod_exec:
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/tmp/jdk-src/src/hotspot/share/asm/codeBuffer.hpp:200), pid=72752, tid=72769
#  assert(allocates2(pc)) failed: not in CodeBuffer memory: 0x0000e85cb03dc620 <= 0x0000e85cb03e8ab4 <= 0x0000e85cb03e8ab0
#
# JRE version: OpenJDK Runtime Environment (25.0) (fastdebug build 25-internal-git-1e01c6deec3)
# Java VM: OpenJDK 64-Bit Server VM (fastdebug 25-internal-git-1e01c6deec3, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64)
# Problematic frame:
# V  [libjvm.so+0x3b391c]  Instruction_aarch64::~Instruction_aarch64()+0xbc
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /tmp/ci-scripts/jdk-src/make/
#
# An error report file with more information is saved as:
# /tmp/jdk-src/make/hs_err_pid72752.log
   ... (rest of output omitted)

* All command lines available in /sysroot/ppc64el/tmp/build-ppc64el/make-support/failure-logs.
=== End of repeated output ===


I suppose we should make the similar update at file `src/hotspot/cpu/aarch64/stubDeclarations_aarch64.hpp` to other platforms

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2667389849

From dlong at openjdk.org  Wed Feb 19 03:32:53 2025
From: dlong at openjdk.org (Dean Long)
Date: Wed, 19 Feb 2025 03:32:53 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native
In-Reply-To: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
Message-ID: <n1Pm6orA_ufoz5E6FWMNMpfCypAT_q2FyImG4oqpVtM=.2e64b4d5-7805-4f7f-8a5c-0c4dfbc3f6c9@github.com>

On Tue, 11 Feb 2025 20:56:39 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
> Tested with tier1-4 and performance tests.

src/java.base/share/classes/java/lang/Class.java line 1287:

> 1285:      */
> 1286:     public Class<?> getComponentType() {
> 1287:         // Only return for array types. Storage may be reused for Class for instance types.

I don't see any changes to componentType related to reuse.  So was this comment and the code below already obsolete?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1960897176

From dlong at openjdk.org  Wed Feb 19 03:37:52 2025
From: dlong at openjdk.org (Dean Long)
Date: Wed, 19 Feb 2025 03:37:52 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native
In-Reply-To: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
Message-ID: <C5MOTd1RwlSiawDFup2fui2I6qr9DHU44yQKeLq_3XY=.942a29c3-4738-457d-b335-a5891a65af19@github.com>

On Tue, 11 Feb 2025 20:56:39 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
> Tested with tier1-4 and performance tests.

src/hotspot/share/prims/jvm.cpp line 2283:

> 2281: // Otherwise it returns its argument value which is the _the_class Klass*.
> 2282: // Please, refer to the description in the jvmtiThreadState.hpp.
> 2283: 

Does this "RedefineClasses support" comment still belong here?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1960900041

From dholmes at openjdk.org  Wed Feb 19 05:14:58 2025
From: dholmes at openjdk.org (David Holmes)
Date: Wed, 19 Feb 2025 05:14:58 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native
In-Reply-To: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
Message-ID: <FrlPjmMb7CgcaoCNXLWwybw-pcHprdWIP8whY8fJU9g=.2564ab5c-9d30-4e5d-95cc-7a85955643b0@github.com>

On Tue, 11 Feb 2025 20:56:39 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
> Tested with tier1-4 and performance tests.

Just a few passing comments as this is mainly compiler stuff.

Does the SA not need any updates in relation to this?

src/hotspot/share/classfile/javaClasses.cpp line 1371:

> 1369: #endif
> 1370:   set_modifiers(java_class, JVM_ACC_ABSTRACT | JVM_ACC_FINAL | JVM_ACC_PUBLIC);
> 1371:   set_is_primitive(java_class);

Just wondering what the comments at the start of this method are alluding to now that we do have a field at the Java level. ???

src/hotspot/share/prims/jvm.cpp line 1262:

> 1260: JVM_END
> 1261: 
> 1262: JVM_ENTRY(jboolean, JVM_IsArrayClass(JNIEnv *env, jclass cls))

Where are the changes to jvm.h?

src/java.base/share/classes/java/lang/Class.java line 1009:

> 1007:     private transient Object classData; // Set by VM
> 1008:     private transient Object[] signers; // Read by VM, mutable
> 1009:     private final transient char modifiers;  // Set by the VM

Why the change of type here?

-------------

PR Review: https://git.openjdk.org/jdk/pull/23572#pullrequestreview-2625638624
PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1960955739
PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1960959718
PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1960960668

From epeter at openjdk.org  Wed Feb 19 07:19:56 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Wed, 19 Feb 2025 07:19:56 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory
In-Reply-To: <IUuLTkwPe7pefd6C6NhQEI7ASmdSW8Bb0kBFJVfXkUY=.f6d110c2-0d6d-424f-8898-b06d5f9552f6@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <IUuLTkwPe7pefd6C6NhQEI7ASmdSW8Bb0kBFJVfXkUY=.f6d110c2-0d6d-424f-8898-b06d5f9552f6@github.com>
Message-ID: <OtJlLrlGEGU9a-lDCP-_n6paLgrAmCTg3-pwhLTeyIU=.c1a3d943-aca1-4dbd-8717-c73020163864@github.com>

On Tue, 18 Feb 2025 19:18:34 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below.
>> 
>> **Background**
>> 
>> With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer.
>> 
>> **Problem**
>> 
>> So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code.
>> 
>> 
>> MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1);
>> MemorySegment nativeUnaligned = nativeAligned.asSlice(1);
>> test3(nativeUnaligned);
>> 
>> 
>> When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not!
>> 
>>     static void test3(MemorySegment ms) {
>>         for (int i = 0; i < RANGE; i++) {
>>             long adr = i * 4L;
>>             int v = ms.get(ELEMENT_LAYOUT, adr);
>>             ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1));
>>         }
>>     }
>> 
>> 
>> **Solution: Runtime Checks - Predicate and Multiversioning**
>> 
>> Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check.
>> 
>> I came up with 2 options where to place the runtime checks:
>> - A new "auto vectorization" Parse Predicate:
>>   - This only works when predicates are available.
>>   - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop.
>> - Multiversion the loop:
>>   - Create 2 copies of the loop (fast and slow loops).
>>   - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take
>>   - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even ...
>
> About actual probability value. I was thinking PROB_LIKELY_MAG(3). PROB_LIKELY_MAG(1) will only guarantee that vectorized loop will be first but it could be enough without moving other loop from hot path. Needs testing.

@vnkozlov I suggest that I change the probability to something quite low now, just to make sure that the fast-loop is placed nicely. When I do the experiments for aliasing-analysis runtime-checks, then I will be able to benchmark much better for both cases, since it is much easier to create many different cases. At that point, I could still adapt the probabilities to a different constant. Or maybe I can somehow adjust the probabilities in the chain such that they are balanced. Like if there is 1 condition, give it `0.5`, if there are 2 give them each `sqrt(0.5)`, if there are `n` then `pow(0.5, 1/n)`, so that once you multiply them you get `pow(pow(0.5, 1/n),n) = 0.5`. We could also set another "target" probability than `0.5`. The issue is that experimenting now is a little difficult, because I only have the alignment-checks to play with, which are really really rare to fail in the "real world", I think. But aliasing-checks are more likely to fail, so there could be more interesting 
 benchmark results there.

Does that sound ok?

> Can we profile alignment in Interpreter (and C1)?

It would be nice if we could profile alignment or aliasing. Maybe that is possible. But I suppose there are always cases where profiling is not available (Xcomp ?), and we should have reasonable defaults there. We could investigate profiling in a second step, to improve things if we think that is worth it. Profiling these things would also be additional complexity - I'm not convinced yet it is worth it.

What do you think?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2667703955

From epeter at openjdk.org  Wed Feb 19 07:42:52 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Wed, 19 Feb 2025 07:42:52 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory [v2]
In-Reply-To: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
Message-ID: <n1ijU_-Wic3KhcDihdMAaf9vCndAmDEAdluIK9RQj28=.b42cbb9a-d5d1-40e8-b898-03dd60d7914f@github.com>

> Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below.
> 
> **Background**
> 
> With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer.
> 
> **Problem**
> 
> So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code.
> 
> 
> MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1);
> MemorySegment nativeUnaligned = nativeAligned.asSlice(1);
> test3(nativeUnaligned);
> 
> 
> When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not!
> 
>     static void test3(MemorySegment ms) {
>         for (int i = 0; i < RANGE; i++) {
>             long adr = i * 4L;
>             int v = ms.get(ELEMENT_LAYOUT, adr);
>             ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1));
>         }
>     }
> 
> 
> **Solution: Runtime Checks - Predicate and Multiversioning**
> 
> Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check.
> 
> I came up with 2 options where to place the runtime checks:
> - A new "auto vectorization" Parse Predicate:
>   - This only works when predicates are available.
>   - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop.
> - Multiversion the loop:
>   - Create 2 copies of the loop (fast and slow loops).
>   - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take
>   - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even unaligned `base`s would end up with reasonably fast code.
>   - We "stall" the `...

Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 63 commits:

 - Merge branch 'master' into JDK-8323582-SW-native-alignment
 - remove multiversion mark if we break the structure
 - register opaque with igvn
 - copyright and rm CFG check
 - IR rules for all cases
 - 3 test versions
 - test changed to unaligned ints
 - stub for slicing
 - add Verify/AlignVector runs to test
 - refactor verify
 - ... and 53 more: https://git.openjdk.org/jdk/compare/9042aa82...a98ffabf

-------------

Changes: https://git.openjdk.org/jdk/pull/22016/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=22016&range=01
  Stats: 1074 lines in 27 files changed: 951 ins; 28 del; 95 mod
  Patch: https://git.openjdk.org/jdk/pull/22016.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/22016/head:pull/22016

PR: https://git.openjdk.org/jdk/pull/22016

From adinn at openjdk.org  Wed Feb 19 10:44:55 2025
From: adinn at openjdk.org (Andrew Dinn)
Date: Wed, 19 Feb 2025 10:44:55 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v2]
In-Reply-To: <bgQcz0qkxlllqA2IQzycvOzQo6SXHvzJHqXnsIV9tJw=.c0841967-0da0-492c-8427-a790ca7cb93f@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <hEbWLMmT-NemgtAzFnQaJcpsD72ILyj6MMABAH6kBQY=.3e4995f2-d6d2-49ea-ac19-9a241333aeac@github.com>
 <7UgNYEuTu6rj7queOgM9xIy-6kQMdACrZiDLtlniMYw=.dff6f18b-1236-43b1-8280-2bce9160f32a@github.com>
 <jrSKBd120-hd7KlWve_pary_pdkTrJFQ18dpCk86O34=.e923d98c-0105-41fc-8b68-48490409adc1@github.com>
 <bgQcz0qkxlllqA2IQzycvOzQo6SXHvzJHqXnsIV9tJw=.c0841967-0da0-492c-8427-a790ca7cb93f@github.com>
Message-ID: <avdjPBxsqAuqoacOJuhhOjfg2NWOZFzyvn9X7pCnits=.b4b6afee-8a17-4900-b493-c0e9506494b3@github.com>

On Tue, 4 Feb 2025 18:57:28 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>>> @ferakocz I'm afraid you lucked out on getting your change committed before my reorganization of the stub generation code. If you are unsure of how to do the merge so your new stub is declared and generated following the new model (see the doc comments in stubDeclarations.hpp for details) let me know and I'll be happy to help you sort it out.
>> 
>> @adinn I think I managed to figure it out. Please take a look at the PR and let me know if I should have done anything differently.
>
>> @ferakocz Yes, the stub declaration part of it looks to be correct.
>> 
>> The rest of the patch will need at least two reviewers (@theRealAph? @martinuy? @franferrax) and may take some time to review, given that they will probably need to read up on the maths and algorithms. As an aid for reviewers and maintainers it would be good to insert a comment into the generator file linking the implementations to the relevant maths and algorithm. I found the FIPS-204 spec and the CRYSTALS-Dilithium Algorithm Speci?cations and Supporting Documentation paper, Shi Bai, L?o Ducas et al, 2021 - are they the best ones to look at?
> 
> The Java implementation of ML-DSA is based on the FIPS-204 standard and the intrinsicss' implementations are based on the corresponding Java methods, except that the montMul() calls in them are inlined. The rest of the transformation from Java code to intrinsic code is pretty straightforward, so a reviewer need not necessarily understand the whole mathematics of the ML-DSA algorithms, just that the Java and the corresponding intrinsic code do the same thing.

@ferakocz Apologies for the delays in reviewing and the limited feedback up to now. The code clearly does the job well but I think it would be made clearer and easier to maintain by tweaking/extending some of the generator methods and adding more detailed commenting. I am afraid I may take a few days to provide the relevant details because of other commitments.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2668251335

From roland at openjdk.org  Wed Feb 19 12:14:56 2025
From: roland at openjdk.org (Roland Westrelin)
Date: Wed, 19 Feb 2025 12:14:56 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory [v2]
In-Reply-To: <Kpik6hwYiD1r8-ervrV0rg2VvrFK7pGr7NEV2qISPoE=.02e5a9e0-9c5f-4e08-9831-b0781a09f2f3@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <tJIepmfbtgbfD-EzVGPavvFjOQRaSK0riJzPO6YsTM0=.77b01211-44d1-47b2-8e56-ca98a68cfac4@github.com>
 <HiM2dzvsG2hB2utYPwFRplD8CRLPglPjQmY3sU2ZAKY=.93d0510e-0d9d-4f07-ba07-b9027ba6f89b@github.com>
 <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com>
 <JIozM6Xe_RuACVPRvbQtZYJUYzqmJri69XklWzmzVO8=.94c86dea-0665-48ba-989a-035e1a0ff35d@github.com>
 <cMsbb3UmOWs-309Zobc6gna8ZepHvuoUpSpwvCaQmoM=.5f999813-d1e5-482b-8415-b875377221c5@github.com>
 <Kpik6hwYiD1r8-ervrV0rg2VvrFK7pGr7NEV2qISPoE=.02e5a9e0-9c5f-4e08-9831-b0781a09f2f3@github.com>
Message-ID: <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com>

On Tue, 18 Feb 2025 17:20:23 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> Right. I suppose code size might be slightly affected. But I only multi-version if we are already going to pre-main-post the loop. And that means that the loop is already copied 3x, and doing 4x is not that noticable I would suspect.

Wouldn't usual optimizations be applied to the slow loop as well (pre/main/post, unrolling)?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2668476997

From epeter at openjdk.org  Wed Feb 19 13:08:56 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Wed, 19 Feb 2025 13:08:56 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory [v2]
In-Reply-To: <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <tJIepmfbtgbfD-EzVGPavvFjOQRaSK0riJzPO6YsTM0=.77b01211-44d1-47b2-8e56-ca98a68cfac4@github.com>
 <HiM2dzvsG2hB2utYPwFRplD8CRLPglPjQmY3sU2ZAKY=.93d0510e-0d9d-4f07-ba07-b9027ba6f89b@github.com>
 <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com>
 <JIozM6Xe_RuACVPRvbQtZYJUYzqmJri69XklWzmzVO8=.94c86dea-0665-48ba-989a-035e1a0ff35d@github.com>
 <cMsbb3UmOWs-309Zobc6gna8ZepHvuoUpSpwvCaQmoM=.5f999813-d1e5-482b-8415-b875377221c5@github.com>
 <Kpik6hwYiD1r8-ervrV0rg2VvrFK7pGr7NEV2qISPoE=.02e5a9e0-9c5f-4e08-9831-b0781a09f2f3@github.com>
 <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com>
Message-ID: <WisIFWC8dXi53nfdv0gXIAHgdjZthyoIeVvOuoY7T_M=.e02ea3c6-b8e0-4e44-9edc-69ef212c9bfa@github.com>

On Wed, 19 Feb 2025 12:12:27 GMT, Roland Westrelin <roland at openjdk.org> wrote:

> > Right. I suppose code size might be slightly affected. But I only multi-version if we are already going to pre-main-post the loop. And that means that the loop is already copied 3x, and doing 4x is not that noticable I would suspect.
> 
> Wouldn't usual optimizations be applied to the slow loop as well (pre/main/post, unrolling)?

That is what I'm avoiding by `stalling` the slow-loop ;)
I only `un-stall` the slow-loop if a we actually add a check to the multiversion-if, and at that point we do care about the slow-loop.

Does that make sense?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2668601537

From roland at openjdk.org  Wed Feb 19 13:20:56 2025
From: roland at openjdk.org (Roland Westrelin)
Date: Wed, 19 Feb 2025 13:20:56 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory [v2]
In-Reply-To: <WisIFWC8dXi53nfdv0gXIAHgdjZthyoIeVvOuoY7T_M=.e02ea3c6-b8e0-4e44-9edc-69ef212c9bfa@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <tJIepmfbtgbfD-EzVGPavvFjOQRaSK0riJzPO6YsTM0=.77b01211-44d1-47b2-8e56-ca98a68cfac4@github.com>
 <HiM2dzvsG2hB2utYPwFRplD8CRLPglPjQmY3sU2ZAKY=.93d0510e-0d9d-4f07-ba07-b9027ba6f89b@github.com>
 <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com>
 <JIozM6Xe_RuACVPRvbQtZYJUYzqmJri69XklWzmzVO8=.94c86dea-0665-48ba-989a-035e1a0ff35d@github.com>
 <cMsbb3UmOWs-309Zobc6gna8ZepHvuoUpSpwvCaQmoM=.5f999813-d1e5-482b-8415-b875377221c5@github.com>
 <Kpik6hwYiD1r8-ervrV0rg2VvrFK7pGr7NEV2qISPoE=.02e5a9e0-9c5f-4e08-9831-b0781a09f2f3@github.com>
 <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com>
 <WisIFWC8dXi53nfdv0gXIAHgdjZthyoIeVvOuoY7T_M=.e02ea3c6-b8e0-4e44-9edc-69ef212c9bfa@github.com>
Message-ID: <GS2B9Xcwwuie0AxjO1pmN1QGF9Y4Ukdpgx-DedaMXPM=.b61bf04b-f489-4206-b7e5-44f06103c02e@github.com>

On Wed, 19 Feb 2025 13:06:02 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> That is what I'm avoiding by `stalling` the slow-loop ;) I only `un-stall` the slow-loop if a we actually add a check to the multiversion-if, and at that point we do care about the slow-loop.

So if the slow loop is kept, it's fully optimized (other than what misaligned accesses prevent)?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2668625485

From epeter at openjdk.org  Wed Feb 19 13:20:57 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Wed, 19 Feb 2025 13:20:57 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory [v2]
In-Reply-To: <GS2B9Xcwwuie0AxjO1pmN1QGF9Y4Ukdpgx-DedaMXPM=.b61bf04b-f489-4206-b7e5-44f06103c02e@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <tJIepmfbtgbfD-EzVGPavvFjOQRaSK0riJzPO6YsTM0=.77b01211-44d1-47b2-8e56-ca98a68cfac4@github.com>
 <HiM2dzvsG2hB2utYPwFRplD8CRLPglPjQmY3sU2ZAKY=.93d0510e-0d9d-4f07-ba07-b9027ba6f89b@github.com>
 <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com>
 <JIozM6Xe_RuACVPRvbQtZYJUYzqmJri69XklWzmzVO8=.94c86dea-0665-48ba-989a-035e1a0ff35d@github.com>
 <cMsbb3UmOWs-309Zobc6gna8ZepHvuoUpSpwvCaQmoM=.5f999813-d1e5-482b-8415-b875377221c5@github.com>
 <Kpik6hwYiD1r8-ervrV0rg2VvrFK7pGr7NEV2qISPoE=.02e5a9e0-9c5f-4e08-9831-b0781a09f2f3@github.com>
 <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com>
 <WisIFWC8dXi53nfdv0gXIAHgdjZthyoIeVvOuoY7T_M=.e02ea3c6-b8e0-4e44-9edc-69ef212c9bfa@github.com>
 <GS2B9Xcwwuie0AxjO1pmN1QGF9Y4Ukdpgx-DedaMXPM=.b61bf04b-f489-4206-b7e5-44f06103c02e@github.com>
Message-ID: <PdCUVqIOpSo7CFkUEnWx7aZlVTBw2w8fVURPagD2R4A=.4ab7ef52-b170-4a95-b15a-7cbd4407606f@github.com>

On Wed, 19 Feb 2025 13:15:46 GMT, Roland Westrelin <roland at openjdk.org> wrote:

> > That is what I'm avoiding by `stalling` the slow-loop ;) I only `un-stall` the slow-loop if a we actually add a check to the multiversion-if, and at that point we do care about the slow-loop.
> 
> So if the slow loop is kept, it's fully optimized (other than what misaligned accesses prevent)?

Exactly. In a sense that would give you similar results as with unswitching, where we also possibly optimize both branches / loops.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2668632094

From roland at openjdk.org  Wed Feb 19 13:28:55 2025
From: roland at openjdk.org (Roland Westrelin)
Date: Wed, 19 Feb 2025 13:28:55 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory [v2]
In-Reply-To: <PdCUVqIOpSo7CFkUEnWx7aZlVTBw2w8fVURPagD2R4A=.4ab7ef52-b170-4a95-b15a-7cbd4407606f@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <tJIepmfbtgbfD-EzVGPavvFjOQRaSK0riJzPO6YsTM0=.77b01211-44d1-47b2-8e56-ca98a68cfac4@github.com>
 <HiM2dzvsG2hB2utYPwFRplD8CRLPglPjQmY3sU2ZAKY=.93d0510e-0d9d-4f07-ba07-b9027ba6f89b@github.com>
 <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com>
 <JIozM6Xe_RuACVPRvbQtZYJUYzqmJri69XklWzmzVO8=.94c86dea-0665-48ba-989a-035e1a0ff35d@github.com>
 <cMsbb3UmOWs-309Zobc6gna8ZepHvuoUpSpwvCaQmoM=.5f999813-d1e5-482b-8415-b875377221c5@github.com>
 <Kpik6hwYiD1r8-ervrV0rg2VvrFK7pGr7NEV2qISPoE=.02e5a9e0-9c5f-4e08-9831-b0781a09f2f3@github.com>
 <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com>
 <WisIFWC8dXi53nfdv0gXIAHgdjZthyoIeVvOuoY7T_M=.e02ea3c6-b8e0-4e44-9edc-69ef212c9bfa@github.com>
 <GS2B9Xcwwuie0AxjO1pmN1QGF9Y4Ukdpgx-DedaMXPM=.b61bf04b-f489-4206-b7e5-44f06103c02e@github.com>
 <PdCUVqIOpSo7CFkUEnWx7aZlVTBw2w8fVUR
 PagD2R4A=.4ab7ef52-b170-4a95-b15a-7cbd4407606f@github.com>
Message-ID: <BarQ04VIv4siELW8k-GoGIeRKbLQLv3CRM9LHa1-3ZI=.170b0477-69a7-4907-a419-5a6367a7ca54@github.com>

On Wed, 19 Feb 2025 13:18:18 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> > > That is what I'm avoiding by `stalling` the slow-loop ;) I only `un-stall` the slow-loop if a we actually add a check to the multiversion-if, and at that point we do care about the slow-loop.
> > 
> > 
> > So if the slow loop is kept, it's fully optimized (other than what misaligned accesses prevent)?
> 
> Exactly. In a sense that would give you similar results as with unswitching, where we also possibly optimize both branches / loops.

So the overhead in the final code is 2x: we can expect the fast and slow paths to be about the same size so the section of code for the loop would see its size grow by 2x.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2668653066

From coleenp at openjdk.org  Wed Feb 19 13:54:55 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Wed, 19 Feb 2025 13:54:55 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native
In-Reply-To: <o0OOTu12p_xatdRLah4eYogKT3R1je0CKtw-mQVsi2Y=.c16a1780-0e76-4d9d-b9f7-5f9b45695c7d@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <6EpQLprXKfUDUQ6UIl0Vo0M5OPmCJ4SjcnOeprbO40w=.7d6cd0d3-ec59-4935-adb9-484764f0235c@github.com>
 <oIkFtLHt8zVGSYcgzTIiQUarrANFHe2XjVYufDlqKPM=.b8a2516f-1668-44fd-b4b7-16a8f9703167@github.com>
 <S35E5eUCp-oDpwRdgMUzaHEIo8wv16ACZb7iDRzzm0g=.0d482ee0-73cb-494b-a98c-c0aaed5d07f2@github.com>
 <o0OOTu12p_xatdRLah4eYogKT3R1je0CKtw-mQVsi2Y=.c16a1780-0e76-4d9d-b9f7-5f9b45695c7d@github.com>
Message-ID: <g5Gge906h_xfzwqXX8lf1kFyp7nq-NDqfQk8iIdYnNw=.3becf624-1862-46b2-9078-38c8a7163809@github.com>

On Wed, 19 Feb 2025 02:54:36 GMT, Chen Liang <liach at openjdk.org> wrote:

>> I don't know if we have a style guide that covers this, but I believe the method and field could both be named `isPrimitive`.
>
> I would personally name such a boolean field `primitive`, but I don't have a strong preference on the field naming as long as its references in tests and other locations are correct. In addition, I believe this field may soon be widened to carry more hotspot-specific flags (such as hidden, etc.) so the name is bound to change.

I like 'primitive'.  'hidden' is also a possibility to add to this and give it the same treatment.  I didn't do that one here to limit the changes and I haven't seen all the calls to isHidden so would need to find out how to measure the effects of that change.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1961722833

From rriggs at openjdk.org  Wed Feb 19 15:12:59 2025
From: rriggs at openjdk.org (Roger Riggs)
Date: Wed, 19 Feb 2025 15:12:59 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native
In-Reply-To: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
Message-ID: <ZjAN9VKqYZJ9kEzsFzL2NXtxIEa1Z_DMCA4B1rmWccs=.663c1dd7-b70d-4746-ad80-85446f82f53f@github.com>

On Tue, 11 Feb 2025 20:56:39 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
> Tested with tier1-4 and performance tests.

Is the change to isInterface and isPrimitive performance neutral?
As @IntrinsicCandidates, there would be some performance gain.

src/hotspot/share/prims/jvm.cpp line 2284:

> 2282: // Please, refer to the description in the jvmtiThreadState.hpp.
> 2283: 
> 2284: JVM_ENTRY(jboolean, JVM_IsInterface(JNIEnv *env, jclass cls))

JVM_IsInteface is deleted in Class.c, what purpose is this?

-------------

PR Review: https://git.openjdk.org/jdk/pull/23572#pullrequestreview-2627122068
PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1961858757

From rriggs at openjdk.org  Wed Feb 19 15:15:57 2025
From: rriggs at openjdk.org (Roger Riggs)
Date: Wed, 19 Feb 2025 15:15:57 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native
In-Reply-To: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
Message-ID: <rCOGZidDLYm4-NaaZkkHkw2Yo-OFPhXPy-nn3pWi85Q=.41bd48cb-2410-42ec-9110-edccebfb4c37@github.com>

On Tue, 11 Feb 2025 20:56:39 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
> Tested with tier1-4 and performance tests.

src/java.base/share/classes/java/lang/Class.java line 807:

> 805:      */
> 806:     public boolean isArray() {
> 807:         return componentType != null;

The componentType declaration should have a comment indicating that == null is the sole indication that the class is an interface.
Perhaps there should be an assert somewhere validating/cross checking that requirement.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1961869286

From epeter at openjdk.org  Wed Feb 19 15:25:56 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Wed, 19 Feb 2025 15:25:56 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory [v2]
In-Reply-To: <BarQ04VIv4siELW8k-GoGIeRKbLQLv3CRM9LHa1-3ZI=.170b0477-69a7-4907-a419-5a6367a7ca54@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <tJIepmfbtgbfD-EzVGPavvFjOQRaSK0riJzPO6YsTM0=.77b01211-44d1-47b2-8e56-ca98a68cfac4@github.com>
 <HiM2dzvsG2hB2utYPwFRplD8CRLPglPjQmY3sU2ZAKY=.93d0510e-0d9d-4f07-ba07-b9027ba6f89b@github.com>
 <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com>
 <JIozM6Xe_RuACVPRvbQtZYJUYzqmJri69XklWzmzVO8=.94c86dea-0665-48ba-989a-035e1a0ff35d@github.com>
 <cMsbb3UmOWs-309Zobc6gna8ZepHvuoUpSpwvCaQmoM=.5f999813-d1e5-482b-8415-b875377221c5@github.com>
 <Kpik6hwYiD1r8-ervrV0rg2VvrFK7pGr7NEV2qISPoE=.02e5a9e0-9c5f-4e08-9831-b0781a09f2f3@github.com>
 <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com>
 <WisIFWC8dXi53nfdv0gXIAHgdjZthyoIeVvOuoY7T_M=.e02ea3c6-b8e0-4e44-9edc-69ef212c9bfa@github.com>
 <GS2B9Xcwwuie0AxjO1pmN1QGF9Y4Ukdpgx-DedaMXPM=.b61bf04b-f489-4206-b7e5-44f06103c02e@github.com>
 <PdCUVqIOpSo7CFkUEnWx7aZlVTBw2w8fVUR
 PagD2R4A=.4ab7ef52-b170-4a95-b15a-7cbd4407606f@github.com>
 <BarQ04VIv4siELW8k-GoGIeRKbLQLv3CRM9LHa1-3ZI=.170b0477-69a7-4907-a419-5a6367a7ca54@github.com>
Message-ID: <mBzl4lD6blV2aF4xn1xwQLonj5FJIMBItY6WjJAUF7w=.f776fa88-75e6-46d4-b07b-8110dc94bedf@github.com>

On Wed, 19 Feb 2025 13:26:37 GMT, Roland Westrelin <roland at openjdk.org> wrote:

> So the overhead in the final code is 2x: we can expect the fast and slow paths to be about the same size so the section of code for the loop would see its size grow by 2x.

Yes, if you get to the point where you add a multi-version-if condition, i.e. where SuperWord has decided it needs a speculative assumption (here for alignment, later for aliasing), then we get the whole loop 2x. I suppose we could try to make the pre-main-post loop more complicated and just multi-version the main-loop, but that sounds much more complicated.

Do you see any better way than having the 2x code size if we need both a slow and fast loop?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2668974247

From liach at openjdk.org  Wed Feb 19 15:45:56 2025
From: liach at openjdk.org (Chen Liang)
Date: Wed, 19 Feb 2025 15:45:56 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native
In-Reply-To: <FrlPjmMb7CgcaoCNXLWwybw-pcHprdWIP8whY8fJU9g=.2564ab5c-9d30-4e5d-95cc-7a85955643b0@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <FrlPjmMb7CgcaoCNXLWwybw-pcHprdWIP8whY8fJU9g=.2564ab5c-9d30-4e5d-95cc-7a85955643b0@github.com>
Message-ID: <2sugnK5bK-SWGVluAWw-UNTKKkErTTNYTxCk7t0mOGo=.3734936f-7a10-48ec-8901-01ece733791f@github.com>

On Wed, 19 Feb 2025 05:08:36 GMT, David Holmes <dholmes at openjdk.org> wrote:

>> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
>> Tested with tier1-4 and performance tests.
>
> src/java.base/share/classes/java/lang/Class.java line 1009:
> 
>> 1007:     private transient Object classData; // Set by VM
>> 1008:     private transient Object[] signers; // Read by VM, mutable
>> 1009:     private final transient char modifiers;  // Set by the VM
> 
> Why the change of type here?

This is to improve the layout so the introduction of a boolean field does not increase the size of a Class object.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1961925828

From kvn at openjdk.org  Wed Feb 19 16:08:56 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Wed, 19 Feb 2025 16:08:56 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory
In-Reply-To: <OtJlLrlGEGU9a-lDCP-_n6paLgrAmCTg3-pwhLTeyIU=.c1a3d943-aca1-4dbd-8717-c73020163864@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <IUuLTkwPe7pefd6C6NhQEI7ASmdSW8Bb0kBFJVfXkUY=.f6d110c2-0d6d-424f-8898-b06d5f9552f6@github.com>
 <OtJlLrlGEGU9a-lDCP-_n6paLgrAmCTg3-pwhLTeyIU=.c1a3d943-aca1-4dbd-8717-c73020163864@github.com>
Message-ID: <qbg12wmJjEJ0bFQoKcErViUam9aqP4jVbFj_lavmGV4=.5d1f789d-fcfd-4083-a72d-7da8258f4c4d@github.com>

On Wed, 19 Feb 2025 07:17:30 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> The issue is that experimenting now is a little difficult, because I only have the alignment-checks to play with, which are really really rare to fail in the "real world", I think. But aliasing-checks are more likely to fail, so there could be more interesting benchmark results there.
>
> Does that sound ok?

Yes, it is good plan.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2669094347

From kvn at openjdk.org  Wed Feb 19 16:18:57 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Wed, 19 Feb 2025 16:18:57 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory
In-Reply-To: <OtJlLrlGEGU9a-lDCP-_n6paLgrAmCTg3-pwhLTeyIU=.c1a3d943-aca1-4dbd-8717-c73020163864@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <IUuLTkwPe7pefd6C6NhQEI7ASmdSW8Bb0kBFJVfXkUY=.f6d110c2-0d6d-424f-8898-b06d5f9552f6@github.com>
 <OtJlLrlGEGU9a-lDCP-_n6paLgrAmCTg3-pwhLTeyIU=.c1a3d943-aca1-4dbd-8717-c73020163864@github.com>
Message-ID: <mcXrI5ah9OFy25nV_Im_DFpPR_DXtfOgn-26D_bC1mQ=.15fc3dd9-7278-4c09-8b25-dde0a1251ca2@github.com>

On Wed, 19 Feb 2025 07:17:30 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> > Can we profile alignment in Interpreter (and C1)?
> 
> It would be nice if we could profile alignment or aliasing. Maybe that is possible. But I suppose there are always cases where profiling is not available (Xcomp ?), and we should have reasonable defaults there. We could investigate profiling in a second step, to improve things if we think that is worth it. Profiling these things would also be additional complexity - I'm not convinced yet it is worth it.
> 
> What do you think?

You should not worry about `-Xcomp` it is testing flag - we can use some default there.
I am fine if you think profiling will not bring us much benefits. Note, I am not asking create counters - just a bit to indicate if we had unaligned access to native memory in a method. In such case we may skip predicate and generate multi versions loop during compilation. On other hand, we may have unaligned access only during startup and not later when we compile method. Anyway, it does not affect these changes.

I will look on changes more later.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2669115673

From epeter at openjdk.org  Wed Feb 19 16:18:57 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Wed, 19 Feb 2025 16:18:57 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory
In-Reply-To: <mcXrI5ah9OFy25nV_Im_DFpPR_DXtfOgn-26D_bC1mQ=.15fc3dd9-7278-4c09-8b25-dde0a1251ca2@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <IUuLTkwPe7pefd6C6NhQEI7ASmdSW8Bb0kBFJVfXkUY=.f6d110c2-0d6d-424f-8898-b06d5f9552f6@github.com>
 <OtJlLrlGEGU9a-lDCP-_n6paLgrAmCTg3-pwhLTeyIU=.c1a3d943-aca1-4dbd-8717-c73020163864@github.com>
 <mcXrI5ah9OFy25nV_Im_DFpPR_DXtfOgn-26D_bC1mQ=.15fc3dd9-7278-4c09-8b25-dde0a1251ca2@github.com>
Message-ID: <KAbG3ivHpcazUVQzT1zWOukh2xCjrdpbZWOfoMLLO4M=.536545d6-b628-45be-bd86-23203f8296f8@github.com>

On Wed, 19 Feb 2025 16:14:09 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

> I am fine if you think profiling will not bring us much benefits

Yeah, I think it is a good assumption that we will always get aligned and non-aliasing inputs. And if that is not the case, then this is a rare case, and it should be ok to pay the price of recompilation, I think.

> I will look on changes more later.

Thanks you :)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2669122452

From liach at openjdk.org  Wed Feb 19 16:21:57 2025
From: liach at openjdk.org (Chen Liang)
Date: Wed, 19 Feb 2025 16:21:57 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native
In-Reply-To: <n1Pm6orA_ufoz5E6FWMNMpfCypAT_q2FyImG4oqpVtM=.2e64b4d5-7805-4f7f-8a5c-0c4dfbc3f6c9@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <n1Pm6orA_ufoz5E6FWMNMpfCypAT_q2FyImG4oqpVtM=.2e64b4d5-7805-4f7f-8a5c-0c4dfbc3f6c9@github.com>
Message-ID: <-rVJ4riSt_UybCT4tvNKCBxGfrHr-xnGx0DNDZyGgsA=.11b43081-86f2-47db-b52c-5f74b8e27960@github.com>

On Wed, 19 Feb 2025 03:30:04 GMT, Dean Long <dlong at openjdk.org> wrote:

>> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
>> Tested with tier1-4 and performance tests.
>
> src/java.base/share/classes/java/lang/Class.java line 1287:
> 
>> 1285:      */
>> 1286:     public Class<?> getComponentType() {
>> 1287:         // Only return for array types. Storage may be reused for Class for instance types.
> 
> I don't see any changes to componentType related to reuse.  So was this comment and the code below already obsolete?

It was. Before the componentType field was reused for the class initialization monitor int array, and it caused problems with core reflection if a program reflectively accesses this field after a few hundred times. See [JDK-8337622](https://bugs.openjdk.org/browse/JDK-8337622).

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1961989175

From liach at openjdk.org  Wed Feb 19 16:25:55 2025
From: liach at openjdk.org (Chen Liang)
Date: Wed, 19 Feb 2025 16:25:55 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native
In-Reply-To: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
Message-ID: <Pg84KDXxj5CYgcyMvzcKvDkuZElW6ycoX4JsrGB-ogQ=.29fb4838-4f30-446c-9015-a1e493467efe@github.com>

On Tue, 11 Feb 2025 20:56:39 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
> Tested with tier1-4 and performance tests.

Re roger's IntrinsicCandidate remark: One behavior that might be affected would be C2's inlining preferences. Some inline-sensitive workloads like FFM API might be affected if some Class attribute access cannot be inlined because the incoming Class object is not constant. See #23460 and #23628.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23572#issuecomment-2669138528

From coleenp at openjdk.org  Wed Feb 19 17:16:02 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Wed, 19 Feb 2025 17:16:02 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native
In-Reply-To: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
Message-ID: <L9_OtZrEExuNr95rU83lgTalaivutLbdTnCgqjT6TkI=.e15ba695-3f5c-433d-9aa7-4db068ce7ce1@github.com>

On Tue, 11 Feb 2025 20:56:39 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
> Tested with tier1-4 and performance tests.

Thanks for looking at this change.

-------------

PR Review: https://git.openjdk.org/jdk/pull/23572#pullrequestreview-2626906239

From coleenp at openjdk.org  Wed Feb 19 17:16:04 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Wed, 19 Feb 2025 17:16:04 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native
In-Reply-To: <FrlPjmMb7CgcaoCNXLWwybw-pcHprdWIP8whY8fJU9g=.2564ab5c-9d30-4e5d-95cc-7a85955643b0@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <FrlPjmMb7CgcaoCNXLWwybw-pcHprdWIP8whY8fJU9g=.2564ab5c-9d30-4e5d-95cc-7a85955643b0@github.com>
Message-ID: <maTemwep0WbKgtihRZGo6LdtkVJAtVi8I-NNP18wvQ8=.34e3da87-46e9-4db4-9d3d-ad73ca8096b2@github.com>

On Wed, 19 Feb 2025 05:01:53 GMT, David Holmes <dholmes at openjdk.org> wrote:

>> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
>> Tested with tier1-4 and performance tests.
>
> src/hotspot/share/classfile/javaClasses.cpp line 1371:
> 
>> 1369: #endif
>> 1370:   set_modifiers(java_class, JVM_ACC_ABSTRACT | JVM_ACC_FINAL | JVM_ACC_PUBLIC);
>> 1371:   set_is_primitive(java_class);
> 
> Just wondering what the comments at the start of this method are alluding to now that we do have a field at the Java level. ???

I think this comment is talking about java.lang.Class.klass field is null.  Which it still is since there's no Klass pointer for basic types. But no idea what the comment is in ClassFileParser and I don't think introducing a new Klass for primitive types is an improvement.  There are comments elsewhere that the klass is null for primitive types, including the call to java_lang_Class::is_primitive(), so this whole comment is only confusing so I'll remove it.  Or change it to:

  // Mirrors for basic types have a null klass field, which makes them special.

> src/hotspot/share/prims/jvm.cpp line 1262:
> 
>> 1260: JVM_END
>> 1261: 
>> 1262: JVM_ENTRY(jboolean, JVM_IsArrayClass(JNIEnv *env, jclass cls))
> 
> Where are the changes to jvm.h?

Good catch, I also removed getProtectionDomain.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1961739084
PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1961773882

From coleenp at openjdk.org  Wed Feb 19 17:16:05 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Wed, 19 Feb 2025 17:16:05 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native
In-Reply-To: <E9GPjreqeKFJmZAIjHGQ-1y6FnyqaT94FHUPuK65kmE=.48bd4ecc-ac91-4f7b-895b-a32280d8b437@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <E9GPjreqeKFJmZAIjHGQ-1y6FnyqaT94FHUPuK65kmE=.48bd4ecc-ac91-4f7b-895b-a32280d8b437@github.com>
Message-ID: <_j9Wkg21aBltyVrbO4wxGFKmmLDy0T-eorRL4epfS4k=.5a453b6b-d673-4cc6-b29f-192fa74e290c@github.com>

On Wed, 19 Feb 2025 02:54:05 GMT, Dean Long <dlong at openjdk.org> wrote:

>> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
>> Tested with tier1-4 and performance tests.
>
> src/hotspot/share/classfile/javaClasses.inline.hpp line 301:
> 
>> 299: #ifdef ASSERT
>> 300:   // The heapwalker walks through Classes that have had their Klass pointers removed, so can't assert this.
>> 301:   // assert(is_primitive == java_class->bool_field(_is_primitive_offset), "must match what we told Java");
> 
> I don't understand this comment about the heapwalker.  It sounds like we could have `is_primitive` set to true incorrectly.  If so, what prevents the asserts below from failing?  And why not use the value from _is_primitive_offset instead?

This is a good question.  The heapwalker walks through dead mirrors so I can't assert that a null klass field matches our boolean setting but I don't know why this never asserts (can't find any instances in the bug database) but it seems like it could.  I'll use the bool field in the mirror in the assert though but not in the return since the caller likely will fetch the klass pointer next.

> src/hotspot/share/prims/jvm.cpp line 2283:
> 
>> 2281: // Otherwise it returns its argument value which is the _the_class Klass*.
>> 2282: // Please, refer to the description in the jvmtiThreadState.hpp.
>> 2283: 
> 
> Does this "RedefineClasses support" comment still belong here?

I think so. The comment in jvmtiThreadState.hpp has details why this is.  We do a mirror switch before verification apparently because of bug 6214132 it says.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1961770573
PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1962059680

From coleenp at openjdk.org  Wed Feb 19 17:16:06 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Wed, 19 Feb 2025 17:16:06 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native
In-Reply-To: <2sugnK5bK-SWGVluAWw-UNTKKkErTTNYTxCk7t0mOGo=.3734936f-7a10-48ec-8901-01ece733791f@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <FrlPjmMb7CgcaoCNXLWwybw-pcHprdWIP8whY8fJU9g=.2564ab5c-9d30-4e5d-95cc-7a85955643b0@github.com>
 <2sugnK5bK-SWGVluAWw-UNTKKkErTTNYTxCk7t0mOGo=.3734936f-7a10-48ec-8901-01ece733791f@github.com>
Message-ID: <mXNfXzELMtBeeiwj1ssTg_C3dVzlxbYVD30UH4lYdB0=.81f041c1-0c85-48c8-a658-a30a154157f7@github.com>

On Wed, 19 Feb 2025 15:42:54 GMT, Chen Liang <liach at openjdk.org> wrote:

>> src/java.base/share/classes/java/lang/Class.java line 1009:
>> 
>>> 1007:     private transient Object classData; // Set by VM
>>> 1008:     private transient Object[] signers; // Read by VM, mutable
>>> 1009:     private final transient char modifiers;  // Set by the VM
>> 
>> Why the change of type here?
>
> This is to improve the layout so the introduction of a boolean field does not increase the size of a Class object.

I changed modifiers to u2 so that we won't have an alignment gap with the bool isPrimitiveType flag.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1962060783

From coleenp at openjdk.org  Wed Feb 19 17:16:07 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Wed, 19 Feb 2025 17:16:07 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native
In-Reply-To: <-rVJ4riSt_UybCT4tvNKCBxGfrHr-xnGx0DNDZyGgsA=.11b43081-86f2-47db-b52c-5f74b8e27960@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <n1Pm6orA_ufoz5E6FWMNMpfCypAT_q2FyImG4oqpVtM=.2e64b4d5-7805-4f7f-8a5c-0c4dfbc3f6c9@github.com>
 <-rVJ4riSt_UybCT4tvNKCBxGfrHr-xnGx0DNDZyGgsA=.11b43081-86f2-47db-b52c-5f74b8e27960@github.com>
Message-ID: <u01Zz75QR4Kvx3ExKJTCnHIKPNVLvekS6fqT5MIZtzk=.5eda33dd-f19d-4549-8216-36c3aa3834cf@github.com>

On Wed, 19 Feb 2025 16:19:22 GMT, Chen Liang <liach at openjdk.org> wrote:

>> src/java.base/share/classes/java/lang/Class.java line 1287:
>> 
>>> 1285:      */
>>> 1286:     public Class<?> getComponentType() {
>>> 1287:         // Only return for array types. Storage may be reused for Class for instance types.
>> 
>> I don't see any changes to componentType related to reuse.  So was this comment and the code below already obsolete?
>
> It was. Before the componentType field was reused for the class initialization monitor int array, and it caused problems with core reflection if a program reflectively accesses this field after a few hundred times. See [JDK-8337622](https://bugs.openjdk.org/browse/JDK-8337622).

Yes, this comment is obsolete.  We used to share the componentType mirror with an internal 'init-lock' but it caused a bug that was fixed.  If it's not an array the componentType is now always null.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1962069719

From galder at openjdk.org  Wed Feb 19 17:42:08 2025
From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=)
Date: Wed, 19 Feb 2025 17:42:08 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v12]
In-Reply-To: <pZjDpZKJUmXi85-qf3F-NX91qVc42_QgZGbuo36XhPk=.f2e4ba72-bf19-4ced-9656-c01907bdae1b@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <pZjDpZKJUmXi85-qf3F-NX91qVc42_QgZGbuo36XhPk=.f2e4ba72-bf19-4ced-9656-c01907bdae1b@github.com>
Message-ID: <ONik2I0-4iczocCQOuYDqxjEmIwpJB-FS5CUeyadAqQ=.0ba8687d-3372-4bff-914e-a10464a16a5e@github.com>

On Fri, 7 Feb 2025 12:39:24 GMT, Galder Zamarre?o <galder at openjdk.org> wrote:

>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance.
>> 
>> Currently vectorization does not kick in for loops containing either of these calls because of the following error:
>> 
>> 
>> VLoop::check_preconditions: failed: control flow in loop not allowed
>> 
>> 
>> The control flow is due to the java implementation for these methods, e.g.
>> 
>> 
>> public static long max(long a, long b) {
>>     return (a >= b) ? a : b;
>> }
>> 
>> 
>> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively.
>> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization.
>> E.g.
>> 
>> 
>> SuperWord::transform_loop:
>>     Loop: N518/N126  counted [int,int),+4 (1025 iters)  main has_sfpt strip_mined
>>  518  CountedLoop  === 518 246 126  [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21)
>> 
>> 
>> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1155
>> long max   1173
>> 
>> 
>> After the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1042
>> long max   1042
>> 
>> 
>> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes.
>> Therefore, it still relies on the macro expansion to transform those into CMoveL.
>> 
>> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results:
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PA...
>
> Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision:
> 
>  - Merge branch 'master' into topic.intrinsify-max-min-long
>  - Fix typo
>  - Renaming methods and variables and add docu on algorithms
>  - Fix copyright years
>  - Make sure it runs with cpus with either avx512 or asimd
>  - Test can only run with 256 bit registers or bigger
>    
>    * Remove platform dependant check
>    and use platform independent configuration instead.
>  - Fix license header
>  - Tests should also run on aarch64 asimd=true envs
>  - Added comment around the assertions
>  - Adjust min/max identity IR test expectations after changes
>  - ... and 34 more: https://git.openjdk.org/jdk/compare/75abfbc2...a190ae68

Following our discussion, I've run `MinMaxVector.long` benchmarks with superword disabled and with/without `_maxL` intrinsic in both AVX-512 and AVX2 modes.

The first thing I've observed is that lacking superword, the results with AVX-512 or AVX2 are identical, so I will just focus on AVX-512 results below.


Benchmark                              (probability)  (range)  (seed)  (size)   Mode  Cnt     -maxL     +maxLr   Units
MinMaxVector.longClippingRange                   N/A       90       0    1000  thrpt    4  1012.017  1011.8109  ops/ms
MinMaxVector.longClippingRange                   N/A      100       0    1000  thrpt    4  1012.113  1011.9530  ops/ms
MinMaxVector.longLoopMax                          50      N/A     N/A    2048  thrpt    4   463.946   473.9408  ops/ms
MinMaxVector.longLoopMax                          80      N/A     N/A    2048  thrpt    4   465.391   473.8063  ops/ms
MinMaxVector.longLoopMax                         100      N/A     N/A    2048  thrpt    4   510.992   471.6280  ops/ms (-8%)
MinMaxVector.longLoopMin                          50      N/A     N/A    2048  thrpt    4   496.036   495.3142  ops/ms
MinMaxVector.longLoopMin                          80      N/A     N/A    2048  thrpt    4   495.797   497.1214  ops/ms
MinMaxVector.longLoopMin                         100      N/A     N/A    2048  thrpt    4   495.302   495.1535  ops/ms
MinMaxVector.longReductionMultiplyMax             50      N/A     N/A    2048  thrpt    4   405.495   405.3936  ops/ms
MinMaxVector.longReductionMultiplyMax             80      N/A     N/A    2048  thrpt    4   405.342   405.4505  ops/ms
MinMaxVector.longReductionMultiplyMax            100      N/A     N/A    2048  thrpt    4   846.492   405.4779  ops/ms (-52%)
MinMaxVector.longReductionMultiplyMin             50      N/A     N/A    2048  thrpt    4   414.755   414.7036  ops/ms
MinMaxVector.longReductionMultiplyMin             80      N/A     N/A    2048  thrpt    4   414.705   414.7093  ops/ms
MinMaxVector.longReductionMultiplyMin            100      N/A     N/A    2048  thrpt    4   414.761   414.7150  ops/ms
MinMaxVector.longReductionSimpleMax               50      N/A     N/A    2048  thrpt    4   460.435   460.3764  ops/ms
MinMaxVector.longReductionSimpleMax               80      N/A     N/A    2048  thrpt    4   460.438   460.4718  ops/ms
MinMaxVector.longReductionSimpleMax              100      N/A     N/A    2048  thrpt    4  1023.005   460.5417  ops/ms (-55%)
MinMaxVector.longReductionSimpleMin               50      N/A     N/A    2048  thrpt    4   459.184   459.1662  ops/ms
MinMaxVector.longReductionSimpleMin               80      N/A     N/A    2048  thrpt    4   459.265   459.2588  ops/ms
MinMaxVector.longReductionSimpleMin              100      N/A     N/A    2048  thrpt    4   459.263   459.1304  ops/ms


`longLoopMax at 100%`, `longReductionMultiplyMax at 100%` and `longReductionSimpleMax at 100%` are regressions with the `_maxL` intrinsic. The cause is familiar: without the intrinsic cmp+mov are emitted, while with the intrinsic and conditions above, `cmov` is emitted:

# `longLoopMax` @ 100%

-maxL:

   4.18%  ????  ???   ?           0x00007fb7580f84b2:   cmpq		%r13, %r11
          ????? ???   ?           0x00007fb7580f84b5:   jl		0x7fb7580f84ec      ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
          ????? ???   ?                                                                     ; - java.lang.Math::max at 11 (line 2038)
          ????? ???   ?                                                                     ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax at 27 (line 256)
          ????? ???   ?                                                                     ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub at 19 (line 124)
   4.23%  ????? ????  ?           0x00007fb7580f84bb:   movq		%r11, 0x10(%rbp, %rsi, 8);*lastore {reexecute=0 rethrow=0 return_oop=0}
          ????? ????  ?                                                                     ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax at 30 (line 256)
          ????? ????  ?                                                                     ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub at 19 (line 124)


+maxL:

   1.06%  ???  0x00007fe1b40f5ed1:   movq		0x20(%rbx, %r10, 8), %r14;*laload {reexecute=0 rethrow=0 return_oop=0}
          ???                                                            ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax at 26 (line 256)
          ???                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub at 19 (line 124)
   1.34%  ???  0x00007fe1b40f5ed6:   cmpq		%r14, %r9
   2.78%  ???  0x00007fe1b40f5ed9:   cmovlq		%r14, %r9
   2.58%  ???  0x00007fe1b40f5edd:   movq		%r9, 0x20(%rax, %r10, 8);*lastore {reexecute=0 rethrow=0 return_oop=0}
          ???                                                            ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax at 30 (line 256)
          ???                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub at 19 (line 124)


# `longReductionMultiplyMax` @ 100%

-maxL:

   6.71%  ??  ???    0x00007f8af40f6278:   imulq		$0xb, 0x18(%r14, %r8, 8), %rdx
          ??  ???                                                              ;*lmul {reexecute=0 rethrow=0 return_oop=0}
          ??  ???                                                              ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMultiplyMax at 24 (line 285)
          ??  ???                                                              ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMultiplyMax_jmhTest::longReductionMultiplyMax_thrpt_jmhStub at 19 (line 124)
   5.28%  ??  ???    0x00007f8af40f627e:   nop
  10.23%  ??  ???    0x00007f8af40f6280:   cmpq		%rdx, %rdi
          ??? ???    0x00007f8af40f6283:   jge		0x7f8af40f62a7      ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
          ??? ???                                                              ; - java.lang.Math::max at 11 (line 2038)
          ??? ???                                                              ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMultiplyMax at 30 (line 286)
          ??? ???                                                              ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMultiplyMax_jmhTest::longReductionMultiplyMax_thrpt_jmhStub at 19 (line 124)


+maxL:

  11.07%  ??  0x00007f47000f5c4d:   imulq		$0xb, 0x18(%r14, %r11, 8), %rax
          ??                                                            ;*lmul {reexecute=0 rethrow=0 return_oop=0}
          ??                                                            ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMultiplyMax at 24 (line 285)
          ??                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMultiplyMax_jmhTest::longReductionMultiplyMax_thrpt_jmhStub at 19 (line 124)
   0.07%  ??  0x00007f47000f5c53:   cmpq		%rdx, %rax
  11.87%  ??  0x00007f47000f5c56:   cmovlq		%rdx, %rax          ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
          ??                                                            ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMultiplyMax at 30 (line 286)
          ??                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMultiplyMax_jmhTest::longReductionMultiplyMax_thrpt_jmhStub at 19 (line 124)


# `longReductionSimpleMax` @ 100%

-maxL:

   5.71%  ?????     ????      ?             0x00007fc2380f75f9:   movq		0x20(%r14, %r8, 8), %rdi;*laload {reexecute=0 rethrow=0 return_oop=0}
          ?????     ????      ?                                                                       ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionSimpleMax at 20 (line 295)
          ?????     ????      ?                                                                       ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionSimpleMax_jmhTest::longReductionSimpleMax_thrpt_jmhStub at 19 (line 124)
   1.85%  ?????     ????      ?             0x00007fc2380f75fe:   nop
   4.52%  ?????     ????      ?             0x00007fc2380f7600:   cmpq		%rdi, %rdx
          ??????    ????      ?             0x00007fc2380f7603:   jge		0x7fc2380f7667      ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
          ??????    ????      ?                                                                       ; - java.lang.Math::max at 11 (line 2038)
          ??????    ????      ?                                                                       ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionSimpleMax at 26 (line 296)
          ??????    ????      ?                                                                       ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionSimpleMax_jmhTest::longReductionSimpleMax_thrpt_jmhStub at 19 (line 124)


+maxL:

   3.06%   ??????  0x00007fa6d00f6020:   movq		0x70(%r14, %r11, 8), %r8;*laload {reexecute=0 rethrow=0 return_oop=0}
           ??????                                                            ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionSimpleMax at 20 (line 295)
           ??????                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionSimpleMax_jmhTest::longReductionSimpleMax_thrpt_jmhStub at 19 (line 124)
           ??????  0x00007fa6d00f6025:   cmpq		%r8, %r13
   2.88%   ??????  0x00007fa6d00f6028:   cmovlq		%r8, %r13           ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
           ??????                                                            ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionSimpleMax at 26 (line 296)
           ??????                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionSimpleMax_jmhTest::longReductionSimpleMax_thrpt_jmhStub at 19 (line 124)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2669329851

From galder at openjdk.org  Wed Feb 19 17:47:06 2025
From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=)
Date: Wed, 19 Feb 2025 17:47:06 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v12]
In-Reply-To: <pZjDpZKJUmXi85-qf3F-NX91qVc42_QgZGbuo36XhPk=.f2e4ba72-bf19-4ced-9656-c01907bdae1b@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <pZjDpZKJUmXi85-qf3F-NX91qVc42_QgZGbuo36XhPk=.f2e4ba72-bf19-4ced-9656-c01907bdae1b@github.com>
Message-ID: <xhGVTCK35DjTHkm356ou70rNSmEp9iuRRhOUSso9VfE=.ca03abd8-d954-4727-9556-58365052b311@github.com>

On Fri, 7 Feb 2025 12:39:24 GMT, Galder Zamarre?o <galder at openjdk.org> wrote:

>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance.
>> 
>> Currently vectorization does not kick in for loops containing either of these calls because of the following error:
>> 
>> 
>> VLoop::check_preconditions: failed: control flow in loop not allowed
>> 
>> 
>> The control flow is due to the java implementation for these methods, e.g.
>> 
>> 
>> public static long max(long a, long b) {
>>     return (a >= b) ? a : b;
>> }
>> 
>> 
>> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively.
>> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization.
>> E.g.
>> 
>> 
>> SuperWord::transform_loop:
>>     Loop: N518/N126  counted [int,int),+4 (1025 iters)  main has_sfpt strip_mined
>>  518  CountedLoop  === 518 246 126  [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21)
>> 
>> 
>> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1155
>> long max   1173
>> 
>> 
>> After the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1042
>> long max   1042
>> 
>> 
>> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes.
>> Therefore, it still relies on the macro expansion to transform those into CMoveL.
>> 
>> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results:
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PA...
>
> Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision:
> 
>  - Merge branch 'master' into topic.intrinsify-max-min-long
>  - Fix typo
>  - Renaming methods and variables and add docu on algorithms
>  - Fix copyright years
>  - Make sure it runs with cpus with either avx512 or asimd
>  - Test can only run with 256 bit registers or bigger
>    
>    * Remove platform dependant check
>    and use platform independent configuration instead.
>  - Fix license header
>  - Tests should also run on aarch64 asimd=true envs
>  - Added comment around the assertions
>  - Adjust min/max identity IR test expectations after changes
>  - ... and 34 more: https://git.openjdk.org/jdk/compare/557d790a...a190ae68

I will run a comparison next with the same batch of tests but looking at `int` and see if there are any differences compared with `long` or not.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2669342758

From coleenp at openjdk.org  Wed Feb 19 18:40:36 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Wed, 19 Feb 2025 18:40:36 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native [v2]
In-Reply-To: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
Message-ID: <dW_0bhxndHftSW3PnLXmEZGsJc4Y16QjUuSA9cChYR4=.a16d0dfc-bbf9-4723-b8f0-5d1abbaae7c8@github.com>

> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
> Tested with tier1-4 and performance tests.

Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:

  Code review comments.

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23572/files
  - new: https://git.openjdk.org/jdk/pull/23572/files/2d9b9ff5..3e731b9f

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23572&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23572&range=00-01

  Stats: 17 lines in 3 files changed: 3 ins; 10 del; 4 mod
  Patch: https://git.openjdk.org/jdk/pull/23572.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23572/head:pull/23572

PR: https://git.openjdk.org/jdk/pull/23572

From coleenp at openjdk.org  Wed Feb 19 18:40:37 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Wed, 19 Feb 2025 18:40:37 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native [v2]
In-Reply-To: <ZjAN9VKqYZJ9kEzsFzL2NXtxIEa1Z_DMCA4B1rmWccs=.663c1dd7-b70d-4746-ad80-85446f82f53f@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <ZjAN9VKqYZJ9kEzsFzL2NXtxIEa1Z_DMCA4B1rmWccs=.663c1dd7-b70d-4746-ad80-85446f82f53f@github.com>
Message-ID: <KzlaP-g90L9MB6Sp11H6FzXt0weIkz8px5IKhP1XF9M=.0027bfea-8105-4a91-9bdd-f837bda8179e@github.com>

On Wed, 19 Feb 2025 15:07:57 GMT, Roger Riggs <rriggs at openjdk.org> wrote:

>> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Code review comments.
>
> src/hotspot/share/prims/jvm.cpp line 2284:
> 
>> 2282: // Please, refer to the description in the jvmtiThreadState.hpp.
>> 2283: 
>> 2284: JVM_ENTRY(jboolean, JVM_IsInterface(JNIEnv *env, jclass cls))
> 
> JVM_IsInteface is deleted in Class.c, what purpose is this?

The old classfile verifier uses JVM_IsInterface.

> src/java.base/share/classes/java/lang/Class.java line 807:
> 
>> 805:      */
>> 806:     public boolean isArray() {
>> 807:         return componentType != null;
> 
> The componentType declaration should have a comment indicating that == null is the sole indication that the class is an interface.
> Perhaps there should be an assert somewhere validating/cross checking that requirement.

I added an assert for set_component_mirror() in the vm, but I don't see how to assert it in Java.  Is the comment like:

// The componentType field's null value is the sole indication that the class is an array, see isArray()

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1962078501
PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1962186820

From coleenp at openjdk.org  Wed Feb 19 18:40:37 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Wed, 19 Feb 2025 18:40:37 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native [v2]
In-Reply-To: <u01Zz75QR4Kvx3ExKJTCnHIKPNVLvekS6fqT5MIZtzk=.5eda33dd-f19d-4549-8216-36c3aa3834cf@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <n1Pm6orA_ufoz5E6FWMNMpfCypAT_q2FyImG4oqpVtM=.2e64b4d5-7805-4f7f-8a5c-0c4dfbc3f6c9@github.com>
 <-rVJ4riSt_UybCT4tvNKCBxGfrHr-xnGx0DNDZyGgsA=.11b43081-86f2-47db-b52c-5f74b8e27960@github.com>
 <u01Zz75QR4Kvx3ExKJTCnHIKPNVLvekS6fqT5MIZtzk=.5eda33dd-f19d-4549-8216-36c3aa3834cf@github.com>
Message-ID: <3orjlwIP5PIjb_UBpCUiIV7ZM1U_5BJfZws3PCleKhw=.55438aa0-1c98-476f-b1db-56672a1bbe4a@github.com>

On Wed, 19 Feb 2025 17:10:09 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> It was. Before the componentType field was reused for the class initialization monitor int array, and it caused problems with core reflection if a program reflectively accesses this field after a few hundred times. See [JDK-8337622](https://bugs.openjdk.org/browse/JDK-8337622).
>
> Yes, this comment is obsolete.  We used to share the componentType mirror with an internal 'init-lock' but it caused a bug that was fixed.  If it's not an array the componentType is now always null.

So for JDK 8 and 21+, the init_lock and componentType are not shared.  In JDK 11 and 17, Hotspot shares the fields, but it's not observable with the older implementation of reflection.  See https://bugs.openjdk.org/browse/JDK-8337622.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1962189932

From coleenp at openjdk.org  Wed Feb 19 18:42:56 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Wed, 19 Feb 2025 18:42:56 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native
In-Reply-To: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
Message-ID: <YKZkZcWYrtvxcqwq8HmKNP1U6DmqztQto_de8aGdv_s=.2bbabcc8-33d4-4219-9d58-e5da44543e2f@github.com>

On Tue, 11 Feb 2025 20:56:39 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
> Tested with tier1-4 and performance tests.

I ran our standard set of benchmarks on this change with no differences in performance.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23572#issuecomment-2669470645

From eastigeevich at openjdk.org  Wed Feb 19 19:54:05 2025
From: eastigeevich at openjdk.org (Evgeny Astigeevich)
Date: Wed, 19 Feb 2025 19:54:05 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v12]
In-Reply-To: <xhGVTCK35DjTHkm356ou70rNSmEp9iuRRhOUSso9VfE=.ca03abd8-d954-4727-9556-58365052b311@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <pZjDpZKJUmXi85-qf3F-NX91qVc42_QgZGbuo36XhPk=.f2e4ba72-bf19-4ced-9656-c01907bdae1b@github.com>
 <xhGVTCK35DjTHkm356ou70rNSmEp9iuRRhOUSso9VfE=.ca03abd8-d954-4727-9556-58365052b311@github.com>
Message-ID: <gXxBans1yY2lV4CGCGT2-zFYYO2TaX1_M7WRAq6uWIM=.6e862783-8e47-46b8-8a2a-e799e7f946da@github.com>

On Wed, 19 Feb 2025 17:43:54 GMT, Galder Zamarre?o <galder at openjdk.org> wrote:

>> Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision:
>> 
>>  - Merge branch 'master' into topic.intrinsify-max-min-long
>>  - Fix typo
>>  - Renaming methods and variables and add docu on algorithms
>>  - Fix copyright years
>>  - Make sure it runs with cpus with either avx512 or asimd
>>  - Test can only run with 256 bit registers or bigger
>>    
>>    * Remove platform dependant check
>>    and use platform independent configuration instead.
>>  - Fix license header
>>  - Tests should also run on aarch64 asimd=true envs
>>  - Added comment around the assertions
>>  - Adjust min/max identity IR test expectations after changes
>>  - ... and 34 more: https://git.openjdk.org/jdk/compare/384bab03...a190ae68
>
> I will run a comparison next with the same batch of tests but looking at `int` and see if there are any differences compared with `long` or not.

Hi @galderz,
Results from Graviton 3(Neoverse-V1).
Without the patch:

Benchmark                       (probability)  (range)  (seed)  (size)   Mode  Cnt      Score    Error   Units
MinMaxVector.intClippingRange             N/A       90       0    1000  thrpt    8  12565.427 ? 37.538  ops/ms
MinMaxVector.intClippingRange             N/A      100       0    1000  thrpt    8  12462.072 ? 84.067  ops/ms
MinMaxVector.intLoopMax                    50      N/A     N/A    2048  thrpt    8   5113.090 ? 68.720  ops/ms
MinMaxVector.intLoopMax                    80      N/A     N/A    2048  thrpt    8   5129.857 ? 35.005  ops/ms
MinMaxVector.intLoopMax                   100      N/A     N/A    2048  thrpt    8   5116.081 ?  8.946  ops/ms
MinMaxVector.intLoopMin                    50      N/A     N/A    2048  thrpt    8   6174.544 ? 52.573  ops/ms
MinMaxVector.intLoopMin                    80      N/A     N/A    2048  thrpt    8   6110.884 ? 54.447  ops/ms
MinMaxVector.intLoopMin                   100      N/A     N/A    2048  thrpt    8   6178.661 ? 48.450  ops/ms
MinMaxVector.intReductionMax               50      N/A     N/A    2048  thrpt    8   5109.270 ? 10.525  ops/ms
MinMaxVector.intReductionMax               80      N/A     N/A    2048  thrpt    8   5123.426 ? 28.229  ops/ms
MinMaxVector.intReductionMax              100      N/A     N/A    2048  thrpt    8   5133.799 ?  7.693  ops/ms
MinMaxVector.intReductionMin               50      N/A     N/A    2048  thrpt    8   5130.209 ? 15.491  ops/ms
MinMaxVector.intReductionMin               80      N/A     N/A    2048  thrpt    8   5127.823 ? 27.767  ops/ms
MinMaxVector.intReductionMin              100      N/A     N/A    2048  thrpt    8   5118.217 ? 22.186  ops/ms
MinMaxVector.longClippingRange            N/A       90       0    1000  thrpt    8   1831.026 ? 15.502  ops/ms
MinMaxVector.longClippingRange            N/A      100       0    1000  thrpt    8   1827.194 ? 22.076  ops/ms
MinMaxVector.longLoopMax                   50      N/A     N/A    2048  thrpt    8   2643.383 ?  9.830  ops/ms
MinMaxVector.longLoopMax                   80      N/A     N/A    2048  thrpt    8   2640.417 ?  7.797  ops/ms
MinMaxVector.longLoopMax                  100      N/A     N/A    2048  thrpt    8   1244.321 ?  1.001  ops/ms
MinMaxVector.longLoopMin                   50      N/A     N/A    2048  thrpt    8   3239.234 ?  8.813  ops/ms
MinMaxVector.longLoopMin                   80      N/A     N/A    2048  thrpt    8   3252.713 ?  3.446  ops/ms
MinMaxVector.longLoopMin                  100      N/A     N/A    2048  thrpt    8   1204.370 ? 10.537  ops/ms
MinMaxVector.longReductionMax              50      N/A     N/A    2048  thrpt    8   2536.322 ?  0.127  ops/ms
MinMaxVector.longReductionMax              80      N/A     N/A    2048  thrpt    8   2536.318 ?  0.277  ops/ms
MinMaxVector.longReductionMax             100      N/A     N/A    2048  thrpt    8   1395.273 ? 13.862  ops/ms
MinMaxVector.longReductionMin              50      N/A     N/A    2048  thrpt    8   2536.325 ?  0.146  ops/ms
MinMaxVector.longReductionMin              80      N/A     N/A    2048  thrpt    8   2536.265 ?  0.272  ops/ms
MinMaxVector.longReductionMin             100      N/A     N/A    2048  thrpt    8   1389.982 ?  5.345  ops/ms


With the patch:

Benchmark                       (probability)  (range)  (seed)  (size)   Mode  Cnt      Score    Error   Units
MinMaxVector.intClippingRange             N/A       90       0    1000  thrpt    8  12598.201 ? 52.631  ops/ms
MinMaxVector.intClippingRange             N/A      100       0    1000  thrpt    8  12555.284 ? 62.472  ops/ms
MinMaxVector.intLoopMax                    50      N/A     N/A    2048  thrpt    8   5079.499 ? 16.392  ops/ms
MinMaxVector.intLoopMax                    80      N/A     N/A    2048  thrpt    8   5100.673 ? 30.376  ops/ms
MinMaxVector.intLoopMax                   100      N/A     N/A    2048  thrpt    8   5082.544 ? 23.540  ops/ms
MinMaxVector.intLoopMin                    50      N/A     N/A    2048  thrpt    8   6137.512 ? 30.198  ops/ms
MinMaxVector.intLoopMin                    80      N/A     N/A    2048  thrpt    8   6136.233 ?  7.726  ops/ms
MinMaxVector.intLoopMin                   100      N/A     N/A    2048  thrpt    8   6142.262 ? 96.510  ops/ms
MinMaxVector.intReductionMax               50      N/A     N/A    2048  thrpt    8   5116.055 ? 23.270  ops/ms
MinMaxVector.intReductionMax               80      N/A     N/A    2048  thrpt    8   5111.481 ? 12.236  ops/ms
MinMaxVector.intReductionMax              100      N/A     N/A    2048  thrpt    8   5106.367 ?  9.035  ops/ms
MinMaxVector.intReductionMin               50      N/A     N/A    2048  thrpt    8   5115.666 ? 15.539  ops/ms
MinMaxVector.intReductionMin               80      N/A     N/A    2048  thrpt    8   5133.127 ?  4.918  ops/ms
MinMaxVector.intReductionMin              100      N/A     N/A    2048  thrpt    8   5120.469 ? 24.355  ops/ms
MinMaxVector.longClippingRange            N/A       90       0    1000  thrpt    8   5094.259 ? 14.092  ops/ms
MinMaxVector.longClippingRange            N/A      100       0    1000  thrpt    8   5096.835 ? 16.517  ops/ms
MinMaxVector.longLoopMax                   50      N/A     N/A    2048  thrpt    8   2636.438 ? 18.760  ops/ms
MinMaxVector.longLoopMax                   80      N/A     N/A    2048  thrpt    8   2644.069 ?  3.933  ops/ms
MinMaxVector.longLoopMax                  100      N/A     N/A    2048  thrpt    8   2646.250 ?  2.007  ops/ms
MinMaxVector.longLoopMin                   50      N/A     N/A    2048  thrpt    8   2648.504 ? 18.294  ops/ms
MinMaxVector.longLoopMin                   80      N/A     N/A    2048  thrpt    8   2658.082 ?  3.362  ops/ms
MinMaxVector.longLoopMin                  100      N/A     N/A    2048  thrpt    8   2647.532 ?  5.600  ops/ms
MinMaxVector.longReductionMax              50      N/A     N/A    2048  thrpt    8   2536.254 ?  0.086  ops/ms
MinMaxVector.longReductionMax              80      N/A     N/A    2048  thrpt    8   2536.209 ?  0.129  ops/ms
MinMaxVector.longReductionMax             100      N/A     N/A    2048  thrpt    8   2536.342 ?  0.068  ops/ms
MinMaxVector.longReductionMin              50      N/A     N/A    2048  thrpt    8   2536.271 ?  0.203  ops/ms
MinMaxVector.longReductionMin              80      N/A     N/A    2048  thrpt    8   2536.250 ?  0.343  ops/ms
MinMaxVector.longReductionMin             100      N/A     N/A    2048  thrpt    8   2536.246 ?  0.179  ops/ms

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2669613497

From coleenp at openjdk.org  Wed Feb 19 20:30:34 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Wed, 19 Feb 2025 20:30:34 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native [v3]
In-Reply-To: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
Message-ID: <9ZTXNeE806c5EDt4Y6QFMqull0_SobjS7mOQGk2wE5s=.81291418-85a7-4826-9ecf-dcdd050ecaf1@github.com>

> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
> Tested with tier1-4 and performance tests.

Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:

  Rename isPrimitiveType field to primitive.

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23572/files
  - new: https://git.openjdk.org/jdk/pull/23572/files/3e731b9f..d08091ac

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23572&range=02
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23572&range=01-02

  Stats: 11 lines in 5 files changed: 2 ins; 0 del; 9 mod
  Patch: https://git.openjdk.org/jdk/pull/23572.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23572/head:pull/23572

PR: https://git.openjdk.org/jdk/pull/23572

From dlong at openjdk.org  Wed Feb 19 21:19:58 2025
From: dlong at openjdk.org (Dean Long)
Date: Wed, 19 Feb 2025 21:19:58 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native [v3]
In-Reply-To: <_j9Wkg21aBltyVrbO4wxGFKmmLDy0T-eorRL4epfS4k=.5a453b6b-d673-4cc6-b29f-192fa74e290c@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <E9GPjreqeKFJmZAIjHGQ-1y6FnyqaT94FHUPuK65kmE=.48bd4ecc-ac91-4f7b-895b-a32280d8b437@github.com>
 <_j9Wkg21aBltyVrbO4wxGFKmmLDy0T-eorRL4epfS4k=.5a453b6b-d673-4cc6-b29f-192fa74e290c@github.com>
Message-ID: <3qpqR3PC8PFmdgaIoSYA3jDWdl-oon0-AcIzXcI76rY=.38635503-c067-4f6e-a4f1-92c1b6d991d1@github.com>

On Wed, 19 Feb 2025 14:19:58 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

> ... but not in the return since the caller likely will fetch the klass pointer next.

I notice that too.  Callers are using is_primitive() to short-circuit calls to as_Klass(), which means they seem to be aware of this implementation detail when maybe they shouldn't.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1962384926

From sviswanathan at openjdk.org  Wed Feb 19 23:21:07 2025
From: sviswanathan at openjdk.org (Sandhya Viswanathan)
Date: Wed, 19 Feb 2025 23:21:07 GMT
Subject: RFR: 8342103: C2 compiler support for Float16 type and associated
 scalar operations [v18]
In-Reply-To: <npCqs_ZZsS5MkLbKdkH2HqSrD8KI0loTBdo1DEIMfe4=.ac942c11-d10a-4bc2-a820-396656331758@github.com>
References: <a00XTjaE0iFc3MKq9ER_tgXoz81Hg07N8sPSPpTIQt4=.c05fd92f-8105-49d5-80be-ee56aeb77ede@github.com>
 <GTm_Er6CT-A4aFdVeWEMCXyJKWWrW56VLe9On4W02fk=.6bb331e3-3a26-4f5e-befb-42e955e4d994@github.com>
 <npCqs_ZZsS5MkLbKdkH2HqSrD8KI0loTBdo1DEIMfe4=.ac942c11-d10a-4bc2-a820-396656331758@github.com>
Message-ID: <2OIYkOt8CJ-CqnQIK8sgMDtvLxJUyD5r_mKj5QT7_a8=.10b1d382-d9ae-40a1-b895-09086c80dee6@github.com>

On Tue, 18 Feb 2025 02:36:13 GMT, Julian Waters <jwaters at openjdk.org> wrote:

>> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Review comments resolutions
>
> Is anyone else getting compile failures after this was integrated? This weirdly seems to only happen on Linux
> 
> * For target hotspot_variant-server_libjvm_objs_mulnode.o:
> /home/runner/work/jdk/jdk/src/hotspot/share/opto/mulnode.cpp: In member function ?virtual const Type* FmaHFNode::Value(PhaseGVN*) const?:
> /home/runner/work/jdk/jdk/src/hotspot/share/opto/mulnode.cpp:1944:37: error: call of overloaded ?make(double)? is ambiguous
>  1944 |   return TypeH::make(fma(f1, f2, f3));
>       |                                     ^
> In file included from /home/runner/work/jdk/jdk/src/hotspot/share/opto/node.hpp:31,
>                  from /home/runner/work/jdk/jdk/src/hotspot/share/opto/addnode.hpp:28,
>                  from /home/runner/work/jdk/jdk/src/hotspot/share/opto/mulnode.cpp:26:
> /home/runner/work/jdk/jdk/src/hotspot/share/opto/type.hpp:544:23: note: candidate: ?static const TypeH* TypeH::make(float)?
>   544 |   static const TypeH* make(float f);
>       |                       ^~~~
> /home/runner/work/jdk/jdk/src/hotspot/share/opto/type.hpp:545:23: note: candidate: ?static const TypeH* TypeH::make(short int)?
>   545 |   static const TypeH* make(short f);
>       |                       ^~~~

@TheShermanTanker I don't see any compile failures on Linux. Both the fastdebug and release build successfully.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22754#issuecomment-2669979058

From dholmes at openjdk.org  Thu Feb 20 02:52:58 2025
From: dholmes at openjdk.org (David Holmes)
Date: Thu, 20 Feb 2025 02:52:58 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native [v3]
In-Reply-To: <9ZTXNeE806c5EDt4Y6QFMqull0_SobjS7mOQGk2wE5s=.81291418-85a7-4826-9ecf-dcdd050ecaf1@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <9ZTXNeE806c5EDt4Y6QFMqull0_SobjS7mOQGk2wE5s=.81291418-85a7-4826-9ecf-dcdd050ecaf1@github.com>
Message-ID: <p18PU3MM6PoNM1H1n6V0NCAYakQZAA5vhnG1_zHT1NU=.fbae280a-8f73-4a0d-819e-0372f9c772fc@github.com>

On Wed, 19 Feb 2025 20:30:34 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
>> Tested with tier1-4 and performance tests.
>
> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Rename isPrimitiveType field to primitive.

src/java.base/share/classes/java/lang/Class.java line 1296:

> 1294: 
> 1295:     // The componentType field's null value is the sole indication that the class is an array,
> 1296:     // see isArray().

Suggestion:

    // The componentType field's null value is the sole indication that the class 
    // is an array - see isArray().

src/java.base/share/classes/java/lang/Class.java line 1297:

> 1295:     // The componentType field's null value is the sole indication that the class is an array,
> 1296:     // see isArray().
> 1297:     private transient final Class<?> componentType;

Why the `transient` and how does this impact serialization??

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1962781718
PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1962782083

From liach at openjdk.org  Thu Feb 20 04:31:55 2025
From: liach at openjdk.org (Chen Liang)
Date: Thu, 20 Feb 2025 04:31:55 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native [v3]
In-Reply-To: <p18PU3MM6PoNM1H1n6V0NCAYakQZAA5vhnG1_zHT1NU=.fbae280a-8f73-4a0d-819e-0372f9c772fc@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <9ZTXNeE806c5EDt4Y6QFMqull0_SobjS7mOQGk2wE5s=.81291418-85a7-4826-9ecf-dcdd050ecaf1@github.com>
 <p18PU3MM6PoNM1H1n6V0NCAYakQZAA5vhnG1_zHT1NU=.fbae280a-8f73-4a0d-819e-0372f9c772fc@github.com>
Message-ID: <HDfQ6ytPQFhvZcymShEy9Ga8Z3v0TqH-sFHpTqanOOY=.ea3784a6-c54a-4e28-b771-602bbeae9f38@github.com>

On Thu, 20 Feb 2025 02:50:17 GMT, David Holmes <dholmes at openjdk.org> wrote:

>> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Rename isPrimitiveType field to primitive.
>
> src/java.base/share/classes/java/lang/Class.java line 1297:
> 
>> 1295:     // The componentType field's null value is the sole indication that the class is an array,
>> 1296:     // see isArray().
>> 1297:     private transient final Class<?> componentType;
> 
> Why the `transient` and how does this impact serialization??

The fields in `Class` are just inconsistently transient or not. `Class` has special treatment in the serialization specification, so the presence or absence of the `transient` modifier has no effect.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1962841415

From galder at openjdk.org  Thu Feb 20 06:27:57 2025
From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=)
Date: Thu, 20 Feb 2025 06:27:57 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v12]
In-Reply-To: <gXxBans1yY2lV4CGCGT2-zFYYO2TaX1_M7WRAq6uWIM=.6e862783-8e47-46b8-8a2a-e799e7f946da@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <pZjDpZKJUmXi85-qf3F-NX91qVc42_QgZGbuo36XhPk=.f2e4ba72-bf19-4ced-9656-c01907bdae1b@github.com>
 <xhGVTCK35DjTHkm356ou70rNSmEp9iuRRhOUSso9VfE=.ca03abd8-d954-4727-9556-58365052b311@github.com>
 <gXxBans1yY2lV4CGCGT2-zFYYO2TaX1_M7WRAq6uWIM=.6e862783-8e47-46b8-8a2a-e799e7f946da@github.com>
Message-ID: <tTR667Dokfrxx_FadpVxbEaE3Hlr6id5BBYo5Xgo9is=.8288d4e9-904b-4f96-be3c-799e2ae1bb92@github.com>

On Wed, 19 Feb 2025 19:50:50 GMT, Evgeny Astigeevich <eastigeevich at openjdk.org> wrote:

>> I will run a comparison next with the same batch of tests but looking at `int` and see if there are any differences compared with `long` or not.
>
> Hi @galderz,
> Results from Graviton 3(Neoverse-V1).
> Without the patch:
> 
> Benchmark                       (probability)  (range)  (seed)  (size)   Mode  Cnt      Score    Error   Units
> MinMaxVector.intClippingRange             N/A       90       0    1000  thrpt    8  12565.427 ? 37.538  ops/ms
> MinMaxVector.intClippingRange             N/A      100       0    1000  thrpt    8  12462.072 ? 84.067  ops/ms
> MinMaxVector.intLoopMax                    50      N/A     N/A    2048  thrpt    8   5113.090 ? 68.720  ops/ms
> MinMaxVector.intLoopMax                    80      N/A     N/A    2048  thrpt    8   5129.857 ? 35.005  ops/ms
> MinMaxVector.intLoopMax                   100      N/A     N/A    2048  thrpt    8   5116.081 ?  8.946  ops/ms
> MinMaxVector.intLoopMin                    50      N/A     N/A    2048  thrpt    8   6174.544 ? 52.573  ops/ms
> MinMaxVector.intLoopMin                    80      N/A     N/A    2048  thrpt    8   6110.884 ? 54.447  ops/ms
> MinMaxVector.intLoopMin                   100      N/A     N/A    2048  thrpt    8   6178.661 ? 48.450  ops/ms
> MinMaxVector.intReductionMax               50      N/A     N/A    2048  thrpt    8   5109.270 ? 10.525  ops/ms
> MinMaxVector.intReductionMax               80      N/A     N/A    2048  thrpt    8   5123.426 ? 28.229  ops/ms
> MinMaxVector.intReductionMax              100      N/A     N/A    2048  thrpt    8   5133.799 ?  7.693  ops/ms
> MinMaxVector.intReductionMin               50      N/A     N/A    2048  thrpt    8   5130.209 ? 15.491  ops/ms
> MinMaxVector.intReductionMin               80      N/A     N/A    2048  thrpt    8   5127.823 ? 27.767  ops/ms
> MinMaxVector.intReductionMin              100      N/A     N/A    2048  thrpt    8   5118.217 ? 22.186  ops/ms
> MinMaxVector.longClippingRange            N/A       90       0    1000  thrpt    8   1831.026 ? 15.502  ops/ms
> MinMaxVector.longClippingRange            N/A      100       0    1000  thrpt    8   1827.194 ? 22.076  ops/ms
> MinMaxVector.longLoopMax                   50      N/A     N/A    2048  thrpt    8   2643.383 ?  9.830  ops/ms
> MinMaxVector.longLoopMax                   80      N/A     N/A    2048  thrpt    8   2640.417 ?  7.797  ops/ms
> MinMaxVector.longLoopMax                  100      N/A     N/A    2048  thrpt    8   1244.321 ?  1.001  ops/ms
> MinMaxVector.longLoopMin                   50      N/A     N/A    2048  thrpt    8   3239.234 ?  8.813  ops/ms
> MinMaxVector.longLoopMin                   80      N/A     N/A    2048  thrpt    8   3252.713 ?  3...

Thanks @eastig for the results on Graviton 3. I'm summarising them here:


Benchmark                       (probability)  (range)  (seed)  (size)   Mode  Cnt       Base      Patch   Units
MinMaxVector.longClippingRange            N/A       90       0    1000  thrpt    8   1831.026   5094.259  ops/ms (+178%)
MinMaxVector.longClippingRange            N/A      100       0    1000  thrpt    8   1827.194   5096.835  ops/ms (+180%)
MinMaxVector.longLoopMax                   50      N/A     N/A    2048  thrpt    8   2643.383   2636.438  ops/ms
MinMaxVector.longLoopMax                   80      N/A     N/A    2048  thrpt    8   2640.417   2644.069  ops/ms
MinMaxVector.longLoopMax                  100      N/A     N/A    2048  thrpt    8   1244.321   2646.250  ops/ms (+112%)
MinMaxVector.longLoopMin                   50      N/A     N/A    2048  thrpt    8   3239.234   2648.504  ops/ms (-18%)
MinMaxVector.longLoopMin                   80      N/A     N/A    2048  thrpt    8   3252.713   2658.082  ops/ms (-18%)
MinMaxVector.longLoopMin                  100      N/A     N/A    2048  thrpt    8   1204.370   2647.532  ops/ms (+119%)
MinMaxVector.longReductionMax              50      N/A     N/A    2048  thrpt    8   2536.322   2536.254  ops/ms
MinMaxVector.longReductionMax              80      N/A     N/A    2048  thrpt    8   2536.318   2536.209  ops/ms
MinMaxVector.longReductionMax             100      N/A     N/A    2048  thrpt    8   1395.273   2536.342  ops/ms (+81%)
MinMaxVector.longReductionMin              50      N/A     N/A    2048  thrpt    8   2536.325   2536.271  ops/ms
MinMaxVector.longReductionMin              80      N/A     N/A    2048  thrpt    8   2536.265   2536.250  ops/ms
MinMaxVector.longReductionMin             100      N/A     N/A    2048  thrpt    8   1389.982   2536.246  ops/ms (+82%)


On Graviton 3 there are wide enough registers for vectorization to kick in, so we see similar improvements to x64 AVX-512 in https://github.com/openjdk/jdk/pull/20098#issuecomment-2642788364. There is some variance in the 50/80% probability range, this was also observed slightly there, but on the aarch64 system it looks more pronounced. Interesting that it happened with min but not max but could be variance.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2670574593

From galder at openjdk.org  Thu Feb 20 06:53:04 2025
From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=)
Date: Thu, 20 Feb 2025 06:53:04 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v12]
In-Reply-To: <pZjDpZKJUmXi85-qf3F-NX91qVc42_QgZGbuo36XhPk=.f2e4ba72-bf19-4ced-9656-c01907bdae1b@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <pZjDpZKJUmXi85-qf3F-NX91qVc42_QgZGbuo36XhPk=.f2e4ba72-bf19-4ced-9656-c01907bdae1b@github.com>
Message-ID: <NA_X8GyT3wCtXQMpvXjBZB0MsptfiNQ8M66gOaUwh5g=.9f10568a-2599-4711-9d6c-30148465cbdc@github.com>

On Fri, 7 Feb 2025 12:39:24 GMT, Galder Zamarre?o <galder at openjdk.org> wrote:

>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance.
>> 
>> Currently vectorization does not kick in for loops containing either of these calls because of the following error:
>> 
>> 
>> VLoop::check_preconditions: failed: control flow in loop not allowed
>> 
>> 
>> The control flow is due to the java implementation for these methods, e.g.
>> 
>> 
>> public static long max(long a, long b) {
>>     return (a >= b) ? a : b;
>> }
>> 
>> 
>> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively.
>> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization.
>> E.g.
>> 
>> 
>> SuperWord::transform_loop:
>>     Loop: N518/N126  counted [int,int),+4 (1025 iters)  main has_sfpt strip_mined
>>  518  CountedLoop  === 518 246 126  [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21)
>> 
>> 
>> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1155
>> long max   1173
>> 
>> 
>> After the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1042
>> long max   1042
>> 
>> 
>> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes.
>> Therefore, it still relies on the macro expansion to transform those into CMoveL.
>> 
>> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results:
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PA...
>
> Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision:
> 
>  - Merge branch 'master' into topic.intrinsify-max-min-long
>  - Fix typo
>  - Renaming methods and variables and add docu on algorithms
>  - Fix copyright years
>  - Make sure it runs with cpus with either avx512 or asimd
>  - Test can only run with 256 bit registers or bigger
>    
>    * Remove platform dependant check
>    and use platform independent configuration instead.
>  - Fix license header
>  - Tests should also run on aarch64 asimd=true envs
>  - Added comment around the assertions
>  - Adjust min/max identity IR test expectations after changes
>  - ... and 34 more: https://git.openjdk.org/jdk/compare/af7645e5...a190ae68

To follow up https://github.com/openjdk/jdk/pull/20098#issuecomment-2669329851, I've run `MinMaxVector.int` benchmarks with **superword disabled** and with/without `_max`/`_min` intrinsics in both AVX-512 and AVX2 modes.

# AVX-512


Benchmark                             (probability)  (range)  (seed)  (size)   Mode  Cnt  -min/-max  +min/+max   Units
MinMaxVector.intClippingRange                   N/A       90       0    1000  thrpt    4   1067.050   1038.640  ops/ms
MinMaxVector.intClippingRange                   N/A      100       0    1000  thrpt    4   1041.922   1039.004  ops/ms
MinMaxVector.intLoopMax                          50      N/A     N/A    2048  thrpt    4    605.173    604.337  ops/ms
MinMaxVector.intLoopMax                          80      N/A     N/A    2048  thrpt    4    605.106    604.309  ops/ms
MinMaxVector.intLoopMax                         100      N/A     N/A    2048  thrpt    4    604.547    604.432  ops/ms
MinMaxVector.intLoopMin                          50      N/A     N/A    2048  thrpt    4    495.042    605.216  ops/ms (+22%)
MinMaxVector.intLoopMin                          80      N/A     N/A    2048  thrpt    4    495.105    495.217  ops/ms
MinMaxVector.intLoopMin                         100      N/A     N/A    2048  thrpt    4    495.040    495.176  ops/ms
MinMaxVector.intReductionMultiplyMax             50      N/A     N/A    2048  thrpt    4    407.920    407.984  ops/ms
MinMaxVector.intReductionMultiplyMax             80      N/A     N/A    2048  thrpt    4    407.710    407.965  ops/ms
MinMaxVector.intReductionMultiplyMax            100      N/A     N/A    2048  thrpt    4    874.881    407.922  ops/ms (-53%)
MinMaxVector.intReductionMultiplyMin             50      N/A     N/A    2048  thrpt    4    407.911    407.947  ops/ms
MinMaxVector.intReductionMultiplyMin             80      N/A     N/A    2048  thrpt    4    408.015    408.024  ops/ms
MinMaxVector.intReductionMultiplyMin            100      N/A     N/A    2048  thrpt    4    407.978    407.994  ops/ms
MinMaxVector.intReductionSimpleMax               50      N/A     N/A    2048  thrpt    4    460.538    460.439  ops/ms
MinMaxVector.intReductionSimpleMax               80      N/A     N/A    2048  thrpt    4    460.579    460.542  ops/ms
MinMaxVector.intReductionSimpleMax              100      N/A     N/A    2048  thrpt    4    998.211    460.404  ops/ms (-53%)
MinMaxVector.intReductionSimpleMin               50      N/A     N/A    2048  thrpt    4    460.570    460.447  ops/ms
MinMaxVector.intReductionSimpleMin               80      N/A     N/A    2048  thrpt    4    460.552    460.493  ops/ms
MinMaxVector.intReductionSimpleMin              100      N/A     N/A    2048  thrpt    4    460.455    460.485  ops/ms


There is some improvement in `intLoopMin` @ 50% but this didn't materialize in the `perfasm` run, so I don't think it can be strictly be correlated with the use/not-use of the intrinsic.

`intReductionMultiplyMax` and `intReductionSimpleMax` @ 100% regressions with the `max` intrinsic activated are consistent with what the saw with long.

### `intReductionMultiplyMin` and `intReductionSimpleMin` @ 100% same performance

There is something very intriguing happening here, which I don't know it's due to min itself or int vs long. Basically, with or without the `min` intrinsic the performance of these 2 benchmarks is same at 100% branch probability. What is going on? Let's look at one of them:

-min

# VM options: -Djava.library.path=/home/vagrant/1/jdk-intrinsify-max-min-long/build/release-linux-x86_64/images/test/micro/native -XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_max,_min -XX:-UseSuperWord
...
   3.04%  ????    ?   0x00007f49280f76e9:   cmpl		%edi, %r10d
   3.14%  ????    ?   0x00007f49280f76ec:   cmovgl		%edi, %r10d         ;*ireturn {reexecute=0 rethrow=0 return_oop=0}
          ????    ?                                                             ; - java.lang.Math::min at 10 (line 2119)
          ????    ?                                                             ; - org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMin at 23 (line 212)
          ????    ?                                                             ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionSimpleMin_jmhTest::intReductionSimpleMin_thrpt_jmhStub at 19 (line 124)


+min

# VM options: -Djava.library.path=/home/vagrant/1/jdk-intrinsify-max-min-long/build/release-linux-x86_64/images/test/micro/native -XX:-UseSuperWord
...
   3.10%  ??      ?   0x00007fbf340f6b97:   cmpl		%edi, %r10d
   3.08%  ??      ?   0x00007fbf340f6b9a:   cmovgl		%edi, %r10d         ;*invokestatic min {reexecute=0 rethrow=0 return_oop=0}
          ??      ?                                                             ; - org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMin at 23 (line 212)
          ??      ?                                                             ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionSimpleMin_jmhTest::intReductionSimpleMin_thrpt_jmhStub at 19 (line 124)


Both are `cmov`. You can see how without the intrinsic the `Math::min` bytecode gets executed and transformed into a `cmov` and the same with the intrinsic.

I will verify this with long shortly to see if this behaviour is specific to `min` operation or something to do with int vs long.

# AVX2

Here are the AVX2 numbers:


Benchmark                             (probability)  (range)  (seed)  (size)   Mode  Cnt  -min/-max  +min/+max   Units
MinMaxVector.intClippingRange                   N/A       90       0    1000  thrpt    4   1068.265   1039.087  ops/ms
MinMaxVector.intClippingRange                   N/A      100       0    1000  thrpt    4   1067.705   1038.760  ops/ms
MinMaxVector.intLoopMax                          50      N/A     N/A    2048  thrpt    4    605.015    604.364  ops/ms
MinMaxVector.intLoopMax                          80      N/A     N/A    2048  thrpt    4    605.169    604.366  ops/ms
MinMaxVector.intLoopMax                         100      N/A     N/A    2048  thrpt    4    604.527    604.494  ops/ms
MinMaxVector.intLoopMin                          50      N/A     N/A    2048  thrpt    4    605.099    605.057  ops/ms
MinMaxVector.intLoopMin                          80      N/A     N/A    2048  thrpt    4    495.071    605.080  ops/ms (+22%)
MinMaxVector.intLoopMin                         100      N/A     N/A    2048  thrpt    4    495.134    495.047  ops/ms
MinMaxVector.intReductionMultiplyMax             50      N/A     N/A    2048  thrpt    4    407.953    407.987  ops/ms
MinMaxVector.intReductionMultiplyMax             80      N/A     N/A    2048  thrpt    4    407.861    408.005  ops/ms
MinMaxVector.intReductionMultiplyMax            100      N/A     N/A    2048  thrpt    4    873.915    407.995  ops/ms (-53%)
MinMaxVector.intReductionMultiplyMin             50      N/A     N/A    2048  thrpt    4    408.019    407.987  ops/ms
MinMaxVector.intReductionMultiplyMin             80      N/A     N/A    2048  thrpt    4    407.971    408.009  ops/ms
MinMaxVector.intReductionMultiplyMin            100      N/A     N/A    2048  thrpt    4    407.970    407.956  ops/ms
MinMaxVector.intReductionSimpleMax               50      N/A     N/A    2048  thrpt    4    460.443    460.514  ops/ms
MinMaxVector.intReductionSimpleMax               80      N/A     N/A    2048  thrpt    4    460.484    460.581  ops/ms
MinMaxVector.intReductionSimpleMax              100      N/A     N/A    2048  thrpt    4   1015.601    460.446  ops/ms (-54%)
MinMaxVector.intReductionSimpleMin               50      N/A     N/A    2048  thrpt    4    460.494    460.532  ops/ms
MinMaxVector.intReductionSimpleMin               80      N/A     N/A    2048  thrpt    4    460.489    460.451  ops/ms
MinMaxVector.intReductionSimpleMin              100      N/A     N/A    2048  thrpt    4   1021.420    460.435  ops/ms (-55%)


This time we see an improvement in `intLoopMin` @ 80% but again it was not observable in the `perfasm` run.

`intReductionMultiplyMax` and `intReductionSimpleMax` @ 100% have regressions, the familiar one of cmp+mov vs cmov. `intReductionMultiplyMin` @ 100% does not have a regression for the same reasons above, both use cmov.

The interesting thing is `intReductionSimpleMin` @ 100%. We see a regression there but I didn't observe it with the `perfasm` run. So, this could be due to variance in the application of `cmov` or not?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2670609470

From epeter at openjdk.org  Thu Feb 20 07:21:45 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Thu, 20 Feb 2025 07:21:45 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory [v3]
In-Reply-To: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
Message-ID: <owyo22rcVx0t1HF558zxMB2k26WXy2TJNttzm73fSOE=.a5d9b6ff-1d42-4ce5-8409-295e6aad7b6b@github.com>

> Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below.
> 
> **Background**
> 
> With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer.
> 
> **Problem**
> 
> So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code.
> 
> 
> MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1);
> MemorySegment nativeUnaligned = nativeAligned.asSlice(1);
> test3(nativeUnaligned);
> 
> 
> When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not!
> 
>     static void test3(MemorySegment ms) {
>         for (int i = 0; i < RANGE; i++) {
>             long adr = i * 4L;
>             int v = ms.get(ELEMENT_LAYOUT, adr);
>             ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1));
>         }
>     }
> 
> 
> **Solution: Runtime Checks - Predicate and Multiversioning**
> 
> Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check.
> 
> I came up with 2 options where to place the runtime checks:
> - A new "auto vectorization" Parse Predicate:
>   - This only works when predicates are available.
>   - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop.
> - Multiversion the loop:
>   - Create 2 copies of the loop (fast and slow loops).
>   - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take
>   - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even unaligned `base`s would end up with reasonably fast code.
>   - We "stall" the `...

Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision:

  adjust selector if probability

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/22016/files
  - new: https://git.openjdk.org/jdk/pull/22016/files/a98ffabf..b3044bc5

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=22016&range=02
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22016&range=01-02

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/22016.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/22016/head:pull/22016

PR: https://git.openjdk.org/jdk/pull/22016

From roland at openjdk.org  Thu Feb 20 09:46:58 2025
From: roland at openjdk.org (Roland Westrelin)
Date: Thu, 20 Feb 2025 09:46:58 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory [v3]
In-Reply-To: <mBzl4lD6blV2aF4xn1xwQLonj5FJIMBItY6WjJAUF7w=.f776fa88-75e6-46d4-b07b-8110dc94bedf@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <tJIepmfbtgbfD-EzVGPavvFjOQRaSK0riJzPO6YsTM0=.77b01211-44d1-47b2-8e56-ca98a68cfac4@github.com>
 <HiM2dzvsG2hB2utYPwFRplD8CRLPglPjQmY3sU2ZAKY=.93d0510e-0d9d-4f07-ba07-b9027ba6f89b@github.com>
 <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com>
 <JIozM6Xe_RuACVPRvbQtZYJUYzqmJri69XklWzmzVO8=.94c86dea-0665-48ba-989a-035e1a0ff35d@github.com>
 <cMsbb3UmOWs-309Zobc6gna8ZepHvuoUpSpwvCaQmoM=.5f999813-d1e5-482b-8415-b875377221c5@github.com>
 <Kpik6hwYiD1r8-ervrV0rg2VvrFK7pGr7NEV2qISPoE=.02e5a9e0-9c5f-4e08-9831-b0781a09f2f3@github.com>
 <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com>
 <WisIFWC8dXi53nfdv0gXIAHgdjZthyoIeVvOuoY7T_M=.e02ea3c6-b8e0-4e44-9edc-69ef212c9bfa@github.com>
 <GS2B9Xcwwuie0AxjO1pmN1QGF9Y4Ukdpgx-DedaMXPM=.b61bf04b-f489-4206-b7e5-44f06103c02e@github.com>
 <PdCUVqIOpSo7CFkUEnWx7aZlVTBw2w8fVUR
 PagD2R4A=.4ab7ef52-b170-4a95-b15a-7cbd4407606f@github.com>
 <BarQ04VIv4siELW8k-GoGIeRKbLQLv3CRM9LHa1-3ZI=.170b0477-69a7-4907-a419-5a6367a7ca54@github.com>
 <mBzl4lD6blV2aF4xn1xwQLonj5FJIMBItY6WjJAUF7w=.f776fa88-75e6-46d4-b07b-8110dc94bedf@github.com>
Message-ID: <UgK2CAAaCU0dCJExlMvJfaukAuHK9W9bI_azCJM2Tv0=.41e70b72-1e0a-4b1a-94b3-8643c87f02cd@github.com>

On Wed, 19 Feb 2025 15:23:13 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> Do you see any better way than having the 2x code size if we need both a slow and fast loop?

No but I was confused by your comment about 3x and 4x which is why I asked for clarification.
Compiled code size affects inlining decisions: if a callee has compiled code and it's larger than some threshold, then the callee is considered too expensive to inline. With your change, some method that was considered ok to inline could now be considered too big. I think that's what Vladimir is concerned by. I don't see what you can do about it, this said.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2670957288

From roland at openjdk.org  Thu Feb 20 09:46:58 2025
From: roland at openjdk.org (Roland Westrelin)
Date: Thu, 20 Feb 2025 09:46:58 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory [v3]
In-Reply-To: <UgK2CAAaCU0dCJExlMvJfaukAuHK9W9bI_azCJM2Tv0=.41e70b72-1e0a-4b1a-94b3-8643c87f02cd@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <tJIepmfbtgbfD-EzVGPavvFjOQRaSK0riJzPO6YsTM0=.77b01211-44d1-47b2-8e56-ca98a68cfac4@github.com>
 <HiM2dzvsG2hB2utYPwFRplD8CRLPglPjQmY3sU2ZAKY=.93d0510e-0d9d-4f07-ba07-b9027ba6f89b@github.com>
 <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com>
 <JIozM6Xe_RuACVPRvbQtZYJUYzqmJri69XklWzmzVO8=.94c86dea-0665-48ba-989a-035e1a0ff35d@github.com>
 <cMsbb3UmOWs-309Zobc6gna8ZepHvuoUpSpwvCaQmoM=.5f999813-d1e5-482b-8415-b875377221c5@github.com>
 <Kpik6hwYiD1r8-ervrV0rg2VvrFK7pGr7NEV2qISPoE=.02e5a9e0-9c5f-4e08-9831-b0781a09f2f3@github.com>
 <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com>
 <WisIFWC8dXi53nfdv0gXIAHgdjZthyoIeVvOuoY7T_M=.e02ea3c6-b8e0-4e44-9edc-69ef212c9bfa@github.com>
 <GS2B9Xcwwuie0AxjO1pmN1QGF9Y4Ukdpgx-DedaMXPM=.b61bf04b-f489-4206-b7e5-44f06103c02e@github.com>
 <PdCUVqIOpSo7CFkUEnWx7aZlVTBw2w8fVUR
 PagD2R4A=.4ab7ef52-b170-4a95-b15a-7cbd4407606f@github.com>
 <BarQ04VIv4siELW8k-GoGIeRKbLQLv3CRM9LHa1-3ZI=.170b0477-69a7-4907-a419-5a6367a7ca54@github.com>
 <mBzl4lD6blV2aF4xn1xwQLonj5FJIMBItY6WjJAUF7w=.f776fa88-75e6-46d4-b07b-8110dc94bedf@github.com>
 <UgK2CAAaCU0dCJExlMvJfaukAuHK9W9bI_azCJM2Tv0=.41e70b72-1e0a-4b1a-94b3-8643c87f02cd@github.com>
Message-ID: <nAJ_7HOCiCrekY3z_7JMl0Ixajl3Deu_jUUxa1fqSjw=.0b69eff6-aca5-462d-b5c4-ec5234997074@github.com>

On Thu, 20 Feb 2025 09:39:59 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>>> So the overhead in the final code is 2x: we can expect the fast and slow paths to be about the same size so the section of code for the loop would see its size grow by 2x.
>> 
>> Yes, if you get to the point where you add a multi-version-if condition, i.e. where SuperWord has decided it needs a speculative assumption (here for alignment, later for aliasing), then we get the whole loop 2x. I suppose we could try to make the pre-main-post loop more complicated and just multi-version the main-loop, but that sounds much more complicated.
>> 
>> Do you see any better way than having the 2x code size if we need both a slow and fast loop?
>
>> Do you see any better way than having the 2x code size if we need both a slow and fast loop?
> 
> No but I was confused by your comment about 3x and 4x which is why I asked for clarification.
> Compiled code size affects inlining decisions: if a callee has compiled code and it's larger than some threshold, then the callee is considered too expensive to inline. With your change, some method that was considered ok to inline could now be considered too big. I think that's what Vladimir is concerned by. I don't see what you can do about it, this said.

> @rwestrel I think I had tried some verifications above, but I could not even get it to work in all cases in `SuperWord`.
> 
> In `VLoop::check_preconditions_helper`, I try to find either the predicate or the multiversioning if. But I cannot always find it, and I think that one reason was that the pre-loop can be lost. At least that is what I remember from 4+ weeks ago.

Do you understand when that happens? It doesn't feel right that the pre loop can be lost.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2670971210

From roland at openjdk.org  Thu Feb 20 09:47:01 2025
From: roland at openjdk.org (Roland Westrelin)
Date: Thu, 20 Feb 2025 09:47:01 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory [v3]
In-Reply-To: <47tXBG3sQGZVEE5Ya2wr46CopmDjy8OClbpqagIsjgA=.6d07b495-4777-4c7e-a3b7-820f100ec2c0@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <tJIepmfbtgbfD-EzVGPavvFjOQRaSK0riJzPO6YsTM0=.77b01211-44d1-47b2-8e56-ca98a68cfac4@github.com>
 <47tXBG3sQGZVEE5Ya2wr46CopmDjy8OClbpqagIsjgA=.6d07b495-4777-4c7e-a3b7-820f100ec2c0@github.com>
Message-ID: <qnupBXMVxaHvK4lnZdcmfe10Rmgod1vPz91DuWnN6cE=.e3f0af30-9d8b-4b2f-8a98-d034c457458a@github.com>

On Tue, 18 Feb 2025 09:42:17 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> src/hotspot/share/opto/loopUnswitch.cpp line 513:
>> 
>>> 511: 
>>> 512:   // Create new Region.
>>> 513:   RegionNode* region = new RegionNode(1);
>> 
>> So we create a new `Region` every time a new condition is added?
>
> Yes. Are you ok with that? Or would you prefer if we extended an existing region (is that possible?) and then we'd have 2 cases, one where there is none yet, and one where we'd extend. I think adding one each time is easier, and it would get commoned anyway, right?

That sounds ok to me.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1963217281

From roland at openjdk.org  Thu Feb 20 09:47:03 2025
From: roland at openjdk.org (Roland Westrelin)
Date: Thu, 20 Feb 2025 09:47:03 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory [v3]
In-Reply-To: <yA58b9sjhHbbQyMn3eKACoN7gRR5a8GZBpQ6ovkoqCI=.9a893efc-ab75-4536-aeaf-24ca717a6af6@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <tJIepmfbtgbfD-EzVGPavvFjOQRaSK0riJzPO6YsTM0=.77b01211-44d1-47b2-8e56-ca98a68cfac4@github.com>
 <haW3UR2f-oX34CVUcZ3PuHNnYhrosO5CCc6I4zic7tg=.dce84228-5c39-4c9d-b881-4404c2bb28fc@github.com>
 <lta_D15JcJ5ohO7jE4czhU5ZGys5Ty3SZ1uMbvmFbm0=.35840037-8357-40ea-b452-269fb2425532@github.com>
 <g3D0_Zz8XB5cKoUXbfWoRzZePw7c2NVCgF4uy9y1aRI=.20d4c390-13b9-4e27-9753-e87af103ec97@github.com>
 <-h_j1wlUqiWpk7lHDe2qqLlTPUdRLJ2NBaid6KJURCQ=.e1ef0bfa-4043-42b0-be58-ac130373c788@github.com>
 <IbMGSpAFsTKRhUuPYQS3q36YFegklEV1b8rwuT8TN0g=.bc3adcf6-cc6d-483c-a59e-cf6b2971e7de@github.com>
 <yA58b9sjhHbbQyMn3eKACoN7gRR5a8GZBpQ6ovkoqCI=.9a893efc-ab75-4536-aeaf-24ca717a6af6@github.com>
Message-ID: <hdS7jvkZIs8UGUpKXqgczsHCmCknX6Lj21rLqtDa7bU=.6aa692e2-93e7-4c79-a4ba-8353f0d89a1c@github.com>

On Tue, 18 Feb 2025 10:26:37 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>> @rwestrel do you consider that a blocking issue for this PR here?
>
> No

I filed: https://bugs.openjdk.org/browse/JDK-8350330

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1963215126

From epeter at openjdk.org  Thu Feb 20 10:35:06 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Thu, 20 Feb 2025 10:35:06 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory [v3]
In-Reply-To: <nAJ_7HOCiCrekY3z_7JMl0Ixajl3Deu_jUUxa1fqSjw=.0b69eff6-aca5-462d-b5c4-ec5234997074@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <tJIepmfbtgbfD-EzVGPavvFjOQRaSK0riJzPO6YsTM0=.77b01211-44d1-47b2-8e56-ca98a68cfac4@github.com>
 <HiM2dzvsG2hB2utYPwFRplD8CRLPglPjQmY3sU2ZAKY=.93d0510e-0d9d-4f07-ba07-b9027ba6f89b@github.com>
 <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com>
 <JIozM6Xe_RuACVPRvbQtZYJUYzqmJri69XklWzmzVO8=.94c86dea-0665-48ba-989a-035e1a0ff35d@github.com>
 <cMsbb3UmOWs-309Zobc6gna8ZepHvuoUpSpwvCaQmoM=.5f999813-d1e5-482b-8415-b875377221c5@github.com>
 <Kpik6hwYiD1r8-ervrV0rg2VvrFK7pGr7NEV2qISPoE=.02e5a9e0-9c5f-4e08-9831-b0781a09f2f3@github.com>
 <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com>
 <WisIFWC8dXi53nfdv0gXIAHgdjZthyoIeVvOuoY7T_M=.e02ea3c6-b8e0-4e44-9edc-69ef212c9bfa@github.com>
 <GS2B9Xcwwuie0AxjO1pmN1QGF9Y4Ukdpgx-DedaMXPM=.b61bf04b-f489-4206-b7e5-44f06103c02e@github.com>
 <PdCUVqIOpSo7CFkUEnWx7aZlVTBw2w8fVUR
 PagD2R4A=.4ab7ef52-b170-4a95-b15a-7cbd4407606f@github.com>
 <BarQ04VIv4siELW8k-GoGIeRKbLQLv3CRM9LHa1-3ZI=.170b0477-69a7-4907-a419-5a6367a7ca54@github.com>
 <mBzl4lD6blV2aF4xn1xwQLonj5FJIMBItY6WjJAUF7w=.f776fa88-75e6-46d4-b07b-8110dc94bedf@github.com>
 <UgK2CAAaCU0dCJExlMvJfaukAuHK9W9bI_azCJM2Tv0=.41e70b72-1e0a-4b1a-94b3-8643c87f02cd@github.com>
 <nAJ_7HOCiCrekY3z_7JMl0Ixajl3Deu_jUUxa1fqSjw=.0b69eff6-aca5-462d-b5c4-ec5234997074@github.com>
Message-ID: <aQ3KAFQwSy_g55dyCf-WGvKNAgnkNf1zAkck3Li6ufY=.0bdb455a-95f9-49e9-a6bb-4b4c17edcd02@github.com>

On Thu, 20 Feb 2025 09:44:16 GMT, Roland Westrelin <roland at openjdk.org> wrote:

> > @rwestrel I think I had tried some verifications above, but I could not even get it to work in all cases in `SuperWord`.
> > In `VLoop::check_preconditions_helper`, I try to find either the predicate or the multiversioning if. But I cannot always find it, and I think that one reason was that the pre-loop can be lost. At least that is what I remember from 4+ weeks ago.
> 
> Do you understand when that happens? It doesn't feel right that the pre loop can be lost.

`VLoop::check_preconditions_helper` has a check like this:


  // To align vector memory accesses in the main-loop, we will have to adjust
  // the pre-loop limit.
  if (_cl->is_main_loop()) {
    CountedLoopEndNode* pre_end = _cl->find_pre_loop_end();
    if (pre_end == nullptr) {
      return VStatus::make_failure(VLoop::FAILURE_PRE_LOOP_LIMIT);
    }
    Node* pre_opaq1 = pre_end->limit();
    if (pre_opaq1->Opcode() != Op_Opaque1) {
      return VStatus::make_failure(VLoop::FAILURE_PRE_LOOP_LIMIT);
    }
    _pre_loop_end = pre_end;
  }


I don't remember exactly why the pre-loop disappears. They are rare cases. The pre-loop somehow folds away, maybe because it only has a single iteration, or just so few that it would never take the backedge.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2671093141

From galder at openjdk.org  Thu Feb 20 10:56:58 2025
From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=)
Date: Thu, 20 Feb 2025 10:56:58 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v12]
In-Reply-To: <NA_X8GyT3wCtXQMpvXjBZB0MsptfiNQ8M66gOaUwh5g=.9f10568a-2599-4711-9d6c-30148465cbdc@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <pZjDpZKJUmXi85-qf3F-NX91qVc42_QgZGbuo36XhPk=.f2e4ba72-bf19-4ced-9656-c01907bdae1b@github.com>
 <NA_X8GyT3wCtXQMpvXjBZB0MsptfiNQ8M66gOaUwh5g=.9f10568a-2599-4711-9d6c-30148465cbdc@github.com>
Message-ID: <axNKSbry1NhlbWIMgmU9ZxeeHBc_8UjOYp8EtNKxbQM=.89c14b1a-6c67-4752-ae9e-4198a2612c60@github.com>

On Thu, 20 Feb 2025 06:50:07 GMT, Galder Zamarre?o <galder at openjdk.org> wrote:

> There is something very intriguing happening here, which I don't know it's due to min itself or int vs long.


Benchmark                              (probability)  (size)   Mode  Cnt  -min/-max  +min/+max   Units
MinMaxVector.intReductionMultiplyMax             100    2048  thrpt    4    876.867    407.905  ops/ms (-53%)
MinMaxVector.intReductionMultiplyMin             100    2048  thrpt    4    407.963    407.956  ops/ms (1)
MinMaxVector.longReductionMultiplyMax            100    2048  thrpt    4    838.845    405.371  ops/ms (-51%)
MinMaxVector.longReductionMultiplyMin            100    2048  thrpt    4    825.602    414.757  ops/ms (-49%)
MinMaxVector.intReductionSimpleMax               100    2048  thrpt    4   1032.561    460.486  ops/ms (-55%)
MinMaxVector.intReductionSimpleMin               100    2048  thrpt    4    460.530    460.490  ops/ms (2)
MinMaxVector.longReductionSimpleMax              100    2048  thrpt    4   1017.560    460.436  ops/ms (-54%)
MinMaxVector.longReductionSimpleMin              100    2048  thrpt    4    959.507    459.197  ops/ms (-52%)


 (1) (2) It seems it's a combination of both int AND min reduction operations and disabling the intrinsic. The rest of reduction operations seems to use cmp+mov in that situation but not int+min, which uses cmov. Maybe this is intentional or maybe it's a bug, but it's interesting to notice.

`intReductionMultiplyMin` -min:

# VM options: -Djava.library.path=/home/vagrant/1/jdk-intrinsify-max-min-long/build/release-linux-x86_64/images/test/micro/native -XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_min -XX:-UseSuperWord
# Benchmark: org.openjdk.bench.java.lang.MinMaxVector.intReductionMultiplyMin
# Parameters: (probability = 100, size = 2048)
...
   2.29%  ???    ?   0x00007f4aa40f5835:   cmpl		%edi, %r10d
   4.25%  ???    ?   0x00007f4aa40f5838:   cmovgl		%edi, %r10d         ;*ireturn {reexecute=0 rethrow=0 return_oop=0}
          ???    ?                                                             ; - java.lang.Math::min at 10 (line 2119)
          ???    ?                                                             ; - org.openjdk.bench.java.lang.MinMaxVector::intReductionMultiplyMin at 26 (line 202)
          ???    ?                                                             ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionMultiplyMin_jmhTest::intReductionMultiplyMin_thrpt_jmhStub at 19 (line 124)


`intReductionMultiplyMin` +min:

# VM options: -Djava.library.path=/home/vagrant/1/jdk-intrinsify-max-min-long/build/release-linux-x86_64/images/test/micro/native -XX:-UseSuperWord
# Benchmark: org.openjdk.bench.java.lang.MinMaxVector.intReductionMultiplyMin
# Parameters: (probability = 100, size = 2048)
...
   2.06%  ???    ?   0x00007ff8ec0f4c35:   cmpl		%edi, %r10d
   4.31%  ???    ?   0x00007ff8ec0f4c38:   cmovgl		%edi, %r10d         ;*invokestatic min {reexecute=0 rethrow=0 return_oop=0}
          ???    ?                                                             ; - org.openjdk.bench.java.lang.MinMaxVector::intReductionMultiplyMin at 26 (line 202)
          ???    ?                                                             ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionMultiplyMin_jmhTest::intReductionMultiplyMin_thrpt_jmhStub at 19 (line 124)


`longReductionMultiplyMin` -min:

# VM options: -Djava.library.path=/home/vagrant/1/jdk-intrinsify-max-min-long/build/release-linux-x86_64/images/test/micro/native -XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_minL -XX:-UseSuperWord
# Benchmark: org.openjdk.bench.java.lang.MinMaxVector.longReductionMultiplyMin
# Parameters: (probability = 100, size = 2048)
...
   0.01%  ?            ?   ?? ? ??  0x00007ff9d80f7609:   imulq		$0xb, 0x10(%r12, %r10, 8), %rbp
          ?            ?   ?? ? ??                                                            ;*lmul {reexecute=0 rethrow=0 return_oop=0}
          ?            ?   ?? ? ??                                                            ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMultiplyMin at 24 (line 265)
          ?            ?   ?? ? ??                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMultiplyMin_jmhTest::longReductionMultiplyMin_thrpt_jmhStub at 19 (line 124)
          ?            ?   ?? ? ??  0x00007ff9d80f760f:   testq		%rbp, %rbp
          ?            ?   ???? ??  0x00007ff9d80f7612:   jge		0x7ff9d80f7646      ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
          ?            ?   ???? ??                                                            ; - java.lang.Math::min at 11 (line 2134)
          ?            ?   ???? ??                                                            ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMultiplyMin at 30 (line 266)
          ?            ?   ???? ??                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMultiplyMin_jmhTest::longReductionMultiplyMin_thrpt_jmhStub at 19 (line 124)


`longReductionMultiplyMin` +min:

# VM options: -Djava.library.path=/home/vagrant/1/jdk-intrinsify-max-min-long/build/release-linux-x86_64/images/test/micro/native -XX:-UseSuperWord
# Benchmark: org.openjdk.bench.java.lang.MinMaxVector.longReductionMultiplyMin
# Parameters: (probability = 100, size = 2048)
...
   0.01%  ?   ??  0x00007f83400f7d76:   cmpq		%r13, %rdx
   0.12%  ?   ??  0x00007f83400f7d79:   cmovlq		%rdx, %r13          ;*invokestatic min {reexecute=0 rethrow=0 return_oop=0}
          ?   ??                                                            ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMultiplyMin at 30 (line 266)
          ?   ??                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMultiplyMin_jmhTest::longReductionMultiplyMin_thrpt_jmhStub at 19 (line 124)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2671144644

From galder at openjdk.org  Thu Feb 20 11:03:58 2025
From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=)
Date: Thu, 20 Feb 2025 11:03:58 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v11]
In-Reply-To: <N-faJEFC3vJg5m-H3EyMnAkBg3ecpx_YduxpPyXRz8o=.fee546a6-61d3-4ce7-9939-80a1a9e2748e@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com>
 <Mci8jQuT-MquLYeikUrrdzKo9dJJuQa3ejdc7tlYQyI=.e0007de8-08b2-4a42-950c-f8e1225777fc@github.com>
 <RHL_g49_BCZQzsQJU-T88fkAOoSKpNvEC2Xx-QxdpRk=.4fbc0037-ba55-40e1-a091-4c16d7e8ee99@github.com>
 <MAcSY0Kc9JFrv5ueJDNkDH9I9LpIsYcRjiJ7RtQt090=.03b107fd-8879-41de-a0a4-b6202a9da369@github.com>
 <OxF5Va_n5CdxRW2uSTQQzMe6JSSNqnfs4qd3pAwSAEo=.d33d96b2-0d11-4d67-91cc-5ae94e78c580@github.com>
 <S-e2FYgJy02RfywN3A6WTPaepdkK6ly8KYNnX6tHCig=.a38238e5-7d3b-412b-a78e-9def776179ff@github.com>
 <T8VJwIaK1x-G60XsQNIUB9wcHQeYP1bkYB98Am9c8KM=.56412af0-9471-4619-9e14-03ba5f5cfde2@github.com>
 <msxoTDkFVjqA2xbPwhboyul7Oal6uQAQcig5RdkeiyY=.7d5bd096-0906-45fa-a3db-0ace9b629fcb@github.com>
 <5oGMaD5b87inAMkco6l5ODRvWv7FRsHGJiu_UMrGrTc=.0be44429-d322-4a6f-b91d-b64a146fad05@github.com>
 <3ArmrOQcUoj8DhHTq1a40Oz3GE8bCDDy3FF
 eVgbladg=.b8e0e13b-39f3-41a6-8a1b-5ca4febb4a41@github.com>
 <ajuhBKoDVKmn_D9vg3O3OVUQVVunAT02KhXcaapMokE=.5b8f2f36-6521-4389-97de-0bf8deeb47dc@github.com>
 <N-faJEFC3vJg5m-H3EyMnAkBg3ecpx_YduxpPyXRz8o=.fee546a6-61d3-4ce7-9939-80a1a9e2748e@github.com>
Message-ID: <tQDhCoMxTDakWM2E5C0jo68nfJnRIsmGYJM6CSje5UU=.132a1a43-4911-4874-b5ac-6610467fd065@github.com>

On Tue, 18 Feb 2025 08:43:38 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> To make it more explicit: implementing long min/max in ad files as cmp will likely remove all the 100% regressions that are observed here. I'm going to repeat the same MinMaxVector int min/max reduction test above with the ad changes @rwestrel suggested to see what effect they have.
>
> @galderz I think we will have the same issue with both `int` and `long`: As far as I know, it is really a difficult problem to decide at compile-time if a `cmove` or `branch` is the better choice. I'm not sure there is any heuristic for which you will not find a micro-benchmark where the heuristic made the wrong choice.
> 
> To my understanding, these are the factors that impact the performance:
> - `cmove` requires all inputs to complete before it can execute, and it has an inherent latency of a cycle or so itself. But you cannot have any branch mispredictions, and hence no branch misprediction penalties (i.e. when the CPU has to flush out the ops from the wrong branch and restart at the branch).
> - `branch` can hide some latencies, because we can already continue with the branch that is speculated on. We do not need to wait for the inputs of the comparison to arrive, and we can already continue with the speculated resulting value. But if the speculation is ever wrong, we have to pay the misprediction penalty.
> 
> In my understanding, there are roughly 3 scenarios:
> - The branch probability is so extreme that the branch predictor would be correct almost always, and so it is profitable to do branching code.
> - The branching probability is somewhere in the middle, and the branch is not predictable. Branch mispredictions are very expensive, and so it is better to use `cmove`.
> - The branching probability is somewhere in the middle, but the branch is predictable (e.g. swapps back and forth). The branch predictor will have almost no mispredictions, and it is faster to use branching code.
> 
> Modeling this precisely is actually a little complex. You would have to know the cost of the `cmove` and the `branching` version of the code. That depends on the latency of the inputs, and the outputs: does the `cmove` dramatically increase the latency on the critical path, and `branching` could hide some of that latency? And you would have to know how good the branch predictor is, which you cannot derive from the branching probability of our profiling (at least not when the probabilities are in the middle, and you don't know if it is a random or predictable pattern).
> 
> If we can find a perfect heuristic - that would be fantastic ;)
> 
> If we cannot find a perfect heuristic, then we should think about what are the most "common" or "relevant" scenarios, I think.
> 
> But let's discuss all of this in a call / offline :)

FYI @eme64 @chhagedorn @rwestrel 

Since we know that vectorization does not always kick in, there was a worry if scalar fallbacks would heavily suffer with the work included in this PR to add long intrinsic for min/max. Looking at the same scenarios with int (read my comments https://github.com/openjdk/jdk/pull/20098#issuecomment-2669329851 and https://github.com/openjdk/jdk/pull/20098#issuecomment-2671144644), it looks clear that the same kind of regressions are also present there. So, if those int scalar regressions were not a problem when int min/max intrinsic was added, I would expect the same to apply to long.

Re: https://github.com/openjdk/jdk/pull/20098#issuecomment-2671144644 - I was trying to think what could be causing this. I thought maybe it's due to the int min/max backend, which is implemented in platform specific way, vs the long min/max backend which relies on platform independent macro expansion. But if that theory was true, I would expect the same behaviour with int max vs long max, but that's not the case. It seems odd to only see this difference with min.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2671163220

From jbhateja at openjdk.org  Thu Feb 20 11:37:08 2025
From: jbhateja at openjdk.org (Jatin Bhateja)
Date: Thu, 20 Feb 2025 11:37:08 GMT
Subject: RFR: 8342103: C2 compiler support for Float16 type and associated
 scalar operations [v18]
In-Reply-To: <npCqs_ZZsS5MkLbKdkH2HqSrD8KI0loTBdo1DEIMfe4=.ac942c11-d10a-4bc2-a820-396656331758@github.com>
References: <a00XTjaE0iFc3MKq9ER_tgXoz81Hg07N8sPSPpTIQt4=.c05fd92f-8105-49d5-80be-ee56aeb77ede@github.com>
 <GTm_Er6CT-A4aFdVeWEMCXyJKWWrW56VLe9On4W02fk=.6bb331e3-3a26-4f5e-befb-42e955e4d994@github.com>
 <npCqs_ZZsS5MkLbKdkH2HqSrD8KI0loTBdo1DEIMfe4=.ac942c11-d10a-4bc2-a820-396656331758@github.com>
Message-ID: <L2lftStF1cIovnUh2vu6w8D1V3cKM2-ZqlEjdvDDvtg=.03b3c48e-10b5-4f93-950b-8bc48921640a@github.com>

On Tue, 18 Feb 2025 02:36:13 GMT, Julian Waters <jwaters at openjdk.org> wrote:

> Is anyone else getting compile failures after this was integrated? This weirdly seems to only happen on Linux
> 
> ```
> * For target hotspot_variant-server_libjvm_objs_mulnode.o:
> /home/runner/work/jdk/jdk/src/hotspot/share/opto/mulnode.cpp: In member function ?virtual const Type* FmaHFNode::Value(PhaseGVN*) const?:
> /home/runner/work/jdk/jdk/src/hotspot/share/opto/mulnode.cpp:1944:37: error: call of overloaded ?make(double)? is ambiguous
>  1944 |   return TypeH::make(fma(f1, f2, f3));
>       |                                     ^
> In file included from /home/runner/work/jdk/jdk/src/hotspot/share/opto/node.hpp:31,
>                  from /home/runner/work/jdk/jdk/src/hotspot/share/opto/addnode.hpp:28,
>                  from /home/runner/work/jdk/jdk/src/hotspot/share/opto/mulnode.cpp:26:
> /home/runner/work/jdk/jdk/src/hotspot/share/opto/type.hpp:544:23: note: candidate: ?static const TypeH* TypeH::make(float)?
>   544 |   static const TypeH* make(float f);
>       |                       ^~~~
> /home/runner/work/jdk/jdk/src/hotspot/share/opto/type.hpp:545:23: note: candidate: ?static const TypeH* TypeH::make(short int)?
>   545 |   static const TypeH* make(short f);
>       |                       ^~~~
> ```

Hi @TheShermanTanker , 

Please file a separate JBS issue for the errors you are observing with non-standard build options. 
I am also seeing some other build issues with the following configuration
--with-extra-cxxflags=-D__CORRECT_ISO_CPP11_MATH_H_PROTO_FP

Best Regards,
Jatin

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22754#issuecomment-2671231948

From coleenp at openjdk.org  Thu Feb 20 13:00:03 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Thu, 20 Feb 2025 13:00:03 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native [v3]
In-Reply-To: <HDfQ6ytPQFhvZcymShEy9Ga8Z3v0TqH-sFHpTqanOOY=.ea3784a6-c54a-4e28-b771-602bbeae9f38@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <9ZTXNeE806c5EDt4Y6QFMqull0_SobjS7mOQGk2wE5s=.81291418-85a7-4826-9ecf-dcdd050ecaf1@github.com>
 <p18PU3MM6PoNM1H1n6V0NCAYakQZAA5vhnG1_zHT1NU=.fbae280a-8f73-4a0d-819e-0372f9c772fc@github.com>
 <HDfQ6ytPQFhvZcymShEy9Ga8Z3v0TqH-sFHpTqanOOY=.ea3784a6-c54a-4e28-b771-602bbeae9f38@github.com>
Message-ID: <KZoqmh2at9dSUfeAJYZW2w3zQ8zVoWyL2bLkM77OnYU=.40fb703f-99b8-42ee-8c48-c552ce749a75@github.com>

On Thu, 20 Feb 2025 04:29:04 GMT, Chen Liang <liach at openjdk.org> wrote:

>> src/java.base/share/classes/java/lang/Class.java line 1297:
>> 
>>> 1295:     // The componentType field's null value is the sole indication that the class is an array,
>>> 1296:     // see isArray().
>>> 1297:     private transient final Class<?> componentType;
>> 
>> Why the `transient` and how does this impact serialization??
>
> The fields in `Class` are just inconsistently transient or not. `Class` has special treatment in the serialization specification, so the presence or absence of the `transient` modifier has no effect.

Thanks Chen.  I was wondering why the other JVM installed fields were transient and this one wasn't so I added it to see if someone noticed and could verify whether it's right or not.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1963520059

From duke at openjdk.org  Thu Feb 20 17:24:57 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Thu, 20 Feb 2025 17:24:57 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]
In-Reply-To: <1yB95sOajuS5ptFI0GQWLepii5JsZ9DOsje-TEFyFYs=.a325ad18-17ed-4e77-b1e3-0bad2cf55c67@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <unMldYiDLGyImOJQ1oXuzR2OViIBxTKFjE3Ks6_VSn4=.e86bd4ee-5fce-415a-888a-06aff24bd664@github.com>
 <1yB95sOajuS5ptFI0GQWLepii5JsZ9DOsje-TEFyFYs=.a325ad18-17ed-4e77-b1e3-0bad2cf55c67@github.com>
Message-ID: <bqom3W9zRU-ChMNDgkfcOVPEmKvl5J1huHv1o6Fe1yc=.052a4703-111d-40df-8843-4aee2dd93cca@github.com>

On Tue, 11 Feb 2025 10:40:31 GMT, Bhavana Kilambi <bkilambi at openjdk.org> wrote:

>> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Adding comments + some code reorganization
>
> src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 2618:
> 
>> 2616:   INSN(smaxp,  0, 0b101001, false); // accepted arrangements: T8B, T16B, T4H, T8H, T2S, T4S
>> 2617:   INSN(sminp,  0, 0b101011, false); // accepted arrangements: T8B, T16B, T4H, T8H, T2S, T4S
>> 2618:   INSN(sqdmulh,0, 0b101101, false); // accepted arrangements: T4H, T8H, T2S, T4S
> 
> Hi, not a comment on the algorithm itself but you might have to add these new instructions in the gtest for aarch64 here - test/hotspot/gtest/aarch64/aarch64-asmtest.py and use this file to generate test/hotspot/gtest/aarch64/asmtest.out.h which would contain these newly added instructions.

I have tried that, but the python script (actually the as command that it started) threw error messages:

aarch64ops.s:338:24: error: index must be a multiple of 8 in range [0, 32760].
        prfm    PLDL1KEEP, [x15, 43]
                                 ^
aarch64ops.s:357:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
        sub     x1, x10, x23, sxth #2
                              ^
aarch64ops.s:359:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
        add     x11, x21, x5, uxtb #3
                              ^
aarch64ops.s:360:22: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
        adds    x11, x17, x17, uxtw #1
                               ^
aarch64ops.s:361:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
        sub     x11, x0, x15, uxtb #1
                              ^
aarch64ops.s:362:19: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
        subs    x7, x1, x0, sxth #2
                            ^
This is without any modifications from what is in the master branch currently.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1964049673

From duke at openjdk.org  Thu Feb 20 17:33:18 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Thu, 20 Feb 2025 17:33:18 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v6]
In-Reply-To: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
Message-ID: <hpjeRDYSLqEPnvrKbO2EScnS51J5Fm2uixY5nYA3Oho=.8500a6e5-02eb-4b67-985c-449e3e960ffd@github.com>

> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.

Ferenc Rakoczi has updated the pull request incrementally with four additional commits since the last revision:

 - Accepting suggested change from Andrew Dinn
 - Added comments suggested by Andrew Dinn
 - Fixed copyright years
 - renaming a couple of functions

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23300/files
  - new: https://git.openjdk.org/jdk/pull/23300/files/9a3a9444..54373d5a

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23300&range=05
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23300&range=04-05

  Stats: 98 lines in 6 files changed: 2 ins; 0 del; 96 mod
  Patch: https://git.openjdk.org/jdk/pull/23300.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23300/head:pull/23300

PR: https://git.openjdk.org/jdk/pull/23300

From coleenp at openjdk.org  Thu Feb 20 20:11:11 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Thu, 20 Feb 2025 20:11:11 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native [v4]
In-Reply-To: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
Message-ID: <SXu4wiwyAqFFkV5mBhqBKKWv03q7fBHRx1HTlMCt3x8=.c135d8ff-aeda-425a-88e6-55b04bbb59c5@github.com>

> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
> Tested with tier1-4 and performance tests.

Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:

  Update src/java.base/share/classes/java/lang/Class.java
  
  Co-authored-by: David Holmes <62092539+dholmes-ora at users.noreply.github.com>

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23572/files
  - new: https://git.openjdk.org/jdk/pull/23572/files/d08091ac..7a4c595b

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23572&range=03
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23572&range=02-03

  Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod
  Patch: https://git.openjdk.org/jdk/pull/23572.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23572/head:pull/23572

PR: https://git.openjdk.org/jdk/pull/23572

From coleenp at openjdk.org  Thu Feb 20 20:19:15 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Thu, 20 Feb 2025 20:19:15 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native [v5]
In-Reply-To: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
Message-ID: <PM65jBoMpUpFNVH-ZcJWCDJl1IavmTdNU-2DnS9j22c=.7d4fdd41-5ed0-4785-b37f-c7f4b0c1e9e9@github.com>

> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
> Tested with tier1-4 and performance tests.

Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:

  Fix whitespace

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23572/files
  - new: https://git.openjdk.org/jdk/pull/23572/files/7a4c595b..02347433

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23572&range=04
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23572&range=03-04

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/23572.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23572/head:pull/23572

PR: https://git.openjdk.org/jdk/pull/23572

From vlivanov at openjdk.org  Thu Feb 20 21:56:55 2025
From: vlivanov at openjdk.org (Vladimir Ivanov)
Date: Thu, 20 Feb 2025 21:56:55 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native [v5]
In-Reply-To: <PM65jBoMpUpFNVH-ZcJWCDJl1IavmTdNU-2DnS9j22c=.7d4fdd41-5ed0-4785-b37f-c7f4b0c1e9e9@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <PM65jBoMpUpFNVH-ZcJWCDJl1IavmTdNU-2DnS9j22c=.7d4fdd41-5ed0-4785-b37f-c7f4b0c1e9e9@github.com>
Message-ID: <LI--MPBmyj6QB6202JJV1iLAWY-friLYZ-yWm5t0vvg=.474dcd0d-6ef2-4312-8cff-5496e2a6c5d9@github.com>

On Thu, 20 Feb 2025 20:19:15 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
>> Tested with tier1-4 and performance tests.
>
> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix whitespace

Looks good!

Regarding @IntrinsicCandidate and its effects on JIT-compiler inlining decisions, @ForceInline could be added, but IMO it's not necessary since new implementations are small.

-------------

Marked as reviewed by vlivanov (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/23572#pullrequestreview-2631244815

From coleenp at openjdk.org  Thu Feb 20 23:25:57 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Thu, 20 Feb 2025 23:25:57 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native [v5]
In-Reply-To: <3qpqR3PC8PFmdgaIoSYA3jDWdl-oon0-AcIzXcI76rY=.38635503-c067-4f6e-a4f1-92c1b6d991d1@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <E9GPjreqeKFJmZAIjHGQ-1y6FnyqaT94FHUPuK65kmE=.48bd4ecc-ac91-4f7b-895b-a32280d8b437@github.com>
 <_j9Wkg21aBltyVrbO4wxGFKmmLDy0T-eorRL4epfS4k=.5a453b6b-d673-4cc6-b29f-192fa74e290c@github.com>
 <3qpqR3PC8PFmdgaIoSYA3jDWdl-oon0-AcIzXcI76rY=.38635503-c067-4f6e-a4f1-92c1b6d991d1@github.com>
Message-ID: <dOZIsMeRy34Anbe2OORtfWN6eJHD2zPhIHF_vtuqDCo=.956c37b8-e682-475a-90f0-5818e7168044@github.com>

On Wed, 19 Feb 2025 21:16:51 GMT, Dean Long <dlong at openjdk.org> wrote:

>> This is a good question.  The heapwalker walks through dead mirrors so I can't assert that a null klass field matches our boolean setting but I don't know why this never asserts (can't find any instances in the bug database) but it seems like it could.  I'll use the bool field in the mirror in the assert though but not in the return since the caller likely will fetch the klass pointer next.
>
>> ... but not in the return since the caller likely will fetch the klass pointer next.
> 
> I notice that too.  Callers are using is_primitive() to short-circuit calls to as_Klass(), which means they seem to be aware of this implementation detail when maybe they shouldn't.

There are 136 callers so yes, it might be something that shouldn't be known in this many places.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1964492501

From coleenp at openjdk.org  Thu Feb 20 23:31:55 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Thu, 20 Feb 2025 23:31:55 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native [v5]
In-Reply-To: <PM65jBoMpUpFNVH-ZcJWCDJl1IavmTdNU-2DnS9j22c=.7d4fdd41-5ed0-4785-b37f-c7f4b0c1e9e9@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <PM65jBoMpUpFNVH-ZcJWCDJl1IavmTdNU-2DnS9j22c=.7d4fdd41-5ed0-4785-b37f-c7f4b0c1e9e9@github.com>
Message-ID: <J-2mk_b36BGlhDo7eRbyuERg0jr8lZmYfJKTtOcRGp4=.b0a93e8f-0006-472d-9221-f04315d89254@github.com>

On Thu, 20 Feb 2025 20:19:15 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
>> Tested with tier1-4 and performance tests.
>
> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix whitespace

Thanks Vladimir for review and for answering my earlier questions on this change.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23572#issuecomment-2672941007

From liach at openjdk.org  Thu Feb 20 23:40:55 2025
From: liach at openjdk.org (Chen Liang)
Date: Thu, 20 Feb 2025 23:40:55 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native [v5]
In-Reply-To: <PM65jBoMpUpFNVH-ZcJWCDJl1IavmTdNU-2DnS9j22c=.7d4fdd41-5ed0-4785-b37f-c7f4b0c1e9e9@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <PM65jBoMpUpFNVH-ZcJWCDJl1IavmTdNU-2DnS9j22c=.7d4fdd41-5ed0-4785-b37f-c7f4b0c1e9e9@github.com>
Message-ID: <beP3eW57na8hjpPFM27Zbhg3i4iZm_htJAhn7LZiwHg=.876cec03-2f47-468c-8197-fa59e590e435@github.com>

On Thu, 20 Feb 2025 20:19:15 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
>> Tested with tier1-4 and performance tests.
>
> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix whitespace

You are right, using the field directly is indeed better.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1964502825

From epeter at openjdk.org  Fri Feb 21 07:04:57 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Fri, 21 Feb 2025 07:04:57 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory
In-Reply-To: <mcXrI5ah9OFy25nV_Im_DFpPR_DXtfOgn-26D_bC1mQ=.15fc3dd9-7278-4c09-8b25-dde0a1251ca2@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <IUuLTkwPe7pefd6C6NhQEI7ASmdSW8Bb0kBFJVfXkUY=.f6d110c2-0d6d-424f-8898-b06d5f9552f6@github.com>
 <OtJlLrlGEGU9a-lDCP-_n6paLgrAmCTg3-pwhLTeyIU=.c1a3d943-aca1-4dbd-8717-c73020163864@github.com>
 <mcXrI5ah9OFy25nV_Im_DFpPR_DXtfOgn-26D_bC1mQ=.15fc3dd9-7278-4c09-8b25-dde0a1251ca2@github.com>
Message-ID: <uJ5e_ryZdco7kGqtKv33v-Yn3lH1Zh-srJn-200FaQw=.811c1a51-19a0-4fb6-85b7-c9dd853e7429@github.com>

On Wed, 19 Feb 2025 16:14:09 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> @vnkozlov I suggest that I change the probability to something quite low now, just to make sure that the fast-loop is placed nicely. When I do the experiments for aliasing-analysis runtime-checks, then I will be able to benchmark much better for both cases, since it is much easier to create many different cases. At that point, I could still adapt the probabilities to a different constant. Or maybe I can somehow adjust the probabilities in the chain such that they are balanced. Like if there is 1 condition, give it `0.5`, if there are 2 give them each `sqrt(0.5)`, if there are `n` then `pow(0.5, 1/n)`, so that once you multiply them you get `pow(pow(0.5, 1/n),n) = 0.5`. We could also set another "target" probability than `0.5`. The issue is that experimenting now is a little difficult, because I only have the alignment-checks to play with, which are really really rare to fail in the "real world", I think. But aliasing-checks are more likely to fail, so there could be more interesti
 ng benchmark results there.
>> 
>> Does that sound ok?
>> 
>>> Can we profile alignment in Interpreter (and C1)?
>> 
>> It would be nice if we could profile alignment or aliasing. Maybe that is possible. But I suppose there are always cases where profiling is not available (Xcomp ?), and we should have reasonable defaults there. We could investigate profiling in a second step, to improve things if we think that is worth it. Profiling these things would also be additional complexity - I'm not convinced yet it is worth it.
>> 
>> What do you think?
>
>> > Can we profile alignment in Interpreter (and C1)?
>> 
>> It would be nice if we could profile alignment or aliasing. Maybe that is possible. But I suppose there are always cases where profiling is not available (Xcomp ?), and we should have reasonable defaults there. We could investigate profiling in a second step, to improve things if we think that is worth it. Profiling these things would also be additional complexity - I'm not convinced yet it is worth it.
>> 
>> What do you think?
> 
> You should not worry about `-Xcomp` it is testing flag - we can use some default there.
> I am fine if you think profiling will not bring us much benefits. Note, I am not asking create counters - just a bit to indicate if we had unaligned access to native memory in a method. In such case we may skip predicate and generate multi versions loop during compilation. On other hand, we may have unaligned access only during startup and not later when we compile method. Anyway, it does not affect these changes.
> 
> I will look on changes more later.

@vnkozlov I made the change with the probability `PROB_FAIR` -> `PROB_LIKELY_MAG(3)` and ran testing again.

@rwestrel Do you want me to find examples for the pre-loop disappearing, I suppose I can find some easily by adding an assert in SuperWord, where we bail out, as I showed above.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2673745463

From epeter at openjdk.org  Fri Feb 21 08:22:59 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Fri, 21 Feb 2025 08:22:59 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v11]
In-Reply-To: <tQDhCoMxTDakWM2E5C0jo68nfJnRIsmGYJM6CSje5UU=.132a1a43-4911-4874-b5ac-6610467fd065@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com>
 <Mci8jQuT-MquLYeikUrrdzKo9dJJuQa3ejdc7tlYQyI=.e0007de8-08b2-4a42-950c-f8e1225777fc@github.com>
 <RHL_g49_BCZQzsQJU-T88fkAOoSKpNvEC2Xx-QxdpRk=.4fbc0037-ba55-40e1-a091-4c16d7e8ee99@github.com>
 <MAcSY0Kc9JFrv5ueJDNkDH9I9LpIsYcRjiJ7RtQt090=.03b107fd-8879-41de-a0a4-b6202a9da369@github.com>
 <OxF5Va_n5CdxRW2uSTQQzMe6JSSNqnfs4qd3pAwSAEo=.d33d96b2-0d11-4d67-91cc-5ae94e78c580@github.com>
 <S-e2FYgJy02RfywN3A6WTPaepdkK6ly8KYNnX6tHCig=.a38238e5-7d3b-412b-a78e-9def776179ff@github.com>
 <T8VJwIaK1x-G60XsQNIUB9wcHQeYP1bkYB98Am9c8KM=.56412af0-9471-4619-9e14-03ba5f5cfde2@github.com>
 <msxoTDkFVjqA2xbPwhboyul7Oal6uQAQcig5RdkeiyY=.7d5bd096-0906-45fa-a3db-0ace9b629fcb@github.com>
 <5oGMaD5b87inAMkco6l5ODRvWv7FRsHGJiu_UMrGrTc=.0be44429-d322-4a6f-b91d-b64a146fad05@github.com>
 <3ArmrOQcUoj8DhHTq1a40Oz3GE8bCDDy3FF
 eVgbladg=.b8e0e13b-39f3-41a6-8a1b-5ca4febb4a41@github.com>
 <ajuhBKoDVKmn_D9vg3O3OVUQVVunAT02KhXcaapMokE=.5b8f2f36-6521-4389-97de-0bf8deeb47dc@github.com>
 <N-faJEFC3vJg5m-H3EyMnAkBg3ecpx_YduxpPyXRz8o=.fee546a6-61d3-4ce7-9939-80a1a9e2748e@github.com>
 <tQDhCoMxTDakWM2E5C0jo68nfJnRIsmGYJM6CSje5UU=.132a1a43-4911-4874-b5ac-6610467fd065@github.com>
Message-ID: <ugxcwW3OuUA372UYiQzEdzUW6zJxog-Bc_YD8ALzovI=.d92b16d0-af53-4935-90fa-65af126c978f@github.com>

On Thu, 20 Feb 2025 11:00:59 GMT, Galder Zamarre?o <galder at openjdk.org> wrote:

> So, if those int scalar regressions were not a problem when int min/max intrinsic was added, I would expect the same to apply to long.

Do you know when they were added? If that was a long time ago, we might not have noticed back then, but we might notice now.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2673875104

From epeter at openjdk.org  Fri Feb 21 08:23:00 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Fri, 21 Feb 2025 08:23:00 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v12]
In-Reply-To: <NA_X8GyT3wCtXQMpvXjBZB0MsptfiNQ8M66gOaUwh5g=.9f10568a-2599-4711-9d6c-30148465cbdc@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <pZjDpZKJUmXi85-qf3F-NX91qVc42_QgZGbuo36XhPk=.f2e4ba72-bf19-4ced-9656-c01907bdae1b@github.com>
 <NA_X8GyT3wCtXQMpvXjBZB0MsptfiNQ8M66gOaUwh5g=.9f10568a-2599-4711-9d6c-30148465cbdc@github.com>
Message-ID: <UtVPWMSN4Gkta0QNlEUoBSAckHZqc0b_XaCTBRSiQdE=.2d3ef926-b052-4040-8c65-9d1e493bfba1@github.com>

On Thu, 20 Feb 2025 06:50:07 GMT, Galder Zamarre?o <galder at openjdk.org> wrote:

> The interesting thing is intReductionSimpleMin @ 100%. We see a regression there but I didn't observe it with the perfasm run. So, this could be due to variance in the application of cmov or not?

I don't see the error / variance in the results you posted. Often I look at those, and if it is anywhere above 10% of the average, then I'm suspicious ;)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2673879859

From epeter at openjdk.org  Fri Feb 21 08:30:00 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Fri, 21 Feb 2025 08:30:00 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v11]
In-Reply-To: <tQDhCoMxTDakWM2E5C0jo68nfJnRIsmGYJM6CSje5UU=.132a1a43-4911-4874-b5ac-6610467fd065@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com>
 <Mci8jQuT-MquLYeikUrrdzKo9dJJuQa3ejdc7tlYQyI=.e0007de8-08b2-4a42-950c-f8e1225777fc@github.com>
 <RHL_g49_BCZQzsQJU-T88fkAOoSKpNvEC2Xx-QxdpRk=.4fbc0037-ba55-40e1-a091-4c16d7e8ee99@github.com>
 <MAcSY0Kc9JFrv5ueJDNkDH9I9LpIsYcRjiJ7RtQt090=.03b107fd-8879-41de-a0a4-b6202a9da369@github.com>
 <OxF5Va_n5CdxRW2uSTQQzMe6JSSNqnfs4qd3pAwSAEo=.d33d96b2-0d11-4d67-91cc-5ae94e78c580@github.com>
 <S-e2FYgJy02RfywN3A6WTPaepdkK6ly8KYNnX6tHCig=.a38238e5-7d3b-412b-a78e-9def776179ff@github.com>
 <T8VJwIaK1x-G60XsQNIUB9wcHQeYP1bkYB98Am9c8KM=.56412af0-9471-4619-9e14-03ba5f5cfde2@github.com>
 <msxoTDkFVjqA2xbPwhboyul7Oal6uQAQcig5RdkeiyY=.7d5bd096-0906-45fa-a3db-0ace9b629fcb@github.com>
 <5oGMaD5b87inAMkco6l5ODRvWv7FRsHGJiu_UMrGrTc=.0be44429-d322-4a6f-b91d-b64a146fad05@github.com>
 <3ArmrOQcUoj8DhHTq1a40Oz3GE8bCDDy3FF
 eVgbladg=.b8e0e13b-39f3-41a6-8a1b-5ca4febb4a41@github.com>
 <ajuhBKoDVKmn_D9vg3O3OVUQVVunAT02KhXcaapMokE=.5b8f2f36-6521-4389-97de-0bf8deeb47dc@github.com>
 <N-faJEFC3vJg5m-H3EyMnAkBg3ecpx_YduxpPyXRz8o=.fee546a6-61d3-4ce7-9939-80a1a9e2748e@github.com>
 <tQDhCoMxTDakWM2E5C0jo68nfJnRIsmGYJM6CSje5UU=.132a1a43-4911-4874-b5ac-6610467fd065@github.com>
Message-ID: <tts6bHpI9Z29U4XwPBLqAWkhE2XG9qccCkkCLI_M0Nw=.52ccbcfe-1910-4cba-9ed4-a30b0f90f696@github.com>

On Thu, 20 Feb 2025 11:00:59 GMT, Galder Zamarre?o <galder at openjdk.org> wrote:

> Re: https://github.com/openjdk/jdk/pull/20098#issuecomment-2671144644 - I was trying to think what could be causing this.

Maybe it is an issue with probabilities? Do you know at what point (if at all) the `MinI` node appears/disappears in that example?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2673892612

From duke at openjdk.org  Fri Feb 21 10:09:56 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Fri, 21 Feb 2025 10:09:56 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]
In-Reply-To: <3kiI1J7jcczgzTRi9HZztzhGe1blcy8Ga11xoGhzueY=.98543172-5b38-4199-bead-0988de0e0e75@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <unMldYiDLGyImOJQ1oXuzR2OViIBxTKFjE3Ks6_VSn4=.e86bd4ee-5fce-415a-888a-06aff24bd664@github.com>
 <3kiI1J7jcczgzTRi9HZztzhGe1blcy8Ga11xoGhzueY=.98543172-5b38-4199-bead-0988de0e0e75@github.com>
Message-ID: <D1adAbM3S-UVoYRr2s7ecXU-Jj2qsMWblg0xHAhIdeM=.ec4bcd2f-e86d-4fd1-a1c1-5b8c3abe1b60@github.com>

On Tue, 18 Feb 2025 13:33:52 GMT, Andrew Dinn <adinn at openjdk.org> wrote:

>> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Adding comments + some code reorganization
>
> src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 2594:
> 
>> 2592:     guarantee(T != T1Q && T != T1D, "incorrect arrangement");                           \
>> 2593:     if (!acceptT2D) guarantee(T != T2D, "incorrect arrangement");                       \
>> 2594:     if (strcmp(#NAME, "sqdmulh") == 0) guarantee(T != T8B && T != T16B, "incorrect arrangement");   \
> 
> Suggestion:
> 
> I think it might be better to change this test from a strcmp call to (opc2 == 0b101101). The strcmp test is clearer to a reader of the code but the call may not be guaranteed to be compiled out at build time while the latter will.

Changed as suggested.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1965215153

From duke at openjdk.org  Fri Feb 21 10:14:00 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Fri, 21 Feb 2025 10:14:00 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]
In-Reply-To: <Xeb_-xj7BbIrcvXu7SM1zaBqg4Mr_qnGXZdUeomEeLI=.63ded1f5-b52a-41eb-bf03-7d74fdfb17f1@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <unMldYiDLGyImOJQ1oXuzR2OViIBxTKFjE3Ks6_VSn4=.e86bd4ee-5fce-415a-888a-06aff24bd664@github.com>
 <Xeb_-xj7BbIrcvXu7SM1zaBqg4Mr_qnGXZdUeomEeLI=.63ded1f5-b52a-41eb-bf03-7d74fdfb17f1@github.com>
Message-ID: <fTj45oTkL2epugLBFkVlRfnBxauJKujnO0UX0c7MbdU=.10b05e41-fd95-4d82-ac47-455f21c27a85@github.com>

On Tue, 18 Feb 2025 13:43:18 GMT, Andrew Dinn <adinn at openjdk.org> wrote:

>> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Adding comments + some code reorganization
>
> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4066:
> 
>> 4064:   }
>> 4065: 
>> 4066:   // Execute on round of keccak of two computations in parallel.
> 
> Suggestion:
> 
> It would be helpful to add comments that relate the register and instruction selection to the original Java source code. e.g. change the header as follows
> 
>     // Performs 2 keccak round transformations using vector parallelism
>     // 
>     // Two sets of 25 * 64-bit input states a0[lo:hi]...a24[lo:hi] are passed in
>     // the lower/upper halves of registers v0...v24 and the transformed states
>     // are returned in the same registers. Intermediate 64-bit pairs
>     // c0...c5 and d0...d5 are computed in registers v25...v30. v31 is
>     // loaded with the required pair of 64 bit rounding constants.
>     // During computation of the output states some intermediate results are
>     // shuffled around registers v0...v30. Comments on each line indicate
>     // how the values in registers correspond to variables ai, ci, di in
>     // the Java source code, likewise how the generated machine instructions
>     // correspond to Java source operations (n.b. rol means rotate left).
> 
> The annotate the generation steps as follows:
> 
>     __ eor3(v29, __ T16B, v4, v9, v14);       // c4 = a4 ^ a9 ^ a14
>     __ eor3(v26, __ T16B, v1, v6, v11);       // c1 = a1 ^ a16 ^ a11
>     __ eor3(v28, __ T16B, v3, v8, v13);       // c3 = a3 ^ a8 ^a13
>     __ eor3(v25, __ T16B, v0, v5, v10);       // c0 = a0 ^ a5 ^ a10
>     __ eor3(v27, __ T16B, v2, v7, v12);       // c2 = a2 ^ a7 ^ a12
>     __ eor3(v29, __ T16B, v29, v19, v24);     // c4 ^= a19 ^ a24
>     __ eor3(v26, __ T16B, v26, v16, v21);     // c1 ^= a16 ^ a21
>     __ eor3(v28, __ T16B, v28, v18, v23);     // c3 ^= a18 ^ a23
>     __ eor3(v25, __ T16B, v25, v15, v20);     // c0 ^= a15 ^ a20
>     __ eor3(v27, __ T16B, v27, v17, v22);     // c2 ^= a17 ^ a22
> 
>     __ rax1(v30, __ T2D, v29, v26);           // d0 = c4 ^ rol(c1, 1)
>     __ rax1(v26, __ T2D, v26, v28);           // d2 = c1 ^ rol(c3, 1)
>     __ rax1(v28, __ T2D, v28, v25);           // d4 = c3 ^ rol(c0, 1)
>     __ rax1(v25, __ T2D, v25, v27);           // d1 = c0 ^ rol(c2, 1)
>     __ rax1(v27, __ T2D, v27, v29);           // d3 = c2 ^ rol(c4, 1)
> 
>     __ eor(v0, __ T16B, v0, v30);             // a0 = a0 ^ d0
>     __ xar(v29, __ T2D, v1,  v25, (64 - 1));  // a10' = rol((a1^d1), 1)
>     __ xar(v1,  __ T2D, v6,  v25, (64 - 44)); // a1 = rol(a6^d1), 44)
>     __ xar(v6,  __ T2D, v9,  v28, (64 - 20)); // a6 = rol((a9^d4), 20)
>     __ xar(v...

Although this piece of code is not new, and I don't really think that this level of commenting is necessary, especially in code that is very unlikely to change, I added the comments.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1965220606

From duke at openjdk.org  Fri Feb 21 10:25:59 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Fri, 21 Feb 2025 10:25:59 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]
In-Reply-To: <c8EPfl5IC1K3uLMftbZSbf-TyJK-e5LEsXovSfjqO14=.ae0182d1-e7d2-4ab5-9ebe-d7bc8bac643e@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <unMldYiDLGyImOJQ1oXuzR2OViIBxTKFjE3Ks6_VSn4=.e86bd4ee-5fce-415a-888a-06aff24bd664@github.com>
 <c8EPfl5IC1K3uLMftbZSbf-TyJK-e5LEsXovSfjqO14=.ae0182d1-e7d2-4ab5-9ebe-d7bc8bac643e@github.com>
Message-ID: <me_0hCYLnyFadVymsq-oxV7VggbpPMqLnXe-jPU4boU=.bfcec994-1f12-4f4e-9002-8e2564919511@github.com>

On Wed, 19 Feb 2025 02:55:18 GMT, Hao Sun <haosun at openjdk.org> wrote:

>> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Adding comments + some code reorganization
>
> Hi. Here is the test result of our CI.
> 
> ### copyright year
> 
> the following files should update the copyright year to 2025.
> 
> 
> src/hotspot/cpu/aarch64/assembler_aarch64.hpp
> src/hotspot/cpu/aarch64/stubRoutines_aarch64.hpp
> src/hotspot/share/runtime/globals.hpp
> src/java.base/share/classes/sun/security/provider/ML_DSA.java
> src/java.base/share/classes/sun/security/provider/SHA3Parallel.java
> test/micro/org/openjdk/bench/java/security/MLDSA.java
> 
> 
> ### cross-build failure
> 
> Cross build for riscv64/s390/ppc64 failed.
> 
> Here shows the error msg for ppc64
> 
> 
> === Output from failing command(s) repeated here ===
> * For target support_interim-jmods_support__create_java.base.jmod_exec:
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  Internal Error (/tmp/jdk-src/src/hotspot/share/asm/codeBuffer.hpp:200), pid=72752, tid=72769
> #  assert(allocates2(pc)) failed: not in CodeBuffer memory: 0x0000e85cb03dc620 <= 0x0000e85cb03e8ab4 <= 0x0000e85cb03e8ab0
> #
> # JRE version: OpenJDK Runtime Environment (25.0) (fastdebug build 25-internal-git-1e01c6deec3)
> # Java VM: OpenJDK 64-Bit Server VM (fastdebug 25-internal-git-1e01c6deec3, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64)
> # Problematic frame:
> # V  [libjvm.so+0x3b391c]  Instruction_aarch64::~Instruction_aarch64()+0xbc
> #
> # Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /tmp/ci-scripts/jdk-src/make/
> #
> # An error report file with more information is saved as:
> # /tmp/jdk-src/make/hs_err_pid72752.log
>    ... (rest of output omitted)
> 
> * All command lines available in /sysroot/ppc64el/tmp/build-ppc64el/make-support/failure-logs.
> === End of repeated output ===
> 
> 
> I suppose we should make the similar update at file `src/hotspot/cpu/aarch64/stubDeclarations_aarch64.hpp` to other platforms

@shqking, I changed the copyright years, but I don't really understand how the aarch64-specific code can overflow buffers on other architectures. As far as I understand, Instruction_aarch64 should not have been there in a ppc build.
Was this a build attempted on an aarch64 for the other architectures?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2674156680

From yzheng at openjdk.org  Fri Feb 21 12:14:57 2025
From: yzheng at openjdk.org (Yudi Zheng)
Date: Fri, 21 Feb 2025 12:14:57 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native [v5]
In-Reply-To: <PM65jBoMpUpFNVH-ZcJWCDJl1IavmTdNU-2DnS9j22c=.7d4fdd41-5ed0-4785-b37f-c7f4b0c1e9e9@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <PM65jBoMpUpFNVH-ZcJWCDJl1IavmTdNU-2DnS9j22c=.7d4fdd41-5ed0-4785-b37f-c7f4b0c1e9e9@github.com>
Message-ID: <PAvrHsKnYAx1187SXemNFstJdKD2JZJq4S6sWt-NAhs=.fb8a0ca2-d4e2-4f94-bf3a-6b4f7238a7e0@github.com>

On Thu, 20 Feb 2025 20:19:15 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
>> Tested with tier1-4 and performance tests.
>
> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix whitespace

LGTM! As @iwanowww said, not inlining such trivial methods seems more like an inliner bug/enhancement opportunity.

-------------

Marked as reviewed by yzheng (Committer).

PR Review: https://git.openjdk.org/jdk/pull/23572#pullrequestreview-2632877796

From coleenp at openjdk.org  Fri Feb 21 12:31:46 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Fri, 21 Feb 2025 12:31:46 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native [v6]
In-Reply-To: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
Message-ID: <hOLDDibFVRsNY46cNYYY_J_QcBPABSHF2IeCZdpVV-c=.4ae5e382-51e4-4f6f-98ad-16540ced161f@github.com>

> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
> Tested with tier1-4 and performance tests.

Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:

  Remove JVM_GetClassModifiers from jvm.h too.

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23572/files
  - new: https://git.openjdk.org/jdk/pull/23572/files/02347433..c23718b3

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23572&range=05
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23572&range=04-05

  Stats: 3 lines in 1 file changed: 0 ins; 3 del; 0 mod
  Patch: https://git.openjdk.org/jdk/pull/23572.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23572/head:pull/23572

PR: https://git.openjdk.org/jdk/pull/23572

From coleenp at openjdk.org  Fri Feb 21 12:31:48 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Fri, 21 Feb 2025 12:31:48 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native [v6]
In-Reply-To: <maTemwep0WbKgtihRZGo6LdtkVJAtVi8I-NNP18wvQ8=.34e3da87-46e9-4db4-9d3d-ad73ca8096b2@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <FrlPjmMb7CgcaoCNXLWwybw-pcHprdWIP8whY8fJU9g=.2564ab5c-9d30-4e5d-95cc-7a85955643b0@github.com>
 <maTemwep0WbKgtihRZGo6LdtkVJAtVi8I-NNP18wvQ8=.34e3da87-46e9-4db4-9d3d-ad73ca8096b2@github.com>
Message-ID: <w0inKFB0-c2_QmfFAqTQ9fVqy5wMyQ69sdROl19NrP0=.335d827a-1820-4bdd-83ed-f00e6ab37370@github.com>

On Wed, 19 Feb 2025 14:21:47 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> src/hotspot/share/prims/jvm.cpp line 1262:
>> 
>>> 1260: JVM_END
>>> 1261: 
>>> 1262: JVM_ENTRY(jboolean, JVM_IsArrayClass(JNIEnv *env, jclass cls))
>> 
>> Where are the changes to jvm.h?
>
> Good catch, I also removed JVM_GetProtectionDomain.

and JVM_GetClassModifiers.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1965401052

From coleenp at openjdk.org  Fri Feb 21 12:31:49 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Fri, 21 Feb 2025 12:31:49 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native [v5]
In-Reply-To: <beP3eW57na8hjpPFM27Zbhg3i4iZm_htJAhn7LZiwHg=.876cec03-2f47-468c-8197-fa59e590e435@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <PM65jBoMpUpFNVH-ZcJWCDJl1IavmTdNU-2DnS9j22c=.7d4fdd41-5ed0-4785-b37f-c7f4b0c1e9e9@github.com>
 <beP3eW57na8hjpPFM27Zbhg3i4iZm_htJAhn7LZiwHg=.876cec03-2f47-468c-8197-fa59e590e435@github.com>
Message-ID: <3jNPEzaXa0Ncf8eu3vct6a_jyH7k4tH_mbRBaKmbMc0=.d3a86a0f-1bed-4084-af92-959f4dbd52f4@github.com>

On Thu, 20 Feb 2025 23:38:34 GMT, Chen Liang <liach at openjdk.org> wrote:

>> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Fix whitespace
>
> You are right, using the field directly is indeed better.

I don't use the field directly because the field is a short and getModifiers makes it into Modifier.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1965399996

From liach at openjdk.org  Fri Feb 21 14:04:02 2025
From: liach at openjdk.org (Chen Liang)
Date: Fri, 21 Feb 2025 14:04:02 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native [v5]
In-Reply-To: <3jNPEzaXa0Ncf8eu3vct6a_jyH7k4tH_mbRBaKmbMc0=.d3a86a0f-1bed-4084-af92-959f4dbd52f4@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <PM65jBoMpUpFNVH-ZcJWCDJl1IavmTdNU-2DnS9j22c=.7d4fdd41-5ed0-4785-b37f-c7f4b0c1e9e9@github.com>
 <beP3eW57na8hjpPFM27Zbhg3i4iZm_htJAhn7LZiwHg=.876cec03-2f47-468c-8197-fa59e590e435@github.com>
 <3jNPEzaXa0Ncf8eu3vct6a_jyH7k4tH_mbRBaKmbMc0=.d3a86a0f-1bed-4084-af92-959f4dbd52f4@github.com>
Message-ID: <VyoHvDMpugJg1-XWZE_gKFwBkWc-7of8qIfdcuDB2jE=.b0d9bdb7-7480-4ae1-afec-e48cbf94c188@github.com>

On Fri, 21 Feb 2025 12:27:56 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> You are right, using the field directly is indeed better.
>
> I don't use the field directly because the field is a short and getModifiers makes it into Modifier.

Indeed, even though this checks for the specific bit so widening has no effect, it is better to be cautious here.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1965522767

From kvn at openjdk.org  Fri Feb 21 19:08:01 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Fri, 21 Feb 2025 19:08:01 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory [v3]
In-Reply-To: <owyo22rcVx0t1HF558zxMB2k26WXy2TJNttzm73fSOE=.a5d9b6ff-1d42-4ce5-8409-295e6aad7b6b@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <owyo22rcVx0t1HF558zxMB2k26WXy2TJNttzm73fSOE=.a5d9b6ff-1d42-4ce5-8409-295e6aad7b6b@github.com>
Message-ID: <kt8u_H5aG9Zvo7YGdbjt5p9ibpK8i5MV8UBkh-Qp3ms=.0e363306-aab3-49cc-8ead-7f73c051fcfd@github.com>

On Thu, 20 Feb 2025 07:21:45 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below.
>> 
>> **Background**
>> 
>> With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer.
>> 
>> **Problem**
>> 
>> So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code.
>> 
>> 
>> MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1);
>> MemorySegment nativeUnaligned = nativeAligned.asSlice(1);
>> test3(nativeUnaligned);
>> 
>> 
>> When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not!
>> 
>>     static void test3(MemorySegment ms) {
>>         for (int i = 0; i < RANGE; i++) {
>>             long adr = i * 4L;
>>             int v = ms.get(ELEMENT_LAYOUT, adr);
>>             ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1));
>>         }
>>     }
>> 
>> 
>> **Solution: Runtime Checks - Predicate and Multiversioning**
>> 
>> Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check.
>> 
>> I came up with 2 options where to place the runtime checks:
>> - A new "auto vectorization" Parse Predicate:
>>   - This only works when predicates are available.
>>   - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop.
>> - Multiversion the loop:
>>   - Create 2 copies of the loop (fast and slow loops).
>>   - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take
>>   - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even ...
>
> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision:
> 
>   adjust selector if probability

How profitable (performance wise) to optimize slow path loop? Can we skip any optimizations for it - treat it as not-Counted?

src/hotspot/share/opto/loopTransform.cpp line 3363:

> 3361:   if (cl->is_pre_loop() || cl->is_post_loop()) return true;
> 3362: 
> 3363:   // If we are stalled, check if we can get unstalled.

Can you expand comment explaining cases when we "stall" and what it means?

src/hotspot/share/opto/loopopts.cpp line 4514:

> 4512: // and then rejecting the slow_loop by constant folding the multiversion_if.
> 4513: //
> 4514: // Therefore, we "stall" the optimization of the slow_loop until we add

We don't use "stall" term. We use "delay" - this is what happens here if I understand it correctly.

src/hotspot/share/opto/loopopts.cpp line 4520:

> 4518: // multiversion_if folds away the "stalled" slow_loop. If we add any
> 4519: // speculative assumption, then we mark the OpaqueMultiversioningNode
> 4520: // with "unstall_slow_loop", so that the slow_loop can be optimized.

"unstall_slow_loop" - > "optimize_slow_loop"

-------------

PR Review: https://git.openjdk.org/jdk/pull/22016#pullrequestreview-2633960596
PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1966019182
PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1966028103
PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1966032230

From dlong at openjdk.org  Fri Feb 21 21:10:58 2025
From: dlong at openjdk.org (Dean Long)
Date: Fri, 21 Feb 2025 21:10:58 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native [v5]
In-Reply-To: <VyoHvDMpugJg1-XWZE_gKFwBkWc-7of8qIfdcuDB2jE=.b0d9bdb7-7480-4ae1-afec-e48cbf94c188@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <PM65jBoMpUpFNVH-ZcJWCDJl1IavmTdNU-2DnS9j22c=.7d4fdd41-5ed0-4785-b37f-c7f4b0c1e9e9@github.com>
 <beP3eW57na8hjpPFM27Zbhg3i4iZm_htJAhn7LZiwHg=.876cec03-2f47-468c-8197-fa59e590e435@github.com>
 <3jNPEzaXa0Ncf8eu3vct6a_jyH7k4tH_mbRBaKmbMc0=.d3a86a0f-1bed-4084-af92-959f4dbd52f4@github.com>
 <VyoHvDMpugJg1-XWZE_gKFwBkWc-7of8qIfdcuDB2jE=.b0d9bdb7-7480-4ae1-afec-e48cbf94c188@github.com>
Message-ID: <LrjguT4-f7V_Wotxu2zb87fAUEoCtFnuPqr442jiEpc=.0bb754fa-51ce-479d-b1de-fbea5f9f2017@github.com>

On Fri, 21 Feb 2025 14:01:20 GMT, Chen Liang <liach at openjdk.org> wrote:

>> I don't use the field directly because the field is a short and getModifiers makes it into Modifier.
>
> Indeed, even though this checks for the specific bit so widening has no effect, it is better to be cautious here.

> I don't use the field directly because the field is a short and getModifiers makes it into Modifier.

But getModifiers() returns `int`, not `Modifier` (which is all static).

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1966170358

From coleenp at openjdk.org  Sat Feb 22 14:49:38 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Sat, 22 Feb 2025 14:49:38 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native [v7]
In-Reply-To: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
Message-ID: <Y4xsaQoNZ448-1w0-OjJvAi5NidyaI5bLHAtRx_2Ovk=.2b5fc86b-ad1d-4b75-b51a-459fa5fd88e3@github.com>

> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
> Tested with tier1-4 and performance tests.

Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:

  Use modifiers field directly in isInterface.

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23572/files
  - new: https://git.openjdk.org/jdk/pull/23572/files/c23718b3..db7c9782

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23572&range=06
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23572&range=05-06

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/23572.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23572/head:pull/23572

PR: https://git.openjdk.org/jdk/pull/23572

From coleenp at openjdk.org  Sat Feb 22 14:49:38 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Sat, 22 Feb 2025 14:49:38 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native [v5]
In-Reply-To: <LrjguT4-f7V_Wotxu2zb87fAUEoCtFnuPqr442jiEpc=.0bb754fa-51ce-479d-b1de-fbea5f9f2017@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <PM65jBoMpUpFNVH-ZcJWCDJl1IavmTdNU-2DnS9j22c=.7d4fdd41-5ed0-4785-b37f-c7f4b0c1e9e9@github.com>
 <beP3eW57na8hjpPFM27Zbhg3i4iZm_htJAhn7LZiwHg=.876cec03-2f47-468c-8197-fa59e590e435@github.com>
 <3jNPEzaXa0Ncf8eu3vct6a_jyH7k4tH_mbRBaKmbMc0=.d3a86a0f-1bed-4084-af92-959f4dbd52f4@github.com>
 <VyoHvDMpugJg1-XWZE_gKFwBkWc-7of8qIfdcuDB2jE=.b0d9bdb7-7480-4ae1-afec-e48cbf94c188@github.com>
 <LrjguT4-f7V_Wotxu2zb87fAUEoCtFnuPqr442jiEpc=.0bb754fa-51ce-479d-b1de-fbea5f9f2017@github.com>
Message-ID: <lVKXy7KXi77A5PNSsqH5M4z_PXRaHBuBheQSm3t4gic=.76f5d72f-00f7-4f0c-bbce-08f6f7146a8a@github.com>

On Fri, 21 Feb 2025 21:08:33 GMT, Dean Long <dlong at openjdk.org> wrote:

>> Indeed, even though this checks for the specific bit so widening has no effect, it is better to be cautious here.
>
>> I don't use the field directly because the field is a short and getModifiers makes it into Modifier.
> 
> But getModifiers() returns `int`, not `Modifier` (which is all static).

I mis-remembered why I called getModifiers(), maybe because all the other calls to getModifiers() in Class.java which used be needed, but I did want to call Modifier.isInterface().  If using the 'modifiers' field directly is better, I'll change it to that.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1966527692

From epeter at openjdk.org  Mon Feb 24 07:25:59 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Mon, 24 Feb 2025 07:25:59 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory
In-Reply-To: <mcXrI5ah9OFy25nV_Im_DFpPR_DXtfOgn-26D_bC1mQ=.15fc3dd9-7278-4c09-8b25-dde0a1251ca2@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <IUuLTkwPe7pefd6C6NhQEI7ASmdSW8Bb0kBFJVfXkUY=.f6d110c2-0d6d-424f-8898-b06d5f9552f6@github.com>
 <OtJlLrlGEGU9a-lDCP-_n6paLgrAmCTg3-pwhLTeyIU=.c1a3d943-aca1-4dbd-8717-c73020163864@github.com>
 <mcXrI5ah9OFy25nV_Im_DFpPR_DXtfOgn-26D_bC1mQ=.15fc3dd9-7278-4c09-8b25-dde0a1251ca2@github.com>
Message-ID: <FJPHjx5Vn0id_huZ3db2Q508dkGtBwV6X5N-nc5i9ig=.9fefcd54-0df3-4dc8-874f-e5395ec9476a@github.com>

On Wed, 19 Feb 2025 16:14:09 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> @vnkozlov I suggest that I change the probability to something quite low now, just to make sure that the fast-loop is placed nicely. When I do the experiments for aliasing-analysis runtime-checks, then I will be able to benchmark much better for both cases, since it is much easier to create many different cases. At that point, I could still adapt the probabilities to a different constant. Or maybe I can somehow adjust the probabilities in the chain such that they are balanced. Like if there is 1 condition, give it `0.5`, if there are 2 give them each `sqrt(0.5)`, if there are `n` then `pow(0.5, 1/n)`, so that once you multiply them you get `pow(pow(0.5, 1/n),n) = 0.5`. We could also set another "target" probability than `0.5`. The issue is that experimenting now is a little difficult, because I only have the alignment-checks to play with, which are really really rare to fail in the "real world", I think. But aliasing-checks are more likely to fail, so there could be more interesti
 ng benchmark results there.
>> 
>> Does that sound ok?
>> 
>>> Can we profile alignment in Interpreter (and C1)?
>> 
>> It would be nice if we could profile alignment or aliasing. Maybe that is possible. But I suppose there are always cases where profiling is not available (Xcomp ?), and we should have reasonable defaults there. We could investigate profiling in a second step, to improve things if we think that is worth it. Profiling these things would also be additional complexity - I'm not convinced yet it is worth it.
>> 
>> What do you think?
>
>> > Can we profile alignment in Interpreter (and C1)?
>> 
>> It would be nice if we could profile alignment or aliasing. Maybe that is possible. But I suppose there are always cases where profiling is not available (Xcomp ?), and we should have reasonable defaults there. We could investigate profiling in a second step, to improve things if we think that is worth it. Profiling these things would also be additional complexity - I'm not convinced yet it is worth it.
>> 
>> What do you think?
> 
> You should not worry about `-Xcomp` it is testing flag - we can use some default there.
> I am fine if you think profiling will not bring us much benefits. Note, I am not asking create counters - just a bit to indicate if we had unaligned access to native memory in a method. In such case we may skip predicate and generate multi versions loop during compilation. On other hand, we may have unaligned access only during startup and not later when we compile method. Anyway, it does not affect these changes.
> 
> I will look on changes more later.

@vnkozlov I'll think about the "stall" vs "delay" suggestion.

> How profitable (performance wise) to optimize slow path loop? Can we skip any optimizations for it - treat it as not-Counted?

I suppose that depends on if the slow path loop will be taken. Imagine we are working on some unaligned MemorySegment (or with aliasing runtime-checks failing). In these cases without optimizing we would for example not unroll. But unrolling can give quite the speedup, of course at the cost of more compile time and code size. Also some RangeCheck eliminations only happen if you have a pre-main-post loop structure. There are probably other optimizations as well. So yes, if the slow path loop is taken often, then optimizing is probably worth it. What do you think?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2677607527

From haosun at openjdk.org  Mon Feb 24 07:44:55 2025
From: haosun at openjdk.org (Hao Sun)
Date: Mon, 24 Feb 2025 07:44:55 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]
In-Reply-To: <me_0hCYLnyFadVymsq-oxV7VggbpPMqLnXe-jPU4boU=.bfcec994-1f12-4f4e-9002-8e2564919511@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <unMldYiDLGyImOJQ1oXuzR2OViIBxTKFjE3Ks6_VSn4=.e86bd4ee-5fce-415a-888a-06aff24bd664@github.com>
 <c8EPfl5IC1K3uLMftbZSbf-TyJK-e5LEsXovSfjqO14=.ae0182d1-e7d2-4ab5-9ebe-d7bc8bac643e@github.com>
 <me_0hCYLnyFadVymsq-oxV7VggbpPMqLnXe-jPU4boU=.bfcec994-1f12-4f4e-9002-8e2564919511@github.com>
Message-ID: <zxk5lxdeHj_YJQt9FpVV-MxgRs5yC26WvjQ8AQpSG78=.74286aba-1c08-44c2-8f23-d81329b2878b@github.com>

On Fri, 21 Feb 2025 10:23:37 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>  Was this a build attempted on an aarch64 for the other architectures?

Yes. It's a cross-build on AArch64 for other architectures.

>  Instruction_aarch64 should not have been there in a ppc build

Oops. I didn't check the error message carefully. It might be some issue in our CI. I will check that.

Sorry for the noise.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2677637524

From epeter at openjdk.org  Mon Feb 24 08:03:59 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Mon, 24 Feb 2025 08:03:59 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory
In-Reply-To: <mcXrI5ah9OFy25nV_Im_DFpPR_DXtfOgn-26D_bC1mQ=.15fc3dd9-7278-4c09-8b25-dde0a1251ca2@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <IUuLTkwPe7pefd6C6NhQEI7ASmdSW8Bb0kBFJVfXkUY=.f6d110c2-0d6d-424f-8898-b06d5f9552f6@github.com>
 <OtJlLrlGEGU9a-lDCP-_n6paLgrAmCTg3-pwhLTeyIU=.c1a3d943-aca1-4dbd-8717-c73020163864@github.com>
 <mcXrI5ah9OFy25nV_Im_DFpPR_DXtfOgn-26D_bC1mQ=.15fc3dd9-7278-4c09-8b25-dde0a1251ca2@github.com>
Message-ID: <xlcVpse5Yr5iUC65xdLtOTj-aNCoS59WgnFOnbgNOG8=.6f554f7c-e245-43c2-adc5-df2ee07639cc@github.com>

On Wed, 19 Feb 2025 16:14:09 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> @vnkozlov I suggest that I change the probability to something quite low now, just to make sure that the fast-loop is placed nicely. When I do the experiments for aliasing-analysis runtime-checks, then I will be able to benchmark much better for both cases, since it is much easier to create many different cases. At that point, I could still adapt the probabilities to a different constant. Or maybe I can somehow adjust the probabilities in the chain such that they are balanced. Like if there is 1 condition, give it `0.5`, if there are 2 give them each `sqrt(0.5)`, if there are `n` then `pow(0.5, 1/n)`, so that once you multiply them you get `pow(pow(0.5, 1/n),n) = 0.5`. We could also set another "target" probability than `0.5`. The issue is that experimenting now is a little difficult, because I only have the alignment-checks to play with, which are really really rare to fail in the "real world", I think. But aliasing-checks are more likely to fail, so there could be more interesti
 ng benchmark results there.
>> 
>> Does that sound ok?
>> 
>>> Can we profile alignment in Interpreter (and C1)?
>> 
>> It would be nice if we could profile alignment or aliasing. Maybe that is possible. But I suppose there are always cases where profiling is not available (Xcomp ?), and we should have reasonable defaults there. We could investigate profiling in a second step, to improve things if we think that is worth it. Profiling these things would also be additional complexity - I'm not convinced yet it is worth it.
>> 
>> What do you think?
>
>> > Can we profile alignment in Interpreter (and C1)?
>> 
>> It would be nice if we could profile alignment or aliasing. Maybe that is possible. But I suppose there are always cases where profiling is not available (Xcomp ?), and we should have reasonable defaults there. We could investigate profiling in a second step, to improve things if we think that is worth it. Profiling these things would also be additional complexity - I'm not convinced yet it is worth it.
>> 
>> What do you think?
> 
> You should not worry about `-Xcomp` it is testing flag - we can use some default there.
> I am fine if you think profiling will not bring us much benefits. Note, I am not asking create counters - just a bit to indicate if we had unaligned access to native memory in a method. In such case we may skip predicate and generate multi versions loop during compilation. On other hand, we may have unaligned access only during startup and not later when we compile method. Anyway, it does not affect these changes.
> 
> I will look on changes more later.

@vnkozlov I mean the issue this: once I implement aliasing-analysis runtime-checks with this multiversion approach, then we'd get regressions if we do not optimize the slow path loop. Currently, we would not vectorize (because we have to be ready for aliasing cases), but we at least unroll, and whatever else we can except vectorization. But if we do not optimize the slow path loop, then we would get performance regressions in aliasing cases because we have no unrolling for them any more. I think we need to avoid that - would you agree?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2677667789

From adinn at openjdk.org  Mon Feb 24 08:39:54 2025
From: adinn at openjdk.org (Andrew Dinn)
Date: Mon, 24 Feb 2025 08:39:54 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]
In-Reply-To: <zxk5lxdeHj_YJQt9FpVV-MxgRs5yC26WvjQ8AQpSG78=.74286aba-1c08-44c2-8f23-d81329b2878b@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <unMldYiDLGyImOJQ1oXuzR2OViIBxTKFjE3Ks6_VSn4=.e86bd4ee-5fce-415a-888a-06aff24bd664@github.com>
 <c8EPfl5IC1K3uLMftbZSbf-TyJK-e5LEsXovSfjqO14=.ae0182d1-e7d2-4ab5-9ebe-d7bc8bac643e@github.com>
 <me_0hCYLnyFadVymsq-oxV7VggbpPMqLnXe-jPU4boU=.bfcec994-1f12-4f4e-9002-8e2564919511@github.com>
 <zxk5lxdeHj_YJQt9FpVV-MxgRs5yC26WvjQ8AQpSG78=.74286aba-1c08-44c2-8f23-d81329b2878b@github.com>
Message-ID: <GZnx7PSvFB7sVAncQr45gxvh4eJ0rKEu7luBQRUIoaI=.901704d6-3dad-4ff3-ba87-3dd0f2f6643f@github.com>

On Mon, 24 Feb 2025 07:41:58 GMT, Hao Sun <haosun at openjdk.org> wrote:

>> @shqking, I changed the copyright years, but I don't really understand how the aarch64-specific code can overflow buffers on other architectures. As far as I understand, Instruction_aarch64 should not have been there in a ppc build.
>> Was this a build attempted on an aarch64 for the other architectures?
>
>>  Was this a build attempted on an aarch64 for the other architectures?
> 
> Yes. It's a cross-build on AArch64 for other architectures.
> 
>>  Instruction_aarch64 should not have been there in a ppc build
> 
> Oops. I didn't check the error message carefully. It might be some issue in our CI. I will check that.
> 
> Sorry for the noise.

@shqking There is a [known issue](https://bugs.openjdk.org/browse/JDK-8349921) with cross-builds that is still being investigated. I think that may explain the problem you are seeing.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2677735964

From bkilambi at openjdk.org  Mon Feb 24 09:37:54 2025
From: bkilambi at openjdk.org (Bhavana Kilambi)
Date: Mon, 24 Feb 2025 09:37:54 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]
In-Reply-To: <bqom3W9zRU-ChMNDgkfcOVPEmKvl5J1huHv1o6Fe1yc=.052a4703-111d-40df-8843-4aee2dd93cca@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <unMldYiDLGyImOJQ1oXuzR2OViIBxTKFjE3Ks6_VSn4=.e86bd4ee-5fce-415a-888a-06aff24bd664@github.com>
 <1yB95sOajuS5ptFI0GQWLepii5JsZ9DOsje-TEFyFYs=.a325ad18-17ed-4e77-b1e3-0bad2cf55c67@github.com>
 <bqom3W9zRU-ChMNDgkfcOVPEmKvl5J1huHv1o6Fe1yc=.052a4703-111d-40df-8843-4aee2dd93cca@github.com>
Message-ID: <gwsbBkJuQZGEXNGXnYS9VCZkCrb-dkcjlE49U79mT30=.8635cba8-93e4-47a0-bc93-d2ad0a2f515e@github.com>

On Thu, 20 Feb 2025 17:22:25 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 2618:
>> 
>>> 2616:   INSN(smaxp,  0, 0b101001, false); // accepted arrangements: T8B, T16B, T4H, T8H, T2S, T4S
>>> 2617:   INSN(sminp,  0, 0b101011, false); // accepted arrangements: T8B, T16B, T4H, T8H, T2S, T4S
>>> 2618:   INSN(sqdmulh,0, 0b101101, false); // accepted arrangements: T4H, T8H, T2S, T4S
>> 
>> Hi, not a comment on the algorithm itself but you might have to add these new instructions in the gtest for aarch64 here - test/hotspot/gtest/aarch64/aarch64-asmtest.py and use this file to generate test/hotspot/gtest/aarch64/asmtest.out.h which would contain these newly added instructions.
>
> I have tried that, but the python script (actually the as command that it started) threw error messages:
> 
> aarch64ops.s:338:24: error: index must be a multiple of 8 in range [0, 32760].
>         prfm    PLDL1KEEP, [x15, 43]
>                                  ^
> aarch64ops.s:357:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
>         sub     x1, x10, x23, sxth #2
>                               ^
> aarch64ops.s:359:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
>         add     x11, x21, x5, uxtb #3
>                               ^
> aarch64ops.s:360:22: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
>         adds    x11, x17, x17, uxtw #1
>                                ^
> aarch64ops.s:361:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
>         sub     x11, x0, x15, uxtb #1
>                               ^
> aarch64ops.s:362:19: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
>         subs    x7, x1, x0, sxth #2
>                             ^
> This is without any modifications from what is in the master branch currently.

You might have to use an assembler from the latest binutils build (if the system default isn't the latest) and add the path to the assembler in the "AS" variable. Also you can run it something like - `python aarch64-asmtest.py | expand > asmtest.out.h`. Please let me know if you still face problems.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967284270

From adinn at openjdk.org  Mon Feb 24 11:50:55 2025
From: adinn at openjdk.org (Andrew Dinn)
Date: Mon, 24 Feb 2025 11:50:55 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v6]
In-Reply-To: <hpjeRDYSLqEPnvrKbO2EScnS51J5Fm2uixY5nYA3Oho=.8500a6e5-02eb-4b67-985c-449e3e960ffd@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <hpjeRDYSLqEPnvrKbO2EScnS51J5Fm2uixY5nYA3Oho=.8500a6e5-02eb-4b67-985c-449e3e960ffd@github.com>
Message-ID: <piNr__GoM29W83FvbfQIH4JfzfAGWsW46CoM7jXg5ZA=.83a331b5-9032-4f66-b90d-cec6f6117711@github.com>

On Thu, 20 Feb 2025 17:33:18 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with four additional commits since the last revision:
> 
>  - Accepting suggested change from Andrew Dinn
>  - Added comments suggested by Andrew Dinn
>  - Fixed copyright years
>  - renaming a couple of functions

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4593:

> 4591:   // chunks of) vector registers v30 and v31, resp.
> 4592:   // The inputs are in v0-v7 and v16-v23 and the results go to v16-v23,
> 4593:   // four 32-bit values in each register

Suggestion:

Once again it would be good to annotate the lines in this code with comments  that relate the generated code back to the original Java code.

In the header comment you should refer to the relevant Java class and the var names there:

    // computes (in parallel across 8 x 4S vectors)
    //    a = b * c * 2^-32 mod MONT_Q
    // where
    //    inputs b and c are in v0, ..., v7 and v16, ... v23,
    //    scratch registers v24, ... v27 are clobbered
    //    output a is written back into v16, ... v23
    //    constants q and q_inv are in v30, v31
    //
    // See the equivalent Java code in method ML_DSA.montMul

Then comment the generation lines as shown below

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967490923

From adinn at openjdk.org  Mon Feb 24 11:53:55 2025
From: adinn at openjdk.org (Andrew Dinn)
Date: Mon, 24 Feb 2025 11:53:55 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v6]
In-Reply-To: <hpjeRDYSLqEPnvrKbO2EScnS51J5Fm2uixY5nYA3Oho=.8500a6e5-02eb-4b67-985c-449e3e960ffd@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <hpjeRDYSLqEPnvrKbO2EScnS51J5Fm2uixY5nYA3Oho=.8500a6e5-02eb-4b67-985c-449e3e960ffd@github.com>
Message-ID: <dqecolNHP8XNk6DIVXKqgdRtd7Kc9tl0pmKxq49bLiI=.82a89d38-2a4a-46b8-93f5-8ff97728c4ce@github.com>

On Thu, 20 Feb 2025 17:33:18 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with four additional commits since the last revision:
> 
>  - Accepting suggested change from Andrew Dinn
>  - Added comments suggested by Andrew Dinn
>  - Fixed copyright years
>  - renaming a couple of functions

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4604:

> 4602:     FloatRegister vr7 = by_constant ? v29 : v7;
> 4603: 
> 4604:     __ sqdmulh(v24, __ T4S, vr0, v16);

+    __ sqdmulh(v24, __ T4S, v0, v16);  // aHigh = hi32(2 * b * c)
+    __ mulv(v16, __ T4S, v0, v16);     // aLow = lo32(b * c)

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4613:

> 4611:     __ mulv(v19, __ T4S, vr3, v19);
> 4612: 
> 4613:     __ mulv(v16, __ T4S, v16, v30);

__ mulv(v16, __ T4S, v16, v30);    // m = aLow * qinv

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4618:

> 4616:     __ mulv(v19, __ T4S, v19, v30);
> 4617: 
> 4618:     __ sqdmulh(v16, __ T4S, v16, v31);

__ sqdmulh(v16, __ T4S, v16, v31); // n = hi32(2 * m * q)

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4623:

> 4621:     __ sqdmulh(v19, __ T4S, v19, v31);
> 4622: 
> 4623:     __ shsubv(v16, __ T4S, v24, v16);

__ shsubv(v16, __ T4S, v24, v16);  // a = (aHigh - n) / 2

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967491928
PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967492635
PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967493031
PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967493643

From bkilambi at openjdk.org  Mon Feb 24 12:14:03 2025
From: bkilambi at openjdk.org (Bhavana Kilambi)
Date: Mon, 24 Feb 2025 12:14:03 GMT
Subject: RFR: 8345125: Aarch64: Add aarch64 backend for Float16 operations
Message-ID: <Rbal8Cp4ncat_17FPV367OvtBjO-GVr0AdM-X-yuNt8=.e09edada-bc9d-4dbd-9904-ab523d25fc47@github.com>

This patch adds aarch64 backend for scalar FP16 operations namely - add, subtract, multiply, divide, fma, sqrt, min and max.

-------------

Commit messages:
 - 8345125: Aarch64: Add aarch64 backend for Float16 operations

Changes: https://git.openjdk.org/jdk/pull/23748/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23748&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8345125
  Stats: 1007 lines in 13 files changed: 326 ins; 1 del; 680 mod
  Patch: https://git.openjdk.org/jdk/pull/23748.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23748/head:pull/23748

PR: https://git.openjdk.org/jdk/pull/23748

From roland at openjdk.org  Mon Feb 24 12:54:56 2025
From: roland at openjdk.org (Roland Westrelin)
Date: Mon, 24 Feb 2025 12:54:56 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory [v3]
In-Reply-To: <nAJ_7HOCiCrekY3z_7JMl0Ixajl3Deu_jUUxa1fqSjw=.0b69eff6-aca5-462d-b5c4-ec5234997074@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <tJIepmfbtgbfD-EzVGPavvFjOQRaSK0riJzPO6YsTM0=.77b01211-44d1-47b2-8e56-ca98a68cfac4@github.com>
 <HiM2dzvsG2hB2utYPwFRplD8CRLPglPjQmY3sU2ZAKY=.93d0510e-0d9d-4f07-ba07-b9027ba6f89b@github.com>
 <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com>
 <JIozM6Xe_RuACVPRvbQtZYJUYzqmJri69XklWzmzVO8=.94c86dea-0665-48ba-989a-035e1a0ff35d@github.com>
 <cMsbb3UmOWs-309Zobc6gna8ZepHvuoUpSpwvCaQmoM=.5f999813-d1e5-482b-8415-b875377221c5@github.com>
 <Kpik6hwYiD1r8-ervrV0rg2VvrFK7pGr7NEV2qISPoE=.02e5a9e0-9c5f-4e08-9831-b0781a09f2f3@github.com>
 <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com>
 <WisIFWC8dXi53nfdv0gXIAHgdjZthyoIeVvOuoY7T_M=.e02ea3c6-b8e0-4e44-9edc-69ef212c9bfa@github.com>
 <GS2B9Xcwwuie0AxjO1pmN1QGF9Y4Ukdpgx-DedaMXPM=.b61bf04b-f489-4206-b7e5-44f06103c02e@github.com>
 <PdCUVqIOpSo7CFkUEnWx7aZlVTBw2w8fVUR
 PagD2R4A=.4ab7ef52-b170-4a95-b15a-7cbd4407606f@github.com>
 <BarQ04VIv4siELW8k-GoGIeRKbLQLv3CRM9LHa1-3ZI=.170b0477-69a7-4907-a419-5a6367a7ca54@github.com>
 <mBzl4lD6blV2aF4xn1xwQLonj5FJIMBItY6WjJAUF7w=.f776fa88-75e6-46d4-b07b-8110dc94bedf@github.com>
 <UgK2CAAaCU0dCJExlMvJfaukAuHK9W9bI_azCJM2Tv0=.41e70b72-1e0a-4b1a-94b3-8643c87f02cd@github.com>
 <nAJ_7HOCiCrekY3z_7JMl0Ixajl3Deu_jUUxa1fqSjw=.0b69eff6-aca5-462d-b5c4-ec5234997074@github.com>
Message-ID: <FwdKJtfykDRiXH9K-Cv0GE6X6FraUXzPU1um5Jhz77A=.afa0a5dc-7a40-48dd-aed2-b3683245dfb4@github.com>

On Thu, 20 Feb 2025 09:44:16 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>>> Do you see any better way than having the 2x code size if we need both a slow and fast loop?
>> 
>> No but I was confused by your comment about 3x and 4x which is why I asked for clarification.
>> Compiled code size affects inlining decisions: if a callee has compiled code and it's larger than some threshold, then the callee is considered too expensive to inline. With your change, some method that was considered ok to inline could now be considered too big. I think that's what Vladimir is concerned by. I don't see what you can do about it, this said.
>
>> @rwestrel I think I had tried some verifications above, but I could not even get it to work in all cases in `SuperWord`.
>> 
>> In `VLoop::check_preconditions_helper`, I try to find either the predicate or the multiversioning if. But I cannot always find it, and I think that one reason was that the pre-loop can be lost. At least that is what I remember from 4+ weeks ago.
> 
> Do you understand when that happens? It doesn't feel right that the pre loop can be lost.

> @rwestrel Do you want me to find examples for the pre-loop disappearing? I suppose I can find some easily by adding an assert in SuperWord, where we bail out, as I showed above.

Yes, if not too much work.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2678332801

From epeter at openjdk.org  Mon Feb 24 14:32:57 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Mon, 24 Feb 2025 14:32:57 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory [v3]
In-Reply-To: <FwdKJtfykDRiXH9K-Cv0GE6X6FraUXzPU1um5Jhz77A=.afa0a5dc-7a40-48dd-aed2-b3683245dfb4@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <tJIepmfbtgbfD-EzVGPavvFjOQRaSK0riJzPO6YsTM0=.77b01211-44d1-47b2-8e56-ca98a68cfac4@github.com>
 <HiM2dzvsG2hB2utYPwFRplD8CRLPglPjQmY3sU2ZAKY=.93d0510e-0d9d-4f07-ba07-b9027ba6f89b@github.com>
 <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com>
 <JIozM6Xe_RuACVPRvbQtZYJUYzqmJri69XklWzmzVO8=.94c86dea-0665-48ba-989a-035e1a0ff35d@github.com>
 <cMsbb3UmOWs-309Zobc6gna8ZepHvuoUpSpwvCaQmoM=.5f999813-d1e5-482b-8415-b875377221c5@github.com>
 <Kpik6hwYiD1r8-ervrV0rg2VvrFK7pGr7NEV2qISPoE=.02e5a9e0-9c5f-4e08-9831-b0781a09f2f3@github.com>
 <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com>
 <WisIFWC8dXi53nfdv0gXIAHgdjZthyoIeVvOuoY7T_M=.e02ea3c6-b8e0-4e44-9edc-69ef212c9bfa@github.com>
 <GS2B9Xcwwuie0AxjO1pmN1QGF9Y4Ukdpgx-DedaMXPM=.b61bf04b-f489-4206-b7e5-44f06103c02e@github.com>
 <PdCUVqIOpSo7CFkUEnWx7aZlVTBw2w8fVUR
 PagD2R4A=.4ab7ef52-b170-4a95-b15a-7cbd4407606f@github.com>
 <BarQ04VIv4siELW8k-GoGIeRKbLQLv3CRM9LHa1-3ZI=.170b0477-69a7-4907-a419-5a6367a7ca54@github.com>
 <mBzl4lD6blV2aF4xn1xwQLonj5FJIMBItY6WjJAUF7w=.f776fa88-75e6-46d4-b07b-8110dc94bedf@github.com>
 <UgK2CAAaCU0dCJExlMvJfaukAuHK9W9bI_azCJM2Tv0=.41e70b72-1e0a-4b1a-94b3-8643c87f02cd@github.com>
 <nAJ_7HOCiCrekY3z_7JMl0Ixajl3Deu_jUUxa1fqSjw=.0b69eff6-aca5-462d-b5c4-ec5234997074@github.com>
 <FwdKJtfykDRiXH9K-Cv0GE6X6FraUXzPU1um5Jhz77A=.afa0a5dc-7a40-48dd-aed2-b3683245dfb4@github.com>
Message-ID: <i8nBTk0wRAEmlOqjcbLIuwQmH8dDbR3H-LX7BB1XMEc=.608ad82d-b8ce-44d2-863d-581d50f72ae3@github.com>

On Mon, 24 Feb 2025 12:52:42 GMT, Roland Westrelin <roland at openjdk.org> wrote:

> > @rwestrel Do you want me to find examples for the pre-loop disappearing? I suppose I can find some easily by adding an assert in SuperWord, where we bail out, as I showed above.
> 
> Yes, if not too much work.

Ok, let's add this:

diff --git a/src/hotspot/share/opto/vectorization.cpp b/src/hotspot/share/opto/vectorization.cpp
index e607a1065dd..290ee249a42 100644
--- a/src/hotspot/share/opto/vectorization.cpp
+++ b/src/hotspot/share/opto/vectorization.cpp
@@ -98,6 +98,7 @@ VStatus VLoop::check_preconditions_helper() {
     // the pre-loop limit.
     CountedLoopEndNode* pre_end = _cl->find_pre_loop_end();
     if (pre_end == nullptr) {
+      assert(false, "found no pre-loop");
       return VStatus::make_failure(VLoop::FAILURE_PRE_LOOP_LIMIT);
     }
     Node* pre_opaq1 = pre_end->limit();


And run that:

rr /oracle-work/jdk-fork7/build/linux-x64-slowdebug/jdk/bin/java -Xcomp -XX:+TraceLoopOpts -XX:CompileCommand=compileonly,jdk.internal.classfile.impl.StackMapGenerator::processBlock --version

....

PreMainPost      Loop: N7127/N4014  limit_check profile_predicated predicated counted [0,int),+1 (2147483648 iters)  rc  has_sfpt strip_mined
Unroll 2         Loop: N7127/N4014  counted [int,int),+1 (2147483648 iters)  main rc  has_sfpt strip_mined
Loop: N0/N0  has_call has_sfpt
  Loop: N7453/N7460  limit_check profile_predicated predicated counted [0,int),+1 (4 iters)  pre rc  has_sfpt
  Loop: N7126/N7125  sfpts={ 7128 }
    Loop: N7508/N4014  counted [int,int),+2 (2147483648 iters)  main rc  has_sfpt strip_mined
  Loop: N7409/N7416  counted [int,int),+1 (4 iters)  post rc  has_sfpt
Parallel IV: 7728   Loop: N7453/N7460  limit_check profile_predicated predicated counted [0,int),+1 (4 iters)  pre has_sfpt
Parallel IV: 7725     Loop: N7508/N4014  counted [int,int),+2 (2147483648 iters)  main has_sfpt strip_mined
Parallel IV: 7718   Loop: N7409/N7416  counted [int,int),+1 (4 iters)  post has_sfpt
Loop: N0/N0  has_call has_sfpt
  Loop: N7453/N7460  limit_check profile_predicated predicated counted [0,int),+1 (4 iters)  pre has_sfpt
  Loop: N7126/N7125  sfpts={ 7128 }
    Loop: N7508/N4014  counted [int,int),+2 (2147483648 iters)  main has_sfpt strip_mined
  Loop: N7409/N7416  counted [int,int),+1 (4 iters)  post has_sfpt
RangeCheck       Loop: N7508/N4014  counted [int,int),+2 (2147483648 iters)  main has_sfpt rce strip_mined
Unroll 4         Loop: N7508/N4014  limit_check counted [int,int),+2 (2147483648 iters)  main has_sfpt rce strip_mined
Loop: N0/N0  has_call has_sfpt
  Loop: N7453/N7460  limit_check profile_predicated predicated counted [0,int),+1 (4 iters)  pre rc  has_sfpt
  Loop: N7126/N7125  limit_check sfpts={ 7128 }
    Loop: N8146/N4014  limit_check counted [int,int),+4 (2147483648 iters)  main has_sfpt strip_mined
  Loop: N7409/N7416  counted [int,int),+1 (4 iters)  post rc  has_sfpt

...
#  Internal Error (/oracle-work/jdk-fork7/open/src/hotspot/share/opto/vectorization.cpp:101), pid=1381339, tid=1381348
#  assert(false) failed: found no pre-loop


The pre-loop node is not dead actually. The issue is with the main-loop in `CountedLoopNode::is_canonical_loop_entry`.

We skip through some predicates, but then we cannot find the ZeroTripGuard, rather I'm seeing this:

(rr) p ctrl->dump_bfs(2,0,"#cd")
dist dump
---------------------------------------------
   2   974  ConI  === 0  [[ ... ]]  #int:1
   2  8060  IfTrue  === 8056  [[ 8073 ]] #1
   1  8073  If  === 8060 974  [[ 8074 8077 ]] #Last Value Assertion Predicate  P=0.999999, C=-1.000000
   0  8077  IfTrue  === 8073  [[ 8103 ]] #1


The pre-loop is further up though:

(rr) p this->dump_bfs(26,0,"#c")
dist dump
---------------------------------------------
  26  7453  CountedLoop  === 7453 4015 7460  [[ 7452 7453 7454 7455 ]] inner stride: 1 pre of N7127 !orig=[7127],[7118],[2645] !jvms: StackMapGenerator::processBlock @ bci:2677 (line 671)
  25  7455  If  === 7453 7441  [[ 7456 7464 ]] P=0.000001, C=-1.000000 !orig=[2686] !jvms: StackMapGenerator$Frame::popStack @ bci:5 (line 1001) StackMapGenerator::processBlock @ bci:2681 (line 671)
  24  7456  IfFalse  === 7455  [[ 7448 7457 ]] #0 !orig=[2631],[2628] !jvms: StackMapGenerator$Frame::popStack @ bci:5 (line 1001) StackMapGenerator::processBlock @ bci:2681 (line 671)
  23  7457  RangeCheck  === 7456 7446  [[ 7458 7467 ]] P=0.999999, C=-1.000000 !orig=[1189] !jvms: StackMapGenerator$Frame::popStack @ bci:33 (line 1002) StackMapGenerator::processBlock @ bci:2681 (line 671)
  22  7458  IfTrue  === 7457  [[ 7459 ]] #1 !orig=[777],385 !jvms: StackMapGenerator$Frame::popStack @ bci:33 (line 1002) StackMapGenerator::processBlock @ bci:2681 (line 671)
  21  7459  CountedLoopEnd  === 7458 7443  [[ 7460 7482 ]] [lt] P=0.900000, C=-1.000000 !orig=7122,[5398] !jvms: StackMapGenerator::processBlock @ bci:2674 (line 670)
  20  7482  IfFalse  === 7459  [[ 7486 ]] #0
  19  7486  If  === 7482 7485  [[ 7461 7487 ]] P=0.999999, C=-1.000000
  18  7487  IfTrue  === 7486  [[ 7977 ]] #1
  17  7977  If  === 7487 974  [[ 7978 7981 ]] #Init Value Assertion Predicate  P=0.999999, C=-1.000000
  16  7981  IfTrue  === 7977  [[ 7994 ]] #1
  15  7994  If  === 7981 974  [[ 7995 7998 ]] #Last Value Assertion Predicate  P=0.999999, C=-1.000000
  14  7998  IfTrue  === 7994  [[ 8118 ]] #1
  13  8118  If  === 7998 8117  [[ 8119 8122 ]] #Last Value Assertion Predicate  P=0.999999, C=-1.000000
  12  8122  IfTrue  === 8118  [[ 8007 ]] #1
  11  8007  If  === 8122 8006  [[ 8008 8011 ]] #Init Value Assertion Predicate  P=0.999999, C=-1.000000
  10  8011  IfTrue  === 8007  [[ 8056 ]] #1
   9  8056  If  === 8011 974  [[ 8057 8060 ]] #Init Value Assertion Predicate  P=0.999999, C=-1.000000
   8  8060  IfTrue  === 8056  [[ 8073 ]] #1
   7  8073  If  === 8060 974  [[ 8074 8077 ]] #Last Value Assertion Predicate  P=0.999999, C=-1.000000
   6  8077  IfTrue  === 8073  [[ 8103 ]] #1
   5  8173  IfFalse  === 7122  [[ 7128 7129 ]] #0 !orig=[7524],[7123],[5442] !jvms: StackMapGenerator::processBlock @ bci:2674 (line 670)
   5  8103  If  === 8077 8102  [[ 8104 8107 ]] #Last Value Assertion Predicate  P=0.999999, C=-1.000000
   4  7128  SafePoint  === 8173 1 778 1 1 7129 780 1 1 781 781 782 783 784 1 1 1 785 786  [[ 7124 ]]  SafePoint  !orig=385 !jvms: StackMapGenerator::processBlock @ bci:2688 (line 670)
   4  8107  IfTrue  === 8103  [[ 8086 ]] #1
   3  7124  OuterStripMinedLoopEnd  === 7128 781  [[ 7125 7471 ]] P=0.900000, C=-1.000000
   3  8086  If  === 8107 8085  [[ 8087 8090 ]] #Init Value Assertion Predicate  P=0.999999, C=-1.000000
   2  7122  CountedLoopEnd  === 8146 7121  [[ 8173 4014 ]] [lt] P=0.900000, C=-1.000000 !orig=[5398] !jvms: StackMapGenerator::processBlock @ bci:2674 (line 670)
   2  7125  IfTrue  === 7124  [[ 7126 ]] #1
   2  8090  IfTrue  === 8086  [[ 7126 ]] #1
   1  4014  IfTrue  === 7122  [[ 8146 ]] #1 !jvms: StackMapGenerator::processBlock @ bci:2674 (line 670)
   1  7126  OuterStripMinedLoop  === 7126 8090 7125  [[ 7126 8146 ]] 
   0  8146  CountedLoop  === 8146 7126 4014  [[ 8146 1191 8157 8158 7122 7503 ]] inner stride: 4 main of N8146 strip mined !orig=[7508],[7127],[7118],[2645] !jvms: StackMapGenerator::processBlock @ bci:2677 (line 671)


It looks like we are skipping some predicates, but not enough of them maybe?
In `AssertionPredicates::find_entry` we see:
- `8090  IfTrue  === 8086  [[ 7126 ]] #1`: `is_predicate` returns `true`.
- `8107  IfTrue  === 8103  [[ 8086 ]] #1`: `is_predicate` returns `true`.
- `8077  IfTrue  === 8073  [[ 8103 ]] #1`: `is_predicate` returns `false`. The reason is that the assertion predicate Opaque nodes have already disappeared.

I talked with @chhagedorn and he says that there are some "dying" initialized assertion predicates from unrolling that can be in the way. They would be cleaned out by IGVN later, and then we can see through. But at this point they are in the way and we cannot see through and find the ZeroTripGuard, the predicate iterator is not good enough yet. But @chhagedorn is working on that. https://bugs.openjdk.org/browse/JDK-8350579

The implication is that the ZeroTripGuard can be temporarily not be found, and so we cannot even find the pre-loop, and also not the multiversion-if. So I cannot really add an assert now. And who knows, there may be other blocking reasons on top of that.

@rwestrel Does that make sense? What do you think we should do?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2678602660

From adinn at openjdk.org  Mon Feb 24 14:58:57 2025
From: adinn at openjdk.org (Andrew Dinn)
Date: Mon, 24 Feb 2025 14:58:57 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v6]
In-Reply-To: <hpjeRDYSLqEPnvrKbO2EScnS51J5Fm2uixY5nYA3Oho=.8500a6e5-02eb-4b67-985c-449e3e960ffd@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <hpjeRDYSLqEPnvrKbO2EScnS51J5Fm2uixY5nYA3Oho=.8500a6e5-02eb-4b67-985c-449e3e960ffd@github.com>
Message-ID: <_ApJlty8yCwyY8FiRhczpoKGf1G83hvMuXvOWeKHb90=.5758138f-b03b-49be-ab7a-3b4b56cbe7a6@github.com>

On Thu, 20 Feb 2025 17:33:18 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with four additional commits since the last revision:
> 
>  - Accepting suggested change from Andrew Dinn
>  - Added comments suggested by Andrew Dinn
>  - Fixed copyright years
>  - renaming a couple of functions

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4654:

> 4652: 
> 4653:   void dilithium_add_sub32() {
> 4654:     __ addv(v24, __ T4S, v0, v16);

__ addv(v24, __ T4S, v0, v16); // a0 = b + c

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4663:

> 4661:     __ addv(v31, __ T4S, v7, v23);
> 4662: 
> 4663:     __ subv(v0, __ T4S, v0, v16);

__ subv(v0, __ T4S, v0, v16);  // a1 = b - c

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4674:

> 4672: 
> 4673:   void dilithium_montmul_sub_add16() {
> 4674:     __ sqdmulh(v24, __ T4S, v1, v16);

__ mulv(v16, __ T4S, v16, v30);    // m = aLow * qinv

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967809436
PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967809840
PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967811299

From epeter at openjdk.org  Mon Feb 24 15:30:07 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Mon, 24 Feb 2025 15:30:07 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory [v3]
In-Reply-To: <FwdKJtfykDRiXH9K-Cv0GE6X6FraUXzPU1um5Jhz77A=.afa0a5dc-7a40-48dd-aed2-b3683245dfb4@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <tJIepmfbtgbfD-EzVGPavvFjOQRaSK0riJzPO6YsTM0=.77b01211-44d1-47b2-8e56-ca98a68cfac4@github.com>
 <HiM2dzvsG2hB2utYPwFRplD8CRLPglPjQmY3sU2ZAKY=.93d0510e-0d9d-4f07-ba07-b9027ba6f89b@github.com>
 <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com>
 <JIozM6Xe_RuACVPRvbQtZYJUYzqmJri69XklWzmzVO8=.94c86dea-0665-48ba-989a-035e1a0ff35d@github.com>
 <cMsbb3UmOWs-309Zobc6gna8ZepHvuoUpSpwvCaQmoM=.5f999813-d1e5-482b-8415-b875377221c5@github.com>
 <Kpik6hwYiD1r8-ervrV0rg2VvrFK7pGr7NEV2qISPoE=.02e5a9e0-9c5f-4e08-9831-b0781a09f2f3@github.com>
 <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com>
 <WisIFWC8dXi53nfdv0gXIAHgdjZthyoIeVvOuoY7T_M=.e02ea3c6-b8e0-4e44-9edc-69ef212c9bfa@github.com>
 <GS2B9Xcwwuie0AxjO1pmN1QGF9Y4Ukdpgx-DedaMXPM=.b61bf04b-f489-4206-b7e5-44f06103c02e@github.com>
 <PdCUVqIOpSo7CFkUEnWx7aZlVTBw2w8fVUR
 PagD2R4A=.4ab7ef52-b170-4a95-b15a-7cbd4407606f@github.com>
 <BarQ04VIv4siELW8k-GoGIeRKbLQLv3CRM9LHa1-3ZI=.170b0477-69a7-4907-a419-5a6367a7ca54@github.com>
 <mBzl4lD6blV2aF4xn1xwQLonj5FJIMBItY6WjJAUF7w=.f776fa88-75e6-46d4-b07b-8110dc94bedf@github.com>
 <UgK2CAAaCU0dCJExlMvJfaukAuHK9W9bI_azCJM2Tv0=.41e70b72-1e0a-4b1a-94b3-8643c87f02cd@github.com>
 <nAJ_7HOCiCrekY3z_7JMl0Ixajl3Deu_jUUxa1fqSjw=.0b69eff6-aca5-462d-b5c4-ec5234997074@github.com>
 <FwdKJtfykDRiXH9K-Cv0GE6X6FraUXzPU1um5Jhz77A=.afa0a5dc-7a40-48dd-aed2-b3683245dfb4@github.com>
Message-ID: <xQ7JBc3GrVAmcdWf4OEhAlZNbj2Rr1wJWG33uRySSog=.850691bc-839f-4ee0-83cd-15b28005fb76@github.com>

On Mon, 24 Feb 2025 12:52:42 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>>> @rwestrel I think I had tried some verifications above, but I could not even get it to work in all cases in `SuperWord`.
>>> 
>>> In `VLoop::check_preconditions_helper`, I try to find either the predicate or the multiversioning if. But I cannot always find it, and I think that one reason was that the pre-loop can be lost. At least that is what I remember from 4+ weeks ago.
>> 
>> Do you understand when that happens? It doesn't feel right that the pre loop can be lost.
>
>> @rwestrel Do you want me to find examples for the pre-loop disappearing? I suppose I can find some easily by adding an assert in SuperWord, where we bail out, as I showed above.
> 
> Yes, if not too much work.

@rwestrel I think we should just file an RFE to keep track of these assertions we would like to add once those issues are fixed.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2678803600

From adinn at openjdk.org  Mon Feb 24 15:33:08 2025
From: adinn at openjdk.org (Andrew Dinn)
Date: Mon, 24 Feb 2025 15:33:08 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v6]
In-Reply-To: <hpjeRDYSLqEPnvrKbO2EScnS51J5Fm2uixY5nYA3Oho=.8500a6e5-02eb-4b67-985c-449e3e960ffd@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <hpjeRDYSLqEPnvrKbO2EScnS51J5Fm2uixY5nYA3Oho=.8500a6e5-02eb-4b67-985c-449e3e960ffd@github.com>
Message-ID: <JatA8bcXZz3ivvHovT8Nqo1YXY4oscXg9yF32fRfsxA=.e0413a63-d2d0-4f9e-ac1b-cf5fa7be2251@github.com>

On Thu, 20 Feb 2025 17:33:18 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with four additional commits since the last revision:
> 
>  - Accepting suggested change from Andrew Dinn
>  - Added comments suggested by Andrew Dinn
>  - Fixed copyright years
>  - renaming a couple of functions

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4683:

> 4681:     __ mulv(v19, __ T4S, v7, v19);
> 4682: 
> 4683:     __ mulv(v16, __ T4S, v16, v30);

__ mulv(v16, __ T4S, v16, v30);    // m = aLow * qinv

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4688:

> 4686:     __ mulv(v19, __ T4S, v19, v30);
> 4687: 
> 4688:     __ sqdmulh(v16, __ T4S, v16, v31);

__ sqdmulh(v16, __ T4S, v16, v31); // n = hi32(2 * m * q)

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4693:

> 4691:     __ sqdmulh(v19, __ T4S, v19, v31);
> 4692: 
> 4693:     __ shsubv(v16, __ T4S, v24, v16);

__ shsubv(v16, __ T4S, v24, v16);  // a = (aHigh  - n) / 2

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4698:

> 4696:     __ shsubv(v19, __ T4S, v27, v19);
> 4697: 
> 4698:     __ subv(v1, __ T4S, v0, v16);

__ subv(v1, __ T4S, v0, v16);      // x1 = x - a

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4703:

> 4701:     __ subv(v7, __ T4S, v6, v19);
> 4702: 
> 4703:     __ addv(v0, __ T4S, v0, v16);

__ addv(v0, __ T4S, v0, v16);      // x0 = x + a

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4742:

> 4740: 
> 4741:       for (int i = 0; i < 4; i++) {
> 4742:         __ ldpq(v30, v31, Address(dilithiumConsts, 0));

__ ldpq(v30, v31, Address(dilithiumConsts, 0)); // qinv, q

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4813:

> 4811:     // level 5
> 4812:     for (int i = 0; i < 1024; i += 256) {
> 4813:       __ ldpq(v30, v31, Address(dilithiumConsts, 0));

__ ldpq(v30, v31, Address(dilithiumConsts, 0)); // qinv, q

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4853:

> 4851:     // level 6
> 4852:     for (int i = 0; i < 1024; i += 128) {
> 4853:       __ ldpq(v30, v31, Address(dilithiumConsts, 0));

__ ldpq(v30, v31, Address(dilithiumConsts, 0)); // qinv, q

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4876:

> 4874:     // level 7
> 4875:     for (int i = 0; i < 1024; i += 128) {
> 4876:       __ ldpq(v30, v31, Address(dilithiumConsts, 0));

__ ldpq(v30, v31, Address(dilithiumConsts, 0)); // qinv, q

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4905:

> 4903: 
> 4904:   void dilithium_sub_add_montmul16() {
> 4905:     __ subv(v20, __ T4S, v0, v1);

__ subv(v20, __ T4S, v0, v1);      // b = x0 - x1

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4910:

> 4908:     __ subv(v23, __ T4S, v6, v7);
> 4909: 
> 4910:     __ addv(v0, __ T4S, v0, v1);

__ addv(v0, __ T4S, v0, v1);       // a0 = x0 + x1

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4915:

> 4913:     __ addv(v6, __ T4S, v6, v7);
> 4914: 
> 4915:     __ sqdmulh(v24, __ T4S, v20, v16);

__ sqdmulh(v24, __ T4S, v20, v16); // aHigh = hi32(2 * b * c)
    __ mulv(v1, __ T4S, v20, v16);     // aLow = lo32(b * c)

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4924:

> 4922:     __ mulv(v7, __ T4S, v23, v19);
> 4923: 
> 4924:     __ mulv(v1, __ T4S, v1, v30);

__ mulv(v1, __ T4S, v1, v30);      // m = (aLow * q)

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4929:

> 4927:     __ mulv(v7, __ T4S, v7, v30);
> 4928: 
> 4929:     __ sqdmulh(v1, __ T4S, v1, v31);

__ sqdmulh(v1, __ T4S, v1, v31);  // n = hi32(2 * m * q)

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4934:

> 4932:     __ sqdmulh(v7, __ T4S, v7, v31);
> 4933: 
> 4934:     __ shsubv(v1, __ T4S, v24, v1);

__ shsubv(v1, __ T4S, v24, v1);  // a1 = (aHigh  - n) / 2

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5044:

> 5042:     // level0
> 5043:     for (int i = 0; i < 1024; i += 128) {
> 5044:       __ ldpq(v30, v31, Address(dilithiumConsts, 0));

__ ldpq(v30, v31, Address(dilithiumConsts, 0)); //qinv, q

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5115:

> 5113:       __ str(v31, __ Q, Address(coeffs, i + 224));
> 5114:       dilithium_load32zetas(zetas);
> 5115:       __ ldpq(v30, v31, Address(dilithiumConsts, 0));

__ ldpq(v30, v31, Address(dilithiumConsts, 0)); //qinv, q

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5166:

> 5164:     __ lea(dilithiumConsts, ExternalAddress((address) StubRoutines::aarch64::_dilithiumConsts));
> 5165: 
> 5166:     __ ldpq(v30, v31, Address(dilithiumConsts, 0));

__ ldpq(v30, v31, Address(dilithiumConsts, 0));  // qinv, q
    __ ldr(v29, __ Q, Address(dilithiumConsts, 48)); // rsquare

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5228:

> 5226:     __ lea(dilithiumConsts, ExternalAddress((address) StubRoutines::aarch64::_dilithiumConsts));
> 5227: 
> 5228:     __ ldpq(v30, v31, Address(dilithiumConsts, 0));

__ ldpq(v30, v31, Address(dilithiumConsts, 0)); // qinv, q

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967863821
PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967864748
PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967865658
PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967866379
PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967866822
PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967867752
PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967869143
PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967870036
PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967870373
PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967871386
PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967871949
PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967872681
PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967873281
PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967873918
PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967874418
PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967875655
PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967876745
PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967877717
PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967878884

From roland at openjdk.org  Mon Feb 24 15:49:01 2025
From: roland at openjdk.org (Roland Westrelin)
Date: Mon, 24 Feb 2025 15:49:01 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory [v3]
In-Reply-To: <FwdKJtfykDRiXH9K-Cv0GE6X6FraUXzPU1um5Jhz77A=.afa0a5dc-7a40-48dd-aed2-b3683245dfb4@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <tJIepmfbtgbfD-EzVGPavvFjOQRaSK0riJzPO6YsTM0=.77b01211-44d1-47b2-8e56-ca98a68cfac4@github.com>
 <HiM2dzvsG2hB2utYPwFRplD8CRLPglPjQmY3sU2ZAKY=.93d0510e-0d9d-4f07-ba07-b9027ba6f89b@github.com>
 <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com>
 <JIozM6Xe_RuACVPRvbQtZYJUYzqmJri69XklWzmzVO8=.94c86dea-0665-48ba-989a-035e1a0ff35d@github.com>
 <cMsbb3UmOWs-309Zobc6gna8ZepHvuoUpSpwvCaQmoM=.5f999813-d1e5-482b-8415-b875377221c5@github.com>
 <Kpik6hwYiD1r8-ervrV0rg2VvrFK7pGr7NEV2qISPoE=.02e5a9e0-9c5f-4e08-9831-b0781a09f2f3@github.com>
 <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com>
 <WisIFWC8dXi53nfdv0gXIAHgdjZthyoIeVvOuoY7T_M=.e02ea3c6-b8e0-4e44-9edc-69ef212c9bfa@github.com>
 <GS2B9Xcwwuie0AxjO1pmN1QGF9Y4Ukdpgx-DedaMXPM=.b61bf04b-f489-4206-b7e5-44f06103c02e@github.com>
 <PdCUVqIOpSo7CFkUEnWx7aZlVTBw2w8fVUR
 PagD2R4A=.4ab7ef52-b170-4a95-b15a-7cbd4407606f@github.com>
 <BarQ04VIv4siELW8k-GoGIeRKbLQLv3CRM9LHa1-3ZI=.170b0477-69a7-4907-a419-5a6367a7ca54@github.com>
 <mBzl4lD6blV2aF4xn1xwQLonj5FJIMBItY6WjJAUF7w=.f776fa88-75e6-46d4-b07b-8110dc94bedf@github.com>
 <UgK2CAAaCU0dCJExlMvJfaukAuHK9W9bI_azCJM2Tv0=.41e70b72-1e0a-4b1a-94b3-8643c87f02cd@github.com>
 <nAJ_7HOCiCrekY3z_7JMl0Ixajl3Deu_jUUxa1fqSjw=.0b69eff6-aca5-462d-b5c4-ec5234997074@github.com>
 <FwdKJtfykDRiXH9K-Cv0GE6X6FraUXzPU1um5Jhz77A=.afa0a5dc-7a40-48dd-aed2-b3683245dfb4@github.com>
Message-ID: <rlD7RSQQjOwNiogcvfbt-qNypMakWvDxrI_B3FMjNT4=.694dc242-c75f-4e67-96e0-d45ec8ceb6ac@github.com>

On Mon, 24 Feb 2025 12:52:42 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>>> @rwestrel I think I had tried some verifications above, but I could not even get it to work in all cases in `SuperWord`.
>>> 
>>> In `VLoop::check_preconditions_helper`, I try to find either the predicate or the multiversioning if. But I cannot always find it, and I think that one reason was that the pre-loop can be lost. At least that is what I remember from 4+ weeks ago.
>> 
>> Do you understand when that happens? It doesn't feel right that the pre loop can be lost.
>
>> @rwestrel Do you want me to find examples for the pre-loop disappearing? I suppose I can find some easily by adding an assert in SuperWord, where we bail out, as I showed above.
> 
> Yes, if not too much work.

> @rwestrel I think we should just file an RFE to keep track of these assertions we would like to add once those issues are fixed.

That sounds reasonable to me.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2678873056

From coleenp at openjdk.org  Mon Feb 24 15:59:58 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Mon, 24 Feb 2025 15:59:58 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native [v7]
In-Reply-To: <FrlPjmMb7CgcaoCNXLWwybw-pcHprdWIP8whY8fJU9g=.2564ab5c-9d30-4e5d-95cc-7a85955643b0@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <FrlPjmMb7CgcaoCNXLWwybw-pcHprdWIP8whY8fJU9g=.2564ab5c-9d30-4e5d-95cc-7a85955643b0@github.com>
Message-ID: <PougFcuJq00aqVSqW2EhQ9xOiQw_QhOy04xbXQDb4ak=.bfc9d18a-d756-4b21-8aba-e91b3dfe28df@github.com>

On Wed, 19 Feb 2025 05:12:38 GMT, David Holmes <dholmes at openjdk.org> wrote:

> Does the SA not need any updates in relation to this?

No, the SA doesn't know about these compiler intrinsics.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23572#issuecomment-2678913119

From coleenp at openjdk.org  Mon Feb 24 15:59:59 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Mon, 24 Feb 2025 15:59:59 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native [v7]
In-Reply-To: <dOZIsMeRy34Anbe2OORtfWN6eJHD2zPhIHF_vtuqDCo=.956c37b8-e682-475a-90f0-5818e7168044@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <E9GPjreqeKFJmZAIjHGQ-1y6FnyqaT94FHUPuK65kmE=.48bd4ecc-ac91-4f7b-895b-a32280d8b437@github.com>
 <_j9Wkg21aBltyVrbO4wxGFKmmLDy0T-eorRL4epfS4k=.5a453b6b-d673-4cc6-b29f-192fa74e290c@github.com>
 <3qpqR3PC8PFmdgaIoSYA3jDWdl-oon0-AcIzXcI76rY=.38635503-c067-4f6e-a4f1-92c1b6d991d1@github.com>
 <dOZIsMeRy34Anbe2OORtfWN6eJHD2zPhIHF_vtuqDCo=.956c37b8-e682-475a-90f0-5818e7168044@github.com>
Message-ID: <4eQr952WCBhGqlLqX0q2TCDLuFrwh_UmxgJcb2BOs_s=.8e7f55a7-60ec-4cc8-9a8b-cca84ccbba10@github.com>

On Thu, 20 Feb 2025 23:23:08 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>>> ... but not in the return since the caller likely will fetch the klass pointer next.
>> 
>> I notice that too.  Callers are using is_primitive() to short-circuit calls to as_Klass(), which means they seem to be aware of this implementation detail when maybe they shouldn't.
>
> There are 70 callers so yes, it might be something that shouldn't be known in this many places.

Definitely out of the scope of this PR.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1967943222

From adinn at openjdk.org  Mon Feb 24 16:21:58 2025
From: adinn at openjdk.org (Andrew Dinn)
Date: Mon, 24 Feb 2025 16:21:58 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v6]
In-Reply-To: <hpjeRDYSLqEPnvrKbO2EScnS51J5Fm2uixY5nYA3Oho=.8500a6e5-02eb-4b67-985c-449e3e960ffd@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <hpjeRDYSLqEPnvrKbO2EScnS51J5Fm2uixY5nYA3Oho=.8500a6e5-02eb-4b67-985c-449e3e960ffd@github.com>
Message-ID: <6B25PDNMw8dDUm8r5rX4heL3cfvbsPVKqnVg7e1Ax84=.43b91704-15fa-4445-b8be-216fffcf12d4@github.com>

On Thu, 20 Feb 2025 17:33:18 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with four additional commits since the last revision:
> 
>  - Accepting suggested change from Andrew Dinn
>  - Added comments suggested by Andrew Dinn
>  - Fixed copyright years
>  - renaming a couple of functions

Please add comments as indicated to relate generated code to original Java source. Otherwise good to go.

-------------

Marked as reviewed by adinn (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/23300#pullrequestreview-2637711807

From adinn at openjdk.org  Mon Feb 24 16:21:59 2025
From: adinn at openjdk.org (Andrew Dinn)
Date: Mon, 24 Feb 2025 16:21:59 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]
In-Reply-To: <me_0hCYLnyFadVymsq-oxV7VggbpPMqLnXe-jPU4boU=.bfcec994-1f12-4f4e-9002-8e2564919511@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <unMldYiDLGyImOJQ1oXuzR2OViIBxTKFjE3Ks6_VSn4=.e86bd4ee-5fce-415a-888a-06aff24bd664@github.com>
 <c8EPfl5IC1K3uLMftbZSbf-TyJK-e5LEsXovSfjqO14=.ae0182d1-e7d2-4ab5-9ebe-d7bc8bac643e@github.com>
 <me_0hCYLnyFadVymsq-oxV7VggbpPMqLnXe-jPU4boU=.bfcec994-1f12-4f4e-9002-8e2564919511@github.com>
Message-ID: <36J5kPTCknNCBjMx56e9JmLK2vFbvxBXXXOvTmv5pDs=.6aaa25e2-4cd9-4217-8da3-3280c1d3c4db@github.com>

On Fri, 21 Feb 2025 10:23:37 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> Hi. Here is the test result of our CI.
>> 
>> ### copyright year
>> 
>> the following files should update the copyright year to 2025.
>> 
>> 
>> src/hotspot/cpu/aarch64/assembler_aarch64.hpp
>> src/hotspot/cpu/aarch64/stubRoutines_aarch64.hpp
>> src/hotspot/share/runtime/globals.hpp
>> src/java.base/share/classes/sun/security/provider/ML_DSA.java
>> src/java.base/share/classes/sun/security/provider/SHA3Parallel.java
>> test/micro/org/openjdk/bench/java/security/MLDSA.java
>> 
>> 
>> ### cross-build failure
>> 
>> Cross build for riscv64/s390/ppc64 failed.
>> 
>> Here shows the error msg for ppc64
>> 
>> 
>> === Output from failing command(s) repeated here ===
>> * For target support_interim-jmods_support__create_java.base.jmod_exec:
>> #
>> # A fatal error has been detected by the Java Runtime Environment:
>> #
>> #  Internal Error (/tmp/jdk-src/src/hotspot/share/asm/codeBuffer.hpp:200), pid=72752, tid=72769
>> #  assert(allocates2(pc)) failed: not in CodeBuffer memory: 0x0000e85cb03dc620 <= 0x0000e85cb03e8ab4 <= 0x0000e85cb03e8ab0
>> #
>> # JRE version: OpenJDK Runtime Environment (25.0) (fastdebug build 25-internal-git-1e01c6deec3)
>> # Java VM: OpenJDK 64-Bit Server VM (fastdebug 25-internal-git-1e01c6deec3, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64)
>> # Problematic frame:
>> # V  [libjvm.so+0x3b391c]  Instruction_aarch64::~Instruction_aarch64()+0xbc
>> #
>> # Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /tmp/ci-scripts/jdk-src/make/
>> #
>> # An error report file with more information is saved as:
>> # /tmp/jdk-src/make/hs_err_pid72752.log
>>    ... (rest of output omitted)
>> 
>> * All command lines available in /sysroot/ppc64el/tmp/build-ppc64el/make-support/failure-logs.
>> === End of repeated output ===
>> 
>> 
>> I suppose we should make the similar update at file `src/hotspot/cpu/aarch64/stubDeclarations_aarch64.hpp` to other platforms
>
> @shqking, I changed the copyright years, but I don't really understand how the aarch64-specific code can overflow buffers on other architectures. As far as I understand, Instruction_aarch64 should not have been there in a ppc build.
> Was this a build attempted on an aarch64 for the other architectures?

@ferakocz I have indicated a few places where I think you should add comments to clarify the relationship to the original Java code or just clarify what data is being used. I think the code is ok to go in as it is but I would really like to investigate a better structuring of the generator code. This can be done as a follow-up rather than delay getting this version committed.

There are two things I still see as problematic with the current code.

1) There are lots of places in your auxiliary generator methods and also in their client methods where you generate distinct sequences of calls to the assembler sharing essentially the same code shape i.e. the same instructions but with different vector register arguments. For example, in `dilithium_montmul32` you generate the multiply sequence to montgomery multiply 4x4s registers in v0..v3 by 4x4s registers in v16..v19 and then repeat exactly the same code in exactly the same sequence to multiply the 4x4s registers in v4..v7 by 4x4s registers in v20..v23.

Likewise, `dilithium_sub_add_montmul16` generates that same shape code but uses the montmul sequence with odd registers v1..v7 paired against the compact sequence v16..19. As another example, you generate various 4 or 8 long sequences of subv and addv operations at various points, including in some of the top level methods.

I appreciate that you have folded one of the montmult cases into the other by adding the `bool by_constant` parameter to `dilithium_montmul32`. However, I think it would be worth investigating an alternative that would allow more use more, systematic use of auxiliary methods.

2) Your current auxiliary generator methods rely on a fixed mapping of input, output and scratch registers to specific registers. This is part of why the reason why you cannot always call your auxiliaries (or smaller pieces of them) from other locations where the same code shape is generated -- the input and output mappings of data to registers expected by the auxiliary do not match the register sequences in which the relevant data are (transiently) located.

This same fact also means that the repeated code sections heavily depend on naming exactly the right register on each generator line. That makes it harder for a maintainer to recognize how, essentially, what is really just one common, abstract operation is, at each different occurrence, consuming, combining and updating several input sequences of related registers to generate one or more output sequences. That also means that it would be very easy to introduce an error if the code ever needed to be changed.

I would like to investigate an alternative approach where your auxiliary generator methods and their callers pass arguments that identify the vector register sequences to be consumed as inputs, used as temporaries and written as outputs. In cases where the routines operate on sequences of 4 or 8 successive vectors then, at the very least, that would involve specifying the first register for each input, temporary or output e.g. for the montmult32 multiply v0+ by v16+ using v24+ as temporaries and v30+ as constants and output the results to v16+. However, that leaves it implicit that the first two inputs involve 8 registers while the temporaries involves 4 and the constants 2. The more general requirement is not just to specify the vector sequence length (2, 4 or 8) but also allow the default stride of one (e.g. v0, v1, ...) to be varied to allow for skip sequences (e.g. v0, v2, ...) or constant sequences (v28, v28, ... as would be needed for multiply constant).

I have prototyped a simple vector sequence type `VRSeq<N>` that models an indexable sequence of FloatRegisters and allows many of your higher level routines to simply declare register sets they operate on and then pass them as arguments to a range of simply auxiliary generator functions that can be used in many places where you currently have a lot of inline calls to the assembler -- see attachment:
[vseq.zip](https://github.com/user-attachments/files/18946470/vseq.zip)

I'll raise a JIRA to cover recoding the current implementation using this type and post a follow-up PR that uses it to see how far it helps simplify the code. I believe it will make it easier for maintainers to understand the structure of the generated code and observe/verify the use of registers to store specific values. It should also allow assertions about the use of registers to be added to the code to ensure that values are not being overwritten (expect in circumstances where that is legitimate).

Meanwhile I'll approve this PR modulo the commenting I suggested.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2678977770

From adinn at openjdk.org  Mon Feb 24 16:33:54 2025
From: adinn at openjdk.org (Andrew Dinn)
Date: Mon, 24 Feb 2025 16:33:54 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v6]
In-Reply-To: <hpjeRDYSLqEPnvrKbO2EScnS51J5Fm2uixY5nYA3Oho=.8500a6e5-02eb-4b67-985c-449e3e960ffd@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <hpjeRDYSLqEPnvrKbO2EScnS51J5Fm2uixY5nYA3Oho=.8500a6e5-02eb-4b67-985c-449e3e960ffd@github.com>
Message-ID: <EsK5S7LwliVnuHpVTpT9D39ARIFMnhd2hqx0Ekd0lVo=.1dab8143-168a-4442-9d88-796c25326cbb@github.com>

On Thu, 20 Feb 2025 17:33:18 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with four additional commits since the last revision:
> 
>  - Accepting suggested change from Andrew Dinn
>  - Added comments suggested by Andrew Dinn
>  - Fixed copyright years
>  - renaming a couple of functions

I raised [JDK-8350589](https://bugs.openjdk.org/browse/JDK-8350589) to cover investigation of an alternative implementation.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2679012108

From aph at openjdk.org  Mon Feb 24 17:09:54 2025
From: aph at openjdk.org (Andrew Haley)
Date: Mon, 24 Feb 2025 17:09:54 GMT
Subject: RFR: 8345125: Aarch64: Add aarch64 backend for Float16 scalar
 operations
In-Reply-To: <Rbal8Cp4ncat_17FPV367OvtBjO-GVr0AdM-X-yuNt8=.e09edada-bc9d-4dbd-9904-ab523d25fc47@github.com>
References: <Rbal8Cp4ncat_17FPV367OvtBjO-GVr0AdM-X-yuNt8=.e09edada-bc9d-4dbd-9904-ab523d25fc47@github.com>
Message-ID: <odM4XNH06z2lo1rXYzqUx_78UzueXp1TAY_9TyCQTmc=.6902f909-9447-48f2-8221-0833907b5a6e@github.com>

On Mon, 24 Feb 2025 12:09:57 GMT, Bhavana Kilambi <bkilambi at openjdk.org> wrote:

> This patch adds aarch64 backend for scalar FP16 operations namely - add, subtract, multiply, divide, fma, sqrt, min and max.

src/hotspot/cpu/aarch64/aarch64.ad line 17275:

> 17273: 
> 17274: // This pattern would result in the following instructions (the first two are for ConvF2HF
> 17275: // and the last instruction is for ReinterpretS2HF) -

Suggestion:

// Without this pattern, (ReinterpretS2HF (ConvF2HF src)) would result in the following instructions (the first two for ConvF2HF
// and the last instruction for ReinterpretS2HF) -

Reads a little better, I think?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23748#discussion_r1968070079

From adinn at openjdk.org  Mon Feb 24 17:15:59 2025
From: adinn at openjdk.org (Andrew Dinn)
Date: Mon, 24 Feb 2025 17:15:59 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v6]
In-Reply-To: <hpjeRDYSLqEPnvrKbO2EScnS51J5Fm2uixY5nYA3Oho=.8500a6e5-02eb-4b67-985c-449e3e960ffd@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <hpjeRDYSLqEPnvrKbO2EScnS51J5Fm2uixY5nYA3Oho=.8500a6e5-02eb-4b67-985c-449e3e960ffd@github.com>
Message-ID: <OHPpKZeCcAdhx5iaSgMkqHkhPLtHEG7pmw9OCKQnIk8=.2a843079-bddb-456e-a150-a2206d45f0b9@github.com>

On Thu, 20 Feb 2025 17:33:18 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with four additional commits since the last revision:
> 
>  - Accepting suggested change from Andrew Dinn
>  - Added comments suggested by Andrew Dinn
>  - Fixed copyright years
>  - renaming a couple of functions

Marked as reviewed by adinn (Reviewer).

-------------

PR Review: https://git.openjdk.org/jdk/pull/23300#pullrequestreview-2637878768

From adinn at openjdk.org  Mon Feb 24 17:16:00 2025
From: adinn at openjdk.org (Andrew Dinn)
Date: Mon, 24 Feb 2025 17:16:00 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]
In-Reply-To: <bqom3W9zRU-ChMNDgkfcOVPEmKvl5J1huHv1o6Fe1yc=.052a4703-111d-40df-8843-4aee2dd93cca@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <unMldYiDLGyImOJQ1oXuzR2OViIBxTKFjE3Ks6_VSn4=.e86bd4ee-5fce-415a-888a-06aff24bd664@github.com>
 <1yB95sOajuS5ptFI0GQWLepii5JsZ9DOsje-TEFyFYs=.a325ad18-17ed-4e77-b1e3-0bad2cf55c67@github.com>
 <bqom3W9zRU-ChMNDgkfcOVPEmKvl5J1huHv1o6Fe1yc=.052a4703-111d-40df-8843-4aee2dd93cca@github.com>
Message-ID: <Z_BqJtEFwJ_A5aQnxJCOs4jswW7CmUL4dJAKrHLJGk0=.48dae4a5-e713-447d-bf05-f2caa6408540@github.com>

On Thu, 20 Feb 2025 17:22:25 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 2618:
>> 
>>> 2616:   INSN(smaxp,  0, 0b101001, false); // accepted arrangements: T8B, T16B, T4H, T8H, T2S, T4S
>>> 2617:   INSN(sminp,  0, 0b101011, false); // accepted arrangements: T8B, T16B, T4H, T8H, T2S, T4S
>>> 2618:   INSN(sqdmulh,0, 0b101101, false); // accepted arrangements: T4H, T8H, T2S, T4S
>> 
>> Hi, not a comment on the algorithm itself but you might have to add these new instructions in the gtest for aarch64 here - test/hotspot/gtest/aarch64/aarch64-asmtest.py and use this file to generate test/hotspot/gtest/aarch64/asmtest.out.h which would contain these newly added instructions.
>
> I have tried that, but the python script (actually the as command that it started) threw error messages:
> 
> aarch64ops.s:338:24: error: index must be a multiple of 8 in range [0, 32760].
>         prfm    PLDL1KEEP, [x15, 43]
>                                  ^
> aarch64ops.s:357:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
>         sub     x1, x10, x23, sxth #2
>                               ^
> aarch64ops.s:359:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
>         add     x11, x21, x5, uxtb #3
>                               ^
> aarch64ops.s:360:22: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
>         adds    x11, x17, x17, uxtw #1
>                                ^
> aarch64ops.s:361:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
>         sub     x11, x0, x15, uxtb #1
>                               ^
> aarch64ops.s:362:19: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
>         subs    x7, x1, x0, sxth #2
>                             ^
> This is without any modifications from what is in the master branch currently.

@ferakocz This also really needs addressing before committing the patch. Perhaps @theRealAph can advise on how to circumvent the problems you found when trying to update the python script?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1968076559

From aph at openjdk.org  Mon Feb 24 17:31:52 2025
From: aph at openjdk.org (Andrew Haley)
Date: Mon, 24 Feb 2025 17:31:52 GMT
Subject: RFR: 8345125: Aarch64: Add aarch64 backend for Float16 scalar
 operations
In-Reply-To: <Rbal8Cp4ncat_17FPV367OvtBjO-GVr0AdM-X-yuNt8=.e09edada-bc9d-4dbd-9904-ab523d25fc47@github.com>
References: <Rbal8Cp4ncat_17FPV367OvtBjO-GVr0AdM-X-yuNt8=.e09edada-bc9d-4dbd-9904-ab523d25fc47@github.com>
Message-ID: <IpKutko2TmcaFVICxb0dRzY-XnqEjejl9pVvdxCeMkg=.9c508f4a-5a3b-41fe-a986-5005d1cf03cc@github.com>

On Mon, 24 Feb 2025 12:09:57 GMT, Bhavana Kilambi <bkilambi at openjdk.org> wrote:

> This patch adds aarch64 backend for scalar FP16 operations namely - add, subtract, multiply, divide, fma, sqrt, min and max.

src/hotspot/cpu/aarch64/aarch64.ad line 6978:

> 6976: // ldr instruction has 32/64/128 bit variants but not a 16-bit variant. This
> 6977: // loads the 16-bit value from constant pool into a 32-bit register but only
> 6978: // the bottom half will be populated.

Surely what actually happens here is that it loads a 32-bit word from the constant pool. The bottom 16 bits of this word contain the half-precision constant, the top 16 bits are zero.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23748#discussion_r1968101418

From bkilambi at openjdk.org  Mon Feb 24 17:44:52 2025
From: bkilambi at openjdk.org (Bhavana Kilambi)
Date: Mon, 24 Feb 2025 17:44:52 GMT
Subject: RFR: 8345125: Aarch64: Add aarch64 backend for Float16 scalar
 operations
In-Reply-To: <IpKutko2TmcaFVICxb0dRzY-XnqEjejl9pVvdxCeMkg=.9c508f4a-5a3b-41fe-a986-5005d1cf03cc@github.com>
References: <Rbal8Cp4ncat_17FPV367OvtBjO-GVr0AdM-X-yuNt8=.e09edada-bc9d-4dbd-9904-ab523d25fc47@github.com>
 <IpKutko2TmcaFVICxb0dRzY-XnqEjejl9pVvdxCeMkg=.9c508f4a-5a3b-41fe-a986-5005d1cf03cc@github.com>
Message-ID: <NIILdJ885F2cPfFad0x7x1pSpGhbarI5A7u8d1U8F8E=.497ed380-4de9-4d5b-968a-5eb97fed5cca@github.com>

On Mon, 24 Feb 2025 17:28:43 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> This patch adds aarch64 backend for scalar FP16 operations namely - add, subtract, multiply, divide, fma, sqrt, min and max.
>
> src/hotspot/cpu/aarch64/aarch64.ad line 6978:
> 
>> 6976: // ldr instruction has 32/64/128 bit variants but not a 16-bit variant. This
>> 6977: // loads the 16-bit value from constant pool into a 32-bit register but only
>> 6978: // the bottom half will be populated.
> 
> Surely what actually happens here is that it loads a 32-bit word from the constant pool. The bottom 16 bits of this word contain the half-precision constant, the top 16 bits are zero.

I agree. The wording didn't quite convey that. I will change it in my next PS. Thank you for looking into the patch!

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23748#discussion_r1968120239

From liach at openjdk.org  Mon Feb 24 17:52:00 2025
From: liach at openjdk.org (Chen Liang)
Date: Mon, 24 Feb 2025 17:52:00 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native [v7]
In-Reply-To: <Y4xsaQoNZ448-1w0-OjJvAi5NidyaI5bLHAtRx_2Ovk=.2b5fc86b-ad1d-4b75-b51a-459fa5fd88e3@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <Y4xsaQoNZ448-1w0-OjJvAi5NidyaI5bLHAtRx_2Ovk=.2b5fc86b-ad1d-4b75-b51a-459fa5fd88e3@github.com>
Message-ID: <IbaGy3rmKAt4_B3oPMB_K-3rprF0qY036ISs84VPwLo=.4be6bc1a-c26e-4e6d-9d35-e0d97e7e1512@github.com>

On Sat, 22 Feb 2025 14:49:38 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
>> Tested with tier1-4 and performance tests.
>
> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Use modifiers field directly in isInterface.

The limited changes to the Java codebase looks reasonable. We should probably get a double check from Alan or some other architect.

-------------

Marked as reviewed by liach (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/23572#pullrequestreview-2637961573

From rriggs at openjdk.org  Mon Feb 24 19:10:02 2025
From: rriggs at openjdk.org (Roger Riggs)
Date: Mon, 24 Feb 2025 19:10:02 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native [v7]
In-Reply-To: <Y4xsaQoNZ448-1w0-OjJvAi5NidyaI5bLHAtRx_2Ovk=.2b5fc86b-ad1d-4b75-b51a-459fa5fd88e3@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <Y4xsaQoNZ448-1w0-OjJvAi5NidyaI5bLHAtRx_2Ovk=.2b5fc86b-ad1d-4b75-b51a-459fa5fd88e3@github.com>
Message-ID: <zbniE_PgMPvG0iCQWIN6hgMZj7vcpLTqIB_DSDP6Upc=.211ac7ec-ea4a-43b6-b712-d267ad049412@github.com>

On Sat, 22 Feb 2025 14:49:38 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
>> Tested with tier1-4 and performance tests.
>
> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Use modifiers field directly in isInterface.

A nice simplification.

src/java.base/share/classes/java/lang/Class.java line 241:

> 239:     private Class(ClassLoader loader, Class<?> arrayComponentType, char mods, ProtectionDomain pd, boolean isPrim) {
> 240:         // Initialize final field for classLoader.  The initialization value of non-null
> 241:         // prevents future JIT optimizations from assuming this final field is null.

To add a bit more depth to this comment, I'd add.

"The following assignments are done directly by the VM without calling this constructor."
Or something to that effect.

-------------

Marked as reviewed by rriggs (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/23572#pullrequestreview-2638174546
PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1968254793

From coleenp at openjdk.org  Mon Feb 24 19:30:41 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Mon, 24 Feb 2025 19:30:41 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native [v8]
In-Reply-To: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
Message-ID: <Da5eRkoxdxdezyjWC7_XfLct1aDhjd1ZtP6WPEZ5yis=.e4954c0d-91d7-47f3-a5f6-ec0afc0e8be2@github.com>

> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
> Tested with tier1-4 and performance tests.

Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:

  Add a comment about Class constructor.

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23572/files
  - new: https://git.openjdk.org/jdk/pull/23572/files/db7c9782..591abdda

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23572&range=07
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23572&range=06-07

  Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod
  Patch: https://git.openjdk.org/jdk/pull/23572.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23572/head:pull/23572

PR: https://git.openjdk.org/jdk/pull/23572

From coleenp at openjdk.org  Mon Feb 24 19:30:41 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Mon, 24 Feb 2025 19:30:41 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native [v7]
In-Reply-To: <zbniE_PgMPvG0iCQWIN6hgMZj7vcpLTqIB_DSDP6Upc=.211ac7ec-ea4a-43b6-b712-d267ad049412@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <Y4xsaQoNZ448-1w0-OjJvAi5NidyaI5bLHAtRx_2Ovk=.2b5fc86b-ad1d-4b75-b51a-459fa5fd88e3@github.com>
 <zbniE_PgMPvG0iCQWIN6hgMZj7vcpLTqIB_DSDP6Upc=.211ac7ec-ea4a-43b6-b712-d267ad049412@github.com>
Message-ID: <BBoia7Fs_qyBD_fYBoKdNLX7__DobS7vj7QWX_Dnyv0=.be23fc10-bf38-4d37-b623-6ed30ae59a62@github.com>

On Mon, 24 Feb 2025 19:06:30 GMT, Roger Riggs <rriggs at openjdk.org> wrote:

>> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Use modifiers field directly in isInterface.
>
> src/java.base/share/classes/java/lang/Class.java line 241:
> 
>> 239:     private Class(ClassLoader loader, Class<?> arrayComponentType, char mods, ProtectionDomain pd, boolean isPrim) {
>> 240:         // Initialize final field for classLoader.  The initialization value of non-null
>> 241:         // prevents future JIT optimizations from assuming this final field is null.
> 
> To add a bit more depth to this comment, I'd add.
> 
> "The following assignments are done directly by the VM without calling this constructor."
> Or something to that effect.

Okay, that's a good comment.  I'll add it.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1968297499

From coleenp at openjdk.org  Mon Feb 24 19:30:41 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Mon, 24 Feb 2025 19:30:41 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native [v7]
In-Reply-To: <Y4xsaQoNZ448-1w0-OjJvAi5NidyaI5bLHAtRx_2Ovk=.2b5fc86b-ad1d-4b75-b51a-459fa5fd88e3@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <Y4xsaQoNZ448-1w0-OjJvAi5NidyaI5bLHAtRx_2Ovk=.2b5fc86b-ad1d-4b75-b51a-459fa5fd88e3@github.com>
Message-ID: <5i_vwoj0oivW08tMAX5Bp2m7yK_pgQOy0b7_MizQ-uM=.0f54046e-8972-4d05-89d6-aee42b079b48@github.com>

On Sat, 22 Feb 2025 14:49:38 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
>> Tested with tier1-4 and performance tests.
>
> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Use modifiers field directly in isInterface.

Thanks for reviewing Roger.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23572#issuecomment-2679447427

From dlong at openjdk.org  Mon Feb 24 21:09:57 2025
From: dlong at openjdk.org (Dean Long)
Date: Mon, 24 Feb 2025 21:09:57 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native [v8]
In-Reply-To: <Da5eRkoxdxdezyjWC7_XfLct1aDhjd1ZtP6WPEZ5yis=.e4954c0d-91d7-47f3-a5f6-ec0afc0e8be2@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <Da5eRkoxdxdezyjWC7_XfLct1aDhjd1ZtP6WPEZ5yis=.e4954c0d-91d7-47f3-a5f6-ec0afc0e8be2@github.com>
Message-ID: <bLz8_HfrtBGMU5bk3p1wWEdYslpHx1kS4pYEtgfSrTw=.8842f20c-1b14-427e-b026-f4c3420ce45b@github.com>

On Mon, 24 Feb 2025 19:30:41 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
>> Tested with tier1-4 and performance tests.
>
> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Add a comment about Class constructor.

Marked as reviewed by dlong (Reviewer).

-------------

PR Review: https://git.openjdk.org/jdk/pull/23572#pullrequestreview-2638441924

From kvn at openjdk.org  Tue Feb 25 00:37:00 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Tue, 25 Feb 2025 00:37:00 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory
In-Reply-To: <xlcVpse5Yr5iUC65xdLtOTj-aNCoS59WgnFOnbgNOG8=.6f554f7c-e245-43c2-adc5-df2ee07639cc@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <IUuLTkwPe7pefd6C6NhQEI7ASmdSW8Bb0kBFJVfXkUY=.f6d110c2-0d6d-424f-8898-b06d5f9552f6@github.com>
 <OtJlLrlGEGU9a-lDCP-_n6paLgrAmCTg3-pwhLTeyIU=.c1a3d943-aca1-4dbd-8717-c73020163864@github.com>
 <mcXrI5ah9OFy25nV_Im_DFpPR_DXtfOgn-26D_bC1mQ=.15fc3dd9-7278-4c09-8b25-dde0a1251ca2@github.com>
 <xlcVpse5Yr5iUC65xdLtOTj-aNCoS59WgnFOnbgNOG8=.6f554f7c-e245-43c2-adc5-df2ee07639cc@github.com>
Message-ID: <9mXRl7rScxJwxNNlV_H1gxndtzZ6g-gE8cMsc6VsTJQ=.b5a77c13-6e7e-4203-898a-3318e298d30f@github.com>

On Mon, 24 Feb 2025 08:00:24 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> But if we do not optimize the slow path loop, then we would get performance regressions in aliasing cases because we have no unrolling for them any more. 

Okay, we are back to our previous conversation - we will wait your aliasing-analysis runtime-checks implementation and do performance runs to see if "slow" path affects performance.

Okay.

PS: "slow" path implies that it is not taking frequently and it should not affect general performance of application.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2680031423

From epeter at openjdk.org  Tue Feb 25 07:11:55 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Tue, 25 Feb 2025 07:11:55 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory
In-Reply-To: <9mXRl7rScxJwxNNlV_H1gxndtzZ6g-gE8cMsc6VsTJQ=.b5a77c13-6e7e-4203-898a-3318e298d30f@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <IUuLTkwPe7pefd6C6NhQEI7ASmdSW8Bb0kBFJVfXkUY=.f6d110c2-0d6d-424f-8898-b06d5f9552f6@github.com>
 <OtJlLrlGEGU9a-lDCP-_n6paLgrAmCTg3-pwhLTeyIU=.c1a3d943-aca1-4dbd-8717-c73020163864@github.com>
 <mcXrI5ah9OFy25nV_Im_DFpPR_DXtfOgn-26D_bC1mQ=.15fc3dd9-7278-4c09-8b25-dde0a1251ca2@github.com>
 <xlcVpse5Yr5iUC65xdLtOTj-aNCoS59WgnFOnbgNOG8=.6f554f7c-e245-43c2-adc5-df2ee07639cc@github.com>
 <9mXRl7rScxJwxNNlV_H1gxndtzZ6g-gE8cMsc6VsTJQ=.b5a77c13-6e7e-4203-898a-3318e298d30f@github.com>
Message-ID: <ddWEwYKdCFkjzD9eQMraW6DQCjXAd8yMBuDq_hVLqM8=.9c9d7d8a-5120-4488-b8ca-2303245d0998@github.com>

On Tue, 25 Feb 2025 00:34:14 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

> > But if we do not optimize the slow path loop, then we would get performance regressions in aliasing cases because we have no unrolling for them any more.
> 
> Okay, we are back to our previous conversation - we will wait your aliasing-analysis runtime-checks implementation and do performance runs to see if "slow" path affects performance.
> 
> Okay.

Sounds good, we will revisit and write more benchmarks there.

> 
> PS: "slow" path implies that it is not taking frequently and it should not affect general performance of application.

For me "slow" just means less optimized, because some assumption does not hold. The "fast" path is faster, because it has more assumptions and can optimize more (i.e. vectorize in this case, or vectorize more instructions). Do you have a better name than "fast/slow"?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2680885496

From epeter at openjdk.org  Tue Feb 25 07:15:56 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Tue, 25 Feb 2025 07:15:56 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory
In-Reply-To: <9mXRl7rScxJwxNNlV_H1gxndtzZ6g-gE8cMsc6VsTJQ=.b5a77c13-6e7e-4203-898a-3318e298d30f@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <IUuLTkwPe7pefd6C6NhQEI7ASmdSW8Bb0kBFJVfXkUY=.f6d110c2-0d6d-424f-8898-b06d5f9552f6@github.com>
 <OtJlLrlGEGU9a-lDCP-_n6paLgrAmCTg3-pwhLTeyIU=.c1a3d943-aca1-4dbd-8717-c73020163864@github.com>
 <mcXrI5ah9OFy25nV_Im_DFpPR_DXtfOgn-26D_bC1mQ=.15fc3dd9-7278-4c09-8b25-dde0a1251ca2@github.com>
 <xlcVpse5Yr5iUC65xdLtOTj-aNCoS59WgnFOnbgNOG8=.6f554f7c-e245-43c2-adc5-df2ee07639cc@github.com>
 <9mXRl7rScxJwxNNlV_H1gxndtzZ6g-gE8cMsc6VsTJQ=.b5a77c13-6e7e-4203-898a-3318e298d30f@github.com>
Message-ID: <RJftUvfKAbTDEtjOQpWMRXKcE8vUAkwHrvb_HNsWgnU=.4e560ea3-f066-426f-9f0e-923aada25318@github.com>

On Tue, 25 Feb 2025 00:34:14 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> @vnkozlov I mean the issue this: once I implement aliasing-analysis runtime-checks with this multiversion approach, then we'd get regressions if we do not optimize the slow path loop. Currently, we would not vectorize (because we have to be ready for aliasing cases), but we at least unroll, and whatever else we can except vectorization. But if we do not optimize the slow path loop, then we would get performance regressions in aliasing cases because we have no unrolling for them any more. I think we need to avoid that - would you agree?
>
>> But if we do not optimize the slow path loop, then we would get performance regressions in aliasing cases because we have no unrolling for them any more. 
> 
> Okay, we are back to our previous conversation - we will wait your aliasing-analysis runtime-checks implementation and do performance runs to see if "slow" path affects performance.
> 
> Okay.
> 
> PS: "slow" path implies that it is not taking frequently and it should not affect general performance of application.

@vnkozlov @rwestrel Let me summarize the tasks left to do here:
- Rename `stalled` -> `delayed`. And `unstall` -> `resume_optimizations` or alike. Improve some comments.
- File follow-up RFE for more verification (must find multiversion-if from multiversioned loop) - currently blocked by predicate traversal issue. Maybe we can also assert that we can always find the pre-loop from the main-loop, at least during loop-opts.
- When working on aliasing-analysis runtime-check, we have to do more performance analysis, and show the need of both the fast and slow path loops.

Let me know if there is more ;)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2680894298

From epeter at openjdk.org  Tue Feb 25 09:27:13 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Tue, 25 Feb 2025 09:27:13 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory [v4]
In-Reply-To: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
Message-ID: <IWJpxq3rbBbcGsvrYi8iPP1fWTBAgYlnyL3nnCi9ofM=.c638630a-f208-4473-a1df-693e720a1350@github.com>

> Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below.
> 
> **Background**
> 
> With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer.
> 
> **Problem**
> 
> So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code.
> 
> 
> MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1);
> MemorySegment nativeUnaligned = nativeAligned.asSlice(1);
> test3(nativeUnaligned);
> 
> 
> When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not!
> 
>     static void test3(MemorySegment ms) {
>         for (int i = 0; i < RANGE; i++) {
>             long adr = i * 4L;
>             int v = ms.get(ELEMENT_LAYOUT, adr);
>             ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1));
>         }
>     }
> 
> 
> **Solution: Runtime Checks - Predicate and Multiversioning**
> 
> Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check.
> 
> I came up with 2 options where to place the runtime checks:
> - A new "auto vectorization" Parse Predicate:
>   - This only works when predicates are available.
>   - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop.
> - Multiversion the loop:
>   - Create 2 copies of the loop (fast and slow loops).
>   - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take
>   - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even unaligned `base`s would end up with reasonably fast code.
>   - We "stall" the `...

Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 66 commits:

 - Merge branch 'master' into JDK-8323582-SW-native-alignment
 - stall -> delay, plus some more comments
 - adjust selector if probability
 - Merge branch 'master' into JDK-8323582-SW-native-alignment
 - remove multiversion mark if we break the structure
 - register opaque with igvn
 - copyright and rm CFG check
 - IR rules for all cases
 - 3 test versions
 - test changed to unaligned ints
 - ... and 56 more: https://git.openjdk.org/jdk/compare/d551daca...8eb52292

-------------

Changes: https://git.openjdk.org/jdk/pull/22016/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=22016&range=03
  Stats: 1089 lines in 27 files changed: 966 ins; 28 del; 95 mod
  Patch: https://git.openjdk.org/jdk/pull/22016.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/22016/head:pull/22016

PR: https://git.openjdk.org/jdk/pull/22016

From epeter at openjdk.org  Tue Feb 25 09:36:58 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Tue, 25 Feb 2025 09:36:58 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory
In-Reply-To: <9mXRl7rScxJwxNNlV_H1gxndtzZ6g-gE8cMsc6VsTJQ=.b5a77c13-6e7e-4203-898a-3318e298d30f@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <IUuLTkwPe7pefd6C6NhQEI7ASmdSW8Bb0kBFJVfXkUY=.f6d110c2-0d6d-424f-8898-b06d5f9552f6@github.com>
 <OtJlLrlGEGU9a-lDCP-_n6paLgrAmCTg3-pwhLTeyIU=.c1a3d943-aca1-4dbd-8717-c73020163864@github.com>
 <mcXrI5ah9OFy25nV_Im_DFpPR_DXtfOgn-26D_bC1mQ=.15fc3dd9-7278-4c09-8b25-dde0a1251ca2@github.com>
 <xlcVpse5Yr5iUC65xdLtOTj-aNCoS59WgnFOnbgNOG8=.6f554f7c-e245-43c2-adc5-df2ee07639cc@github.com>
 <9mXRl7rScxJwxNNlV_H1gxndtzZ6g-gE8cMsc6VsTJQ=.b5a77c13-6e7e-4203-898a-3318e298d30f@github.com>
Message-ID: <wvhbo3nDDS2w7et2a7KoJHUvW49Ckr10S4uyxu7_7Ac=.e4bc4eda-c55f-408d-9889-565e70bf5494@github.com>

On Tue, 25 Feb 2025 00:34:14 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> @vnkozlov I mean the issue this: once I implement aliasing-analysis runtime-checks with this multiversion approach, then we'd get regressions if we do not optimize the slow path loop. Currently, we would not vectorize (because we have to be ready for aliasing cases), but we at least unroll, and whatever else we can except vectorization. But if we do not optimize the slow path loop, then we would get performance regressions in aliasing cases because we have no unrolling for them any more. I think we need to avoid that - would you agree?
>
>> But if we do not optimize the slow path loop, then we would get performance regressions in aliasing cases because we have no unrolling for them any more. 
> 
> Okay, we are back to our previous conversation - we will wait your aliasing-analysis runtime-checks implementation and do performance runs to see if "slow" path affects performance.
> 
> Okay.
> 
> PS: "slow" path implies that it is not taking frequently and it should not affect general performance of application.

@vnkozlov @rwestrel 
- I did the `stall` -> `delay` renaming, and added some more comments in places you asked for it. Let me know if that looks better.
- Filed: [JDK-8350637](https://bugs.openjdk.org/browse/JDK-8350637): C2: verify that main_loop finds pre_loop and that multiversion loops find the multiversion_if
- I added a comment to [JDK-8324751](https://bugs.openjdk.org/browse/JDK-8324751) C2 SuperWord: Aliasing Analysis runtime check, to check performance around slow_loop.

Let me know what more I can do ;)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2681315131

From aph at openjdk.org  Tue Feb 25 09:40:57 2025
From: aph at openjdk.org (Andrew Haley)
Date: Tue, 25 Feb 2025 09:40:57 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]
In-Reply-To: <Z_BqJtEFwJ_A5aQnxJCOs4jswW7CmUL4dJAKrHLJGk0=.48dae4a5-e713-447d-bf05-f2caa6408540@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <unMldYiDLGyImOJQ1oXuzR2OViIBxTKFjE3Ks6_VSn4=.e86bd4ee-5fce-415a-888a-06aff24bd664@github.com>
 <1yB95sOajuS5ptFI0GQWLepii5JsZ9DOsje-TEFyFYs=.a325ad18-17ed-4e77-b1e3-0bad2cf55c67@github.com>
 <bqom3W9zRU-ChMNDgkfcOVPEmKvl5J1huHv1o6Fe1yc=.052a4703-111d-40df-8843-4aee2dd93cca@github.com>
 <Z_BqJtEFwJ_A5aQnxJCOs4jswW7CmUL4dJAKrHLJGk0=.48dae4a5-e713-447d-bf05-f2caa6408540@github.com>
Message-ID: <ifk97t9mBPxdf6XVck8CHs_hE1PrweC597vmZ_VO5yU=.42ea9675-35af-49fb-8a6c-ec53c2543be4@github.com>

On Mon, 24 Feb 2025 17:11:24 GMT, Andrew Dinn <adinn at openjdk.org> wrote:

>> I have tried that, but the python script (actually the as command that it started) threw error messages:
>> 
>> aarch64ops.s:338:24: error: index must be a multiple of 8 in range [0, 32760].
>>         prfm    PLDL1KEEP, [x15, 43]
>>                                  ^
>> aarch64ops.s:357:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
>>         sub     x1, x10, x23, sxth #2
>>                               ^
>> aarch64ops.s:359:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
>>         add     x11, x21, x5, uxtb #3
>>                               ^
>> aarch64ops.s:360:22: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
>>         adds    x11, x17, x17, uxtw #1
>>                                ^
>> aarch64ops.s:361:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
>>         sub     x11, x0, x15, uxtb #1
>>                               ^
>> aarch64ops.s:362:19: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
>>         subs    x7, x1, x0, sxth #2
>>                             ^
>> This is without any modifications from what is in the master branch currently.
>
> @ferakocz This also really needs addressing before committing the patch. Perhaps @theRealAph can advise on how to circumvent the problems you found when trying to update the python script?

> You might have to use an assembler from the latest binutils build (if the system default isn't the latest) and add the path to the assembler in the "AS" variable. Also you can run it something like - `python aarch64-asmtest.py | expand > asmtest.out.h`. Please let me know if you still face problems.

People have been running this script for a decade now.

Let's look at just one of these:


aarch64ops.s:357:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
sub x1, x10, x23, sxth #2


>From the AArch64 manual:

SUB (extended register)
SUB <Xd|SP>, <Xn|SP>, <R><m>{, <extend> {#<amount>}}

 It thinks this is a SUB (shifted register), bit it's really a SUB (extended register).


fedora:aarch64 $ cat t.s
sub x1, x10, x23, sxth #2
fedora:aarch64 $ as t.s
fedora:aarch64 $ objdump -D a.out
Disassembly of section .text:

0000000000000000 <.text>:
   0:	cb37a941 	sub	x1, x10, w23, sxth #2


So perhaps binutils expects w23 here, not x23. But the manual (ARM DDI 0487K.a) says x23 should be just fine, and, what's more, gives the x form preferred status.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1969374124

From duke at openjdk.org  Tue Feb 25 11:17:58 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Tue, 25 Feb 2025 11:17:58 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]
In-Reply-To: <ifk97t9mBPxdf6XVck8CHs_hE1PrweC597vmZ_VO5yU=.42ea9675-35af-49fb-8a6c-ec53c2543be4@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <unMldYiDLGyImOJQ1oXuzR2OViIBxTKFjE3Ks6_VSn4=.e86bd4ee-5fce-415a-888a-06aff24bd664@github.com>
 <1yB95sOajuS5ptFI0GQWLepii5JsZ9DOsje-TEFyFYs=.a325ad18-17ed-4e77-b1e3-0bad2cf55c67@github.com>
 <bqom3W9zRU-ChMNDgkfcOVPEmKvl5J1huHv1o6Fe1yc=.052a4703-111d-40df-8843-4aee2dd93cca@github.com>
 <Z_BqJtEFwJ_A5aQnxJCOs4jswW7CmUL4dJAKrHLJGk0=.48dae4a5-e713-447d-bf05-f2caa6408540@github.com>
 <ifk97t9mBPxdf6XVck8CHs_hE1PrweC597vmZ_VO5yU=.42ea9675-35af-49fb-8a6c-ec53c2543be4@github.com>
Message-ID: <cEMnfNLE2HH7lgv7M9ury8U5ef6QYb0glG28uB5Lm1w=.88ac5562-5128-49dc-a639-6444153c622e@github.com>

On Tue, 25 Feb 2025 09:36:49 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> @ferakocz This also really needs addressing before committing the patch. Perhaps @theRealAph can advise on how to circumvent the problems you found when trying to update the python script?
>
>> You might have to use an assembler from the latest binutils build (if the system default isn't the latest) and add the path to the assembler in the "AS" variable. Also you can run it something like - `python aarch64-asmtest.py | expand > asmtest.out.h`. Please let me know if you still face problems.
> 
> People have been running this script for a decade now.
> 
> Let's look at just one of these:
> 
> 
> aarch64ops.s:357:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
> sub x1, x10, x23, sxth #2
> 
> 
> From the AArch64 manual:
> 
> SUB (extended register)
> SUB <Xd|SP>, <Xn|SP>, <R><m>{, <extend> {#<amount>}}
> 
>  It thinks this is a SUB (shifted register), bit it's really a SUB (extended register).
> 
> 
> fedora:aarch64 $ cat t.s
> sub x1, x10, x23, sxth #2
> fedora:aarch64 $ as t.s
> fedora:aarch64 $ objdump -D a.out
> Disassembly of section .text:
> 
> 0000000000000000 <.text>:
>    0:	cb37a941 	sub	x1, x10, w23, sxth #2
> 
> 
> So perhaps binutils expects w23 here, not x23. But the manual (ARM DDI 0487K.a) says x23 should be just fine, and, what's more, gives the x form preferred status.

@theRealAlph, maybe we are not reading the same manual (ARM DDI 0487K.a). In my copy:
SUB (extended register) is defined as
SUB <Xd|SP>, <Xn|SP>, <R><m>{, <extend> {#<amount>}}
and <R> should be W when <extend> is SXTH
and the as I have enforces this: 

ferakocz at ferakocz-mac aarch64 % cat t.s  
sub x1, x10, w23, sxth #2
ferakocz at ferakocz-mac aarch64 % cat > t1.s
sub x1, x10, x23, sxth #2
ferakocz at ferakocz-mac aarch64 % cat t.s
sub x1, x10, w23, sxth #2
ferakocz at ferakocz-mac aarch64 % cat t1.s
sub x1, x10, x23, sxth #2
ferakocz at ferakocz-mac aarch64 % as --version
Apple clang version 16.0.0 (clang-1600.0.26.6)
Target: arm64-apple-darwin24.3.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
ferakocz at ferakocz-mac aarch64 % as t.s
ferakocz at ferakocz-mac aarch64 % objdump -D t.o

t.o:	file format mach-o arm64

Disassembly of section __TEXT,__text:

0000000000000000 <ltmp0>:
       0: cb37a941     	sub	x1, x10, w23, sxth #2
ferakocz at ferakocz-mac aarch64 % as t1.s
t1.s:1:19: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
sub x1, x10, x23, sxth #2
                    ^

I have not found the place in the manual where it allows/encourages the use of x<n> instead of w<n>, but I admit I haven't read through all of the 14568 pages.

So I'm stuck for now. What 'as' are you using?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1969561791

From coleenp at openjdk.org  Tue Feb 25 12:40:03 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Tue, 25 Feb 2025 12:40:03 GMT
Subject: RFR: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native [v8]
In-Reply-To: <Da5eRkoxdxdezyjWC7_XfLct1aDhjd1ZtP6WPEZ5yis=.e4954c0d-91d7-47f3-a5f6-ec0afc0e8be2@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
 <Da5eRkoxdxdezyjWC7_XfLct1aDhjd1ZtP6WPEZ5yis=.e4954c0d-91d7-47f3-a5f6-ec0afc0e8be2@github.com>
Message-ID: <Kq_vMby90JoSY_5YK-poAQL0s2yg6z-0HAgWGLSl_oc=.189f710c-fb62-4d3a-beb1-f6cab96ea9e5@github.com>

On Mon, 24 Feb 2025 19:30:41 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
>> Tested with tier1-4 and performance tests.
>
> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Add a comment about Class constructor.

Thanks for reviewing Dean, Roger, Vladimir, Yudi and Chen, and comments David.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23572#issuecomment-2681823548

From coleenp at openjdk.org  Tue Feb 25 12:40:04 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Tue, 25 Feb 2025 12:40:04 GMT
Subject: Integrated: 8349860: Make Class.isArray(),
 Class.isInterface() and Class.isPrimitive() non-native
In-Reply-To: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
References: <aSp1Ao_f1Elwm9vJjKpKbWmkGRraemsiBzED07FdGtE=.95e9c60a-6da1-47bc-8cb3-a3399c9d62cc@github.com>
Message-ID: <CvC9FPVohZaQEDsTYU0rUh1mmxfR677aKYum2v_PH_Y=.48c73963-cf44-4626-8f94-0947a4c6e83a@github.com>

On Tue, 11 Feb 2025 20:56:39 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes.
> Tested with tier1-4 and performance tests.

This pull request has now been integrated.

Changeset: c413549e
Author:    Coleen Phillimore <coleenp at openjdk.org>
URL:       https://git.openjdk.org/jdk/commit/c413549eb775f4209416c718dc9aa0748144a6b4
Stats:     202 lines in 20 files changed: 43 ins; 128 del; 31 mod

8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native

Reviewed-by: dlong, rriggs, vlivanov, yzheng, liach

-------------

PR: https://git.openjdk.org/jdk/pull/23572

From coleenp at openjdk.org  Tue Feb 25 13:19:11 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Tue, 25 Feb 2025 13:19:11 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists
In-Reply-To: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
Message-ID: <QdX-cBxF7VypFf8qsFH8DCHYztDP4-VBwWD3dChbGts=.10cd76e5-cf57-4bba-9923-fd23ad986745@github.com>

On Mon, 3 Feb 2025 16:29:25 GMT, Fredrik Bredberg <fbredberg at openjdk.org> wrote:

> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`.
> 
> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past.
> 
> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks.
> 
> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`.
> 
> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable.
> 
> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list.
> 
> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor).
> 
> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation.
> 
> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fact that c2 no longer has to check b...

src/hotspot/share/jvmci/vmStructs_jvmci.cpp line 332:

> 330:   volatile_nonstatic_field(ObjectMonitor,      _owner,                                        int64_t)                               \
> 331:   volatile_nonstatic_field(ObjectMonitor,      _recursions,                                   intptr_t)                              \
> 332:   volatile_nonstatic_field(ObjectMonitor,      _EntryListTail,                                ObjectWaiter*)                         \

You may need to coordinate with @mur47x111 to see what graal does with this field.  I suspect the graal code also checks both ctx and EntryList in the unlock fast path and now only needs to check _EntryList.  In which case we don't need to export EntryListTail.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1947058523

From fbredberg at openjdk.org  Tue Feb 25 13:19:10 2025
From: fbredberg at openjdk.org (Fredrik Bredberg)
Date: Tue, 25 Feb 2025 13:19:10 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists
Message-ID: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>

I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`.

This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past.

In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks.

The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`.

You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable.

The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list.

Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor).

Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation.

However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fact that c2 no longer has to check both `EntryList` and `cxq` makes this PR worthwhile, I think.

Tests tier1-7 passes okay as well as micro-benchmarks like `vm.lang.LockUnlock`.
Unsupported platforms { ppc, riscv, s390 } has been tested with QEmu.

-------------

Commit messages:
 - Moved set_bad_pointers() and added accessors.
 - Merge branch 'master' into 8343840_rewrite_objectmonitor_lists
 - Atomic hygiene
 - Fixed a bug in UnlinkAfterAcquire
 - General cleanup
 - Updated theory of operations comment
 - 8343840: Rewrite the ObjectMonitor lists

Changes: https://git.openjdk.org/jdk/pull/23421/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23421&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8343840
  Stats: 594 lines in 9 files changed: 213 ins; 219 del; 162 mod
  Patch: https://git.openjdk.org/jdk/pull/23421.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23421/head:pull/23421

PR: https://git.openjdk.org/jdk/pull/23421

From aph at openjdk.org  Tue Feb 25 13:19:02 2025
From: aph at openjdk.org (Andrew Haley)
Date: Tue, 25 Feb 2025 13:19:02 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]
In-Reply-To: <cEMnfNLE2HH7lgv7M9ury8U5ef6QYb0glG28uB5Lm1w=.88ac5562-5128-49dc-a639-6444153c622e@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <unMldYiDLGyImOJQ1oXuzR2OViIBxTKFjE3Ks6_VSn4=.e86bd4ee-5fce-415a-888a-06aff24bd664@github.com>
 <1yB95sOajuS5ptFI0GQWLepii5JsZ9DOsje-TEFyFYs=.a325ad18-17ed-4e77-b1e3-0bad2cf55c67@github.com>
 <bqom3W9zRU-ChMNDgkfcOVPEmKvl5J1huHv1o6Fe1yc=.052a4703-111d-40df-8843-4aee2dd93cca@github.com>
 <Z_BqJtEFwJ_A5aQnxJCOs4jswW7CmUL4dJAKrHLJGk0=.48dae4a5-e713-447d-bf05-f2caa6408540@github.com>
 <ifk97t9mBPxdf6XVck8CHs_hE1PrweC597vmZ_VO5yU=.42ea9675-35af-49fb-8a6c-ec53c2543be4@github.com>
 <cEMnfNLE2HH7lgv7M9ury8U5ef6QYb0glG28uB5Lm1w=.88ac5562-5128-49dc-a639-6444153c622e@github.com>
Message-ID: <teEatC-H5hsucsK32FLBVpb8x5Digf3hIh9vUqfHfew=.e13bc121-5716-4724-bc79-36a4b46de8a2@github.com>

On Tue, 25 Feb 2025 11:15:39 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>>> You might have to use an assembler from the latest binutils build (if the system default isn't the latest) and add the path to the assembler in the "AS" variable. Also you can run it something like - `python aarch64-asmtest.py | expand > asmtest.out.h`. Please let me know if you still face problems.
>> 
>> People have been running this script for a decade now.
>> 
>> Let's look at just one of these:
>> 
>> 
>> aarch64ops.s:357:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
>> sub x1, x10, x23, sxth #2
>> 
>> 
>> From the AArch64 manual:
>> 
>> SUB (extended register)
>> SUB <Xd|SP>, <Xn|SP>, <R><m>{, <extend> {#<amount>}}
>> 
>>  It thinks this is a SUB (shifted register), bit it's really a SUB (extended register).
>> 
>> 
>> fedora:aarch64 $ cat t.s
>> sub x1, x10, x23, sxth #2
>> fedora:aarch64 $ as t.s
>> fedora:aarch64 $ objdump -D a.out
>> Disassembly of section .text:
>> 
>> 0000000000000000 <.text>:
>>    0:	cb37a941 	sub	x1, x10, w23, sxth #2
>> 
>> 
>> So perhaps binutils expects w23 here, not x23. But the manual (ARM DDI 0487K.a) says x23 should be just fine, and, what's more, gives the x form preferred status.
>
> @theRealAlph, maybe we are not reading the same manual (ARM DDI 0487K.a). In my copy:
> SUB (extended register) is defined as
> SUB <Xd|SP>, <Xn|SP>, <R><m>{, <extend> {#<amount>}}
> and <R> should be W when <extend> is SXTH
> and the as I have enforces this: 
> 
> ferakocz at ferakocz-mac aarch64 % cat t.s  
> sub x1, x10, w23, sxth #2
> ferakocz at ferakocz-mac aarch64 % cat > t1.s
> sub x1, x10, x23, sxth #2
> ferakocz at ferakocz-mac aarch64 % cat t.s
> sub x1, x10, w23, sxth #2
> ferakocz at ferakocz-mac aarch64 % cat t1.s
> sub x1, x10, x23, sxth #2
> ferakocz at ferakocz-mac aarch64 % as --version
> Apple clang version 16.0.0 (clang-1600.0.26.6)
> Target: arm64-apple-darwin24.3.0
> Thread model: posix
> InstalledDir: /Library/Developer/CommandLineTools/usr/bin
> ferakocz at ferakocz-mac aarch64 % as t.s
> ferakocz at ferakocz-mac aarch64 % objdump -D t.o
> 
> t.o:	file format mach-o arm64
> 
> Disassembly of section __TEXT,__text:
> 
> 0000000000000000 <ltmp0>:
>        0: cb37a941     	sub	x1, x10, w23, sxth #2
> ferakocz at ferakocz-mac aarch64 % as t1.s
> t1.s:1:19: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
> sub x1, x10, x23, sxth #2
>                     ^
> 
> I have not found the place in the manual where it allows/encourages the use of x<n> instead of w<n>, but I admit I haven't read through all of the 14568 pages.
> 
> So I'm stuck for now. What 'as' are you using?

> I have not found the place in the manual where it allows/encourages the use of x instead of w, but I admit I > haven't read through all of the 14568 pages.

Yes, you've got a point, but it's always worked. Is this a macos thing, maybe?

> So I'm stuck for now. What 'as' are you using?

Latest binutils, today. I checked it out half an hour ago.

GNU assembler (GNU Binutils) 2.44.50.20250225
Copyright (C) 2025 Free Software Foundation, Inc.

Try this:


diff --git a/test/hotspot/gtest/aarch64/aarch64-asmtest.py b/test/hotspot/gtest/aarch64/aarch64-asmtest.py
index 9c770632e25..b1674fff04d 100644
--- a/test/hotspot/gtest/aarch64/aarch64-asmtest.py
+++ b/test/hotspot/gtest/aarch64/aarch64-asmtest.py
@@ -476,8 +476,13 @@ class AddSubExtendedOp(ThreeRegInstruction):
                    + ", " + str(self.amount) + ");"))
 
     def astr(self):
-        return (super(AddSubExtendedOp, self).astr()
-                + (", " + AddSubExtendedOp.optNames[self.option]
+        prefix = self.asmRegPrefix
+        return (super(ThreeRegInstruction, self).astr()
+                + ('%s, %s, %s'
+                   % (self.reg[0].astr(prefix),
+                      self.reg[1].astr(prefix),
+                      self.reg[1].astr("w"))
+                + ", " + AddSubExtendedOp.optNames[self.option]
                    + " #" + str(self.amount)))
 
 class AddSubImmOp(TwoRegImmedInstruction):

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1969760509

From aph at openjdk.org  Tue Feb 25 13:19:03 2025
From: aph at openjdk.org (Andrew Haley)
Date: Tue, 25 Feb 2025 13:19:03 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]
In-Reply-To: <teEatC-H5hsucsK32FLBVpb8x5Digf3hIh9vUqfHfew=.e13bc121-5716-4724-bc79-36a4b46de8a2@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <unMldYiDLGyImOJQ1oXuzR2OViIBxTKFjE3Ks6_VSn4=.e86bd4ee-5fce-415a-888a-06aff24bd664@github.com>
 <1yB95sOajuS5ptFI0GQWLepii5JsZ9DOsje-TEFyFYs=.a325ad18-17ed-4e77-b1e3-0bad2cf55c67@github.com>
 <bqom3W9zRU-ChMNDgkfcOVPEmKvl5J1huHv1o6Fe1yc=.052a4703-111d-40df-8843-4aee2dd93cca@github.com>
 <Z_BqJtEFwJ_A5aQnxJCOs4jswW7CmUL4dJAKrHLJGk0=.48dae4a5-e713-447d-bf05-f2caa6408540@github.com>
 <ifk97t9mBPxdf6XVck8CHs_hE1PrweC597vmZ_VO5yU=.42ea9675-35af-49fb-8a6c-ec53c2543be4@github.com>
 <cEMnfNLE2HH7lgv7M9ury8U5ef6QYb0glG28uB5Lm1w=.88ac5562-5128-49dc-a639-6444153c622e@github.com>
 <teEatC-H5hsucsK32FLBVpb8x5Digf3hIh9vUqfHfew=.e13bc121-5716-4724-bc79-36a4b46de8a2@github.com>
Message-ID: <NnDHhEs6oCXiK69ge5b63oCG47J6NM60pGbcfM30blI=.f1f82355-2d70-45c8-a4f8-7f83b91d65d4@github.com>

On Tue, 25 Feb 2025 13:14:52 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> @theRealAlph, maybe we are not reading the same manual (ARM DDI 0487K.a). In my copy:
>> SUB (extended register) is defined as
>> SUB <Xd|SP>, <Xn|SP>, <R><m>{, <extend> {#<amount>}}
>> and <R> should be W when <extend> is SXTH
>> and the as I have enforces this: 
>> 
>> ferakocz at ferakocz-mac aarch64 % cat t.s  
>> sub x1, x10, w23, sxth #2
>> ferakocz at ferakocz-mac aarch64 % cat > t1.s
>> sub x1, x10, x23, sxth #2
>> ferakocz at ferakocz-mac aarch64 % cat t.s
>> sub x1, x10, w23, sxth #2
>> ferakocz at ferakocz-mac aarch64 % cat t1.s
>> sub x1, x10, x23, sxth #2
>> ferakocz at ferakocz-mac aarch64 % as --version
>> Apple clang version 16.0.0 (clang-1600.0.26.6)
>> Target: arm64-apple-darwin24.3.0
>> Thread model: posix
>> InstalledDir: /Library/Developer/CommandLineTools/usr/bin
>> ferakocz at ferakocz-mac aarch64 % as t.s
>> ferakocz at ferakocz-mac aarch64 % objdump -D t.o
>> 
>> t.o:	file format mach-o arm64
>> 
>> Disassembly of section __TEXT,__text:
>> 
>> 0000000000000000 <ltmp0>:
>>        0: cb37a941     	sub	x1, x10, w23, sxth #2
>> ferakocz at ferakocz-mac aarch64 % as t1.s
>> t1.s:1:19: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
>> sub x1, x10, x23, sxth #2
>>                     ^
>> 
>> I have not found the place in the manual where it allows/encourages the use of x<n> instead of w<n>, but I admit I haven't read through all of the 14568 pages.
>> 
>> So I'm stuck for now. What 'as' are you using?
>
>> I have not found the place in the manual where it allows/encourages the use of x instead of w, but I admit I > haven't read through all of the 14568 pages.
> 
> Yes, you've got a point, but it's always worked. Is this a macos thing, maybe?
> 
>> So I'm stuck for now. What 'as' are you using?
> 
> Latest binutils, today. I checked it out half an hour ago.
> 
> GNU assembler (GNU Binutils) 2.44.50.20250225
> Copyright (C) 2025 Free Software Foundation, Inc.
> 
> Try this:
> 
> 
> diff --git a/test/hotspot/gtest/aarch64/aarch64-asmtest.py b/test/hotspot/gtest/aarch64/aarch64-asmtest.py
> index 9c770632e25..b1674fff04d 100644
> --- a/test/hotspot/gtest/aarch64/aarch64-asmtest.py
> +++ b/test/hotspot/gtest/aarch64/aarch64-asmtest.py
> @@ -476,8 +476,13 @@ class AddSubExtendedOp(ThreeRegInstruction):
>                     + ", " + str(self.amount) + ");"))
>  
>      def astr(self):
> -        return (super(AddSubExtendedOp, self).astr()
> -                + (", " + AddSubExtendedOp.optNames[self.option]
> +        prefix = self.asmRegPrefix
> +        return (super(ThreeRegInstruction, self).astr()
> +                + ('%s, %s, %s'
> +                   % (self.reg[0].astr(prefix),
> +                      self.reg[1].astr(prefix),
> +                      self.reg[1].astr("w"))
> +                + ", " + AddSubExtendedOp.optNames[self.option]
>                     + " #" + str(self.amount)))
>  
>  class AddSubImmOp(TwoRegImmedInstruction):

I just tried it with top-of trunk latest binutils:

fedora:aarch64 $ ~/binutils-gdb-install/bin/as -march=armv9-a+sha3+sve2-bitperm aarch64ops.s
fedora:aarch64 $ ~/binutils-gdb-install/bin/as --version
GNU assembler (GNU Binutils) 2.44.50.20250225

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1969761898

From dholmes at openjdk.org  Tue Feb 25 13:19:16 2025
From: dholmes at openjdk.org (David Holmes)
Date: Tue, 25 Feb 2025 13:19:16 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists
In-Reply-To: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
Message-ID: <OIKGHJ-kXr4Mrk4Ag3Sg_Y_8XBr5aOabO0fZGLxy-Gg=.956cf1d1-aa26-4e14-afdf-653affef19c4@github.com>

On Mon, 3 Feb 2025 16:29:25 GMT, Fredrik Bredberg <fbredberg at openjdk.org> wrote:

> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`.
> 
> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past.
> 
> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks.
> 
> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`.
> 
> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable.
> 
> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list.
> 
> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor).
> 
> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation.
> 
> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fact that c2 no longer has to check b...

src/hotspot/share/runtime/objectMonitor.cpp line 704:

> 702: 
> 703:   for (;;) {
> 704:     ObjectWaiter* front = Atomic::load_acquire(&_entry_list);

Technically you don't need a load_acquire here because you do not access any members of front before hitting the cmpxchg that gives you a full fence.. For good code hygiene Atomic::load would suffice.

src/hotspot/share/runtime/objectMonitor.cpp line 723:

> 721: 
> 722:   for (;;) {
> 723:     ObjectWaiter* front = Atomic::load_acquire(&_entry_list);

Technically you don't need a `load_acquire` here because you do not access any members of `front` before hitting the cmpxchg that gives you a full fence.. For good code hygiene `Atomic::load` would suffice.

src/hotspot/share/runtime/objectMonitor.cpp line 1264:

> 1262:     return w;
> 1263:   }
> 1264:   w = Atomic::load_acquire(&_entry_list);

Suggestion:

  // Need acquire here to match the implicit release of the cmpxchg that updated _entry_list, so we
  // can access w->_next.
  w = Atomic::load_acquire(&_entry_list);

src/hotspot/share/runtime/objectMonitor.cpp line 1303:

> 1301:   // Check if we are unlinking the last element in the _entry_list.
> 1302:   // This is by far the most common case.
> 1303:   if (currentNode->_next == nullptr) {

The direct checks of `_next` and _prev` for null/non-null do not work with your use of `set_bad_pointers`. If you actually intend to keep `set_bad_pointers` in the final code then you should be using accessors e.g.

ObjectWaiter* next() {
  assert (_next != 0xBAD, "corrupted list!");
  return _next;
}

src/hotspot/share/runtime/objectMonitor.cpp line 1306:

> 1304:     assert(_entry_list_tail == nullptr || _entry_list_tail == currentNode, "invariant");
> 1305: 
> 1306:     ObjectWaiter* v = Atomic::load_acquire(&_entry_list);

Again technically you do not need `load_acquire` here because you do not access any fields of `v` when `v` could be other than the current node. `Atomic::load` will suffice.

src/hotspot/share/runtime/objectMonitor.cpp line 1315:

> 1313:       }
> 1314:       // The CAS above can fail from interference IFF a contending
> 1315:       // thread "pushed" itself onto entry_list.

Suggestion:

      // The CAS above can fail from interference IFF a contending
      // thread "pushed" itself onto entry_list. So fall-through to
      // building the doubly-linked list.
      assert(currentNode->prev == nullptr, "invariant");

src/hotspot/share/runtime/objectMonitor.cpp line 1334:

> 1332:   }
> 1333: 
> 1334:   assert(currentNode->_next != nullptr, "invariant");

Suggestion:

  else {  //  currentNode->_next != nullptr
  
    // If we get here it means the current thread enqueued itself on the EntryList but was then able to
    // "steal" the lock before the chosen successor was able to. Consequently currentNode must be an
    // interior node in the EntryList, or the head.

src/hotspot/share/runtime/objectMonitor.cpp line 1337:

> 1335:   assert(currentNode != _entry_list_tail, "invariant");
> 1336: 
> 1337:   if (currentNode->_prev == nullptr) {

Suggestion:

  // Check if we are in the singly-linked portion of the EntryList. If we are the head then we try to remove
  // ourselves, else we convert to the doubly-linked list.
  if (currentNode->_prev == nullptr) {

src/hotspot/share/runtime/objectMonitor.cpp line 1347:

> 1345:   // else we convert to the doubly-linked list.
> 1346:   if (currentNode->_prev == nullptr) {
> 1347:     ObjectWaiter* v = Atomic::load_acquire(&_entry_list);

Again no `load_acquire` needed.

src/hotspot/share/runtime/objectMonitor.cpp line 1352:

> 1350:         // The CAS above can fail from interference IFF a contending
> 1351:         // thread "pushed" itself onto entry_list, in which case
> 1352:         // currentNode must now be in the interior of the list.

Suggestion:

        // currentNode must now be in the interior of the list. Fall-through
        // to building the doubly-linked list.

src/hotspot/share/runtime/objectMonitor.cpp line 1353:

> 1351:         // thread "pushed" itself onto entry_list, in which case
> 1352:         // currentNode must now be in the interior of the list.
> 1353:         assert(_entry_list != currentNode, "invariant");

Not sure you really need this. The fact the cmpxchg failed means we can't be the head of the list. Also by reading it again you are potentially finding a different head to that which existed when the cmpxchg failed.

src/hotspot/share/runtime/objectMonitor.cpp line 1362:

> 1360:   }
> 1361: 
> 1362:   // We now assume we are unlinking currentNode from the interior of a

Suggestion:

  // We now know we are unlinking currentNode from the interior of a

src/hotspot/share/runtime/objectMonitor.cpp line 1534:

> 1532:     ObjectWaiter* w = nullptr;
> 1533: 
> 1534:     w = _entry_list;

Use `Atomic::load` for consistency and good code hygiene.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1962360900
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1962359972
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1962364788
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1957707916
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1962368696
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1957692735
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1957696030
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1957698728
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1962370002
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1957699877
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1957701253
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1957701596
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1962372883

From fbredberg at openjdk.org  Tue Feb 25 13:19:11 2025
From: fbredberg at openjdk.org (Fredrik Bredberg)
Date: Tue, 25 Feb 2025 13:19:11 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists
In-Reply-To: <QdX-cBxF7VypFf8qsFH8DCHYztDP4-VBwWD3dChbGts=.10cd76e5-cf57-4bba-9923-fd23ad986745@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <QdX-cBxF7VypFf8qsFH8DCHYztDP4-VBwWD3dChbGts=.10cd76e5-cf57-4bba-9923-fd23ad986745@github.com>
Message-ID: <Oi8FWi7Bd4slmd3aLd1NNpz7Tcx78fDiJHn9YCTm3wA=.00f007f3-4d40-48cd-9f50-bcd9c0e5e122@github.com>

On Fri, 7 Feb 2025 19:17:24 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`.
>> 
>> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past.
>> 
>> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks.
>> 
>> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`.
>> 
>> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable.
>> 
>> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list.
>> 
>> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor).
>> 
>> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation.
>> 
>> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fac...
>
> src/hotspot/share/jvmci/vmStructs_jvmci.cpp line 332:
> 
>> 330:   volatile_nonstatic_field(ObjectMonitor,      _owner,                                        int64_t)                               \
>> 331:   volatile_nonstatic_field(ObjectMonitor,      _recursions,                                   intptr_t)                              \
>> 332:   volatile_nonstatic_field(ObjectMonitor,      _EntryListTail,                                ObjectWaiter*)                         \
> 
> You may need to coordinate with @mur47x111 to see what graal does with this field.  I suspect the graal code also checks both ctx and EntryList in the unlock fast path and now only needs to check _EntryList.  In which case we don't need to export EntryListTail.

Thanks for the heads up @coleenp . I was planing on contacting the Graal team when this PR gets closer to getting integrated. I'll delete the `_EntryListTail` export, and make sure to ask for a review from @mur47x111  when that time comes.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1949002357

From yzheng at openjdk.org  Tue Feb 25 13:19:11 2025
From: yzheng at openjdk.org (Yudi Zheng)
Date: Tue, 25 Feb 2025 13:19:11 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists
In-Reply-To: <QdX-cBxF7VypFf8qsFH8DCHYztDP4-VBwWD3dChbGts=.10cd76e5-cf57-4bba-9923-fd23ad986745@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <QdX-cBxF7VypFf8qsFH8DCHYztDP4-VBwWD3dChbGts=.10cd76e5-cf57-4bba-9923-fd23ad986745@github.com>
Message-ID: <zxhoCtlhRfefxvDbZkbrq5ZwlBwppH-U5j5NfE2MdCU=.39b159a1-f30a-455b-a616-ff3949db3064@github.com>

On Fri, 7 Feb 2025 19:17:24 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`.
>> 
>> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past.
>> 
>> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks.
>> 
>> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`.
>> 
>> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable.
>> 
>> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list.
>> 
>> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor).
>> 
>> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation.
>> 
>> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fac...
>
> src/hotspot/share/jvmci/vmStructs_jvmci.cpp line 332:
> 
>> 330:   volatile_nonstatic_field(ObjectMonitor,      _owner,                                        int64_t)                               \
>> 331:   volatile_nonstatic_field(ObjectMonitor,      _recursions,                                   intptr_t)                              \
>> 332:   volatile_nonstatic_field(ObjectMonitor,      _EntryListTail,                                ObjectWaiter*)                         \
> 
> You may need to coordinate with @mur47x111 to see what graal does with this field.  I suspect the graal code also checks both ctx and EntryList in the unlock fast path and now only needs to check _EntryList.  In which case we don't need to export EntryListTail.

Indeed. You may delete this export and I will make the Graal side changes accordingly at [MonitorSnippets.java#L680](https://github.com/oracle/graal/blob/3d543641b056fdaa8e7444f09615067f8d766f6e/compiler/src/jdk.graal.compiler/src/jdk/graal/compiler/hotspot/replacements/MonitorSnippets.java#L680)

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1948809912

From fbredberg at openjdk.org  Tue Feb 25 13:19:16 2025
From: fbredberg at openjdk.org (Fredrik Bredberg)
Date: Tue, 25 Feb 2025 13:19:16 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists
In-Reply-To: <OIKGHJ-kXr4Mrk4Ag3Sg_Y_8XBr5aOabO0fZGLxy-Gg=.956cf1d1-aa26-4e14-afdf-653affef19c4@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <OIKGHJ-kXr4Mrk4Ag3Sg_Y_8XBr5aOabO0fZGLxy-Gg=.956cf1d1-aa26-4e14-afdf-653affef19c4@github.com>
Message-ID: <CNB3-17uzYNPtytG39ggfWcZdBBPsO5qPVyWXQWJ9HE=.abfa19bb-961c-4a82-b128-8c9d746594cf@github.com>

On Wed, 19 Feb 2025 20:55:28 GMT, David Holmes <dholmes at openjdk.org> wrote:

>> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`.
>> 
>> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past.
>> 
>> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks.
>> 
>> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`.
>> 
>> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable.
>> 
>> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list.
>> 
>> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor).
>> 
>> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation.
>> 
>> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fac...
>
> src/hotspot/share/runtime/objectMonitor.cpp line 704:
> 
>> 702: 
>> 703:   for (;;) {
>> 704:     ObjectWaiter* front = Atomic::load_acquire(&_entry_list);
> 
> Technically you don't need a load_acquire here because you do not access any members of front before hitting the cmpxchg that gives you a full fence.. For good code hygiene Atomic::load would suffice.

Fixed

> src/hotspot/share/runtime/objectMonitor.cpp line 723:
> 
>> 721: 
>> 722:   for (;;) {
>> 723:     ObjectWaiter* front = Atomic::load_acquire(&_entry_list);
> 
> Technically you don't need a `load_acquire` here because you do not access any members of `front` before hitting the cmpxchg that gives you a full fence.. For good code hygiene `Atomic::load` would suffice.

Fixed

> src/hotspot/share/runtime/objectMonitor.cpp line 1264:
> 
>> 1262:     return w;
>> 1263:   }
>> 1264:   w = Atomic::load_acquire(&_entry_list);
> 
> Suggestion:
> 
>   // Need acquire here to match the implicit release of the cmpxchg that updated _entry_list, so we
>   // can access w->_next.
>   w = Atomic::load_acquire(&_entry_list);

Fixed

> src/hotspot/share/runtime/objectMonitor.cpp line 1303:
> 
>> 1301:   // Check if we are unlinking the last element in the _entry_list.
>> 1302:   // This is by far the most common case.
>> 1303:   if (currentNode->_next == nullptr) {
> 
> The direct checks of `_next` and _prev` for null/non-null do not work with your use of `set_bad_pointers`. If you actually intend to keep `set_bad_pointers` in the final code then you should be using accessors e.g.
> 
> ObjectWaiter* next() {
>   assert (_next != 0xBAD, "corrupted list!");
>   return _next;
> }

Fixed

> src/hotspot/share/runtime/objectMonitor.cpp line 1306:
> 
>> 1304:     assert(_entry_list_tail == nullptr || _entry_list_tail == currentNode, "invariant");
>> 1305: 
>> 1306:     ObjectWaiter* v = Atomic::load_acquire(&_entry_list);
> 
> Again technically you do not need `load_acquire` here because you do not access any fields of `v` when `v` could be other than the current node. `Atomic::load` will suffice.

Fixed

> src/hotspot/share/runtime/objectMonitor.cpp line 1315:
> 
>> 1313:       }
>> 1314:       // The CAS above can fail from interference IFF a contending
>> 1315:       // thread "pushed" itself onto entry_list.
> 
> Suggestion:
> 
>       // The CAS above can fail from interference IFF a contending
>       // thread "pushed" itself onto entry_list. So fall-through to
>       // building the doubly-linked list.
>       assert(currentNode->prev == nullptr, "invariant");

Fixed

> src/hotspot/share/runtime/objectMonitor.cpp line 1334:
> 
>> 1332:   }
>> 1333: 
>> 1334:   assert(currentNode->_next != nullptr, "invariant");
> 
> Suggestion:
> 
>   else {  //  currentNode->_next != nullptr
>   
>     // If we get here it means the current thread enqueued itself on the EntryList but was then able to
>     // "steal" the lock before the chosen successor was able to. Consequently currentNode must be an
>     // interior node in the EntryList, or the head.

Added the comment but left out the suggested "else" and kept the assert. I know that the if statement above always ends in a return, but if that is changed this feels safer.

> src/hotspot/share/runtime/objectMonitor.cpp line 1337:
> 
>> 1335:   assert(currentNode != _entry_list_tail, "invariant");
>> 1336: 
>> 1337:   if (currentNode->_prev == nullptr) {
> 
> Suggestion:
> 
>   // Check if we are in the singly-linked portion of the EntryList. If we are the head then we try to remove
>   // ourselves, else we convert to the doubly-linked list.
>   if (currentNode->_prev == nullptr) {

Fixed

> src/hotspot/share/runtime/objectMonitor.cpp line 1347:
> 
>> 1345:   // else we convert to the doubly-linked list.
>> 1346:   if (currentNode->_prev == nullptr) {
>> 1347:     ObjectWaiter* v = Atomic::load_acquire(&_entry_list);
> 
> Again no `load_acquire` needed.

Fixed

> src/hotspot/share/runtime/objectMonitor.cpp line 1352:
> 
>> 1350:         // The CAS above can fail from interference IFF a contending
>> 1351:         // thread "pushed" itself onto entry_list, in which case
>> 1352:         // currentNode must now be in the interior of the list.
> 
> Suggestion:
> 
>         // currentNode must now be in the interior of the list. Fall-through
>         // to building the doubly-linked list.

Fixed

> src/hotspot/share/runtime/objectMonitor.cpp line 1353:
> 
>> 1351:         // thread "pushed" itself onto entry_list, in which case
>> 1352:         // currentNode must now be in the interior of the list.
>> 1353:         assert(_entry_list != currentNode, "invariant");
> 
> Not sure you really need this. The fact the cmpxchg failed means we can't be the head of the list. Also by reading it again you are potentially finding a different head to that which existed when the cmpxchg failed.

You are right I don't really need it, but sometimes I feel that comments can rotten, but asserts can't.
I guess I put this one in so that it's easier to see what state the currentNode is in (not head) without reading through the logic that end up in the else-statement.

> src/hotspot/share/runtime/objectMonitor.cpp line 1362:
> 
>> 1360:   }
>> 1361: 
>> 1362:   // We now assume we are unlinking currentNode from the interior of a
> 
> Suggestion:
> 
>   // We now know we are unlinking currentNode from the interior of a

Fixed

> src/hotspot/share/runtime/objectMonitor.cpp line 1534:
> 
>> 1532:     ObjectWaiter* w = nullptr;
>> 1533: 
>> 1534:     w = _entry_list;
> 
> Use `Atomic::load` for consistency and good code hygiene.

Fixed

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1963144747
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1963135003
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1963050591
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1967825628
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1963136473
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1963132242
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1961646077
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1961647021
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1963137807
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1969341568
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1961659147
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1963133824
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1963141844

From aph at openjdk.org  Tue Feb 25 13:40:57 2025
From: aph at openjdk.org (Andrew Haley)
Date: Tue, 25 Feb 2025 13:40:57 GMT
Subject: RFR: 8345125: Aarch64: Add aarch64 backend for Float16 scalar
 operations
In-Reply-To: <Rbal8Cp4ncat_17FPV367OvtBjO-GVr0AdM-X-yuNt8=.e09edada-bc9d-4dbd-9904-ab523d25fc47@github.com>
References: <Rbal8Cp4ncat_17FPV367OvtBjO-GVr0AdM-X-yuNt8=.e09edada-bc9d-4dbd-9904-ab523d25fc47@github.com>
Message-ID: <thMfDFM_OeQNdgzWUU9r7Q5SaZ6c78o1EVnt9UgECbY=.dc39552f-6a81-4bbf-81d7-1db325ec3f44@github.com>

On Mon, 24 Feb 2025 12:09:57 GMT, Bhavana Kilambi <bkilambi at openjdk.org> wrote:

> This patch adds aarch64 backend for scalar FP16 operations namely - add, subtract, multiply, divide, fma, sqrt, min and max.

test/hotspot/gtest/aarch64/aarch64-asmtest.py line 19:

> 17:         0x7e0, 0xfc0, 0x1f80, 0x3ff0, 0x7e00, 0x8000,
> 18:         0x81ff, 0xc1ff, 0xc003, 0xc7ff, 0xdfff, 0xe03f,
> 19:         0xe1ff, 0xf801, 0xfc00, 0xfc07, 0xff03, 0xfffe]

So here you've deleted the duplicated `0x7e00` (good) but also the not-duplicated `0xe10f`. Is `0xe10f` not valid?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23748#discussion_r1969800950

From aph at openjdk.org  Tue Feb 25 13:46:57 2025
From: aph at openjdk.org (Andrew Haley)
Date: Tue, 25 Feb 2025 13:46:57 GMT
Subject: RFR: 8345125: Aarch64: Add aarch64 backend for Float16 scalar
 operations
In-Reply-To: <Rbal8Cp4ncat_17FPV367OvtBjO-GVr0AdM-X-yuNt8=.e09edada-bc9d-4dbd-9904-ab523d25fc47@github.com>
References: <Rbal8Cp4ncat_17FPV367OvtBjO-GVr0AdM-X-yuNt8=.e09edada-bc9d-4dbd-9904-ab523d25fc47@github.com>
Message-ID: <V8TknCVj6fpyqgiv92kcPcmU_MwyMeqcgs4OUjmpJH8=.745a3392-6ff7-4dd9-b362-102e34ab95c0@github.com>

On Mon, 24 Feb 2025 12:09:57 GMT, Bhavana Kilambi <bkilambi at openjdk.org> wrote:

> This patch adds aarch64 backend for scalar FP16 operations namely - add, subtract, multiply, divide, fma, sqrt, min and max.

Overall, this looks like a great pice of work. I only have a few changes in comments and a question, then we're good to go.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23748#issuecomment-2682030036

From aph at openjdk.org  Tue Feb 25 13:52:55 2025
From: aph at openjdk.org (Andrew Haley)
Date: Tue, 25 Feb 2025 13:52:55 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]
In-Reply-To: <NnDHhEs6oCXiK69ge5b63oCG47J6NM60pGbcfM30blI=.f1f82355-2d70-45c8-a4f8-7f83b91d65d4@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <unMldYiDLGyImOJQ1oXuzR2OViIBxTKFjE3Ks6_VSn4=.e86bd4ee-5fce-415a-888a-06aff24bd664@github.com>
 <1yB95sOajuS5ptFI0GQWLepii5JsZ9DOsje-TEFyFYs=.a325ad18-17ed-4e77-b1e3-0bad2cf55c67@github.com>
 <bqom3W9zRU-ChMNDgkfcOVPEmKvl5J1huHv1o6Fe1yc=.052a4703-111d-40df-8843-4aee2dd93cca@github.com>
 <Z_BqJtEFwJ_A5aQnxJCOs4jswW7CmUL4dJAKrHLJGk0=.48dae4a5-e713-447d-bf05-f2caa6408540@github.com>
 <ifk97t9mBPxdf6XVck8CHs_hE1PrweC597vmZ_VO5yU=.42ea9675-35af-49fb-8a6c-ec53c2543be4@github.com>
 <cEMnfNLE2HH7lgv7M9ury8U5ef6QYb0glG28uB5Lm1w=.88ac5562-5128-49dc-a639-6444153c622e@github.com>
 <teEatC-H5hsucsK32FLBVpb8x5Digf3hIh9vUqfHfew=.e13bc121-5716-4724-bc79-36a4b46de8a2@github.com>
 <NnDHhEs6oCXiK69ge5b63oCG47J6NM60pGbcfM30blI=.f1f82355-2d70-45c8-a4f8-7f83b91d65d4@github.com>
Message-ID: <OzV68CDcIoCJFhzWV-FiamEX6JFpKj18Ydl7FoOfyvE=.56f23f42-00b6-43ae-9d03-90e0067aa31b@github.com>

On Tue, 25 Feb 2025 13:15:49 GMT, Andrew Haley <aph at openjdk.org> wrote:

>>> I have not found the place in the manual where it allows/encourages the use of x instead of w, but I admit I > haven't read through all of the 14568 pages.
>> 
>> Yes, you've got a point, but it's always worked. Is this a macos thing, maybe?
>> 
>>> So I'm stuck for now. What 'as' are you using?
>> 
>> Latest binutils, today. I checked it out half an hour ago.
>> 
>> GNU assembler (GNU Binutils) 2.44.50.20250225
>> Copyright (C) 2025 Free Software Foundation, Inc.
>> 
>> Try this:
>> 
>> 
>> diff --git a/test/hotspot/gtest/aarch64/aarch64-asmtest.py b/test/hotspot/gtest/aarch64/aarch64-asmtest.py
>> index 9c770632e25..b1674fff04d 100644
>> --- a/test/hotspot/gtest/aarch64/aarch64-asmtest.py
>> +++ b/test/hotspot/gtest/aarch64/aarch64-asmtest.py
>> @@ -476,8 +476,13 @@ class AddSubExtendedOp(ThreeRegInstruction):
>>                     + ", " + str(self.amount) + ");"))
>>  
>>      def astr(self):
>> -        return (super(AddSubExtendedOp, self).astr()
>> -                + (", " + AddSubExtendedOp.optNames[self.option]
>> +        prefix = self.asmRegPrefix
>> +        return (super(ThreeRegInstruction, self).astr()
>> +                + ('%s, %s, %s'
>> +                   % (self.reg[0].astr(prefix),
>> +                      self.reg[1].astr(prefix),
>> +                      self.reg[1].astr("w"))
>> +                + ", " + AddSubExtendedOp.optNames[self.option]
>>                     + " #" + str(self.amount)))
>>  
>>  class AddSubImmOp(TwoRegImmedInstruction):
>
> I just tried it with top-of trunk latest binutils:
> 
> fedora:aarch64 $ ~/binutils-gdb-install/bin/as -march=armv9-a+sha3+sve2-bitperm aarch64ops.s
> fedora:aarch64 $ ~/binutils-gdb-install/bin/as --version
> GNU assembler (GNU Binutils) 2.44.50.20250225

Aha!


aph at Andrews-MacBook-Pro ~ % as t.s   
t.s:1:19: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
sub x1, x10, x23, sxth #2
                  ^
aph at Andrews-MacBook-Pro ~ % as --version
Apple clang version 16.0.0 (clang-1600.0.26.6)
Target: arm64-apple-darwin24.3.0

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1969823700

From bkilambi at openjdk.org  Tue Feb 25 13:55:58 2025
From: bkilambi at openjdk.org (Bhavana Kilambi)
Date: Tue, 25 Feb 2025 13:55:58 GMT
Subject: RFR: 8345125: Aarch64: Add aarch64 backend for Float16 scalar
 operations
In-Reply-To: <thMfDFM_OeQNdgzWUU9r7Q5SaZ6c78o1EVnt9UgECbY=.dc39552f-6a81-4bbf-81d7-1db325ec3f44@github.com>
References: <Rbal8Cp4ncat_17FPV367OvtBjO-GVr0AdM-X-yuNt8=.e09edada-bc9d-4dbd-9904-ab523d25fc47@github.com>
 <thMfDFM_OeQNdgzWUU9r7Q5SaZ6c78o1EVnt9UgECbY=.dc39552f-6a81-4bbf-81d7-1db325ec3f44@github.com>
Message-ID: <kl_shJGljzrBPLRjXhVxMR6atnU4ibkN2P_zxbMwPto=.9b08418a-e7e6-4213-8079-90bdbdcfa513@github.com>

On Tue, 25 Feb 2025 13:37:51 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> This patch adds aarch64 backend for scalar FP16 operations namely - add, subtract, multiply, divide, fma, sqrt, min and max.
>
> test/hotspot/gtest/aarch64/aarch64-asmtest.py line 19:
> 
>> 17:         0x7e0, 0xfc0, 0x1f80, 0x3ff0, 0x7e00, 0x8000,
>> 18:         0x81ff, 0xc1ff, 0xc003, 0xc7ff, 0xdfff, 0xe03f,
>> 19:         0xe1ff, 0xf801, 0xfc00, 0xfc07, 0xff03, 0xfffe]
> 
> So here you've deleted the duplicated `0x7e00` (good) but also the not-duplicated `0xe10f`. Is `0xe10f` not valid?

Hi, yes `0xe10f` does not seem to be valid. While I tried generating the `asmtest.out.h` I ran into errors with this value -

aarch64ops.s:1105: Error: immediate out of range at operand 3 -- eor z6.h,z6.h,#0xe10f
aarch64ops.s:1123: Error: immediate out of range at operand 3 -- eor z3.h,z3.h,#0xe10f


So I looked it up here - https://gist.github.com/dinfuehr/51a01ac58c0b23e4de9aac313ed6a06a to see if this number is a legal immediate and looks like it isn't. Maybe it's just chance that this number wasn't generated before as an immediate operand and these errors didn't up till now.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23748#discussion_r1969827032

From galder at openjdk.org  Tue Feb 25 14:57:05 2025
From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=)
Date: Tue, 25 Feb 2025 14:57:05 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v12]
In-Reply-To: <pZjDpZKJUmXi85-qf3F-NX91qVc42_QgZGbuo36XhPk=.f2e4ba72-bf19-4ced-9656-c01907bdae1b@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <pZjDpZKJUmXi85-qf3F-NX91qVc42_QgZGbuo36XhPk=.f2e4ba72-bf19-4ced-9656-c01907bdae1b@github.com>
Message-ID: <er1-FErHR72DYYqkEh9jLiESwthoh4FQd0A_fPNKetw=.e7089ac7-387d-48fa-97bd-9eb53d361f65@github.com>

On Fri, 7 Feb 2025 12:39:24 GMT, Galder Zamarre?o <galder at openjdk.org> wrote:

>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance.
>> 
>> Currently vectorization does not kick in for loops containing either of these calls because of the following error:
>> 
>> 
>> VLoop::check_preconditions: failed: control flow in loop not allowed
>> 
>> 
>> The control flow is due to the java implementation for these methods, e.g.
>> 
>> 
>> public static long max(long a, long b) {
>>     return (a >= b) ? a : b;
>> }
>> 
>> 
>> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively.
>> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization.
>> E.g.
>> 
>> 
>> SuperWord::transform_loop:
>>     Loop: N518/N126  counted [int,int),+4 (1025 iters)  main has_sfpt strip_mined
>>  518  CountedLoop  === 518 246 126  [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21)
>> 
>> 
>> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1155
>> long max   1173
>> 
>> 
>> After the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1042
>> long max   1042
>> 
>> 
>> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes.
>> Therefore, it still relies on the macro expansion to transform those into CMoveL.
>> 
>> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results:
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PA...
>
> Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision:
> 
>  - Merge branch 'master' into topic.intrinsify-max-min-long
>  - Fix typo
>  - Renaming methods and variables and add docu on algorithms
>  - Fix copyright years
>  - Make sure it runs with cpus with either avx512 or asimd
>  - Test can only run with 256 bit registers or bigger
>    
>    * Remove platform dependant check
>    and use platform independent configuration instead.
>  - Fix license header
>  - Tests should also run on aarch64 asimd=true envs
>  - Added comment around the assertions
>  - Adjust min/max identity IR test expectations after changes
>  - ... and 34 more: https://git.openjdk.org/jdk/compare/d6aa3453...a190ae68

> > The interesting thing is intReductionSimpleMin @ 100%. We see a regression there but I didn't observe it with the perfasm run. So, this could be due to variance in the application of cmov or not?
> 
> I don't see the error / variance in the results you posted. Often I look at those, and if it is anywhere above 10% of the average, then I'm suspicious ;)

> > Re: [#20098 (comment)](https://github.com/openjdk/jdk/pull/20098#issuecomment-2671144644) - I was trying to think what could be causing this.
> 
> Maybe it is an issue with probabilities? Do you know at what point (if at all) the `MinI` node appears/disappears in that example?

@eme64 I think you're in the right direction:


            minLongA = negate(maxLongA);
            minLongB = negate(maxLongB);
            minIntA = toInts(minLongA);
            minIntB = toInts(minLongB);


To keep same data distribution algorithm for both min and max operations, I started with positive numbers for max and found out that I could use the same data with the same properties for min by negating them. As you can see in the above snippet, the min values for ints had not been negated. I'll fix that and show final numbers with the same subset shown in https://github.com/openjdk/jdk/pull/20098#issuecomment-2671144644

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2682263423

From tschatzl at openjdk.org  Tue Feb 25 15:04:28 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Tue, 25 Feb 2025 15:04:28 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput
 with a More Efficient Write-Barrier
Message-ID: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>

Hi all,

  please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.

The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.

### Current situation

With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.

The main reason for the current barrier is how g1 implements concurrent refinement:
* g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
* For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
* Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.

These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:


// Filtering
if (region(@x.a) == region(y)) goto done; // same region check
if (y == null) goto done;     // null value check
if (card(@x.a) == young_card) goto done;  // write to young gen check
StoreLoad;                // synchronize
if (card(@x.a) == dirty_card) goto done;

*card(@x.a) = dirty

// Card tracking
enqueue(card-address(@x.a)) into thread-local-dcq;
if (thread-local-dcq is not full) goto done;

call runtime to move thread-local-dcq into dcqs

done:


Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.

The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.

There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).

The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a second card table ("refinement table"). The second card table also replaces the dirty card queue.

In that scheme the fine-grained synchronization is unnecessary because mutator and refinement threads always write to different memory areas (and no concurrent write where an update can be lost can occur). This removes the necessity for synchronization for every reference write.
Also no card enqueuing is required any more.

Only the filters and the card mark remain.

### How this works

In the beginning both the card table and the refinement table are completely unmarked (contain "clean" cards). The mutator dirties the card table, until G1 heuristics think that a significant enough amount of cards were dirtied based on what is allocated for scanning them during the garbage collection.

At that point, the card table and the refinement table are exchanged "atomically" using handshakes. The mutator keeps dirtying the (the previous, clean refinement table which is now the) card table, while the refinement threads look for and refine dirty cards on the refinement table as before.

Refinement of cards is very similar to before: if an interesting reference in a dirty card has been found, G1 records it in appropriate remembered sets. In this implementation there is an exception for references to the current collection set (typically young gen) - the refinement threads redirty that card on the card table with a special `to-collection-set` value.

This is valid because races with the mutator for that write do not matter - the entire card will eventually be rescanned anyway, regardless of whether it ends up as dirty or to-collection-set. The advantage of marking to-collection-set cards specially is that the next time the card tables are swapped, the refinement threads will not re-refine them on the assumption that that reference to the collection set will not change. This decreases refinement work substantially.

If refinement gets interrupted by GC, the refinement table will be merged with the card table before card scanning, which works as before.

New barrier pseudo-code for an assignment `x.a = y`:

// Filtering
if (region(@x.a) == region(y)) goto done; // same region check
if (y == null) goto done;     // null value check
if (card(@x.a) != clean_card) goto done;  // skip already non-clean cards
*card(@x.a) = dirty

This is basically the Serial/Parallel GC barrier with additional filters to keep the number of dirty cards as little as possible.

A few more comments about the barrier:
* the barrier now loads the card table base offset from a thread local instead of inlining it. This is necessary for this mechanism to work as the card table to dirty changes over time, and may even be faster on some architectures (code size), and some architectures already do.
* all existing pre-filters were kept. Benchmarks showed some significant regressions wrt to pause times and even throughput compared to G1 in master. Using the Parallel GC barrier (just the dirty card write) would be possible, and further investigation on stripping parts will be made as follow-up.
* the final check tests for non-clean cards to avoid overwriting existing cards, in particular the "to-collection set" cards described above.

Current G1 marks the cards corresponding to young gen regions as all "young" so that the original barrier could potentially avoid the `StoreLoad`. This implementation removes this facility (which might be re-introduced later), but measurements showed that pre-dirtying the young generation region's cards as "dirty" (g1 does not need to use an extra "young" value) did not yield any measurable performance difference.

### Refinement process

The goal of the refinement (threads) is to make sure that the number of cards to scan in the garbage collection is below a particular threshold.

The prototype changes the refinement threads into a single control thread and a set of (refinement) worker threads. Differently to the previous implementation, the control thread does not do any refinement, but only executes the heuristics to start a calculated amount of worker threads and tracking refinement progress. 

The refinement trigger is based on current known number of pending (i.e. dirty) cards on the card table and a pending card generation rate, fairly similarly to the previous algorithm. After the refinement control thread determines that it is time to do refinement, it starts the following sequence:

1) **Swap the card table**. This consists of several steps:
    1) **Swap the global card table** - the global card table pointer is swapped; newly created threads and runtime calls will eventually use the new values, at the latest after the next two steps.
    2) **Update the pointers in all JavaThread**'s TLS storage to the new card table pointer using a handshake operation
    3) **Update the pointers in the GC thread**'s TLS storage to the new card table pointer using the SuspendibleThreadSet mechanism
2) **Snapshot the heap** - determine the extent of work needed for all regions where the refinement threads need to do some work on the refinement table (the previous card table). The snapshot stores the work progress for each region so that work can be interrupted and continued at any time.
    This work either consists of refinement of the particular card (old generation regions) or clearing the cards (next collection set/young generation regions).
3) **Sweep the refinement table** by activating the refinement worker threads. The threads refine dirty cards using the heap snapshot where worker threads claim parts of regions to process.
      * Cards with references to the young generation are not added to the young generation's card based remembered set. Instead these cards are marked as to-collection-set  in the card table and any remaining refinement of that card skipped.
      * If refinement encounters a card that is already marked as to-collection-set  it is not refined and re-marked as to-collection-set  on the card table .
      * During refinement, the refinement table is also cleared (in bulk for collection set regions as they do not need any refinement, and in other regions as they are refined for the non-clean cards).
      * Dirty cards within unparsable heap areas are forwarded to/redirtied on the card table as is.
4) **Completion work**, mostly statistics.

If the work is interrupted by a non-garbage collection synchronization point, work is suspended temporarily and resumed later using the heap snapshot.

After the refinement process the refinement table is all-clean again and ready to be swapped again.

### Garbage collection pause changes

Since a garbage collection (young or full gc) pause may occur at any point during the refinement process, the garbage collection needs some compensating work for the not yet swept parts of the refinement table.

Note that this situation is very rare, and the heuristics try to avoid that, so in most cases nothing needs to be done as the refinement table is all clean.

If this happens, young collections add a new phase called `Merge Refinement Table` in the garbage collection pause right before the `Merge Heap Roots` phase. This compensating phase does the following:

  0) (Optional) Snapshot the heap if not done yet (if the process has been interrupted between state 1 and 3 of the refinement process)
  1) Merge the refinement table into the card table - in this step the dirty cards of interesting regions are
  2) Completion work (statistics)

If a full collection interrupts concurrent refinement, the refinement table is simply cleared and all dirty cards thrown away.

A garbage collection generates new cards (e.g. references from promoted objects into the young generation) on the refinement table. This acts similarly to the extra DCQS used to record these interesting references/cards and redirty the card table using them in the previous implementation. G1 swaps the card tables at the end of the collection to keep the post-condition of the refinement table being all clean (and any to-be-refined cards on the card table) at the end of garbage collection.

### Performance metrics

Following is an overview of the changes in behavior. Some numbers are provided in the CR in the first comment.

#### Native memory usage

The refinement table takes an additional 0.2% of the Java heap size of native memory compared to JDK 21 and above (in JDK 21 we removed one card table sized data structure, so this is a non-issue when updating from before).

Some of that additional memory usage is automatically reclaimed by removing the dirty card queues. Additional memory is reclaimed by managing the cards containing to-collection-set references on the card table by dropping the explicit remembered sets for young generation completely and any remembered set entries which would otherwise be duplicated into the other region's remembered sets.

In some applications/benchmarks these gains completely offset the additional card table, however most of the time this is not the case, particularly for throughput applications currently.
It is possible to allocate the refinement table lazily, which means that since these applications often do not need any concurrent refinement, there is no overhead at all but actually a net reduction of native memory usage. This is not implemented in this prototype.

#### Latency ("Pause times")

Not affected or slightly better. Pause times decrease due to a shorter "Merge remembered sets" phase due to no work required for the remembered sets for the young generation - they are always already on the card table!

However merging of the refinement table into the card table is extremely fast and is always faster than merging remembered sets for the young gen in my measurements. Since this work is linearly scanning some memory, this is embarassingly parallel too.

The cards created during garbage collection do not need to be redirtied, so that phase has also been removed.

The card table swap is based on predictions for mutator card dirtying rate and refinement rate as before, and the policy is actually fairly similar to before. It is still rather aggressive, but in most cases takes less cpu resources than the one before, mostly because refining takes less cpu time. Many applications do not do any refinement at all like before. More investigation could be done to improve this in the future.

#### Throughput

This change always increases throughput in my measurements, depending on benchmark/application it may not actually show up in scores though.

Due to the pre-barrier and the additional filters in the barrier G1 is still slower than Parallel on raw throughput benchmarks, but is typically somewhere half-way to Parallel GC or closer.

### Platform support

Since the post write barrier changed, additional work for some platforms is required to allow this change to proceed. At this time all work for all platforms is done, but needs testing

- GraalVM (contributed by the GraalVM team)
- S390 (contributed by A. Kumar from IBM)
- PPC (contributed by M. Doerr, from SAP)
- ARM (should work, HelloWorld compiles and runs)
- RISCV (should work, HelloWorld compiles and runs)
- x86 (should work, build/HelloWorld compiles and runs)

None of the above mentioned platforms implement the barrier method to write cards for a reference array (aarch64 and x64 are fully implemented), they call the runtime as before. I believe it is doable fairly easily now with this simplified barrier for some extra performance, but not necessary.

### Alternatives

The JEP text extensively discusses alternatives.

### Reviewing

The change can be roughly divided in these fairly isolated parts
* platform specific changes to the barrier
* refinement and refinement control thread changes; this is best reviewed starting from the `G1ConcurrentRefineThread::run_service` method
* changes to garbage collection: `merge_refinement_table()` in `g1RemSet.cpp`
* policy modifications are typically related to code around the calls to `G1Policy::record_dirtying_stats`.

Further information is available in the [JEP draft](https://bugs.openjdk.org/browse/JDK-8340827); there is also an a bit more extensive discussion of the change on my [blog](https://tschatzl.github.io/2025/02/21/new-write-barriers.html).

Some additional comments:
* the pre-marking of young generation cards has been removed. Benchmarks did not show any significant difference either way. To me this makes somewhat sense because the entire young gen will quickly get marked anyway. I.e. one only saves a single additional card table write (for every card). With the old barrier the costs for a card table mark has been much higher.
* G1 sets `UseCondCardMark` to true by default. The conditional card mark corresponds to the third filter in the write barrier now, and since I decided to keep all filters for this change, it makes sense to directly use this mechanism.

If there are any questions, feel free to ask.

Testing: tier1-7 (multiple tier1-7, tier1-8 with slightly older versions)

Thanks,
  Thomas

-------------

Commit messages:
 - * only provide byte map base for JavaThreads
 - * mdoerr review: fix comments in ppc code
 - * fix crash when writing dirty cards for memory regions during card table switching
 - * remove mention of "enqueue" or "enqueuing" for actions related to post barrier
 - * remove some commented out debug code
 - Card table as DCQ

Changes: https://git.openjdk.org/jdk/pull/23739/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8342382
  Stats: 6543 lines in 103 files changed: 2162 ins; 3461 del; 920 mod
  Patch: https://git.openjdk.org/jdk/pull/23739.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739

PR: https://git.openjdk.org/jdk/pull/23739

From mdoerr at openjdk.org  Tue Feb 25 15:04:29 2025
From: mdoerr at openjdk.org (Martin Doerr)
Date: Tue, 25 Feb 2025 15:04:29 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier
In-Reply-To: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
Message-ID: <G83FK3KxY1dsv3aaFy0NmOqXr2nl1tE_2deUJfrYjjE=.5c37af96-e130-4156-86f0-7c42d22db295@github.com>

On Sun, 23 Feb 2025 18:53:33 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

> Hi all,
> 
>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
> 
> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
> 
> ### Current situation
> 
> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
> 
> The main reason for the current barrier is how g1 implements concurrent refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
> 
> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
> 
> 
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done;     // null value check
> if (card(@x.a) == young_card) goto done;  // write to young gen check
> StoreLoad;                // synchronize
> if (card(@x.a) == dirty_card) goto done;
> 
> *card(@x.a) = dirty
> 
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
> 
> call runtime to move thread-local-dcq into dcqs
> 
> done:
> 
> 
> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
> 
> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
> 
> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
> 
> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se...

PPC64 code looks great! Thanks for doing this! Only some comments are no longer correct.

src/hotspot/cpu/ppc/gc/g1/g1BarrierSetAssembler_ppc.cpp line 244:

> 242: 
> 243:   __ xorr(R0, store_addr, new_val);                          // tmp1 := store address ^ new value
> 244:   __ srdi_(R0, R0, G1HeapRegion::LogOfHRGrainBytes);         // tmp1 := ((store address ^ new value) >> LogOfHRGrainBytes)

Comment: R0 is used instead of tmp1

src/hotspot/cpu/ppc/gc/g1/g1BarrierSetAssembler_ppc.cpp line 259:

> 257: 
> 258:   __ ld(tmp1, G1ThreadLocalData::card_table_base_offset(), thread);
> 259:   __ srdi(tmp2, store_addr, CardTable::card_shift());        // tmp1 := card address relative to card table base

Comment: tmp2 is used, here

src/hotspot/cpu/ppc/gc/g1/g1BarrierSetAssembler_ppc.cpp line 261:

> 259:   __ srdi(tmp2, store_addr, CardTable::card_shift());        // tmp1 := card address relative to card table base
> 260:   if (UseCondCardMark) {
> 261:     __ lbzx(R0, tmp1, tmp2);                                 // tmp1 := card address

Can you remove the comment, please? It's wrong.

-------------

PR Review: https://git.openjdk.org/jdk/pull/23739#pullrequestreview-2637143540
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1967669777
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1967670850
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1967671593

From duke at openjdk.org  Tue Feb 25 15:04:29 2025
From: duke at openjdk.org (Piotr Tarsa)
Date: Tue, 25 Feb 2025 15:04:29 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier
In-Reply-To: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
Message-ID: <mM84AwxY2EZXYcvhxn0c1ULbZ8qAyUWbN4tIDf1b_5k=.b88e32a6-ac41-4a29-a34c-e8973ddf6599@github.com>

On Sun, 23 Feb 2025 18:53:33 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

> Hi all,
> 
>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
> 
> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
> 
> ### Current situation
> 
> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
> 
> The main reason for the current barrier is how g1 implements concurrent refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
> 
> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
> 
> 
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done;     // null value check
> if (card(@x.a) == young_card) goto done;  // write to young gen check
> StoreLoad;                // synchronize
> if (card(@x.a) == dirty_card) goto done;
> 
> *card(@x.a) = dirty
> 
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
> 
> call runtime to move thread-local-dcq into dcqs
> 
> done:
> 
> 
> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
> 
> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
> 
> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
> 
> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se...

in this pr you've wrote

if (region(@x.a) != region(y)) goto done; // same region check

but on https://tschatzl.github.io/2025/02/21/new-write-barriers.html you wrote:

(1)  if (region(x.a) == region(y)) goto done;    // Ignore references within the same region/area

i guess the second one is correct

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23739#issuecomment-2677075290

From stuefe at openjdk.org  Tue Feb 25 15:04:29 2025
From: stuefe at openjdk.org (Thomas Stuefe)
Date: Tue, 25 Feb 2025 15:04:29 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier
In-Reply-To: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
Message-ID: <vccICgwHGbpmqNDM2u1xObw2CDIupybtOmIN0LpVcY0=.d5c2d7ac-5eca-4168-b25c-8bb0675062a2@github.com>

On Sun, 23 Feb 2025 18:53:33 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

> Hi all,
> 
>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
> 
> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
> 
> ### Current situation
> 
> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
> 
> The main reason for the current barrier is how g1 implements concurrent refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
> 
> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
> 
> 
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done;     // null value check
> if (card(@x.a) == young_card) goto done;  // write to young gen check
> StoreLoad;                // synchronize
> if (card(@x.a) == dirty_card) goto done;
> 
> *card(@x.a) = dirty
> 
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
> 
> call runtime to move thread-local-dcq into dcqs
> 
> done:
> 
> 
> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
> 
> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
> 
> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
> 
> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se...

@tschatzl I did not contribute the ppc port. Did you mean @TheRealMDoerr or @reinrich ?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23739#issuecomment-2677512780

From tschatzl at openjdk.org  Tue Feb 25 15:13:43 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Tue, 25 Feb 2025 15:13:43 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v2]
In-Reply-To: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
Message-ID: <sATIRzYvi1wLv1tzkXOuwBN2F2YKy7SLSxaU6mnPWTg=.93b86875-3f5f-4ffe-91f0-a59ef68b2157@github.com>

> Hi all,
> 
>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
> 
> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
> 
> ### Current situation
> 
> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
> 
> The main reason for the current barrier is how g1 implements concurrent refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
> 
> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
> 
> 
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done;     // null value check
> if (card(@x.a) == young_card) goto done;  // write to young gen check
> StoreLoad;                // synchronize
> if (card(@x.a) == dirty_card) goto done;
> 
> *card(@x.a) = dirty
> 
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
> 
> call runtime to move thread-local-dcq into dcqs
> 
> done:
> 
> 
> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
> 
> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
> 
> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
> 
> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se...

Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:

  * remove unnecessarily added logging

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23739/files
  - new: https://git.openjdk.org/jdk/pull/23739/files/0100d8e2..9ef9c5f4

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=00-01

  Stats: 4 lines in 4 files changed: 0 ins; 1 del; 3 mod
  Patch: https://git.openjdk.org/jdk/pull/23739.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739

PR: https://git.openjdk.org/jdk/pull/23739

From duke at openjdk.org  Tue Feb 25 16:00:57 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Tue, 25 Feb 2025 16:00:57 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]
In-Reply-To: <OzV68CDcIoCJFhzWV-FiamEX6JFpKj18Ydl7FoOfyvE=.56f23f42-00b6-43ae-9d03-90e0067aa31b@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <unMldYiDLGyImOJQ1oXuzR2OViIBxTKFjE3Ks6_VSn4=.e86bd4ee-5fce-415a-888a-06aff24bd664@github.com>
 <1yB95sOajuS5ptFI0GQWLepii5JsZ9DOsje-TEFyFYs=.a325ad18-17ed-4e77-b1e3-0bad2cf55c67@github.com>
 <bqom3W9zRU-ChMNDgkfcOVPEmKvl5J1huHv1o6Fe1yc=.052a4703-111d-40df-8843-4aee2dd93cca@github.com>
 <Z_BqJtEFwJ_A5aQnxJCOs4jswW7CmUL4dJAKrHLJGk0=.48dae4a5-e713-447d-bf05-f2caa6408540@github.com>
 <ifk97t9mBPxdf6XVck8CHs_hE1PrweC597vmZ_VO5yU=.42ea9675-35af-49fb-8a6c-ec53c2543be4@github.com>
 <cEMnfNLE2HH7lgv7M9ury8U5ef6QYb0glG28uB5Lm1w=.88ac5562-5128-49dc-a639-6444153c622e@github.com>
 <teEatC-H5hsucsK32FLBVpb8x5Digf3hIh9vUqfHfew=.e13bc121-5716-4724-bc79-36a4b46de8a2@github.com>
 <NnDHhEs6oCXiK69ge5b63oCG47J6NM60pGbcfM30blI=.f1f82355-2d70-45c8-a4f8-7f83b91d65d4@github.com>
 <OzV68CDcIoCJFhzWV-FiamEX6JFpKj18Ydl7FoOfyvE=.56f23f42-00b6-43ae-9d03-90e0067aa31b@github.com>
Message-ID: <_CekdxBJviS_sZCVN62_yFx-cTF4qrIuAnqbIeUmFck=.3a6afffb-8fbe-4809-a4ca-1bc22b52a628@github.com>

On Tue, 25 Feb 2025 13:50:35 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> I just tried it with top-of trunk latest binutils:
>> 
>> fedora:aarch64 $ ~/binutils-gdb-install/bin/as -march=armv9-a+sha3+sve2-bitperm aarch64ops.s
>> fedora:aarch64 $ ~/binutils-gdb-install/bin/as --version
>> GNU assembler (GNU Binutils) 2.44.50.20250225
>
> Aha!
> 
> 
> aph at Andrews-MacBook-Pro ~ % as t.s   
> t.s:1:19: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
> sub x1, x10, x23, sxth #2
>                   ^
> aph at Andrews-MacBook-Pro ~ % as --version
> Apple clang version 16.0.0 (clang-1600.0.26.6)
> Target: arm64-apple-darwin24.3.0

OK, so GNU as is more forgiving than Apple as...

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1970076152

From kvn at openjdk.org  Tue Feb 25 17:32:02 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Tue, 25 Feb 2025 17:32:02 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory [v4]
In-Reply-To: <IWJpxq3rbBbcGsvrYi8iPP1fWTBAgYlnyL3nnCi9ofM=.c638630a-f208-4473-a1df-693e720a1350@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <IWJpxq3rbBbcGsvrYi8iPP1fWTBAgYlnyL3nnCi9ofM=.c638630a-f208-4473-a1df-693e720a1350@github.com>
Message-ID: <fEp6idde0cUkSI58u30JngnSdrZo8Gg-B0o9PYoDiw4=.6186df6c-9daf-4e65-98ae-bef59de701f2@github.com>

On Tue, 25 Feb 2025 09:27:13 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below.
>> 
>> **Background**
>> 
>> With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer.
>> 
>> **Problem**
>> 
>> So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code.
>> 
>> 
>> MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1);
>> MemorySegment nativeUnaligned = nativeAligned.asSlice(1);
>> test3(nativeUnaligned);
>> 
>> 
>> When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not!
>> 
>>     static void test3(MemorySegment ms) {
>>         for (int i = 0; i < RANGE; i++) {
>>             long adr = i * 4L;
>>             int v = ms.get(ELEMENT_LAYOUT, adr);
>>             ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1));
>>         }
>>     }
>> 
>> 
>> **Solution: Runtime Checks - Predicate and Multiversioning**
>> 
>> Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check.
>> 
>> I came up with 2 options where to place the runtime checks:
>> - A new "auto vectorization" Parse Predicate:
>>   - This only works when predicates are available.
>>   - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop.
>> - Multiversion the loop:
>>   - Create 2 copies of the loop (fast and slow loops).
>>   - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take
>>   - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even ...
>
> Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 66 commits:
> 
>  - Merge branch 'master' into JDK-8323582-SW-native-alignment
>  - stall -> delay, plus some more comments
>  - adjust selector if probability
>  - Merge branch 'master' into JDK-8323582-SW-native-alignment
>  - remove multiversion mark if we break the structure
>  - register opaque with igvn
>  - copyright and rm CFG check
>  - IR rules for all cases
>  - 3 test versions
>  - test changed to unaligned ints
>  - ... and 56 more: https://git.openjdk.org/jdk/compare/d551daca...8eb52292

This looks good for me.

-------------

Marked as reviewed by kvn (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/22016#pullrequestreview-2641927937

From kvn at openjdk.org  Tue Feb 25 17:32:02 2025
From: kvn at openjdk.org (Vladimir Kozlov)
Date: Tue, 25 Feb 2025 17:32:02 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory
In-Reply-To: <ddWEwYKdCFkjzD9eQMraW6DQCjXAd8yMBuDq_hVLqM8=.9c9d7d8a-5120-4488-b8ca-2303245d0998@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <IUuLTkwPe7pefd6C6NhQEI7ASmdSW8Bb0kBFJVfXkUY=.f6d110c2-0d6d-424f-8898-b06d5f9552f6@github.com>
 <OtJlLrlGEGU9a-lDCP-_n6paLgrAmCTg3-pwhLTeyIU=.c1a3d943-aca1-4dbd-8717-c73020163864@github.com>
 <mcXrI5ah9OFy25nV_Im_DFpPR_DXtfOgn-26D_bC1mQ=.15fc3dd9-7278-4c09-8b25-dde0a1251ca2@github.com>
 <xlcVpse5Yr5iUC65xdLtOTj-aNCoS59WgnFOnbgNOG8=.6f554f7c-e245-43c2-adc5-df2ee07639cc@github.com>
 <9mXRl7rScxJwxNNlV_H1gxndtzZ6g-gE8cMsc6VsTJQ=.b5a77c13-6e7e-4203-898a-3318e298d30f@github.com>
 <ddWEwYKdCFkjzD9eQMraW6DQCjXAd8yMBuDq_hVLqM8=.9c9d7d8a-5120-4488-b8ca-2303245d0998@github.com>
Message-ID: <_pnjKfnS2e4hYWJ5_y8CudFAOmKB7FrD8cad8wCfZus=.16ac819a-2a99-4a8b-9640-3fa3bde53970@github.com>

On Tue, 25 Feb 2025 07:09:24 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> > PS: "slow" path implies that it is not taking frequently and it should not affect general performance of application.
> 
> For me "slow" just means less optimized, because some assumption does not hold. The "fast" path is faster, because it has more assumptions and can optimize more (i.e. vectorize in this case, or vectorize more instructions). Do you have a better name than "fast/slow"?

I think I nit-picked here. I see your good comments in `loopTransform.cpp` and loop `node.hpp` explaining mutiversioning fast_loop/slow_loop. I think it is fine to keep "slow/fast". We can use "uncommon" to indicate unfrequent path.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2682745643

From bkilambi at openjdk.org  Tue Feb 25 19:45:31 2025
From: bkilambi at openjdk.org (Bhavana Kilambi)
Date: Tue, 25 Feb 2025 19:45:31 GMT
Subject: RFR: 8345125: Aarch64: Add aarch64 backend for Float16 scalar
 operations [v2]
In-Reply-To: <Rbal8Cp4ncat_17FPV367OvtBjO-GVr0AdM-X-yuNt8=.e09edada-bc9d-4dbd-9904-ab523d25fc47@github.com>
References: <Rbal8Cp4ncat_17FPV367OvtBjO-GVr0AdM-X-yuNt8=.e09edada-bc9d-4dbd-9904-ab523d25fc47@github.com>
Message-ID: <8QDbenZGakijqUrwAcaVogoJBEiNpzYhN3sDrrteSDk=.d8539631-ab03-45ff-a762-0b6e14c63f89@github.com>

> This patch adds aarch64 backend for scalar FP16 operations namely - add, subtract, multiply, divide, fma, sqrt, min and max.

Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision:

  Address review comments

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23748/files
  - new: https://git.openjdk.org/jdk/pull/23748/files/a608a035..4d699740

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23748&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23748&range=00-01

  Stats: 7 lines in 1 file changed: 0 ins; 0 del; 7 mod
  Patch: https://git.openjdk.org/jdk/pull/23748.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23748/head:pull/23748

PR: https://git.openjdk.org/jdk/pull/23748

From bkilambi at openjdk.org  Tue Feb 25 19:49:01 2025
From: bkilambi at openjdk.org (Bhavana Kilambi)
Date: Tue, 25 Feb 2025 19:49:01 GMT
Subject: RFR: 8345125: Aarch64: Add aarch64 backend for Float16 scalar
 operations [v2]
In-Reply-To: <odM4XNH06z2lo1rXYzqUx_78UzueXp1TAY_9TyCQTmc=.6902f909-9447-48f2-8221-0833907b5a6e@github.com>
References: <Rbal8Cp4ncat_17FPV367OvtBjO-GVr0AdM-X-yuNt8=.e09edada-bc9d-4dbd-9904-ab523d25fc47@github.com>
 <odM4XNH06z2lo1rXYzqUx_78UzueXp1TAY_9TyCQTmc=.6902f909-9447-48f2-8221-0833907b5a6e@github.com>
Message-ID: <uOGCXL7_U5tzhmGugTykXfKtSb10wx7LRpDyhf1XpWs=.d454548f-fa88-4d9d-95fe-dedc2d3d4482@github.com>

On Mon, 24 Feb 2025 17:06:59 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Address review comments
>
> src/hotspot/cpu/aarch64/aarch64.ad line 17275:
> 
>> 17273: 
>> 17274: // This pattern would result in the following instructions (the first two are for ConvF2HF
>> 17275: // and the last instruction is for ReinterpretS2HF) -
> 
> Suggestion:
> 
> // Without this pattern, (ReinterpretS2HF (ConvF2HF src)) would result in the following instructions (the first two for ConvF2HF
> // and the last instruction for ReinterpretS2HF) -
> 
> Reads a little better, I think?

Addressed this in the new patch.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23748#discussion_r1970437734

From bkilambi at openjdk.org  Tue Feb 25 19:48:59 2025
From: bkilambi at openjdk.org (Bhavana Kilambi)
Date: Tue, 25 Feb 2025 19:48:59 GMT
Subject: RFR: 8345125: Aarch64: Add aarch64 backend for Float16 scalar
 operations [v2]
In-Reply-To: <NIILdJ885F2cPfFad0x7x1pSpGhbarI5A7u8d1U8F8E=.497ed380-4de9-4d5b-968a-5eb97fed5cca@github.com>
References: <Rbal8Cp4ncat_17FPV367OvtBjO-GVr0AdM-X-yuNt8=.e09edada-bc9d-4dbd-9904-ab523d25fc47@github.com>
 <IpKutko2TmcaFVICxb0dRzY-XnqEjejl9pVvdxCeMkg=.9c508f4a-5a3b-41fe-a986-5005d1cf03cc@github.com>
 <NIILdJ885F2cPfFad0x7x1pSpGhbarI5A7u8d1U8F8E=.497ed380-4de9-4d5b-968a-5eb97fed5cca@github.com>
Message-ID: <VA269tiz7B1n9WXD89AA-58T7ffQOZ-fFvO_AOPzcn4=.a0f05e75-9b2d-4f8d-97a5-f8035dd3a630@github.com>

On Mon, 24 Feb 2025 17:42:05 GMT, Bhavana Kilambi <bkilambi at openjdk.org> wrote:

>> src/hotspot/cpu/aarch64/aarch64.ad line 6978:
>> 
>>> 6976: // ldr instruction has 32/64/128 bit variants but not a 16-bit variant. This
>>> 6977: // loads the 16-bit value from constant pool into a 32-bit register but only
>>> 6978: // the bottom half will be populated.
>> 
>> Surely what actually happens here is that it loads a 32-bit word from the constant pool. The bottom 16 bits of this word contain the half-precision constant, the top 16 bits are zero.
>
> I agree. The wording didn't quite convey that. I will change it in my next PS. Thank you for looking into the patch!

Addressed this in the new patch.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23748#discussion_r1970437283

From mpowers at openjdk.org  Wed Feb 26 01:03:52 2025
From: mpowers at openjdk.org (Mark Powers)
Date: Wed, 26 Feb 2025 01:03:52 GMT
Subject: RFR: 8349721: Add aarch64 intrinsics for ML-KEM
In-Reply-To: <eUTcEbCy4gKPEfe0fS4GXXR8i49JYSKCGygRz8CsCnE=.c17a1739-85d2-4bb9-8a74-5ad1694d8d3d@github.com>
References: <eUTcEbCy4gKPEfe0fS4GXXR8i49JYSKCGygRz8CsCnE=.c17a1739-85d2-4bb9-8a74-5ad1694d8d3d@github.com>
Message-ID: <JTf10HIwSjt7xlp0vG0c-sdie9qDbqYzN6SnAMpd6-Q=.6d54e3f7-c212-41a5-a061-541d268ecb93@github.com>

On Mon, 17 Feb 2025 13:53:30 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

> By using the aarch64 vector registers the speed of the computation of the ML-KEM algorithms (key generation, encapsulation, decapsulation) can be approximately doubled.

ML-KEM benchmark results of this PR:


MLKEM.decapsulate  512 11.80 us/op
MLKEM.decapsulate  768 18.19 us/op 
MLKEM.decapsulate 1024 29.57 us/op
MLKEM.encapsulate  512  8.80 us/op 
MLKEM.encapsulate  768 13.49 us/op  
MLKEM.encapsulate 1024 22.53 us/op
MLKEM.keygen       512  7.49 us/op 
MLKEM.keygen       768 11.22 us/op  
MLKEM.keygen      1024 19.08 us/op 


ML-KEM no intrinsics


MLKEM.decapsulate  512 31.23 us/op  
MLKEM.decapsulate  768 50.09 us/op 
MLKEM.decapsulate 1024 75.92 us/op
MLKEM.encapsulate  512 22.72 us/op  
MLKEM.encapsulate  768 37.27 us/op 
MLKEM.encapsulate 1024 59.69 us/op
MLKEM.keygen       512 17.95 us/op  
MLKEM.keygen       768 30.95 us/op 
MLKEM.keygen      1024 49.04 us/op

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23663#issuecomment-2683631601

From dholmes at openjdk.org  Wed Feb 26 07:02:12 2025
From: dholmes at openjdk.org (David Holmes)
Date: Wed, 26 Feb 2025 07:02:12 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists
In-Reply-To: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
Message-ID: <DT9JbIKvpPv7zQjYb2pgXpOxyL4h7R1M5HULfOetNzo=.2e53f146-dbb6-47b4-aed8-e9ff98ff13e1@github.com>

On Mon, 3 Feb 2025 16:29:25 GMT, Fredrik Bredberg <fbredberg at openjdk.org> wrote:

> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`.
> 
> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past.
> 
> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks.
> 
> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`.
> 
> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable.
> 
> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list.
> 
> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor).
> 
> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation.
> 
> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fact that c2 no longer has to check b...

Disclaimer for other reviewers, I have been looking at this code for some time now.

Overall code looks good. I have quite a few comments/suggestions about comments.

I suggest renaming `_vthread_cxq_head` to just `_vthread_head` as the `cxq` part is no longer meaningful.

I agree that even though this seems performance neutral, the code simplification (for people reading it for the first time) will be worth it.

Thanks.

src/hotspot/share/jvmci/vmStructs_jvmci.cpp line 331:

> 329:   volatile_nonstatic_field(ObjectMonitor,      _owner,                                        int64_t)                               \
> 330:   volatile_nonstatic_field(ObjectMonitor,      _recursions,                                   intptr_t)                              \
> 331:   volatile_nonstatic_field(ObjectMonitor,      _entry_list,                                    ObjectWaiter*)                        \

Suggestion:

  volatile_nonstatic_field(ObjectMonitor,      _entry_list,                                   ObjectWaiter*)                        \

Extra space

src/hotspot/share/runtime/objectMonitor.cpp line 166:

> 164: //   its next pointer, and have its prev pointer set to null. Thus
> 165: //   pushing six threads A-F (in that order) onto entry_list, will
> 166: //   form a singly-linked list, see 1) below.

Suggestion: have diagram 1 immediately follow this text so the reader doesn't have to jump down.

src/hotspot/share/runtime/objectMonitor.cpp line 172:

> 170: //   from the entry_list head. While walking the list we also assign
> 171: //   the prev pointers of each thread, essentially forming a doubly
> 172: //   linked list, see 2) below.

Suggestion: have diagram 2 immediately follow this text so the reader doesn't have to jump down.

src/hotspot/share/runtime/objectMonitor.cpp line 176:

> 174: //   Once we have formed a doubly linked list it's easy to find the
> 175: //   successor, wake it up, have it remove itself, and update the
> 176: //   tail pointer, as seen in 2) and 3) below.

Suggestion:

//   tail pointer, as seen in 3) below.

But have diagram 3 right here.

src/hotspot/share/runtime/objectMonitor.cpp line 179:

> 177: //
> 178: //   At any time new threads can add themselves to the entry_list, see
> 179: //   4) and 5).

Diagrams 4 and 5 do not follow from what has just been described, but the use of "at any time" implies to me you intended to show them affecting the queue as we have already seen it.

Again show the diagram you want here.

src/hotspot/share/runtime/objectMonitor.cpp line 183:

> 181: //   If the thread that removes itself from the end of the list hasn't
> 182: //   got any prev pointer, we just set the tail pointer to null, see
> 183: //   5) and 6).

Suggestion:

//   If the thread to be removed is the only thread in the entry list:
//    entry_list      -> A -> null
//    entry_list_tail ---^
//   we remove it and just set the tail pointer to null,
//    entry_list      -> null
//    entry_list_tail -> null

src/hotspot/share/runtime/objectMonitor.cpp line 187:

> 185: //   Next time we need to find the successor and the tail is null, we
> 186: //   just start walking from the entry_list head again forming a new
> 187: //   doubly linked list, see 6) and 7) below.

Suggestion:

//   Next time we need to find the successor and the tail is null,
//       entry_list       ->I->H->G->null
//       entry_list_tail  ->null
//   we just start walking from the entry_list head again forming a new
//   doubly linked list:
//       entry_list       ->I<=>H<=>G->null
//       entry_list_tail  ----------^

src/hotspot/share/runtime/objectMonitor.cpp line 189:

> 187: //   doubly linked list, see 6) and 7) below.
> 188: //
> 189: //      1)  entry_list       ->F->E->D->C->B->A->null

Suggestion:

//      1)          entry_list  ->F->E->D->C->B->A->null

Right-justify the names please

src/hotspot/share/runtime/objectMonitor.cpp line 215:

> 213: //      The mutex property of the monitor itself protects the entry_list
> 214: //      from concurrent interference.
> 215: //   -- Only the monitor owner may detach nodes from the entry_list.

Suggestion for this block - get rid of invariants headings and just say:

// The monitor itself protects all of the operations on the entry_list except for the CAS of a new arrival
// to the head. Only the monitor owner can read or write the prev links (e.g. to remove itself) or update
// the tail.

src/hotspot/share/runtime/objectMonitor.cpp line 225:

> 223: //   concurrent detaching thread. This mechanism is immune from the
> 224: //   ABA corruption. More precisely, the CAS-based "push" onto
> 225: //   entry_list is ABA-oblivious.

Not sure this actually says anything to help people understand the code or its operation. There basically is no A-B-A issue with the use of CAS here.

src/hotspot/share/runtime/objectMonitor.cpp line 227:

> 225: //   entry_list is ABA-oblivious.
> 226: //
> 227: // * The entry_list form a queue of threads stalled trying to acquire

Suggestion:

// * The entry_list forms a queue of threads stalled trying to acquire

src/hotspot/share/runtime/objectMonitor.cpp line 232:

> 230: //   thread notices that the tail of the entry_list is not known, we
> 231: //   convert the singly-linked entry_list into a doubly linked list by
> 232: //   assigning the prev pointers and the entry_list_tail pointer.

Didn't we essentially say all this at the beginning?

src/hotspot/share/runtime/objectMonitor.cpp line 260:

> 258: //
> 259: // * notify() or notifyAll() simply transfers threads from the WaitSet
> 260: //   to either the entry_list. Subsequent exit() operations will

Suggestion:

//   to the entry_list. Subsequent exit() operations will

src/hotspot/share/runtime/objectMonitor.cpp line 704:

> 702: 
> 703:   for (;;) {
> 704:     ObjectWaiter* front = Atomic::load(&_entry_list);

In comments and code pick "head" or "front" to use to describe what _entry_list points to and use that consistently. I think "front" is much more common.

src/hotspot/share/runtime/objectMonitor.cpp line 705:

> 703:   for (;;) {
> 704:     ObjectWaiter* front = Atomic::load(&_entry_list);
> 705: 

No need for blank line.

src/hotspot/share/runtime/objectMonitor.cpp line 718:

> 716: // if we added current to _entry_list. Once on _entry_list, current
> 717: // stays on-queue until it acquires the lock.
> 718: bool ObjectMonitor::try_lock_or_add_to_entry_list(JavaThread* current, ObjectWaiter* node) {

Nit: the name suggests we do the try_lock first, when we don't. If we reverse the name we should also reverse the true/false return so that true relates to the first part of the name. See what others think.

src/hotspot/share/runtime/objectMonitor.cpp line 719:

> 717: // stays on-queue until it acquires the lock.
> 718: bool ObjectMonitor::try_lock_or_add_to_entry_list(JavaThread* current, ObjectWaiter* node) {
> 719:   node->_prev   = nullptr;

Shouldn't this already be the case?

src/hotspot/share/runtime/objectMonitor.cpp line 724:

> 722:   for (;;) {
> 723:     ObjectWaiter* front = Atomic::load(&_entry_list);
> 724: 

No need for blank line.

src/hotspot/share/runtime/objectMonitor.cpp line 731:

> 729: 
> 730:     // Interference - the CAS failed because _entry_list changed.  Just retry.
> 731:     // As an optional optimization we retry the lock.

Suggestion:

    // Interference - the CAS failed because _entry_list changed.  Before
    // retrying the CAS retry taking the lock as it may now be free.

src/hotspot/share/runtime/objectMonitor.cpp line 812:

> 810:   guarantee(_entry_list == nullptr,
> 811:             "must be no entering threads: entry_list=" INTPTR_FORMAT,
> 812:             p2i(_entry_list));

Mustn't re-read _entry_list in the p2i as it may have changed from the value that is causing the guarantee to fail. The old guarantees were buggy in this regard - a temp is needed.

src/hotspot/share/runtime/objectMonitor.cpp line 1299:

> 1297:     assert(_entry_list_tail == nullptr || _entry_list_tail == currentNode, "invariant");
> 1298: 
> 1299:     ObjectWaiter* v = Atomic::load(&_entry_list);

Nit: use `w` to be consistent with similar code. The original used `w` for EntryList and `v` for cxq IIRC.

src/hotspot/share/runtime/objectMonitor.cpp line 2018:

> 2016: // that in prepend-mode we invert the order of the waiters. Let's say that the
> 2017: // waitset is "ABCD" and the entry_list is "XYZ". After a notifyAll() in prepend
> 2018: // mode the waitset will be empty and the entry_list will be "DCBAXYZ".

We don't support different ordering modes any more so we always "prepend" such that waiters are added to the entry_list in the reverse order of waiting. So given waitList -> A -> B -> C -> D, and _entry_list -> x -> y -> z we will get _entry_list -> D -> C -> B -> A -> X -> Y -> Z

src/hotspot/share/runtime/objectMonitor.hpp line 195:

> 193:   volatile intx _recursions;        // recursion count, 0 for first entry
> 194:   ObjectWaiter* volatile _entry_list;  // Threads blocked on entry or reentry.
> 195:                                        // The list is actually composed of WaitNodes,

Suggestion:

                                       // The list is actually composed of wait-nodes,

Pre-existing (check for other uses) `WaitNodes` reads like a class name but it isn't.

-------------

Changes requested by dholmes (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/23421#pullrequestreview-2643098063
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970923830
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970940771
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970940914
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970941662
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970936929
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970946641
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970948581
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970934947
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970956573
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970965071
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970965291
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970966451
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970967237
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970971522
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970968581
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970975419
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970976144
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970976457
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970977990
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970979335
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970982964
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1971037645
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970926134

From haosun at openjdk.org  Wed Feb 26 08:30:55 2025
From: haosun at openjdk.org (Hao Sun)
Date: Wed, 26 Feb 2025 08:30:55 GMT
Subject: RFR: 8345125: Aarch64: Add aarch64 backend for Float16 scalar
 operations [v2]
In-Reply-To: <8QDbenZGakijqUrwAcaVogoJBEiNpzYhN3sDrrteSDk=.d8539631-ab03-45ff-a762-0b6e14c63f89@github.com>
References: <Rbal8Cp4ncat_17FPV367OvtBjO-GVr0AdM-X-yuNt8=.e09edada-bc9d-4dbd-9904-ab523d25fc47@github.com>
 <8QDbenZGakijqUrwAcaVogoJBEiNpzYhN3sDrrteSDk=.d8539631-ab03-45ff-a762-0b6e14c63f89@github.com>
Message-ID: <zYEifb4xz1ZRbgovZt2gT_WYKUZL_860r5Smub2GhcA=.043b82b4-53d5-4f80-9f46-a8b8d07f09fc@github.com>

On Tue, 25 Feb 2025 19:45:31 GMT, Bhavana Kilambi <bkilambi at openjdk.org> wrote:

>> This patch adds aarch64 backend for scalar FP16 operations namely - add, subtract, multiply, divide, fma, sqrt, min and max.
>
> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Address review comments

src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 2097:

> 2095: 
> 2096:   // Half-precision floating-point instructions
> 2097:   INSN(fabdh,  0b011, 0b11, 0b000101, 0b0);

I suppose `fadbh` and `fnmulh` are added to keep aligned with the float and double ones, i.e. `fabd(s|d)` and `fnmul(s|d)`.


I noticed that there are matching rules for `fabd(s|d)`, i.e. `absd(F|D)_reg`. I wonder if we need add the corresponding rule for fp16 here?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23748#discussion_r1971142347

From bkilambi at openjdk.org  Wed Feb 26 08:52:53 2025
From: bkilambi at openjdk.org (Bhavana Kilambi)
Date: Wed, 26 Feb 2025 08:52:53 GMT
Subject: RFR: 8345125: Aarch64: Add aarch64 backend for Float16 scalar
 operations [v2]
In-Reply-To: <zYEifb4xz1ZRbgovZt2gT_WYKUZL_860r5Smub2GhcA=.043b82b4-53d5-4f80-9f46-a8b8d07f09fc@github.com>
References: <Rbal8Cp4ncat_17FPV367OvtBjO-GVr0AdM-X-yuNt8=.e09edada-bc9d-4dbd-9904-ab523d25fc47@github.com>
 <8QDbenZGakijqUrwAcaVogoJBEiNpzYhN3sDrrteSDk=.d8539631-ab03-45ff-a762-0b6e14c63f89@github.com>
 <zYEifb4xz1ZRbgovZt2gT_WYKUZL_860r5Smub2GhcA=.043b82b4-53d5-4f80-9f46-a8b8d07f09fc@github.com>
Message-ID: <dv1U81fUkH7X7FXgiRLdM-ulfITDgHXdL6-v-hgnlKA=.5b5a4651-d367-44f2-8574-fe0ba2782a8f@github.com>

On Wed, 26 Feb 2025 08:26:57 GMT, Hao Sun <haosun at openjdk.org> wrote:

>> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Address review comments
>
> src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 2097:
> 
>> 2095: 
>> 2096:   // Half-precision floating-point instructions
>> 2097:   INSN(fabdh,  0b011, 0b11, 0b000101, 0b0);
> 
> I suppose `fadbh` and `fnmulh` are added to keep aligned with the float and double ones, i.e. `fabd(s|d)` and `fnmul(s|d)`.
> 
> 
> I noticed that there are matching rules for `fabd(s|d)`, i.e. `absd(F|D)_reg`. I wonder if we need add the corresponding rule for fp16 here?

Hi @shqking , thanks for your review comments. Yes I added `fabdh` and `fnmulh` to keep aligned with float and double types.
For adding support for FP16 `absd` we need `AbsHF` to be supported (along with SubHF) but `AbsHF` node is not implemented currently. `abs` operation is directly executed from the java code here - https://github.com/openjdk/jdk/blob/037e47112bdf2fa2324f7c58198f6d433f17d9fd/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Float16.java#L1464 and is not intrinsified or pattern matched like other FP16 operations. Same with `negate` operation for FP16 - https://github.com/openjdk/jdk/blob/037e47112bdf2fa2324f7c58198f6d433f17d9fd/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Float16.java#L1449
On the Valhalla repo, while these operation were being developed, I tried adding support for `AbsHF/NegHF` which emitted `fabs` and `fneg` instructions but the performance with the direct java code(bit manipulation operations) was much faster (sorry don't remember the exact number) so we decided to go with the java implementation instead.
I still added `fabd` here because `op21` is 0 only in `fabd` H variant and felt that it'd be better to handle it here as it belongs to this group of instructions. Please let me know your thoughts.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23748#discussion_r1971175829

From roland at openjdk.org  Wed Feb 26 09:16:03 2025
From: roland at openjdk.org (Roland Westrelin)
Date: Wed, 26 Feb 2025 09:16:03 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory [v4]
In-Reply-To: <IWJpxq3rbBbcGsvrYi8iPP1fWTBAgYlnyL3nnCi9ofM=.c638630a-f208-4473-a1df-693e720a1350@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <IWJpxq3rbBbcGsvrYi8iPP1fWTBAgYlnyL3nnCi9ofM=.c638630a-f208-4473-a1df-693e720a1350@github.com>
Message-ID: <6R7kv7XGOWIBrjPQCemB6u2vd_tFl_xMQGQaVWoxkK0=.d26f6780-82f8-4ab9-a4bc-ff7831ed9a1a@github.com>

On Tue, 25 Feb 2025 09:27:13 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below.
>> 
>> **Background**
>> 
>> With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer.
>> 
>> **Problem**
>> 
>> So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code.
>> 
>> 
>> MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1);
>> MemorySegment nativeUnaligned = nativeAligned.asSlice(1);
>> test3(nativeUnaligned);
>> 
>> 
>> When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not!
>> 
>>     static void test3(MemorySegment ms) {
>>         for (int i = 0; i < RANGE; i++) {
>>             long adr = i * 4L;
>>             int v = ms.get(ELEMENT_LAYOUT, adr);
>>             ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1));
>>         }
>>     }
>> 
>> 
>> **Solution: Runtime Checks - Predicate and Multiversioning**
>> 
>> Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check.
>> 
>> I came up with 2 options where to place the runtime checks:
>> - A new "auto vectorization" Parse Predicate:
>>   - This only works when predicates are available.
>>   - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop.
>> - Multiversion the loop:
>>   - Create 2 copies of the loop (fast and slow loops).
>>   - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take
>>   - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even ...
>
> Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 66 commits:
> 
>  - Merge branch 'master' into JDK-8323582-SW-native-alignment
>  - stall -> delay, plus some more comments
>  - adjust selector if probability
>  - Merge branch 'master' into JDK-8323582-SW-native-alignment
>  - remove multiversion mark if we break the structure
>  - register opaque with igvn
>  - copyright and rm CFG check
>  - IR rules for all cases
>  - 3 test versions
>  - test changed to unaligned ints
>  - ... and 56 more: https://git.openjdk.org/jdk/compare/d551daca...8eb52292

Would it be possible and make sense to remove useless slow path loops the way it's done for predicates or zero trip guards? In `PhaseIdealLoop::build_loop_late_post_work()`, collect all `OpaqueMultiversioningNode` in a list. Then iterate over all loops the way it's done in `PhaseIdealLoop::eliminate_useless_zero_trip_guard()`, find loops marked as multi version, check we can get from the loop to the `OpaqueMultiversioningNode` and mark that one as useful. Eliminate all `OpaqueMultiversioningNode` not marked as useful. That way if some transformation such as peeling makes the loop non multi version or if the expected shape breaks for some reason, the slow loop is eliminated on next loop opts pass.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2684365921

From epeter at openjdk.org  Wed Feb 26 10:02:09 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Wed, 26 Feb 2025 10:02:09 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory [v4]
In-Reply-To: <6R7kv7XGOWIBrjPQCemB6u2vd_tFl_xMQGQaVWoxkK0=.d26f6780-82f8-4ab9-a4bc-ff7831ed9a1a@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <IWJpxq3rbBbcGsvrYi8iPP1fWTBAgYlnyL3nnCi9ofM=.c638630a-f208-4473-a1df-693e720a1350@github.com>
 <6R7kv7XGOWIBrjPQCemB6u2vd_tFl_xMQGQaVWoxkK0=.d26f6780-82f8-4ab9-a4bc-ff7831ed9a1a@github.com>
Message-ID: <h-VCRKhFzCVEDAkwaY8rpc303ueoREJsIWMeIPo-yys=.f8ed1033-da55-4a75-8edf-8bc762313bfc@github.com>

On Wed, 26 Feb 2025 09:12:46 GMT, Roland Westrelin <roland at openjdk.org> wrote:

> Would it be possible and make sense to remove useless slow path loops the way it's done for predicates or zero trip guards? In `PhaseIdealLoop::build_loop_late_post_work()`, collect all `OpaqueMultiversioningNode` in a list. Then iterate over all loops the way it's done in `PhaseIdealLoop::eliminate_useless_zero_trip_guard()`, find loops marked as multi version, check we can get from the loop to the `OpaqueMultiversioningNode` and mark that one as useful. Eliminate all `OpaqueMultiversioningNode` not marked as useful. That way if some transformation such as peeling makes the loop non multi version or if the expected shape breaks for some reason, the slow loop is eliminated on next loop opts pass.

I suppose we could try that. Is it ok to do that in a separate RFE, so we are keeping this here to a more manageable size?

And would we not have similar issues with traversing from the loops to their `OpaqueMultiversioningNode`? What if some are not reachable in the meantime? Then we would just lose the `multiversion_if` early, and could not use it any more. So maybe we'd have to do that after the verification:
[JDK-8350637](https://bugs.openjdk.org/browse/JDK-8350637): C2: verify that main_loop finds pre_loop and that multiversion loops find the multiversion_if

I wonder if we do not have similar issues with `PhaseIdealLoop::eliminate_useless_zero_trip_guard()` currently. Maybe it's rare enough we don't notice.

@rwestrel What do you think?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2684482233

From roland at openjdk.org  Wed Feb 26 10:18:12 2025
From: roland at openjdk.org (Roland Westrelin)
Date: Wed, 26 Feb 2025 10:18:12 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory [v4]
In-Reply-To: <h-VCRKhFzCVEDAkwaY8rpc303ueoREJsIWMeIPo-yys=.f8ed1033-da55-4a75-8edf-8bc762313bfc@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <IWJpxq3rbBbcGsvrYi8iPP1fWTBAgYlnyL3nnCi9ofM=.c638630a-f208-4473-a1df-693e720a1350@github.com>
 <6R7kv7XGOWIBrjPQCemB6u2vd_tFl_xMQGQaVWoxkK0=.d26f6780-82f8-4ab9-a4bc-ff7831ed9a1a@github.com>
 <h-VCRKhFzCVEDAkwaY8rpc303ueoREJsIWMeIPo-yys=.f8ed1033-da55-4a75-8edf-8bc762313bfc@github.com>
Message-ID: <cg5lvEbHZGaiF-CTPbLce9AlTYvpCXuhy-gQW1U_r3Q=.f15cd01d-e14d-4f72-b111-942ecda845d2@github.com>

On Wed, 26 Feb 2025 09:59:36 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> I suppose we could try that. Is it ok to do that in a separate RFE, so we are keeping this here to a more manageable size?

Ok

> And would we not have similar issues with traversing from the loops to their `OpaqueMultiversioningNode`? What if some are not reachable in the meantime? Then we would just lose the `multiversion_if` early, and could not use it any more. So maybe we'd have to do that after the verification: [JDK-8350637](https://bugs.openjdk.org/browse/JDK-8350637): C2: verify that main_loop finds pre_loop and that multiversion loops find the multiversion_if
> 
> I wonder if we do not have similar issues with `PhaseIdealLoop::eliminate_useless_zero_trip_guard()` currently. Maybe it's rare enough we don't notice.

I don't think that's a problem. When that code runs the graph is in a stable shape. There's no dead condition that needs to go through igvn to be cleaned up. We've just run igvn and haven't made any change to the graph yet.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2684523673

From aph at openjdk.org  Wed Feb 26 10:27:02 2025
From: aph at openjdk.org (Andrew Haley)
Date: Wed, 26 Feb 2025 10:27:02 GMT
Subject: RFR: 8345125: Aarch64: Add aarch64 backend for Float16 scalar
 operations [v2]
In-Reply-To: <dv1U81fUkH7X7FXgiRLdM-ulfITDgHXdL6-v-hgnlKA=.5b5a4651-d367-44f2-8574-fe0ba2782a8f@github.com>
References: <Rbal8Cp4ncat_17FPV367OvtBjO-GVr0AdM-X-yuNt8=.e09edada-bc9d-4dbd-9904-ab523d25fc47@github.com>
 <8QDbenZGakijqUrwAcaVogoJBEiNpzYhN3sDrrteSDk=.d8539631-ab03-45ff-a762-0b6e14c63f89@github.com>
 <zYEifb4xz1ZRbgovZt2gT_WYKUZL_860r5Smub2GhcA=.043b82b4-53d5-4f80-9f46-a8b8d07f09fc@github.com>
 <dv1U81fUkH7X7FXgiRLdM-ulfITDgHXdL6-v-hgnlKA=.5b5a4651-d367-44f2-8574-fe0ba2782a8f@github.com>
Message-ID: <xiMMFgFY8Yf-8QdKXKPQIkOYOAFd3tvHiP1OF3mlUuk=.fe5729c3-2677-4110-9a72-3df32d332480@github.com>

On Wed, 26 Feb 2025 08:49:58 GMT, Bhavana Kilambi <bkilambi at openjdk.org> wrote:

>> src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 2097:
>> 
>>> 2095: 
>>> 2096:   // Half-precision floating-point instructions
>>> 2097:   INSN(fabdh,  0b011, 0b11, 0b000101, 0b0);
>> 
>> I suppose `fadbh` and `fnmulh` are added to keep aligned with the float and double ones, i.e. `fabd(s|d)` and `fnmul(s|d)`.
>> 
>> 
>> I noticed that there are matching rules for `fabd(s|d)`, i.e. `absd(F|D)_reg`. I wonder if we need add the corresponding rule for fp16 here?
>
> Hi @shqking , thanks for your review comments. Yes I added `fabdh` and `fnmulh` to keep aligned with float and double types.
> For adding support for FP16 `absd` we need `AbsHF` to be supported (along with SubHF) but `AbsHF` node is not implemented currently. `abs` operation is directly executed from the java code here - https://github.com/openjdk/jdk/blob/037e47112bdf2fa2324f7c58198f6d433f17d9fd/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Float16.java#L1464 and is not intrinsified or pattern matched like other FP16 operations. Same with `negate` operation for FP16 - https://github.com/openjdk/jdk/blob/037e47112bdf2fa2324f7c58198f6d433f17d9fd/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Float16.java#L1449
> On the Valhalla repo, while these operation were being developed, I tried adding support for `AbsHF/NegHF` which emitted `fabs` and `fneg` instructions but the performance with the direct java code(bit manipulation operations) was much faster (sorry don't remember the exact number) so we decided to go with the java implementation instead.
> I still added `fabd` here because `op21` is 0 only in `fabd` H variant and felt that it'd be better to handle it here as it belongs to this group of instructions. Please let me know your thoughts.

According to the RM, fabd is in _Advanced SIMD scalar three same FP16_, but the rest are in _Floating-point data-processing (2 source)_.  The decoding scheme looks rather different.`fabd`, then, doesn't really fit here, but in a section with the rest of the three same FP16 instructions.
The encoding scheme for _Advanced SIMD scalar three same FP16_ is pretty simple, so I suggest you create a new group for them, and put `fabd` in there.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23748#discussion_r1971330062

From epeter at openjdk.org  Wed Feb 26 10:30:15 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Wed, 26 Feb 2025 10:30:15 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory [v4]
In-Reply-To: <cg5lvEbHZGaiF-CTPbLce9AlTYvpCXuhy-gQW1U_r3Q=.f15cd01d-e14d-4f72-b111-942ecda845d2@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <IWJpxq3rbBbcGsvrYi8iPP1fWTBAgYlnyL3nnCi9ofM=.c638630a-f208-4473-a1df-693e720a1350@github.com>
 <6R7kv7XGOWIBrjPQCemB6u2vd_tFl_xMQGQaVWoxkK0=.d26f6780-82f8-4ab9-a4bc-ff7831ed9a1a@github.com>
 <h-VCRKhFzCVEDAkwaY8rpc303ueoREJsIWMeIPo-yys=.f8ed1033-da55-4a75-8edf-8bc762313bfc@github.com>
 <cg5lvEbHZGaiF-CTPbLce9AlTYvpCXuhy-gQW1U_r3Q=.f15cd01d-e14d-4f72-b111-942ecda845d2@github.com>
Message-ID: <ATl2Duz13FQB9xeRGCbj_xlOYQI0brVn9C9qBAYQrLU=.e5ac493f-f655-4573-8af9-788c57f3e489@github.com>

On Wed, 26 Feb 2025 10:15:48 GMT, Roland Westrelin <roland at openjdk.org> wrote:

> > And would we not have similar issues with traversing from the loops to their `OpaqueMultiversioningNode`? What if some are not reachable in the meantime? Then we would just lose the `multiversion_if` early, and could not use it any more. So maybe we'd have to do that after the verification: [JDK-8350637](https://bugs.openjdk.org/browse/JDK-8350637): C2: verify that main_loop finds pre_loop and that multiversion loops find the multiversion_if
> > I wonder if we do not have similar issues with `PhaseIdealLoop::eliminate_useless_zero_trip_guard()` currently. Maybe it's rare enough we don't notice.
> 
> I don't think that's a problem. When that code runs the graph is in a stable shape. There's no dead condition that needs to go through igvn to be cleaned up. We've just run igvn and haven't made any change to the graph yet.

Ah ok, I'll have to look into it myself then. But if we know that it happens at the beginning of a loop-opts phase just after igvn, and no predicates were hacked yet, then that should work fine.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2684550571

From epeter at openjdk.org  Wed Feb 26 10:36:06 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Wed, 26 Feb 2025 10:36:06 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory [v4]
In-Reply-To: <cg5lvEbHZGaiF-CTPbLce9AlTYvpCXuhy-gQW1U_r3Q=.f15cd01d-e14d-4f72-b111-942ecda845d2@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <IWJpxq3rbBbcGsvrYi8iPP1fWTBAgYlnyL3nnCi9ofM=.c638630a-f208-4473-a1df-693e720a1350@github.com>
 <6R7kv7XGOWIBrjPQCemB6u2vd_tFl_xMQGQaVWoxkK0=.d26f6780-82f8-4ab9-a4bc-ff7831ed9a1a@github.com>
 <h-VCRKhFzCVEDAkwaY8rpc303ueoREJsIWMeIPo-yys=.f8ed1033-da55-4a75-8edf-8bc762313bfc@github.com>
 <cg5lvEbHZGaiF-CTPbLce9AlTYvpCXuhy-gQW1U_r3Q=.f15cd01d-e14d-4f72-b111-942ecda845d2@github.com>
Message-ID: <JBc5bWQt598ivGuLj4PelBPtJY0UxBOjbZfqNPm-Ml8=.5b3d548a-993a-4d14-8d99-adfb3ad2984e@github.com>

On Wed, 26 Feb 2025 10:15:48 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>>> Would it be possible and make sense to remove useless slow path loops the way it's done for predicates or zero trip guards? In `PhaseIdealLoop::build_loop_late_post_work()`, collect all `OpaqueMultiversioningNode` in a list. Then iterate over all loops the way it's done in `PhaseIdealLoop::eliminate_useless_zero_trip_guard()`, find loops marked as multi version, check we can get from the loop to the `OpaqueMultiversioningNode` and mark that one as useful. Eliminate all `OpaqueMultiversioningNode` not marked as useful. That way if some transformation such as peeling makes the loop non multi version or if the expected shape breaks for some reason, the slow loop is eliminated on next loop opts pass.
>> 
>> I suppose we could try that. Is it ok to do that in a separate RFE, so we are keeping this here to a more manageable size?
>> 
>> I don't see it as super critical personally, as the slow_path is `delayed`, so no loop-opts are performed on it. The overhead is minimal if we keep it until after loop-opts, I think. But I'm not against trying. It would take a bit of effort to construct test cases where we have the loop fold away after multiversion_if is added, but that is probably possible.
>> 
>> And would we not have similar issues with traversing from the loops to their `OpaqueMultiversioningNode`? What if some are not reachable in the meantime? Then we would just lose the `multiversion_if` early, and could not use it any more. So maybe we'd have to do that after the verification:
>> [JDK-8350637](https://bugs.openjdk.org/browse/JDK-8350637): C2: verify that main_loop finds pre_loop and that multiversion loops find the multiversion_if
>> 
>> I wonder if we do not have similar issues with `PhaseIdealLoop::eliminate_useless_zero_trip_guard()` currently. Maybe it's rare enough we don't notice.
>> 
>> @rwestrel What do you think?
>
>> I suppose we could try that. Is it ok to do that in a separate RFE, so we are keeping this here to a more manageable size?
> 
> Ok
> 
>> And would we not have similar issues with traversing from the loops to their `OpaqueMultiversioningNode`? What if some are not reachable in the meantime? Then we would just lose the `multiversion_if` early, and could not use it any more. So maybe we'd have to do that after the verification: [JDK-8350637](https://bugs.openjdk.org/browse/JDK-8350637): C2: verify that main_loop finds pre_loop and that multiversion loops find the multiversion_if
>> 
>> I wonder if we do not have similar issues with `PhaseIdealLoop::eliminate_useless_zero_trip_guard()` currently. Maybe it's rare enough we don't notice.
> 
> I don't think that's a problem. When that code runs the graph is in a stable shape. There's no dead condition that needs to go through igvn to be cleaned up. We've just run igvn and haven't made any change to the graph yet.

@rwestrel I filed this follow-up RFE:
[JDK-8350756](https://bugs.openjdk.org/browse/JDK-8350756): C2 SuperWord Multiversioning: remove useless slow loop when the fast loop disappears

We'll have to be careful to only fold the `slow_loop` away if it is not used, i.e. if we did not in the meantime use the `multiversion_if`, and maybe the `fast_loop` structure is only desintegrating because of some speculative assumption, maybe because of more unrolling that only happens with vectorization. It would be good to have a test-case for that. I'm writing that here so I will remember it later ;)

@rwestrel Do you have any other ideas / suggestions?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2684567780

From galder at openjdk.org  Wed Feb 26 11:36:11 2025
From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=)
Date: Wed, 26 Feb 2025 11:36:11 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v12]
In-Reply-To: <pZjDpZKJUmXi85-qf3F-NX91qVc42_QgZGbuo36XhPk=.f2e4ba72-bf19-4ced-9656-c01907bdae1b@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <pZjDpZKJUmXi85-qf3F-NX91qVc42_QgZGbuo36XhPk=.f2e4ba72-bf19-4ced-9656-c01907bdae1b@github.com>
Message-ID: <wDsCyP79rQ4dN3G6lMjZliTn6ym5-HwjGZ-Y-Xx_vQY=.0a187197-7505-4c95-a6cd-8b8eea0bea88@github.com>

On Fri, 7 Feb 2025 12:39:24 GMT, Galder Zamarre?o <galder at openjdk.org> wrote:

>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance.
>> 
>> Currently vectorization does not kick in for loops containing either of these calls because of the following error:
>> 
>> 
>> VLoop::check_preconditions: failed: control flow in loop not allowed
>> 
>> 
>> The control flow is due to the java implementation for these methods, e.g.
>> 
>> 
>> public static long max(long a, long b) {
>>     return (a >= b) ? a : b;
>> }
>> 
>> 
>> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively.
>> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization.
>> E.g.
>> 
>> 
>> SuperWord::transform_loop:
>>     Loop: N518/N126  counted [int,int),+4 (1025 iters)  main has_sfpt strip_mined
>>  518  CountedLoop  === 518 246 126  [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21)
>> 
>> 
>> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1155
>> long max   1173
>> 
>> 
>> After the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1042
>> long max   1042
>> 
>> 
>> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes.
>> Therefore, it still relies on the macro expansion to transform those into CMoveL.
>> 
>> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results:
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PA...
>
> Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision:
> 
>  - Merge branch 'master' into topic.intrinsify-max-min-long
>  - Fix typo
>  - Renaming methods and variables and add docu on algorithms
>  - Fix copyright years
>  - Make sure it runs with cpus with either avx512 or asimd
>  - Test can only run with 256 bit registers or bigger
>    
>    * Remove platform dependant check
>    and use platform independent configuration instead.
>  - Fix license header
>  - Tests should also run on aarch64 asimd=true envs
>  - Added comment around the assertions
>  - Adjust min/max identity IR test expectations after changes
>  - ... and 34 more: https://git.openjdk.org/jdk/compare/abdd4f5e...a190ae68

> > Re: [#20098 (comment)](https://github.com/openjdk/jdk/pull/20098#issuecomment-2671144644) - I was trying to think what could be causing this.
> 
> Maybe it is an issue with probabilities? Do you know at what point (if at all) the `MinI` node appears/disappears in that example?

The probabilities are fine.

I think the issue with `Math.min(II)` seems to be specific to when its compilation happens, and the combined fact that the intrinsic has been disabled and vectorization does not kick in (explicitly disabled). Note that other parts of the JDK invoke `Math.min(II)`.

In the slow cases it appears the compilation happens before the benchmark kicks in, and so it takes the profiling data before the benchmark to decide how to compile this in.

In the slow versions you see this `PrintMethodData`:

static java.lang.Math::min(II)I
  interpreter_invocation_count:       18171
  invocation_counter:                 18171
  backedge_counter:                       0
  decompile_count:                        0
  mdo size: 328 bytes

   0 iload_0
   1 iload_1
   2 if_icmpgt 9
  0    bci: 2    BranchData         taken(7732) displacement(56)
                                    not taken(10180)
   5 iload_0
   6 goto 10
  32   bci: 6    JumpData           taken(10180) displacement(24)
   9 iload_1
  10 ireturn

org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMin(Lorg/openjdk/bench/java/lang/MinMaxVector$LoopState;)I
  interpreter_invocation_count:         189
  invocation_counter:                   189
  backedge_counter:                  313344
  decompile_count:                        0
  mdo size: 384 bytes

   0 iconst_0
   1 istore_2
   2 iconst_0
   3 istore_3
   4 iload_3
   5 aload_1
   6 fast_igetfield 35 <org/openjdk/bench/java/lang/MinMaxVector$LoopState.size:I>
   9 if_icmpge 33
  0    bci: 9    BranchData         taken(58) displacement(72)
                                    not taken(192512)
  12 aload_1
  13 fast_agetfield 41 <org/openjdk/bench/java/lang/MinMaxVector$LoopState.minIntA:[I>
  16 iload_3
  17 iaload
  18 istore #4
  20 iload_2
  21 fast_iload #4
  23 invokestatic 32 <java/lang/Math.min(II)I>
  32   bci: 23   CounterData        count(192512)
  26 istore_2
  27 iinc #3 1
  30 goto 4
  48   bci: 30   JumpData           taken(192512) displacement(-48)
  33 iload_2
  34 ireturn


The benchmark method calls Math.min `192_512` times, yet the method data shows only `18_171` invocations,
of which `7_732` are taken which is 42%.
So it gets compiled with a `cmov` and the benchmark will be slow because it will branch 100% one of the sides.

In the fast version, `PrintMethodData` looks like this:


static java.lang.Math::min(II)I
  interpreter_invocation_count:     1575322
  invocation_counter:               1575322
  backedge_counter:                       0
  decompile_count:                        0
  mdo size: 368 bytes

   0 iload_0
   1 iload_1
   2 if_icmpgt 9
  0    bci: 2    BranchData         taken(1418001) displacement(56)
                                    not taken(157062)
   5 iload_0
   6 goto 10
  32   bci: 6    JumpData           taken(157062) displacement(24)
   9 iload_1
  10 ireturn

org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMin(Lorg/openjdk/bench/java/lang/MinMaxVector$LoopState;)I
  interpreter_invocation_count:         858
  invocation_counter:                   858
  backedge_counter:                 1756214
  decompile_count:                        0
  mdo size: 424 bytes

   0 iconst_0
   1 istore_2
   2 iconst_0
   3 istore_3
   4 iload_3
   5 aload_1
   6 fast_igetfield 35 <org/openjdk/bench/java/lang/MinMaxVector$LoopState.size:I>
   9 if_icmpge 33
  0    bci: 9    BranchData         taken(733) displacement(72)
                                    not taken(1637363)
  12 aload_1
  13 fast_agetfield 41 <org/openjdk/bench/java/lang/MinMaxVector$LoopState.minIntA:[I>
  16 iload_3
  17 iaload
  18 istore #4
  20 iload_2
  21 fast_iload #4
  23 invokestatic 32 <java/lang/Math.min(II)I>
  32   bci: 23   CounterData        count(1637363)
  26 istore_2
  27 iinc #3 1
  30 goto 4
  48   bci: 30   JumpData           taken(1637363) displacement(-48)
  33 iload_2
  34 ireturn


The benchmark method calls Math.min `1_637_363` times, and the method data shows `1_575_322` invocations,
of which `1_418_001` are taken which is 90%. So no cmov is introduced and the benchmark will be fast because it will branch 100% one of the sides.

A factor here might be my Xeon machine. I run the benchmark on a 4 core VM inside it, so given the limited resources compilation can take longer. I've noticed that it's easier to replicate this scenario there rather than my M1 laptop, which has 10 cores.

>> So, if those int scalar regressions were not a problem when int min/max intrinsic was added, I would expect the same to apply to long.
>
> Do you know when they were added? If that was a long time ago, we might not have noticed back then, but we might notice now.

I don't know when they were added.

> That said: if we know that it is only in the high-probability cases, then we can address those separately. I would not consider it a blocking issue, as long as we file the follow-up RFE for int/max scalar case with high branch probability.
> 
> What would be really helpful: a list of all regressions / issues, and how we intend to deal with them. If we later find a regression that someone cares about, then we can come back to that list, and justify the decision we made here.

I'll make up a list of regressions and post it here. I won't create RFEs for now. I'd rather wait until we have the list in front of us and we can decide which RFEs to create.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2684701935

From duke at openjdk.org  Wed Feb 26 14:18:14 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Wed, 26 Feb 2025 14:18:14 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v7]
In-Reply-To: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
Message-ID: <bGF_OqX377tbkkgzkwww0GZfE40OxsOf4uU3X_8KPHE=.9bb9cd5a-49c4-470b-8fa4-44bc81e05928@github.com>

> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.

Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision:

 - Added more comments, mainly as suggested by Andrew Dinn
 - Changed aarch64-asmtest.py as suggested by Bhavana-Kilambi

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23300/files
  - new: https://git.openjdk.org/jdk/pull/23300/files/54373d5a..aa0570db

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23300&range=06
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23300&range=05-06

  Stats: 478 lines in 3 files changed: 40 ins; 6 del; 432 mod
  Patch: https://git.openjdk.org/jdk/pull/23300.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23300/head:pull/23300

PR: https://git.openjdk.org/jdk/pull/23300

From haosun at openjdk.org  Wed Feb 26 14:41:56 2025
From: haosun at openjdk.org (Hao Sun)
Date: Wed, 26 Feb 2025 14:41:56 GMT
Subject: RFR: 8345125: Aarch64: Add aarch64 backend for Float16 scalar
 operations [v2]
In-Reply-To: <dv1U81fUkH7X7FXgiRLdM-ulfITDgHXdL6-v-hgnlKA=.5b5a4651-d367-44f2-8574-fe0ba2782a8f@github.com>
References: <Rbal8Cp4ncat_17FPV367OvtBjO-GVr0AdM-X-yuNt8=.e09edada-bc9d-4dbd-9904-ab523d25fc47@github.com>
 <8QDbenZGakijqUrwAcaVogoJBEiNpzYhN3sDrrteSDk=.d8539631-ab03-45ff-a762-0b6e14c63f89@github.com>
 <zYEifb4xz1ZRbgovZt2gT_WYKUZL_860r5Smub2GhcA=.043b82b4-53d5-4f80-9f46-a8b8d07f09fc@github.com>
 <dv1U81fUkH7X7FXgiRLdM-ulfITDgHXdL6-v-hgnlKA=.5b5a4651-d367-44f2-8574-fe0ba2782a8f@github.com>
Message-ID: <QtPPRnB55xmfGBjzCOnedbjS8TRkvYSHWegjRd_jV78=.7627d972-9281-421d-bbf2-6f3347603a9e@github.com>

On Wed, 26 Feb 2025 08:49:58 GMT, Bhavana Kilambi <bkilambi at openjdk.org> wrote:

>> src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 2097:
>> 
>>> 2095: 
>>> 2096:   // Half-precision floating-point instructions
>>> 2097:   INSN(fabdh,  0b011, 0b11, 0b000101, 0b0);
>> 
>> I suppose `fadbh` and `fnmulh` are added to keep aligned with the float and double ones, i.e. `fabd(s|d)` and `fnmul(s|d)`.
>> 
>> 
>> I noticed that there are matching rules for `fabd(s|d)`, i.e. `absd(F|D)_reg`. I wonder if we need add the corresponding rule for fp16 here?
>
> Hi @shqking , thanks for your review comments. Yes I added `fabdh` and `fnmulh` to keep aligned with float and double types.
> For adding support for FP16 `absd` we need `AbsHF` to be supported (along with SubHF) but `AbsHF` node is not implemented currently. `abs` operation is directly executed from the java code here - https://github.com/openjdk/jdk/blob/037e47112bdf2fa2324f7c58198f6d433f17d9fd/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Float16.java#L1464 and is not intrinsified or pattern matched like other FP16 operations. Same with `negate` operation for FP16 - https://github.com/openjdk/jdk/blob/037e47112bdf2fa2324f7c58198f6d433f17d9fd/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Float16.java#L1449
> On the Valhalla repo, while these operation were being developed, I tried adding support for `AbsHF/NegHF` which emitted `fabs` and `fneg` instructions but the performance with the direct java code(bit manipulation operations) was much faster (sorry don't remember the exact number) so we decided to go with the java implementation instead.
> I still added `fabd` here because `op21` is 0 only in `fabd` H variant and felt that it'd be better to handle it here as it belongs to this group of instructions. Please let me know your thoughts.

@Bhavana-Kilambi Thanks for your explanation for the missing `AbsHF`. It's okay to me to have `fadbh` and `fnmulh` in this patch.

Overall it's good to me except aph's comment above.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23748#discussion_r1971712164

From adinn at openjdk.org  Wed Feb 26 14:58:07 2025
From: adinn at openjdk.org (Andrew Dinn)
Date: Wed, 26 Feb 2025 14:58:07 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v7]
In-Reply-To: <bGF_OqX377tbkkgzkwww0GZfE40OxsOf4uU3X_8KPHE=.9bb9cd5a-49c4-470b-8fa4-44bc81e05928@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <bGF_OqX377tbkkgzkwww0GZfE40OxsOf4uU3X_8KPHE=.9bb9cd5a-49c4-470b-8fa4-44bc81e05928@github.com>
Message-ID: <8h5rWJFe3PKLNO6QiDZiAj98ePBoCilk0b9w420hZLE=.a17a4ecd-757b-405c-8f5a-5470bde5bf18@github.com>

On Wed, 26 Feb 2025 14:18:14 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - Added more comments, mainly as suggested by Andrew Dinn
>  - Changed aarch64-asmtest.py as suggested by Bhavana-Kilambi

Ok, still good

-------------

Marked as reviewed by adinn (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/23300#pullrequestreview-2644812035

From galder at openjdk.org  Wed Feb 26 18:33:03 2025
From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=)
Date: Wed, 26 Feb 2025 18:33:03 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v12]
In-Reply-To: <wDsCyP79rQ4dN3G6lMjZliTn6ym5-HwjGZ-Y-Xx_vQY=.0a187197-7505-4c95-a6cd-8b8eea0bea88@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <pZjDpZKJUmXi85-qf3F-NX91qVc42_QgZGbuo36XhPk=.f2e4ba72-bf19-4ced-9656-c01907bdae1b@github.com>
 <wDsCyP79rQ4dN3G6lMjZliTn6ym5-HwjGZ-Y-Xx_vQY=.0a187197-7505-4c95-a6cd-8b8eea0bea88@github.com>
Message-ID: <DvPOVOuqVlNY-0K5E201YQgKmguBMpJDQw2h1myD8S0=.81a46beb-90fa-4417-a9b2-9d3ebc746538@github.com>

On Wed, 26 Feb 2025 11:32:57 GMT, Galder Zamarre?o <galder at openjdk.org> wrote:

> > That said: if we know that it is only in the high-probability cases, then we can address those separately. I would not consider it a blocking issue, as long as we file the follow-up RFE for int/max scalar case with high branch probability.
> > What would be really helpful: a list of all regressions / issues, and how we intend to deal with them. If we later find a regression that someone cares about, then we can come back to that list, and justify the decision we made here.
> 
> I'll make up a list of regressions and post it here. I won't create RFEs for now. I'd rather wait until we have the list in front of us and we can decide which RFEs to create.

Before noting the regressions, it's worth noting that PR also improves performance certain scenarios. I will summarise those tomorrow.

Here's a summary of the regressions

### Regression 1
Given a loop with a long min/max reduction pattern with one side of branch taken near 100% of time, when Supeword finds the pattern not profitable, then HotSpot will use scalar instructions (cmov) and performance will regress.

Possible solutions:
a) make Superword recognise these scenarios as profitable.

### Regression 2
Given a loop with a long min/max reduction pattern with one side of branch near 100% of time, when the platform does not support vector instructions to achieve this (e.g. AVX-512 quad word vpmax/vpmin), then HotSpot will use scalar instructions (cmov) and performance will regress.

Possible solutions
a) find a way to use other vector instructions (vpcmp+vpblend+vmov?)
b) fallback on more suitable scalar instructions, e.g. cmp+mov, when the branch is very one-sided

### Regression 3
Given a loop with a long min/max non-reduction pattern (e.g. `longLoopMax`) with one side of branch taken near 100% of time, when the platform does not vectorize it (either lack of CPU instruction support, or Superword finding not profitable), then HotSpot will use scalar instructions (cmov) and performance will regress.

Possible solutions:
a) find a way to use other vector instructions (e.g. `longLoopMax` vectorizes with AVX2 and might also do with earlier instruction sets)
b) fallback on more suitable scalar instructions, e.g. cmp+mov, when the branch is very one-sided,

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2685865807

From roland at openjdk.org  Wed Feb 26 19:34:04 2025
From: roland at openjdk.org (Roland Westrelin)
Date: Wed, 26 Feb 2025 19:34:04 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory [v4]
In-Reply-To: <IWJpxq3rbBbcGsvrYi8iPP1fWTBAgYlnyL3nnCi9ofM=.c638630a-f208-4473-a1df-693e720a1350@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <IWJpxq3rbBbcGsvrYi8iPP1fWTBAgYlnyL3nnCi9ofM=.c638630a-f208-4473-a1df-693e720a1350@github.com>
Message-ID: <nNYlvCZ5MJRCTa3qOVv0cyEKzhkk_DWaLfj_Y2t2Yqg=.83067ca2-e757-4f4c-b19b-b9b38e8081ca@github.com>

On Tue, 25 Feb 2025 09:27:13 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below.
>> 
>> **Background**
>> 
>> With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer.
>> 
>> **Problem**
>> 
>> So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code.
>> 
>> 
>> MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1);
>> MemorySegment nativeUnaligned = nativeAligned.asSlice(1);
>> test3(nativeUnaligned);
>> 
>> 
>> When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not!
>> 
>>     static void test3(MemorySegment ms) {
>>         for (int i = 0; i < RANGE; i++) {
>>             long adr = i * 4L;
>>             int v = ms.get(ELEMENT_LAYOUT, adr);
>>             ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1));
>>         }
>>     }
>> 
>> 
>> **Solution: Runtime Checks - Predicate and Multiversioning**
>> 
>> Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check.
>> 
>> I came up with 2 options where to place the runtime checks:
>> - A new "auto vectorization" Parse Predicate:
>>   - This only works when predicates are available.
>>   - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop.
>> - Multiversion the loop:
>>   - Create 2 copies of the loop (fast and slow loops).
>>   - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take
>>   - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even ...
>
> Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 66 commits:
> 
>  - Merge branch 'master' into JDK-8323582-SW-native-alignment
>  - stall -> delay, plus some more comments
>  - adjust selector if probability
>  - Merge branch 'master' into JDK-8323582-SW-native-alignment
>  - remove multiversion mark if we break the structure
>  - register opaque with igvn
>  - copyright and rm CFG check
>  - IR rules for all cases
>  - 3 test versions
>  - test changed to unaligned ints
>  - ... and 56 more: https://git.openjdk.org/jdk/compare/d551daca...8eb52292

Looks good to me.

-------------

Marked as reviewed by roland (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/22016#pullrequestreview-2645658428

From epeter at openjdk.org  Thu Feb 27 06:57:04 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Thu, 27 Feb 2025 06:57:04 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v12]
In-Reply-To: <DvPOVOuqVlNY-0K5E201YQgKmguBMpJDQw2h1myD8S0=.81a46beb-90fa-4417-a9b2-9d3ebc746538@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <pZjDpZKJUmXi85-qf3F-NX91qVc42_QgZGbuo36XhPk=.f2e4ba72-bf19-4ced-9656-c01907bdae1b@github.com>
 <wDsCyP79rQ4dN3G6lMjZliTn6ym5-HwjGZ-Y-Xx_vQY=.0a187197-7505-4c95-a6cd-8b8eea0bea88@github.com>
 <DvPOVOuqVlNY-0K5E201YQgKmguBMpJDQw2h1myD8S0=.81a46beb-90fa-4417-a9b2-9d3ebc746538@github.com>
Message-ID: <63F-0aHgMthexL0b2DFmkW8_QrJeo8OOlCaIyZApfpY=.4744070d-9d56-4031-8684-be14cf66d1e5@github.com>

On Wed, 26 Feb 2025 18:29:58 GMT, Galder Zamarre?o <galder at openjdk.org> wrote:

>>> > Re: [#20098 (comment)](https://github.com/openjdk/jdk/pull/20098#issuecomment-2671144644) - I was trying to think what could be causing this.
>>> 
>>> Maybe it is an issue with probabilities? Do you know at what point (if at all) the `MinI` node appears/disappears in that example?
>> 
>> The probabilities are fine.
>> 
>> I think the issue with `Math.min(II)` seems to be specific to when its compilation happens, and the combined fact that the intrinsic has been disabled and vectorization does not kick in (explicitly disabled). Note that other parts of the JDK invoke `Math.min(II)`.
>> 
>> In the slow cases it appears the compilation happens before the benchmark kicks in, and so it takes the profiling data before the benchmark to decide how to compile this in.
>> 
>> In the slow versions you see this `PrintMethodData`:
>> 
>> static java.lang.Math::min(II)I
>>   interpreter_invocation_count:       18171
>>   invocation_counter:                 18171
>>   backedge_counter:                       0
>>   decompile_count:                        0
>>   mdo size: 328 bytes
>> 
>>    0 iload_0
>>    1 iload_1
>>    2 if_icmpgt 9
>>   0    bci: 2    BranchData         taken(7732) displacement(56)
>>                                     not taken(10180)
>>    5 iload_0
>>    6 goto 10
>>   32   bci: 6    JumpData           taken(10180) displacement(24)
>>    9 iload_1
>>   10 ireturn
>> 
>> org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMin(Lorg/openjdk/bench/java/lang/MinMaxVector$LoopState;)I
>>   interpreter_invocation_count:         189
>>   invocation_counter:                   189
>>   backedge_counter:                  313344
>>   decompile_count:                        0
>>   mdo size: 384 bytes
>> 
>>    0 iconst_0
>>    1 istore_2
>>    2 iconst_0
>>    3 istore_3
>>    4 iload_3
>>    5 aload_1
>>    6 fast_igetfield 35 <org/openjdk/bench/java/lang/MinMaxVector$LoopState.size:I>
>>    9 if_icmpge 33
>>   0    bci: 9    BranchData         taken(58) displacement(72)
>>                                     not taken(192512)
>>   12 aload_1
>>   13 fast_agetfield 41 <org/openjdk/bench/java/lang/MinMaxVector$LoopState.minIntA:[I>
>>   16 iload_3
>>   17 iaload
>>   18 istore #4
>>   20 iload_2
>>   21 fast_iload #4
>>   23 invokestatic 32 <java/lang/Math.min(II)I>
>>   32   bci: 23   CounterData        count(192512)
>>   26 istore_2
>>   27 iinc #3 1
>>   30 goto 4
>>   48   bci: 30   JumpData           taken(192512) displacement(-48)
>>   33 iload_2
>>   34 ireturn
>> 
>> 
>> The benchmark method calls Math...
>
>> > That said: if we know that it is only in the high-probability cases, then we can address those separately. I would not consider it a blocking issue, as long as we file the follow-up RFE for int/max scalar case with high branch probability.
>> > What would be really helpful: a list of all regressions / issues, and how we intend to deal with them. If we later find a regression that someone cares about, then we can come back to that list, and justify the decision we made here.
>> 
>> I'll make up a list of regressions and post it here. I won't create RFEs for now. I'd rather wait until we have the list in front of us and we can decide which RFEs to create.
> 
> Before noting the regressions, it's worth noting that PR also improves performance certain scenarios. I will summarise those tomorrow.
> 
> Here's a summary of the regressions
> 
> ### Regression 1
> Given a loop with a long min/max reduction pattern with one side of branch taken near 100% of time, when Supeword finds the pattern not profitable, then HotSpot will use scalar instructions (cmov) and performance will regress.
> 
> Possible solutions:
> a) make Superword recognise these scenarios as profitable.
> 
> ### Regression 2
> Given a loop with a long min/max reduction pattern with one side of branch near 100% of time, when the platform does not support vector instructions to achieve this (e.g. AVX-512 quad word vpmax/vpmin), then HotSpot will use scalar instructions (cmov) and performance will regress.
> 
> Possible solutions
> a) find a way to use other vector instructions (vpcmp+vpblend+vmov?)
> b) fallback on more suitable scalar instructions, e.g. cmp+mov, when the branch is very one-sided
> 
> ### Regression 3
> Given a loop with a long min/max non-reduction pattern (e.g. `longLoopMax`) with one side of branch taken near 100% of time, when the platform does not vectorize it (either lack of CPU instruction support, or Superword finding not profitable), then HotSpot will use scalar instructions (cmov) and performance will regress.
> 
> Possible solutions:
> a) find a way to use other vector instructions (e.g. `longLoopMax` vectorizes with AVX2 and might also do with earlier instruction sets)
> b) fallback on more suitable scalar instructions, e.g. cmp+mov, when the branch is very one-sided,

@galderz Thanks for the summary of regressions! Yes, there are plenty of speedups, I assume primarily because of `Long.min/max` vectorization, but possibly also because the operation can now "float" out of a loop for example.

All your Regressions 1-3 are cases with "extreme" probabilitiy (close to 100% / 0%), you listed none else. That matches with my intuition, that branching code is usually better than cmove in extreme probability cases.

As for possible solutions. In all Regression 1-3 cases, it seems the issue is scalar cmove. So actually in all cases a possible solution is using branching code (i.e. `cmp+mov`). So to me, these are the follow-up RFE's:
- Detect "extreme" probability scalar cmove, and replace them with branching code. This should take care of all regressions here. This one has high priority, as it fixes the regression caused by this patch here. But it would also help to improve performance for the `Integer.min/max` cases, which have the same issue.
- Additional performance improvement: make SuperWord recognize more cases as profitble (see Regression 1). Optional.
- Additional performance improvement: extend backend capabilities for vectorization (see Regression 2 + 3). Optional.

Does that make sense, or am I missing something?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2687067125

From epeter at openjdk.org  Thu Feb 27 07:02:10 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Thu, 27 Feb 2025 07:02:10 GMT
Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory
 access with unaligned native memory [v4]
In-Reply-To: <cg5lvEbHZGaiF-CTPbLce9AlTYvpCXuhy-gQW1U_r3Q=.f15cd01d-e14d-4f72-b111-942ecda845d2@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
 <IWJpxq3rbBbcGsvrYi8iPP1fWTBAgYlnyL3nnCi9ofM=.c638630a-f208-4473-a1df-693e720a1350@github.com>
 <6R7kv7XGOWIBrjPQCemB6u2vd_tFl_xMQGQaVWoxkK0=.d26f6780-82f8-4ab9-a4bc-ff7831ed9a1a@github.com>
 <h-VCRKhFzCVEDAkwaY8rpc303ueoREJsIWMeIPo-yys=.f8ed1033-da55-4a75-8edf-8bc762313bfc@github.com>
 <cg5lvEbHZGaiF-CTPbLce9AlTYvpCXuhy-gQW1U_r3Q=.f15cd01d-e14d-4f72-b111-942ecda845d2@github.com>
Message-ID: <sQSg4q8nJmnr8uAJP4xNw0uaw3Vp_JyvPw6pAvjD0Bk=.79ba0be6-532c-476a-ae9c-cb59f205b776@github.com>

On Wed, 26 Feb 2025 10:15:48 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>>> Would it be possible and make sense to remove useless slow path loops the way it's done for predicates or zero trip guards? In `PhaseIdealLoop::build_loop_late_post_work()`, collect all `OpaqueMultiversioningNode` in a list. Then iterate over all loops the way it's done in `PhaseIdealLoop::eliminate_useless_zero_trip_guard()`, find loops marked as multi version, check we can get from the loop to the `OpaqueMultiversioningNode` and mark that one as useful. Eliminate all `OpaqueMultiversioningNode` not marked as useful. That way if some transformation such as peeling makes the loop non multi version or if the expected shape breaks for some reason, the slow loop is eliminated on next loop opts pass.
>> 
>> I suppose we could try that. Is it ok to do that in a separate RFE, so we are keeping this here to a more manageable size?
>> 
>> I don't see it as super critical personally, as the slow_path is `delayed`, so no loop-opts are performed on it. The overhead is minimal if we keep it until after loop-opts, I think. But I'm not against trying. It would take a bit of effort to construct test cases where we have the loop fold away after multiversion_if is added, but that is probably possible.
>> 
>> And would we not have similar issues with traversing from the loops to their `OpaqueMultiversioningNode`? What if some are not reachable in the meantime? Then we would just lose the `multiversion_if` early, and could not use it any more. So maybe we'd have to do that after the verification:
>> [JDK-8350637](https://bugs.openjdk.org/browse/JDK-8350637): C2: verify that main_loop finds pre_loop and that multiversion loops find the multiversion_if
>> 
>> I wonder if we do not have similar issues with `PhaseIdealLoop::eliminate_useless_zero_trip_guard()` currently. Maybe it's rare enough we don't notice.
>> 
>> @rwestrel What do you think?
>
>> I suppose we could try that. Is it ok to do that in a separate RFE, so we are keeping this here to a more manageable size?
> 
> Ok
> 
>> And would we not have similar issues with traversing from the loops to their `OpaqueMultiversioningNode`? What if some are not reachable in the meantime? Then we would just lose the `multiversion_if` early, and could not use it any more. So maybe we'd have to do that after the verification: [JDK-8350637](https://bugs.openjdk.org/browse/JDK-8350637): C2: verify that main_loop finds pre_loop and that multiversion loops find the multiversion_if
>> 
>> I wonder if we do not have similar issues with `PhaseIdealLoop::eliminate_useless_zero_trip_guard()` currently. Maybe it's rare enough we don't notice.
> 
> I don't think that's a problem. When that code runs the graph is in a stable shape. There's no dead condition that needs to go through igvn to be cleaned up. We've just run igvn and haven't made any change to the graph yet.

@rwestrel @vnkozlov Thank you for the reviews, and all the good questions, and ideas for follow-up RFE's ?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2687071561

From epeter at openjdk.org  Thu Feb 27 07:02:11 2025
From: epeter at openjdk.org (Emanuel Peter)
Date: Thu, 27 Feb 2025 07:02:11 GMT
Subject: Integrated: 8323582: C2 SuperWord AlignVector: misaligned vector
 memory access with unaligned native memory
In-Reply-To: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
References: <pVg3luDG6rcogJccQNhdBPlNNyLclJwqrRy6D9PYcfI=.759355f9-d318-4b1c-a1e1-fe2071922da8@github.com>
Message-ID: <wgRmMUTlxUhY7AB8uv978maSl5_lF5ep0M6bry_ZI-g=.0d41f2ad-b6c4-4033-aa88-0d37230bdf4c@github.com>

On Mon, 11 Nov 2024 14:40:09 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below.
> 
> **Background**
> 
> With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer.
> 
> **Problem**
> 
> So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code.
> 
> 
> MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1);
> MemorySegment nativeUnaligned = nativeAligned.asSlice(1);
> test3(nativeUnaligned);
> 
> 
> When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not!
> 
>     static void test3(MemorySegment ms) {
>         for (int i = 0; i < RANGE; i++) {
>             long adr = i * 4L;
>             int v = ms.get(ELEMENT_LAYOUT, adr);
>             ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1));
>         }
>     }
> 
> 
> **Solution: Runtime Checks - Predicate and Multiversioning**
> 
> Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check.
> 
> I came up with 2 options where to place the runtime checks:
> - A new "auto vectorization" Parse Predicate:
>   - This only works when predicates are available.
>   - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop.
> - Multiversion the loop:
>   - Create 2 copies of the loop (fast and slow loops).
>   - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take
>   - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even unaligned `base`s would end up with reasonably fast code.
>   - We "stall" the `...

This pull request has now been integrated.

Changeset: 885338b5
Author:    Emanuel Peter <epeter at openjdk.org>
URL:       https://git.openjdk.org/jdk/commit/885338b5f38ed05d8b91efc0178b371f2f89310e
Stats:     1089 lines in 27 files changed: 966 ins; 28 del; 95 mod

8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory

Reviewed-by: roland, kvn

-------------

PR: https://git.openjdk.org/jdk/pull/22016

From adinn at openjdk.org  Thu Feb 27 09:50:59 2025
From: adinn at openjdk.org (Andrew Dinn)
Date: Thu, 27 Feb 2025 09:50:59 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]
In-Reply-To: <me_0hCYLnyFadVymsq-oxV7VggbpPMqLnXe-jPU4boU=.bfcec994-1f12-4f4e-9002-8e2564919511@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <unMldYiDLGyImOJQ1oXuzR2OViIBxTKFjE3Ks6_VSn4=.e86bd4ee-5fce-415a-888a-06aff24bd664@github.com>
 <c8EPfl5IC1K3uLMftbZSbf-TyJK-e5LEsXovSfjqO14=.ae0182d1-e7d2-4ab5-9ebe-d7bc8bac643e@github.com>
 <me_0hCYLnyFadVymsq-oxV7VggbpPMqLnXe-jPU4boU=.bfcec994-1f12-4f4e-9002-8e2564919511@github.com>
Message-ID: <KiLkaXSpVA5HHOySCeImnTuA-4UkQw_fOcuI7cxBZjk=.7c44481e-ddc3-403e-bbbf-1f2555135fc4@github.com>

On Fri, 21 Feb 2025 10:23:37 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> Hi. Here is the test result of our CI.
>> 
>> ### copyright year
>> 
>> the following files should update the copyright year to 2025.
>> 
>> 
>> src/hotspot/cpu/aarch64/assembler_aarch64.hpp
>> src/hotspot/cpu/aarch64/stubRoutines_aarch64.hpp
>> src/hotspot/share/runtime/globals.hpp
>> src/java.base/share/classes/sun/security/provider/ML_DSA.java
>> src/java.base/share/classes/sun/security/provider/SHA3Parallel.java
>> test/micro/org/openjdk/bench/java/security/MLDSA.java
>> 
>> 
>> ### cross-build failure
>> 
>> Cross build for riscv64/s390/ppc64 failed.
>> 
>> Here shows the error msg for ppc64
>> 
>> 
>> === Output from failing command(s) repeated here ===
>> * For target support_interim-jmods_support__create_java.base.jmod_exec:
>> #
>> # A fatal error has been detected by the Java Runtime Environment:
>> #
>> #  Internal Error (/tmp/jdk-src/src/hotspot/share/asm/codeBuffer.hpp:200), pid=72752, tid=72769
>> #  assert(allocates2(pc)) failed: not in CodeBuffer memory: 0x0000e85cb03dc620 <= 0x0000e85cb03e8ab4 <= 0x0000e85cb03e8ab0
>> #
>> # JRE version: OpenJDK Runtime Environment (25.0) (fastdebug build 25-internal-git-1e01c6deec3)
>> # Java VM: OpenJDK 64-Bit Server VM (fastdebug 25-internal-git-1e01c6deec3, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64)
>> # Problematic frame:
>> # V  [libjvm.so+0x3b391c]  Instruction_aarch64::~Instruction_aarch64()+0xbc
>> #
>> # Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /tmp/ci-scripts/jdk-src/make/
>> #
>> # An error report file with more information is saved as:
>> # /tmp/jdk-src/make/hs_err_pid72752.log
>>    ... (rest of output omitted)
>> 
>> * All command lines available in /sysroot/ppc64el/tmp/build-ppc64el/make-support/failure-logs.
>> === End of repeated output ===
>> 
>> 
>> I suppose we should make the similar update at file `src/hotspot/cpu/aarch64/stubDeclarations_aarch64.hpp` to other platforms
>
> @shqking, I changed the copyright years, but I don't really understand how the aarch64-specific code can overflow buffers on other architectures. As far as I understand, Instruction_aarch64 should not have been there in a ppc build.
> Was this a build attempted on an aarch64 for the other architectures?

@ferakocz Apologies for raising yet another resolve conflict. You will need to make a further adjustment to the compiler blob declaration to accommodate a fix I just pushed to resolve a problem with cross-compilation. Your patch should now specify  

do_arch_blob(compiler, 50000 ZGC_ONLY(+10000))

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2687427983

From adinn at openjdk.org  Thu Feb 27 09:56:02 2025
From: adinn at openjdk.org (Andrew Dinn)
Date: Thu, 27 Feb 2025 09:56:02 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v7]
In-Reply-To: <bGF_OqX377tbkkgzkwww0GZfE40OxsOf4uU3X_8KPHE=.9bb9cd5a-49c4-470b-8fa4-44bc81e05928@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <bGF_OqX377tbkkgzkwww0GZfE40OxsOf4uU3X_8KPHE=.9bb9cd5a-49c4-470b-8fa4-44bc81e05928@github.com>
Message-ID: <dDw877tmMKNbdKSXIoAookE9CPwydOtWM8N1OIj4q7A=.1e26c53e-5ae0-4c1e-a22b-9d7ca284aa68@github.com>

On Wed, 26 Feb 2025 14:18:14 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - Added more comments, mainly as suggested by Andrew Dinn
>  - Changed aarch64-asmtest.py as suggested by Bhavana-Kilambi

Oops. sorry - cut and paste error -- the new setting should be

do_arch_blob(compiler, 55000 ZGC_ONLY(+5000))

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2687440017

From aph at openjdk.org  Thu Feb 27 10:19:06 2025
From: aph at openjdk.org (Andrew Haley)
Date: Thu, 27 Feb 2025 10:19:06 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]
In-Reply-To: <_CekdxBJviS_sZCVN62_yFx-cTF4qrIuAnqbIeUmFck=.3a6afffb-8fbe-4809-a4ca-1bc22b52a628@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <unMldYiDLGyImOJQ1oXuzR2OViIBxTKFjE3Ks6_VSn4=.e86bd4ee-5fce-415a-888a-06aff24bd664@github.com>
 <1yB95sOajuS5ptFI0GQWLepii5JsZ9DOsje-TEFyFYs=.a325ad18-17ed-4e77-b1e3-0bad2cf55c67@github.com>
 <bqom3W9zRU-ChMNDgkfcOVPEmKvl5J1huHv1o6Fe1yc=.052a4703-111d-40df-8843-4aee2dd93cca@github.com>
 <Z_BqJtEFwJ_A5aQnxJCOs4jswW7CmUL4dJAKrHLJGk0=.48dae4a5-e713-447d-bf05-f2caa6408540@github.com>
 <ifk97t9mBPxdf6XVck8CHs_hE1PrweC597vmZ_VO5yU=.42ea9675-35af-49fb-8a6c-ec53c2543be4@github.com>
 <cEMnfNLE2HH7lgv7M9ury8U5ef6QYb0glG28uB5Lm1w=.88ac5562-5128-49dc-a639-6444153c622e@github.com>
 <teEatC-H5hsucsK32FLBVpb8x5Digf3hIh9vUqfHfew=.e13bc121-5716-4724-bc79-36a4b46de8a2@github.com>
 <NnDHhEs6oCXiK69ge5b63oCG47J6NM60pGbcfM30blI=.f1f82355-2d70-45c8-a4f8-7f83b91d65d4@github.com>
 <OzV68CDcIoCJFhzWV-FiamEX6JFpKj18Ydl7FoOfyvE=.56f23f42-00b6-43ae-9d03-90e0067aa31b@github.com>
 <_CekdxBJviS_sZCVN62_yFx-cTF4qrIuAnq
 bIeUmFck=.3a6afffb-8fbe-4809-a4ca-1bc22b52a628@github.com>
Message-ID: <NsjD23nmanj2F0enMZMf3ISG7yMcQJ_zew3f03KNNRA=.4ec31a84-7170-4639-a698-d86a2ad25d25@github.com>

On Tue, 25 Feb 2025 15:58:18 GMT, Ferenc Rakoczi <duke at openjdk.org> wrote:

>> Aha!
>> 
>> 
>> aph at Andrews-MacBook-Pro ~ % as t.s   
>> t.s:1:19: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4]
>> sub x1, x10, x23, sxth #2
>>                   ^
>> aph at Andrews-MacBook-Pro ~ % as --version
>> Apple clang version 16.0.0 (clang-1600.0.26.6)
>> Target: arm64-apple-darwin24.3.0
>
> OK, so GNU as is more forgiving than Apple as...

Did my patch to aarch64-asmtest.py solve the problem?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1973284472

From coleenp at openjdk.org  Thu Feb 27 14:28:07 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Thu, 27 Feb 2025 14:28:07 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists
In-Reply-To: <DT9JbIKvpPv7zQjYb2pgXpOxyL4h7R1M5HULfOetNzo=.2e53f146-dbb6-47b4-aed8-e9ff98ff13e1@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <DT9JbIKvpPv7zQjYb2pgXpOxyL4h7R1M5HULfOetNzo=.2e53f146-dbb6-47b4-aed8-e9ff98ff13e1@github.com>
Message-ID: <cPLqn5OwcqQjupvXr9YnUFq06Gio_Yim6dq8pFbxydw=.069c8bc7-abe8-4ab2-8f08-a1b724e7ea16@github.com>

On Wed, 26 Feb 2025 05:42:12 GMT, David Holmes <dholmes at openjdk.org> wrote:

>> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`.
>> 
>> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past.
>> 
>> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks.
>> 
>> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`.
>> 
>> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable.
>> 
>> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list.
>> 
>> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor).
>> 
>> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation.
>> 
>> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fac...
>
> src/hotspot/share/runtime/objectMonitor.cpp line 166:
> 
>> 164: //   its next pointer, and have its prev pointer set to null. Thus
>> 165: //   pushing six threads A-F (in that order) onto entry_list, will
>> 166: //   form a singly-linked list, see 1) below.
> 
> Suggestion: have diagram 1 immediately follow this text so the reader doesn't have to jump down.

I like this suggestion.  I like these comments.

> src/hotspot/share/runtime/objectMonitor.cpp line 718:
> 
>> 716: // if we added current to _entry_list. Once on _entry_list, current
>> 717: // stays on-queue until it acquires the lock.
>> 718: bool ObjectMonitor::try_lock_or_add_to_entry_list(JavaThread* current, ObjectWaiter* node) {
> 
> Nit: the name suggests we do the try_lock first, when we don't. If we reverse the name we should also reverse the true/false return so that true relates to the first part of the name. See what others think.

How about add_to_entry_list with a boolean parameter that tries the lock if it fails, and only have one of these functions?  Although the return true if you get the lock makes it weird.


bool add_to_entry_list(JavaThread* current, ObjectWaiter* node, bool or_lock) {
  return true if locked, false otherwise;
}


Maybe that makes sense.

> src/hotspot/share/runtime/objectMonitor.cpp line 719:
> 
>> 717: // stays on-queue until it acquires the lock.
>> 718: bool ObjectMonitor::try_lock_or_add_to_entry_list(JavaThread* current, ObjectWaiter* node) {
>> 719:   node->_prev   = nullptr;
> 
> Shouldn't this already be the case?

I think for the vthread case, it isn't yet(?).  Maybe motivation to fix the ObjectWaiter constructor with this patch?

> src/hotspot/share/runtime/objectMonitor.cpp line 2018:
> 
>> 2016: // that in prepend-mode we invert the order of the waiters. Let's say that the
>> 2017: // waitset is "ABCD" and the entry_list is "XYZ". After a notifyAll() in prepend
>> 2018: // mode the waitset will be empty and the entry_list will be "DCBAXYZ".
> 
> We don't support different ordering modes any more so we always "prepend" such that waiters are added to the entry_list in the reverse order of waiting. So given waitList -> A -> B -> C -> D, and _entry_list -> x -> y -> z we will get _entry_list -> D -> C -> B -> A -> X -> Y -> Z

One of the benefits of this work is to read, understand and clean up misleading and out of date comments in this code.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1973636957
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1973657207
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1973681891
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1973684370

From coleenp at openjdk.org  Thu Feb 27 14:28:05 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Thu, 27 Feb 2025 14:28:05 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists
In-Reply-To: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
Message-ID: <bDLWpt4obpSoiWgNK9Z47kzDc8qFphwxzp2X-4GlcxA=.503be599-941b-4630-a753-df2908b59da7@github.com>

On Mon, 3 Feb 2025 16:29:25 GMT, Fredrik Bredberg <fbredberg at openjdk.org> wrote:

> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`.
> 
> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past.
> 
> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks.
> 
> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`.
> 
> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable.
> 
> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list.
> 
> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor).
> 
> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation.
> 
> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fact that c2 no longer has to check b...

This looks really good - I have some small change and improvement requests.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 418:

> 416:   // have released the lock.
> 417:   // Refer to the comments in synchronizer.cpp for how we might encode extra
> 418:   // state in _succ so we can avoid fetching entry_list.

I there is no comment in synchronizer about this (that I can find) and whether or not this is a good idea, can you remove this line with this change?

src/hotspot/share/runtime/objectMonitor.cpp line 701:

> 699: void ObjectMonitor::add_to_entry_list(JavaThread* current, ObjectWaiter* node) {
> 700:   node->_prev   = nullptr;
> 701:   node->TState  = ObjectWaiter::TS_ENTER;

I think you should do this in a future cleanup.  The ObjectWaiter's constructor should initialize these fields to TS_ENTER or TS_WAIT when it's created  and make prev, next null (or 0xBAD?).  And fix the constructor to have an initialization list instead.

src/hotspot/share/runtime/objectMonitor.cpp line 735:

> 733:       assert(!has_successor(current), "invariant");
> 734:       assert(has_owner(current), "invariant");
> 735:       return true;

I wonder for a future RFE we can move these asserts into TryLock.

src/hotspot/share/runtime/objectMonitor.cpp line 1285:

> 1283: // By convention we unlink a contending thread from _entry_list immediately
> 1284: // after the thread acquires the lock in ::enter().  Equally, we could defer
> 1285: // unlinking the thread until ::exit()-time.

Since you're here, remove these two lines 1222-1223.  I really don't think pointing out an alternate implementation that we did not choose is helpful to understanding this code.

src/hotspot/share/runtime/objectMonitor.hpp line 46:

> 44: class ObjectWaiter : public CHeapObj<mtThread> {
> 45:  public:
> 46:   enum TStates : uint8_t { TS_UNDEF, TS_READY, TS_RUN, TS_WAIT, TS_ENTER };

TS_READY looks unused.

src/hotspot/share/runtime/objectMonitor.hpp line 79:

> 77:   void set_bad_pointers() {
> 78: #ifdef ASSERT
> 79:     // Diagnostic hygiene ...

hygiene seems like the wrong word here.  Can you remove this comment?

src/hotspot/share/runtime/synchronizer.cpp line 369:

> 367:       // We have one or more waiters. Since this is an inflated monitor
> 368:       // that we own, we can transfer one or more threads from the waitset
> 369:       // to the entry_list here and now, avoiding the slow-path.

Not related to this change but I found that this quick_notify isn't quicker.

-------------

Changes requested by coleenp (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/23421#pullrequestreview-2647862248
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1973630782
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1973654464
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1973664035
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1973670396
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1973678657
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1973632087
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1973634214

From fbredberg at openjdk.org  Thu Feb 27 15:54:28 2025
From: fbredberg at openjdk.org (Fredrik Bredberg)
Date: Thu, 27 Feb 2025 15:54:28 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
Message-ID: <VkUPkPpw2u6nQ7IToE9MkPhq7fJdRcNAyreKlfCRE4g=.c30f2a89-27bd-47b5-80a2-b6288d77fe18@github.com>

> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`.
> 
> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past.
> 
> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks.
> 
> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`.
> 
> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable.
> 
> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list.
> 
> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor).
> 
> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation.
> 
> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fact that c2 no longer has to check b...

Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision:

  Update after review by David and Coleen.

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23421/files
  - new: https://git.openjdk.org/jdk/pull/23421/files/e1d4fac6..283c2431

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23421&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23421&range=00-01

  Stats: 124 lines in 5 files changed: 28 ins; 36 del; 60 mod
  Patch: https://git.openjdk.org/jdk/pull/23421.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23421/head:pull/23421

PR: https://git.openjdk.org/jdk/pull/23421

From fbredberg at openjdk.org  Thu Feb 27 16:00:13 2025
From: fbredberg at openjdk.org (Fredrik Bredberg)
Date: Thu, 27 Feb 2025 16:00:13 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <DT9JbIKvpPv7zQjYb2pgXpOxyL4h7R1M5HULfOetNzo=.2e53f146-dbb6-47b4-aed8-e9ff98ff13e1@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <DT9JbIKvpPv7zQjYb2pgXpOxyL4h7R1M5HULfOetNzo=.2e53f146-dbb6-47b4-aed8-e9ff98ff13e1@github.com>
Message-ID: <nOvv_0BHTFNwxquO4kLEWIRS1fdBPtDUJL-p-TVQXKw=.2ccbd9a7-6159-4f7c-9c7d-2fbce2de4763@github.com>

On Wed, 26 Feb 2025 05:42:25 GMT, David Holmes <dholmes at openjdk.org> wrote:

>> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Update after review by David and Coleen.
>
> src/hotspot/share/runtime/objectMonitor.cpp line 172:
> 
>> 170: //   from the entry_list head. While walking the list we also assign
>> 171: //   the prev pointers of each thread, essentially forming a doubly
>> 172: //   linked list, see 2) below.
> 
> Suggestion: have diagram 2 immediately follow this text so the reader doesn't have to jump down.

Fixed

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1973880640

From fbredberg at openjdk.org  Thu Feb 27 16:00:14 2025
From: fbredberg at openjdk.org (Fredrik Bredberg)
Date: Thu, 27 Feb 2025 16:00:14 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <bDLWpt4obpSoiWgNK9Z47kzDc8qFphwxzp2X-4GlcxA=.503be599-941b-4630-a753-df2908b59da7@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <bDLWpt4obpSoiWgNK9Z47kzDc8qFphwxzp2X-4GlcxA=.503be599-941b-4630-a753-df2908b59da7@github.com>
Message-ID: <E6SRaEw0Rbl_rjZ69hY1gN7YXNr_XGciKB979xHOcTw=.91103b87-6d45-4f95-b3b2-f957947ea9fe@github.com>

On Thu, 27 Feb 2025 14:09:45 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Update after review by David and Coleen.
>
> src/hotspot/share/runtime/objectMonitor.cpp line 701:
> 
>> 699: void ObjectMonitor::add_to_entry_list(JavaThread* current, ObjectWaiter* node) {
>> 700:   node->_prev   = nullptr;
>> 701:   node->TState  = ObjectWaiter::TS_ENTER;
> 
> I think you should do this in a future cleanup.  The ObjectWaiter's constructor should initialize these fields to TS_ENTER or TS_WAIT when it's created  and make prev, next null (or 0xBAD?).  And fix the constructor to have an initialization list instead.

Sounds like a plan.

> src/hotspot/share/runtime/synchronizer.cpp line 369:
> 
>> 367:       // We have one or more waiters. Since this is an inflated monitor
>> 368:       // that we own, we can transfer one or more threads from the waitset
>> 369:       // to the entry_list here and now, avoiding the slow-path.
> 
> Not related to this change but I found that this quick_notify isn't quicker.

Let's make quick_notify quicker (in another RFE).

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1973883699
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1973878764

From galder at openjdk.org  Thu Feb 27 16:41:13 2025
From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=)
Date: Thu, 27 Feb 2025 16:41:13 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v12]
In-Reply-To: <pZjDpZKJUmXi85-qf3F-NX91qVc42_QgZGbuo36XhPk=.f2e4ba72-bf19-4ced-9656-c01907bdae1b@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <pZjDpZKJUmXi85-qf3F-NX91qVc42_QgZGbuo36XhPk=.f2e4ba72-bf19-4ced-9656-c01907bdae1b@github.com>
Message-ID: <a5rgV-mShBLPGVHpb_rApMTUxpDMdPnD9L8VnTyfRxQ=.23a74fc4-e041-4c67-a653-1e00b2568d2c@github.com>

On Fri, 7 Feb 2025 12:39:24 GMT, Galder Zamarre?o <galder at openjdk.org> wrote:

>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance.
>> 
>> Currently vectorization does not kick in for loops containing either of these calls because of the following error:
>> 
>> 
>> VLoop::check_preconditions: failed: control flow in loop not allowed
>> 
>> 
>> The control flow is due to the java implementation for these methods, e.g.
>> 
>> 
>> public static long max(long a, long b) {
>>     return (a >= b) ? a : b;
>> }
>> 
>> 
>> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively.
>> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization.
>> E.g.
>> 
>> 
>> SuperWord::transform_loop:
>>     Loop: N518/N126  counted [int,int),+4 (1025 iters)  main has_sfpt strip_mined
>>  518  CountedLoop  === 518 246 126  [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21)
>> 
>> 
>> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1155
>> long max   1173
>> 
>> 
>> After the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1042
>> long max   1042
>> 
>> 
>> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes.
>> Therefore, it still relies on the macro expansion to transform those into CMoveL.
>> 
>> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results:
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PA...
>
> Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision:
> 
>  - Merge branch 'master' into topic.intrinsify-max-min-long
>  - Fix typo
>  - Renaming methods and variables and add docu on algorithms
>  - Fix copyright years
>  - Make sure it runs with cpus with either avx512 or asimd
>  - Test can only run with 256 bit registers or bigger
>    
>    * Remove platform dependant check
>    and use platform independent configuration instead.
>  - Fix license header
>  - Tests should also run on aarch64 asimd=true envs
>  - Added comment around the assertions
>  - Adjust min/max identity IR test expectations after changes
>  - ... and 34 more: https://git.openjdk.org/jdk/compare/92e82467...a190ae68

Also, I've started a [discussion on jmh-dev](https://mail.openjdk.org/pipermail/jmh-dev/2025-February/004094.html) to see if there's a way to minimise pollution of `Math.min(II)` compilation. As a follow to https://github.com/openjdk/jdk/pull/20098#issuecomment-2684701935 I looked at where the other `Math.min(II)` calls are coming from, and a big chunk seem related to the JMH infrastructure.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2688510211

From galder at openjdk.org  Thu Feb 27 16:38:04 2025
From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=)
Date: Thu, 27 Feb 2025 16:38:04 GMT
Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and
 Math.min(long,long) [v12]
In-Reply-To: <63F-0aHgMthexL0b2DFmkW8_QrJeo8OOlCaIyZApfpY=.4744070d-9d56-4031-8684-be14cf66d1e5@github.com>
References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com>
 <pZjDpZKJUmXi85-qf3F-NX91qVc42_QgZGbuo36XhPk=.f2e4ba72-bf19-4ced-9656-c01907bdae1b@github.com>
 <wDsCyP79rQ4dN3G6lMjZliTn6ym5-HwjGZ-Y-Xx_vQY=.0a187197-7505-4c95-a6cd-8b8eea0bea88@github.com>
 <DvPOVOuqVlNY-0K5E201YQgKmguBMpJDQw2h1myD8S0=.81a46beb-90fa-4417-a9b2-9d3ebc746538@github.com>
 <63F-0aHgMthexL0b2DFmkW8_QrJeo8OOlCaIyZApfpY=.4744070d-9d56-4031-8684-be14cf66d1e5@github.com>
Message-ID: <UR_gzF8pDYx_j2m2J1UhSzkjG8Qcz6aL_T7FtxDCq28=.99f74160-15da-46ac-aa03-a01bbe21a399@github.com>

On Thu, 27 Feb 2025 06:54:30 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> Detect "extreme" probability scalar cmove, and replace them with branching code. This should take care of all regressions here. This one has high priority, as it fixes the regression caused by this patch here. But it would also help to improve performance for the Integer.min/max cases, which have the same issue.

+1 and the rest of suggestions. Shall I create a JDK bug for this?

> Additional performance improvement: make SuperWord recognize more cases as profitble (see Regression 1). Optional.
> Additional performance improvement: extend backend capabilities for vectorization (see Regression 2 + 3). Optional.

Do we need JDK bug(s) for these? If so, how many? 1 or 2?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2688502397

From duke at openjdk.org  Thu Feb 27 16:48:07 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Thu, 27 Feb 2025 16:48:07 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]
In-Reply-To: <NsjD23nmanj2F0enMZMf3ISG7yMcQJ_zew3f03KNNRA=.4ec31a84-7170-4639-a698-d86a2ad25d25@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
 <unMldYiDLGyImOJQ1oXuzR2OViIBxTKFjE3Ks6_VSn4=.e86bd4ee-5fce-415a-888a-06aff24bd664@github.com>
 <1yB95sOajuS5ptFI0GQWLepii5JsZ9DOsje-TEFyFYs=.a325ad18-17ed-4e77-b1e3-0bad2cf55c67@github.com>
 <bqom3W9zRU-ChMNDgkfcOVPEmKvl5J1huHv1o6Fe1yc=.052a4703-111d-40df-8843-4aee2dd93cca@github.com>
 <Z_BqJtEFwJ_A5aQnxJCOs4jswW7CmUL4dJAKrHLJGk0=.48dae4a5-e713-447d-bf05-f2caa6408540@github.com>
 <ifk97t9mBPxdf6XVck8CHs_hE1PrweC597vmZ_VO5yU=.42ea9675-35af-49fb-8a6c-ec53c2543be4@github.com>
 <cEMnfNLE2HH7lgv7M9ury8U5ef6QYb0glG28uB5Lm1w=.88ac5562-5128-49dc-a639-6444153c622e@github.com>
 <teEatC-H5hsucsK32FLBVpb8x5Digf3hIh9vUqfHfew=.e13bc121-5716-4724-bc79-36a4b46de8a2@github.com>
 <NnDHhEs6oCXiK69ge5b63oCG47J6NM60pGbcfM30blI=.f1f82355-2d70-45c8-a4f8-7f83b91d65d4@github.com>
 <OzV68CDcIoCJFhzWV-FiamEX6JFpKj18Ydl7FoOfyvE=.56f23f42-00b6-43ae-9d03-90e0067aa31b@github.com>
 <_CekdxBJviS_sZCVN62_yFx-cTF4qrIuAnq
 bIeUmFck=.3a6afffb-8fbe-4809-a4ca-1bc22b52a628@github.com>
 <NsjD23nmanj2F0enMZMf3ISG7yMcQJ_zew3f03KNNRA=.4ec31a84-7170-4639-a698-d86a2ad25d25@github.com>
Message-ID: <QTxALn-zfAiTANLFLCvTxRG9NRhpS5nb67A0guEyvUQ=.906aedb0-91b8-414c-b561-c658cb84aad2@github.com>

On Thu, 27 Feb 2025 10:15:48 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> OK, so GNU as is more forgiving than Apple as...
>
> Did my patch to aarch64-asmtest.py solve the problem?

I haven't tried, I just used GNU as.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1973970358

From coleenp at openjdk.org  Thu Feb 27 17:16:05 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Thu, 27 Feb 2025 17:16:05 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <bDLWpt4obpSoiWgNK9Z47kzDc8qFphwxzp2X-4GlcxA=.503be599-941b-4630-a753-df2908b59da7@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <bDLWpt4obpSoiWgNK9Z47kzDc8qFphwxzp2X-4GlcxA=.503be599-941b-4630-a753-df2908b59da7@github.com>
Message-ID: <zO6oz6I607055tJAMfg7Wb6_11AHpME4v4MkONm7fuw=.5c9a2219-89c6-4e1e-a0ee-4cc5e2660b45@github.com>

On Thu, 27 Feb 2025 14:22:02 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Update after review by David and Coleen.
>
> src/hotspot/share/runtime/objectMonitor.hpp line 46:
> 
>> 44: class ObjectWaiter : public CHeapObj<mtThread> {
>> 45:  public:
>> 46:   enum TStates : uint8_t { TS_UNDEF, TS_READY, TS_RUN, TS_WAIT, TS_ENTER };
> 
> TS_READY looks unused.

Edit: this could be a trivial further PR.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974015687

From coleenp at openjdk.org  Thu Feb 27 17:16:04 2025
From: coleenp at openjdk.org (Coleen Phillimore)
Date: Thu, 27 Feb 2025 17:16:04 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <VkUPkPpw2u6nQ7IToE9MkPhq7fJdRcNAyreKlfCRE4g=.c30f2a89-27bd-47b5-80a2-b6288d77fe18@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <VkUPkPpw2u6nQ7IToE9MkPhq7fJdRcNAyreKlfCRE4g=.c30f2a89-27bd-47b5-80a2-b6288d77fe18@github.com>
Message-ID: <uJ2ikgYpckNbHc143C0SaFGC0I2J20PhYGtND8VGLf0=.51d82677-85e0-4fcd-9874-c11dcb82110a@github.com>

On Thu, 27 Feb 2025 15:54:28 GMT, Fredrik Bredberg <fbredberg at openjdk.org> wrote:

>> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`.
>> 
>> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past.
>> 
>> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks.
>> 
>> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`.
>> 
>> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable.
>> 
>> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list.
>> 
>> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor).
>> 
>> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation.
>> 
>> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fac...
>
> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Update after review by David and Coleen.

This change looks great.  Thank you!

src/hotspot/share/runtime/objectMonitor.cpp line 219:

> 217: //          entry_list_tail  ----------^
> 218: //
> 219: // * The monitor itself protects all of the operations on the

This is a nice comment and really helps understand the algorithm.

src/hotspot/share/runtime/objectMonitor.cpp line 948:

> 946:   current->_ParkEvent->reset();
> 947: 
> 948:   if (try_lock_or_add_to_entry_list(current, &node)) {

try_lock_or_add_to_entry_list() name makes sense in this context.


if (add_to_entry_list(current, &node, /*try_lock*/true)) {
  return; // We got the lock
}

Makes less sense.  I propose leaving the names and the functions for now.

-------------

Marked as reviewed by coleenp (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/23421#pullrequestreview-2648493876
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974006126
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974014351

From ayang at openjdk.org  Thu Feb 27 18:34:14 2025
From: ayang at openjdk.org (Albert Mingkun Yang)
Date: Thu, 27 Feb 2025 18:34:14 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v2]
In-Reply-To: <sATIRzYvi1wLv1tzkXOuwBN2F2YKy7SLSxaU6mnPWTg=.93b86875-3f5f-4ffe-91f0-a59ef68b2157@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <sATIRzYvi1wLv1tzkXOuwBN2F2YKy7SLSxaU6mnPWTg=.93b86875-3f5f-4ffe-91f0-a59ef68b2157@github.com>
Message-ID: <3zmj-DeeRyPMHc32YnvfqACN0xJxLQ6jZZ7sd-Baa3w=.672912f6-e4a3-4679-b8a3-b7f6ad51589d@github.com>

On Tue, 25 Feb 2025 15:13:43 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> Hi all,
>> 
>>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
>> 
>> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
>> 
>> ### Current situation
>> 
>> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
>> 
>> The main reason for the current barrier is how g1 implements concurrent refinement:
>> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
>> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
>> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
>> 
>> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
>> 
>> 
>> // Filtering
>> if (region(@x.a) == region(y)) goto done; // same region check
>> if (y == null) goto done;     // null value check
>> if (card(@x.a) == young_card) goto done;  // write to young gen check
>> StoreLoad;                // synchronize
>> if (card(@x.a) == dirty_card) goto done;
>> 
>> *card(@x.a) = dirty
>> 
>> // Card tracking
>> enqueue(card-address(@x.a)) into thread-local-dcq;
>> if (thread-local-dcq is not full) goto done;
>> 
>> call runtime to move thread-local-dcq into dcqs
>> 
>> done:
>> 
>> 
>> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
>> 
>> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
>> 
>> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
>> 
>> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c...
>
> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
> 
>   * remove unnecessarily added logging

src/hotspot/share/gc/g1/g1BarrierSet.hpp line 54:

> 52: // them, keeping the write barrier simple.
> 53: //
> 54: // The refinement threads mark cards in the the current collection set specially on the

"the the" typo.

src/hotspot/share/gc/g1/g1CardTable.inline.hpp line 47:

> 45: 
> 46: // Returns bits from a where mask is 0, and bits from b where mask is 1.
> 47: inline size_t blend(size_t a, size_t b, size_t mask) {

Can you provide some input/output examples in the doc?

src/hotspot/share/gc/g1/g1CardTableClaimTable.cpp line 45:

> 43: }
> 44: 
> 45: void G1CardTableClaimTable::initialize(size_t max_reserved_regions) {

Should the arg be `uint`?

src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 280:

> 278:   assert_state(State::SweepRT);
> 279: 
> 280:   set_state_start_time();

This method is called in a loop; would that skew the state-starting time?

src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 344:

> 342:     size_t _num_clean;
> 343:     size_t _num_dirty;
> 344:     size_t _num_to_cset;

Seem never read.

src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 349:

> 347: 
> 348:     bool do_heap_region(G1HeapRegion* r) override {
> 349:       if (!r->is_free()) {

I am a bit lost on this closure; the intention seems to set unclaimed to all non-free regions, why can't this be done in one go, instead of first setting all regions to claimed (`reset_all_claims_to_claimed`), then set non-free ones unclaimed?

src/hotspot/share/gc/g1/g1ConcurrentRefine.hpp line 116:

> 114: 
> 115:   // Current heap snapshot.
> 116:   G1CardTableClaimTable* _sweep_state;

Since this is a table, I wonder if we can name it "x_table" instead of "x_state".

src/hotspot/share/gc/g1/g1RemSet.cpp line 147:

> 145:     if (_contains[region]) {
> 146:       return;
> 147:     }

Indentation seems broken.

src/hotspot/share/gc/g1/g1RemSet.cpp line 830:

> 828:         size_t const start_idx = region_card_base_idx + claim.value();
> 829: 
> 830:         size_t* card_cur_card = (size_t*)card_table->byte_for_index(start_idx);

This var name should end with "_word", instead of "_card".

src/hotspot/share/gc/g1/g1RemSet.cpp line 1252:

> 1250:     G1ConcurrentRefineWorkState::snapshot_heap_into(&constructed);
> 1251:     claim = &constructed;
> 1252:   }

It's not super obvious to me why the "has_sweep_claims" checking needs to be on this level. Can `G1ConcurrentRefineWorkState` return a valid `G1CardTableClaimTable*` directly?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1974124792
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1971426039
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1973435950
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1974083760
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1973447654
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1973452168
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1974056492
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1973423400
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1974108760
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1974134441

From fbredberg at openjdk.org  Thu Feb 27 19:57:03 2025
From: fbredberg at openjdk.org (Fredrik Bredberg)
Date: Thu, 27 Feb 2025 19:57:03 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <DT9JbIKvpPv7zQjYb2pgXpOxyL4h7R1M5HULfOetNzo=.2e53f146-dbb6-47b4-aed8-e9ff98ff13e1@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <DT9JbIKvpPv7zQjYb2pgXpOxyL4h7R1M5HULfOetNzo=.2e53f146-dbb6-47b4-aed8-e9ff98ff13e1@github.com>
Message-ID: <UdurvJZZRJ1RegqvEACdje7bmpWLRY_Kt5LXGBbf5ZM=.9e598a86-cc93-4223-8dee-c1f1f97a9ddf@github.com>

On Wed, 26 Feb 2025 05:19:44 GMT, David Holmes <dholmes at openjdk.org> wrote:

>> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Update after review by David and Coleen.
>
> src/hotspot/share/jvmci/vmStructs_jvmci.cpp line 331:
> 
>> 329:   volatile_nonstatic_field(ObjectMonitor,      _owner,                                        int64_t)                               \
>> 330:   volatile_nonstatic_field(ObjectMonitor,      _recursions,                                   intptr_t)                              \
>> 331:   volatile_nonstatic_field(ObjectMonitor,      _entry_list,                                    ObjectWaiter*)                        \
> 
> Suggestion:
> 
>   volatile_nonstatic_field(ObjectMonitor,      _entry_list,                                   ObjectWaiter*)                        \
> 
> Extra space

Fixed

> src/hotspot/share/runtime/objectMonitor.cpp line 176:
> 
>> 174: //   Once we have formed a doubly linked list it's easy to find the
>> 175: //   successor, wake it up, have it remove itself, and update the
>> 176: //   tail pointer, as seen in 2) and 3) below.
> 
> Suggestion:
> 
> //   tail pointer, as seen in 3) below.
> 
> But have diagram 3 right here.

Fixed

> src/hotspot/share/runtime/objectMonitor.cpp line 179:
> 
>> 177: //
>> 178: //   At any time new threads can add themselves to the entry_list, see
>> 179: //   4) and 5).
> 
> Diagrams 4 and 5 do not follow from what has just been described, but the use of "at any time" implies to me you intended to show them affecting the queue as we have already seen it.
> 
> Again show the diagram you want here.

Rewrote diagram.

> src/hotspot/share/runtime/objectMonitor.cpp line 183:
> 
>> 181: //   If the thread that removes itself from the end of the list hasn't
>> 182: //   got any prev pointer, we just set the tail pointer to null, see
>> 183: //   5) and 6).
> 
> Suggestion:
> 
> //   If the thread to be removed is the only thread in the entry list:
> //    entry_list      -> A -> null
> //    entry_list_tail ---^
> //   we remove it and just set the tail pointer to null,
> //    entry_list      -> null
> //    entry_list_tail -> null

Rewrote the diagram. Wanted to show how things work when he thread that removes itself from the end of the list hasn't got any prev pointer (and it's not the only thread in the entry list).

> src/hotspot/share/runtime/objectMonitor.cpp line 187:
> 
>> 185: //   Next time we need to find the successor and the tail is null, we
>> 186: //   just start walking from the entry_list head again forming a new
>> 187: //   doubly linked list, see 6) and 7) below.
> 
> Suggestion:
> 
> //   Next time we need to find the successor and the tail is null,
> //       entry_list       ->I->H->G->null
> //       entry_list_tail  ->null
> //   we just start walking from the entry_list head again forming a new
> //   doubly linked list:
> //       entry_list       ->I<=>H<=>G->null
> //       entry_list_tail  ----------^

Rewrote diagram. Didn't abandon the "number list" since everything else is written that way.

> src/hotspot/share/runtime/objectMonitor.cpp line 189:
> 
>> 187: //   doubly linked list, see 6) and 7) below.
>> 188: //
>> 189: //      1)  entry_list       ->F->E->D->C->B->A->null
> 
> Suggestion:
> 
> //      1)          entry_list  ->F->E->D->C->B->A->null
> 
> Right-justify the names please

I think it's more readable to have it left-justified, since both entry_list and entry_list_tail both start with the same text.

> src/hotspot/share/runtime/objectMonitor.cpp line 215:
> 
>> 213: //      The mutex property of the monitor itself protects the entry_list
>> 214: //      from concurrent interference.
>> 215: //   -- Only the monitor owner may detach nodes from the entry_list.
> 
> Suggestion for this block - get rid of invariants headings and just say:
> 
> // The monitor itself protects all of the operations on the entry_list except for the CAS of a new arrival
> // to the head. Only the monitor owner can read or write the prev links (e.g. to remove itself) or update
> // the tail.

Fixed

> src/hotspot/share/runtime/objectMonitor.cpp line 225:
> 
>> 223: //   concurrent detaching thread. This mechanism is immune from the
>> 224: //   ABA corruption. More precisely, the CAS-based "push" onto
>> 225: //   entry_list is ABA-oblivious.
> 
> Not sure this actually says anything to help people understand the code or its operation. There basically is no A-B-A issue with the use of CAS here.

Rewritten the comment.

> src/hotspot/share/runtime/objectMonitor.cpp line 227:
> 
>> 225: //   entry_list is ABA-oblivious.
>> 226: //
>> 227: // * The entry_list form a queue of threads stalled trying to acquire
> 
> Suggestion:
> 
> // * The entry_list forms a queue of threads stalled trying to acquire

Fixed

> src/hotspot/share/runtime/objectMonitor.hpp line 195:
> 
>> 193:   volatile intx _recursions;        // recursion count, 0 for first entry
>> 194:   ObjectWaiter* volatile _entry_list;  // Threads blocked on entry or reentry.
>> 195:                                        // The list is actually composed of WaitNodes,
> 
> Suggestion:
> 
>                                        // The list is actually composed of wait-nodes,
> 
> Pre-existing (check for other uses) `WaitNodes` reads like a class name but it isn't.

Fixed

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974244653
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974247893
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974246933
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974250054
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974251792
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974246012
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974252355
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974252954
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974253676
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974245155

From fbredberg at openjdk.org  Thu Feb 27 19:57:04 2025
From: fbredberg at openjdk.org (Fredrik Bredberg)
Date: Thu, 27 Feb 2025 19:57:04 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <cPLqn5OwcqQjupvXr9YnUFq06Gio_Yim6dq8pFbxydw=.069c8bc7-abe8-4ab2-8f08-a1b724e7ea16@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <DT9JbIKvpPv7zQjYb2pgXpOxyL4h7R1M5HULfOetNzo=.2e53f146-dbb6-47b4-aed8-e9ff98ff13e1@github.com>
 <cPLqn5OwcqQjupvXr9YnUFq06Gio_Yim6dq8pFbxydw=.069c8bc7-abe8-4ab2-8f08-a1b724e7ea16@github.com>
Message-ID: <V6nLNY3gFLzycYbmdfpTaw_bqZHhXC4YN3dHq6M3eRo=.f0790643-2dc6-408d-a7d8-4eeab208e547@github.com>

On Thu, 27 Feb 2025 13:59:38 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> src/hotspot/share/runtime/objectMonitor.cpp line 166:
>> 
>>> 164: //   its next pointer, and have its prev pointer set to null. Thus
>>> 165: //   pushing six threads A-F (in that order) onto entry_list, will
>>> 166: //   form a singly-linked list, see 1) below.
>> 
>> Suggestion: have diagram 1 immediately follow this text so the reader doesn't have to jump down.
>
> I like this suggestion.  I like these comments.

Fixed

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974247465

From fbredberg at openjdk.org  Thu Feb 27 20:04:02 2025
From: fbredberg at openjdk.org (Fredrik Bredberg)
Date: Thu, 27 Feb 2025 20:04:02 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <DT9JbIKvpPv7zQjYb2pgXpOxyL4h7R1M5HULfOetNzo=.2e53f146-dbb6-47b4-aed8-e9ff98ff13e1@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <DT9JbIKvpPv7zQjYb2pgXpOxyL4h7R1M5HULfOetNzo=.2e53f146-dbb6-47b4-aed8-e9ff98ff13e1@github.com>
Message-ID: <B4s5mqg4GZGTNB6BUuv_wXghfFgWjFHqG4y39-8mVqc=.6207037a-a568-46d1-ada0-77036712309b@github.com>

On Wed, 26 Feb 2025 06:08:14 GMT, David Holmes <dholmes at openjdk.org> wrote:

>> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Update after review by David and Coleen.
>
> src/hotspot/share/runtime/objectMonitor.cpp line 232:
> 
>> 230: //   thread notices that the tail of the entry_list is not known, we
>> 231: //   convert the singly-linked entry_list into a doubly linked list by
>> 232: //   assigning the prev pointers and the entry_list_tail pointer.
> 
> Didn't we essentially say all this at the beginning?

This text makes more sense before the newly added "Example:", so I moved it.

> src/hotspot/share/runtime/objectMonitor.cpp line 260:
> 
>> 258: //
>> 259: // * notify() or notifyAll() simply transfers threads from the WaitSet
>> 260: //   to either the entry_list. Subsequent exit() operations will
> 
> Suggestion:
> 
> //   to the entry_list. Subsequent exit() operations will

Fixed

> src/hotspot/share/runtime/objectMonitor.cpp line 704:
> 
>> 702: 
>> 703:   for (;;) {
>> 704:     ObjectWaiter* front = Atomic::load(&_entry_list);
> 
> In comments and code pick "head" or "front" to use to describe what _entry_list points to and use that consistently. I think "front" is much more common.

A `grep -r `suggests that `head` is more common, so I changed to `head`.

> src/hotspot/share/runtime/objectMonitor.cpp line 705:
> 
>> 703:   for (;;) {
>> 704:     ObjectWaiter* front = Atomic::load(&_entry_list);
>> 705: 
> 
> No need for blank line.

Fixed

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974257620
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974259984
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974261995
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974260402

From fbredberg at openjdk.org  Thu Feb 27 20:12:58 2025
From: fbredberg at openjdk.org (Fredrik Bredberg)
Date: Thu, 27 Feb 2025 20:12:58 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <bDLWpt4obpSoiWgNK9Z47kzDc8qFphwxzp2X-4GlcxA=.503be599-941b-4630-a753-df2908b59da7@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <bDLWpt4obpSoiWgNK9Z47kzDc8qFphwxzp2X-4GlcxA=.503be599-941b-4630-a753-df2908b59da7@github.com>
Message-ID: <PqYUUfoGjl8igCU97LmJLFWiLkn98lnPvPsIcZllMpI=.43b310af-21ff-45db-96f6-369372e96320@github.com>

On Thu, 27 Feb 2025 13:56:15 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Update after review by David and Coleen.
>
> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 418:
> 
>> 416:   // have released the lock.
>> 417:   // Refer to the comments in synchronizer.cpp for how we might encode extra
>> 418:   // state in _succ so we can avoid fetching entry_list.
> 
> I there is no comment in synchronizer about this (that I can find) and whether or not this is a good idea, can you remove this line with this change?

Removed

> src/hotspot/share/runtime/objectMonitor.hpp line 79:
> 
>> 77:   void set_bad_pointers() {
>> 78: #ifdef ASSERT
>> 79:     // Diagnostic hygiene ...
> 
> hygiene seems like the wrong word here.  Can you remove this comment?

Removed

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974271052
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974271724

From fbredberg at openjdk.org  Thu Feb 27 20:12:59 2025
From: fbredberg at openjdk.org (Fredrik Bredberg)
Date: Thu, 27 Feb 2025 20:12:59 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <cPLqn5OwcqQjupvXr9YnUFq06Gio_Yim6dq8pFbxydw=.069c8bc7-abe8-4ab2-8f08-a1b724e7ea16@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <DT9JbIKvpPv7zQjYb2pgXpOxyL4h7R1M5HULfOetNzo=.2e53f146-dbb6-47b4-aed8-e9ff98ff13e1@github.com>
 <cPLqn5OwcqQjupvXr9YnUFq06Gio_Yim6dq8pFbxydw=.069c8bc7-abe8-4ab2-8f08-a1b724e7ea16@github.com>
Message-ID: <0ALa3fouoHHnr9xwosMUd0gxQnQFwomxSmQ8_4wijcY=.acdb876b-6b94-4320-904a-f7741d54c8de@github.com>

On Thu, 27 Feb 2025 14:11:21 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> src/hotspot/share/runtime/objectMonitor.cpp line 718:
>> 
>>> 716: // if we added current to _entry_list. Once on _entry_list, current
>>> 717: // stays on-queue until it acquires the lock.
>>> 718: bool ObjectMonitor::try_lock_or_add_to_entry_list(JavaThread* current, ObjectWaiter* node) {
>> 
>> Nit: the name suggests we do the try_lock first, when we don't. If we reverse the name we should also reverse the true/false return so that true relates to the first part of the name. See what others think.
>
> How about add_to_entry_list with a boolean parameter that tries the lock if it fails, and only have one of these functions?  Although the return true if you get the lock makes it weird.
> 
> 
> bool add_to_entry_list(JavaThread* current, ObjectWaiter* node, bool or_lock) {
>   return true if locked, false otherwise;
> }
> 
> 
> Maybe that makes sense.

I wasn't completely happy with naming this `try_lock_or_add_to_entry_list` for the exact reason David points out. It does NOT first `try_lock` and then if that fails `add_to_entry_list`. It does the complete opposite. It first try to add to the entry list and if that fails, it tries to lock.

So why on earth did I end up with this solution? Because I went along with how the current family of `try_enter`, `spin_enter` and `TryLockWithContentionMark` works. They all try to lock the monitor and if they succeed they return true otherwise they return false.

And this is exactly how my `try_lock_or_add_to_entry_list` works, except for the fact that when it returns false (because we didn't get the lock) the current thread has been been added to the `entry_list`. 

I also think that combining the two functions into one (as Colleen suggests) just adds to the confusion, mostly because of the "weird" return value.

I guess we just have to choose what kind of weirdness we can accept. I'm absolutely willing to change it if anyone has a strong opinion, or comes up with something that the majority think is better. For me joining the `TryLockWithContentionMark` etc. camp seemed like the most reasonable kind of weird.

>> src/hotspot/share/runtime/objectMonitor.cpp line 719:
>> 
>>> 717: // stays on-queue until it acquires the lock.
>>> 718: bool ObjectMonitor::try_lock_or_add_to_entry_list(JavaThread* current, ObjectWaiter* node) {
>>> 719:   node->_prev   = nullptr;
>> 
>> Shouldn't this already be the case?
>
> I think for the vthread case, it isn't yet(?).  Maybe motivation to fix the ObjectWaiter constructor with this patch?

For the most part it is. But as Coleen points out, the vthread case might not be, and I'm not willing to risk it.

>> src/hotspot/share/runtime/objectMonitor.cpp line 2018:
>> 
>>> 2016: // that in prepend-mode we invert the order of the waiters. Let's say that the
>>> 2017: // waitset is "ABCD" and the entry_list is "XYZ". After a notifyAll() in prepend
>>> 2018: // mode the waitset will be empty and the entry_list will be "DCBAXYZ".
>> 
>> We don't support different ordering modes any more so we always "prepend" such that waiters are added to the entry_list in the reverse order of waiting. So given waitList -> A -> B -> C -> D, and _entry_list -> x -> y -> z we will get _entry_list -> D -> C -> B -> A -> X -> Y -> Z
>
> One of the benefits of this work is to read, understand and clean up misleading and out of date comments in this code.

Rewrote the comment. Let the waitset remain as a string "ABCD" because it would be to messy to try to depict it as a circular doubly linked list.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974266558
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974267473
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974270597

From fbredberg at openjdk.org  Thu Feb 27 20:13:01 2025
From: fbredberg at openjdk.org (Fredrik Bredberg)
Date: Thu, 27 Feb 2025 20:13:01 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <DT9JbIKvpPv7zQjYb2pgXpOxyL4h7R1M5HULfOetNzo=.2e53f146-dbb6-47b4-aed8-e9ff98ff13e1@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <DT9JbIKvpPv7zQjYb2pgXpOxyL4h7R1M5HULfOetNzo=.2e53f146-dbb6-47b4-aed8-e9ff98ff13e1@github.com>
Message-ID: <PPvm6P3LkyLkT8fZ1bhbYP1Cqbp5U8D8Oy4A7tOGBFo=.870f47af-c9ff-409b-9e59-83ad2b2f5988@github.com>

On Wed, 26 Feb 2025 06:19:38 GMT, David Holmes <dholmes at openjdk.org> wrote:

>> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Update after review by David and Coleen.
>
> src/hotspot/share/runtime/objectMonitor.cpp line 724:
> 
>> 722:   for (;;) {
>> 723:     ObjectWaiter* front = Atomic::load(&_entry_list);
>> 724: 
> 
> No need for blank line.

Fixed

> src/hotspot/share/runtime/objectMonitor.cpp line 731:
> 
>> 729: 
>> 730:     // Interference - the CAS failed because _entry_list changed.  Just retry.
>> 731:     // As an optional optimization we retry the lock.
> 
> Suggestion:
> 
>     // Interference - the CAS failed because _entry_list changed.  Before
>     // retrying the CAS retry taking the lock as it may now be free.

Fixed

> src/hotspot/share/runtime/objectMonitor.cpp line 812:
> 
>> 810:   guarantee(_entry_list == nullptr,
>> 811:             "must be no entering threads: entry_list=" INTPTR_FORMAT,
>> 812:             p2i(_entry_list));
> 
> Mustn't re-read _entry_list in the p2i as it may have changed from the value that is causing the guarantee to fail. The old guarantees were buggy in this regard - a temp is needed.

Fixed

> src/hotspot/share/runtime/objectMonitor.cpp line 1299:
> 
>> 1297:     assert(_entry_list_tail == nullptr || _entry_list_tail == currentNode, "invariant");
>> 1298: 
>> 1299:     ObjectWaiter* v = Atomic::load(&_entry_list);
> 
> Nit: use `w` to be consistent with similar code. The original used `w` for EntryList and `v` for cxq IIRC.

Fixed

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974268658
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974268941
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974267878
PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974269555

From fbredberg at openjdk.org  Thu Feb 27 20:19:01 2025
From: fbredberg at openjdk.org (Fredrik Bredberg)
Date: Thu, 27 Feb 2025 20:19:01 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <bDLWpt4obpSoiWgNK9Z47kzDc8qFphwxzp2X-4GlcxA=.503be599-941b-4630-a753-df2908b59da7@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <bDLWpt4obpSoiWgNK9Z47kzDc8qFphwxzp2X-4GlcxA=.503be599-941b-4630-a753-df2908b59da7@github.com>
Message-ID: <Ks7QuQ_3814ad8xu2t3oHqmye8UbN67hCpaESzTGZBI=.f6923755-73cf-4fb5-a34c-2999bd982fc6@github.com>

On Thu, 27 Feb 2025 14:15:15 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Update after review by David and Coleen.
>
> src/hotspot/share/runtime/objectMonitor.cpp line 735:
> 
>> 733:       assert(!has_successor(current), "invariant");
>> 734:       assert(has_owner(current), "invariant");
>> 735:       return true;
> 
> I wonder for a future RFE we can move these asserts into TryLock.

Good idea!

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974277231

From fbredberg at openjdk.org  Thu Feb 27 20:19:02 2025
From: fbredberg at openjdk.org (Fredrik Bredberg)
Date: Thu, 27 Feb 2025 20:19:02 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <zO6oz6I607055tJAMfg7Wb6_11AHpME4v4MkONm7fuw=.5c9a2219-89c6-4e1e-a0ee-4cc5e2660b45@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <bDLWpt4obpSoiWgNK9Z47kzDc8qFphwxzp2X-4GlcxA=.503be599-941b-4630-a753-df2908b59da7@github.com>
 <zO6oz6I607055tJAMfg7Wb6_11AHpME4v4MkONm7fuw=.5c9a2219-89c6-4e1e-a0ee-4cc5e2660b45@github.com>
Message-ID: <Pb6V_8GzxsuK34thmng7CSz3mtGfYAk1QoOmroWDoOY=.984e9723-32f7-4d75-82ba-3c8afbbaacd6@github.com>

On Thu, 27 Feb 2025 17:12:40 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

>> src/hotspot/share/runtime/objectMonitor.hpp line 46:
>> 
>>> 44: class ObjectWaiter : public CHeapObj<mtThread> {
>>> 45:  public:
>>> 46:   enum TStates : uint8_t { TS_UNDEF, TS_READY, TS_RUN, TS_WAIT, TS_ENTER };
>> 
>> TS_READY looks unused.
>
> Edit: this could be a trivial further PR.

And so does `TS_UNDEF`, but the enum value for `TS_UNDEF` will be zero and maybe there is some hidden "check for uninitialized `TStates` code" somewhere that stops working... A grep also finds:

`src/hotspot/share/prims/jvmtiRawMonitor.hpp:    enum TStates { TS_READY, TS_RUN, TS_WAIT, TS_ENTER }; `

So, since this is not really in the core part of this PR, I'd like to postpone that change to a later cleanup RFE.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974278590

From fbredberg at openjdk.org  Thu Feb 27 20:40:56 2025
From: fbredberg at openjdk.org (Fredrik Bredberg)
Date: Thu, 27 Feb 2025 20:40:56 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <VkUPkPpw2u6nQ7IToE9MkPhq7fJdRcNAyreKlfCRE4g=.c30f2a89-27bd-47b5-80a2-b6288d77fe18@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <VkUPkPpw2u6nQ7IToE9MkPhq7fJdRcNAyreKlfCRE4g=.c30f2a89-27bd-47b5-80a2-b6288d77fe18@github.com>
Message-ID: <_XnhdwtuB6AhiTL4TYmV4yqIy_WwQEeASn2b2zL9-V0=.05ec2994-8599-4f76-871d-a9e2bbe8afa2@github.com>

On Thu, 27 Feb 2025 15:54:28 GMT, Fredrik Bredberg <fbredberg at openjdk.org> wrote:

>> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`.
>> 
>> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past.
>> 
>> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks.
>> 
>> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`.
>> 
>> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable.
>> 
>> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list.
>> 
>> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor).
>> 
>> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation.
>> 
>> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fac...
>
> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Update after review by David and Coleen.

I've used QEMU to smoke test this PR on ppc64le, riscv64 and s390x,  But it would be nice if @TheRealMDoerr, @RealFYang and @offamitkumar could check if it runs okay on real hardware as well.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23421#issuecomment-2689061860

From fbredberg at openjdk.org  Thu Feb 27 20:53:56 2025
From: fbredberg at openjdk.org (Fredrik Bredberg)
Date: Thu, 27 Feb 2025 20:53:56 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <VkUPkPpw2u6nQ7IToE9MkPhq7fJdRcNAyreKlfCRE4g=.c30f2a89-27bd-47b5-80a2-b6288d77fe18@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <VkUPkPpw2u6nQ7IToE9MkPhq7fJdRcNAyreKlfCRE4g=.c30f2a89-27bd-47b5-80a2-b6288d77fe18@github.com>
Message-ID: <vuuurI5PGlIeFwznd4y-v6m4dAqrP2qpv5h-UmoGVgw=.3c2a7a62-d526-43e3-9da4-2157183322cc@github.com>

On Thu, 27 Feb 2025 15:54:28 GMT, Fredrik Bredberg <fbredberg at openjdk.org> wrote:

>> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`.
>> 
>> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past.
>> 
>> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks.
>> 
>> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`.
>> 
>> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable.
>> 
>> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list.
>> 
>> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor).
>> 
>> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation.
>> 
>> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fac...
>
> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Update after review by David and Coleen.

@pchilano 
Since I have removed the `cxq` list @dholmes-ora suggested that I should also rename `_vthread_cxq_head`. Thereby removing the term "cxq" altogether. I chose to rename `_vthread_cxq_head` with `_vthread_list_head`. Hope that is okay.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23421#issuecomment-2689083393

From fbredberg at openjdk.org  Thu Feb 27 20:59:55 2025
From: fbredberg at openjdk.org (Fredrik Bredberg)
Date: Thu, 27 Feb 2025 20:59:55 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <Oi8FWi7Bd4slmd3aLd1NNpz7Tcx78fDiJHn9YCTm3wA=.00f007f3-4d40-48cd-9f50-bcd9c0e5e122@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <QdX-cBxF7VypFf8qsFH8DCHYztDP4-VBwWD3dChbGts=.10cd76e5-cf57-4bba-9923-fd23ad986745@github.com>
 <Oi8FWi7Bd4slmd3aLd1NNpz7Tcx78fDiJHn9YCTm3wA=.00f007f3-4d40-48cd-9f50-bcd9c0e5e122@github.com>
Message-ID: <1S7kUz3GfEDitlf6dU4nF5Tl1X7UNBhMDdWCPE9Apos=.a1e7abc2-065d-4fe3-95b2-d0d5ca884dac@github.com>

On Mon, 10 Feb 2025 12:51:43 GMT, Fredrik Bredberg <fbredberg at openjdk.org> wrote:

>> src/hotspot/share/jvmci/vmStructs_jvmci.cpp line 332:
>> 
>>> 330:   volatile_nonstatic_field(ObjectMonitor,      _owner,                                        int64_t)                               \
>>> 331:   volatile_nonstatic_field(ObjectMonitor,      _recursions,                                   intptr_t)                              \
>>> 332:   volatile_nonstatic_field(ObjectMonitor,      _EntryListTail,                                ObjectWaiter*)                         \
>> 
>> You may need to coordinate with @mur47x111 to see what graal does with this field.  I suspect the graal code also checks both ctx and EntryList in the unlock fast path and now only needs to check _EntryList.  In which case we don't need to export EntryListTail.
>
> Thanks for the heads up @coleenp . I was planing on contacting the Graal team when this PR gets closer to getting integrated. I'll delete the `_EntryListTail` export, and make sure to ask for a review from @mur47x111  when that time comes.

They seem to have everything under control: [[JDK-8349711] Adapt JDK-8343840: Rewrite the ObjectMonitor lists](https://github.com/oracle/graal/pull/10757)

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974327790

From fyang at openjdk.org  Fri Feb 28 05:23:54 2025
From: fyang at openjdk.org (Fei Yang)
Date: Fri, 28 Feb 2025 05:23:54 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <_XnhdwtuB6AhiTL4TYmV4yqIy_WwQEeASn2b2zL9-V0=.05ec2994-8599-4f76-871d-a9e2bbe8afa2@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <VkUPkPpw2u6nQ7IToE9MkPhq7fJdRcNAyreKlfCRE4g=.c30f2a89-27bd-47b5-80a2-b6288d77fe18@github.com>
 <_XnhdwtuB6AhiTL4TYmV4yqIy_WwQEeASn2b2zL9-V0=.05ec2994-8599-4f76-871d-a9e2bbe8afa2@github.com>
Message-ID: <DjDHJL77EOzvK3nZllhrCSaowdVKJNbwc5wyo1mKJuk=.3ede9419-9988-4650-b254-6292bdb94cf4@github.com>

On Thu, 27 Feb 2025 20:38:32 GMT, Fredrik Bredberg <fbredberg at openjdk.org> wrote:

> I've used QEMU to smoke test this PR on ppc64le, riscv64 and s390x, But it would be nice if @TheRealMDoerr, @RealFYang and @offamitkumar could check if it runs okay on real hardware as well.

FYI: hs:tier1 - hs:tier3 test good on linux-riscv64 platform.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23421#issuecomment-2689751810

From duke at openjdk.org  Fri Feb 28 06:22:09 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Fri, 28 Feb 2025 06:22:09 GMT
Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v8]
In-Reply-To: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
References: <hB4pg2SogrHkUlaqOt5ZD-LcayzrD400sJUEkANQDyo=.031d2ad5-8f8c-4a3b-b7b8-fb19ea8f35b4@github.com>
Message-ID: <Kkfzgfg4Op3XpVBv4QezdDMS43v1mwyzN9UI47q3asE=.1a7edee1-0500-43c6-a0f8-b430e88150dc@github.com>

> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled.

Ferenc Rakoczi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits:

 - Merged master.
 - Added more comments, mainly as suggested by Andrew Dinn
 - Changed aarch64-asmtest.py as suggested by Bhavana-Kilambi
 - Accepting suggested change from Andrew Dinn
 - Added comments suggested by Andrew Dinn
 - Fixed copyright years
 - renaming a couple of functions
 - Adding comments + some code reorganization
 - removed debugging code
 - merging master
 - ... and 3 more: https://git.openjdk.org/jdk/compare/ab4b0ef9...d82dfb2f

-------------

Changes: https://git.openjdk.org/jdk/pull/23300/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23300&range=07
  Stats: 2611 lines in 22 files changed: 2030 ins; 92 del; 489 mod
  Patch: https://git.openjdk.org/jdk/pull/23300.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23300/head:pull/23300

PR: https://git.openjdk.org/jdk/pull/23300

From dholmes at openjdk.org  Fri Feb 28 07:02:55 2025
From: dholmes at openjdk.org (David Holmes)
Date: Fri, 28 Feb 2025 07:02:55 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <VkUPkPpw2u6nQ7IToE9MkPhq7fJdRcNAyreKlfCRE4g=.c30f2a89-27bd-47b5-80a2-b6288d77fe18@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <VkUPkPpw2u6nQ7IToE9MkPhq7fJdRcNAyreKlfCRE4g=.c30f2a89-27bd-47b5-80a2-b6288d77fe18@github.com>
Message-ID: <MvKx9VGp_lL_ebHiqxukv00A_kcv55kvSIH6QlFJ0vA=.a6e9df2d-279d-45a0-8618-06713243bc24@github.com>

On Thu, 27 Feb 2025 15:54:28 GMT, Fredrik Bredberg <fbredberg at openjdk.org> wrote:

>> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`.
>> 
>> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past.
>> 
>> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks.
>> 
>> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`.
>> 
>> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable.
>> 
>> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list.
>> 
>> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor).
>> 
>> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation.
>> 
>> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fac...
>
> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Update after review by David and Coleen.

Okay that's good enough for me. :)

Thanks

-------------

Marked as reviewed by dholmes (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/23421#pullrequestreview-2649910490

From amitkumar at openjdk.org  Fri Feb 28 07:02:56 2025
From: amitkumar at openjdk.org (Amit Kumar)
Date: Fri, 28 Feb 2025 07:02:56 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <DjDHJL77EOzvK3nZllhrCSaowdVKJNbwc5wyo1mKJuk=.3ede9419-9988-4650-b254-6292bdb94cf4@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <VkUPkPpw2u6nQ7IToE9MkPhq7fJdRcNAyreKlfCRE4g=.c30f2a89-27bd-47b5-80a2-b6288d77fe18@github.com>
 <_XnhdwtuB6AhiTL4TYmV4yqIy_WwQEeASn2b2zL9-V0=.05ec2994-8599-4f76-871d-a9e2bbe8afa2@github.com>
 <DjDHJL77EOzvK3nZllhrCSaowdVKJNbwc5wyo1mKJuk=.3ede9419-9988-4650-b254-6292bdb94cf4@github.com>
Message-ID: <Se02S4h4RdkeTMltPoCUtQ-ebNlGVl2AaNEgZtXmBF8=.b438a78a-4335-4498-a03f-84aadd40d05a@github.com>

On Fri, 28 Feb 2025 05:21:34 GMT, Fei Yang <fyang at openjdk.org> wrote:

> I've used QEMU to smoke test this PR on ppc64le, riscv64 and s390x, But it would be nice if @TheRealMDoerr, @RealFYang and @offamitkumar could check if it runs okay on real hardware as well.

Tier1 test passed on s390x.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23421#issuecomment-2689887509

From duke at openjdk.org  Fri Feb 28 09:46:32 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Fri, 28 Feb 2025 09:46:32 GMT
Subject: RFR: 8349721: Add aarch64 intrinsics for ML-KEM [v2]
In-Reply-To: <eUTcEbCy4gKPEfe0fS4GXXR8i49JYSKCGygRz8CsCnE=.c17a1739-85d2-4bb9-8a74-5ad1694d8d3d@github.com>
References: <eUTcEbCy4gKPEfe0fS4GXXR8i49JYSKCGygRz8CsCnE=.c17a1739-85d2-4bb9-8a74-5ad1694d8d3d@github.com>
Message-ID: <smkBDS_XPWknoVtPI_PV2-7DW0gNNEb9AGWG8_jW1LU=.b8a7970e-9bb1-4bd2-8c58-2ac941edbf09@github.com>

> By using the aarch64 vector registers the speed of the computation of the ML-KEM algorithms (key generation, encapsulation, decapsulation) can be approximately doubled.

Ferenc Rakoczi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits:

 - Merged master
 - removing trailing spaces
 - kyber aarch64 intrinsics

-------------

Changes: https://git.openjdk.org/jdk/pull/23663/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23663&range=01
  Stats: 2885 lines in 20 files changed: 2774 ins; 84 del; 27 mod
  Patch: https://git.openjdk.org/jdk/pull/23663.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23663/head:pull/23663

PR: https://git.openjdk.org/jdk/pull/23663

From duke at openjdk.org  Fri Feb 28 10:15:09 2025
From: duke at openjdk.org (Ferenc Rakoczi)
Date: Fri, 28 Feb 2025 10:15:09 GMT
Subject: RFR: 8349721: Add aarch64 intrinsics for ML-KEM [v3]
In-Reply-To: <eUTcEbCy4gKPEfe0fS4GXXR8i49JYSKCGygRz8CsCnE=.c17a1739-85d2-4bb9-8a74-5ad1694d8d3d@github.com>
References: <eUTcEbCy4gKPEfe0fS4GXXR8i49JYSKCGygRz8CsCnE=.c17a1739-85d2-4bb9-8a74-5ad1694d8d3d@github.com>
Message-ID: <oDUh13MAhSA-D005hmnUHHhM02NRu9gvGP8nJduCSOs=.d41bb006-44ee-45c3-99a9-f34b338187b6@github.com>

> By using the aarch64 vector registers the speed of the computation of the ML-KEM algorithms (key generation, encapsulation, decapsulation) can be approximately doubled.

Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision:

  A little cleanup

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23663/files
  - new: https://git.openjdk.org/jdk/pull/23663/files/ff0f8430..4adc5cf2

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23663&range=02
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23663&range=01-02

  Stats: 24 lines in 3 files changed: 0 ins; 23 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/23663.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23663/head:pull/23663

PR: https://git.openjdk.org/jdk/pull/23663

From tschatzl at openjdk.org  Fri Feb 28 10:35:03 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Fri, 28 Feb 2025 10:35:03 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v2]
In-Reply-To: <3zmj-DeeRyPMHc32YnvfqACN0xJxLQ6jZZ7sd-Baa3w=.672912f6-e4a3-4679-b8a3-b7f6ad51589d@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <sATIRzYvi1wLv1tzkXOuwBN2F2YKy7SLSxaU6mnPWTg=.93b86875-3f5f-4ffe-91f0-a59ef68b2157@github.com>
 <3zmj-DeeRyPMHc32YnvfqACN0xJxLQ6jZZ7sd-Baa3w=.672912f6-e4a3-4679-b8a3-b7f6ad51589d@github.com>
Message-ID: <aqQgjLF5Uy31kSi3iSiC6uiypK-Hjk6zmJBQuU07Y-c=.1974c6fc-9924-4968-916e-d6bd48eece47@github.com>

On Thu, 27 Feb 2025 18:24:15 GMT, Albert Mingkun Yang <ayang at openjdk.org> wrote:

>> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   * remove unnecessarily added logging
>
> src/hotspot/share/gc/g1/g1BarrierSet.hpp line 54:
> 
>> 52: // them, keeping the write barrier simple.
>> 53: //
>> 54: // The refinement threads mark cards in the the current collection set specially on the
> 
> "the the" typo.

I fixed one more occurrence in files changed in this CR. There are like 10 more of these duplications in our code, I will fix separately.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1975186407

From mdoerr at openjdk.org  Fri Feb 28 10:50:00 2025
From: mdoerr at openjdk.org (Martin Doerr)
Date: Fri, 28 Feb 2025 10:50:00 GMT
Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2]
In-Reply-To: <Se02S4h4RdkeTMltPoCUtQ-ebNlGVl2AaNEgZtXmBF8=.b438a78a-4335-4498-a03f-84aadd40d05a@github.com>
References: <eua1CqUIsyHjjC7Jy0eAfweDIR1HRCUMXrMtJM-i918=.006eed36-1235-414a-add4-abecd9c20750@github.com>
 <VkUPkPpw2u6nQ7IToE9MkPhq7fJdRcNAyreKlfCRE4g=.c30f2a89-27bd-47b5-80a2-b6288d77fe18@github.com>
 <_XnhdwtuB6AhiTL4TYmV4yqIy_WwQEeASn2b2zL9-V0=.05ec2994-8599-4f76-871d-a9e2bbe8afa2@github.com>
 <DjDHJL77EOzvK3nZllhrCSaowdVKJNbwc5wyo1mKJuk=.3ede9419-9988-4650-b254-6292bdb94cf4@github.com>
 <Se02S4h4RdkeTMltPoCUtQ-ebNlGVl2AaNEgZtXmBF8=.b438a78a-4335-4498-a03f-84aadd40d05a@github.com>
Message-ID: <wErNcGWUCit5J6xCpc9WCI0ZeZ9H8Geu2BT-3B6Dup0=.5ae69aeb-4e73-42c4-af52-b67834fc8417@github.com>

On Fri, 28 Feb 2025 07:00:40 GMT, Amit Kumar <amitkumar at openjdk.org> wrote:

> I've used QEMU to smoke test this PR on ppc64le, riscv64 and s390x, But it would be nice if @TheRealMDoerr, @RealFYang and @offamitkumar could check if it runs okay on real hardware as well.

The PPC64 code looks correct and some quick tests have passed. I'll run larger test suites over the weekend.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23421#issuecomment-2690327204

From tschatzl at openjdk.org  Fri Feb 28 11:25:53 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Fri, 28 Feb 2025 11:25:53 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v2]
In-Reply-To: <3zmj-DeeRyPMHc32YnvfqACN0xJxLQ6jZZ7sd-Baa3w=.672912f6-e4a3-4679-b8a3-b7f6ad51589d@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <sATIRzYvi1wLv1tzkXOuwBN2F2YKy7SLSxaU6mnPWTg=.93b86875-3f5f-4ffe-91f0-a59ef68b2157@github.com>
 <3zmj-DeeRyPMHc32YnvfqACN0xJxLQ6jZZ7sd-Baa3w=.672912f6-e4a3-4679-b8a3-b7f6ad51589d@github.com>
Message-ID: <9tS5E1tteGutSNX7rZh5WYLdZoF7Vgl_4_pjuAdT4WU=.c8c73c45-7abb-48a9-b623-769d3c1679ca@github.com>

On Thu, 27 Feb 2025 12:07:29 GMT, Albert Mingkun Yang <ayang at openjdk.org> wrote:

>> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   * remove unnecessarily added logging
>
> src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 349:
> 
>> 347: 
>> 348:     bool do_heap_region(G1HeapRegion* r) override {
>> 349:       if (!r->is_free()) {
> 
> I am a bit lost on this closure; the intention seems to set unclaimed to all non-free regions, why can't this be done in one go, instead of first setting all regions to claimed (`reset_all_claims_to_claimed`), then set non-free ones unclaimed?

`do_heap_region()` only visits committed regions in this case. I wanted to avoid the additional check in the iteration code. If you still think it is more clear to filter those out later, please tell me. I'll add a comment for now.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1975250646

From tschatzl at openjdk.org  Fri Feb 28 12:14:01 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Fri, 28 Feb 2025 12:14:01 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v2]
In-Reply-To: <3zmj-DeeRyPMHc32YnvfqACN0xJxLQ6jZZ7sd-Baa3w=.672912f6-e4a3-4679-b8a3-b7f6ad51589d@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
 <sATIRzYvi1wLv1tzkXOuwBN2F2YKy7SLSxaU6mnPWTg=.93b86875-3f5f-4ffe-91f0-a59ef68b2157@github.com>
 <3zmj-DeeRyPMHc32YnvfqACN0xJxLQ6jZZ7sd-Baa3w=.672912f6-e4a3-4679-b8a3-b7f6ad51589d@github.com>
Message-ID: <87L5pcyGAgyDsXTwlSdAFLyIAOcUl1ZdYXK-nwzLrUQ=.c3db7522-b3e6-46e0-b268-e457c3d2bdc2@github.com>

On Thu, 27 Feb 2025 18:31:16 GMT, Albert Mingkun Yang <ayang at openjdk.org> wrote:

>> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   * remove unnecessarily added logging
>
> src/hotspot/share/gc/g1/g1RemSet.cpp line 1252:
> 
>> 1250:     G1ConcurrentRefineWorkState::snapshot_heap_into(&constructed);
>> 1251:     claim = &constructed;
>> 1252:   }
> 
> It's not super obvious to me why the "has_sweep_claims" checking needs to be on this level. Can `G1ConcurrentRefineWorkState` return a valid `G1CardTableClaimTable*` directly?

I agree. I remember having similar thoughts as well, but then did not do anything about this. Will fix.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1975311607

From tschatzl at openjdk.org  Fri Feb 28 13:43:24 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Fri, 28 Feb 2025 13:43:24 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v3]
In-Reply-To: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
Message-ID: <aJZNpCuhp1tRT7ZW5yAuvTH1OlbDDz7yTV8JyR9Qc6Y=.1d964dd8-910d-4ebf-b7e8-5a80246f8d28@github.com>

> Hi all,
> 
>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
> 
> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
> 
> ### Current situation
> 
> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
> 
> The main reason for the current barrier is how g1 implements concurrent refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
> 
> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
> 
> 
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done;     // null value check
> if (card(@x.a) == young_card) goto done;  // write to young gen check
> StoreLoad;                // synchronize
> if (card(@x.a) == dirty_card) goto done;
> 
> *card(@x.a) = dirty
> 
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
> 
> call runtime to move thread-local-dcq into dcqs
> 
> done:
> 
> 
> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
> 
> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
> 
> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
> 
> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se...

Thomas Schatzl has updated the pull request incrementally with two additional commits since the last revision:

 - * ayang review 1 (ctd)
     * split up sweep-rt state into "start" (to be called once) and "step" (to be called repeatedly) phases
     * move building the snapshot our of g1remset
 - * ayang review 1
     * use uint for number of reserved regions consistently
     * rename *sweep_state to *sweep_table
     * improved comments

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23739/files
  - new: https://git.openjdk.org/jdk/pull/23739/files/9ef9c5f4..7d361fc1

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=02
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=01-02

  Stats: 108 lines in 8 files changed: 40 ins; 24 del; 44 mod
  Patch: https://git.openjdk.org/jdk/pull/23739.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739

PR: https://git.openjdk.org/jdk/pull/23739

From tschatzl at openjdk.org  Fri Feb 28 17:52:56 2025
From: tschatzl at openjdk.org (Thomas Schatzl)
Date: Fri, 28 Feb 2025 17:52:56 GMT
Subject: RFR: 8342382: Implementation of JEP G1: Improve Application
 Throughput with a More Efficient Write-Barrier [v4]
In-Reply-To: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
References: <tt7smwtk5Lj4CC0R41IyWe6aLXA2JZPrJT3Bq1ILHr0=.abd86704-19f0-4350-b218-184ee0917f1f@github.com>
Message-ID: <AgRxsR4WWKDB45qANJJeb6wzR4CrY5uncSKOvK5AeIE=.ced521a5-6380-4a9f-a4f2-7f5702ce9120@github.com>

> Hi all,
> 
>   please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier.
> 
> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25.
> 
> ### Current situation
> 
> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier.
> 
> The main reason for the current barrier is how g1 implements concurrent refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible.
> 
> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code:
> 
> 
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done;     // null value check
> if (card(@x.a) == young_card) goto done;  // write to young gen check
> StoreLoad;                // synchronize
> if (card(@x.a) == dirty_card) goto done;
> 
> *card(@x.a) = dirty
> 
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
> 
> call runtime to move thread-local-dcq into dcqs
> 
> done:
> 
> 
> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc.
> 
> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining.
> 
> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links).
> 
> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se...

Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:

  * fix assert

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23739/files
  - new: https://git.openjdk.org/jdk/pull/23739/files/7d361fc1..d87935a0

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=03
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=02-03

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/23739.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739

PR: https://git.openjdk.org/jdk/pull/23739